linux-fscrypt.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] btrfs: support fsverity
@ 2021-04-08 18:33 Boris Burkov
  2021-04-08 18:33 ` [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Boris Burkov @ 2021-04-08 18:33 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fscrypt

This patchset provides support for fsverity in btrfs.

At a high level, we store the verity descriptor and Merkle tree data
in the file system btree with the file's inode as the objectid, and
direct reads/writes to those items to implement the generic fsverity
interface required by fs/verity/.

The first patch is a preparatory patch which adds a notion of
compat_flags to the btrfs_inode and inode_item in order to allow
enabling verity on a file without making the file system unmountable for
older kernels. (It runs afoul of the leaf corruption check otherwise)

The second patch is the bulk of the fsverity implementation. It
implements the fsverity interface and adds verity checks for the typical
file reading case.

The third patch cleans up the corner cases in readpage, covering inline
extents, preallocated extents, and holes.

The fourth patch handles direct io of a veritied file by falling back to
buffered io.

The fifth patch handles crashes mid-verity enable via orphan items

I have tested this patch set in the following ways:
- xfstests auto group
- with a separate fix for btrfs fiemap and some light touches to the
  tests themselves: xfstests generic/572,573,574,575.
- new xfstest for btrfs specific corruptions (e.g. inline extents).
- new xfstest using dmlogwrites and dmsnapshot to exercise orphans.
- manual test with pwrite script to test merkle cache EFBIG cases near
  the size boundary of my filesystem.
- manual test with sleeps in kernel to force orphan vs. unlink race.
- manual end-to-end test with verity signed rpms.
--
changes for v3:
Patch 2: fix bug in overflow logic, fix interface of
get_verity_descriptor, truncate merkle cache items on failure, fix
various code/style issues.
Patch 5: fix extent data leak if verity races with unlink or O_TMPFILE
and removes a legitimate orphan, then system is interrupted such that
the orphan was needed.

changes for v2:
Patch 1: Unchanged.
Patch 2: Return EFBIG if Merkle data past s_maxbytes. Added special
descriptor item for encryption and to handle ERANGE case for
get_verity_descriptor. Improved function comments. Rebased onto subpage
read patches -- modified end_page_read to do verity check before marking
the page uptodate. Changed from full compat to ro_compat; merged sysfs
feature here.
Patch 3: Rebased onto subpage read patches.
Patch 4: Unchanged.
Patch 5: Used to be sysfs feature, now a new patch that handles orphaned
verity data.

Boris Burkov (4):
  btrfs: add compat_flags to btrfs_inode_item
  btrfs: check verity for reads of inline extents and holes
  btrfs: fallback to buffered io for verity files
  btrfs: verity metadata orphan items

Chris Mason (1):
  btrfs: initial fsverity support

 fs/btrfs/Makefile               |   1 +
 fs/btrfs/btrfs_inode.h          |   2 +
 fs/btrfs/ctree.h                |  25 +-
 fs/btrfs/delayed-inode.c        |   2 +
 fs/btrfs/extent_io.c            |  53 +--
 fs/btrfs/file.c                 |   9 +
 fs/btrfs/inode.c                |  25 +-
 fs/btrfs/ioctl.c                |  21 +-
 fs/btrfs/super.c                |   1 +
 fs/btrfs/sysfs.c                |   6 +
 fs/btrfs/tree-log.c             |   1 +
 fs/btrfs/verity.c               | 683 ++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h      |   2 +-
 include/uapi/linux/btrfs_tree.h |  22 +-
 14 files changed, 817 insertions(+), 36 deletions(-)
 create mode 100644 fs/btrfs/verity.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-04-08 18:33 [PATCH v3 0/5] btrfs: support fsverity Boris Burkov
@ 2021-04-08 18:33 ` Boris Burkov
  2021-04-08 23:40   ` Anand Jain
  2021-04-08 18:33 ` [PATCH v3 2/5] btrfs: initial fsverity support Boris Burkov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2021-04-08 18:33 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fscrypt

The tree checker currently rejects unrecognized flags when it reads
btrfs_inode_item. Practically, this means that adding a new flag makes
the change backwards incompatible if the flag is ever set on a file.

Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a
new "compat_flags". These flags are zero on inode creation in btrfs and
mkfs and are ignored by an older kernel, so it should be safe to use
them in this way.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/btrfs_inode.h          | 1 +
 fs/btrfs/ctree.h                | 2 ++
 fs/btrfs/delayed-inode.c        | 2 ++
 fs/btrfs/inode.c                | 3 +++
 fs/btrfs/ioctl.c                | 7 ++++---
 fs/btrfs/tree-log.c             | 1 +
 include/uapi/linux/btrfs_tree.h | 7 ++++++-
 7 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c652e19ad74e..e8dbc8e848ce 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -191,6 +191,7 @@ struct btrfs_inode {
 
 	/* flags field from the on disk inode */
 	u32 flags;
+	u64 compat_flags;
 
 	/*
 	 * Counters to keep track of the number of extent item's we may use due
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f2fd73e58ee6..d633c563164b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1754,6 +1754,7 @@ BTRFS_SETGET_FUNCS(inode_gid, struct btrfs_inode_item, gid, 32);
 BTRFS_SETGET_FUNCS(inode_mode, struct btrfs_inode_item, mode, 32);
 BTRFS_SETGET_FUNCS(inode_rdev, struct btrfs_inode_item, rdev, 64);
 BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
+BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_generation, struct btrfs_inode_item,
 			 generation, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_sequence, struct btrfs_inode_item,
@@ -1771,6 +1772,7 @@ BTRFS_SETGET_STACK_FUNCS(stack_inode_gid, struct btrfs_inode_item, gid, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_mode, struct btrfs_inode_item, mode, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_rdev, struct btrfs_inode_item, rdev, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_flags, struct btrfs_inode_item, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
 BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
 BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 1a88f6214ebc..ef4e0265dbe3 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1718,6 +1718,7 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_inode_transid(inode_item, trans->transid);
 	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
 	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
+	btrfs_set_stack_inode_compat_flags(inode_item, BTRFS_I(inode)->compat_flags);
 	btrfs_set_stack_inode_block_group(inode_item, 0);
 
 	btrfs_set_stack_timespec_sec(&inode_item->atime,
@@ -1776,6 +1777,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 	inode->i_rdev = 0;
 	*rdev = btrfs_stack_inode_rdev(inode_item);
 	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
+	BTRFS_I(inode)->compat_flags = btrfs_stack_inode_compat_flags(inode_item);
 
 	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
 	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1e0e20ad25e4..3aa96ec27045 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3627,6 +3627,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
 
 	BTRFS_I(inode)->index_cnt = (u64)-1;
 	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
+	BTRFS_I(inode)->compat_flags = btrfs_inode_compat_flags(leaf, inode_item);
 
 cache_index:
 	/*
@@ -3793,6 +3794,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
 	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
@@ -8859,6 +8861,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	ei->defrag_bytes = 0;
 	ei->disk_i_size = 0;
 	ei->flags = 0;
+	ei->compat_flags = 0;
 	ei->csum_bytes = 0;
 	ei->index_cnt = (u64)-1;
 	ei->dir_index = 0;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3415a9f06c81..2c9cbd2642b1 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -102,8 +102,9 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
  * Export internal inode flags to the format expected by the FS_IOC_GETFLAGS
  * ioctl.
  */
-static unsigned int btrfs_inode_flags_to_fsflags(unsigned int flags)
+static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 {
+	unsigned int flags = binode->flags;
 	unsigned int iflags = 0;
 
 	if (flags & BTRFS_INODE_SYNC)
@@ -156,7 +157,7 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
 {
 	struct btrfs_inode *binode = BTRFS_I(file_inode(file));
-	unsigned int flags = btrfs_inode_flags_to_fsflags(binode->flags);
+	unsigned int flags = btrfs_inode_flags_to_fsflags(binode);
 
 	if (copy_to_user(arg, &flags, sizeof(flags)))
 		return -EFAULT;
@@ -228,7 +229,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 
 	btrfs_inode_lock(inode, 0);
 	fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
-	old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
+	old_fsflags = btrfs_inode_flags_to_fsflags(binode);
 
 	ret = vfs_ioc_setflags_prepare(inode, old_fsflags, fsflags);
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 72c4b66ed516..fed638f995ba 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3944,6 +3944,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
 	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
 	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
+	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
 	btrfs_set_token_inode_block_group(&token, item, 0);
 }
 
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 58d7cff9afb1..ae25280316bd 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -574,11 +574,16 @@ struct btrfs_inode_item {
 	/* modification sequence number for NFS */
 	__le64 sequence;
 
+	/*
+	 * flags which aren't checked for corruption at mount
+	 * and can be added in a backwards compatible way
+	 */
+	__le64 compat_flags;
 	/*
 	 * a little future expansion, for more than this we can
 	 * just grow the inode item and version it
 	 */
-	__le64 reserved[4];
+	__le64 reserved[3];
 	struct btrfs_timespec atime;
 	struct btrfs_timespec ctime;
 	struct btrfs_timespec mtime;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-08 18:33 [PATCH v3 0/5] btrfs: support fsverity Boris Burkov
  2021-04-08 18:33 ` [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
@ 2021-04-08 18:33 ` Boris Burkov
  2021-04-08 22:38   ` kernel test robot
                     ` (3 more replies)
  2021-04-08 18:33 ` [PATCH v3 3/5] btrfs: check verity for reads of inline extents and holes Boris Burkov
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 18+ messages in thread
From: Boris Burkov @ 2021-04-08 18:33 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fscrypt

From: Chris Mason <clm@fb.com>

Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and the other
stores the Merkle tree data itself.

Verity checking is done at the end of IOs to ensure each page is checked
before it is marked uptodate.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode compat_flags to store verity on the inode item, so
that we can enable verity on a file, then rollback to an older kernel
and still mount the file system and read the file. Since we can't safely
write the file anymore without ruining the invariants of the Merkle
tree, we mark a ro_compat flag on the file system when a file has verity
enabled.

Signed-off-by: Chris Mason <clm@fb.com>
---
 fs/btrfs/Makefile               |   1 +
 fs/btrfs/btrfs_inode.h          |   1 +
 fs/btrfs/ctree.h                |  23 +-
 fs/btrfs/extent_io.c            |  27 +-
 fs/btrfs/file.c                 |   6 +
 fs/btrfs/inode.c                |   7 +
 fs/btrfs/ioctl.c                |  14 +-
 fs/btrfs/super.c                |   1 +
 fs/btrfs/sysfs.c                |   6 +
 fs/btrfs/verity.c               | 614 ++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h      |   2 +-
 include/uapi/linux/btrfs_tree.h |  15 +
 12 files changed, 706 insertions(+), 11 deletions(-)
 create mode 100644 fs/btrfs/verity.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index cec88a66bd6c..3dcf9bcc2326 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -36,6 +36,7 @@ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
 btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
+btrfs-$(CONFIG_FS_VERITY) += verity.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index e8dbc8e848ce..4536548b9e79 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -51,6 +51,7 @@ enum {
 	 * the file range, inode's io_tree).
 	 */
 	BTRFS_INODE_NO_DELALLOC_FLUSH,
+	BTRFS_INODE_VERITY_IN_PROGRESS,
 };
 
 /* in memory btrfs inode */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d633c563164b..78cc4f44b85e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -279,9 +279,10 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
-#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
-	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
-	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID)
+#define BTRFS_FEATURE_COMPAT_RO_SUPP				\
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |		\
+	 BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID |	\
+	 BTRFS_FEATURE_COMPAT_RO_VERITY)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -1473,6 +1474,11 @@ do {                                                                   \
 	 BTRFS_INODE_COMPRESS |						\
 	 BTRFS_INODE_ROOT_ITEM_INIT)
 
+/*
+ * Inode compat flags
+ */
+#define BTRFS_INODE_VERITY		(1 << 0)
+
 struct btrfs_map_token {
 	struct extent_buffer *eb;
 	char *kaddr;
@@ -3721,6 +3727,17 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info)
 	return signal_pending(current);
 }
 
+/* verity.c */
+extern const struct fsverity_operations btrfs_verityops;
+int btrfs_drop_verity_items(struct btrfs_inode *inode);
+BTRFS_SETGET_FUNCS(verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
+		   encryption, 8);
+BTRFS_SETGET_FUNCS(verity_descriptor_size, struct btrfs_verity_descriptor_item, size, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_encryption, struct btrfs_verity_descriptor_item,
+			 encryption, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_verity_descriptor_size, struct btrfs_verity_descriptor_item,
+			 size, 64);
+
 /* Sanity test specific functions */
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 void btrfs_test_destroy_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5b2a8a314adf..bf784c10040b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -13,6 +13,7 @@
 #include <linux/pagevec.h>
 #include <linux/prefetch.h>
 #include <linux/cleancache.h>
+#include <linux/fsverity.h>
 #include "misc.h"
 #include "extent_io.h"
 #include "extent-io-tree.h"
@@ -2862,15 +2863,28 @@ static void begin_page_read(struct btrfs_fs_info *fs_info, struct page *page)
 	btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
 }
 
-static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
+static int end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 {
-	struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+	int ret = 0;
+	struct inode *inode = page->mapping->host;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 
 	ASSERT(page_offset(page) <= start &&
 		start + len <= page_offset(page) + PAGE_SIZE);
 
 	if (uptodate) {
-		btrfs_page_set_uptodate(fs_info, page, start, len);
+		/*
+		 * buffered reads of a file with page alignment will issue a
+		 * 0 length read for one page past the end of file, so we must
+		 * explicitly skip checking verity on that page of zeros.
+		 */
+		if (!PageError(page) && !PageUptodate(page) &&
+		    start < i_size_read(inode) &&
+		    fsverity_active(inode) &&
+		    !fsverity_verify_page(page))
+			ret = -EIO;
+		else
+			btrfs_page_set_uptodate(fs_info, page, start, len);
 	} else {
 		btrfs_page_clear_uptodate(fs_info, page, start, len);
 		btrfs_page_set_error(fs_info, page, start, len);
@@ -2878,12 +2892,13 @@ static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
 
 	if (fs_info->sectorsize == PAGE_SIZE)
 		unlock_page(page);
-	else if (is_data_inode(page->mapping->host))
+	else if (is_data_inode(inode))
 		/*
 		 * For subpage data, unlock the page if we're the last reader.
 		 * For subpage metadata, page lock is not utilized for read.
 		 */
 		btrfs_subpage_end_reader(fs_info, page, start, len);
+	return ret;
 }
 
 /*
@@ -3059,7 +3074,9 @@ static void end_bio_extent_readpage(struct bio *bio)
 		bio_offset += len;
 
 		/* Update page status and unlock */
-		end_page_read(page, uptodate, start, len);
+		ret = end_page_read(page, uptodate, start, len);
+		if (ret)
+			uptodate = 0;
 		endio_readpage_release_extent(&processed, BTRFS_I(inode),
 					      start, end, uptodate);
 	}
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..e8dcada0d239 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -16,6 +16,7 @@
 #include <linux/btrfs.h>
 #include <linux/uio.h>
 #include <linux/iversion.h>
+#include <linux/fsverity.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -3578,7 +3579,12 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
+	int ret;
 	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+
+	ret = fsverity_file_open(inode, filp);
+	if (ret)
+		return ret;
 	return generic_file_open(inode, filp);
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3aa96ec27045..887e1ca2ed66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -32,6 +32,7 @@
 #include <linux/sched/mm.h>
 #include <linux/iomap.h>
 #include <asm/unaligned.h>
+#include <linux/fsverity.h>
 #include "misc.h"
 #include "ctree.h"
 #include "disk-io.h"
@@ -5401,7 +5402,9 @@ void btrfs_evict_inode(struct inode *inode)
 
 	trace_btrfs_inode_evict(inode);
 
+
 	if (!root) {
+		fsverity_cleanup_inode(inode);
 		clear_inode(inode);
 		return;
 	}
@@ -5484,6 +5487,7 @@ void btrfs_evict_inode(struct inode *inode)
 	 * to retry these periodically in the future.
 	 */
 	btrfs_remove_delayed_node(BTRFS_I(inode));
+	fsverity_cleanup_inode(inode);
 	clear_inode(inode);
 }
 
@@ -9043,6 +9047,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(path->dentry);
 	u32 blocksize = inode->i_sb->s_blocksize;
 	u32 bi_flags = BTRFS_I(inode)->flags;
+	u32 bi_compat_flags = BTRFS_I(inode)->compat_flags;
 
 	stat->result_mask |= STATX_BTIME;
 	stat->btime.tv_sec = BTRFS_I(inode)->i_otime.tv_sec;
@@ -9055,6 +9060,8 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 		stat->attributes |= STATX_ATTR_IMMUTABLE;
 	if (bi_flags & BTRFS_INODE_NODUMP)
 		stat->attributes |= STATX_ATTR_NODUMP;
+	if (bi_compat_flags & BTRFS_INODE_VERITY)
+		stat->attributes |= STATX_ATTR_VERITY;
 
 	stat->attributes_mask |= (STATX_ATTR_APPEND |
 				  STATX_ATTR_COMPRESSED |
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2c9cbd2642b1..4ec5735cff9b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -26,6 +26,7 @@
 #include <linux/btrfs.h>
 #include <linux/uaccess.h>
 #include <linux/iversion.h>
+#include <linux/fsverity.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "export.h"
@@ -105,6 +106,7 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
 static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 {
 	unsigned int flags = binode->flags;
+	unsigned int compat_flags = binode->compat_flags;
 	unsigned int iflags = 0;
 
 	if (flags & BTRFS_INODE_SYNC)
@@ -121,6 +123,8 @@ static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
 		iflags |= FS_DIRSYNC_FL;
 	if (flags & BTRFS_INODE_NODATACOW)
 		iflags |= FS_NOCOW_FL;
+	if (compat_flags & BTRFS_INODE_VERITY)
+		iflags |= FS_VERITY_FL;
 
 	if (flags & BTRFS_INODE_NOCOMPRESS)
 		iflags |= FS_NOCOMP_FL;
@@ -148,10 +152,12 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
 		new_fl |= S_NOATIME;
 	if (binode->flags & BTRFS_INODE_DIRSYNC)
 		new_fl |= S_DIRSYNC;
+	if (binode->compat_flags & BTRFS_INODE_VERITY)
+		new_fl |= S_VERITY;
 
 	set_mask_bits(&inode->i_flags,
-		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
-		      new_fl);
+		      S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC |
+		      S_VERITY, new_fl);
 }
 
 static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
@@ -5055,6 +5061,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_subvol_rootref(file, argp);
 	case BTRFS_IOC_INO_LOOKUP_USER:
 		return btrfs_ioctl_ino_lookup_user(file, argp);
+	case FS_IOC_ENABLE_VERITY:
+		return fsverity_ioctl_enable(file, (const void __user *)argp);
+	case FS_IOC_MEASURE_VERITY:
+		return fsverity_ioctl_measure(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f7a4ad86adee..e5282a8f566a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1339,6 +1339,7 @@ static int btrfs_fill_super(struct super_block *sb,
 	sb->s_op = &btrfs_super_ops;
 	sb->s_d_op = &btrfs_dentry_operations;
 	sb->s_export_op = &btrfs_export_ops;
+	sb->s_vop = &btrfs_verityops;
 	sb->s_xattr = btrfs_xattr_handlers;
 	sb->s_time_gran = 1;
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 6eb1c50fa98c..e8de0055c61e 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -267,6 +267,9 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
 #ifdef CONFIG_BTRFS_DEBUG
 BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 #endif
+#ifdef CONFIG_FS_VERITY
+BTRFS_FEAT_ATTR_COMPAT_RO(verity, VERITY);
+#endif
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -284,6 +287,9 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
 #ifdef CONFIG_BTRFS_DEBUG
 	BTRFS_FEAT_ATTR_PTR(zoned),
+#endif
+#ifdef CONFIG_FS_VERITY
+	BTRFS_FEAT_ATTR_PTR(verity),
 #endif
 	NULL
 };
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
new file mode 100644
index 000000000000..0cc9bdd876e2
--- /dev/null
+++ b/fs/btrfs/verity.c
@@ -0,0 +1,614 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Facebook.  All rights reserved.
+ */
+
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/rwsem.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/iversion.h>
+#include <linux/fsverity.h>
+#include <linux/sched/mm.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "locking.h"
+
+/*
+ * Just like ext4, we cache the merkle tree in pages after EOF in the page
+ * cache.  Unlike ext4, we're storing these in dedicated btree items and
+ * not just shoving them after EOF in the file.  This means we'll need to
+ * do extra work to encrypt them once encryption is supported in btrfs,
+ * but btrfs has a lot of careful code around i_size and it seems better
+ * to make a new key type than try and adjust all of our expectations
+ * for i_size.
+ *
+ * fs verity items are stored under two different key types on disk.
+ *
+ * The descriptor items:
+ * [ inode objectid, BTRFS_VERITY_DESC_ITEM_KEY, offset ]
+ *
+ * At offset 0, we store a btrfs_verity_descriptor_item which tracks the
+ * size of the descriptor item and some extra data for encryption.
+ * Starting at offset 1, these hold the generic fs verity descriptor.
+ * These are opaque to btrfs, we just read and write them as a blob for
+ * the higher level verity code.  The most common size for this is 256 bytes.
+ *
+ * The merkle tree items:
+ * [ inode objectid, BTRFS_VERITY_MERKLE_ITEM_KEY, offset ]
+ *
+ * These also start at offset 0, and correspond to the merkle tree bytes.
+ * So when fsverity asks for page 0 of the merkle tree, we pull up one page
+ * starting at offset 0 for this key type.  These are also opaque to btrfs,
+ * we're blindly storing whatever fsverity sends down.
+ *
+ * This file is just reading and writing the various items whenever
+ * fsverity needs us to.
+ */
+
+/*
+ * Helper function for computing cache index for Merkle tree pages
+ * @inode: verity file whose Merkle items we want.
+ * @merkle_index: index of the page in the Merkle tree (as in
+ *                read_merkle_tree_page).
+ * @ret_index: returned index in the inode's mapping
+ *
+ * Returns: 0 on success, -EFBIG if the location in the file would be beyond
+ * sb->s_maxbytes.
+ */
+static int get_verity_mapping_index(struct inode *inode,
+				    pgoff_t merkle_index,
+				    pgoff_t *ret_index)
+{
+	/*
+	 * the file is readonly, so i_size can't change here.  We jump
+	 * some pages past the last page to cache our merkles.  The goal
+	 * is just to jump past any hugepages that might be mapped in.
+	 */
+	pgoff_t merkle_offset = 2048;
+	u64 index = (i_size_read(inode) >> PAGE_SHIFT) + merkle_offset + merkle_index;
+
+	if (index > inode->i_sb->s_maxbytes >> PAGE_SHIFT)
+		return -EFBIG;
+
+	*ret_index = index;
+	return 0;
+}
+
+
+/*
+ * Drop all the items for this inode with this key_type.
+ * @inode: The inode to drop items for
+ * @key_type: The type of items to drop (VERITY_DESC_ITEM or
+ *            VERITY_MERKLE_ITEM)
+ *
+ * Before doing a verity enable we cleanup any existing verity items.
+ *
+ * This is also used to clean up if a verity enable failed half way
+ * through.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+static int drop_verity_items(struct btrfs_inode *inode, u8 key_type)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (1) {
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			goto out;
+		}
+
+		/*
+		 * walk backwards through all the items until we find one
+		 * that isn't from our key type or objectid
+		 */
+		key.objectid = btrfs_ino(inode);
+		key.offset = (u64)-1;
+		key.type = key_type;
+
+		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+		if (ret > 0) {
+			ret = 0;
+			/* no more keys of this type, we're done */
+			if (path->slots[0] == 0)
+				break;
+			path->slots[0]--;
+		} else if (ret < 0) {
+			break;
+		}
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		/* no more keys of this type, we're done */
+		if (key.objectid != btrfs_ino(inode) || key.type != key_type)
+			break;
+
+		/*
+		 * this shouldn't be a performance sensitive function because
+		 * it's not used as part of truncate.  If it ever becomes
+		 * perf sensitive, change this to walk forward and bulk delete
+		 * items
+		 */
+		ret = btrfs_del_items(trans, root, path,
+				      path->slots[0], 1);
+		btrfs_release_path(path);
+		btrfs_end_transaction(trans);
+
+		if (ret)
+			goto out;
+	}
+
+	btrfs_end_transaction(trans);
+out:
+	btrfs_free_path(path);
+	return ret;
+
+}
+
+/*
+ * Insert and write inode items with a given key type and offset.
+ * @inode: The inode to insert for.
+ * @key_type: The key type to insert.
+ * @offset: The item offset to insert at.
+ * @src: Source data to write.
+ * @len: Length of source data to write.
+ *
+ * Write len bytes from src into items of up to 1k length.
+ * The inserted items will have key <ino, key_type, offset + off> where
+ * off is consecutively increasing from 0 up to the last item ending at
+ * offset + len.
+ *
+ * Returns 0 on success and a negative error code on failure.
+ */
+static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
+			   const char *src, u64 len)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_path *path;
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 orig_len = len;
+	u64 copied = 0;
+	unsigned long copy_bytes;
+	unsigned long src_offset = 0;
+	void *data;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	while (len > 0) {
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			break;
+		}
+
+		key.objectid = btrfs_ino(inode);
+		key.offset = offset;
+		key.type = key_type;
+
+		/*
+		 * insert 1K at a time mostly to be friendly for smaller
+		 * leaf size filesystems
+		 */
+		copy_bytes = min_t(u64, len, 1024);
+
+		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
+		if (ret) {
+			btrfs_end_transaction(trans);
+			break;
+		}
+
+		leaf = path->nodes[0];
+
+		data = btrfs_item_ptr(leaf, path->slots[0], void);
+		write_extent_buffer(leaf, src + src_offset,
+				    (unsigned long)data, copy_bytes);
+		offset += copy_bytes;
+		src_offset += copy_bytes;
+		len -= copy_bytes;
+		copied += copy_bytes;
+
+		btrfs_release_path(path);
+		btrfs_end_transaction(trans);
+	}
+
+	btrfs_free_path(path);
+
+	if (!ret && copied != orig_len)
+		ret = -EIO;
+	return ret;
+}
+
+/*
+ * Read inode items of the given key type and offset from the btree.
+ * @inode: The inode to read items of.
+ * @key_type: The key type to read.
+ * @offset: The item offset to read from.
+ * @dest: The buffer to read into. This parameter has slightly tricky
+ *        semantics.  If it is NULL, the function will not do any copying
+ *        and will just return the size of all the items up to len bytes.
+ *        If dest_page is passed, then the function will kmap_atomic the
+ *        page and ignore dest, but it must still be non-NULL to avoid the
+ *        counting-only behavior.
+ * @len: Length in bytes to read.
+ * @dest_page: Copy into this page instead of the dest buffer.
+ *
+ * Helper function to read items from the btree.  This returns the number
+ * of bytes read or < 0 for errors.  We can return short reads if the
+ * items don't exist on disk or aren't big enough to fill the desired length.
+ *
+ * Supports reading into a provided buffer (dest) or into the page cache
+ *
+ * Returns number of bytes read or a negative error code on failure.
+ */
+static ssize_t read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
+			  char *dest, u64 len, struct page *dest_page)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 item_end;
+	u64 copy_end;
+	u64 copied = 0;
+	u32 copy_offset;
+	unsigned long copy_bytes;
+	unsigned long dest_offset = 0;
+	void *data;
+	char *kaddr = dest;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	if (dest_page)
+		path->reada = READA_FORWARD;
+
+	key.objectid = btrfs_ino(inode);
+	key.offset = offset;
+	key.type = key_type;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0) {
+		goto out;
+	} else if (ret > 0) {
+		ret = 0;
+		if (path->slots[0] == 0)
+			goto out;
+		path->slots[0]--;
+	}
+
+	while (len > 0) {
+		leaf = path->nodes[0];
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+		if (key.objectid != btrfs_ino(inode) ||
+		    key.type != key_type)
+			break;
+
+		item_end = btrfs_item_size_nr(leaf, path->slots[0]) + key.offset;
+
+		if (copied > 0) {
+			/*
+			 * once we've copied something, we want all of the items
+			 * to be sequential
+			 */
+			if (key.offset != offset)
+				break;
+		} else {
+			/*
+			 * our initial offset might be in the middle of an
+			 * item.  Make sure it all makes sense
+			 */
+			if (key.offset > offset)
+				break;
+			if (item_end <= offset)
+				break;
+		}
+
+		/* desc = NULL to just sum all the item lengths */
+		if (!dest)
+			copy_end = item_end;
+		else
+			copy_end = min(offset + len, item_end);
+
+		/* number of bytes in this item we want to copy */
+		copy_bytes = copy_end - offset;
+
+		/* offset from the start of item for copying */
+		copy_offset = offset - key.offset;
+
+		if (dest) {
+			if (dest_page)
+				kaddr = kmap_atomic(dest_page);
+
+			data = btrfs_item_ptr(leaf, path->slots[0], void);
+			read_extent_buffer(leaf, kaddr + dest_offset,
+					   (unsigned long)data + copy_offset,
+					   copy_bytes);
+
+			if (dest_page)
+				kunmap_atomic(kaddr);
+		}
+
+		offset += copy_bytes;
+		dest_offset += copy_bytes;
+		len -= copy_bytes;
+		copied += copy_bytes;
+
+		path->slots[0]++;
+		if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+			/*
+			 * we've reached the last slot in this leaf and we need
+			 * to go to the next leaf.
+			 */
+			ret = btrfs_next_leaf(root, path);
+			if (ret < 0) {
+				break;
+			} else if (ret > 0) {
+				ret = 0;
+				break;
+			}
+		}
+	}
+out:
+	btrfs_free_path(path);
+	if (!ret)
+		ret = copied;
+	return ret;
+}
+
+/*
+ * fsverity op that begins enabling verity.
+ * fsverity calls this to ask us to setup the inode for enabling.  We
+ * drop any existing verity items and set the in progress bit.
+ */
+static int btrfs_begin_enable_verity(struct file *filp)
+{
+	struct inode *inode = file_inode(filp);
+	int ret;
+
+	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags))
+		return -EBUSY;
+
+	/*
+	 * ext4 adds the inode to the orphan list here, presumably because the
+	 * truncate done at orphan processing time will delete partial
+	 * measurements.  TODO: setup orphans
+	 */
+	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
+	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
+	if (ret)
+		goto err;
+
+	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
+	return ret;
+
+}
+
+/*
+ * fsverity op that ends enabling verity.
+ * fsverity calls this when it's done with all of the pages in the file
+ * and all of the merkle items have been inserted.  We write the
+ * descriptor and update the inode in the btree to reflect its new life
+ * as a verity file.
+ */
+static int btrfs_end_enable_verity(struct file *filp, const void *desc,
+				  size_t desc_size, u64 merkle_tree_size)
+{
+	struct btrfs_trans_handle *trans;
+	struct inode *inode = file_inode(filp);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_verity_descriptor_item item;
+	int ret;
+
+	if (desc != NULL) {
+		/* write out the descriptor item */
+		memset(&item, 0, sizeof(item));
+		btrfs_set_stack_verity_descriptor_size(&item, desc_size);
+		ret = write_key_bytes(BTRFS_I(inode),
+				      BTRFS_VERITY_DESC_ITEM_KEY, 0,
+				      (const char *)&item, sizeof(item));
+		if (ret)
+			goto out;
+		/* write out the descriptor itself */
+		ret = write_key_bytes(BTRFS_I(inode),
+				      BTRFS_VERITY_DESC_ITEM_KEY, 1,
+				      desc, desc_size);
+		if (ret)
+			goto out;
+
+		/* update our inode flags to include fs verity */
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			goto out;
+		}
+		BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
+		btrfs_sync_inode_flags_to_i_flags(inode);
+		ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
+		btrfs_end_transaction(trans);
+	}
+
+out:
+	if (desc == NULL || ret) {
+		/* If we failed, drop all the verity items */
+		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
+		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+	} else
+		btrfs_set_fs_compat_ro(root->fs_info, VERITY);
+	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
+	return ret;
+}
+
+/*
+ * fsverity op that gets the struct fsverity_descriptor.
+ * fsverity does a two pass setup for reading the descriptor, in the first pass
+ * it calls with buf_size = 0 to query the size of the descriptor,
+ * and then in the second pass it actually reads the descriptor off
+ * disk.
+ */
+static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
+				       size_t buf_size)
+{
+	size_t true_size;
+	ssize_t ret = 0;
+	struct btrfs_verity_descriptor_item item;
+
+	memset(&item, 0, sizeof(item));
+	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
+			     0, (char *)&item, sizeof(item), NULL);
+	if (ret < 0)
+		return ret;
+
+	true_size = btrfs_stack_verity_descriptor_size(&item);
+	if (true_size > INT_MAX)
+		return -EUCLEAN;
+	if (!buf_size)
+		return true_size;
+	if (buf_size < true_size)
+		return -ERANGE;
+
+	ret = read_key_bytes(BTRFS_I(inode),
+			     BTRFS_VERITY_DESC_ITEM_KEY, 1,
+			     buf, buf_size, NULL);
+	if (ret < 0)
+		return ret;
+	if (ret != true_size)
+		return -EIO;
+
+	return true_size;
+}
+
+/*
+ * fsverity op that reads and caches a merkle tree page.  These are stored
+ * in the btree, but we cache them in the inode's address space after EOF.
+ */
+static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
+					       pgoff_t index,
+					       unsigned long num_ra_pages)
+{
+	struct page *p;
+	u64 start = index << PAGE_SHIFT;
+	pgoff_t mapping_index;
+	ssize_t ret;
+	int err;
+
+	err = get_verity_mapping_index(inode, index, &mapping_index);
+	if (err < 0)
+		return ERR_PTR(err);
+again:
+	p = find_get_page_flags(inode->i_mapping, mapping_index, FGP_ACCESSED);
+	if (p) {
+		if (PageUptodate(p))
+			return p;
+
+		lock_page(p);
+		/*
+		 * we only insert uptodate pages, so !Uptodate has to be
+		 * an error
+		 */
+		if (!PageUptodate(p)) {
+			unlock_page(p);
+			put_page(p);
+			return ERR_PTR(-EIO);
+		}
+		unlock_page(p);
+		return p;
+	}
+
+	p = page_cache_alloc(inode->i_mapping);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * merkle item keys are indexed from byte 0 in the merkle tree.
+	 * they have the form:
+	 *
+	 * [ inode objectid, BTRFS_MERKLE_ITEM_KEY, offset in bytes ]
+	 */
+	ret = read_key_bytes(BTRFS_I(inode),
+			     BTRFS_VERITY_MERKLE_ITEM_KEY, start,
+			     page_address(p), PAGE_SIZE, p);
+	if (ret < 0) {
+		put_page(p);
+		return ERR_PTR(ret);
+	}
+
+	/* zero fill any bytes we didn't write into the page */
+	if (ret < PAGE_SIZE) {
+		char *kaddr = kmap_atomic(p);
+
+		memset(kaddr + ret, 0, PAGE_SIZE - ret);
+		kunmap_atomic(kaddr);
+	}
+	SetPageUptodate(p);
+	err = add_to_page_cache_lru(p, inode->i_mapping, mapping_index,
+				    mapping_gfp_mask(inode->i_mapping));
+
+	if (!err) {
+		/* inserted and ready for fsverity */
+		unlock_page(p);
+	} else {
+		put_page(p);
+		/* did someone race us into inserting this page? */
+		if (err == -EEXIST)
+			goto again;
+		p = ERR_PTR(err);
+	}
+	return p;
+}
+
+/*
+ * fsverity op that writes a merkle tree block into the btree in 1k chunks.
+ */
+static int btrfs_write_merkle_tree_block(struct inode *inode, const void *buf,
+					u64 index, int log_blocksize)
+{
+	u64 start = index << log_blocksize;
+	u64 len = 1 << log_blocksize;
+	int ret;
+	pgoff_t mapping_index;
+
+	ret = get_verity_mapping_index(inode, index, &mapping_index);
+	if (ret < 0)
+		return ret;
+
+	return write_key_bytes(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY,
+			       start, buf, len);
+}
+
+const struct fsverity_operations btrfs_verityops = {
+	.begin_enable_verity	= btrfs_begin_enable_verity,
+	.end_enable_verity	= btrfs_end_enable_verity,
+	.get_verity_descriptor	= btrfs_get_verity_descriptor,
+	.read_merkle_tree_page	= btrfs_read_merkle_tree_page,
+	.write_merkle_tree_block = btrfs_write_merkle_tree_block,
+};
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5df73001aad4..fa21c8aac78d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -288,6 +288,7 @@ struct btrfs_ioctl_fs_info_args {
  * first mount when booting older kernel versions.
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE_VALID	(1ULL << 1)
+#define BTRFS_FEATURE_COMPAT_RO_VERITY		(1ULL << 2)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
@@ -308,7 +309,6 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 #define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
-
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
 	__u64 compat_ro_flags;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ae25280316bd..2be57416f886 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -118,6 +118,14 @@
 #define BTRFS_INODE_REF_KEY		12
 #define BTRFS_INODE_EXTREF_KEY		13
 #define BTRFS_XATTR_ITEM_KEY		24
+
+/*
+ * fsverity has a descriptor per file, and then
+ * a number of sha or csum items indexed by offset in to the file.
+ */
+#define BTRFS_VERITY_DESC_ITEM_KEY	36
+#define BTRFS_VERITY_MERKLE_ITEM_KEY	37
+
 #define BTRFS_ORPHAN_ITEM_KEY		48
 /* reserve 2-15 close to the inode for later flexibility */
 
@@ -996,4 +1004,11 @@ struct btrfs_qgroup_limit_item {
 	__le64 rsv_excl;
 } __attribute__ ((__packed__));
 
+struct btrfs_verity_descriptor_item {
+	/* size of the verity descriptor in bytes */
+	__le64 size;
+	__le64 reserved[2];
+	__u8 encryption;
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 3/5] btrfs: check verity for reads of inline extents and holes
  2021-04-08 18:33 [PATCH v3 0/5] btrfs: support fsverity Boris Burkov
  2021-04-08 18:33 ` [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
  2021-04-08 18:33 ` [PATCH v3 2/5] btrfs: initial fsverity support Boris Burkov
@ 2021-04-08 18:33 ` Boris Burkov
  2021-04-08 18:33 ` [PATCH v3 4/5] btrfs: fallback to buffered io for verity files Boris Burkov
  2021-04-08 18:33 ` [PATCH v3 5/5] btrfs: verity metadata orphan items Boris Burkov
  4 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-04-08 18:33 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fscrypt

The majority of reads receive a verity check after the bio is complete
as the page is marked uptodate. However, there is a class of reads which
are handled with btrfs logic in readpage, rather than by submitting a
bio. Specifically, these are inline extents, preallocated extents, and
holes. Tweak readpage so that if it is going to mark such a page
uptodate, it first checks verity on it.

Now if a veritied file has corruption to this class of EXTENT_DATA
items, it will be detected at read time.

There is one annoying edge case that requires checking for start <
last_byte: if userspace reads to the end of a file with page aligned
size and then tries to keep reading (as cat does), the buffered read
code will try to read the page past the end of the file, and expects it
to be filled with 0s and marked uptodate. That bogus page is not part of
the data hashed by verity, so we have to ignore it.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/extent_io.c | 26 +++++++-------------------
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bf784c10040b..a15f289e29e6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2202,18 +2202,6 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	return bitset;
 }
 
-/*
- * helper function to set a given page up to date if all the
- * extents in the tree for that page are up to date
- */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
-{
-	u64 start = page_offset(page);
-	u64 end = start + PAGE_SIZE - 1;
-	if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
-		SetPageUptodate(page);
-}
-
 int free_io_failure(struct extent_io_tree *failure_tree,
 		    struct extent_io_tree *io_tree,
 		    struct io_failure_record *rec)
@@ -3467,14 +3455,14 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 					    &cached, GFP_NOFS);
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
-			end_page_read(page, true, cur, iosize);
+			ret = end_page_read(page, true, cur, iosize);
 			break;
 		}
 		em = __get_extent_map(inode, page, pg_offset, cur,
 				      end - cur + 1, em_cached);
 		if (IS_ERR_OR_NULL(em)) {
 			unlock_extent(tree, cur, end);
-			end_page_read(page, false, cur, end + 1 - cur);
+			ret = end_page_read(page, false, cur, end + 1 - cur);
 			break;
 		}
 		extent_offset = cur - em->start;
@@ -3555,9 +3543,10 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 
 			set_extent_uptodate(tree, cur, cur + iosize - 1,
 					    &cached, GFP_NOFS);
+
 			unlock_extent_cached(tree, cur,
 					     cur + iosize - 1, &cached);
-			end_page_read(page, true, cur, iosize);
+			ret = end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3565,9 +3554,8 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		/* the get_extent function already copied into the page */
 		if (test_range_bit(tree, cur, cur_end,
 				   EXTENT_UPTODATE, 1, NULL)) {
-			check_page_uptodate(tree, page);
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, true, cur, iosize);
+			ret = end_page_read(page, true, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3577,7 +3565,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		 */
 		if (block_start == EXTENT_MAP_INLINE) {
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, false, cur, iosize);
+			ret = end_page_read(page, false, cur, iosize);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -3595,7 +3583,7 @@ int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 			*bio_flags = this_bio_flag;
 		} else {
 			unlock_extent(tree, cur, cur + iosize - 1);
-			end_page_read(page, false, cur, iosize);
+			ret = end_page_read(page, false, cur, iosize);
 			goto out;
 		}
 		cur = cur + iosize;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 4/5] btrfs: fallback to buffered io for verity files
  2021-04-08 18:33 [PATCH v3 0/5] btrfs: support fsverity Boris Burkov
                   ` (2 preceding siblings ...)
  2021-04-08 18:33 ` [PATCH v3 3/5] btrfs: check verity for reads of inline extents and holes Boris Burkov
@ 2021-04-08 18:33 ` Boris Burkov
  2021-04-08 18:33 ` [PATCH v3 5/5] btrfs: verity metadata orphan items Boris Burkov
  4 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-04-08 18:33 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fscrypt

Reading the contents with direct IO would circumvent verity checks, so
fallback to buffered reads. For what it's worth, this is how ext4
handles it as well.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/file.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e8dcada0d239..9f8d90bbbe26 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3613,6 +3613,9 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if (fsverity_active(inode))
+		return 0;
+
 	if (check_direct_read(btrfs_sb(inode->i_sb), to, iocb->ki_pos))
 		return 0;
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 5/5] btrfs: verity metadata orphan items
  2021-04-08 18:33 [PATCH v3 0/5] btrfs: support fsverity Boris Burkov
                   ` (3 preceding siblings ...)
  2021-04-08 18:33 ` [PATCH v3 4/5] btrfs: fallback to buffered io for verity files Boris Burkov
@ 2021-04-08 18:33 ` Boris Burkov
  4 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-04-08 18:33 UTC (permalink / raw)
  To: linux-btrfs, kernel-team, linux-fscrypt

If we don't finish creating fsverity metadata for a file, or fail to
clean up already created metadata after a failure, we could leak the
verity items.

To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.

There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.

A final thing to note is that this is a resurrection of using orphans to
signal orphaned metadata that isn't the inode itself. This makes the
comment discussing deprecating that concept a bit messy in full context.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c  | 15 ++++++--
 fs/btrfs/verity.c | 89 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 92 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 887e1ca2ed66..939893cb039d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3419,7 +3419,9 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 
 		/*
 		 * If we have an inode with links, there are a couple of
-		 * possibilities. Old kernels (before v3.12) used to create an
+		 * possibilities:
+		 *
+		 * 1. Old kernels (before v3.12) used to create an
 		 * orphan item for truncate indicating that there were possibly
 		 * extent items past i_size that needed to be deleted. In v3.12,
 		 * truncate was changed to update i_size in sync with the extent
@@ -3432,13 +3434,22 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 		 * slim, and it's a pain to do the truncate now, so just delete
 		 * the orphan item.
 		 *
+		 * 2. We were halfway through creating fsverity metadata for the
+		 * file. In that case, the orphan item represents incomplete
+		 * fsverity metadata which must be cleaned up with
+		 * btrfs_drop_verity_items.
+		 *
 		 * It's also possible that this orphan item was supposed to be
 		 * deleted but wasn't. The inode number may have been reused,
 		 * but either way, we can delete the orphan item.
 		 */
 		if (ret == -ENOENT || inode->i_nlink) {
-			if (!ret)
+			if (!ret) {
+				ret = btrfs_drop_verity_items(BTRFS_I(inode));
 				iput(inode);
+				if (ret)
+					goto out;
+			}
 			trans = btrfs_start_transaction(root, 1);
 			if (IS_ERR(trans)) {
 				ret = PTR_ERR(trans);
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index 0cc9bdd876e2..b96f7d9698a8 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -378,6 +378,69 @@ static ssize_t read_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset
 	return ret;
 }
 
+static int add_orphan(struct btrfs_inode *inode)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	int ret = 0;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+	ret = btrfs_orphan_add(trans, inode);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
+	btrfs_end_transaction(trans);
+
+out:
+	return ret;
+}
+
+static int del_orphan(struct btrfs_inode *inode)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = inode->root;
+	int ret;
+
+	if (!inode->vfs_inode.i_nlink)
+		return 0;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
+
+	btrfs_end_transaction(trans);
+	return ret;
+}
+
+int btrfs_drop_verity_items(struct btrfs_inode *inode)
+{
+	int ret;
+	struct inode *ino = &inode->vfs_inode;
+	pgoff_t index;
+
+	ret = get_verity_mapping_index(ino, 0, &index);
+	if (ret)
+		return ret;
+	truncate_inode_pages(inode->vfs_inode.i_mapping, index << PAGE_SHIFT);
+
+	ret = drop_verity_items(inode, BTRFS_VERITY_DESC_ITEM_KEY);
+	if (ret)
+		return ret;
+
+	return drop_verity_items(inode, BTRFS_VERITY_MERKLE_ITEM_KEY);
+}
+
 /*
  * fsverity op that begins enabling verity.
  * fsverity calls this to ask us to setup the inode for enabling.  We
@@ -391,17 +454,13 @@ static int btrfs_begin_enable_verity(struct file *filp)
 	if (test_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags))
 		return -EBUSY;
 
-	/*
-	 * ext4 adds the inode to the orphan list here, presumably because the
-	 * truncate done at orphan processing time will delete partial
-	 * measurements.  TODO: setup orphans
-	 */
 	set_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
-	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
+
+	ret = btrfs_drop_verity_items(BTRFS_I(inode));
 	if (ret)
 		goto err;
 
-	ret = drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+	ret = add_orphan(BTRFS_I(inode));
 	if (ret)
 		goto err;
 
@@ -428,6 +487,7 @@ static int btrfs_end_enable_verity(struct file *filp, const void *desc,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_verity_descriptor_item item;
 	int ret;
+	int keep_orphan = 0;
 
 	if (desc != NULL) {
 		/* write out the descriptor item */
@@ -459,11 +519,20 @@ static int btrfs_end_enable_verity(struct file *filp, const void *desc,
 
 out:
 	if (desc == NULL || ret) {
-		/* If we failed, drop all the verity items */
-		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
-		drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
+		/*
+		 * If verity failed (here or in the generic code), drop all the
+		 * verity items.
+		 */
+		keep_orphan = btrfs_drop_verity_items(BTRFS_I(inode));
 	} else
 		btrfs_set_fs_compat_ro(root->fs_info, VERITY);
+	/*
+	 * If we are handling an error, but failed to drop the verity items,
+	 * we still need the orphan.
+	 */
+	if (!keep_orphan)
+		del_orphan(BTRFS_I(inode));
+
 	clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
 	return ret;
 }
-- 
2.30.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-08 18:33 ` [PATCH v3 2/5] btrfs: initial fsverity support Boris Burkov
@ 2021-04-08 22:38   ` kernel test robot
  2021-04-08 22:50   ` Eric Biggers
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2021-04-08 22:38 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team, linux-fscrypt; +Cc: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3172 bytes --]

Hi Boris,

I love your patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20210408]
[cannot apply to v5.12-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-support-fsverity/20210409-023606
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/dd118218fea47389631a62ec533207ba39e69b41
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Boris-Burkov/btrfs-support-fsverity/20210409-023606
        git checkout dd118218fea47389631a62ec533207ba39e69b41
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   fs/btrfs/super.c: In function 'btrfs_fill_super':
>> fs/btrfs/super.c:1368:6: error: 'struct super_block' has no member named 's_vop'; did you mean 's_op'?
    1368 |  sb->s_vop = &btrfs_verityops;
         |      ^~~~~
         |      s_op


vim +1368 fs/btrfs/super.c

  1354	
  1355	static int btrfs_fill_super(struct super_block *sb,
  1356				    struct btrfs_fs_devices *fs_devices,
  1357				    void *data)
  1358	{
  1359		struct inode *inode;
  1360		struct btrfs_fs_info *fs_info = btrfs_sb(sb);
  1361		int err;
  1362	
  1363		sb->s_maxbytes = MAX_LFS_FILESIZE;
  1364		sb->s_magic = BTRFS_SUPER_MAGIC;
  1365		sb->s_op = &btrfs_super_ops;
  1366		sb->s_d_op = &btrfs_dentry_operations;
  1367		sb->s_export_op = &btrfs_export_ops;
> 1368		sb->s_vop = &btrfs_verityops;
  1369		sb->s_xattr = btrfs_xattr_handlers;
  1370		sb->s_time_gran = 1;
  1371	#ifdef CONFIG_BTRFS_FS_POSIX_ACL
  1372		sb->s_flags |= SB_POSIXACL;
  1373	#endif
  1374		sb->s_flags |= SB_I_VERSION;
  1375		sb->s_iflags |= SB_I_CGROUPWB;
  1376	
  1377		err = super_setup_bdi(sb);
  1378		if (err) {
  1379			btrfs_err(fs_info, "super_setup_bdi failed");
  1380			return err;
  1381		}
  1382	
  1383		err = open_ctree(sb, fs_devices, (char *)data);
  1384		if (err) {
  1385			btrfs_err(fs_info, "open_ctree failed");
  1386			return err;
  1387		}
  1388	
  1389		inode = btrfs_iget(sb, BTRFS_FIRST_FREE_OBJECTID, fs_info->fs_root);
  1390		if (IS_ERR(inode)) {
  1391			err = PTR_ERR(inode);
  1392			goto fail_close;
  1393		}
  1394	
  1395		sb->s_root = d_make_root(inode);
  1396		if (!sb->s_root) {
  1397			err = -ENOMEM;
  1398			goto fail_close;
  1399		}
  1400	
  1401		cleancache_init_fs(sb);
  1402		sb->s_flags |= SB_ACTIVE;
  1403		return 0;
  1404	
  1405	fail_close:
  1406		close_ctree(fs_info);
  1407		return err;
  1408	}
  1409	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 41474 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-08 18:33 ` [PATCH v3 2/5] btrfs: initial fsverity support Boris Burkov
  2021-04-08 22:38   ` kernel test robot
@ 2021-04-08 22:50   ` Eric Biggers
  2021-04-09 18:05     ` Boris Burkov
  2021-04-09 22:45     ` Boris Burkov
  2021-04-08 22:56   ` kernel test robot
  2021-04-08 23:19   ` kernel test robot
  3 siblings, 2 replies; 18+ messages in thread
From: Eric Biggers @ 2021-04-08 22:50 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team, linux-fscrypt

On Thu, Apr 08, 2021 at 11:33:53AM -0700, Boris Burkov wrote:
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index f7a4ad86adee..e5282a8f566a 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1339,6 +1339,7 @@ static int btrfs_fill_super(struct super_block *sb,
>  	sb->s_op = &btrfs_super_ops;
>  	sb->s_d_op = &btrfs_dentry_operations;
>  	sb->s_export_op = &btrfs_export_ops;
> +	sb->s_vop = &btrfs_verityops;
>  	sb->s_xattr = btrfs_xattr_handlers;
>  	sb->s_time_gran = 1;

As the kernel test robot has hinted at, this line needs to be conditional on
CONFIG_FS_VERITY.

> +/*
> + * Helper function for computing cache index for Merkle tree pages
> + * @inode: verity file whose Merkle items we want.
> + * @merkle_index: index of the page in the Merkle tree (as in
> + *                read_merkle_tree_page).
> + * @ret_index: returned index in the inode's mapping
> + *
> + * Returns: 0 on success, -EFBIG if the location in the file would be beyond
> + * sb->s_maxbytes.
> + */
> +static int get_verity_mapping_index(struct inode *inode,
> +				    pgoff_t merkle_index,
> +				    pgoff_t *ret_index)
> +{
> +	/*
> +	 * the file is readonly, so i_size can't change here.  We jump
> +	 * some pages past the last page to cache our merkles.  The goal
> +	 * is just to jump past any hugepages that might be mapped in.
> +	 */
> +	pgoff_t merkle_offset = 2048;
> +	u64 index = (i_size_read(inode) >> PAGE_SHIFT) + merkle_offset + merkle_index;

Would it make more sense to align the page index to 2048, rather than adding
2048?  Or are huge pages not necessarily aligned in the page cache?

> +
> +	if (index > inode->i_sb->s_maxbytes >> PAGE_SHIFT)
> +		return -EFBIG;

There's an off-by-one error here; it's considering the beginning of the page
rather than the end of the page.

> +/*
> + * Insert and write inode items with a given key type and offset.
> + * @inode: The inode to insert for.
> + * @key_type: The key type to insert.
> + * @offset: The item offset to insert at.
> + * @src: Source data to write.
> + * @len: Length of source data to write.
> + *
> + * Write len bytes from src into items of up to 1k length.
> + * The inserted items will have key <ino, key_type, offset + off> where
> + * off is consecutively increasing from 0 up to the last item ending at
> + * offset + len.
> + *
> + * Returns 0 on success and a negative error code on failure.
> + */
> +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> +			   const char *src, u64 len)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_path *path;
> +	struct btrfs_root *root = inode->root;
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	u64 orig_len = len;
> +	u64 copied = 0;
> +	unsigned long copy_bytes;
> +	unsigned long src_offset = 0;
> +	void *data;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	while (len > 0) {
> +		trans = btrfs_start_transaction(root, 1);
> +		if (IS_ERR(trans)) {
> +			ret = PTR_ERR(trans);
> +			break;
> +		}
> +
> +		key.objectid = btrfs_ino(inode);
> +		key.offset = offset;
> +		key.type = key_type;
> +
> +		/*
> +		 * insert 1K at a time mostly to be friendly for smaller
> +		 * leaf size filesystems
> +		 */
> +		copy_bytes = min_t(u64, len, 1024);
> +
> +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> +		if (ret) {
> +			btrfs_end_transaction(trans);
> +			break;
> +		}
> +
> +		leaf = path->nodes[0];
> +
> +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> +		write_extent_buffer(leaf, src + src_offset,
> +				    (unsigned long)data, copy_bytes);
> +		offset += copy_bytes;
> +		src_offset += copy_bytes;
> +		len -= copy_bytes;
> +		copied += copy_bytes;
> +
> +		btrfs_release_path(path);
> +		btrfs_end_transaction(trans);
> +	}
> +
> +	btrfs_free_path(path);
> +
> +	if (!ret && copied != orig_len)
> +		ret = -EIO;

The condition '!ret && copied != orig_len' at the end appears to be unnecessary,
since this function doesn't do short writes.

> +/*
> + * fsverity op that gets the struct fsverity_descriptor.
> + * fsverity does a two pass setup for reading the descriptor, in the first pass
> + * it calls with buf_size = 0 to query the size of the descriptor,
> + * and then in the second pass it actually reads the descriptor off
> + * disk.
> + */
> +static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
> +				       size_t buf_size)
> +{
> +	size_t true_size;
> +	ssize_t ret = 0;
> +	struct btrfs_verity_descriptor_item item;
> +
> +	memset(&item, 0, sizeof(item));
> +	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
> +			     0, (char *)&item, sizeof(item), NULL);
> +	if (ret < 0)
> +		return ret;
> +
> +	true_size = btrfs_stack_verity_descriptor_size(&item);
> +	if (true_size > INT_MAX)

true_size is a __le64 on-disk, so it technically should be __u64 here; otherwise
its high 32 bits might be ignored.

> +struct btrfs_verity_descriptor_item {
> +	/* size of the verity descriptor in bytes */
> +	__le64 size;
> +	__le64 reserved[2];
> +	__u8 encryption;
> +} __attribute__ ((__packed__));

The 'reserved' field still isn't validated to be 0 before going ahead and using
the descriptor.  Is that still intentional?  If so, it might be clearer to call
this field 'unused'.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-08 18:33 ` [PATCH v3 2/5] btrfs: initial fsverity support Boris Burkov
  2021-04-08 22:38   ` kernel test robot
  2021-04-08 22:50   ` Eric Biggers
@ 2021-04-08 22:56   ` kernel test robot
  2021-04-08 23:19   ` kernel test robot
  3 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2021-04-08 22:56 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team, linux-fscrypt
  Cc: kbuild-all, clang-built-linux

[-- Attachment #1: Type: text/plain, Size: 4534 bytes --]

Hi Boris,

I love your patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on next-20210408]
[cannot apply to v5.12-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-support-fsverity/20210409-023606
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: arm64-randconfig-r021-20210408 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 56ea2e2fdd691136d5e6631fa0e447173694b82c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/0day-ci/linux/commit/dd118218fea47389631a62ec533207ba39e69b41
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Boris-Burkov/btrfs-support-fsverity/20210409-023606
        git checkout dd118218fea47389631a62ec533207ba39e69b41
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/btrfs/verity.c:432:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (desc != NULL) {
               ^~~~~~~~~~~~
   fs/btrfs/verity.c:468:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   fs/btrfs/verity.c:432:2: note: remove the 'if' if its condition is always true
           if (desc != NULL) {
           ^~~~~~~~~~~~~~~~~~
   fs/btrfs/verity.c:430:9: note: initialize the variable 'ret' to silence this warning
           int ret;
                  ^
                   = 0
   1 warning generated.


vim +432 fs/btrfs/verity.c

   415	
   416	/*
   417	 * fsverity op that ends enabling verity.
   418	 * fsverity calls this when it's done with all of the pages in the file
   419	 * and all of the merkle items have been inserted.  We write the
   420	 * descriptor and update the inode in the btree to reflect its new life
   421	 * as a verity file.
   422	 */
   423	static int btrfs_end_enable_verity(struct file *filp, const void *desc,
   424					  size_t desc_size, u64 merkle_tree_size)
   425	{
   426		struct btrfs_trans_handle *trans;
   427		struct inode *inode = file_inode(filp);
   428		struct btrfs_root *root = BTRFS_I(inode)->root;
   429		struct btrfs_verity_descriptor_item item;
   430		int ret;
   431	
 > 432		if (desc != NULL) {
   433			/* write out the descriptor item */
   434			memset(&item, 0, sizeof(item));
   435			btrfs_set_stack_verity_descriptor_size(&item, desc_size);
   436			ret = write_key_bytes(BTRFS_I(inode),
   437					      BTRFS_VERITY_DESC_ITEM_KEY, 0,
   438					      (const char *)&item, sizeof(item));
   439			if (ret)
   440				goto out;
   441			/* write out the descriptor itself */
   442			ret = write_key_bytes(BTRFS_I(inode),
   443					      BTRFS_VERITY_DESC_ITEM_KEY, 1,
   444					      desc, desc_size);
   445			if (ret)
   446				goto out;
   447	
   448			/* update our inode flags to include fs verity */
   449			trans = btrfs_start_transaction(root, 1);
   450			if (IS_ERR(trans)) {
   451				ret = PTR_ERR(trans);
   452				goto out;
   453			}
   454			BTRFS_I(inode)->compat_flags |= BTRFS_INODE_VERITY;
   455			btrfs_sync_inode_flags_to_i_flags(inode);
   456			ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
   457			btrfs_end_transaction(trans);
   458		}
   459	
   460	out:
   461		if (desc == NULL || ret) {
   462			/* If we failed, drop all the verity items */
   463			drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY);
   464			drop_verity_items(BTRFS_I(inode), BTRFS_VERITY_MERKLE_ITEM_KEY);
   465		} else
   466			btrfs_set_fs_compat_ro(root->fs_info, VERITY);
   467		clear_bit(BTRFS_INODE_VERITY_IN_PROGRESS, &BTRFS_I(inode)->runtime_flags);
   468		return ret;
   469	}
   470	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45592 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-08 18:33 ` [PATCH v3 2/5] btrfs: initial fsverity support Boris Burkov
                     ` (2 preceding siblings ...)
  2021-04-08 22:56   ` kernel test robot
@ 2021-04-08 23:19   ` kernel test robot
  3 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2021-04-08 23:19 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team, linux-fscrypt; +Cc: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3177 bytes --]

Hi Boris,

I love your patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on next-20210408]
[cannot apply to v5.12-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Boris-Burkov/btrfs-support-fsverity/20210409-023606
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-a005-20210408 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/dd118218fea47389631a62ec533207ba39e69b41
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Boris-Burkov/btrfs-support-fsverity/20210409-023606
        git checkout dd118218fea47389631a62ec533207ba39e69b41
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   fs/btrfs/super.c: In function 'btrfs_fill_super':
>> fs/btrfs/super.c:1368:6: error: 'struct super_block' has no member named 's_vop'; did you mean 's_op'?
    1368 |  sb->s_vop = &btrfs_verityops;
         |      ^~~~~
         |      s_op


vim +1368 fs/btrfs/super.c

  1354	
  1355	static int btrfs_fill_super(struct super_block *sb,
  1356				    struct btrfs_fs_devices *fs_devices,
  1357				    void *data)
  1358	{
  1359		struct inode *inode;
  1360		struct btrfs_fs_info *fs_info = btrfs_sb(sb);
  1361		int err;
  1362	
  1363		sb->s_maxbytes = MAX_LFS_FILESIZE;
  1364		sb->s_magic = BTRFS_SUPER_MAGIC;
  1365		sb->s_op = &btrfs_super_ops;
  1366		sb->s_d_op = &btrfs_dentry_operations;
  1367		sb->s_export_op = &btrfs_export_ops;
> 1368		sb->s_vop = &btrfs_verityops;
  1369		sb->s_xattr = btrfs_xattr_handlers;
  1370		sb->s_time_gran = 1;
  1371	#ifdef CONFIG_BTRFS_FS_POSIX_ACL
  1372		sb->s_flags |= SB_POSIXACL;
  1373	#endif
  1374		sb->s_flags |= SB_I_VERSION;
  1375		sb->s_iflags |= SB_I_CGROUPWB;
  1376	
  1377		err = super_setup_bdi(sb);
  1378		if (err) {
  1379			btrfs_err(fs_info, "super_setup_bdi failed");
  1380			return err;
  1381		}
  1382	
  1383		err = open_ctree(sb, fs_devices, (char *)data);
  1384		if (err) {
  1385			btrfs_err(fs_info, "open_ctree failed");
  1386			return err;
  1387		}
  1388	
  1389		inode = btrfs_iget(sb, BTRFS_FIRST_FREE_OBJECTID, fs_info->fs_root);
  1390		if (IS_ERR(inode)) {
  1391			err = PTR_ERR(inode);
  1392			goto fail_close;
  1393		}
  1394	
  1395		sb->s_root = d_make_root(inode);
  1396		if (!sb->s_root) {
  1397			err = -ENOMEM;
  1398			goto fail_close;
  1399		}
  1400	
  1401		cleancache_init_fs(sb);
  1402		sb->s_flags |= SB_ACTIVE;
  1403		return 0;
  1404	
  1405	fail_close:
  1406		close_ctree(fs_info);
  1407		return err;
  1408	}
  1409	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-04-08 18:33 ` [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
@ 2021-04-08 23:40   ` Anand Jain
  2021-04-09 18:20     ` Boris Burkov
  0 siblings, 1 reply; 18+ messages in thread
From: Anand Jain @ 2021-04-08 23:40 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team, linux-fscrypt

On 09/04/2021 02:33, Boris Burkov wrote:
> The tree checker currently rejects unrecognized flags when it reads
> btrfs_inode_item. Practically, this means that adding a new flag makes
> the change backwards incompatible if the flag is ever set on a file.


> Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a
> new "compat_flags". These flags are zero on inode creation in btrfs and
> mkfs and are ignored by an older kernel, so it should be safe to use
> them in this way.

I don't see an incompt flags check during mount, how does this patch 
will handle if you mount a disk with an older on-disk btrfs_inode_item 
data structure which has no compat_flags?

Why not update the tree checker (need to fix stable kernel as well) and 
inode flags, so that we spare u64 space in the btrfs_inode_item?

Also, I think we need the incompt flags to check during mount.

Thanks, Anand


> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/btrfs_inode.h          | 1 +
>   fs/btrfs/ctree.h                | 2 ++
>   fs/btrfs/delayed-inode.c        | 2 ++
>   fs/btrfs/inode.c                | 3 +++
>   fs/btrfs/ioctl.c                | 7 ++++---
>   fs/btrfs/tree-log.c             | 1 +
>   include/uapi/linux/btrfs_tree.h | 7 ++++++-
>   7 files changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index c652e19ad74e..e8dbc8e848ce 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -191,6 +191,7 @@ struct btrfs_inode {
>   
>   	/* flags field from the on disk inode */
>   	u32 flags;
> +	u64 compat_flags;
>   
>   	/*
>   	 * Counters to keep track of the number of extent item's we may use due
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f2fd73e58ee6..d633c563164b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1754,6 +1754,7 @@ BTRFS_SETGET_FUNCS(inode_gid, struct btrfs_inode_item, gid, 32);
>   BTRFS_SETGET_FUNCS(inode_mode, struct btrfs_inode_item, mode, 32);
>   BTRFS_SETGET_FUNCS(inode_rdev, struct btrfs_inode_item, rdev, 64);
>   BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
> +BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
>   BTRFS_SETGET_STACK_FUNCS(stack_inode_generation, struct btrfs_inode_item,
>   			 generation, 64);
>   BTRFS_SETGET_STACK_FUNCS(stack_inode_sequence, struct btrfs_inode_item,
> @@ -1771,6 +1772,7 @@ BTRFS_SETGET_STACK_FUNCS(stack_inode_gid, struct btrfs_inode_item, gid, 32);
>   BTRFS_SETGET_STACK_FUNCS(stack_inode_mode, struct btrfs_inode_item, mode, 32);
>   BTRFS_SETGET_STACK_FUNCS(stack_inode_rdev, struct btrfs_inode_item, rdev, 64);
>   BTRFS_SETGET_STACK_FUNCS(stack_inode_flags, struct btrfs_inode_item, flags, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
>   BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
>   BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
>   BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 1a88f6214ebc..ef4e0265dbe3 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1718,6 +1718,7 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
>   	btrfs_set_stack_inode_transid(inode_item, trans->transid);
>   	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
>   	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> +	btrfs_set_stack_inode_compat_flags(inode_item, BTRFS_I(inode)->compat_flags);
>   	btrfs_set_stack_inode_block_group(inode_item, 0);
>   
>   	btrfs_set_stack_timespec_sec(&inode_item->atime,
> @@ -1776,6 +1777,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
>   	inode->i_rdev = 0;
>   	*rdev = btrfs_stack_inode_rdev(inode_item);
>   	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
> +	BTRFS_I(inode)->compat_flags = btrfs_stack_inode_compat_flags(inode_item);
>   
>   	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
>   	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1e0e20ad25e4..3aa96ec27045 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3627,6 +3627,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
>   
>   	BTRFS_I(inode)->index_cnt = (u64)-1;
>   	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
> +	BTRFS_I(inode)->compat_flags = btrfs_inode_compat_flags(leaf, inode_item);
>   
>   cache_index:
>   	/*
> @@ -3793,6 +3794,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
>   	btrfs_set_token_inode_transid(&token, item, trans->transid);
>   	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
>   	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
> +	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
>   	btrfs_set_token_inode_block_group(&token, item, 0);
>   }
>   
> @@ -8859,6 +8861,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>   	ei->defrag_bytes = 0;
>   	ei->disk_i_size = 0;
>   	ei->flags = 0;
> +	ei->compat_flags = 0;
>   	ei->csum_bytes = 0;
>   	ei->index_cnt = (u64)-1;
>   	ei->dir_index = 0;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 3415a9f06c81..2c9cbd2642b1 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -102,8 +102,9 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
>    * Export internal inode flags to the format expected by the FS_IOC_GETFLAGS
>    * ioctl.
>    */
> -static unsigned int btrfs_inode_flags_to_fsflags(unsigned int flags)
> +static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
>   {
> +	unsigned int flags = binode->flags;
>   	unsigned int iflags = 0;
>   
>   	if (flags & BTRFS_INODE_SYNC)
> @@ -156,7 +157,7 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
>   static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
>   {
>   	struct btrfs_inode *binode = BTRFS_I(file_inode(file));
> -	unsigned int flags = btrfs_inode_flags_to_fsflags(binode->flags);
> +	unsigned int flags = btrfs_inode_flags_to_fsflags(binode);
>   
>   	if (copy_to_user(arg, &flags, sizeof(flags)))
>   		return -EFAULT;
> @@ -228,7 +229,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
>   
>   	btrfs_inode_lock(inode, 0);
>   	fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
> -	old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
> +	old_fsflags = btrfs_inode_flags_to_fsflags(binode);
>   
>   	ret = vfs_ioc_setflags_prepare(inode, old_fsflags, fsflags);
>   	if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 72c4b66ed516..fed638f995ba 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3944,6 +3944,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
>   	btrfs_set_token_inode_transid(&token, item, trans->transid);
>   	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
>   	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
> +	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
>   	btrfs_set_token_inode_block_group(&token, item, 0);
>   }
>   
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 58d7cff9afb1..ae25280316bd 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -574,11 +574,16 @@ struct btrfs_inode_item {
>   	/* modification sequence number for NFS */
>   	__le64 sequence;
>   
> +	/*
> +	 * flags which aren't checked for corruption at mount
> +	 * and can be added in a backwards compatible way
> +	 */
> +	__le64 compat_flags;
>   	/*
>   	 * a little future expansion, for more than this we can
>   	 * just grow the inode item and version it
>   	 */
> -	__le64 reserved[4];
> +	__le64 reserved[3];
>   	struct btrfs_timespec atime;
>   	struct btrfs_timespec ctime;
>   	struct btrfs_timespec mtime;
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-08 22:50   ` Eric Biggers
@ 2021-04-09 18:05     ` Boris Burkov
  2021-04-09 23:25       ` Eric Biggers
  2021-04-09 22:45     ` Boris Burkov
  1 sibling, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2021-04-09 18:05 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, kernel-team, linux-fscrypt

On Thu, Apr 08, 2021 at 03:50:08PM -0700, Eric Biggers wrote:
> On Thu, Apr 08, 2021 at 11:33:53AM -0700, Boris Burkov wrote:
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index f7a4ad86adee..e5282a8f566a 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -1339,6 +1339,7 @@ static int btrfs_fill_super(struct super_block *sb,
> >  	sb->s_op = &btrfs_super_ops;
> >  	sb->s_d_op = &btrfs_dentry_operations;
> >  	sb->s_export_op = &btrfs_export_ops;
> > +	sb->s_vop = &btrfs_verityops;
> >  	sb->s_xattr = btrfs_xattr_handlers;
> >  	sb->s_time_gran = 1;
> 
> As the kernel test robot has hinted at, this line needs to be conditional on
> CONFIG_FS_VERITY.
> 
> > +/*
> > + * Helper function for computing cache index for Merkle tree pages
> > + * @inode: verity file whose Merkle items we want.
> > + * @merkle_index: index of the page in the Merkle tree (as in
> > + *                read_merkle_tree_page).
> > + * @ret_index: returned index in the inode's mapping
> > + *
> > + * Returns: 0 on success, -EFBIG if the location in the file would be beyond
> > + * sb->s_maxbytes.
> > + */
> > +static int get_verity_mapping_index(struct inode *inode,
> > +				    pgoff_t merkle_index,
> > +				    pgoff_t *ret_index)
> > +{
> > +	/*
> > +	 * the file is readonly, so i_size can't change here.  We jump
> > +	 * some pages past the last page to cache our merkles.  The goal
> > +	 * is just to jump past any hugepages that might be mapped in.
> > +	 */
> > +	pgoff_t merkle_offset = 2048;
> > +	u64 index = (i_size_read(inode) >> PAGE_SHIFT) + merkle_offset + merkle_index;
> 
> Would it make more sense to align the page index to 2048, rather than adding
> 2048?  Or are huge pages not necessarily aligned in the page cache?
> 
> > +
> > +	if (index > inode->i_sb->s_maxbytes >> PAGE_SHIFT)
> > +		return -EFBIG;
> 
> There's an off-by-one error here; it's considering the beginning of the page
> rather than the end of the page.
> 
> > +/*
> > + * Insert and write inode items with a given key type and offset.
> > + * @inode: The inode to insert for.
> > + * @key_type: The key type to insert.
> > + * @offset: The item offset to insert at.
> > + * @src: Source data to write.
> > + * @len: Length of source data to write.
> > + *
> > + * Write len bytes from src into items of up to 1k length.
> > + * The inserted items will have key <ino, key_type, offset + off> where
> > + * off is consecutively increasing from 0 up to the last item ending at
> > + * offset + len.
> > + *
> > + * Returns 0 on success and a negative error code on failure.
> > + */
> > +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> > +			   const char *src, u64 len)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_path *path;
> > +	struct btrfs_root *root = inode->root;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_key key;
> > +	u64 orig_len = len;
> > +	u64 copied = 0;
> > +	unsigned long copy_bytes;
> > +	unsigned long src_offset = 0;
> > +	void *data;
> > +	int ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	while (len > 0) {
> > +		trans = btrfs_start_transaction(root, 1);
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			break;
> > +		}
> > +
> > +		key.objectid = btrfs_ino(inode);
> > +		key.offset = offset;
> > +		key.type = key_type;
> > +
> > +		/*
> > +		 * insert 1K at a time mostly to be friendly for smaller
> > +		 * leaf size filesystems
> > +		 */
> > +		copy_bytes = min_t(u64, len, 1024);
> > +
> > +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> > +		if (ret) {
> > +			btrfs_end_transaction(trans);
> > +			break;
> > +		}
> > +
> > +		leaf = path->nodes[0];
> > +
> > +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> > +		write_extent_buffer(leaf, src + src_offset,
> > +				    (unsigned long)data, copy_bytes);
> > +		offset += copy_bytes;
> > +		src_offset += copy_bytes;
> > +		len -= copy_bytes;
> > +		copied += copy_bytes;
> > +
> > +		btrfs_release_path(path);
> > +		btrfs_end_transaction(trans);
> > +	}
> > +
> > +	btrfs_free_path(path);
> > +
> > +	if (!ret && copied != orig_len)
> > +		ret = -EIO;
> 
> The condition '!ret && copied != orig_len' at the end appears to be unnecessary,
> since this function doesn't do short writes.
> 
> > +/*
> > + * fsverity op that gets the struct fsverity_descriptor.
> > + * fsverity does a two pass setup for reading the descriptor, in the first pass
> > + * it calls with buf_size = 0 to query the size of the descriptor,
> > + * and then in the second pass it actually reads the descriptor off
> > + * disk.
> > + */
> > +static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
> > +				       size_t buf_size)
> > +{
> > +	size_t true_size;
> > +	ssize_t ret = 0;
> > +	struct btrfs_verity_descriptor_item item;
> > +
> > +	memset(&item, 0, sizeof(item));
> > +	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
> > +			     0, (char *)&item, sizeof(item), NULL);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	true_size = btrfs_stack_verity_descriptor_size(&item);
> > +	if (true_size > INT_MAX)
> 
> true_size is a __le64 on-disk, so it technically should be __u64 here; otherwise
> its high 32 bits might be ignored.
> 
> > +struct btrfs_verity_descriptor_item {
> > +	/* size of the verity descriptor in bytes */
> > +	__le64 size;
> > +	__le64 reserved[2];
> > +	__u8 encryption;
> > +} __attribute__ ((__packed__));
> 
> The 'reserved' field still isn't validated to be 0 before going ahead and using
> the descriptor.  Is that still intentional?  If so, it might be clearer to call
> this field 'unused'.
> 

I should have asked about this last time, sorry I didn't get around to
it. I'm not familiar with the implied semantics of something called
reserved vs unused, so if you wouldn't mind explaining that a bit more,
I would appreciate it.

Thinking out loud, and apologies in advance that this is a bit naive:
Whether or not I validate it in kernel K1 will affect two things: not
accidentally putting junk in s.t. it is hard for K2 to properly use the
field, and it ensures that K1 can never understand the file if K2 uses
the field and we roll back to K1. Is that correct? 100% committing to
the latter seems like a negative, since I'm not sure the future use
can't be compatible. The validation against junk is nice, but I
personally don't know how critical it is. Currently, it feels sufficient
to always zero the item before filling it out and writing it to disk.

> - Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-04-08 23:40   ` Anand Jain
@ 2021-04-09 18:20     ` Boris Burkov
  2021-05-13 17:21       ` David Sterba
  0 siblings, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2021-04-09 18:20 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, kernel-team, linux-fscrypt

On Fri, Apr 09, 2021 at 07:40:44AM +0800, Anand Jain wrote:
> On 09/04/2021 02:33, Boris Burkov wrote:
> > The tree checker currently rejects unrecognized flags when it reads
> > btrfs_inode_item. Practically, this means that adding a new flag makes
> > the change backwards incompatible if the flag is ever set on a file.
> 
> 
> > Take up one of the 4 reserved u64 fields in the btrfs_inode_item as a
> > new "compat_flags". These flags are zero on inode creation in btrfs and
> > mkfs and are ignored by an older kernel, so it should be safe to use
> > them in this way.
> 
> I don't see an incompt flags check during mount, how does this patch will
> handle if you mount a disk with an older on-disk btrfs_inode_item data
> structure which has no compat_flags?

I'm referring to check_inode_item in fs/btrfs/tree-checker.c
Specificall, the last check it does that fails with the string: "unknown
flags detected".

This patch ignores the new compat_flags from a checking perspective, so
it won't complain no matter what an old on-disk format put there. As far
as I can tell from inspecting the code and mkfs, it should be zero,
though I suppose it's possible an older version of btrfs did not zero
it. I think the worst case would be if it weren't zero and we
incorrectly interpreted the file as having a compat_flags flag set.

> 
> Why not update the tree checker (need to fix stable kernel as well) and
> inode flags, so that we spare u64 space in the btrfs_inode_item?

I don't understand this suggestion, could you be more specific?

> 
> Also, I think we need the incompt flags to check during mount.

Same for this one, sorry. What do you think I should check?

> 
> Thanks, Anand

Thank you for the review,
Boris
> 
> 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/btrfs_inode.h          | 1 +
> >   fs/btrfs/ctree.h                | 2 ++
> >   fs/btrfs/delayed-inode.c        | 2 ++
> >   fs/btrfs/inode.c                | 3 +++
> >   fs/btrfs/ioctl.c                | 7 ++++---
> >   fs/btrfs/tree-log.c             | 1 +
> >   include/uapi/linux/btrfs_tree.h | 7 ++++++-
> >   7 files changed, 19 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index c652e19ad74e..e8dbc8e848ce 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -191,6 +191,7 @@ struct btrfs_inode {
> >   	/* flags field from the on disk inode */
> >   	u32 flags;
> > +	u64 compat_flags;
> >   	/*
> >   	 * Counters to keep track of the number of extent item's we may use due
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index f2fd73e58ee6..d633c563164b 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1754,6 +1754,7 @@ BTRFS_SETGET_FUNCS(inode_gid, struct btrfs_inode_item, gid, 32);
> >   BTRFS_SETGET_FUNCS(inode_mode, struct btrfs_inode_item, mode, 32);
> >   BTRFS_SETGET_FUNCS(inode_rdev, struct btrfs_inode_item, rdev, 64);
> >   BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
> > +BTRFS_SETGET_FUNCS(inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
> >   BTRFS_SETGET_STACK_FUNCS(stack_inode_generation, struct btrfs_inode_item,
> >   			 generation, 64);
> >   BTRFS_SETGET_STACK_FUNCS(stack_inode_sequence, struct btrfs_inode_item,
> > @@ -1771,6 +1772,7 @@ BTRFS_SETGET_STACK_FUNCS(stack_inode_gid, struct btrfs_inode_item, gid, 32);
> >   BTRFS_SETGET_STACK_FUNCS(stack_inode_mode, struct btrfs_inode_item, mode, 32);
> >   BTRFS_SETGET_STACK_FUNCS(stack_inode_rdev, struct btrfs_inode_item, rdev, 64);
> >   BTRFS_SETGET_STACK_FUNCS(stack_inode_flags, struct btrfs_inode_item, flags, 64);
> > +BTRFS_SETGET_STACK_FUNCS(stack_inode_compat_flags, struct btrfs_inode_item, compat_flags, 64);
> >   BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
> >   BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
> >   BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
> > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> > index 1a88f6214ebc..ef4e0265dbe3 100644
> > --- a/fs/btrfs/delayed-inode.c
> > +++ b/fs/btrfs/delayed-inode.c
> > @@ -1718,6 +1718,7 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
> >   	btrfs_set_stack_inode_transid(inode_item, trans->transid);
> >   	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
> >   	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
> > +	btrfs_set_stack_inode_compat_flags(inode_item, BTRFS_I(inode)->compat_flags);
> >   	btrfs_set_stack_inode_block_group(inode_item, 0);
> >   	btrfs_set_stack_timespec_sec(&inode_item->atime,
> > @@ -1776,6 +1777,7 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
> >   	inode->i_rdev = 0;
> >   	*rdev = btrfs_stack_inode_rdev(inode_item);
> >   	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
> > +	BTRFS_I(inode)->compat_flags = btrfs_stack_inode_compat_flags(inode_item);
> >   	inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
> >   	inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 1e0e20ad25e4..3aa96ec27045 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -3627,6 +3627,7 @@ static int btrfs_read_locked_inode(struct inode *inode,
> >   	BTRFS_I(inode)->index_cnt = (u64)-1;
> >   	BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
> > +	BTRFS_I(inode)->compat_flags = btrfs_inode_compat_flags(leaf, inode_item);
> >   cache_index:
> >   	/*
> > @@ -3793,6 +3794,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
> >   	btrfs_set_token_inode_transid(&token, item, trans->transid);
> >   	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
> >   	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
> > +	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
> >   	btrfs_set_token_inode_block_group(&token, item, 0);
> >   }
> > @@ -8859,6 +8861,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
> >   	ei->defrag_bytes = 0;
> >   	ei->disk_i_size = 0;
> >   	ei->flags = 0;
> > +	ei->compat_flags = 0;
> >   	ei->csum_bytes = 0;
> >   	ei->index_cnt = (u64)-1;
> >   	ei->dir_index = 0;
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 3415a9f06c81..2c9cbd2642b1 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -102,8 +102,9 @@ static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
> >    * Export internal inode flags to the format expected by the FS_IOC_GETFLAGS
> >    * ioctl.
> >    */
> > -static unsigned int btrfs_inode_flags_to_fsflags(unsigned int flags)
> > +static unsigned int btrfs_inode_flags_to_fsflags(struct btrfs_inode *binode)
> >   {
> > +	unsigned int flags = binode->flags;
> >   	unsigned int iflags = 0;
> >   	if (flags & BTRFS_INODE_SYNC)
> > @@ -156,7 +157,7 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
> >   static int btrfs_ioctl_getflags(struct file *file, void __user *arg)
> >   {
> >   	struct btrfs_inode *binode = BTRFS_I(file_inode(file));
> > -	unsigned int flags = btrfs_inode_flags_to_fsflags(binode->flags);
> > +	unsigned int flags = btrfs_inode_flags_to_fsflags(binode);
> >   	if (copy_to_user(arg, &flags, sizeof(flags)))
> >   		return -EFAULT;
> > @@ -228,7 +229,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
> >   	btrfs_inode_lock(inode, 0);
> >   	fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
> > -	old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
> > +	old_fsflags = btrfs_inode_flags_to_fsflags(binode);
> >   	ret = vfs_ioc_setflags_prepare(inode, old_fsflags, fsflags);
> >   	if (ret)
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index 72c4b66ed516..fed638f995ba 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -3944,6 +3944,7 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
> >   	btrfs_set_token_inode_transid(&token, item, trans->transid);
> >   	btrfs_set_token_inode_rdev(&token, item, inode->i_rdev);
> >   	btrfs_set_token_inode_flags(&token, item, BTRFS_I(inode)->flags);
> > +	btrfs_set_token_inode_compat_flags(&token, item, BTRFS_I(inode)->compat_flags);
> >   	btrfs_set_token_inode_block_group(&token, item, 0);
> >   }
> > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > index 58d7cff9afb1..ae25280316bd 100644
> > --- a/include/uapi/linux/btrfs_tree.h
> > +++ b/include/uapi/linux/btrfs_tree.h
> > @@ -574,11 +574,16 @@ struct btrfs_inode_item {
> >   	/* modification sequence number for NFS */
> >   	__le64 sequence;
> > +	/*
> > +	 * flags which aren't checked for corruption at mount
> > +	 * and can be added in a backwards compatible way
> > +	 */
> > +	__le64 compat_flags;
> >   	/*
> >   	 * a little future expansion, for more than this we can
> >   	 * just grow the inode item and version it
> >   	 */
> > -	__le64 reserved[4];
> > +	__le64 reserved[3];
> >   	struct btrfs_timespec atime;
> >   	struct btrfs_timespec ctime;
> >   	struct btrfs_timespec mtime;
> > 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-08 22:50   ` Eric Biggers
  2021-04-09 18:05     ` Boris Burkov
@ 2021-04-09 22:45     ` Boris Burkov
  2021-04-09 23:32       ` Eric Biggers
  1 sibling, 1 reply; 18+ messages in thread
From: Boris Burkov @ 2021-04-09 22:45 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, kernel-team, linux-fscrypt

On Thu, Apr 08, 2021 at 03:50:08PM -0700, Eric Biggers wrote:
> On Thu, Apr 08, 2021 at 11:33:53AM -0700, Boris Burkov wrote:
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index f7a4ad86adee..e5282a8f566a 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -1339,6 +1339,7 @@ static int btrfs_fill_super(struct super_block *sb,
> >  	sb->s_op = &btrfs_super_ops;
> >  	sb->s_d_op = &btrfs_dentry_operations;
> >  	sb->s_export_op = &btrfs_export_ops;
> > +	sb->s_vop = &btrfs_verityops;
> >  	sb->s_xattr = btrfs_xattr_handlers;
> >  	sb->s_time_gran = 1;
> 
> As the kernel test robot has hinted at, this line needs to be conditional on
> CONFIG_FS_VERITY.
> 
> > +/*
> > + * Helper function for computing cache index for Merkle tree pages
> > + * @inode: verity file whose Merkle items we want.
> > + * @merkle_index: index of the page in the Merkle tree (as in
> > + *                read_merkle_tree_page).
> > + * @ret_index: returned index in the inode's mapping
> > + *
> > + * Returns: 0 on success, -EFBIG if the location in the file would be beyond
> > + * sb->s_maxbytes.
> > + */
> > +static int get_verity_mapping_index(struct inode *inode,
> > +				    pgoff_t merkle_index,
> > +				    pgoff_t *ret_index)
> > +{
> > +	/*
> > +	 * the file is readonly, so i_size can't change here.  We jump
> > +	 * some pages past the last page to cache our merkles.  The goal
> > +	 * is just to jump past any hugepages that might be mapped in.
> > +	 */
> > +	pgoff_t merkle_offset = 2048;
> > +	u64 index = (i_size_read(inode) >> PAGE_SHIFT) + merkle_offset + merkle_index;
> 
> Would it make more sense to align the page index to 2048, rather than adding
> 2048?  Or are huge pages not necessarily aligned in the page cache?
> 

What advantages are there to aligning? I don't have any objection to
doing it besides keeping things as simple as possible.

> > +
> > +	if (index > inode->i_sb->s_maxbytes >> PAGE_SHIFT)
> > +		return -EFBIG;
> 
> There's an off-by-one error here; it's considering the beginning of the page
> rather than the end of the page.
> 

I can't see the error myself, yet..

read_merkle_tree_page is what interacts with the page cache and does it
with read_key_bytes in PAGE_SIZE chunks. So if index == maxbytes >>
PAGE_SHIFT, I take that to mean we match on the start of the last
possible page, which seems fine to read in all of. The next index will
fail.

I think the weird thing is I called get_verity_merkle_index to
write_merkle_tree_block. It doesn't do much there since we aren't
affecting the page cache till we read.

As far as I can see, to make the btrfs implementation behave as
similarly as possible to ext4, it should either interact with the page
cache on the write path, or if that is undesirable (haven't thought it
through carefully yet), it should accurately fail writes with EFBIG that
would later fail as reads.

> > +/*
> > + * Insert and write inode items with a given key type and offset.
> > + * @inode: The inode to insert for.
> > + * @key_type: The key type to insert.
> > + * @offset: The item offset to insert at.
> > + * @src: Source data to write.
> > + * @len: Length of source data to write.
> > + *
> > + * Write len bytes from src into items of up to 1k length.
> > + * The inserted items will have key <ino, key_type, offset + off> where
> > + * off is consecutively increasing from 0 up to the last item ending at
> > + * offset + len.
> > + *
> > + * Returns 0 on success and a negative error code on failure.
> > + */
> > +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> > +			   const char *src, u64 len)
> > +{
> > +	struct btrfs_trans_handle *trans;
> > +	struct btrfs_path *path;
> > +	struct btrfs_root *root = inode->root;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_key key;
> > +	u64 orig_len = len;
> > +	u64 copied = 0;
> > +	unsigned long copy_bytes;
> > +	unsigned long src_offset = 0;
> > +	void *data;
> > +	int ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path)
> > +		return -ENOMEM;
> > +
> > +	while (len > 0) {
> > +		trans = btrfs_start_transaction(root, 1);
> > +		if (IS_ERR(trans)) {
> > +			ret = PTR_ERR(trans);
> > +			break;
> > +		}
> > +
> > +		key.objectid = btrfs_ino(inode);
> > +		key.offset = offset;
> > +		key.type = key_type;
> > +
> > +		/*
> > +		 * insert 1K at a time mostly to be friendly for smaller
> > +		 * leaf size filesystems
> > +		 */
> > +		copy_bytes = min_t(u64, len, 1024);
> > +
> > +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> > +		if (ret) {
> > +			btrfs_end_transaction(trans);
> > +			break;
> > +		}
> > +
> > +		leaf = path->nodes[0];
> > +
> > +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> > +		write_extent_buffer(leaf, src + src_offset,
> > +				    (unsigned long)data, copy_bytes);
> > +		offset += copy_bytes;
> > +		src_offset += copy_bytes;
> > +		len -= copy_bytes;
> > +		copied += copy_bytes;
> > +
> > +		btrfs_release_path(path);
> > +		btrfs_end_transaction(trans);
> > +	}
> > +
> > +	btrfs_free_path(path);
> > +
> > +	if (!ret && copied != orig_len)
> > +		ret = -EIO;
> 
> The condition '!ret && copied != orig_len' at the end appears to be unnecessary,
> since this function doesn't do short writes.
> 
> > +/*
> > + * fsverity op that gets the struct fsverity_descriptor.
> > + * fsverity does a two pass setup for reading the descriptor, in the first pass
> > + * it calls with buf_size = 0 to query the size of the descriptor,
> > + * and then in the second pass it actually reads the descriptor off
> > + * disk.
> > + */
> > +static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
> > +				       size_t buf_size)
> > +{
> > +	size_t true_size;
> > +	ssize_t ret = 0;
> > +	struct btrfs_verity_descriptor_item item;
> > +
> > +	memset(&item, 0, sizeof(item));
> > +	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
> > +			     0, (char *)&item, sizeof(item), NULL);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	true_size = btrfs_stack_verity_descriptor_size(&item);
> > +	if (true_size > INT_MAX)
> 
> true_size is a __le64 on-disk, so it technically should be __u64 here; otherwise
> its high 32 bits might be ignored.
> 
> > +struct btrfs_verity_descriptor_item {
> > +	/* size of the verity descriptor in bytes */
> > +	__le64 size;
> > +	__le64 reserved[2];
> > +	__u8 encryption;
> > +} __attribute__ ((__packed__));
> 
> The 'reserved' field still isn't validated to be 0 before going ahead and using
> the descriptor.  Is that still intentional?  If so, it might be clearer to call
> this field 'unused'.
> 
> - Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-09 18:05     ` Boris Burkov
@ 2021-04-09 23:25       ` Eric Biggers
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Biggers @ 2021-04-09 23:25 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team, linux-fscrypt

On Fri, Apr 09, 2021 at 11:05:05AM -0700, Boris Burkov wrote:
> On Thu, Apr 08, 2021 at 03:50:08PM -0700, Eric Biggers wrote:
> > On Thu, Apr 08, 2021 at 11:33:53AM -0700, Boris Burkov wrote:
> > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > > index f7a4ad86adee..e5282a8f566a 100644
> > > --- a/fs/btrfs/super.c
> > > +++ b/fs/btrfs/super.c
> > > @@ -1339,6 +1339,7 @@ static int btrfs_fill_super(struct super_block *sb,
> > >  	sb->s_op = &btrfs_super_ops;
> > >  	sb->s_d_op = &btrfs_dentry_operations;
> > >  	sb->s_export_op = &btrfs_export_ops;
> > > +	sb->s_vop = &btrfs_verityops;
> > >  	sb->s_xattr = btrfs_xattr_handlers;
> > >  	sb->s_time_gran = 1;
> > 
> > As the kernel test robot has hinted at, this line needs to be conditional on
> > CONFIG_FS_VERITY.
> > 
> > > +/*
> > > + * Helper function for computing cache index for Merkle tree pages
> > > + * @inode: verity file whose Merkle items we want.
> > > + * @merkle_index: index of the page in the Merkle tree (as in
> > > + *                read_merkle_tree_page).
> > > + * @ret_index: returned index in the inode's mapping
> > > + *
> > > + * Returns: 0 on success, -EFBIG if the location in the file would be beyond
> > > + * sb->s_maxbytes.
> > > + */
> > > +static int get_verity_mapping_index(struct inode *inode,
> > > +				    pgoff_t merkle_index,
> > > +				    pgoff_t *ret_index)
> > > +{
> > > +	/*
> > > +	 * the file is readonly, so i_size can't change here.  We jump
> > > +	 * some pages past the last page to cache our merkles.  The goal
> > > +	 * is just to jump past any hugepages that might be mapped in.
> > > +	 */
> > > +	pgoff_t merkle_offset = 2048;
> > > +	u64 index = (i_size_read(inode) >> PAGE_SHIFT) + merkle_offset + merkle_index;
> > 
> > Would it make more sense to align the page index to 2048, rather than adding
> > 2048?  Or are huge pages not necessarily aligned in the page cache?
> > 
> > > +
> > > +	if (index > inode->i_sb->s_maxbytes >> PAGE_SHIFT)
> > > +		return -EFBIG;
> > 
> > There's an off-by-one error here; it's considering the beginning of the page
> > rather than the end of the page.
> > 
> > > +/*
> > > + * Insert and write inode items with a given key type and offset.
> > > + * @inode: The inode to insert for.
> > > + * @key_type: The key type to insert.
> > > + * @offset: The item offset to insert at.
> > > + * @src: Source data to write.
> > > + * @len: Length of source data to write.
> > > + *
> > > + * Write len bytes from src into items of up to 1k length.
> > > + * The inserted items will have key <ino, key_type, offset + off> where
> > > + * off is consecutively increasing from 0 up to the last item ending at
> > > + * offset + len.
> > > + *
> > > + * Returns 0 on success and a negative error code on failure.
> > > + */
> > > +static int write_key_bytes(struct btrfs_inode *inode, u8 key_type, u64 offset,
> > > +			   const char *src, u64 len)
> > > +{
> > > +	struct btrfs_trans_handle *trans;
> > > +	struct btrfs_path *path;
> > > +	struct btrfs_root *root = inode->root;
> > > +	struct extent_buffer *leaf;
> > > +	struct btrfs_key key;
> > > +	u64 orig_len = len;
> > > +	u64 copied = 0;
> > > +	unsigned long copy_bytes;
> > > +	unsigned long src_offset = 0;
> > > +	void *data;
> > > +	int ret;
> > > +
> > > +	path = btrfs_alloc_path();
> > > +	if (!path)
> > > +		return -ENOMEM;
> > > +
> > > +	while (len > 0) {
> > > +		trans = btrfs_start_transaction(root, 1);
> > > +		if (IS_ERR(trans)) {
> > > +			ret = PTR_ERR(trans);
> > > +			break;
> > > +		}
> > > +
> > > +		key.objectid = btrfs_ino(inode);
> > > +		key.offset = offset;
> > > +		key.type = key_type;
> > > +
> > > +		/*
> > > +		 * insert 1K at a time mostly to be friendly for smaller
> > > +		 * leaf size filesystems
> > > +		 */
> > > +		copy_bytes = min_t(u64, len, 1024);
> > > +
> > > +		ret = btrfs_insert_empty_item(trans, root, path, &key, copy_bytes);
> > > +		if (ret) {
> > > +			btrfs_end_transaction(trans);
> > > +			break;
> > > +		}
> > > +
> > > +		leaf = path->nodes[0];
> > > +
> > > +		data = btrfs_item_ptr(leaf, path->slots[0], void);
> > > +		write_extent_buffer(leaf, src + src_offset,
> > > +				    (unsigned long)data, copy_bytes);
> > > +		offset += copy_bytes;
> > > +		src_offset += copy_bytes;
> > > +		len -= copy_bytes;
> > > +		copied += copy_bytes;
> > > +
> > > +		btrfs_release_path(path);
> > > +		btrfs_end_transaction(trans);
> > > +	}
> > > +
> > > +	btrfs_free_path(path);
> > > +
> > > +	if (!ret && copied != orig_len)
> > > +		ret = -EIO;
> > 
> > The condition '!ret && copied != orig_len' at the end appears to be unnecessary,
> > since this function doesn't do short writes.
> > 
> > > +/*
> > > + * fsverity op that gets the struct fsverity_descriptor.
> > > + * fsverity does a two pass setup for reading the descriptor, in the first pass
> > > + * it calls with buf_size = 0 to query the size of the descriptor,
> > > + * and then in the second pass it actually reads the descriptor off
> > > + * disk.
> > > + */
> > > +static int btrfs_get_verity_descriptor(struct inode *inode, void *buf,
> > > +				       size_t buf_size)
> > > +{
> > > +	size_t true_size;
> > > +	ssize_t ret = 0;
> > > +	struct btrfs_verity_descriptor_item item;
> > > +
> > > +	memset(&item, 0, sizeof(item));
> > > +	ret = read_key_bytes(BTRFS_I(inode), BTRFS_VERITY_DESC_ITEM_KEY,
> > > +			     0, (char *)&item, sizeof(item), NULL);
> > > +	if (ret < 0)
> > > +		return ret;
> > > +
> > > +	true_size = btrfs_stack_verity_descriptor_size(&item);
> > > +	if (true_size > INT_MAX)
> > 
> > true_size is a __le64 on-disk, so it technically should be __u64 here; otherwise
> > its high 32 bits might be ignored.
> > 
> > > +struct btrfs_verity_descriptor_item {
> > > +	/* size of the verity descriptor in bytes */
> > > +	__le64 size;
> > > +	__le64 reserved[2];
> > > +	__u8 encryption;
> > > +} __attribute__ ((__packed__));
> > 
> > The 'reserved' field still isn't validated to be 0 before going ahead and using
> > the descriptor.  Is that still intentional?  If so, it might be clearer to call
> > this field 'unused'.
> > 
> 
> I should have asked about this last time, sorry I didn't get around to
> it. I'm not familiar with the implied semantics of something called
> reserved vs unused, so if you wouldn't mind explaining that a bit more,
> I would appreciate it.

"reserved" normally means a field is reserved for future use, so unrecognized
values result in an error (like an "incompat" flag).  "unused" usually means the
field is simply not used (like a "compat" flag).

> 
> Thinking out loud, and apologies in advance that this is a bit naive:
> Whether or not I validate it in kernel K1 will affect two things: not
> accidentally putting junk in s.t. it is hard for K2 to properly use the
> field, and it ensures that K1 can never understand the file if K2 uses
> the field and we roll back to K1. Is that correct? 100% committing to
> the latter seems like a negative, since I'm not sure the future use
> can't be compatible. The validation against junk is nice, but I
> personally don't know how critical it is. Currently, it feels sufficient
> to always zero the item before filling it out and writing it to disk.

Which you want depends on what sort of thing you are hoping to use this field
for in the future.

Usually failing to validate flags/fields causes more problems than not
validating them, though.

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-09 22:45     ` Boris Burkov
@ 2021-04-09 23:32       ` Eric Biggers
  2021-05-03 18:46         ` Boris Burkov
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Biggers @ 2021-04-09 23:32 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team, linux-fscrypt

On Fri, Apr 09, 2021 at 03:45:17PM -0700, Boris Burkov wrote:
> On Thu, Apr 08, 2021 at 03:50:08PM -0700, Eric Biggers wrote:
> > On Thu, Apr 08, 2021 at 11:33:53AM -0700, Boris Burkov wrote:
> > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > > index f7a4ad86adee..e5282a8f566a 100644
> > > --- a/fs/btrfs/super.c
> > > +++ b/fs/btrfs/super.c
> > > @@ -1339,6 +1339,7 @@ static int btrfs_fill_super(struct super_block *sb,
> > >  	sb->s_op = &btrfs_super_ops;
> > >  	sb->s_d_op = &btrfs_dentry_operations;
> > >  	sb->s_export_op = &btrfs_export_ops;
> > > +	sb->s_vop = &btrfs_verityops;
> > >  	sb->s_xattr = btrfs_xattr_handlers;
> > >  	sb->s_time_gran = 1;
> > 
> > As the kernel test robot has hinted at, this line needs to be conditional on
> > CONFIG_FS_VERITY.
> > 
> > > +/*
> > > + * Helper function for computing cache index for Merkle tree pages
> > > + * @inode: verity file whose Merkle items we want.
> > > + * @merkle_index: index of the page in the Merkle tree (as in
> > > + *                read_merkle_tree_page).
> > > + * @ret_index: returned index in the inode's mapping
> > > + *
> > > + * Returns: 0 on success, -EFBIG if the location in the file would be beyond
> > > + * sb->s_maxbytes.
> > > + */
> > > +static int get_verity_mapping_index(struct inode *inode,
> > > +				    pgoff_t merkle_index,
> > > +				    pgoff_t *ret_index)
> > > +{
> > > +	/*
> > > +	 * the file is readonly, so i_size can't change here.  We jump
> > > +	 * some pages past the last page to cache our merkles.  The goal
> > > +	 * is just to jump past any hugepages that might be mapped in.
> > > +	 */
> > > +	pgoff_t merkle_offset = 2048;
> > > +	u64 index = (i_size_read(inode) >> PAGE_SHIFT) + merkle_offset + merkle_index;
> > 
> > Would it make more sense to align the page index to 2048, rather than adding
> > 2048?  Or are huge pages not necessarily aligned in the page cache?
> > 
> 
> What advantages are there to aligning? I don't have any objection to
> doing it besides keeping things as simple as possible.

It just seems like the logical thing to do, and it's what ext4 and f2fs do; they
align the start of the verity metadata to 65536 bytes so that it's page-aligned
on all architectures.

Actually, you might want to choose a fixed value like that as well (rather than
some constant multiple of PAGE_SIZE) so that your maximum file size isn't
different on different architectures.

Can you elaborate on what sort of huge page scenario you have in mind here?

> 
> > > +
> > > +	if (index > inode->i_sb->s_maxbytes >> PAGE_SHIFT)
> > > +		return -EFBIG;
> > 
> > There's an off-by-one error here; it's considering the beginning of the page
> > rather than the end of the page.
> > 
> 
> I can't see the error myself, yet..
> 
> read_merkle_tree_page is what interacts with the page cache and does it
> with read_key_bytes in PAGE_SIZE chunks. So if index == maxbytes >>
> PAGE_SHIFT, I take that to mean we match on the start of the last
> possible page, which seems fine to read in all of. The next index will
> fail.

The maximum number of pages is s_maxbytes >> PAGE_SHIFT, and you're allowing the
page with that index, which means you're allowing one too many pages.  Hence, an
off-by-one-error.

> 
> I think the weird thing is I called get_verity_merkle_index to
> write_merkle_tree_block. It doesn't do much there since we aren't
> affecting the page cache till we read.
> 
> As far as I can see, to make the btrfs implementation behave as
> similarly as possible to ext4, it should either interact with the page
> cache on the write path, or if that is undesirable (haven't thought it
> through carefully yet), it should accurately fail writes with EFBIG that
> would later fail as reads.
> 

Yes, you need to enforce the limit at write time, not just at read time.  But
make sure you're using the page index, not the block index (to be ready for
Merkle tree block size != PAGE_SIZE in the future).

- Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: initial fsverity support
  2021-04-09 23:32       ` Eric Biggers
@ 2021-05-03 18:46         ` Boris Burkov
  0 siblings, 0 replies; 18+ messages in thread
From: Boris Burkov @ 2021-05-03 18:46 UTC (permalink / raw)
  To: Eric Biggers; +Cc: linux-btrfs, kernel-team, linux-fscrypt

On Fri, Apr 09, 2021 at 04:32:59PM -0700, Eric Biggers wrote:
> On Fri, Apr 09, 2021 at 03:45:17PM -0700, Boris Burkov wrote:
> > On Thu, Apr 08, 2021 at 03:50:08PM -0700, Eric Biggers wrote:
> > > On Thu, Apr 08, 2021 at 11:33:53AM -0700, Boris Burkov wrote:
> > > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > > > index f7a4ad86adee..e5282a8f566a 100644
> > > > --- a/fs/btrfs/super.c
> > > > +++ b/fs/btrfs/super.c
> > > > @@ -1339,6 +1339,7 @@ static int btrfs_fill_super(struct super_block *sb,
> > > >  	sb->s_op = &btrfs_super_ops;
> > > >  	sb->s_d_op = &btrfs_dentry_operations;
> > > >  	sb->s_export_op = &btrfs_export_ops;
> > > > +	sb->s_vop = &btrfs_verityops;
> > > >  	sb->s_xattr = btrfs_xattr_handlers;
> > > >  	sb->s_time_gran = 1;
> > > 
> > > As the kernel test robot has hinted at, this line needs to be conditional on
> > > CONFIG_FS_VERITY.
> > > 
> > > > +/*
> > > > + * Helper function for computing cache index for Merkle tree pages
> > > > + * @inode: verity file whose Merkle items we want.
> > > > + * @merkle_index: index of the page in the Merkle tree (as in
> > > > + *                read_merkle_tree_page).
> > > > + * @ret_index: returned index in the inode's mapping
> > > > + *
> > > > + * Returns: 0 on success, -EFBIG if the location in the file would be beyond
> > > > + * sb->s_maxbytes.
> > > > + */
> > > > +static int get_verity_mapping_index(struct inode *inode,
> > > > +				    pgoff_t merkle_index,
> > > > +				    pgoff_t *ret_index)
> > > > +{
> > > > +	/*
> > > > +	 * the file is readonly, so i_size can't change here.  We jump
> > > > +	 * some pages past the last page to cache our merkles.  The goal
> > > > +	 * is just to jump past any hugepages that might be mapped in.
> > > > +	 */
> > > > +	pgoff_t merkle_offset = 2048;
> > > > +	u64 index = (i_size_read(inode) >> PAGE_SHIFT) + merkle_offset + merkle_index;
> > > 
> > > Would it make more sense to align the page index to 2048, rather than adding
> > > 2048?  Or are huge pages not necessarily aligned in the page cache?
> > > 
> > 
> > What advantages are there to aligning? I don't have any objection to
> > doing it besides keeping things as simple as possible.
> 
> It just seems like the logical thing to do, and it's what ext4 and f2fs do; they
> align the start of the verity metadata to 65536 bytes so that it's page-aligned
> on all architectures.
> 
> Actually, you might want to choose a fixed value like that as well (rather than
> some constant multiple of PAGE_SIZE) so that your maximum file size isn't
> different on different architectures.
> 
> Can you elaborate on what sort of huge page scenario you have in mind here?
> 

The concern was a transparent hugepage at the end of the file that would
interact negatively with these false pages past the end of the file.
Since the indexing was pretty arbitrary, we just wanted it to be past
any final hugepage.

However, I looked into it more closely and it looks like khugepaged's
collapse_file will not collapse pages that would leave a hugepage
hanging out past EOF, so it wasn't a real issue. For consistency, when I
send V4, it will use the same "round to 65536" logic.

> > 
> > > > +
> > > > +	if (index > inode->i_sb->s_maxbytes >> PAGE_SHIFT)
> > > > +		return -EFBIG;
> > > 
> > > There's an off-by-one error here; it's considering the beginning of the page
> > > rather than the end of the page.
> > > 
> > 
> > I can't see the error myself, yet..
> > 
> > read_merkle_tree_page is what interacts with the page cache and does it
> > with read_key_bytes in PAGE_SIZE chunks. So if index == maxbytes >>
> > PAGE_SHIFT, I take that to mean we match on the start of the last
> > possible page, which seems fine to read in all of. The next index will
> > fail.
> 
> The maximum number of pages is s_maxbytes >> PAGE_SHIFT, and you're allowing the
> page with that index, which means you're allowing one too many pages.  Hence, an
> off-by-one-error.

Thinking on it further, I'm not convinced that this is wrong for the
64 bit long case. s_maxbytes is at the end of the last page, and I don't
see any reason you couldn't index it (i.e., xarray doesn't seem opposed
to that index). My rough argument for this is:

"What if maxbytes was 4095? Then maxbytes >> PAGE_SHIFT is 0, and 0 is
the valid index of the last and only page."

However, that logic falls apart for the 32 bit long case where max is
ULONG_MAX << PAGE_SHIFT, which is at the beginning of a page, and that
last index only works for exactly one byte. My code would wrongly
try to read the whole page.

To make the logic uniform for the two cases, I have found things work a
lot nicer if I operate in "file position space" rather than "page index
space" the way that ext4 does.

Have I understood it correctly now, or am I still missing something?

Thanks again for your help.

> 
> > 
> > I think the weird thing is I called get_verity_merkle_index to
> > write_merkle_tree_block. It doesn't do much there since we aren't
> > affecting the page cache till we read.
> > 
> > As far as I can see, to make the btrfs implementation behave as
> > similarly as possible to ext4, it should either interact with the page
> > cache on the write path, or if that is undesirable (haven't thought it
> > through carefully yet), it should accurately fail writes with EFBIG that
> > would later fail as reads.
> > 
> 
> Yes, you need to enforce the limit at write time, not just at read time.  But
> make sure you're using the page index, not the block index (to be ready for
> Merkle tree block size != PAGE_SIZE in the future).
> 
> - Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item
  2021-04-09 18:20     ` Boris Burkov
@ 2021-05-13 17:21       ` David Sterba
  0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2021-05-13 17:21 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Anand Jain, linux-btrfs, kernel-team, linux-fscrypt

On Fri, Apr 09, 2021 at 11:20:32AM -0700, Boris Burkov wrote:
> On Fri, Apr 09, 2021 at 07:40:44AM +0800, Anand Jain wrote:
> > On 09/04/2021 02:33, Boris Burkov wrote:
> > Why not update the tree checker (need to fix stable kernel as well) and
> > inode flags, so that we spare u64 space in the btrfs_inode_item?
> 
> I don't understand this suggestion, could you be more specific?

That's probably the same suggestion I made regarding the existing inode
flags split, with some minimal backport to recognize the compat flags.

> > Also, I think we need the incompt flags to check during mount.
> 
> Same for this one, sorry. What do you think I should check?

The flag is read-only compat, and once it's added to the set
BTRFS_FEATURE_COMPAT_RO_SUPP it's already checked in open_ctree

3290 /*
3291  * Needn't use the lock because there is no other task which will
3292  * update the flag.
3293  */
3294 btrfs_set_super_incompat_flags(disk_super, features);
3295
3296 features = btrfs_super_compat_ro_flags(disk_super) &
3297         ~BTRFS_FEATURE_COMPAT_RO_SUPP;
3298 if (!sb_rdonly(sb) && features) {
3299         btrfs_err(fs_info,
3300 "cannot mount read-write because of unsupported optional features (%llx)",
3301                features);
3302         err = -EINVAL;
3303         goto fail_alloc;
3304 }

So the mount check is there.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-05-13 17:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 18:33 [PATCH v3 0/5] btrfs: support fsverity Boris Burkov
2021-04-08 18:33 ` [PATCH v3 1/5] btrfs: add compat_flags to btrfs_inode_item Boris Burkov
2021-04-08 23:40   ` Anand Jain
2021-04-09 18:20     ` Boris Burkov
2021-05-13 17:21       ` David Sterba
2021-04-08 18:33 ` [PATCH v3 2/5] btrfs: initial fsverity support Boris Burkov
2021-04-08 22:38   ` kernel test robot
2021-04-08 22:50   ` Eric Biggers
2021-04-09 18:05     ` Boris Burkov
2021-04-09 23:25       ` Eric Biggers
2021-04-09 22:45     ` Boris Burkov
2021-04-09 23:32       ` Eric Biggers
2021-05-03 18:46         ` Boris Burkov
2021-04-08 22:56   ` kernel test robot
2021-04-08 23:19   ` kernel test robot
2021-04-08 18:33 ` [PATCH v3 3/5] btrfs: check verity for reads of inline extents and holes Boris Burkov
2021-04-08 18:33 ` [PATCH v3 4/5] btrfs: fallback to buffered io for verity files Boris Burkov
2021-04-08 18:33 ` [PATCH v3 5/5] btrfs: verity metadata orphan items Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).