All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/24] fs-verity support for XFS
@ 2024-03-04 19:10 Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 01/24] fsverity: remove hash page spin lock Andrey Albershteyn
                   ` (23 more replies)
  0 siblings, 24 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

Hi all,

Here's v5 of my patchset of adding fs-verity support to XFS.

This implementation uses extended attributes to store fs-verity
metadata. The Merkle tree blocks are stored in the remote extended
attributes. The names are offsets into the tree.

A few key points of this patchset:
- fs-verity can work with Merkle tree blocks based caching (xfs) and
  PAGE caching (ext4, f2fs, btrfs)
- iomap does fs-verity verification
- In XFS, fs-verity metadata is stored in extended attributes
- per-sb workqueue for verification processing
- Inodes with fs-verity have new on-disk diflag
- xfs_attr_get() can return a buffer with an extended attribute
- xfs_buf can allocate double space for Merkle tree blocks. Part of
  the space is used to store  the extended attribute data without
  leaf headers
- xfs_buf tracks verified status of merkle tree blocks

The patchset consists of five parts:
- [1]: fs-verity spinlock removal pending in fsverity/for-next
- [2..4]: Parent pointers adding binary xattr names
- [5]: Expose FS_XFLAG_VERITY for fs-verity files
- [6..9]: Changes to fs-verity core
- [10]: Integrate fs-verity to iomap
- [11-24]: Add fs-verity support to XFS

Testing:
The patchset is tested with xfstests -g verity on xfs_1k, xfs_4k,
xfs_1k_quota, xfs_4k_quota, ext4_4k, and ext4_4k_quota. With
KMEMLEAK and KASAN enabled. More testing on the way.

Changes from V4:
- Mainly fs-verity changes; removed unnecessary functions
- Replace XFS workqueue with per-sb workqueue created in
  fsverity_set_ops()
- Drop patch with readahead calculation in bytes
Changes from V3:
- redone changes to fs-verity core as previous version had an issue
  on ext4
- add blocks invalidation interface to fs-verity
- move memory ordering primitives out of block status check to fs
  read block function
- add fs-verity verification to iomap instead of general post read
  processing
Changes from V2:
- FS_XFLAG_VERITY extended attribute flag
- Change fs-verity to use Merkle tree blocks instead of expecting
  PAGE references from filesystem
- Change approach in iomap to filesystem provided bio_set and
  submit_io instead of just callouts to filesystem
- Add possibility for xfs_buf allocate more space for fs-verity
  extended attributes
- Make xfs_attr module to copy fs-verity blocks inside the xfs_buf,
  so XFS can get data without leaf headers
- Add Merkle tree removal for error path
- Makae scrub aware of new dinode flag
Changes from V1:
- Added parent pointer patches for easier testing
- Many issues and refactoring points fixed from the V1 review
- Adjusted for recent changes in fs-verity core (folios, non-4k)
- Dropped disabling of large folios
- Completely new fsverity patches (fix, callout, log_blocksize)
- Change approach to verification in iomap to the same one as in
  write path. Callouts to fs instead of direct fs-verity use.
- New XFS workqueue for post read folio verification
- xfs_attr_get() can return underlying xfs_buf
- xfs_bufs are marked with XBF_VERITY_CHECKED to track verified
  blocks

kernel:
[1]: https://github.com/alberand/linux/tree/fsverity-v5

xfsprogs:
[2]: https://github.com/alberand/xfsprogs/tree/fsverity-v5

xfstests:
[3]: https://github.com/alberand/xfstests/tree/fsverity-v5

v1:
[4]: https://lore.kernel.org/linux-xfs/20221213172935.680971-1-aalbersh@redhat.com/

v2:
[5]: https://lore.kernel.org/linux-xfs/20230404145319.2057051-1-aalbersh@redhat.com/

v3:
[6]: https://lore.kernel.org/all/20231006184922.252188-1-aalbersh@redhat.com/

v4:
[7]: https://lore.kernel.org/linux-xfs/20240212165821.1901300-1-aalbersh@redhat.com/

fs-verity:
[7]: https://www.kernel.org/doc/html/latest/filesystems/fsverity.html

Thanks,
Andrey

Allison Henderson (3):
  xfs: add parent pointer support to attribute code
  xfs: define parent pointer ondisk extended attribute format
  xfs: add parent pointer validator functions

Andrey Albershteyn (21):
  fsverity: remove hash page spin lock
  fs: add FS_XFLAG_VERITY for verity files
  fsverity: pass tree_blocksize to end_enable_verity()
  fsverity: support block-based Merkle tree caching
  fsverity: add per-sb workqueue for post read processing
  fsverity: add tracepoints
  iomap: integrate fs-verity verification into iomap's read path
  xfs: add XBF_VERITY_SEEN xfs_buf flag
  xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer
  xfs: add attribute type for fs-verity
  xfs: make xfs_buf_get() to take XBF_* flags
  xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer
  xfs: add fs-verity ro-compat flag
  xfs: add inode on-disk VERITY flag
  xfs: initialize fs-verity on file open and cleanup on inode
    destruction
  xfs: don't allow to enable DAX on fs-verity sealsed inode
  xfs: disable direct read path for fs-verity files
  xfs: add fs-verity support
  xfs: make scrub aware of verity dinode flag
  xfs: add fs-verity ioctls
  xfs: enable ro-compat fs-verity flag

 Documentation/filesystems/fsverity.rst |   8 +
 MAINTAINERS                            |   1 +
 fs/btrfs/verity.c                      |   4 +-
 fs/ext4/verity.c                       |   3 +-
 fs/f2fs/verity.c                       |   3 +-
 fs/ioctl.c                             |  11 +
 fs/iomap/buffered-io.c                 |  92 ++++++-
 fs/super.c                             |   7 +
 fs/verity/enable.c                     |   9 +-
 fs/verity/fsverity_private.h           |  11 +-
 fs/verity/init.c                       |   1 +
 fs/verity/open.c                       |   9 +-
 fs/verity/read_metadata.c              |  64 +++--
 fs/verity/signature.c                  |   2 +
 fs/verity/verify.c                     | 180 +++++++++----
 fs/xfs/Makefile                        |   2 +
 fs/xfs/libxfs/xfs_attr.c               |  31 ++-
 fs/xfs/libxfs/xfs_attr.h               |   3 +-
 fs/xfs/libxfs/xfs_attr_leaf.c          |  24 +-
 fs/xfs/libxfs/xfs_attr_remote.c        |  39 ++-
 fs/xfs/libxfs/xfs_btree_mem.c          |   2 +-
 fs/xfs/libxfs/xfs_da_btree.h           |   5 +-
 fs/xfs/libxfs/xfs_da_format.h          |  68 ++++-
 fs/xfs/libxfs/xfs_format.h             |  14 +-
 fs/xfs/libxfs/xfs_log_format.h         |   2 +
 fs/xfs/libxfs/xfs_ondisk.h             |   4 +
 fs/xfs/libxfs/xfs_parent.c             | 113 ++++++++
 fs/xfs/libxfs/xfs_parent.h             |  19 ++
 fs/xfs/libxfs/xfs_sb.c                 |   4 +-
 fs/xfs/scrub/attr.c                    |   4 +-
 fs/xfs/xfs_attr_item.c                 |   6 +-
 fs/xfs/xfs_attr_list.c                 |  14 +-
 fs/xfs/xfs_buf.c                       |   6 +-
 fs/xfs/xfs_buf.h                       |  23 +-
 fs/xfs/xfs_file.c                      |  23 +-
 fs/xfs/xfs_inode.c                     |   2 +
 fs/xfs/xfs_inode.h                     |   3 +-
 fs/xfs/xfs_ioctl.c                     |  22 ++
 fs/xfs/xfs_iops.c                      |   4 +
 fs/xfs/xfs_mount.h                     |   2 +
 fs/xfs/xfs_super.c                     |  12 +
 fs/xfs/xfs_trace.h                     |   4 +-
 fs/xfs/xfs_verity.c                    | 355 +++++++++++++++++++++++++
 fs/xfs/xfs_verity.h                    |  33 +++
 fs/xfs/xfs_xattr.c                     |  10 +
 include/linux/fs.h                     |   2 +
 include/linux/fsverity.h               |  91 ++++++-
 include/trace/events/fsverity.h        | 181 +++++++++++++
 include/uapi/linux/fs.h                |   1 +
 49 files changed, 1395 insertions(+), 138 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_parent.c
 create mode 100644 fs/xfs/libxfs/xfs_parent.h
 create mode 100644 fs/xfs/xfs_verity.c
 create mode 100644 fs/xfs/xfs_verity.h
 create mode 100644 include/trace/events/fsverity.h

-- 
2.42.0


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v5 01/24] fsverity: remove hash page spin lock
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 02/24] xfs: add parent pointer support to attribute code Andrey Albershteyn
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn, Eric Biggers

The spin lock is not necessary here as it can be replaced with
memory barrier which should be better performance-wise.

When Merkle tree block size differs from page size, in
is_hash_block_verified() two things are modified during check - a
bitmap and PG_checked flag of the page.

Each bit in the bitmap represent verification status of the Merkle
tree blocks. PG_checked flag tells if page was just re-instantiated
or was in pagecache. Both of this states are shared between
verification threads. Page which was re-instantiated can not have
already verified blocks (bit set in bitmap).

The spin lock was used to allow only one thread to modify both of
these states and keep order of operations. The only requirement here
is that PG_Checked is set strictly after bitmap is updated.
This way other threads which see that PG_Checked=1 (page cached)
knows that bitmap is up-to-date. Otherwise, if PG_Checked is set
before bitmap is cleared, other threads can see bit=1 and therefore
will not perform verification of that Merkle tree block.

However, there's still the case when one thread is setting a bit in
verify_data_block() and other thread is clearing it in
is_hash_block_verified(). This can happen if two threads get to
!PageChecked branch and one of the threads is rescheduled before
resetting the bitmap. This is fine as at worst blocks are
re-verified in each thread.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/verity/fsverity_private.h |  1 -
 fs/verity/open.c             |  1 -
 fs/verity/verify.c           | 48 ++++++++++++++++++------------------
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index a6a6b2749241..b3506f56e180 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -69,7 +69,6 @@ struct fsverity_info {
 	u8 file_digest[FS_VERITY_MAX_DIGEST_SIZE];
 	const struct inode *inode;
 	unsigned long *hash_block_verified;
-	spinlock_t hash_page_init_lock;
 };
 
 #define FS_VERITY_MAX_SIGNATURE_SIZE	(FS_VERITY_MAX_DESCRIPTOR_SIZE - \
diff --git a/fs/verity/open.c b/fs/verity/open.c
index 6c31a871b84b..fdeb95eca3af 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -239,7 +239,6 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
 			err = -ENOMEM;
 			goto fail;
 		}
-		spin_lock_init(&vi->hash_page_init_lock);
 	}
 
 	return vi;
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 904ccd7e8e16..4fcad0825a12 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -19,7 +19,6 @@ static struct workqueue_struct *fsverity_read_workqueue;
 static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
 				   unsigned long hblock_idx)
 {
-	bool verified;
 	unsigned int blocks_per_page;
 	unsigned int i;
 
@@ -43,12 +42,20 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
 	 * re-instantiated from the backing storage are re-verified.  To do
 	 * this, we use PG_checked again, but now it doesn't really mean
 	 * "checked".  Instead, now it just serves as an indicator for whether
-	 * the hash page is newly instantiated or not.
+	 * the hash page is newly instantiated or not.  If the page is new, as
+	 * indicated by PG_checked=0, we clear the bitmap bits for the page's
+	 * blocks since they are untrustworthy, then set PG_checked=1.
+	 * Otherwise we return the bitmap bit for the requested block.
 	 *
-	 * The first thread that sees PG_checked=0 must clear the corresponding
-	 * bitmap bits, then set PG_checked=1.  This requires a spinlock.  To
-	 * avoid having to take this spinlock in the common case of
-	 * PG_checked=1, we start with an opportunistic lockless read.
+	 * Multiple threads may execute this code concurrently on the same page.
+	 * This is safe because we use memory barriers to ensure that if a
+	 * thread sees PG_checked=1, then it also sees the associated bitmap
+	 * clearing to have occurred.  Also, all writes and their corresponding
+	 * reads are atomic, and all writes are safe to repeat in the event that
+	 * multiple threads get into the PG_checked=0 section.  (Clearing a
+	 * bitmap bit again at worst causes a hash block to be verified
+	 * redundantly.  That event should be very rare, so it's not worth using
+	 * a lock to avoid.  Setting PG_checked again has no effect.)
 	 */
 	if (PageChecked(hpage)) {
 		/*
@@ -58,24 +65,17 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
 		smp_rmb();
 		return test_bit(hblock_idx, vi->hash_block_verified);
 	}
-	spin_lock(&vi->hash_page_init_lock);
-	if (PageChecked(hpage)) {
-		verified = test_bit(hblock_idx, vi->hash_block_verified);
-	} else {
-		blocks_per_page = vi->tree_params.blocks_per_page;
-		hblock_idx = round_down(hblock_idx, blocks_per_page);
-		for (i = 0; i < blocks_per_page; i++)
-			clear_bit(hblock_idx + i, vi->hash_block_verified);
-		/*
-		 * A write memory barrier is needed here to give RELEASE
-		 * semantics to the below SetPageChecked() operation.
-		 */
-		smp_wmb();
-		SetPageChecked(hpage);
-		verified = false;
-	}
-	spin_unlock(&vi->hash_page_init_lock);
-	return verified;
+	blocks_per_page = vi->tree_params.blocks_per_page;
+	hblock_idx = round_down(hblock_idx, blocks_per_page);
+	for (i = 0; i < blocks_per_page; i++)
+		clear_bit(hblock_idx + i, vi->hash_block_verified);
+	/*
+	 * A write memory barrier is needed here to give RELEASE semantics to
+	 * the below SetPageChecked() operation.
+	 */
+	smp_wmb();
+	SetPageChecked(hpage);
+	return false;
 }
 
 /*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 02/24] xfs: add parent pointer support to attribute code
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 01/24] fsverity: remove hash page spin lock Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 03/24] xfs: define parent pointer ondisk extended attribute format Andrey Albershteyn
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Allison Henderson, Mark Tinguely, Dave Chinner

From: Allison Henderson <allison.henderson@oracle.com>

Add the new parent attribute type. XFS_ATTR_PARENT is used only for parent pointer
entries; it uses reserved blocks like XFS_ATTR_ROOT.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c       | 3 ++-
 fs/xfs/libxfs/xfs_da_format.h  | 5 ++++-
 fs/xfs/libxfs/xfs_log_format.h | 1 +
 fs/xfs/scrub/attr.c            | 2 +-
 fs/xfs/xfs_trace.h             | 3 ++-
 5 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 673a4b6d2e8d..ff67a684a452 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -925,7 +925,8 @@ xfs_attr_set(
 	struct xfs_inode	*dp = args->dp;
 	struct xfs_mount	*mp = dp->i_mount;
 	struct xfs_trans_res	tres;
-	bool			rsvd = (args->attr_filter & XFS_ATTR_ROOT);
+	bool			rsvd = (args->attr_filter & (XFS_ATTR_ROOT |
+							     XFS_ATTR_PARENT));
 	int			error, local;
 	int			rmt_blks = 0;
 	unsigned int		total;
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 060e5c96b70f..5434d4d5b551 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -714,12 +714,15 @@ struct xfs_attr3_leafblock {
 #define	XFS_ATTR_LOCAL_BIT	0	/* attr is stored locally */
 #define	XFS_ATTR_ROOT_BIT	1	/* limit access to trusted attrs */
 #define	XFS_ATTR_SECURE_BIT	2	/* limit access to secure attrs */
+#define	XFS_ATTR_PARENT_BIT	3	/* parent pointer attrs */
 #define	XFS_ATTR_INCOMPLETE_BIT	7	/* attr in middle of create/delete */
 #define XFS_ATTR_LOCAL		(1u << XFS_ATTR_LOCAL_BIT)
 #define XFS_ATTR_ROOT		(1u << XFS_ATTR_ROOT_BIT)
 #define XFS_ATTR_SECURE		(1u << XFS_ATTR_SECURE_BIT)
+#define XFS_ATTR_PARENT		(1u << XFS_ATTR_PARENT_BIT)
 #define XFS_ATTR_INCOMPLETE	(1u << XFS_ATTR_INCOMPLETE_BIT)
-#define XFS_ATTR_NSP_ONDISK_MASK	(XFS_ATTR_ROOT | XFS_ATTR_SECURE)
+#define XFS_ATTR_NSP_ONDISK_MASK \
+			(XFS_ATTR_ROOT | XFS_ATTR_SECURE | XFS_ATTR_PARENT)
 
 /*
  * Alignment for namelist and valuelist entries (since they are mixed
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 16872972e1e9..9cbcba4bd363 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -974,6 +974,7 @@ struct xfs_icreate_log {
  */
 #define XFS_ATTRI_FILTER_MASK		(XFS_ATTR_ROOT | \
 					 XFS_ATTR_SECURE | \
+					 XFS_ATTR_PARENT | \
 					 XFS_ATTR_INCOMPLETE)
 
 /*
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 83c7feb38714..49f91cc85a65 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -494,7 +494,7 @@ xchk_xattr_rec(
 	/* Retrieve the entry and check it. */
 	hash = be32_to_cpu(ent->hashval);
 	badflags = ~(XFS_ATTR_LOCAL | XFS_ATTR_ROOT | XFS_ATTR_SECURE |
-			XFS_ATTR_INCOMPLETE);
+			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT);
 	if ((ent->flags & badflags) != 0)
 		xchk_da_set_corrupt(ds, level);
 	if (ent->flags & XFS_ATTR_LOCAL) {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 56b07d8ed431..d4f1b2da21e7 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -86,7 +86,8 @@ struct xfs_bmap_intent;
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
 	{ XFS_ATTR_SECURE,	"SECURE" }, \
-	{ XFS_ATTR_INCOMPLETE,	"INCOMPLETE" }
+	{ XFS_ATTR_INCOMPLETE,	"INCOMPLETE" }, \
+	{ XFS_ATTR_PARENT,	"PARENT" }
 
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 03/24] xfs: define parent pointer ondisk extended attribute format
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 01/24] fsverity: remove hash page spin lock Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 02/24] xfs: add parent pointer support to attribute code Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 04/24] xfs: add parent pointer validator functions Andrey Albershteyn
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Allison Henderson, Dave Chinner

From: Allison Henderson <allison.henderson@oracle.com>

We need to define the parent pointer attribute format before we start
adding support for it into all the code that needs to use it. The EA
format we will use encodes the following information:

        name={parent inode #, parent inode generation, dirent namehash}
        value={dirent name}

The inode/gen gives all the information we need to reliably identify the
parent without requiring child->parent lock ordering, and allows
userspace to do pathname component level reconstruction without the
kernel ever needing to verify the parent itself as part of ioctl calls.
Storing the dirent name hash in the key reduces hash collisions if a
file is hardlinked multiple times in the same directory.

By using the NVLOOKUP mode in the extended attribute code to match
parent pointers using both the xattr name and value, we can identify the
exact parent pointer EA we need to modify/remove in rename/unlink
operations without searching the entire EA space.

By storing the dirent name, we have enough information to be able to
validate and reconstruct damaged directory trees.  Earlier iterations of
this patchset encoded the directory offset in the parent pointer key,
but this format required repair to keep that in sync across directory
rebuilds, which is unnecessary complexity.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: replace diroffset with the namehash in the pptr key]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_da_format.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 5434d4d5b551..67e8c33c4e82 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -878,4 +878,24 @@ static inline unsigned int xfs_dir2_dirblock_bytes(struct xfs_sb *sbp)
 xfs_failaddr_t xfs_da3_blkinfo_verify(struct xfs_buf *bp,
 				      struct xfs_da3_blkinfo *hdr3);
 
+/*
+ * Parent pointer attribute format definition
+ *
+ * The xattr name encodes the parent inode number, generation and the crc32c
+ * hash of the dirent name.
+ *
+ * The xattr value contains the dirent name.
+ */
+struct xfs_parent_name_rec {
+	__be64	p_ino;
+	__be32	p_gen;
+	__be32	p_namehash;
+};
+
+/*
+ * Maximum size of the dirent name that can be stored in a parent pointer.
+ * This matches the maximum dirent name length.
+ */
+#define XFS_PARENT_DIRENT_NAME_MAX_SIZE		(MAXNAMELEN - 1)
+
 #endif /* __XFS_DA_FORMAT_H__ */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 04/24] xfs: add parent pointer validator functions
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (2 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 03/24] xfs: define parent pointer ondisk extended attribute format Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Allison Henderson

From: Allison Henderson <allison.henderson@oracle.com>

Attribute names of parent pointers are not strings.  So we need to
modify attr_namecheck to verify parent pointer records when the
XFS_ATTR_PARENT flag is set.  At the same time, we need to validate attr
values during log recovery if the xattr is really a parent pointer.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: move functions to xfs_parent.c, adjust for new disk format]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile               |   1 +
 fs/xfs/libxfs/xfs_attr.c      |  10 ++-
 fs/xfs/libxfs/xfs_attr.h      |   3 +-
 fs/xfs/libxfs/xfs_da_format.h |   8 +++
 fs/xfs/libxfs/xfs_parent.c    | 113 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_parent.h    |  19 ++++++
 fs/xfs/scrub/attr.c           |   2 +-
 fs/xfs/xfs_attr_item.c        |   6 +-
 fs/xfs/xfs_attr_list.c        |  14 +++--
 9 files changed, 165 insertions(+), 11 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_parent.c
 create mode 100644 fs/xfs/libxfs/xfs_parent.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 76674ad5833e..f8845e65cac7 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -41,6 +41,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_inode_buf.o \
 				   xfs_log_rlimit.o \
 				   xfs_ag_resv.o \
+				   xfs_parent.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
 				   xfs_refcount.o \
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index ff67a684a452..f0b625d45aa4 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -26,6 +26,7 @@
 #include "xfs_trace.h"
 #include "xfs_attr_item.h"
 #include "xfs_xattr.h"
+#include "xfs_parent.h"
 
 struct kmem_cache		*xfs_attr_intent_cache;
 
@@ -1515,9 +1516,14 @@ xfs_attr_node_get(
 /* Returns true if the attribute entry name is valid. */
 bool
 xfs_attr_namecheck(
-	const void	*name,
-	size_t		length)
+	struct xfs_mount	*mp,
+	const void		*name,
+	size_t			length,
+	unsigned int		flags)
 {
+	if (flags & XFS_ATTR_PARENT)
+		return xfs_parent_namecheck(mp, name, length, flags);
+
 	/*
 	 * MAXNAMELEN includes the trailing null, but (name/length) leave it
 	 * out, so use >= for the length check.
diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h
index 81be9b3e4004..92711c8d2a9f 100644
--- a/fs/xfs/libxfs/xfs_attr.h
+++ b/fs/xfs/libxfs/xfs_attr.h
@@ -547,7 +547,8 @@ int xfs_attr_get(struct xfs_da_args *args);
 int xfs_attr_set(struct xfs_da_args *args);
 int xfs_attr_set_iter(struct xfs_attr_intent *attr);
 int xfs_attr_remove_iter(struct xfs_attr_intent *attr);
-bool xfs_attr_namecheck(const void *name, size_t length);
+bool xfs_attr_namecheck(struct xfs_mount *mp, const void *name, size_t length,
+		unsigned int flags);
 int xfs_attr_calc_size(struct xfs_da_args *args, int *local);
 void xfs_init_attr_trans(struct xfs_da_args *args, struct xfs_trans_res *tres,
 			 unsigned int *total);
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 67e8c33c4e82..839df0e5401b 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -757,6 +757,14 @@ xfs_attr3_leaf_name(xfs_attr_leafblock_t *leafp, int idx)
 	return &((char *)leafp)[be16_to_cpu(entries[idx].nameidx)];
 }
 
+static inline int
+xfs_attr3_leaf_flags(xfs_attr_leafblock_t *leafp, int idx)
+{
+	struct xfs_attr_leaf_entry *entries = xfs_attr3_leaf_entryp(leafp);
+
+	return entries[idx].flags;
+}
+
 static inline xfs_attr_leaf_name_remote_t *
 xfs_attr3_leaf_name_remote(xfs_attr_leafblock_t *leafp, int idx)
 {
diff --git a/fs/xfs/libxfs/xfs_parent.c b/fs/xfs/libxfs/xfs_parent.c
new file mode 100644
index 000000000000..1d45f926c13a
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_parent.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022-2024 Oracle.
+ * All rights reserved.
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_da_format.h"
+#include "xfs_log_format.h"
+#include "xfs_shared.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_trans.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_attr_sf.h"
+#include "xfs_bmap.h"
+#include "xfs_defer.h"
+#include "xfs_log.h"
+#include "xfs_xattr.h"
+#include "xfs_parent.h"
+#include "xfs_trans_space.h"
+
+/*
+ * Parent pointer attribute handling.
+ *
+ * Because the attribute value is a filename component, it will never be longer
+ * than 255 bytes. This means the attribute will always be a local format
+ * attribute as it is xfs_attr_leaf_entsize_local_max() for v5 filesystems will
+ * always be larger than this (max is 75% of block size).
+ *
+ * Creating a new parent attribute will always create a new attribute - there
+ * should never, ever be an existing attribute in the tree for a new inode.
+ * ENOSPC behavior is problematic - creating the inode without the parent
+ * pointer is effectively a corruption, so we allow parent attribute creation
+ * to dip into the reserve block pool to avoid unexpected ENOSPC errors from
+ * occurring.
+ */
+
+/* Return true if parent pointer EA name is valid. */
+bool
+xfs_parent_namecheck(
+	struct xfs_mount			*mp,
+	const struct xfs_parent_name_rec	*rec,
+	size_t					reclen,
+	unsigned int				attr_flags)
+{
+	if (!(attr_flags & XFS_ATTR_PARENT))
+		return false;
+
+	/* pptr updates use logged xattrs, so we should never see this flag */
+	if (attr_flags & XFS_ATTR_INCOMPLETE)
+		return false;
+
+	if (reclen != sizeof(struct xfs_parent_name_rec))
+		return false;
+
+	/* Only one namespace bit allowed. */
+	if (hweight32(attr_flags & XFS_ATTR_NSP_ONDISK_MASK) > 1)
+		return false;
+
+	return true;
+}
+
+/* Return true if parent pointer EA value is valid. */
+bool
+xfs_parent_valuecheck(
+	struct xfs_mount		*mp,
+	const void			*value,
+	size_t				valuelen)
+{
+	if (valuelen == 0 || valuelen > XFS_PARENT_DIRENT_NAME_MAX_SIZE)
+		return false;
+
+	if (value == NULL)
+		return false;
+
+	return true;
+}
+
+/* Return true if the ondisk parent pointer is consistent. */
+bool
+xfs_parent_hashcheck(
+	struct xfs_mount		*mp,
+	const struct xfs_parent_name_rec *rec,
+	const void			*value,
+	size_t				valuelen)
+{
+	struct xfs_name			dname = {
+		.name			= value,
+		.len			= valuelen,
+	};
+	xfs_ino_t			p_ino;
+
+	/* Valid dirent name? */
+	if (!xfs_dir2_namecheck(value, valuelen))
+		return false;
+
+	/* Valid inode number? */
+	p_ino = be64_to_cpu(rec->p_ino);
+	if (!xfs_verify_dir_ino(mp, p_ino))
+		return false;
+
+	/* Namehash matches name? */
+	return be32_to_cpu(rec->p_namehash) == xfs_dir2_hashname(mp, &dname);
+}
diff --git a/fs/xfs/libxfs/xfs_parent.h b/fs/xfs/libxfs/xfs_parent.h
new file mode 100644
index 000000000000..fcfeddb645f6
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_parent.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022-2024 Oracle.
+ * All Rights Reserved.
+ */
+#ifndef	__XFS_PARENT_H__
+#define	__XFS_PARENT_H__
+
+/* Metadata validators */
+bool xfs_parent_namecheck(struct xfs_mount *mp,
+		const struct xfs_parent_name_rec *rec, size_t reclen,
+		unsigned int attr_flags);
+bool xfs_parent_valuecheck(struct xfs_mount *mp, const void *value,
+		size_t valuelen);
+bool xfs_parent_hashcheck(struct xfs_mount *mp,
+		const struct xfs_parent_name_rec *rec, const void *value,
+		size_t valuelen);
+
+#endif /* __XFS_PARENT_H__ */
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 49f91cc85a65..9a1f59f7b5a4 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -195,7 +195,7 @@ xchk_xattr_listent(
 	}
 
 	/* Does this name make sense? */
-	if (!xfs_attr_namecheck(name, namelen)) {
+	if (!xfs_attr_namecheck(sx->sc->mp, name, namelen, flags)) {
 		xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno);
 		goto fail_xref;
 	}
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 9b4c61e1c22e..703770cf1482 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -591,7 +591,8 @@ xfs_attr_recover_work(
 	 */
 	attrp = &attrip->attri_format;
 	if (!xfs_attri_validate(mp, attrp) ||
-	    !xfs_attr_namecheck(nv->name.i_addr, nv->name.i_len))
+	    !xfs_attr_namecheck(mp, nv->name.i_addr, nv->name.i_len,
+				attrp->alfi_attr_filter))
 		return -EFSCORRUPTED;
 
 	attr = xfs_attri_recover_work(mp, dfp, attrp, &ip, nv);
@@ -731,7 +732,8 @@ xlog_recover_attri_commit_pass2(
 		return -EFSCORRUPTED;
 	}
 
-	if (!xfs_attr_namecheck(attr_name, attri_formatp->alfi_name_len)) {
+	if (!xfs_attr_namecheck(mp, attr_name, attri_formatp->alfi_name_len,
+				attri_formatp->alfi_attr_filter)) {
 		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 				item->ri_buf[1].i_addr, item->ri_buf[1].i_len);
 		return -EFSCORRUPTED;
diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c
index a6819a642cc0..fa74378577c5 100644
--- a/fs/xfs/xfs_attr_list.c
+++ b/fs/xfs/xfs_attr_list.c
@@ -59,6 +59,7 @@ xfs_attr_shortform_list(
 	struct xfs_attr_sf_sort		*sbuf, *sbp;
 	struct xfs_attr_sf_hdr		*sf = dp->i_af.if_data;
 	struct xfs_attr_sf_entry	*sfe;
+	struct xfs_mount		*mp = dp->i_mount;
 	int				sbsize, nsbuf, count, i;
 	int				error = 0;
 
@@ -82,8 +83,9 @@ xfs_attr_shortform_list(
 	     (dp->i_af.if_bytes + sf->count * 16) < context->bufsize)) {
 		for (i = 0, sfe = xfs_attr_sf_firstentry(sf); i < sf->count; i++) {
 			if (XFS_IS_CORRUPT(context->dp->i_mount,
-					   !xfs_attr_namecheck(sfe->nameval,
-							       sfe->namelen))) {
+					   !xfs_attr_namecheck(mp, sfe->nameval,
+							       sfe->namelen,
+							       sfe->flags))) {
 				xfs_dirattr_mark_sick(context->dp, XFS_ATTR_FORK);
 				return -EFSCORRUPTED;
 			}
@@ -177,8 +179,9 @@ xfs_attr_shortform_list(
 			cursor->offset = 0;
 		}
 		if (XFS_IS_CORRUPT(context->dp->i_mount,
-				   !xfs_attr_namecheck(sbp->name,
-						       sbp->namelen))) {
+				   !xfs_attr_namecheck(mp, sbp->name,
+						       sbp->namelen,
+						       sbp->flags))) {
 			xfs_dirattr_mark_sick(context->dp, XFS_ATTR_FORK);
 			error = -EFSCORRUPTED;
 			goto out;
@@ -474,7 +477,8 @@ xfs_attr3_leaf_list_int(
 		}
 
 		if (XFS_IS_CORRUPT(context->dp->i_mount,
-				   !xfs_attr_namecheck(name, namelen))) {
+				   !xfs_attr_namecheck(mp, name, namelen,
+						       entry->flags))) {
 			xfs_dirattr_mark_sick(context->dp, XFS_ATTR_FORK);
 			return -EFSCORRUPTED;
 		}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (3 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 04/24] xfs: add parent pointer validator functions Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 22:35   ` Eric Biggers
  2024-03-04 19:10 ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

Add extended attribute FS_XFLAG_VERITY for inodes with fs-verity
enabled.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 Documentation/filesystems/fsverity.rst |  8 ++++++++
 fs/ioctl.c                             | 11 +++++++++++
 include/uapi/linux/fs.h                |  1 +
 3 files changed, 20 insertions(+)

diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 13e4b18e5dbb..887cdaf162a9 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -326,6 +326,14 @@ the file has fs-verity enabled.  This can perform better than
 FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
 opening the file, and opening verity files can be expensive.
 
+FS_IOC_FSGETXATTR
+-----------------
+
+Since Linux v6.9, the FS_IOC_FSGETXATTR ioctl sets FS_XFLAG_VERITY (0x00020000)
+in the returned flags when the file has verity enabled. Note that this attribute
+cannot be set with FS_IOC_FSSETXATTR as enabling verity requires input
+parameters. See FS_IOC_ENABLE_VERITY.
+
 .. _accessing_verity_files:
 
 Accessing verity files
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 76cf22ac97d7..38c00e47c069 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -481,6 +481,8 @@ void fileattr_fill_xflags(struct fileattr *fa, u32 xflags)
 		fa->flags |= FS_DAX_FL;
 	if (fa->fsx_xflags & FS_XFLAG_PROJINHERIT)
 		fa->flags |= FS_PROJINHERIT_FL;
+	if (fa->fsx_xflags & FS_XFLAG_VERITY)
+		fa->flags |= FS_VERITY_FL;
 }
 EXPORT_SYMBOL(fileattr_fill_xflags);
 
@@ -511,6 +513,8 @@ void fileattr_fill_flags(struct fileattr *fa, u32 flags)
 		fa->fsx_xflags |= FS_XFLAG_DAX;
 	if (fa->flags & FS_PROJINHERIT_FL)
 		fa->fsx_xflags |= FS_XFLAG_PROJINHERIT;
+	if (fa->flags & FS_VERITY_FL)
+		fa->fsx_xflags |= FS_XFLAG_VERITY;
 }
 EXPORT_SYMBOL(fileattr_fill_flags);
 
@@ -641,6 +645,13 @@ static int fileattr_set_prepare(struct inode *inode,
 	    !(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
 		return -EINVAL;
 
+	/*
+	 * Verity cannot be set through FS_IOC_FSSETXATTR/FS_IOC_SETFLAGS.
+	 * See FS_IOC_ENABLE_VERITY
+	 */
+	if (fa->fsx_xflags & FS_XFLAG_VERITY)
+		return -EINVAL;
+
 	/* Extent size hints of zero turn off the flags. */
 	if (fa->fsx_extsize == 0)
 		fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 48ad69f7722e..b1d0e1169bc3 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -140,6 +140,7 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_VERITY		0x00020000	/* fs-verity enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (4 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-05  0:52   ` Eric Biggers
  2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

XFS will need to know tree_blocksize to remove the tree in case of an
error. The size is needed to calculate offsets of particular Merkle
tree blocks.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/btrfs/verity.c        | 4 +++-
 fs/ext4/verity.c         | 3 ++-
 fs/f2fs/verity.c         | 3 ++-
 fs/verity/enable.c       | 6 ++++--
 include/linux/fsverity.h | 4 +++-
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index 66e2270b0dae..966630523502 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -621,6 +621,7 @@ static int btrfs_begin_enable_verity(struct file *filp)
  * @desc:              verity descriptor to write out (NULL in error conditions)
  * @desc_size:         size of the verity descriptor (variable with signatures)
  * @merkle_tree_size:  size of the merkle tree in bytes
+ * @tree_blocksize:    the Merkle tree block size
  *
  * If desc is null, then VFS is signaling an error occurred during verity
  * enable, and we should try to rollback. Otherwise, attempt to finish verity.
@@ -628,7 +629,8 @@ static int btrfs_begin_enable_verity(struct file *filp)
  * Returns 0 on success, negative error code on error.
  */
 static int btrfs_end_enable_verity(struct file *filp, const void *desc,
-				   size_t desc_size, u64 merkle_tree_size)
+				   size_t desc_size, u64 merkle_tree_size,
+				   unsigned int tree_blocksize)
 {
 	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
 	int ret = 0;
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index 2f37e1ea3955..da2095a81349 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -189,7 +189,8 @@ static int ext4_write_verity_descriptor(struct inode *inode, const void *desc,
 }
 
 static int ext4_end_enable_verity(struct file *filp, const void *desc,
-				  size_t desc_size, u64 merkle_tree_size)
+				  size_t desc_size, u64 merkle_tree_size,
+				  unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	const int credits = 2; /* superblock and inode for ext4_orphan_del() */
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 4fc95f353a7a..b4461b9f47a3 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -144,7 +144,8 @@ static int f2fs_begin_enable_verity(struct file *filp)
 }
 
 static int f2fs_end_enable_verity(struct file *filp, const void *desc,
-				  size_t desc_size, u64 merkle_tree_size)
+				  size_t desc_size, u64 merkle_tree_size,
+				  unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index c284f46d1b53..04e060880b79 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -274,7 +274,8 @@ static int enable_verity(struct file *filp,
 	 * Serialized with ->begin_enable_verity() by the inode lock.
 	 */
 	inode_lock(inode);
-	err = vops->end_enable_verity(filp, desc, desc_size, params.tree_size);
+	err = vops->end_enable_verity(filp, desc, desc_size, params.tree_size,
+				      params.block_size);
 	inode_unlock(inode);
 	if (err) {
 		fsverity_err(inode, "%ps() failed with err %d",
@@ -300,7 +301,8 @@ static int enable_verity(struct file *filp,
 
 rollback:
 	inode_lock(inode);
-	(void)vops->end_enable_verity(filp, NULL, 0, params.tree_size);
+	(void)vops->end_enable_verity(filp, NULL, 0, params.tree_size,
+				      params.block_size);
 	inode_unlock(inode);
 	goto out;
 }
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 1eb7eae580be..ac58b19f23d3 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -51,6 +51,7 @@ struct fsverity_operations {
 	 * @desc: the verity descriptor to write, or NULL on failure
 	 * @desc_size: size of verity descriptor, or 0 on failure
 	 * @merkle_tree_size: total bytes the Merkle tree took up
+	 * @tree_blocksize: the Merkle tree block size
 	 *
 	 * If desc == NULL, then enabling verity failed and the filesystem only
 	 * must do any necessary cleanups.  Else, it must also store the given
@@ -65,7 +66,8 @@ struct fsverity_operations {
 	 * Return: 0 on success, -errno on failure
 	 */
 	int (*end_enable_verity)(struct file *filp, const void *desc,
-				 size_t desc_size, u64 merkle_tree_size);
+				 size_t desc_size, u64 merkle_tree_size,
+				 unsigned int tree_blocksize);
 
 	/**
 	 * Get the verity descriptor of the given inode.
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 07/24] fsverity: support block-based Merkle tree caching
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (5 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-06  3:56   ` Eric Biggers
  2024-03-11 23:22   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing Andrey Albershteyn
                   ` (16 subsequent siblings)
  23 siblings, 2 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

In the current implementation fs-verity expects filesystem to
provide PAGEs filled with Merkle tree blocks. Then, when fs-verity
is done with processing the blocks, reference to PAGE is freed. This
doesn't fit well with the way XFS manages its memory.

To allow XFS integrate fs-verity this patch adds ability to
fs-verity verification code to take Merkle tree blocks instead of
PAGE reference. This way ext4, f2fs, and btrfs are still able to
pass PAGE references and XFS can pass reference to Merkle tree
blocks stored in XFS's buffer infrastructure.

Another addition is invalidation function which tells fs-verity to
mark part of Merkle tree as not verified. This function is used
by filesystem to tell fs-verity to invalidate block which was
evicted from memory.

Depending on Merkle tree block size fs-verity is using either bitmap
or PG_checked flag to track "verified" status of the blocks. With a
Merkle tree block caching (XFS) there is no PAGE to flag it as
verified. fs-verity always uses bitmap to track verified blocks for
filesystems which use block caching.

Further this patch allows filesystem to make additional processing
on verified pages via fsverity_drop_block() instead of just dropping
a reference. This will be used by XFS for internal buffer cache
manipulation in further patches. The btrfs, ext4, and f2fs just drop
the reference.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/verity/fsverity_private.h |   8 +++
 fs/verity/open.c             |   8 ++-
 fs/verity/read_metadata.c    |  64 +++++++++++-------
 fs/verity/verify.c           | 125 +++++++++++++++++++++++++++--------
 include/linux/fsverity.h     |  65 ++++++++++++++++++
 5 files changed, 217 insertions(+), 53 deletions(-)

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index b3506f56e180..dad33e6ff0d6 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
 
 void __init fsverity_init_workqueue(void);
 
+/*
+ * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
+ * filesystem if ->drop_block() is set, otherwise, drop the reference in the
+ * block->context.
+ */
+void fsverity_drop_block(struct inode *inode,
+			 struct fsverity_blockbuf *block);
+
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/open.c b/fs/verity/open.c
index fdeb95eca3af..6e6922b4b014 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
 	if (err)
 		goto fail;
 
-	if (vi->tree_params.block_size != PAGE_SIZE) {
+	/*
+	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
+	 * fs-verity should use hash_block_verified bitmap as there's no page
+	 * to mark it with PG_checked.
+	 */
+	if (vi->tree_params.block_size != PAGE_SIZE ||
+			inode->i_sb->s_vop->read_merkle_tree_block) {
 		/*
 		 * When the Merkle tree block size and page size differ, we use
 		 * a bitmap to keep track of which hash blocks have been
diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
index f58432772d9e..5da40b5a81af 100644
--- a/fs/verity/read_metadata.c
+++ b/fs/verity/read_metadata.c
@@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
 {
 	const struct fsverity_operations *vops = inode->i_sb->s_vop;
 	u64 end_offset;
-	unsigned int offs_in_page;
+	unsigned int offs_in_block;
 	pgoff_t index, last_index;
 	int retval = 0;
 	int err = 0;
+	const unsigned int block_size = vi->tree_params.block_size;
+	const u8 log_blocksize = vi->tree_params.log_blocksize;
 
 	end_offset = min(offset + length, vi->tree_params.tree_size);
 	if (offset >= end_offset)
 		return 0;
-	offs_in_page = offset_in_page(offset);
-	last_index = (end_offset - 1) >> PAGE_SHIFT;
+	offs_in_block = offset & (block_size - 1);
+	last_index = (end_offset - 1) >> log_blocksize;
 
 	/*
-	 * Iterate through each Merkle tree page in the requested range and copy
-	 * the requested portion to userspace.  Note that the Merkle tree block
-	 * size isn't important here, as we are returning a byte stream; i.e.,
-	 * we can just work with pages even if the tree block size != PAGE_SIZE.
+	 * Iterate through each Merkle tree block in the requested range and
+	 * copy the requested portion to userspace. Note that we are returning
+	 * a byte stream.
 	 */
-	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
+	for (index = offset >> log_blocksize; index <= last_index; index++) {
 		unsigned long num_ra_pages =
 			min_t(unsigned long, last_index - index + 1,
 			      inode->i_sb->s_bdi->io_pages);
 		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
-						   PAGE_SIZE - offs_in_page);
-		struct page *page;
-		const void *virt;
+						   block_size - offs_in_block);
+		struct fsverity_blockbuf block = {
+			.size = block_size,
+		};
 
-		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
-		if (IS_ERR(page)) {
-			err = PTR_ERR(page);
+		if (!vops->read_merkle_tree_block) {
+			unsigned int blocks_per_page =
+				vi->tree_params.blocks_per_page;
+			unsigned long page_idx =
+				round_down(index, blocks_per_page);
+			struct page *page = vops->read_merkle_tree_page(inode,
+					page_idx, num_ra_pages);
+
+			if (IS_ERR(page)) {
+				err = PTR_ERR(page);
+			} else {
+				block.kaddr = kmap_local_page(page) +
+					((index - page_idx) << log_blocksize);
+				block.context = page;
+			}
+		} else {
+			err = vops->read_merkle_tree_block(inode,
+					index << log_blocksize,
+					&block, log_blocksize, num_ra_pages);
+		}
+
+		if (err) {
 			fsverity_err(inode,
-				     "Error %d reading Merkle tree page %lu",
-				     err, index);
+				     "Error %d reading Merkle tree block %lu",
+				     err, index << log_blocksize);
 			break;
 		}
 
-		virt = kmap_local_page(page);
-		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
-			kunmap_local(virt);
-			put_page(page);
+		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
+			fsverity_drop_block(inode, &block);
 			err = -EFAULT;
 			break;
 		}
-		kunmap_local(virt);
-		put_page(page);
+		fsverity_drop_block(inode, &block);
 
 		retval += bytes_to_copy;
 		buf += bytes_to_copy;
@@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
 			break;
 		}
 		cond_resched();
-		offs_in_page = 0;
+		offs_in_block = 0;
 	}
 	return retval ? retval : err;
 }
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 4fcad0825a12..de71911d400c 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -13,14 +13,17 @@
 static struct workqueue_struct *fsverity_read_workqueue;
 
 /*
- * Returns true if the hash block with index @hblock_idx in the tree, located in
- * @hpage, has already been verified.
+ * Returns true if the hash block with index @hblock_idx in the tree has
+ * already been verified.
  */
-static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
+static bool is_hash_block_verified(struct inode *inode,
+				   struct fsverity_blockbuf *block,
 				   unsigned long hblock_idx)
 {
 	unsigned int blocks_per_page;
 	unsigned int i;
+	struct fsverity_info *vi = inode->i_verity_info;
+	struct page *hpage = (struct page *)block->context;
 
 	/*
 	 * When the Merkle tree block size and page size are the same, then the
@@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
 	if (!vi->hash_block_verified)
 		return PageChecked(hpage);
 
+	/*
+	 * Filesystems which use block based caching (e.g. XFS) always use
+	 * bitmap.
+	 */
+	if (inode->i_sb->s_vop->read_merkle_tree_block)
+		return test_bit(hblock_idx, vi->hash_block_verified);
 	/*
 	 * When the Merkle tree block size and page size differ, we use a bitmap
 	 * to indicate whether each hash block has been verified.
@@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 	const struct merkle_tree_params *params = &vi->tree_params;
 	const unsigned int hsize = params->digest_size;
 	int level;
+	int err;
+	int num_ra_pages;
 	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	const u8 *want_hash;
 	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	/* The hash blocks that are traversed, indexed by level */
 	struct {
-		/* Page containing the hash block */
-		struct page *page;
-		/* Mapped address of the hash block (will be within @page) */
-		const void *addr;
+		/* Buffer containing the hash block */
+		struct fsverity_blockbuf block;
 		/* Index of the hash block in the tree overall */
 		unsigned long index;
 		/* Byte offset of the wanted hash relative to @addr */
@@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		unsigned long next_hidx;
 		unsigned long hblock_idx;
 		pgoff_t hpage_idx;
+		u64 hblock_pos;
 		unsigned int hblock_offset_in_page;
 		unsigned int hoffset;
 		struct page *hpage;
-		const void *haddr;
+		struct fsverity_blockbuf *block = &hblocks[level].block;
 
 		/*
 		 * The index of the block in the current level; also the index
@@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		hblock_offset_in_page =
 			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
 
+		/* Offset of the Merkle tree block into the tree */
+		hblock_pos = hblock_idx << params->log_blocksize;
+
 		/* Byte offset of the hash within the block */
 		hoffset = (hidx << params->log_digestsize) &
 			  (params->block_size - 1);
 
-		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
-				hpage_idx, level == 0 ? min(max_ra_pages,
-					params->tree_pages - hpage_idx) : 0);
-		if (IS_ERR(hpage)) {
+		num_ra_pages = level == 0 ?
+			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
+
+		if (inode->i_sb->s_vop->read_merkle_tree_block) {
+			err = inode->i_sb->s_vop->read_merkle_tree_block(
+				inode, hblock_pos, block, params->log_blocksize,
+				num_ra_pages);
+		} else {
+			unsigned int blocks_per_page =
+				vi->tree_params.blocks_per_page;
+			hblock_idx = round_down(hblock_idx, blocks_per_page);
+			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
+				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
+
+			if (IS_ERR(hpage)) {
+				err = PTR_ERR(hpage);
+			} else {
+				block->kaddr = kmap_local_page(hpage) +
+					hblock_offset_in_page;
+				block->context = hpage;
+			}
+		}
+
+		if (err) {
 			fsverity_err(inode,
-				     "Error %ld reading Merkle tree page %lu",
-				     PTR_ERR(hpage), hpage_idx);
+				     "Error %d reading Merkle tree block %lu",
+				     err, hblock_idx);
 			goto error;
 		}
-		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
-		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
-			memcpy(_want_hash, haddr + hoffset, hsize);
+
+		if (is_hash_block_verified(inode, block, hblock_idx)) {
+			memcpy(_want_hash, block->kaddr + hoffset, hsize);
 			want_hash = _want_hash;
-			kunmap_local(haddr);
-			put_page(hpage);
+			fsverity_drop_block(inode, block);
 			goto descend;
 		}
-		hblocks[level].page = hpage;
-		hblocks[level].addr = haddr;
 		hblocks[level].index = hblock_idx;
 		hblocks[level].hoffset = hoffset;
 		hidx = next_hidx;
@@ -197,10 +227,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 descend:
 	/* Descend the tree verifying hash blocks. */
 	for (; level > 0; level--) {
-		struct page *hpage = hblocks[level - 1].page;
-		const void *haddr = hblocks[level - 1].addr;
+		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
+		const void *haddr = block->kaddr;
 		unsigned long hblock_idx = hblocks[level - 1].index;
 		unsigned int hoffset = hblocks[level - 1].hoffset;
+		struct page *hpage = (struct page *)block->context;
 
 		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
 			goto error;
@@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 			SetPageChecked(hpage);
 		memcpy(_want_hash, haddr + hoffset, hsize);
 		want_hash = _want_hash;
-		kunmap_local(haddr);
-		put_page(hpage);
+		fsverity_drop_block(inode, block);
 	}
 
 	/* Finally, verify the data block. */
@@ -235,10 +265,8 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		     params->hash_alg->name, hsize, want_hash,
 		     params->hash_alg->name, hsize, real_hash);
 error:
-	for (; level > 0; level--) {
-		kunmap_local(hblocks[level - 1].addr);
-		put_page(hblocks[level - 1].page);
-	}
+	for (; level > 0; level--)
+		fsverity_drop_block(inode, &hblocks[level - 1].block);
 	return false;
 }
 
@@ -362,3 +390,42 @@ void __init fsverity_init_workqueue(void)
 	if (!fsverity_read_workqueue)
 		panic("failed to allocate fsverity_read_queue");
 }
+
+/**
+ * fsverity_invalidate_block() - invalidate Merkle tree block
+ * @inode: inode to which this Merkle tree blocks belong
+ * @block: block to be invalidated
+ *
+ * This function invalidates/clears "verified" state of Merkle tree block
+ * in the fs-verity bitmap. The block needs to have ->offset set.
+ */
+void fsverity_invalidate_block(struct inode *inode,
+		struct fsverity_blockbuf *block)
+{
+	struct fsverity_info *vi = inode->i_verity_info;
+	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
+
+	if (block->offset > vi->tree_params.tree_size) {
+		fsverity_err(inode,
+"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
+			     vi->tree_params.tree_size, block->offset);
+		return;
+	}
+
+	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
+}
+EXPORT_SYMBOL_GPL(fsverity_invalidate_block);
+
+void fsverity_drop_block(struct inode *inode,
+		struct fsverity_blockbuf *block)
+{
+	if (inode->i_sb->s_vop->drop_block)
+		inode->i_sb->s_vop->drop_block(block);
+	else {
+		struct page *page = (struct page *)block->context;
+
+		kunmap_local(block->kaddr);
+		put_page(page);
+	}
+	block->kaddr = NULL;
+}
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index ac58b19f23d3..0973b521ac5a 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -26,6 +26,33 @@
 /* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
 #define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
 
+/**
+ * struct fsverity_blockbuf - Merkle Tree block buffer
+ * @kaddr: virtual address of the block's data
+ * @offset: block's offset into Merkle tree
+ * @size: the Merkle tree block size
+ * @context: filesystem private context
+ *
+ * Buffer containing single Merkle Tree block. These buffers are passed
+ *  - to filesystem, when fs-verity is building merkel tree,
+ *  - from filesystem, when fs-verity is reading merkle tree from a disk.
+ * Filesystems sets kaddr together with size to point to a memory which contains
+ * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
+ * written down to disk.
+ *
+ * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
+ * ->drop_block to let filesystem know that memory can be freed.
+ *
+ * The context is optional. This field can be used by filesystem to passthrough
+ * state from ->read_merkle_tree_block to ->drop_block.
+ */
+struct fsverity_blockbuf {
+	void *kaddr;
+	u64 offset;
+	unsigned int size;
+	void *context;
+};
+
 /* Verity operations for filesystems */
 struct fsverity_operations {
 
@@ -107,6 +134,32 @@ struct fsverity_operations {
 					      pgoff_t index,
 					      unsigned long num_ra_pages);
 
+	/**
+	 * Read a Merkle tree block of the given inode.
+	 * @inode: the inode
+	 * @pos: byte offset of the block within the Merkle tree
+	 * @block: block buffer for filesystem to point it to the block
+	 * @log_blocksize: log2 of the size of the expected block
+	 * @ra_bytes: The number of bytes that should be
+	 *		prefetched starting at @pos if the page at @pos
+	 *		isn't already cached.  Implementations may ignore this
+	 *		argument; it's only a performance optimization.
+	 *
+	 * This can be called at any time on an open verity file.  It may be
+	 * called by multiple processes concurrently.
+	 *
+	 * In case that block was evicted from the memory filesystem has to use
+	 * fsverity_invalidate_block() to let fsverity know that block's
+	 * verification state is not valid anymore.
+	 *
+	 * Return: 0 on success, -errno on failure
+	 */
+	int (*read_merkle_tree_block)(struct inode *inode,
+				      u64 pos,
+				      struct fsverity_blockbuf *block,
+				      unsigned int log_blocksize,
+				      u64 ra_bytes);
+
 	/**
 	 * Write a Merkle tree block to the given inode.
 	 *
@@ -122,6 +175,16 @@ struct fsverity_operations {
 	 */
 	int (*write_merkle_tree_block)(struct inode *inode, const void *buf,
 				       u64 pos, unsigned int size);
+
+	/**
+	 * Release the reference to a Merkle tree block
+	 *
+	 * @block: the block to release
+	 *
+	 * This is called when fs-verity is done with a block obtained with
+	 * ->read_merkle_tree_block().
+	 */
+	void (*drop_block)(struct fsverity_blockbuf *block);
 };
 
 #ifdef CONFIG_FS_VERITY
@@ -175,6 +238,8 @@ int fsverity_ioctl_read_metadata(struct file *filp, const void __user *uarg);
 bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset);
 void fsverity_verify_bio(struct bio *bio);
 void fsverity_enqueue_verify_work(struct work_struct *work);
+void fsverity_invalidate_block(struct inode *inode,
+		struct fsverity_blockbuf *block);
 
 #else /* !CONFIG_FS_VERITY */
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (6 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-05  1:08   ` Eric Biggers
  2024-03-04 19:10 ` [PATCH v5 09/24] fsverity: add tracepoints Andrey Albershteyn
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

For XFS, fsverity's global workqueue is not really suitable due to:

1. High priority workqueues are used within XFS to ensure that data
   IO completion cannot stall processing of journal IO completions.
   Hence using a WQ_HIGHPRI workqueue directly in the user data IO
   path is a potential filesystem livelock/deadlock vector.

2. The fsverity workqueue is global - it creates a cross-filesystem
   contention point.

This patch adds per-filesystem, per-cpu workqueue for fsverity
work. This allows iomap to add verification work in the read path on
BIO completion.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/super.c               |  7 +++++++
 include/linux/fs.h       |  2 ++
 include/linux/fsverity.h | 22 ++++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index d6efeba0d0ce..03795ee4d9b9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -637,6 +637,13 @@ void generic_shutdown_super(struct super_block *sb)
 			sb->s_dio_done_wq = NULL;
 		}
 
+#ifdef CONFIG_FS_VERITY
+		if (sb->s_read_done_wq) {
+			destroy_workqueue(sb->s_read_done_wq);
+			sb->s_read_done_wq = NULL;
+		}
+#endif
+
 		if (sop->put_super)
 			sop->put_super(sb);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fbc72c5f112..5863519ffd51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1223,6 +1223,8 @@ struct super_block {
 #endif
 #ifdef CONFIG_FS_VERITY
 	const struct fsverity_operations *s_vop;
+	/* Completion queue for post read verification */
+	struct workqueue_struct *s_read_done_wq;
 #endif
 #if IS_ENABLED(CONFIG_UNICODE)
 	struct unicode_map *s_encoding;
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 0973b521ac5a..45b7c613148a 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -241,6 +241,22 @@ void fsverity_enqueue_verify_work(struct work_struct *work);
 void fsverity_invalidate_block(struct inode *inode,
 		struct fsverity_blockbuf *block);
 
+static inline int fsverity_set_ops(struct super_block *sb,
+				   const struct fsverity_operations *ops)
+{
+	sb->s_vop = ops;
+
+	/* Create per-sb workqueue for post read bio verification */
+	struct workqueue_struct *wq = alloc_workqueue(
+		"pread/%s", (WQ_FREEZABLE | WQ_MEM_RECLAIM), 0, sb->s_id);
+	if (!wq)
+		return -ENOMEM;
+
+	sb->s_read_done_wq = wq;
+
+	return 0;
+}
+
 #else /* !CONFIG_FS_VERITY */
 
 static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
@@ -318,6 +334,12 @@ static inline void fsverity_enqueue_verify_work(struct work_struct *work)
 	WARN_ON_ONCE(1);
 }
 
+static inline int fsverity_set_ops(struct super_block *sb,
+				   const struct fsverity_operations *ops)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif	/* !CONFIG_FS_VERITY */
 
 static inline bool fsverity_verify_folio(struct folio *folio)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 09/24] fsverity: add tracepoints
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (7 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-05  0:33   ` Eric Biggers
  2024-03-04 19:10 ` [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

fs-verity previously had debug printk but it was removed. This patch
adds trace points to the same places where printk were used (with a
few additional ones).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 MAINTAINERS                     |   1 +
 fs/verity/enable.c              |   3 +
 fs/verity/fsverity_private.h    |   2 +
 fs/verity/init.c                |   1 +
 fs/verity/signature.c           |   2 +
 fs/verity/verify.c              |   7 ++
 include/trace/events/fsverity.h | 181 ++++++++++++++++++++++++++++++++
 7 files changed, 197 insertions(+)
 create mode 100644 include/trace/events/fsverity.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2ecaaec6a6bf..49888dd5cbbd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8745,6 +8745,7 @@ T:	git https://git.kernel.org/pub/scm/fs/fsverity/linux.git
 F:	Documentation/filesystems/fsverity.rst
 F:	fs/verity/
 F:	include/linux/fsverity.h
+F:	include/trace/events/fsverity.h
 F:	include/uapi/linux/fsverity.h
 
 FT260 FTDI USB-HID TO I2C BRIDGE DRIVER
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index 04e060880b79..945eba0092ab 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -227,6 +227,8 @@ static int enable_verity(struct file *filp,
 	if (err)
 		goto out;
 
+	trace_fsverity_enable(inode, desc, &params);
+
 	/*
 	 * Start enabling verity on this file, serialized by the inode lock.
 	 * Fail if verity is already enabled or is already being enabled.
@@ -255,6 +257,7 @@ static int enable_verity(struct file *filp,
 		fsverity_err(inode, "Error %d building Merkle tree", err);
 		goto rollback;
 	}
+	trace_fsverity_tree_done(inode, desc, &params);
 
 	/*
 	 * Create the fsverity_info.  Don't bother trying to save work by
diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index dad33e6ff0d6..fd8f5a8d1f6a 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -162,4 +162,6 @@ void __init fsverity_init_workqueue(void);
 void fsverity_drop_block(struct inode *inode,
 			 struct fsverity_blockbuf *block);
 
+#include <trace/events/fsverity.h>
+
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/init.c b/fs/verity/init.c
index cb2c9aac61ed..3769d2dc9e3b 100644
--- a/fs/verity/init.c
+++ b/fs/verity/init.c
@@ -5,6 +5,7 @@
  * Copyright 2019 Google LLC
  */
 
+#define CREATE_TRACE_POINTS
 #include "fsverity_private.h"
 
 #include <linux/ratelimit.h>
diff --git a/fs/verity/signature.c b/fs/verity/signature.c
index 90c07573dd77..c1f08bb32ed1 100644
--- a/fs/verity/signature.c
+++ b/fs/verity/signature.c
@@ -53,6 +53,8 @@ int fsverity_verify_signature(const struct fsverity_info *vi,
 	struct fsverity_formatted_digest *d;
 	int err;
 
+	trace_fsverity_verify_signature(inode, signature, sig_size);
+
 	if (sig_size == 0) {
 		if (fsverity_require_signatures) {
 			fsverity_err(inode,
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index de71911d400c..614776e7a2b6 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -118,6 +118,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		/* Byte offset of the wanted hash relative to @addr */
 		unsigned int hoffset;
 	} hblocks[FS_VERITY_MAX_LEVELS];
+	trace_fsverity_verify_block(inode, data_pos);
 	/*
 	 * The index of the previous level's block within that level; also the
 	 * index of that block's hash within the current level.
@@ -215,6 +216,8 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		if (is_hash_block_verified(inode, block, hblock_idx)) {
 			memcpy(_want_hash, block->kaddr + hoffset, hsize);
 			want_hash = _want_hash;
+			trace_fsverity_merkle_tree_block_verified(inode,
+					block, FSVERITY_TRACE_DIR_ASCEND);
 			fsverity_drop_block(inode, block);
 			goto descend;
 		}
@@ -248,6 +251,8 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
 			SetPageChecked(hpage);
 		memcpy(_want_hash, haddr + hoffset, hsize);
 		want_hash = _want_hash;
+		trace_fsverity_merkle_tree_block_verified(inode, block,
+				FSVERITY_TRACE_DIR_DESCEND);
 		fsverity_drop_block(inode, block);
 	}
 
@@ -405,6 +410,8 @@ void fsverity_invalidate_block(struct inode *inode,
 	struct fsverity_info *vi = inode->i_verity_info;
 	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
 
+	trace_fsverity_invalidate_block(inode, block);
+
 	if (block->offset > vi->tree_params.tree_size) {
 		fsverity_err(inode,
 "Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
diff --git a/include/trace/events/fsverity.h b/include/trace/events/fsverity.h
new file mode 100644
index 000000000000..82966ecc5722
--- /dev/null
+++ b/include/trace/events/fsverity.h
@@ -0,0 +1,181 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fsverity
+
+#if !defined(_TRACE_FSVERITY_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FSVERITY_H
+
+#include <linux/tracepoint.h>
+
+struct fsverity_descriptor;
+struct merkle_tree_params;
+struct fsverity_info;
+
+#define FSVERITY_TRACE_DIR_ASCEND	(1ul << 0)
+#define FSVERITY_TRACE_DIR_DESCEND	(1ul << 1)
+#define FSVERITY_HASH_SHOWN_LEN		20
+
+TRACE_EVENT(fsverity_enable,
+	TP_PROTO(struct inode *inode, struct fsverity_descriptor *desc,
+		struct merkle_tree_params *params),
+	TP_ARGS(inode, desc, params),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, data_size)
+		__field(unsigned int, block_size)
+		__field(unsigned int, num_levels)
+		__field(u64, tree_size)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->data_size = desc->data_size;
+		__entry->block_size = params->block_size;
+		__entry->num_levels = params->num_levels;
+		__entry->tree_size = params->tree_size;
+	),
+	TP_printk("ino %lu data size %llu tree size %llu block size %u levels %u",
+		(unsigned long) __entry->ino,
+		__entry->data_size,
+		__entry->tree_size,
+		__entry->block_size,
+		__entry->num_levels)
+);
+
+TRACE_EVENT(fsverity_tree_done,
+	TP_PROTO(struct inode *inode, struct fsverity_descriptor *desc,
+		struct merkle_tree_params *params),
+	TP_ARGS(inode, desc, params),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(unsigned int, levels)
+		__field(unsigned int, tree_blocks)
+		__field(u64, tree_size)
+		__array(u8, tree_hash, 64)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->levels = params->num_levels;
+		__entry->tree_blocks =
+			params->tree_size >> params->log_blocksize;
+		__entry->tree_size = params->tree_size;
+		memcpy(__entry->tree_hash, desc->root_hash, 64);
+	),
+	TP_printk("ino %lu levels %d tree_blocks %d tree_size %lld root_hash %s",
+		(unsigned long) __entry->ino,
+		__entry->levels,
+		__entry->tree_blocks,
+		__entry->tree_size,
+		__print_hex(__entry->tree_hash, 64))
+);
+
+TRACE_EVENT(fsverity_verify_block,
+	TP_PROTO(struct inode *inode, u64 offset),
+	TP_ARGS(inode, offset),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, offset)
+		__field(unsigned int, block_size)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->offset = offset;
+		__entry->block_size =
+			inode->i_verity_info->tree_params.block_size;
+	),
+	TP_printk("ino %lu data offset %lld data block size %u",
+		(unsigned long) __entry->ino,
+		__entry->offset,
+		__entry->block_size)
+);
+
+TRACE_EVENT(fsverity_merkle_tree_block_verified,
+	TP_PROTO(struct inode *inode,
+		 struct fsverity_blockbuf *block,
+		 u8 direction),
+	TP_ARGS(inode, block, direction),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, offset)
+		__field(u8, direction)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->offset = block->offset;
+		__entry->direction = direction;
+	),
+	TP_printk("ino %lu block offset %llu %s",
+		(unsigned long) __entry->ino,
+		__entry->offset,
+		__entry->direction == 0 ? "ascend" : "descend")
+);
+
+TRACE_EVENT(fsverity_invalidate_block,
+	TP_PROTO(struct inode *inode, struct fsverity_blockbuf *block),
+	TP_ARGS(inode, block),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, offset)
+		__field(unsigned int, block_size)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->offset = block->offset;
+		__entry->block_size = block->size;
+	),
+	TP_printk("ino %lu block position %llu block size %u",
+		(unsigned long) __entry->ino,
+		__entry->offset,
+		__entry->block_size)
+);
+
+TRACE_EVENT(fsverity_read_merkle_tree_block,
+	TP_PROTO(struct inode *inode, u64 offset, unsigned int log_blocksize),
+	TP_ARGS(inode, offset, log_blocksize),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__field(u64, offset)
+		__field(u64, index)
+		__field(unsigned int, block_size)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		__entry->offset = offset;
+		__entry->index = offset >> log_blocksize;
+		__entry->block_size = 1 << log_blocksize;
+	),
+	TP_printk("ino %lu tree offset %llu block index %llu block hize %u",
+		(unsigned long) __entry->ino,
+		__entry->offset,
+		__entry->index,
+		__entry->block_size)
+);
+
+TRACE_EVENT(fsverity_verify_signature,
+	TP_PROTO(const struct inode *inode, const u8 *signature, size_t sig_size),
+	TP_ARGS(inode, signature, sig_size),
+	TP_STRUCT__entry(
+		__field(ino_t, ino)
+		__dynamic_array(u8, signature, sig_size)
+		__field(size_t, sig_size)
+		__field(size_t, sig_size_show)
+	),
+	TP_fast_assign(
+		__entry->ino = inode->i_ino;
+		memcpy(__get_dynamic_array(signature), signature, sig_size);
+		__entry->sig_size = sig_size;
+		__entry->sig_size_show = (sig_size > FSVERITY_HASH_SHOWN_LEN ?
+			FSVERITY_HASH_SHOWN_LEN : sig_size);
+	),
+	TP_printk("ino %lu sig_size %lu %s%s%s",
+		(unsigned long) __entry->ino,
+		__entry->sig_size,
+		(__entry->sig_size ? "sig " : ""),
+		__print_hex(__get_dynamic_array(signature),
+			__entry->sig_size_show),
+		(__entry->sig_size ? "..." : ""))
+);
+
+#endif /* _TRACE_FSVERITY_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (8 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 09/24] fsverity: add tracepoints Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 23:39   ` Eric Biggers
  2024-03-04 19:10 ` [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Andrey Albershteyn
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn, Christoph Hellwig

This patch adds fs-verity verification into iomap's read path. After
BIO's io operation is complete the data are verified against
fs-verity's Merkle tree. Verification work is done in a separate
workqueue.

The read path ioend iomap_read_ioend are stored side by side with
BIOs if FS_VERITY is enabled.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 92 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 83 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 093c4515b22a..e27a8af4b351 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -6,6 +6,7 @@
 #include <linux/module.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/fsverity.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
 #include <linux/uio.h>
@@ -330,6 +331,56 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
 		pos >= i_size_read(iter->inode);
 }
 
+#ifdef CONFIG_FS_VERITY
+struct iomap_fsverity_bio {
+	struct work_struct	work;
+	struct bio		bio;
+};
+static struct bio_set iomap_fsverity_bioset;
+
+static void
+iomap_read_fsverify_end_io_work(struct work_struct *work)
+{
+	struct iomap_fsverity_bio *fbio =
+		container_of(work, struct iomap_fsverity_bio, work);
+
+	fsverity_verify_bio(&fbio->bio);
+	iomap_read_end_io(&fbio->bio);
+}
+
+static void
+iomap_read_fsverity_end_io(struct bio *bio)
+{
+	struct iomap_fsverity_bio *fbio =
+		container_of(bio, struct iomap_fsverity_bio, bio);
+
+	INIT_WORK(&fbio->work, iomap_read_fsverify_end_io_work);
+	queue_work(bio->bi_private, &fbio->work);
+}
+#endif /* CONFIG_FS_VERITY */
+
+static struct bio *iomap_read_bio_alloc(struct inode *inode,
+		struct block_device *bdev, int nr_vecs, gfp_t gfp)
+{
+	struct bio *bio;
+
+#ifdef CONFIG_FS_VERITY
+	if (fsverity_active(inode)) {
+		bio = bio_alloc_bioset(bdev, nr_vecs, REQ_OP_READ, gfp,
+					&iomap_fsverity_bioset);
+		if (bio) {
+			bio->bi_private = inode->i_sb->s_read_done_wq;
+			bio->bi_end_io = iomap_read_fsverity_end_io;
+		}
+		return bio;
+	}
+#endif
+	bio = bio_alloc(bdev, nr_vecs, REQ_OP_READ, gfp);
+	if (bio)
+		bio->bi_end_io = iomap_read_end_io;
+	return bio;
+}
+
 static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		struct iomap_readpage_ctx *ctx, loff_t offset)
 {
@@ -353,6 +404,12 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		folio_zero_range(folio, poff, plen);
+		if (fsverity_active(iter->inode) &&
+		    !fsverity_verify_blocks(folio, plen, poff)) {
+			folio_set_error(folio);
+			goto done;
+		}
+
 		iomap_set_range_uptodate(folio, poff, plen);
 		goto done;
 	}
@@ -370,28 +427,29 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
 		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
-		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
 
 		if (ctx->bio)
 			submit_bio(ctx->bio);
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
-				     REQ_OP_READ, gfp);
+
+		ctx->bio = iomap_read_bio_alloc(iter->inode, iomap->bdev,
+				bio_max_segs(DIV_ROUND_UP(length, PAGE_SIZE)),
+				gfp);
+
 		/*
 		 * If the bio_alloc fails, try it again for a single page to
 		 * avoid having to deal with partial page reads.  This emulates
 		 * what do_mpage_read_folio does.
 		 */
 		if (!ctx->bio) {
-			ctx->bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ,
-					     orig_gfp);
+			ctx->bio = iomap_read_bio_alloc(iter->inode,
+					iomap->bdev, 1, orig_gfp);
 		}
 		if (ctx->rac)
 			ctx->bio->bi_opf |= REQ_RAHEAD;
 		ctx->bio->bi_iter.bi_sector = sector;
-		ctx->bio->bi_end_io = iomap_read_end_io;
 		bio_add_folio_nofail(ctx->bio, folio, plen, poff);
 	}
 
@@ -471,6 +529,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
  * iomap_readahead - Attempt to read pages from a file.
  * @rac: Describes the pages to be read.
  * @ops: The operations vector for the filesystem.
+ * @wq: Workqueue for post-I/O processing (only need for fsverity)
  *
  * This function is for filesystems to call to implement their readahead
  * address_space operation.
@@ -1996,10 +2055,25 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
+#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
+
 static int __init iomap_init(void)
 {
-	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
-			   offsetof(struct iomap_ioend, io_inline_bio),
-			   BIOSET_NEED_BVECS);
+	int error;
+
+	error = bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
+			    offsetof(struct iomap_ioend, io_inline_bio),
+			    BIOSET_NEED_BVECS);
+#ifdef CONFIG_FS_VERITY
+	if (error)
+		return error;
+
+	error = bioset_init(&iomap_fsverity_bioset, IOMAP_POOL_SIZE,
+			    offsetof(struct iomap_fsverity_bio, bio),
+			    BIOSET_NEED_BVECS);
+	if (error)
+		bioset_exit(&iomap_ioend_bioset);
+#endif
+	return error;
 }
 fs_initcall(iomap_init);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (9 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:46   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer Andrey Albershteyn
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

One of essential ideas of fs-verity is that pages which are already
verified won't need to be re-verified if they still in page cache.

XFS will store Merkle tree blocks in extended file attributes. When
read extended attribute data is put into xfs_buf.

fs-verity uses PG_checked flag to track status of the blocks in the
page. This flag can has two meanings - page was re-instantiated and
the only block in the page is verified.

However, in XFS, the data in the buffer is not aligned with xfs_buf
pages and we don't have a reference to these pages. Moreover, these
pages are released when value is copied out in xfs_attr code. In
other words, we can not directly mark underlying xfs_buf's pages as
verified as it's done by fs-verity for other filesystems.

One way to track that these pages were processed by fs-verity is to
mark buffer as verified instead. If buffer is evicted the incore
XBF_VERITY_SEEN flag is lost. When the xattr is read again
xfs_attr_get() returns new buffer without the flag. The xfs_buf's
flag is then used to tell fs-verity this buffer was cached or not.

The second state indicated by PG_checked is if the only block in the
PAGE is verified. This is not the case for XFS as there could be
multiple blocks in single buffer (page size 64k block size 4k). This
is handled by fs-verity bitmap. fs-verity is always uses bitmap for
XFS despite of Merkle tree block size.

The meaning of the flag is that value of the extended attribute in
the buffer is processed by fs-verity.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/xfs_buf.h | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 73249abca968..2a73918193ba 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -24,14 +24,15 @@ struct xfs_buf;
 
 #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
 
-#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
-#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
-#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
-#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
-#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
-#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
-#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
-#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
+#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
+#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
+#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
+#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
+#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
+#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
+#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
+#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
+#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
 
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1u << 16)/* inode buffer */
@@ -65,6 +66,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
+	{ XBF_VERITY_SEEN,	"VERITY_SEEN" }, \
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (10 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 13/24] xfs: add attribute type for fs-verity Andrey Albershteyn
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

With XBF_VERITY_SEEN flag on xfs_buf XFS can track which buffers
contain verified Merkle tree blocks. However, we also need to expose
the buffer to pass a reference of underlying page to fs-verity.

This patch adds XFS_DA_OP_BUFFER to tell xfs_attr_get() to
xfs_buf_hold() underlying buffer and return it as xfs_da_args->bp.
The caller must then xfs_buf_rele() the buffer. Therefore, XFS will
hold a reference to xfs_buf till fs-verity is verifying xfs_buf's
content.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_attr.c        |  5 ++++-
 fs/xfs/libxfs/xfs_attr_leaf.c   |  7 +++++++
 fs/xfs/libxfs/xfs_attr_remote.c | 13 +++++++++++--
 fs/xfs/libxfs/xfs_da_btree.h    |  5 ++++-
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index f0b625d45aa4..ca515e8bd2ed 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -252,6 +252,8 @@ xfs_attr_get_ilocked(
  * If the attribute is found, but exceeds the size limit set by the caller in
  * args->valuelen, return -ERANGE with the size of the attribute that was found
  * in args->valuelen.
+ *
+ * Using XFS_DA_OP_BUFFER the caller have to release the buffer args->bp.
  */
 int
 xfs_attr_get(
@@ -270,7 +272,8 @@ xfs_attr_get(
 	args->hashval = xfs_da_hashname(args->name, args->namelen);
 
 	/* Entirely possible to look up a name which doesn't exist */
-	args->op_flags = XFS_DA_OP_OKNOENT;
+	args->op_flags = XFS_DA_OP_OKNOENT |
+					(args->op_flags & XFS_DA_OP_BUFFER);
 
 	lock_mode = xfs_ilock_attr_map_shared(args->dp);
 	error = xfs_attr_get_ilocked(args);
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index ac904cc1a97b..b51f439e4aed 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -2453,6 +2453,13 @@ xfs_attr3_leaf_getvalue(
 		name_loc = xfs_attr3_leaf_name_local(leaf, args->index);
 		ASSERT(name_loc->namelen == args->namelen);
 		ASSERT(memcmp(args->name, name_loc->nameval, args->namelen) == 0);
+
+		/* must be released by the caller */
+		if (args->op_flags & XFS_DA_OP_BUFFER) {
+			xfs_buf_hold(bp);
+			args->bp = bp;
+		}
+
 		return xfs_attr_copy_value(args,
 					&name_loc->nameval[args->namelen],
 					be16_to_cpu(name_loc->valuelen));
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index ff0412828772..4b44866479dc 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -429,9 +429,18 @@ xfs_attr_rmtval_get(
 			error = xfs_attr_rmtval_copyout(mp, bp, args->dp,
 							&offset, &valuelen,
 							&dst);
-			xfs_buf_relse(bp);
-			if (error)
+			xfs_buf_unlock(bp);
+			/* must be released by the caller */
+			if (args->op_flags & XFS_DA_OP_BUFFER)
+				args->bp = bp;
+			else
+				xfs_buf_rele(bp);
+
+			if (error) {
+				if (args->op_flags & XFS_DA_OP_BUFFER)
+					xfs_buf_rele(args->bp);
 				return error;
+			}
 
 			/* roll attribute extent map forwards */
 			lblkno += map[i].br_blockcount;
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index 706baf36e175..1534f4102a47 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -59,6 +59,7 @@ typedef struct xfs_da_args {
 	uint8_t		filetype;	/* filetype of inode for directories */
 	void		*value;		/* set of bytes (maybe contain NULLs) */
 	int		valuelen;	/* length of value */
+	struct xfs_buf	*bp;		/* OUT: xfs_buf which contains the attr */
 	unsigned int	attr_filter;	/* XFS_ATTR_{ROOT,SECURE,INCOMPLETE} */
 	unsigned int	attr_flags;	/* XATTR_{CREATE,REPLACE} */
 	xfs_dahash_t	hashval;	/* hash value of name */
@@ -93,6 +94,7 @@ typedef struct xfs_da_args {
 #define XFS_DA_OP_REMOVE	(1u << 6) /* this is a remove operation */
 #define XFS_DA_OP_RECOVERY	(1u << 7) /* Log recovery operation */
 #define XFS_DA_OP_LOGGED	(1u << 8) /* Use intent items to track op */
+#define XFS_DA_OP_BUFFER	(1u << 9) /* Return underlying buffer */
 
 #define XFS_DA_OP_FLAGS \
 	{ XFS_DA_OP_JUSTCHECK,	"JUSTCHECK" }, \
@@ -103,7 +105,8 @@ typedef struct xfs_da_args {
 	{ XFS_DA_OP_NOTIME,	"NOTIME" }, \
 	{ XFS_DA_OP_REMOVE,	"REMOVE" }, \
 	{ XFS_DA_OP_RECOVERY,	"RECOVERY" }, \
-	{ XFS_DA_OP_LOGGED,	"LOGGED" }
+	{ XFS_DA_OP_LOGGED,	"LOGGED" }, \
+	{ XFS_DA_OP_BUFFER,	"BUFFER" }
 
 /*
  * Storage for holding state during Btree searches and split/join ops.
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 13/24] xfs: add attribute type for fs-verity
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (11 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 14/24] xfs: make xfs_buf_get() to take XBF_* flags Andrey Albershteyn
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

The Merkle tree blocks and descriptor are stored in the extended
attributes of the inode. Add new attribute type for fs-verity
metadata. Add XFS_ATTR_INTERNAL_MASK to skip parent pointer and
fs-verity attributes as those are only for internal use. While we're
at it add a few comments in relevant places that internally visible
attributes are not suppose to be handled via interface defined in
xfs_xattr.c.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_da_format.h  | 10 +++++++++-
 fs/xfs/libxfs/xfs_log_format.h |  1 +
 fs/xfs/xfs_ioctl.c             |  5 +++++
 fs/xfs/xfs_trace.h             |  3 ++-
 fs/xfs/xfs_xattr.c             | 10 ++++++++++
 5 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 839df0e5401b..28d4ac6fa156 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -715,14 +715,22 @@ struct xfs_attr3_leafblock {
 #define	XFS_ATTR_ROOT_BIT	1	/* limit access to trusted attrs */
 #define	XFS_ATTR_SECURE_BIT	2	/* limit access to secure attrs */
 #define	XFS_ATTR_PARENT_BIT	3	/* parent pointer attrs */
+#define	XFS_ATTR_VERITY_BIT	4	/* verity merkle tree and descriptor */
 #define	XFS_ATTR_INCOMPLETE_BIT	7	/* attr in middle of create/delete */
 #define XFS_ATTR_LOCAL		(1u << XFS_ATTR_LOCAL_BIT)
 #define XFS_ATTR_ROOT		(1u << XFS_ATTR_ROOT_BIT)
 #define XFS_ATTR_SECURE		(1u << XFS_ATTR_SECURE_BIT)
 #define XFS_ATTR_PARENT		(1u << XFS_ATTR_PARENT_BIT)
+#define XFS_ATTR_VERITY		(1u << XFS_ATTR_VERITY_BIT)
 #define XFS_ATTR_INCOMPLETE	(1u << XFS_ATTR_INCOMPLETE_BIT)
 #define XFS_ATTR_NSP_ONDISK_MASK \
-			(XFS_ATTR_ROOT | XFS_ATTR_SECURE | XFS_ATTR_PARENT)
+			(XFS_ATTR_ROOT | XFS_ATTR_SECURE | XFS_ATTR_PARENT | \
+			 XFS_ATTR_VERITY)
+
+/*
+ * Internal attributes not exposed to the user
+ */
+#define XFS_ATTR_INTERNAL_MASK (XFS_ATTR_PARENT | XFS_ATTR_VERITY)
 
 /*
  * Alignment for namelist and valuelist entries (since they are mixed
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 9cbcba4bd363..407fadfb5c06 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -975,6 +975,7 @@ struct xfs_icreate_log {
 #define XFS_ATTRI_FILTER_MASK		(XFS_ATTR_ROOT | \
 					 XFS_ATTR_SECURE | \
 					 XFS_ATTR_PARENT | \
+					 XFS_ATTR_VERITY | \
 					 XFS_ATTR_INCOMPLETE)
 
 /*
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d0e2cec6210d..ab61d7d552fb 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -352,6 +352,11 @@ static unsigned int
 xfs_attr_filter(
 	u32			ioc_flags)
 {
+	/*
+	 * Only externally visible attributes should be specified here.
+	 * Internally used attributes (such as parent pointers or fs-verity)
+	 * should not be exposed to userspace.
+	 */
 	if (ioc_flags & XFS_IOC_ATTR_ROOT)
 		return XFS_ATTR_ROOT;
 	if (ioc_flags & XFS_IOC_ATTR_SECURE)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d4f1b2da21e7..9d4ae05abfc8 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -87,7 +87,8 @@ struct xfs_bmap_intent;
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
 	{ XFS_ATTR_SECURE,	"SECURE" }, \
 	{ XFS_ATTR_INCOMPLETE,	"INCOMPLETE" }, \
-	{ XFS_ATTR_PARENT,	"PARENT" }
+	{ XFS_ATTR_PARENT,	"PARENT" }, \
+	{ XFS_ATTR_VERITY,	"VERITY" }
 
 DECLARE_EVENT_CLASS(xfs_attr_list_class,
 	TP_PROTO(struct xfs_attr_list_context *ctx),
diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c
index 364104e1b38a..e4c88dde4e44 100644
--- a/fs/xfs/xfs_xattr.c
+++ b/fs/xfs/xfs_xattr.c
@@ -20,6 +20,13 @@
 
 #include <linux/posix_acl_xattr.h>
 
+/*
+ * This file defines interface to work with externally visible extended
+ * attributes, such as those in user, system or security namespaces. This
+ * interface should not be used for internally used attributes (consider
+ * xfs_attr.c).
+ */
+
 /*
  * Get permission to use log-assisted atomic exchange of file extents.
  *
@@ -244,6 +251,9 @@ xfs_xattr_put_listent(
 
 	ASSERT(context->count >= 0);
 
+	if (flags & XFS_ATTR_INTERNAL_MASK)
+		return;
+
 	if (flags & XFS_ATTR_ROOT) {
 #ifdef CONFIG_XFS_POSIX_ACL
 		if (namelen == SGI_ACL_FILE_SIZE &&
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 14/24] xfs: make xfs_buf_get() to take XBF_* flags
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (12 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 13/24] xfs: add attribute type for fs-verity Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer Andrey Albershteyn
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

Allow passing XBF_* buffer flags from xfs_buf_get(). This will allow
fs-verity to specify flag for increased buffer size.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_attr_remote.c | 2 +-
 fs/xfs/libxfs/xfs_btree_mem.c   | 2 +-
 fs/xfs/libxfs/xfs_sb.c          | 2 +-
 fs/xfs/xfs_buf.h                | 3 ++-
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4b44866479dc..f15350e99d66 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -526,7 +526,7 @@ xfs_attr_rmtval_set_value(
 		dblkno = XFS_FSB_TO_DADDR(mp, map.br_startblock),
 		dblkcnt = XFS_FSB_TO_BB(mp, map.br_blockcount);
 
-		error = xfs_buf_get(mp->m_ddev_targp, dblkno, dblkcnt, &bp);
+		error = xfs_buf_get(mp->m_ddev_targp, dblkno, dblkcnt, 0, &bp);
 		if (error)
 			return error;
 		bp->b_ops = &xfs_attr3_rmt_buf_ops;
diff --git a/fs/xfs/libxfs/xfs_btree_mem.c b/fs/xfs/libxfs/xfs_btree_mem.c
index 036061fe32cc..07df43decce7 100644
--- a/fs/xfs/libxfs/xfs_btree_mem.c
+++ b/fs/xfs/libxfs/xfs_btree_mem.c
@@ -92,7 +92,7 @@ xfbtree_init_leaf_block(
 	xfbno_t				bno = xfbt->highest_bno++;
 	int				error;
 
-	error = xfs_buf_get(xfbt->target, xfbno_to_daddr(bno), XFBNO_BBSIZE,
+	error = xfs_buf_get(xfbt->target, xfbno_to_daddr(bno), XFBNO_BBSIZE, 0,
 			&bp);
 	if (error)
 		return error;
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d991eec05436..a25949843d8d 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1100,7 +1100,7 @@ xfs_update_secondary_sbs(
 
 		error = xfs_buf_get(mp->m_ddev_targp,
 				 XFS_AG_DADDR(mp, pag->pag_agno, XFS_SB_DADDR),
-				 XFS_FSS_TO_BB(mp, 1), &bp);
+				 XFS_FSS_TO_BB(mp, 1), 0, &bp);
 		/*
 		 * If we get an error reading or writing alternate superblocks,
 		 * continue.  xfs_repair chooses the "best" superblock based
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 2a73918193ba..b5c58287c663 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -257,11 +257,12 @@ xfs_buf_get(
 	struct xfs_buftarg	*target,
 	xfs_daddr_t		blkno,
 	size_t			numblks,
+	xfs_buf_flags_t		flags,
 	struct xfs_buf		**bpp)
 {
 	DEFINE_SINGLE_BUF_MAP(map, blkno, numblks);
 
-	return xfs_buf_get_map(target, &map, 1, 0, bpp);
+	return xfs_buf_get_map(target, &map, 1, flags, bpp);
 }
 
 static inline int
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (13 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 14/24] xfs: make xfs_buf_get() to take XBF_* flags Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 16/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

For fs-verity integration, XFS needs to supply kaddr'es of Merkle
tree blocks to fs-verity core and track which blocks are already
verified. One way to track verified status is to set xfs_buf flag
(previously added XBF_VERITY_SEEN). When xfs_buf is evicted from
memory we loose verified status. Otherwise, fs-verity hits the
xfs_buf which is still in cache and contains already verified blocks.

However, the leaf blocks which are read to the xfs_buf contains leaf
headers. xfs_attr_get() allocates new pages and copies out the data
without header. Those newly allocated pages with extended attribute
data are not attached to the buffer anymore.

Add new XBF_DOUBLE_ALLOC which makes xfs_buf allocates x2 memory for
the buffer. Additional memory will be used for a copy of the
attribute data but without any headers. Also, make
xfs_attr_rmtval_get() to copy data to the buffer itself if XFS asked
for fs-verity block.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_attr_remote.c | 26 ++++++++++++++++++++++++--
 fs/xfs/xfs_buf.c                |  6 +++++-
 fs/xfs/xfs_buf.h                |  2 ++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index f15350e99d66..f1b7842da809 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -395,12 +395,22 @@ xfs_attr_rmtval_get(
 	int			blkcnt = args->rmtblkcnt;
 	int			i;
 	int			offset = 0;
+	int			flags = 0;
+	void			*addr;
 
 	trace_xfs_attr_rmtval_get(args);
 
 	ASSERT(args->valuelen != 0);
 	ASSERT(args->rmtvaluelen == args->valuelen);
 
+	/*
+	 * We also check for _OP_BUFFER as we want to trigger on
+	 * verity blocks only, not on verity_descriptor
+	 */
+	if (args->attr_filter & XFS_ATTR_VERITY &&
+			args->op_flags & XFS_DA_OP_BUFFER)
+		flags = XBF_DOUBLE_ALLOC;
+
 	valuelen = args->rmtvaluelen;
 	while (valuelen > 0) {
 		nmap = ATTR_RMTVALUE_MAPSIZE;
@@ -420,12 +430,23 @@ xfs_attr_rmtval_get(
 			dblkno = XFS_FSB_TO_DADDR(mp, map[i].br_startblock);
 			dblkcnt = XFS_FSB_TO_BB(mp, map[i].br_blockcount);
 			error = xfs_buf_read(mp->m_ddev_targp, dblkno, dblkcnt,
-					0, &bp, &xfs_attr3_rmt_buf_ops);
+					flags, &bp, &xfs_attr3_rmt_buf_ops);
 			if (xfs_metadata_is_sick(error))
 				xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
 			if (error)
 				return error;
 
+			/*
+			 * For fs-verity we allocated more space. That space is
+			 * filled with the same xattr data but without leaf
+			 * headers. Point args->value to that data
+			 */
+			if (flags & XBF_DOUBLE_ALLOC) {
+				addr = xfs_buf_offset(bp, BBTOB(bp->b_length));
+				args->value = addr;
+				dst = addr;
+			}
+
 			error = xfs_attr_rmtval_copyout(mp, bp, args->dp,
 							&offset, &valuelen,
 							&dst);
@@ -526,7 +547,8 @@ xfs_attr_rmtval_set_value(
 		dblkno = XFS_FSB_TO_DADDR(mp, map.br_startblock),
 		dblkcnt = XFS_FSB_TO_BB(mp, map.br_blockcount);
 
-		error = xfs_buf_get(mp->m_ddev_targp, dblkno, dblkcnt, 0, &bp);
+		error = xfs_buf_get(mp->m_ddev_targp, dblkno, dblkcnt,
+				XBF_DOUBLE_ALLOC, &bp);
 		if (error)
 			return error;
 		bp->b_ops = &xfs_attr3_rmt_buf_ops;
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7fc26e64368d..d8a478bda865 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -337,6 +337,9 @@ xfs_buf_alloc_kmem(
 	gfp_t		gfp_mask = GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL;
 	size_t		size = BBTOB(bp->b_length);
 
+	if (flags & XBF_DOUBLE_ALLOC)
+		size *= 2;
+
 	/* Assure zeroed buffer for non-read cases. */
 	if (!(flags & XBF_READ))
 		gfp_mask |= __GFP_ZERO;
@@ -367,12 +370,13 @@ xfs_buf_alloc_pages(
 {
 	gfp_t		gfp_mask = GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOWARN;
 	long		filled = 0;
+	int		mul = (bp->b_flags & XBF_DOUBLE_ALLOC) ? 2 : 1;
 
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
 
 	/* Make sure that we have a page list */
-	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
+	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length*mul), PAGE_SIZE);
 	if (bp->b_page_count <= XB_PAGES) {
 		bp->b_pages = bp->b_page_array;
 	} else {
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b5c58287c663..60b7d58d5da1 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -33,6 +33,7 @@ struct xfs_buf;
 #define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
 #define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
 #define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
+#define XBF_DOUBLE_ALLOC	(1u << 9) /* double allocated space */
 
 /* buffer type flags for write callbacks */
 #define _XBF_INODES	 (1u << 16)/* inode buffer */
@@ -67,6 +68,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
 	{ XBF_VERITY_SEEN,	"VERITY_SEEN" }, \
+	{ XBF_DOUBLE_ALLOC,	"DOUBLE_ALLOC" }, \
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 16/24] xfs: add fs-verity ro-compat flag
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (14 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 17/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

To mark inodes with fs-verity enabled the new XFS_DIFLAG2_VERITY flag
will be added in further patch. This requires ro-compat flag to let
older kernels know that fs with fs-verity can not be modified.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h | 1 +
 fs/xfs/libxfs/xfs_sb.c     | 2 ++
 fs/xfs/xfs_mount.h         | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 2b2f9050fbfb..93d280eb8451 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_VERITY   (1 << 4)		/* fs-verity */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a25949843d8d..1c68785e60cc 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -163,6 +163,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_REFLINK;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
 		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_VERITY)
+		features |= XFS_FEAT_VERITY;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e880aa48de68..f198d7c82552 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -292,6 +292,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_BIGTIME	(1ULL << 24)	/* large timestamps */
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
+#define XFS_FEAT_VERITY		(1ULL << 27)	/* fs-verity */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
 __XFS_HAS_FEAT(bigtime, BIGTIME)
 __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(verity, VERITY)
 
 /*
  * Mount features
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 17/24] xfs: add inode on-disk VERITY flag
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (15 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 16/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:06   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

Add flag to mark inodes which have fs-verity enabled on them (i.e.
descriptor exist and tree is built).

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 4 +++-
 fs/xfs/xfs_inode.c         | 2 ++
 fs/xfs/xfs_iops.c          | 2 ++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 93d280eb8451..3ce2902101bc 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1085,16 +1085,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+#define XFS_DIFLAG2_VERITY_BIT	5	/* inode sealed by fsverity */
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_VERITY	(1 << XFS_DIFLAG2_VERITY_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_VERITY)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ea48774f6b76..59446e9e1719 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -607,6 +607,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
+			flags |= FS_XFLAG_VERITY;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66f8c47642e8..0e5cdb82b231 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1241,6 +1241,8 @@ xfs_diflags_to_iflags(
 		flags |= S_NOATIME;
 	if (init && xfs_inode_should_enable_dax(ip))
 		flags |= S_DAX;
+	if (xflags & FS_XFLAG_VERITY)
+		flags |= S_VERITY;
 
 	/*
 	 * S_DAX can only be set during inode initialization and is never set by
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (16 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 17/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:09   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode Andrey Albershteyn
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

fs-verity will read and attach metadata (not the tree itself) from
a disk for those inodes which already have fs-verity enabled.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/xfs_file.c  | 8 ++++++++
 fs/xfs/xfs_super.c | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 632653e00906..17404c2e7e31 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -31,6 +31,7 @@
 #include <linux/mman.h>
 #include <linux/fadvise.h>
 #include <linux/mount.h>
+#include <linux/fsverity.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
 
@@ -1228,10 +1229,17 @@ xfs_file_open(
 	struct inode	*inode,
 	struct file	*file)
 {
+	int		error = 0;
+
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
 			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
+
+	error = fsverity_file_open(inode, file);
+	if (error)
+		return error;
+
 	return generic_file_open(inode, file);
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index afa32bd5e282..9f9c35cff9bf 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -49,6 +49,7 @@
 #include <linux/magic.h>
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
+#include <linux/fsverity.h>
 
 static const struct super_operations xfs_super_operations;
 
@@ -663,6 +664,7 @@ xfs_fs_destroy_inode(
 	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 	XFS_STATS_INC(ip->i_mount, vn_rele);
 	XFS_STATS_INC(ip->i_mount, vn_remove);
+	fsverity_cleanup_inode(inode);
 	xfs_inode_mark_reclaimable(ip);
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (17 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:09   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 20/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

fs-verity doesn't support DAX. Forbid filesystem to enable DAX on
inodes which already have fs-verity enabled. The opposite is checked
when fs-verity is enabled, it won't be enabled if DAX is.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/xfs_iops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 0e5cdb82b231..6f97d777f702 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1213,6 +1213,8 @@ xfs_inode_should_enable_dax(
 		return false;
 	if (!xfs_inode_supports_dax(ip))
 		return false;
+	if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
+		return false;
 	if (xfs_has_dax_always(ip->i_mount))
 		return true;
 	if (ip->i_diflags2 & XFS_DIFLAG2_DAX)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 20/24] xfs: disable direct read path for fs-verity files
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (18 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:11   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

The direct path is not supported on verity files. Attempts to use direct
I/O path on such files should fall back to buffered I/O path.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/xfs_file.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 17404c2e7e31..af3201075066 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -281,7 +281,8 @@ xfs_file_dax_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret = 0;
 
 	trace_xfs_file_dax_read(iocb, to);
@@ -334,10 +335,18 @@ xfs_file_read_iter(
 
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_read(iocb, to);
-	else if (iocb->ki_flags & IOCB_DIRECT)
+	else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
 		ret = xfs_file_dio_read(iocb, to);
-	else
+	else {
+		/*
+		 * In case fs-verity is enabled, we also fallback to the
+		 * buffered read from the direct read path. Therefore,
+		 * IOCB_DIRECT is set and need to be cleared (see
+		 * generic_file_read_iter())
+		 */
+		iocb->ki_flags &= ~IOCB_DIRECT;
 		ret = xfs_file_buffered_read(iocb, to);
+	}
 
 	if (ret > 0)
 		XFS_STATS_ADD(mp, xs_read_bytes, ret);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 21/24] xfs: add fs-verity support
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (19 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 20/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-06  4:55   ` Eric Biggers
  2024-03-07 23:10   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag Andrey Albershteyn
                   ` (2 subsequent siblings)
  23 siblings, 2 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

Add integration with fs-verity. The XFS store fs-verity metadata in
the extended file attributes. The metadata consist of verity
descriptor and Merkle tree blocks.

The descriptor is stored under "vdesc" extended attribute. The
Merkle tree blocks are stored under binary indexes which are offsets
into the Merkle tree.

When fs-verity is enabled on an inode, the XFS_IVERITY_CONSTRUCTION
flag is set meaning that the Merkle tree is being build. The
initialization ends with storing of verity descriptor and setting
inode on-disk flag (XFS_DIFLAG2_VERITY).

The verification on read is done in read path of iomap.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/Makefile                 |   1 +
 fs/xfs/libxfs/xfs_attr.c        |  13 ++
 fs/xfs/libxfs/xfs_attr_leaf.c   |  17 +-
 fs/xfs/libxfs/xfs_attr_remote.c |   8 +-
 fs/xfs/libxfs/xfs_da_format.h   |  27 +++
 fs/xfs/libxfs/xfs_ondisk.h      |   4 +
 fs/xfs/xfs_inode.h              |   3 +-
 fs/xfs/xfs_super.c              |  10 +
 fs/xfs/xfs_verity.c             | 355 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_verity.h             |  33 +++
 10 files changed, 464 insertions(+), 7 deletions(-)
 create mode 100644 fs/xfs/xfs_verity.c
 create mode 100644 fs/xfs/xfs_verity.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index f8845e65cac7..8396a633b541 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -130,6 +130,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_FS_VERITY)		+= xfs_verity.o
 
 # notify failure
 ifeq ($(CONFIG_MEMORY_FAILURE),y)
diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index ca515e8bd2ed..cde5352db9aa 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -27,6 +27,7 @@
 #include "xfs_attr_item.h"
 #include "xfs_xattr.h"
 #include "xfs_parent.h"
+#include "xfs_verity.h"
 
 struct kmem_cache		*xfs_attr_intent_cache;
 
@@ -1527,6 +1528,18 @@ xfs_attr_namecheck(
 	if (flags & XFS_ATTR_PARENT)
 		return xfs_parent_namecheck(mp, name, length, flags);
 
+	if (flags & XFS_ATTR_VERITY) {
+		/* Merkle tree pages are stored under u64 indexes */
+		if (length == sizeof(struct xfs_fsverity_merkle_key))
+			return true;
+
+		/* Verity descriptor blocks are held in a named attribute. */
+		if (length == XFS_VERITY_DESCRIPTOR_NAME_LEN)
+			return true;
+
+		return false;
+	}
+
 	/*
 	 * MAXNAMELEN includes the trailing null, but (name/length) leave it
 	 * out, so use >= for the length check.
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index b51f439e4aed..f1f6aefc0420 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -30,6 +30,7 @@
 #include "xfs_ag.h"
 #include "xfs_errortag.h"
 #include "xfs_health.h"
+#include "xfs_verity.h"
 
 
 /*
@@ -519,7 +520,12 @@ xfs_attr_copy_value(
 		return -ERANGE;
 	}
 
-	if (!args->value) {
+	/*
+	 * We don't want to allocate memory for fs-verity Merkle tree blocks
+	 * (fs-verity descriptor is fine though). They will be stored in
+	 * underlying xfs_buf
+	 */
+	if (!args->value && !xfs_verity_merkle_block(args)) {
 		args->value = kvmalloc(valuelen, GFP_KERNEL | __GFP_NOLOCKDEP);
 		if (!args->value)
 			return -ENOMEM;
@@ -538,7 +544,14 @@ xfs_attr_copy_value(
 	 */
 	if (!value)
 		return -EINVAL;
-	memcpy(args->value, value, valuelen);
+	/*
+	 * We won't copy Merkle tree block to the args->value as we want it be
+	 * in the xfs_buf. And we didn't allocate any memory in args->value.
+	 */
+	if (xfs_verity_merkle_block(args))
+		args->value = value;
+	else
+		memcpy(args->value, value, valuelen);
 	return 0;
 }
 
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index f1b7842da809..a631ddff8068 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -23,6 +23,7 @@
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_health.h"
+#include "xfs_verity.h"
 
 #define ATTR_RMTVALUE_MAPSIZE	1	/* # of map entries at once */
 
@@ -404,11 +405,10 @@ xfs_attr_rmtval_get(
 	ASSERT(args->rmtvaluelen == args->valuelen);
 
 	/*
-	 * We also check for _OP_BUFFER as we want to trigger on
-	 * verity blocks only, not on verity_descriptor
+	 * For fs-verity we want additional space in the xfs_buf. This space is
+	 * used to copy xattr value without leaf headers (crc header).
 	 */
-	if (args->attr_filter & XFS_ATTR_VERITY &&
-			args->op_flags & XFS_DA_OP_BUFFER)
+	if (xfs_verity_merkle_block(args))
 		flags = XBF_DOUBLE_ALLOC;
 
 	valuelen = args->rmtvaluelen;
diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
index 28d4ac6fa156..c30c3c253191 100644
--- a/fs/xfs/libxfs/xfs_da_format.h
+++ b/fs/xfs/libxfs/xfs_da_format.h
@@ -914,4 +914,31 @@ struct xfs_parent_name_rec {
  */
 #define XFS_PARENT_DIRENT_NAME_MAX_SIZE		(MAXNAMELEN - 1)
 
+/*
+ * fs-verity attribute name format
+ *
+ * Merkle tree blocks are stored under extended attributes of the inode. The
+ * name of the attributes are offsets into merkle tree.
+ */
+struct xfs_fsverity_merkle_key {
+	__be64 merkleoff;
+};
+
+static inline void
+xfs_fsverity_merkle_key_to_disk(struct xfs_fsverity_merkle_key *key, loff_t pos)
+{
+	key->merkleoff = cpu_to_be64(pos);
+}
+
+static inline loff_t
+xfs_fsverity_name_to_block_offset(unsigned char *name)
+{
+	struct xfs_fsverity_merkle_key key = {
+		.merkleoff = *(__be64 *)name
+	};
+	loff_t offset = be64_to_cpu(key.merkleoff);
+
+	return offset;
+}
+
 #endif /* __XFS_DA_FORMAT_H__ */
diff --git a/fs/xfs/libxfs/xfs_ondisk.h b/fs/xfs/libxfs/xfs_ondisk.h
index 81885a6a028e..39209943c474 100644
--- a/fs/xfs/libxfs/xfs_ondisk.h
+++ b/fs/xfs/libxfs/xfs_ondisk.h
@@ -194,6 +194,10 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MIN << XFS_DQ_BIGTIME_SHIFT, 4);
 	XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MAX << XFS_DQ_BIGTIME_SHIFT,
 			16299260424LL);
+
+	/* fs-verity descriptor xattr name */
+	XFS_CHECK_VALUE(strlen(XFS_VERITY_DESCRIPTOR_NAME),
+			XFS_VERITY_DESCRIPTOR_NAME_LEN);
 }
 
 #endif /* __XFS_ONDISK_H */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ab46ffb3ac19..d6664e9afa22 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -348,7 +348,8 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
  * inactivation completes, both flags will be cleared and the inode is a
  * plain old IRECLAIMABLE inode.
  */
-#define XFS_INACTIVATING	(1 << 13)
+#define XFS_INACTIVATING		(1 << 13)
+#define XFS_IVERITY_CONSTRUCTION	(1 << 14) /* merkle tree construction */
 
 /* Quotacheck is running but inode has not been added to quota counts. */
 #define XFS_IQUOTAUNCHECKED	(1 << 14)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9f9c35cff9bf..996e6ea91fe1 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -30,6 +30,7 @@
 #include "xfs_filestream.h"
 #include "xfs_quota.h"
 #include "xfs_sysfs.h"
+#include "xfs_verity.h"
 #include "xfs_ondisk.h"
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
@@ -1520,6 +1521,11 @@ xfs_fs_fill_super(
 	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
 #endif
 	sb->s_op = &xfs_super_operations;
+#ifdef CONFIG_FS_VERITY
+	error = fsverity_set_ops(sb, &xfs_verity_ops);
+	if (error)
+		return error;
+#endif
 
 	/*
 	 * Delay mount work if the debug hook is set. This is debug
@@ -1729,6 +1735,10 @@ xfs_fs_fill_super(
 		goto out_filestream_unmount;
 	}
 
+	if (xfs_has_verity(mp))
+		xfs_alert(mp,
+	"EXPERIMENTAL fs-verity feature in use. Use at your own risk!");
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
new file mode 100644
index 000000000000..ea31b5bf6214
--- /dev/null
+++ b/fs/xfs/xfs_verity.c
@@ -0,0 +1,355 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 Red Hat, Inc.
+ */
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_log_format.h"
+#include "xfs_attr.h"
+#include "xfs_verity.h"
+#include "xfs_bmap_util.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_attr_leaf.h"
+
+/*
+ * Make fs-verity invalidate verified status of Merkle tree block
+ */
+static void
+xfs_verity_put_listent(
+	struct xfs_attr_list_context	*context,
+	int				flags,
+	unsigned char			*name,
+	int				namelen,
+	int				valuelen)
+{
+	struct fsverity_blockbuf	block = {
+		.offset = xfs_fsverity_name_to_block_offset(name),
+		.size = valuelen,
+	};
+	/*
+	 * Verity descriptor is smaller than 1024; verity block min size is
+	 * 1024. Exclude verity descriptor
+	 */
+	if (valuelen < 1024)
+		return;
+
+	fsverity_invalidate_block(VFS_I(context->dp), &block);
+}
+
+/*
+ * Iterate over extended attributes in the bp to invalidate Merkle tree blocks
+ */
+static int
+xfs_invalidate_blocks(
+	struct xfs_inode	*ip,
+	struct xfs_buf		*bp)
+{
+	struct xfs_attr_list_context context;
+
+	context.dp = ip;
+	context.resynch = 0;
+	context.buffer = NULL;
+	context.bufsize = 0;
+	context.firstu = 0;
+	context.attr_filter = XFS_ATTR_VERITY;
+	context.put_listent = xfs_verity_put_listent;
+
+	return xfs_attr3_leaf_list_int(bp, &context);
+}
+
+static int
+xfs_get_verity_descriptor(
+	struct inode		*inode,
+	void			*buf,
+	size_t			buf_size)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error = 0;
+	struct xfs_da_args	args = {
+		.dp		= ip,
+		.attr_filter	= XFS_ATTR_VERITY,
+		.name		= (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME,
+		.namelen	= XFS_VERITY_DESCRIPTOR_NAME_LEN,
+		.value		= buf,
+		.valuelen	= buf_size,
+	};
+
+	/*
+	 * The fact that (returned attribute size) == (provided buf_size) is
+	 * checked by xfs_attr_copy_value() (returns -ERANGE)
+	 */
+	error = xfs_attr_get(&args);
+	if (error)
+		return error;
+
+	return args.valuelen;
+}
+
+static int
+xfs_begin_enable_verity(
+	struct file	    *filp)
+{
+	struct inode	    *inode = file_inode(filp);
+	struct xfs_inode    *ip = XFS_I(inode);
+	int		    error = 0;
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	if (IS_DAX(inode))
+		return -EINVAL;
+
+	if (xfs_iflags_test_and_set(ip, XFS_IVERITY_CONSTRUCTION))
+		return -EBUSY;
+
+	return error;
+}
+
+static int
+xfs_drop_merkle_tree(
+	struct xfs_inode		*ip,
+	u64				merkle_tree_size,
+	unsigned int			tree_blocksize)
+{
+	struct xfs_fsverity_merkle_key	name;
+	int				error = 0;
+	u64				offset = 0;
+	struct xfs_da_args		args = {
+		.dp			= ip,
+		.whichfork		= XFS_ATTR_FORK,
+		.attr_filter		= XFS_ATTR_VERITY,
+		.op_flags		= XFS_DA_OP_REMOVE,
+		.namelen		= sizeof(struct xfs_fsverity_merkle_key),
+		/* NULL value make xfs_attr_set remove the attr */
+		.value			= NULL,
+	};
+
+	if (!merkle_tree_size)
+		return 0;
+
+	args.name = (const uint8_t *)&name.merkleoff;
+	for (offset = 0; offset < merkle_tree_size; offset += tree_blocksize) {
+		xfs_fsverity_merkle_key_to_disk(&name, offset);
+		error = xfs_attr_set(&args);
+		if (error)
+			return error;
+	}
+
+	args.name = (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME;
+	args.namelen = XFS_VERITY_DESCRIPTOR_NAME_LEN;
+	error = xfs_attr_set(&args);
+
+	return error;
+}
+
+static int
+xfs_end_enable_verity(
+	struct file		*filp,
+	const void		*desc,
+	size_t			desc_size,
+	u64			merkle_tree_size,
+	unsigned int		tree_blocksize)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	struct xfs_da_args	args = {
+		.dp		= ip,
+		.whichfork	= XFS_ATTR_FORK,
+		.attr_filter	= XFS_ATTR_VERITY,
+		.attr_flags	= XATTR_CREATE,
+		.name		= (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME,
+		.namelen	= XFS_VERITY_DESCRIPTOR_NAME_LEN,
+		.value		= (void *)desc,
+		.valuelen	= desc_size,
+	};
+	int			error = 0;
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	/* fs-verity failed, just cleanup */
+	if (desc == NULL)
+		goto out;
+
+	error = xfs_attr_set(&args);
+	if (error)
+		goto out;
+
+	/* Set fsverity inode flag */
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_ichange,
+			0, 0, false, &tp);
+	if (error)
+		goto out;
+
+	/*
+	 * Ensure that we've persisted the verity information before we enable
+	 * it on the inode and tell the caller we have sealed the inode.
+	 */
+	ip->i_diflags2 |= XFS_DIFLAG2_VERITY;
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	if (!error)
+		inode->i_flags |= S_VERITY;
+
+out:
+	if (error)
+		WARN_ON_ONCE(xfs_drop_merkle_tree(ip, merkle_tree_size,
+						  tree_blocksize));
+
+	xfs_iflags_clear(ip, XFS_IVERITY_CONSTRUCTION);
+	return error;
+}
+
+static int
+xfs_read_merkle_tree_block(
+	struct inode			*inode,
+	u64				pos,
+	struct fsverity_blockbuf	*block,
+	unsigned int			log_blocksize,
+	u64				ra_bytes)
+{
+	struct xfs_inode		*ip = XFS_I(inode);
+	struct xfs_fsverity_merkle_key	name;
+	int				error = 0;
+	struct xfs_da_args		args = {
+		.dp			= ip,
+		.attr_filter		= XFS_ATTR_VERITY,
+		.op_flags		= XFS_DA_OP_BUFFER,
+		.namelen		= sizeof(struct xfs_fsverity_merkle_key),
+		.valuelen		= (1 << log_blocksize),
+	};
+	xfs_fsverity_merkle_key_to_disk(&name, pos);
+	args.name = (const uint8_t *)&name.merkleoff;
+
+	error = xfs_attr_get(&args);
+	if (error)
+		goto out;
+
+	if (!args.valuelen)
+		return -ENODATA;
+
+	block->kaddr = args.value;
+	block->offset = pos;
+	block->size = args.valuelen;
+	block->context = args.bp;
+
+	/*
+	 * Memory barriers are used to force operation ordering of clearing
+	 * bitmap in fsverity_invalidate_block() and setting XBF_VERITY_SEEN
+	 * flag.
+	 *
+	 * Multiple threads may execute this code concurrently on the same block.
+	 * This is safe because we use memory barriers to ensure that if a
+	 * thread sees XBF_VERITY_SEEN, then fsverity bitmap is already up to
+	 * date.
+	 *
+	 * Invalidating block in a bitmap again at worst causes a hash block to
+	 * be verified redundantly. That event should be very rare, so it's not
+	 * worth using a lock to avoid.
+	 */
+	if (!(args.bp->b_flags & XBF_VERITY_SEEN)) {
+		/*
+		 * A read memory barrier is needed here to give ACQUIRE
+		 * semantics to the above check.
+		 */
+		smp_rmb();
+		/*
+		 * fs-verity is not aware if buffer was evicted from the memory.
+		 * Make fs-verity invalidate verfied status of all blocks in the
+		 * buffer.
+		 *
+		 * Single extended attribute can contain multiple Merkle tree
+		 * blocks:
+		 * - leaf with inline data -> invalidate all blocks in the leaf
+		 * - remote value -> invalidate single block
+		 *
+		 * For example, leaf on 64k system with 4k/1k filesystem will
+		 * contain multiple Merkle tree blocks.
+		 *
+		 * Only remote value buffers would have XBF_DOUBLE_ALLOC flag
+		 */
+		if (args.bp->b_flags & XBF_DOUBLE_ALLOC)
+			fsverity_invalidate_block(inode, block);
+		else {
+			error = xfs_invalidate_blocks(ip, args.bp);
+			if (error)
+				goto out;
+		}
+	}
+
+	/*
+	 * A write memory barrier is needed here to give RELEASE
+	 * semantics to the below flag.
+	 */
+	smp_wmb();
+	args.bp->b_flags |= XBF_VERITY_SEEN;
+
+	return error;
+
+out:
+	kvfree(args.value);
+	if (args.bp)
+		xfs_buf_rele(args.bp);
+	return error;
+}
+
+static int
+xfs_write_merkle_tree_block(
+	struct inode		*inode,
+	const void		*buf,
+	u64			pos,
+	unsigned int		size)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_fsverity_merkle_key	name;
+	struct xfs_da_args	args = {
+		.dp		= ip,
+		.whichfork	= XFS_ATTR_FORK,
+		.attr_filter	= XFS_ATTR_VERITY,
+		.attr_flags	= XATTR_CREATE,
+		.namelen	= sizeof(struct xfs_fsverity_merkle_key),
+		.value		= (void *)buf,
+		.valuelen	= size,
+	};
+
+	xfs_fsverity_merkle_key_to_disk(&name, pos);
+	args.name = (const uint8_t *)&name.merkleoff;
+
+	return xfs_attr_set(&args);
+}
+
+static void
+xfs_drop_block(
+	struct fsverity_blockbuf	*block)
+{
+	struct xfs_buf			*bp;
+
+	ASSERT(block != NULL);
+	bp = (struct xfs_buf *)block->context;
+	ASSERT(bp->b_flags & XBF_VERITY_SEEN);
+
+	xfs_buf_rele(bp);
+
+	kunmap_local(block->kaddr);
+}
+
+const struct fsverity_operations xfs_verity_ops = {
+	.begin_enable_verity		= &xfs_begin_enable_verity,
+	.end_enable_verity		= &xfs_end_enable_verity,
+	.get_verity_descriptor		= &xfs_get_verity_descriptor,
+	.read_merkle_tree_block		= &xfs_read_merkle_tree_block,
+	.write_merkle_tree_block	= &xfs_write_merkle_tree_block,
+	.drop_block			= &xfs_drop_block,
+};
diff --git a/fs/xfs/xfs_verity.h b/fs/xfs/xfs_verity.h
new file mode 100644
index 000000000000..deea15cd3cc5
--- /dev/null
+++ b/fs/xfs/xfs_verity.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2022 Red Hat, Inc.
+ */
+#ifndef __XFS_VERITY_H__
+#define __XFS_VERITY_H__
+
+#include "xfs.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include <linux/fsverity.h>
+
+#define XFS_VERITY_DESCRIPTOR_NAME "vdesc"
+#define XFS_VERITY_DESCRIPTOR_NAME_LEN 5
+
+static inline bool
+xfs_verity_merkle_block(
+		struct xfs_da_args *args)
+{
+	if (!(args->attr_filter & XFS_ATTR_VERITY))
+		return false;
+
+	if (!(args->op_flags & XFS_DA_OP_BUFFER))
+		return false;
+
+	return true;
+}
+
+#ifdef CONFIG_FS_VERITY
+extern const struct fsverity_operations xfs_verity_ops;
+#endif	/* CONFIG_FS_VERITY */
+
+#endif	/* __XFS_VERITY_H__ */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (20 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:18   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 23/24] xfs: add fs-verity ioctls Andrey Albershteyn
  2024-03-04 19:10 ` [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

fs-verity adds new inode flag which causes scrub to fail as it is
not yet known.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/attr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 9a1f59f7b5a4..ae4227cb55ec 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -494,7 +494,7 @@ xchk_xattr_rec(
 	/* Retrieve the entry and check it. */
 	hash = be32_to_cpu(ent->hashval);
 	badflags = ~(XFS_ATTR_LOCAL | XFS_ATTR_ROOT | XFS_ATTR_SECURE |
-			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT);
+			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT | XFS_ATTR_VERITY);
 	if ((ent->flags & badflags) != 0)
 		xchk_da_set_corrupt(ds, level);
 	if (ent->flags & XFS_ATTR_LOCAL) {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 23/24] xfs: add fs-verity ioctls
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (21 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:14   ` Darrick J. Wong
  2024-03-04 19:10 ` [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

Add fs-verity ioctls to enable, dump metadata (descriptor and Merkle
tree pages) and obtain file's digest.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/xfs_ioctl.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index ab61d7d552fb..4763d20c05ff 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -43,6 +43,7 @@
 #include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/fileattr.h>
+#include <linux/fsverity.h>
 
 /*
  * xfs_find_handle maps from userspace xfs_fsop_handlereq structure to
@@ -2174,6 +2175,22 @@ xfs_file_ioctl(
 		return error;
 	}
 
+	case FS_IOC_ENABLE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_enable(filp, (const void __user *)arg);
+
+	case FS_IOC_MEASURE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_measure(filp, (void __user *)arg);
+
+	case FS_IOC_READ_VERITY_METADATA:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_read_metadata(filp,
+						    (const void __user *)arg);
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag
  2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
                   ` (22 preceding siblings ...)
  2024-03-04 19:10 ` [PATCH v5 23/24] xfs: add fs-verity ioctls Andrey Albershteyn
@ 2024-03-04 19:10 ` Andrey Albershteyn
  2024-03-07 22:16   ` Darrick J. Wong
  23 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-04 19:10 UTC (permalink / raw)
  To: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong, ebiggers
  Cc: Andrey Albershteyn

Finalize fs-verity integration in XFS by making kernel fs-verity
aware with ro-compat flag.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 3ce2902101bc..be66f0ab20cb 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -355,10 +355,11 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
 #define XFS_SB_FEAT_RO_COMPAT_VERITY   (1 << 4)		/* fs-verity */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
-		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
-		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
-		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		(XFS_SB_FEAT_RO_COMPAT_FINOBT  | \
+		 XFS_SB_FEAT_RO_COMPAT_RMAPBT  | \
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \
+		 XFS_SB_FEAT_RO_COMPAT_VERITY)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files
  2024-03-04 19:10 ` [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
@ 2024-03-04 22:35   ` Eric Biggers
  2024-03-07 21:39     ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-04 22:35 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong

On Mon, Mar 04, 2024 at 08:10:28PM +0100, Andrey Albershteyn wrote:
> @@ -641,6 +645,13 @@ static int fileattr_set_prepare(struct inode *inode,
>  	    !(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
>  		return -EINVAL;
>  
> +	/*
> +	 * Verity cannot be set through FS_IOC_FSSETXATTR/FS_IOC_SETFLAGS.
> +	 * See FS_IOC_ENABLE_VERITY
> +	 */
> +	if (fa->fsx_xflags & FS_XFLAG_VERITY)
> +		return -EINVAL;

This makes FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR start failing on files that
already have verity enabled.

An error should only be returned when the new flags contain verity and the old
flags don't.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-04 19:10 ` [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
@ 2024-03-04 23:39   ` Eric Biggers
  2024-03-07 22:06     ` Darrick J. Wong
  2024-03-07 23:38     ` Dave Chinner
  0 siblings, 2 replies; 94+ messages in thread
From: Eric Biggers @ 2024-03-04 23:39 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong,
	Christoph Hellwig

On Mon, Mar 04, 2024 at 08:10:33PM +0100, Andrey Albershteyn wrote:
> +#ifdef CONFIG_FS_VERITY
> +struct iomap_fsverity_bio {
> +	struct work_struct	work;
> +	struct bio		bio;
> +};

Maybe leave a comment above that mentions that bio must be the last field.

> @@ -471,6 +529,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
>   * iomap_readahead - Attempt to read pages from a file.
>   * @rac: Describes the pages to be read.
>   * @ops: The operations vector for the filesystem.
> + * @wq: Workqueue for post-I/O processing (only need for fsverity)

This should not be here.

> +#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
> +
>  static int __init iomap_init(void)
>  {
> -	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> -			   offsetof(struct iomap_ioend, io_inline_bio),
> -			   BIOSET_NEED_BVECS);
> +	int error;
> +
> +	error = bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
> +			    offsetof(struct iomap_ioend, io_inline_bio),
> +			    BIOSET_NEED_BVECS);
> +#ifdef CONFIG_FS_VERITY
> +	if (error)
> +		return error;
> +
> +	error = bioset_init(&iomap_fsverity_bioset, IOMAP_POOL_SIZE,
> +			    offsetof(struct iomap_fsverity_bio, bio),
> +			    BIOSET_NEED_BVECS);
> +	if (error)
> +		bioset_exit(&iomap_ioend_bioset);
> +#endif
> +	return error;
>  }
>  fs_initcall(iomap_init);

This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
for these bios, regardless of whether they end up being used or not.  When
PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.

How about allocating the pool when it's known it's actually going to be used,
similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
there could be a flag in struct fsverity_operations that says whether filesystem
wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
for any file for the first time since boot, it could call into fs/iomap/ to
initialize the iomap fsverity bioset if needed.

BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
error handling logic above does not really work as may have been intended.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 09/24] fsverity: add tracepoints
  2024-03-04 19:10 ` [PATCH v5 09/24] fsverity: add tracepoints Andrey Albershteyn
@ 2024-03-05  0:33   ` Eric Biggers
  0 siblings, 0 replies; 94+ messages in thread
From: Eric Biggers @ 2024-03-05  0:33 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong

On Mon, Mar 04, 2024 at 08:10:32PM +0100, Andrey Albershteyn wrote:
> diff --git a/include/trace/events/fsverity.h b/include/trace/events/fsverity.h
> new file mode 100644
> index 000000000000..82966ecc5722
> --- /dev/null
> +++ b/include/trace/events/fsverity.h
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM fsverity
> +
> +#if !defined(_TRACE_FSVERITY_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_FSVERITY_H
> +
> +#include <linux/tracepoint.h>
> +
> +struct fsverity_descriptor;
> +struct merkle_tree_params;
> +struct fsverity_info;
> +
> +#define FSVERITY_TRACE_DIR_ASCEND	(1ul << 0)
> +#define FSVERITY_TRACE_DIR_DESCEND	(1ul << 1)
> +#define FSVERITY_HASH_SHOWN_LEN		20
> +
> +TRACE_EVENT(fsverity_enable,
> +	TP_PROTO(struct inode *inode, struct fsverity_descriptor *desc,
> +		struct merkle_tree_params *params),
> +	TP_ARGS(inode, desc, params),
> +	TP_STRUCT__entry(
> +		__field(ino_t, ino)
> +		__field(u64, data_size)
> +		__field(unsigned int, block_size)
> +		__field(unsigned int, num_levels)
> +		__field(u64, tree_size)
> +	),
> +	TP_fast_assign(
> +		__entry->ino = inode->i_ino;
> +		__entry->data_size = desc->data_size;
> +		__entry->block_size = params->block_size;
> +		__entry->num_levels = params->num_levels;
> +		__entry->tree_size = params->tree_size;
> +	),
> +	TP_printk("ino %lu data size %llu tree size %llu block size %u levels %u",
> +		(unsigned long) __entry->ino,
> +		__entry->data_size,
> +		__entry->tree_size,
> +		__entry->block_size,
> +		__entry->num_levels)
> +);

All pointer parameters to the tracepoints should be const, so that it's clear
that the pointed-to-data isn't being modified.

The desc parameter is not needed for fsverity_enable, since it's only being used
for the file size which is also available in inode->i_size.

> +TRACE_EVENT(fsverity_tree_done,
> +	TP_PROTO(struct inode *inode, struct fsverity_descriptor *desc,
> +		struct merkle_tree_params *params),
> +	TP_ARGS(inode, desc, params),
> +	TP_STRUCT__entry(
> +		__field(ino_t, ino)
> +		__field(unsigned int, levels)
> +		__field(unsigned int, tree_blocks)
> +		__field(u64, tree_size)
> +		__array(u8, tree_hash, 64)
> +	),
> +	TP_fast_assign(
> +		__entry->ino = inode->i_ino;
> +		__entry->levels = params->num_levels;
> +		__entry->tree_blocks =
> +			params->tree_size >> params->log_blocksize;
> +		__entry->tree_size = params->tree_size;
> +		memcpy(__entry->tree_hash, desc->root_hash, 64);
> +	),
> +	TP_printk("ino %lu levels %d tree_blocks %d tree_size %lld root_hash %s",
> +		(unsigned long) __entry->ino,
> +		__entry->levels,
> +		__entry->tree_blocks,
> +		__entry->tree_size,
> +		__print_hex(__entry->tree_hash, 64))
> +);

tree_blocks is using the wrong type (unsigned int instead of unsigned long), and
it doesn't seem very useful since there's already tree_size and the
fsverity_enable event which has block_size.

Also, the way this handles the hash is weird.  It shows 64 bytes, even if it's
shorter, and it doesn't show what algorithm it uses.  That makes the value hard
to use, as the same string could be shown for two hashes that are actually
different.  Maybe take a look at how fsverity-utils prints hashes.

Also, did you perhaps intend to use the file digest instead?  The "Merkle tree
root hash" isn't the actual file digest that userspace sees.  There's one more
level of hashing on top of that.

> +TRACE_EVENT(fsverity_verify_block,
> +	TP_PROTO(struct inode *inode, u64 offset),
> +	TP_ARGS(inode, offset),
> +	TP_STRUCT__entry(
> +		__field(ino_t, ino)
> +		__field(u64, offset)
> +		__field(unsigned int, block_size)
> +	),
> +	TP_fast_assign(
> +		__entry->ino = inode->i_ino;
> +		__entry->offset = offset;
> +		__entry->block_size =
> +			inode->i_verity_info->tree_params.block_size;
> +	),
> +	TP_printk("ino %lu data offset %lld data block size %u",
> +		(unsigned long) __entry->ino,
> +		__entry->offset,
> +		__entry->block_size)
> +);

This should be named fsverity_verify_data_block, since it's invoked when a data
block is verified, not when a hash block is verified.  Or did you perhaps intend
for this to be invoked for all blocks?

Also, please don't use 'offset', as it's ambiguous.  We should follow the
convention used in the pagecache code where 'pos' is used for an offset in
bytes, and 'index' is used for an offset in something else such as blocks.
Likewise in the other tracepoints that are using 'offset'.

Also, the derefenece of ->i_verity_info seems a bit out of place.  This probably
should be passed the merkle_tree_params directly.

> +TRACE_EVENT(fsverity_merkle_tree_block_verified,
> +	TP_PROTO(struct inode *inode,
> +		 struct fsverity_blockbuf *block,
> +		 u8 direction),
> +	TP_ARGS(inode, block, direction),
> +	TP_STRUCT__entry(
> +		__field(ino_t, ino)
> +		__field(u64, offset)
> +		__field(u8, direction)
> +	),
> +	TP_fast_assign(
> +		__entry->ino = inode->i_ino;
> +		__entry->offset = block->offset;
> +		__entry->direction = direction;
> +	),
> +	TP_printk("ino %lu block offset %llu %s",
> +		(unsigned long) __entry->ino,
> +		__entry->offset,
> +		__entry->direction == 0 ? "ascend" : "descend")
> +);

It looks like 'offset' is the index of the block in the whole Merkle tree, in
which case it should be called 'index'.  However perhaps it would be more useful
to provide a (level, index_in_level) pair?

Also, fsverity_merkle_tree_block_verified isn't just being invoked when a Merkle
tree block is being verified, but also when an already-verified block is seen.
That might make it confusing to use.  Perhaps it should be defined to be just
for when a block is being verified?

> +TRACE_EVENT(fsverity_invalidate_block,
> +	TP_PROTO(struct inode *inode, struct fsverity_blockbuf *block),
> +	TP_ARGS(inode, block),
> +	TP_STRUCT__entry(
> +		__field(ino_t, ino)
> +		__field(u64, offset)
> +		__field(unsigned int, block_size)
> +	),
> +	TP_fast_assign(
> +		__entry->ino = inode->i_ino;
> +		__entry->offset = block->offset;
> +		__entry->block_size = block->size;
> +	),
> +	TP_printk("ino %lu block position %llu block size %u",
> +		(unsigned long) __entry->ino,
> +		__entry->offset,
> +		__entry->block_size)
> +);

fsverity_invalidate_merkle_tree_block?  And again, call 'offset' something else.

> +TRACE_EVENT(fsverity_read_merkle_tree_block,
> +	TP_PROTO(struct inode *inode, u64 offset, unsigned int log_blocksize),
> +	TP_ARGS(inode, offset, log_blocksize),
> +	TP_STRUCT__entry(
> +		__field(ino_t, ino)
> +		__field(u64, offset)
> +		__field(u64, index)
> +		__field(unsigned int, block_size)
> +	),
> +	TP_fast_assign(
> +		__entry->ino = inode->i_ino;
> +		__entry->offset = offset;
> +		__entry->index = offset >> log_blocksize;
> +		__entry->block_size = 1 << log_blocksize;
> +	),
> +	TP_printk("ino %lu tree offset %llu block index %llu block hize %u",
> +		(unsigned long) __entry->ino,
> +		__entry->offset,
> +		__entry->index,
> +		__entry->block_size)
> +);

This tracepoint is never actually invoked.

Also it seems redundant to have both 'index' and 'offset'.

> +TRACE_EVENT(fsverity_verify_signature,
> +	TP_PROTO(const struct inode *inode, const u8 *signature, size_t sig_size),
> +	TP_ARGS(inode, signature, sig_size),
> +	TP_STRUCT__entry(
> +		__field(ino_t, ino)
> +		__dynamic_array(u8, signature, sig_size)
> +		__field(size_t, sig_size)
> +		__field(size_t, sig_size_show)
> +	),
> +	TP_fast_assign(
> +		__entry->ino = inode->i_ino;
> +		memcpy(__get_dynamic_array(signature), signature, sig_size);
> +		__entry->sig_size = sig_size;
> +		__entry->sig_size_show = (sig_size > FSVERITY_HASH_SHOWN_LEN ?
> +			FSVERITY_HASH_SHOWN_LEN : sig_size);
> +	),
> +	TP_printk("ino %lu sig_size %lu %s%s%s",
> +		(unsigned long) __entry->ino,
> +		__entry->sig_size,
> +		(__entry->sig_size ? "sig " : ""),
> +		__print_hex(__get_dynamic_array(signature),
> +			__entry->sig_size_show),
> +		(__entry->sig_size ? "..." : ""))
> +);

Do you actually have plans to use the builtin signature support?  It's been
causing a lot of issues for people, so I've been discouraging people from using
it.  If there is no use case for this tracepoint then we shouldn't add it.

The way it's printing the signature is also weird.  It's incorrectly referring
to it a "hash", and it's only showing the first 20 bytes which might just
contain PKCS#7 boilerplate and not the actual signature.  So I'm not sure what
the purpose of this is.

Also, this tracepoint gets invoked even when there is no signature, which is
confusing.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-04 19:10 ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
@ 2024-03-05  0:52   ` Eric Biggers
  2024-03-06 16:30     ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-05  0:52 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong

On Mon, Mar 04, 2024 at 08:10:29PM +0100, Andrey Albershteyn wrote:
> XFS will need to know tree_blocksize to remove the tree in case of an
> error. The size is needed to calculate offsets of particular Merkle
> tree blocks.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/btrfs/verity.c        | 4 +++-
>  fs/ext4/verity.c         | 3 ++-
>  fs/f2fs/verity.c         | 3 ++-
>  fs/verity/enable.c       | 6 ++++--
>  include/linux/fsverity.h | 4 +++-
>  5 files changed, 14 insertions(+), 6 deletions(-)

How will XFS handle dropping a file's incomplete tree if the system crashes
while it's being built?

I think this is why none of the other filesystems have needed the tree_blocksize
in ->end_enable_verity() yet.  They need to be able to drop the tree from just
the information the filesystem has on-disk anyway.  ext4 and f2fs just truncate
past EOF, while btrfs has some code that finds all the verity metadata items and
deletes them (see btrfs_drop_verity_items() in fs/btrfs/verity.c).

Technically you don't *have* to drop incomplete trees, since it shouldn't cause
a behavior difference.  But it seems like something that should be cleaned up.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing
  2024-03-04 19:10 ` [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing Andrey Albershteyn
@ 2024-03-05  1:08   ` Eric Biggers
  2024-03-07 21:58     ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-05  1:08 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong

On Mon, Mar 04, 2024 at 08:10:31PM +0100, Andrey Albershteyn wrote:
> For XFS, fsverity's global workqueue is not really suitable due to:
> 
> 1. High priority workqueues are used within XFS to ensure that data
>    IO completion cannot stall processing of journal IO completions.
>    Hence using a WQ_HIGHPRI workqueue directly in the user data IO
>    path is a potential filesystem livelock/deadlock vector.
> 
> 2. The fsverity workqueue is global - it creates a cross-filesystem
>    contention point.
> 
> This patch adds per-filesystem, per-cpu workqueue for fsverity
> work. This allows iomap to add verification work in the read path on
> BIO completion.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>

Should ext4 and f2fs switch over to this by converting
fsverity_enqueue_verify_work() to use it?  I'd prefer not to have to maintain
two separate workqueue strategies as part of the fs/verity/ infrastructure.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1fbc72c5f112..5863519ffd51 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1223,6 +1223,8 @@ struct super_block {
>  #endif
>  #ifdef CONFIG_FS_VERITY
>  	const struct fsverity_operations *s_vop;
> +	/* Completion queue for post read verification */
> +	struct workqueue_struct *s_read_done_wq;
>  #endif

Maybe s_verity_wq?  Or do you anticipate this being used for other purposes too,
such as fscrypt?  Note that it's behind CONFIG_FS_VERITY and is allocated by an
fsverity_* function, so at least at the moment it doesn't feel very generic.

> diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
> index 0973b521ac5a..45b7c613148a 100644
> --- a/include/linux/fsverity.h
> +++ b/include/linux/fsverity.h
> @@ -241,6 +241,22 @@ void fsverity_enqueue_verify_work(struct work_struct *work);
>  void fsverity_invalidate_block(struct inode *inode,
>  		struct fsverity_blockbuf *block);
>  
> +static inline int fsverity_set_ops(struct super_block *sb,
> +				   const struct fsverity_operations *ops)

This doesn't just set the ops, but also allocates a workqueue too.  A better
name for this function might be fsverity_init_sb.

Also this shouldn't really be an inline function.

> +{
> +	sb->s_vop = ops;
> +
> +	/* Create per-sb workqueue for post read bio verification */
> +	struct workqueue_struct *wq = alloc_workqueue(
> +		"pread/%s", (WQ_FREEZABLE | WQ_MEM_RECLAIM), 0, sb->s_id);

"pread" is short for "post read", I guess?  Should it really be this generic?

> +static inline int fsverity_set_ops(struct super_block *sb,
> +				   const struct fsverity_operations *ops)
> +{
> +	return -EOPNOTSUPP;
> +}

I think it would be better to just not have a !CONFIG_FS_VERITY stub for this.

You *could* make it work like fscrypt_set_ops(), which the ubifs folks added,
where it can be called unconditionally if the filesystem has a declaration for
the operations (but not necessarily a definition).  In that case it would need
to return 0, rather than an error.  But I think I prefer just omitting the stub
and having filesystems guard the call to this by CONFIG_FS_VERITY, as you've
already done in XFS.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching
  2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
@ 2024-03-06  3:56   ` Eric Biggers
  2024-03-07 21:54     ` Darrick J. Wong
  2024-03-11 23:22   ` Darrick J. Wong
  1 sibling, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-06  3:56 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong

On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> index b3506f56e180..dad33e6ff0d6 100644
> --- a/fs/verity/fsverity_private.h
> +++ b/fs/verity/fsverity_private.h
> @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
>  
>  void __init fsverity_init_workqueue(void);
>  
> +/*
> + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> + * block->context.
> + */
> +void fsverity_drop_block(struct inode *inode,
> +			 struct fsverity_blockbuf *block);
> +
>  #endif /* _FSVERITY_PRIVATE_H */

This should be paired with a helper function that reads a Merkle tree block by
calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
being consistent with having a helper function for drop, this would prevent code
duplication between verify_data_block() and fsverity_read_merkle_tree().

I recommend that it look like this:

int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
				    unsigned long ra_bytes,
				    struct fsverity_blockbuf *block);

'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
would be the number of bytes for the filesystem to (optionally) readahead if the
block is not yet cached.  I think that things work out simpler if these values
are measured in bytes, not blocks.  'block' would be at the end because it's an
output, and it can be confusing to interleave inputs and outputs in parameters.

Also, the drop function should be named consistently:

void fsverity_drop_merkle_tree_block(struct inode *inode,
                                     struct fsverity_blockbuf *block);

It would be a bit long, but I think it's helpful for consistency and also
because there are other types of "blocks" that it could be confused with.
(Data blocks, filesystem blocks, internal blocks of the hash function.)

> diff --git a/fs/verity/open.c b/fs/verity/open.c
> index fdeb95eca3af..6e6922b4b014 100644
> --- a/fs/verity/open.c
> +++ b/fs/verity/open.c
> @@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
>  	if (err)
>  		goto fail;
>  
> -	if (vi->tree_params.block_size != PAGE_SIZE) {
> +	/*
> +	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
> +	 * fs-verity should use hash_block_verified bitmap as there's no page
> +	 * to mark it with PG_checked.
> +	 */
> +	if (vi->tree_params.block_size != PAGE_SIZE ||
> +			inode->i_sb->s_vop->read_merkle_tree_block) {

Technically, all filesystems "pass Merkle tree blocks to fs-verity".  Maybe
stick to calling it "block-based Merkle tree caching", as you have in the commit
title, as that seems clearer.

> diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> index f58432772d9e..5da40b5a81af 100644
> --- a/fs/verity/read_metadata.c
> +++ b/fs/verity/read_metadata.c
> @@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  {
>  	const struct fsverity_operations *vops = inode->i_sb->s_vop;
>  	u64 end_offset;
> -	unsigned int offs_in_page;
> +	unsigned int offs_in_block;
>  	pgoff_t index, last_index;
>  	int retval = 0;
>  	int err = 0;
> +	const unsigned int block_size = vi->tree_params.block_size;
> +	const u8 log_blocksize = vi->tree_params.log_blocksize;
>  
>  	end_offset = min(offset + length, vi->tree_params.tree_size);
>  	if (offset >= end_offset)
>  		return 0;
> -	offs_in_page = offset_in_page(offset);
> -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> +	offs_in_block = offset & (block_size - 1);
> +	last_index = (end_offset - 1) >> log_blocksize;
>  
>  	/*
> -	 * Iterate through each Merkle tree page in the requested range and copy
> -	 * the requested portion to userspace.  Note that the Merkle tree block
> -	 * size isn't important here, as we are returning a byte stream; i.e.,
> -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> +	 * Iterate through each Merkle tree block in the requested range and
> +	 * copy the requested portion to userspace. Note that we are returning
> +	 * a byte stream.
>  	 */
> -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> +	for (index = offset >> log_blocksize; index <= last_index; index++) {
>  		unsigned long num_ra_pages =
>  			min_t(unsigned long, last_index - index + 1,
>  			      inode->i_sb->s_bdi->io_pages);
>  		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> -						   PAGE_SIZE - offs_in_page);
> -		struct page *page;
> -		const void *virt;
> +						   block_size - offs_in_block);
> +		struct fsverity_blockbuf block = {
> +			.size = block_size,
> +		};
>  
> -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> -		if (IS_ERR(page)) {
> -			err = PTR_ERR(page);
> +		if (!vops->read_merkle_tree_block) {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			unsigned long page_idx =
> +				round_down(index, blocks_per_page);
> +			struct page *page = vops->read_merkle_tree_page(inode,
> +					page_idx, num_ra_pages);
> +
> +			if (IS_ERR(page)) {
> +				err = PTR_ERR(page);
> +			} else {
> +				block.kaddr = kmap_local_page(page) +
> +					((index - page_idx) << log_blocksize);
> +				block.context = page;
> +			}
> +		} else {
> +			err = vops->read_merkle_tree_block(inode,
> +					index << log_blocksize,
> +					&block, log_blocksize, num_ra_pages);
> +		}
> +
> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %d reading Merkle tree page %lu",
> -				     err, index);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, index << log_blocksize);
>  			break;
>  		}
>  
> -		virt = kmap_local_page(page);
> -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> -			kunmap_local(virt);
> -			put_page(page);
> +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> +			fsverity_drop_block(inode, &block);
>  			err = -EFAULT;
>  			break;
>  		}
> -		kunmap_local(virt);
> -		put_page(page);
> +		fsverity_drop_block(inode, &block);
>  
>  		retval += bytes_to_copy;
>  		buf += bytes_to_copy;
> @@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  			break;
>  		}
>  		cond_resched();
> -		offs_in_page = 0;
> +		offs_in_block = 0;
>  	}
>  	return retval ? retval : err;
>  }

This is one of the two places that would be simplified quite a bit by adding an
fsverity_read_merkle_tree_block() helper function, especially if the loop
variable was changed to measure bytes instead of blocks so that the Merkle tree
block position and readahead amount can more easily be computed in bytes.  E.g.:

static int fsverity_read_merkle_tree(struct inode *inode,
				     const struct fsverity_info *vi,
				     void __user *buf, u64 pos, int length)
{
	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
	[...]

	while (pos < end_pos) {
		// bytes_to_copy = remaining length in current block
		[...]
		err = fsverity_read_merkle_tree_block(inode,
						      pos - offs_in_block,
						      ra_bytes,
						      &block);
		[...]
		pos += bytes_to_copy;
		[...]
	}

> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index 4fcad0825a12..de71911d400c 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -13,14 +13,17 @@
>  static struct workqueue_struct *fsverity_read_workqueue;
>  
>  /*
> - * Returns true if the hash block with index @hblock_idx in the tree, located in
> - * @hpage, has already been verified.
> + * Returns true if the hash block with index @hblock_idx in the tree has
> + * already been verified.
>   */
> -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> +static bool is_hash_block_verified(struct inode *inode,
> +				   struct fsverity_blockbuf *block,
>  				   unsigned long hblock_idx)
>  {
>  	unsigned int blocks_per_page;
>  	unsigned int i;
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	struct page *hpage = (struct page *)block->context;
>  
>  	/*
>  	 * When the Merkle tree block size and page size are the same, then the
> @@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
>  	if (!vi->hash_block_verified)
>  		return PageChecked(hpage);
>  
> +	/*
> +	 * Filesystems which use block based caching (e.g. XFS) always use
> +	 * bitmap.
> +	 */
> +	if (inode->i_sb->s_vop->read_merkle_tree_block)
> +		return test_bit(hblock_idx, vi->hash_block_verified);
>  	/*
>  	 * When the Merkle tree block size and page size differ, we use a bitmap
>  	 * to indicate whether each hash block has been verified.

This is hard to follow because the code goes from handling page-based caching to
handling block-based caching to handling page-based caching again.  Also, the
comment says that block-based caching uses the bitmap, but it doesn't give any
hint at why the bitmap is being handled differently in that case from the
page-based caching case that also uses the bitmap.

How about doing something like the following where the block-based caching case
is handled first, before either block-based caching case:

	/*
	 * If the filesystem uses block-based caching, then
	 * ->hash_block_verified is always used and the filesystem pushes
	 * invalidations to it as needed.
	 */
	if (inode->i_sb->s_vop->read_merkle_tree_block)
		return test_bit(hblock_idx, vi->hash_block_verified);

	/* Otherwise, the filesystem uses page-based caching. */
	hpage = (struct page *)block->context;

Note, to avoid confusion, the 'hpage' variable shouldn't be set until it's been
verified that the filesystem is using page-based caching.

> @@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> verify_data_block(struct inode *inode, struct fsverity_info *vi,
>		  const void *data, u64 data_pos, unsigned long max_ra_pages)

max_ra_pages should become max_ra_bytes.

> +	int num_ra_pages;

unsigned long ra_bytes.  Also, this should be declared in the scope later on
where it's actually needed.

> @@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		unsigned long next_hidx;
>  		unsigned long hblock_idx;
>  		pgoff_t hpage_idx;
> +		u64 hblock_pos;
>  		unsigned int hblock_offset_in_page;

hpage_idx and hblock_offset_in_page should both go away.
fsverity_read_merkle_tree_block() should do the conversion to/from pages.

>  		unsigned int hoffset;
>  		struct page *hpage;
> -		const void *haddr;
> +		struct fsverity_blockbuf *block = &hblocks[level].block;
>  
>  		/*
>  		 * The index of the block in the current level; also the index
> @@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>		/* Index of the hash block in the tree overall */
>		hblock_idx = params->level_start[level] + next_hidx;
>
>		/* Byte offset of the hash block within the page */
>  		hblock_offset_in_page =
>  			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
>  
> +		/* Offset of the Merkle tree block into the tree */
> +		hblock_pos = hblock_idx << params->log_blocksize;

The comment for hblock_pos should say:

	/* Byte offset of the hash block in the tree overall */

... so that it's consistent with the one for hblock_idx which says:

	/* Index of the hash block in the tree overall */

> +		num_ra_pages = level == 0 ?
> +			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
> +
> +		if (inode->i_sb->s_vop->read_merkle_tree_block) {
> +			err = inode->i_sb->s_vop->read_merkle_tree_block(
> +				inode, hblock_pos, block, params->log_blocksize,
> +				num_ra_pages);
> +		} else {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			hblock_idx = round_down(hblock_idx, blocks_per_page);
> +			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
> +				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
> +
> +			if (IS_ERR(hpage)) {
> +				err = PTR_ERR(hpage);
> +			} else {
> +				block->kaddr = kmap_local_page(hpage) +
> +					hblock_offset_in_page;
> +				block->context = hpage;
> +			}
> +		}

This has the weirdness with the readahead amount again, where it's still being
measured in pages.  I think it would make more sense to measure it in bytes.  It
would stay bytes when passed to ->read_merkle_tree_block, or be converted to
pages at the last minute for ->read_merkle_tree_page which works in pages.

> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %ld reading Merkle tree page %lu",
> -				     PTR_ERR(hpage), hpage_idx);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, hblock_idx);
>  			goto error;
>  		}

This error message should go in fsverity_read_merkle_tree_block().

Doing that would also eliminate the need to add an 'err' variable to
verify_data_block(), which could be confusing because it returns a bool.

>  	/* Descend the tree verifying hash blocks. */
>  	for (; level > 0; level--) {
> -		struct page *hpage = hblocks[level - 1].page;
> -		const void *haddr = hblocks[level - 1].addr;
> +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> +		const void *haddr = block->kaddr;
>  		unsigned long hblock_idx = hblocks[level - 1].index;
>  		unsigned int hoffset = hblocks[level - 1].hoffset;
> +		struct page *hpage = (struct page *)block->context;
>  
>  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
>  			goto error;
> @@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>		if (vi->hash_block_verified)
>			set_bit(hblock_idx, vi->hash_block_verified);
>		else
>  			SetPageChecked(hpage);

Since the page is only valid when vi->hash_block_verified is NULL, remove the
hpage variable and do 'SetPageChecked((struct page *)block->context);'.
Yes, there's no issue until the pointer is actually dereferenced, but it's
confusing to have incorrectly-typed pointer variables sitting around.

> +/**
> + * fsverity_invalidate_block() - invalidate Merkle tree block
> + * @inode: inode to which this Merkle tree blocks belong
> + * @block: block to be invalidated
> + *
> + * This function invalidates/clears "verified" state of Merkle tree block
> + * in the fs-verity bitmap. The block needs to have ->offset set.
> + */
> +void fsverity_invalidate_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)
> +{
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
> +
> +	if (block->offset > vi->tree_params.tree_size) {
> +		fsverity_err(inode,
> +"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
> +			     vi->tree_params.tree_size, block->offset);
> +		return;
> +	}
> +
> +	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
> +}
> +EXPORT_SYMBOL_GPL(fsverity_invalidate_block);

I don't think this function should be using struct fsverity_blockbuf.  It only
uses the 'offset' field, which is orthogonal to what ->read_merkle_tree_block
and ->drop_merkle_tree_block use.

The function name fsverity_invalidate_block() also has the issue where it's not
clear what type of block it's talking about.

How about changing the prototype to:

void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);

Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
can it happen in cases like filesystem corruption?  If it can only happen due to
kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.

Please note that there's also an off-by-one error.  pos > tree_size should be
pos >= tree_size.

> +void fsverity_drop_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)

fsverity_drop_merkle_tree_block

> +{
> +	if (inode->i_sb->s_vop->drop_block)
> +		inode->i_sb->s_vop->drop_block(block);
> +	else {

Add braces above.

(See line 213 of Documentation/process/coding-style.rst)

> +		struct page *page = (struct page *)block->context;
> +
> +		kunmap_local(block->kaddr);
> +		put_page(page);

put_page((struct page *)block->context), and remove the page variable

> +	}
> +	block->kaddr = NULL;

Also set block->context = NULL

> +/**
> + * struct fsverity_blockbuf - Merkle Tree block buffer
> + * @kaddr: virtual address of the block's data
> + * @offset: block's offset into Merkle tree
> + * @size: the Merkle tree block size
> + * @context: filesystem private context
> + *
> + * Buffer containing single Merkle Tree block. These buffers are passed
> + *  - to filesystem, when fs-verity is building merkel tree,
> + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> + * Filesystems sets kaddr together with size to point to a memory which contains
> + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> + * written down to disk.
> + *
> + * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
> + * ->drop_block to let filesystem know that memory can be freed.
> + *
> + * The context is optional. This field can be used by filesystem to passthrough
> + * state from ->read_merkle_tree_block to ->drop_block.
> + */
> +struct fsverity_blockbuf {
> +	void *kaddr;
> +	u64 offset;
> +	unsigned int size;
> +	void *context;
> +};

If this is only used by ->read_merkle_tree_block and ->drop_merkle_tree_block,
then it can be simplified.  The only fields needed are kaddr and context.

The comment also still incorrectly claims that the struct is used for writes.

> +	/**
> +	 * Read a Merkle tree block of the given inode.
> +	 * @inode: the inode
> +	 * @pos: byte offset of the block within the Merkle tree
> +	 * @block: block buffer for filesystem to point it to the block
> +	 * @log_blocksize: log2 of the size of the expected block
> +	 * @ra_bytes: The number of bytes that should be
> +	 *		prefetched starting at @pos if the page at @pos
> +	 *		isn't already cached.  Implementations may ignore this
> +	 *		argument; it's only a performance optimization.
> +	 *
> +	 * This can be called at any time on an open verity file.  It may be
> +	 * called by multiple processes concurrently.
> +	 *
> +	 * In case that block was evicted from the memory filesystem has to use
> +	 * fsverity_invalidate_block() to let fsverity know that block's
> +	 * verification state is not valid anymore.
> +	 *
> +	 * Return: 0 on success, -errno on failure
> +	 */
> +	int (*read_merkle_tree_block)(struct inode *inode,
> +				      u64 pos,
> +				      struct fsverity_blockbuf *block,
> +				      unsigned int log_blocksize,
> +				      u64 ra_bytes);

How about:
	
	int (*read_merkle_tree_block)(struct inode *inode,
				      u64 pos,
				      unsigned int size,
				      unsigned long ra_bytes,
				      struct fsverity_blockbuf *block);

- size instead of log_blocksize, for consistency with other functions and
  because it's what XFS actually wants

- ra_bytes as unsigned long, since u64 readahead amounts don't really make
  sense.  (This isn't too important though; it could be u64 if you really want.)

- block at the end since it's an output parameter

It also needs to be clearly documented that filesystems must implement either
->read_merkle_tree_page, *or* both ->read_merkle_tree_block and
->drop_merkle_tree_block.

Also, the comment about invalidating the block is still unclear regarding the
timing of when the filesystem must do it.

> +
> +	/**
> +	 * Release the reference to a Merkle tree block
> +	 *
> +	 * @block: the block to release
> +	 *
> +	 * This is called when fs-verity is done with a block obtained with
> +	 * ->read_merkle_tree_block().
> +	 */
> +	void (*drop_block)(struct fsverity_blockbuf *block);

drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 21/24] xfs: add fs-verity support
  2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
@ 2024-03-06  4:55   ` Eric Biggers
  2024-03-06  5:01     ` Eric Biggers
  2024-03-07 23:10   ` Darrick J. Wong
  1 sibling, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-06  4:55 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong

On Mon, Mar 04, 2024 at 08:10:44PM +0100, Andrey Albershteyn wrote:
> +static void
> +xfs_verity_put_listent(
> +	struct xfs_attr_list_context	*context,
> +	int				flags,
> +	unsigned char			*name,
> +	int				namelen,
> +	int				valuelen)
> +{
> +	struct fsverity_blockbuf	block = {
> +		.offset = xfs_fsverity_name_to_block_offset(name),
> +		.size = valuelen,
> +	};
> +	/*
> +	 * Verity descriptor is smaller than 1024; verity block min size is
> +	 * 1024. Exclude verity descriptor
> +	 */
> +	if (valuelen < 1024)
> +		return;
> +

Is there no way to directly check whether it's the verity descriptor?  The
'valuelen < 1024' check is fragile because it will break if support for smaller
Merkle tree block sizes is ever added.  (Silently, because this is doing
invalidation which is hard to test and we need to be super careful with.)

If you really must introduce the assumption that the Merkle tree block size is
at least 1024, this needs to be documented in the comment in
fsverity_init_merkle_tree_params() that explains the reasoning behind the
current restrictions on the Merkle tree block size.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 21/24] xfs: add fs-verity support
  2024-03-06  4:55   ` Eric Biggers
@ 2024-03-06  5:01     ` Eric Biggers
  0 siblings, 0 replies; 94+ messages in thread
From: Eric Biggers @ 2024-03-06  5:01 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, djwong

On Tue, Mar 05, 2024 at 08:55:43PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:44PM +0100, Andrey Albershteyn wrote:
> > +static void
> > +xfs_verity_put_listent(
> > +	struct xfs_attr_list_context	*context,
> > +	int				flags,
> > +	unsigned char			*name,
> > +	int				namelen,
> > +	int				valuelen)
> > +{
> > +	struct fsverity_blockbuf	block = {
> > +		.offset = xfs_fsverity_name_to_block_offset(name),
> > +		.size = valuelen,
> > +	};
> > +	/*
> > +	 * Verity descriptor is smaller than 1024; verity block min size is
> > +	 * 1024. Exclude verity descriptor
> > +	 */
> > +	if (valuelen < 1024)
> > +		return;
> > +
> 
> Is there no way to directly check whether it's the verity descriptor?  The
> 'valuelen < 1024' check is fragile because it will break if support for smaller
> Merkle tree block sizes is ever added.  (Silently, because this is doing
> invalidation which is hard to test and we need to be super careful with.)
> 
> If you really must introduce the assumption that the Merkle tree block size is
> at least 1024, this needs to be documented in the comment in
> fsverity_init_merkle_tree_params() that explains the reasoning behind the
> current restrictions on the Merkle tree block size.

Also, the verity descriptor can be >= 1024 bytes if there is a large builtin
signature attached to it.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-05  0:52   ` Eric Biggers
@ 2024-03-06 16:30     ` Darrick J. Wong
  2024-03-07 22:02       ` Eric Biggers
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-06 16:30 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Mon, Mar 04, 2024 at 04:52:42PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:29PM +0100, Andrey Albershteyn wrote:
> > XFS will need to know tree_blocksize to remove the tree in case of an
> > error. The size is needed to calculate offsets of particular Merkle
> > tree blocks.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > ---
> >  fs/btrfs/verity.c        | 4 +++-
> >  fs/ext4/verity.c         | 3 ++-
> >  fs/f2fs/verity.c         | 3 ++-
> >  fs/verity/enable.c       | 6 ++++--
> >  include/linux/fsverity.h | 4 +++-
> >  5 files changed, 14 insertions(+), 6 deletions(-)
> 
> How will XFS handle dropping a file's incomplete tree if the system crashes
> while it's being built?

AFAICT it simply leaves the half-constructed tree in the xattrs data.

> I think this is why none of the other filesystems have needed the tree_blocksize
> in ->end_enable_verity() yet.  They need to be able to drop the tree from just
> the information the filesystem has on-disk anyway.  ext4 and f2fs just truncate
> past EOF, while btrfs has some code that finds all the verity metadata items and
> deletes them (see btrfs_drop_verity_items() in fs/btrfs/verity.c).
> 
> Technically you don't *have* to drop incomplete trees, since it shouldn't cause
> a behavior difference.  But it seems like something that should be cleaned up.

If it's required that a failed FS_IOC_ENABLE_VERITY clean up the
unfinished merkle tree, then you'd have to introduce some kind of log
intent item to roll back blocks from an unfinished tree.  That log
item has to be committed as the first item in a chain of transactions,
each of which adds a merkle tree block and relogs the rollback item.
When we finish the tree, we log a done item to whiteout the intent.

Log recovery, upon finding an intent without the done item, will replay
the intent, which (in this case) will remove all the blocks.  In theory
a failed merkle tree commit also should do this, but most likely that
will cause an fs shutdown anyway.

(That's a fair amount of work.)

Or you could leave the unfinished tree as-is; that will waste space, but
if userspace tries again, the xattr code will replace the old merkle
tree block contents with the new ones.  This assumes that we're not
using XATTR_CREATE during FS_IOC_ENABLE_VERITY.

--D

> 
> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files
  2024-03-04 22:35   ` Eric Biggers
@ 2024-03-07 21:39     ` Darrick J. Wong
  2024-03-07 22:06       ` Eric Biggers
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 21:39 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Mon, Mar 04, 2024 at 02:35:48PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:28PM +0100, Andrey Albershteyn wrote:
> > @@ -641,6 +645,13 @@ static int fileattr_set_prepare(struct inode *inode,
> >  	    !(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
> >  		return -EINVAL;
> >  
> > +	/*
> > +	 * Verity cannot be set through FS_IOC_FSSETXATTR/FS_IOC_SETFLAGS.
> > +	 * See FS_IOC_ENABLE_VERITY
> > +	 */
> > +	if (fa->fsx_xflags & FS_XFLAG_VERITY)
> > +		return -EINVAL;
> 
> This makes FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR start failing on files that
> already have verity enabled.
> 
> An error should only be returned when the new flags contain verity and the old
> flags don't.

What if the old flags have it and the new ones don't?  Is that supposed
to disable fsverity?  Is removal of the verity information not supported?

I'm guessing that removal isn't supposed to happen, in which case the
above check ought to be:

	if (!!IS_VERITY(inode) != !!(fa->fsx_xflags & FS_XFLAG_VERITY))
		return -EINVAL;

Right?

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching
  2024-03-06  3:56   ` Eric Biggers
@ 2024-03-07 21:54     ` Darrick J. Wong
  2024-03-07 22:49       ` Eric Biggers
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 21:54 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > index b3506f56e180..dad33e6ff0d6 100644
> > --- a/fs/verity/fsverity_private.h
> > +++ b/fs/verity/fsverity_private.h
> > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> >  
> >  void __init fsverity_init_workqueue(void);
> >  
> > +/*
> > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > + * block->context.
> > + */
> > +void fsverity_drop_block(struct inode *inode,
> > +			 struct fsverity_blockbuf *block);
> > +
> >  #endif /* _FSVERITY_PRIVATE_H */
> 
> This should be paired with a helper function that reads a Merkle tree block by
> calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> being consistent with having a helper function for drop, this would prevent code
> duplication between verify_data_block() and fsverity_read_merkle_tree().
> 
> I recommend that it look like this:
> 
> int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> 				    unsigned long ra_bytes,
> 				    struct fsverity_blockbuf *block);
> 
> 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> would be the number of bytes for the filesystem to (optionally) readahead if the
> block is not yet cached.  I think that things work out simpler if these values
> are measured in bytes, not blocks.  'block' would be at the end because it's an
> output, and it can be confusing to interleave inputs and outputs in parameters.

FWIW I don't really like 'pos' here because that's usually short for
"file position", which is a byte, and this looks a lot more like a
merkle tree block number.

u64 blkno?

Or better yet use a typedef ("merkle_blkno_t") to make it really clear
when we're dealing with a tree block number.  Ignore checkpatch
complaining about typeedefs. :)

> Also, the drop function should be named consistently:
> 
> void fsverity_drop_merkle_tree_block(struct inode *inode,
>                                      struct fsverity_blockbuf *block);
> 
> It would be a bit long, but I think it's helpful for consistency and also
> because there are other types of "blocks" that it could be confused with.
> (Data blocks, filesystem blocks, internal blocks of the hash function.)

I agree.

> > diff --git a/fs/verity/open.c b/fs/verity/open.c
> > index fdeb95eca3af..6e6922b4b014 100644
> > --- a/fs/verity/open.c
> > +++ b/fs/verity/open.c
> > @@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
> >  	if (err)
> >  		goto fail;
> >  
> > -	if (vi->tree_params.block_size != PAGE_SIZE) {
> > +	/*
> > +	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
> > +	 * fs-verity should use hash_block_verified bitmap as there's no page
> > +	 * to mark it with PG_checked.
> > +	 */
> > +	if (vi->tree_params.block_size != PAGE_SIZE ||
> > +			inode->i_sb->s_vop->read_merkle_tree_block) {
> 
> Technically, all filesystems "pass Merkle tree blocks to fs-verity".  Maybe
> stick to calling it "block-based Merkle tree caching", as you have in the commit
> title, as that seems clearer.
> 
> > diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> > index f58432772d9e..5da40b5a81af 100644
> > --- a/fs/verity/read_metadata.c
> > +++ b/fs/verity/read_metadata.c
> > @@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
> >  {
> >  	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> >  	u64 end_offset;
> > -	unsigned int offs_in_page;
> > +	unsigned int offs_in_block;
> >  	pgoff_t index, last_index;
> >  	int retval = 0;
> >  	int err = 0;
> > +	const unsigned int block_size = vi->tree_params.block_size;
> > +	const u8 log_blocksize = vi->tree_params.log_blocksize;
> >  
> >  	end_offset = min(offset + length, vi->tree_params.tree_size);
> >  	if (offset >= end_offset)
> >  		return 0;
> > -	offs_in_page = offset_in_page(offset);
> > -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> > +	offs_in_block = offset & (block_size - 1);
> > +	last_index = (end_offset - 1) >> log_blocksize;
> >  
> >  	/*
> > -	 * Iterate through each Merkle tree page in the requested range and copy
> > -	 * the requested portion to userspace.  Note that the Merkle tree block
> > -	 * size isn't important here, as we are returning a byte stream; i.e.,
> > -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> > +	 * Iterate through each Merkle tree block in the requested range and
> > +	 * copy the requested portion to userspace. Note that we are returning
> > +	 * a byte stream.
> >  	 */
> > -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> > +	for (index = offset >> log_blocksize; index <= last_index; index++) {
> >  		unsigned long num_ra_pages =
> >  			min_t(unsigned long, last_index - index + 1,
> >  			      inode->i_sb->s_bdi->io_pages);
> >  		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> > -						   PAGE_SIZE - offs_in_page);
> > -		struct page *page;
> > -		const void *virt;
> > +						   block_size - offs_in_block);
> > +		struct fsverity_blockbuf block = {
> > +			.size = block_size,
> > +		};
> >  
> > -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> > -		if (IS_ERR(page)) {
> > -			err = PTR_ERR(page);
> > +		if (!vops->read_merkle_tree_block) {
> > +			unsigned int blocks_per_page =
> > +				vi->tree_params.blocks_per_page;
> > +			unsigned long page_idx =
> > +				round_down(index, blocks_per_page);
> > +			struct page *page = vops->read_merkle_tree_page(inode,
> > +					page_idx, num_ra_pages);
> > +
> > +			if (IS_ERR(page)) {
> > +				err = PTR_ERR(page);
> > +			} else {
> > +				block.kaddr = kmap_local_page(page) +
> > +					((index - page_idx) << log_blocksize);
> > +				block.context = page;
> > +			}
> > +		} else {
> > +			err = vops->read_merkle_tree_block(inode,
> > +					index << log_blocksize,
> > +					&block, log_blocksize, num_ra_pages);
> > +		}
> > +
> > +		if (err) {
> >  			fsverity_err(inode,
> > -				     "Error %d reading Merkle tree page %lu",
> > -				     err, index);
> > +				     "Error %d reading Merkle tree block %lu",
> > +				     err, index << log_blocksize);
> >  			break;
> >  		}
> >  
> > -		virt = kmap_local_page(page);
> > -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> > -			kunmap_local(virt);
> > -			put_page(page);
> > +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> > +			fsverity_drop_block(inode, &block);
> >  			err = -EFAULT;
> >  			break;
> >  		}
> > -		kunmap_local(virt);
> > -		put_page(page);
> > +		fsverity_drop_block(inode, &block);
> >  
> >  		retval += bytes_to_copy;
> >  		buf += bytes_to_copy;
> > @@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
> >  			break;
> >  		}
> >  		cond_resched();
> > -		offs_in_page = 0;
> > +		offs_in_block = 0;
> >  	}
> >  	return retval ? retval : err;
> >  }
> 
> This is one of the two places that would be simplified quite a bit by adding an
> fsverity_read_merkle_tree_block() helper function, especially if the loop
> variable was changed to measure bytes instead of blocks so that the Merkle tree
> block position and readahead amount can more easily be computed in bytes.  E.g.:

Yes!  I really agree here -- I downloaded the git repo, tried to read
this function, and felt that it would be a lot easier to understand if
the two access paths were separate functions instead of a lot of
switched logic within one function.

> static int fsverity_read_merkle_tree(struct inode *inode,
> 				     const struct fsverity_info *vi,
> 				     void __user *buf, u64 pos, int length)
> {
> 	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> 	[...]
> 
> 	while (pos < end_pos) {
> 		// bytes_to_copy = remaining length in current block
> 		[...]
> 		err = fsverity_read_merkle_tree_block(inode,
> 						      pos - offs_in_block,
> 						      ra_bytes,
> 						      &block);
> 		[...]
> 		pos += bytes_to_copy;
> 		[...]
> 	}
> 
> > diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> > index 4fcad0825a12..de71911d400c 100644
> > --- a/fs/verity/verify.c
> > +++ b/fs/verity/verify.c
> > @@ -13,14 +13,17 @@
> >  static struct workqueue_struct *fsverity_read_workqueue;
> >  
> >  /*
> > - * Returns true if the hash block with index @hblock_idx in the tree, located in
> > - * @hpage, has already been verified.
> > + * Returns true if the hash block with index @hblock_idx in the tree has
> > + * already been verified.
> >   */
> > -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> > +static bool is_hash_block_verified(struct inode *inode,
> > +				   struct fsverity_blockbuf *block,
> >  				   unsigned long hblock_idx)
> >  {
> >  	unsigned int blocks_per_page;
> >  	unsigned int i;
> > +	struct fsverity_info *vi = inode->i_verity_info;
> > +	struct page *hpage = (struct page *)block->context;
> >  
> >  	/*
> >  	 * When the Merkle tree block size and page size are the same, then the
> > @@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> >  	if (!vi->hash_block_verified)
> >  		return PageChecked(hpage);
> >  
> > +	/*
> > +	 * Filesystems which use block based caching (e.g. XFS) always use
> > +	 * bitmap.
> > +	 */
> > +	if (inode->i_sb->s_vop->read_merkle_tree_block)
> > +		return test_bit(hblock_idx, vi->hash_block_verified);
> >  	/*
> >  	 * When the Merkle tree block size and page size differ, we use a bitmap
> >  	 * to indicate whether each hash block has been verified.
> 
> This is hard to follow because the code goes from handling page-based caching to
> handling block-based caching to handling page-based caching again.  Also, the
> comment says that block-based caching uses the bitmap, but it doesn't give any
> hint at why the bitmap is being handled differently in that case from the
> page-based caching case that also uses the bitmap.
> 
> How about doing something like the following where the block-based caching case
> is handled first, before either block-based caching case:
> 
> 	/*
> 	 * If the filesystem uses block-based caching, then
> 	 * ->hash_block_verified is always used and the filesystem pushes
> 	 * invalidations to it as needed.
> 	 */
> 	if (inode->i_sb->s_vop->read_merkle_tree_block)
> 		return test_bit(hblock_idx, vi->hash_block_verified);
> 
> 	/* Otherwise, the filesystem uses page-based caching. */
> 	hpage = (struct page *)block->context;
> 
> Note, to avoid confusion, the 'hpage' variable shouldn't be set until it's been
> verified that the filesystem is using page-based caching.

(Yep.)

> > @@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> > verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		  const void *data, u64 data_pos, unsigned long max_ra_pages)
> 
> max_ra_pages should become max_ra_bytes.
> 
> > +	int num_ra_pages;
> 
> unsigned long ra_bytes.  Also, this should be declared in the scope later on
> where it's actually needed.

Agreed.

> > @@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >  		unsigned long next_hidx;
> >  		unsigned long hblock_idx;
> >  		pgoff_t hpage_idx;
> > +		u64 hblock_pos;
> >  		unsigned int hblock_offset_in_page;
> 
> hpage_idx and hblock_offset_in_page should both go away.
> fsverity_read_merkle_tree_block() should do the conversion to/from pages.
> 
> >  		unsigned int hoffset;
> >  		struct page *hpage;
> > -		const void *haddr;
> > +		struct fsverity_blockbuf *block = &hblocks[level].block;
> >  
> >  		/*
> >  		 * The index of the block in the current level; also the index
> > @@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		/* Index of the hash block in the tree overall */
> >		hblock_idx = params->level_start[level] + next_hidx;
> >
> >		/* Byte offset of the hash block within the page */
> >  		hblock_offset_in_page =
> >  			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
> >  
> > +		/* Offset of the Merkle tree block into the tree */
> > +		hblock_pos = hblock_idx << params->log_blocksize;
> 
> The comment for hblock_pos should say:
> 
> 	/* Byte offset of the hash block in the tree overall */
> 
> ... so that it's consistent with the one for hblock_idx which says:
> 
> 	/* Index of the hash block in the tree overall */
> 
> > +		num_ra_pages = level == 0 ?
> > +			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
> > +
> > +		if (inode->i_sb->s_vop->read_merkle_tree_block) {
> > +			err = inode->i_sb->s_vop->read_merkle_tree_block(
> > +				inode, hblock_pos, block, params->log_blocksize,
> > +				num_ra_pages);
> > +		} else {
> > +			unsigned int blocks_per_page =
> > +				vi->tree_params.blocks_per_page;
> > +			hblock_idx = round_down(hblock_idx, blocks_per_page);
> > +			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
> > +				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
> > +
> > +			if (IS_ERR(hpage)) {
> > +				err = PTR_ERR(hpage);
> > +			} else {
> > +				block->kaddr = kmap_local_page(hpage) +
> > +					hblock_offset_in_page;
> > +				block->context = hpage;
> > +			}
> > +		}
> 
> This has the weirdness with the readahead amount again, where it's still being
> measured in pages.  I think it would make more sense to measure it in bytes.  It
> would stay bytes when passed to ->read_merkle_tree_block, or be converted to
> pages at the last minute for ->read_merkle_tree_page which works in pages.
> 
> > +		if (err) {
> >  			fsverity_err(inode,
> > -				     "Error %ld reading Merkle tree page %lu",
> > -				     PTR_ERR(hpage), hpage_idx);
> > +				     "Error %d reading Merkle tree block %lu",
> > +				     err, hblock_idx);
> >  			goto error;
> >  		}
> 
> This error message should go in fsverity_read_merkle_tree_block().
> 
> Doing that would also eliminate the need to add an 'err' variable to
> verify_data_block(), which could be confusing because it returns a bool.
> 
> >  	/* Descend the tree verifying hash blocks. */
> >  	for (; level > 0; level--) {
> > -		struct page *hpage = hblocks[level - 1].page;
> > -		const void *haddr = hblocks[level - 1].addr;
> > +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> > +		const void *haddr = block->kaddr;
> >  		unsigned long hblock_idx = hblocks[level - 1].index;
> >  		unsigned int hoffset = hblocks[level - 1].hoffset;
> > +		struct page *hpage = (struct page *)block->context;
> >  
> >  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
> >  			goto error;
> > @@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		if (vi->hash_block_verified)
> >			set_bit(hblock_idx, vi->hash_block_verified);
> >		else
> >  			SetPageChecked(hpage);
> 
> Since the page is only valid when vi->hash_block_verified is NULL, remove the
> hpage variable and do 'SetPageChecked((struct page *)block->context);'.
> Yes, there's no issue until the pointer is actually dereferenced, but it's
> confusing to have incorrectly-typed pointer variables sitting around.
> 
> > +/**
> > + * fsverity_invalidate_block() - invalidate Merkle tree block
> > + * @inode: inode to which this Merkle tree blocks belong
> > + * @block: block to be invalidated
> > + *
> > + * This function invalidates/clears "verified" state of Merkle tree block
> > + * in the fs-verity bitmap. The block needs to have ->offset set.
> > + */
> > +void fsverity_invalidate_block(struct inode *inode,
> > +		struct fsverity_blockbuf *block)
> > +{
> > +	struct fsverity_info *vi = inode->i_verity_info;
> > +	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
> > +
> > +	if (block->offset > vi->tree_params.tree_size) {
> > +		fsverity_err(inode,
> > +"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
> > +			     vi->tree_params.tree_size, block->offset);
> > +		return;
> > +	}
> > +
> > +	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
> > +}
> > +EXPORT_SYMBOL_GPL(fsverity_invalidate_block);
> 
> I don't think this function should be using struct fsverity_blockbuf.  It only
> uses the 'offset' field, which is orthogonal to what ->read_merkle_tree_block
> and ->drop_merkle_tree_block use.
> 
> The function name fsverity_invalidate_block() also has the issue where it's not
> clear what type of block it's talking about.
> 
> How about changing the prototype to:
> 
> void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
>
> Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> can it happen in cases like filesystem corruption?  If it can only happen due to
> kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.

I think XFS only passes to _invalidate_* the same pos that was passed to
->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
problem.

Perhaps this function ought to note that @pos is supposed to be the same
value that was given to ->read_merkle_tree_block?

Or: make the implementations return 1 for "reloaded from disk", 0 for
"still in cache", or a negative error code.  Then fsverity can call
the invalidation routine itself and XFS doesn't have to worry about this
part.

(I think?  I have questions about the xfs_invalidate_blocks function.)

> Please note that there's also an off-by-one error.  pos > tree_size should be
> pos >= tree_size.
> 
> > +void fsverity_drop_block(struct inode *inode,
> > +		struct fsverity_blockbuf *block)
> 
> fsverity_drop_merkle_tree_block
> 
> > +{
> > +	if (inode->i_sb->s_vop->drop_block)
> > +		inode->i_sb->s_vop->drop_block(block);
> > +	else {
> 
> Add braces above.
> 
> (See line 213 of Documentation/process/coding-style.rst)
> 
> > +		struct page *page = (struct page *)block->context;
> > +
> > +		kunmap_local(block->kaddr);
> > +		put_page(page);
> 
> put_page((struct page *)block->context), and remove the page variable
> 
> > +	}
> > +	block->kaddr = NULL;
> 
> Also set block->context = NULL
> 
> > +/**
> > + * struct fsverity_blockbuf - Merkle Tree block buffer
> > + * @kaddr: virtual address of the block's data
> > + * @offset: block's offset into Merkle tree
> > + * @size: the Merkle tree block size
> > + * @context: filesystem private context
> > + *
> > + * Buffer containing single Merkle Tree block. These buffers are passed
> > + *  - to filesystem, when fs-verity is building merkel tree,
> > + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> > + * Filesystems sets kaddr together with size to point to a memory which contains
> > + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> > + * written down to disk.
> > + *
> > + * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
> > + * ->drop_block to let filesystem know that memory can be freed.
> > + *
> > + * The context is optional. This field can be used by filesystem to passthrough
> > + * state from ->read_merkle_tree_block to ->drop_block.
> > + */
> > +struct fsverity_blockbuf {
> > +	void *kaddr;
> > +	u64 offset;
> > +	unsigned int size;
> > +	void *context;
> > +};
> 
> If this is only used by ->read_merkle_tree_block and ->drop_merkle_tree_block,
> then it can be simplified.  The only fields needed are kaddr and context.
> 
> The comment also still incorrectly claims that the struct is used for writes.
> 
> > +	/**
> > +	 * Read a Merkle tree block of the given inode.
> > +	 * @inode: the inode
> > +	 * @pos: byte offset of the block within the Merkle tree
> > +	 * @block: block buffer for filesystem to point it to the block
> > +	 * @log_blocksize: log2 of the size of the expected block
> > +	 * @ra_bytes: The number of bytes that should be
> > +	 *		prefetched starting at @pos if the page at @pos
> > +	 *		isn't already cached.  Implementations may ignore this
> > +	 *		argument; it's only a performance optimization.
> > +	 *
> > +	 * This can be called at any time on an open verity file.  It may be
> > +	 * called by multiple processes concurrently.
> > +	 *
> > +	 * In case that block was evicted from the memory filesystem has to use
> > +	 * fsverity_invalidate_block() to let fsverity know that block's
> > +	 * verification state is not valid anymore.

Ah, ok, that's why fsverity_invalidate_block only does one block at a
time.

> > +	 *
> > +	 * Return: 0 on success, -errno on failure
> > +	 */
> > +	int (*read_merkle_tree_block)(struct inode *inode,
> > +				      u64 pos,
> > +				      struct fsverity_blockbuf *block,
> > +				      unsigned int log_blocksize,
> > +				      u64 ra_bytes);
> 
> How about:
> 	
> 	int (*read_merkle_tree_block)(struct inode *inode,
> 				      u64 pos,
> 				      unsigned int size,
> 				      unsigned long ra_bytes,
> 				      struct fsverity_blockbuf *block);
> 
> - size instead of log_blocksize, for consistency with other functions and
>   because it's what XFS actually wants
> 
> - ra_bytes as unsigned long, since u64 readahead amounts don't really make
>   sense.  (This isn't too important though; it could be u64 if you really want.)
> 
> - block at the end since it's an output parameter
> 
> It also needs to be clearly documented that filesystems must implement either
> ->read_merkle_tree_page, *or* both ->read_merkle_tree_block and
> ->drop_merkle_tree_block.
> 
> Also, the comment about invalidating the block is still unclear regarding the
> timing of when the filesystem must do it.
> 
> > +
> > +	/**
> > +	 * Release the reference to a Merkle tree block
> > +	 *
> > +	 * @block: the block to release
> > +	 *
> > +	 * This is called when fs-verity is done with a block obtained with
> > +	 * ->read_merkle_tree_block().
> > +	 */
> > +	void (*drop_block)(struct fsverity_blockbuf *block);
> 
> drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block

Yep.  I noticed that xfs_verity.c doesn't put them together, which made
me wonder if the write_merkle_tree_block path made use of that.  It
doesn't, AFAICT.

And I think the reason is that when we're setting up the merkle tree,
we want to stream the contents straight to disk instead of ending up
with a huge cache that might not all be necessary?

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing
  2024-03-05  1:08   ` Eric Biggers
@ 2024-03-07 21:58     ` Darrick J. Wong
  2024-03-07 22:26       ` Eric Biggers
  2024-03-07 22:55       ` Dave Chinner
  0 siblings, 2 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 21:58 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Mon, Mar 04, 2024 at 05:08:05PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:31PM +0100, Andrey Albershteyn wrote:
> > For XFS, fsverity's global workqueue is not really suitable due to:
> > 
> > 1. High priority workqueues are used within XFS to ensure that data
> >    IO completion cannot stall processing of journal IO completions.
> >    Hence using a WQ_HIGHPRI workqueue directly in the user data IO
> >    path is a potential filesystem livelock/deadlock vector.
> > 
> > 2. The fsverity workqueue is global - it creates a cross-filesystem
> >    contention point.
> > 
> > This patch adds per-filesystem, per-cpu workqueue for fsverity
> > work. This allows iomap to add verification work in the read path on
> > BIO completion.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> 
> Should ext4 and f2fs switch over to this by converting
> fsverity_enqueue_verify_work() to use it?  I'd prefer not to have to maintain
> two separate workqueue strategies as part of the fs/verity/ infrastructure.

(Agreed.)

> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 1fbc72c5f112..5863519ffd51 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1223,6 +1223,8 @@ struct super_block {
> >  #endif
> >  #ifdef CONFIG_FS_VERITY
> >  	const struct fsverity_operations *s_vop;
> > +	/* Completion queue for post read verification */
> > +	struct workqueue_struct *s_read_done_wq;
> >  #endif
> 
> Maybe s_verity_wq?  Or do you anticipate this being used for other purposes too,
> such as fscrypt?  Note that it's behind CONFIG_FS_VERITY and is allocated by an
> fsverity_* function, so at least at the moment it doesn't feel very generic.

Doesn't fscrypt already create its own workqueues?

> > diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
> > index 0973b521ac5a..45b7c613148a 100644
> > --- a/include/linux/fsverity.h
> > +++ b/include/linux/fsverity.h
> > @@ -241,6 +241,22 @@ void fsverity_enqueue_verify_work(struct work_struct *work);
> >  void fsverity_invalidate_block(struct inode *inode,
> >  		struct fsverity_blockbuf *block);
> >  
> > +static inline int fsverity_set_ops(struct super_block *sb,
> > +				   const struct fsverity_operations *ops)
> 
> This doesn't just set the ops, but also allocates a workqueue too.  A better
> name for this function might be fsverity_init_sb.
> 
> Also this shouldn't really be an inline function.

Yeah.

> > +{
> > +	sb->s_vop = ops;
> > +
> > +	/* Create per-sb workqueue for post read bio verification */
> > +	struct workqueue_struct *wq = alloc_workqueue(
> > +		"pread/%s", (WQ_FREEZABLE | WQ_MEM_RECLAIM), 0, sb->s_id);
> 
> "pread" is short for "post read", I guess?  Should it really be this generic?

I think it shouldn't use a term that already means "positioned read" to
userspace.

> > +static inline int fsverity_set_ops(struct super_block *sb,
> > +				   const struct fsverity_operations *ops)
> > +{
> > +	return -EOPNOTSUPP;
> > +}
> 
> I think it would be better to just not have a !CONFIG_FS_VERITY stub for this.
> 
> You *could* make it work like fscrypt_set_ops(), which the ubifs folks added,
> where it can be called unconditionally if the filesystem has a declaration for
> the operations (but not necessarily a definition).  In that case it would need
> to return 0, rather than an error.  But I think I prefer just omitting the stub
> and having filesystems guard the call to this by CONFIG_FS_VERITY, as you've
> already done in XFS.

Aha, I was going to ask why XFS had its own #ifdef guards when this
already exists. :)

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-06 16:30     ` Darrick J. Wong
@ 2024-03-07 22:02       ` Eric Biggers
  2024-03-08  3:46         ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-07 22:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> Or you could leave the unfinished tree as-is; that will waste space, but
> if userspace tries again, the xattr code will replace the old merkle
> tree block contents with the new ones.  This assumes that we're not
> using XATTR_CREATE during FS_IOC_ENABLE_VERITY.

This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
that was interrupted and the one that completed, there may be extra Merkle tree
blocks left over.

BTW, is xfs_repair planned to do anything about any such extra blocks?

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-04 23:39   ` Eric Biggers
@ 2024-03-07 22:06     ` Darrick J. Wong
  2024-03-07 22:19       ` Eric Biggers
  2024-03-07 23:38     ` Dave Chinner
  1 sibling, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, Christoph Hellwig

On Mon, Mar 04, 2024 at 03:39:27PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:33PM +0100, Andrey Albershteyn wrote:
> > +#ifdef CONFIG_FS_VERITY
> > +struct iomap_fsverity_bio {
> > +	struct work_struct	work;
> > +	struct bio		bio;
> > +};
> 
> Maybe leave a comment above that mentions that bio must be the last field.
> 
> > @@ -471,6 +529,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> >   * iomap_readahead - Attempt to read pages from a file.
> >   * @rac: Describes the pages to be read.
> >   * @ops: The operations vector for the filesystem.
> > + * @wq: Workqueue for post-I/O processing (only need for fsverity)
> 
> This should not be here.
> 
> > +#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
> > +
> >  static int __init iomap_init(void)
> >  {
> > -	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> > -			   offsetof(struct iomap_ioend, io_inline_bio),
> > -			   BIOSET_NEED_BVECS);
> > +	int error;
> > +
> > +	error = bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
> > +			    offsetof(struct iomap_ioend, io_inline_bio),
> > +			    BIOSET_NEED_BVECS);
> > +#ifdef CONFIG_FS_VERITY
> > +	if (error)
> > +		return error;
> > +
> > +	error = bioset_init(&iomap_fsverity_bioset, IOMAP_POOL_SIZE,
> > +			    offsetof(struct iomap_fsverity_bio, bio),
> > +			    BIOSET_NEED_BVECS);
> > +	if (error)
> > +		bioset_exit(&iomap_ioend_bioset);
> > +#endif
> > +	return error;
> >  }
> >  fs_initcall(iomap_init);
> 
> This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> for these bios, regardless of whether they end up being used or not.  When
> PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> 
> How about allocating the pool when it's known it's actually going to be used,
> similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> there could be a flag in struct fsverity_operations that says whether filesystem
> wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> for any file for the first time since boot, it could call into fs/iomap/ to
> initialize the iomap fsverity bioset if needed.

I was thinking the same thing.

Though I wondered if you want to have a CONFIG_IOMAP_FS_VERITY as well?
But that might be tinykernel tetrapyloctomy.

--D

> BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
> error handling logic above does not really work as may have been intended.
> 
> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files
  2024-03-07 21:39     ` Darrick J. Wong
@ 2024-03-07 22:06       ` Eric Biggers
  0 siblings, 0 replies; 94+ messages in thread
From: Eric Biggers @ 2024-03-07 22:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 01:39:11PM -0800, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 02:35:48PM -0800, Eric Biggers wrote:
> > On Mon, Mar 04, 2024 at 08:10:28PM +0100, Andrey Albershteyn wrote:
> > > @@ -641,6 +645,13 @@ static int fileattr_set_prepare(struct inode *inode,
> > >  	    !(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
> > >  		return -EINVAL;
> > >  
> > > +	/*
> > > +	 * Verity cannot be set through FS_IOC_FSSETXATTR/FS_IOC_SETFLAGS.
> > > +	 * See FS_IOC_ENABLE_VERITY
> > > +	 */
> > > +	if (fa->fsx_xflags & FS_XFLAG_VERITY)
> > > +		return -EINVAL;
> > 
> > This makes FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR start failing on files that
> > already have verity enabled.
> > 
> > An error should only be returned when the new flags contain verity and the old
> > flags don't.
> 
> What if the old flags have it and the new ones don't?  Is that supposed
> to disable fsverity?  Is removal of the verity information not supported?
> 
> I'm guessing that removal isn't supposed to happen, in which case the
> above check ought to be:
> 
> 	if (!!IS_VERITY(inode) != !!(fa->fsx_xflags & FS_XFLAG_VERITY))
> 		return -EINVAL;
> 
> Right?

Yeah, good catch.  We need to prevent disabling the flag too.  How about:

	if ((fa->flags ^ old_ma->flags) & FS_VERITY_FL)
		return -EINVAL;

That would be consistent with how changes to other flags such as FS_APPEND_FL
and FS_IMMUTABLE_FL are detected earlier in the function.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 17/24] xfs: add inode on-disk VERITY flag
  2024-03-04 19:10 ` [PATCH v5 17/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
@ 2024-03-07 22:06   ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:06 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:40PM +0100, Andrey Albershteyn wrote:
> Add flag to mark inodes which have fs-verity enabled on them (i.e.
> descriptor exist and tree is built).
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_format.h | 4 +++-
>  fs/xfs/xfs_inode.c         | 2 ++
>  fs/xfs/xfs_iops.c          | 2 ++
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 93d280eb8451..3ce2902101bc 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1085,16 +1085,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
>  #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
>  #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
> +#define XFS_DIFLAG2_VERITY_BIT	5	/* inode sealed by fsverity */
>  
>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
>  #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
>  #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
> +#define XFS_DIFLAG2_VERITY	(1 << XFS_DIFLAG2_VERITY_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
>  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
> +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_VERITY)
>  
>  static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
>  {
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ea48774f6b76..59446e9e1719 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -607,6 +607,8 @@ xfs_ip2xflags(
>  			flags |= FS_XFLAG_DAX;
>  		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>  			flags |= FS_XFLAG_COWEXTSIZE;
> +		if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
> +			flags |= FS_XFLAG_VERITY;
>  	}
>  
>  	if (xfs_inode_has_attr_fork(ip))
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 66f8c47642e8..0e5cdb82b231 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1241,6 +1241,8 @@ xfs_diflags_to_iflags(
>  		flags |= S_NOATIME;
>  	if (init && xfs_inode_should_enable_dax(ip))
>  		flags |= S_DAX;
> +	if (xflags & FS_XFLAG_VERITY)
> +		flags |= S_VERITY;
>  
>  	/*
>  	 * S_DAX can only be set during inode initialization and is never set by
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction
  2024-03-04 19:10 ` [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
@ 2024-03-07 22:09   ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:09 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:41PM +0100, Andrey Albershteyn wrote:
> fs-verity will read and attach metadata (not the tree itself) from
> a disk for those inodes which already have fs-verity enabled.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/xfs_file.c  | 8 ++++++++
>  fs/xfs/xfs_super.c | 2 ++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 632653e00906..17404c2e7e31 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -31,6 +31,7 @@
>  #include <linux/mman.h>
>  #include <linux/fadvise.h>
>  #include <linux/mount.h>
> +#include <linux/fsverity.h>
>  
>  static const struct vm_operations_struct xfs_file_vm_ops;
>  
> @@ -1228,10 +1229,17 @@ xfs_file_open(
>  	struct inode	*inode,
>  	struct file	*file)
>  {
> +	int		error = 0;

Not sure why error needs an initializer here?

Otherwise this patch looks good to me.

--D

> +
>  	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
>  		return -EIO;
>  	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
>  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> +
> +	error = fsverity_file_open(inode, file);
> +	if (error)
> +		return error;
> +
>  	return generic_file_open(inode, file);
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index afa32bd5e282..9f9c35cff9bf 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -49,6 +49,7 @@
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
>  #include <linux/fs_parser.h>
> +#include <linux/fsverity.h>
>  
>  static const struct super_operations xfs_super_operations;
>  
> @@ -663,6 +664,7 @@ xfs_fs_destroy_inode(
>  	ASSERT(!rwsem_is_locked(&inode->i_rwsem));
>  	XFS_STATS_INC(ip->i_mount, vn_rele);
>  	XFS_STATS_INC(ip->i_mount, vn_remove);
> +	fsverity_cleanup_inode(inode);
>  	xfs_inode_mark_reclaimable(ip);
>  }
>  
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode
  2024-03-04 19:10 ` [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode Andrey Albershteyn
@ 2024-03-07 22:09   ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:09 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:42PM +0100, Andrey Albershteyn wrote:
> fs-verity doesn't support DAX. Forbid filesystem to enable DAX on
> inodes which already have fs-verity enabled. The opposite is checked
> when fs-verity is enabled, it won't be enabled if DAX is.

What's the problem with allowing fsdax and fsverity at the same time?
Was that merely not in scope, or is there a fundamental incompatibility
between the two?

--D

> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/xfs_iops.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 0e5cdb82b231..6f97d777f702 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1213,6 +1213,8 @@ xfs_inode_should_enable_dax(
>  		return false;
>  	if (!xfs_inode_supports_dax(ip))
>  		return false;
> +	if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
> +		return false;
>  	if (xfs_has_dax_always(ip->i_mount))
>  		return true;
>  	if (ip->i_diflags2 & XFS_DIFLAG2_DAX)
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 20/24] xfs: disable direct read path for fs-verity files
  2024-03-04 19:10 ` [PATCH v5 20/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
@ 2024-03-07 22:11   ` Darrick J. Wong
  2024-03-12 12:02     ` Andrey Albershteyn
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:11 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:43PM +0100, Andrey Albershteyn wrote:
> The direct path is not supported on verity files. Attempts to use direct
> I/O path on such files should fall back to buffered I/O path.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 17404c2e7e31..af3201075066 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -281,7 +281,8 @@ xfs_file_dax_read(
>  	struct kiocb		*iocb,
>  	struct iov_iter		*to)
>  {
> -	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> +	struct inode		*inode = iocb->ki_filp->f_mapping->host;
> +	struct xfs_inode	*ip = XFS_I(inode);
>  	ssize_t			ret = 0;
>  
>  	trace_xfs_file_dax_read(iocb, to);
> @@ -334,10 +335,18 @@ xfs_file_read_iter(
>  
>  	if (IS_DAX(inode))
>  		ret = xfs_file_dax_read(iocb, to);
> -	else if (iocb->ki_flags & IOCB_DIRECT)
> +	else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
>  		ret = xfs_file_dio_read(iocb, to);
> -	else
> +	else {

I think the earlier cases need curly braces {} too.

> +		/*
> +		 * In case fs-verity is enabled, we also fallback to the
> +		 * buffered read from the direct read path. Therefore,
> +		 * IOCB_DIRECT is set and need to be cleared (see
> +		 * generic_file_read_iter())
> +		 */
> +		iocb->ki_flags &= ~IOCB_DIRECT;

I'm curious that you added this flag here; how have we gotten along
this far without clearing it?

--D

>  		ret = xfs_file_buffered_read(iocb, to);
> +	}
>  
>  	if (ret > 0)
>  		XFS_STATS_ADD(mp, xs_read_bytes, ret);
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 23/24] xfs: add fs-verity ioctls
  2024-03-04 19:10 ` [PATCH v5 23/24] xfs: add fs-verity ioctls Andrey Albershteyn
@ 2024-03-07 22:14   ` Darrick J. Wong
  2024-03-12 12:42     ` Andrey Albershteyn
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:14 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:46PM +0100, Andrey Albershteyn wrote:
> Add fs-verity ioctls to enable, dump metadata (descriptor and Merkle
> tree pages) and obtain file's digest.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/xfs_ioctl.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index ab61d7d552fb..4763d20c05ff 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -43,6 +43,7 @@
>  #include <linux/mount.h>
>  #include <linux/namei.h>
>  #include <linux/fileattr.h>
> +#include <linux/fsverity.h>
>  
>  /*
>   * xfs_find_handle maps from userspace xfs_fsop_handlereq structure to
> @@ -2174,6 +2175,22 @@ xfs_file_ioctl(
>  		return error;
>  	}
>  
> +	case FS_IOC_ENABLE_VERITY:
> +		if (!xfs_has_verity(mp))
> +			return -EOPNOTSUPP;
> +		return fsverity_ioctl_enable(filp, (const void __user *)arg);

Isn't @arg already declared as a (void __user *) ?

--D

> +
> +	case FS_IOC_MEASURE_VERITY:
> +		if (!xfs_has_verity(mp))
> +			return -EOPNOTSUPP;
> +		return fsverity_ioctl_measure(filp, (void __user *)arg);
> +
> +	case FS_IOC_READ_VERITY_METADATA:
> +		if (!xfs_has_verity(mp))
> +			return -EOPNOTSUPP;
> +		return fsverity_ioctl_read_metadata(filp,
> +						    (const void __user *)arg);
> +
>  	default:
>  		return -ENOTTY;
>  	}
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag
  2024-03-04 19:10 ` [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
@ 2024-03-07 22:16   ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:16 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:47PM +0100, Andrey Albershteyn wrote:
> Finalize fs-verity integration in XFS by making kernel fs-verity
> aware with ro-compat flag.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_format.h | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 3ce2902101bc..be66f0ab20cb 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -355,10 +355,11 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
>  #define XFS_SB_FEAT_RO_COMPAT_VERITY   (1 << 4)		/* fs-verity */
>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> -		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> -		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> -		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> -		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> +		(XFS_SB_FEAT_RO_COMPAT_FINOBT  | \
> +		 XFS_SB_FEAT_RO_COMPAT_RMAPBT  | \
> +		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
> +		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT| \

You might as well add a few more spaces here  ^^ for future expansion
if you're going to reflow the whole macro definition.

(or just leave it alone which is what I would've done)

Up to you; the rest is
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +		 XFS_SB_FEAT_RO_COMPAT_VERITY)
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag
  2024-03-04 19:10 ` [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag Andrey Albershteyn
@ 2024-03-07 22:18   ` Darrick J. Wong
  2024-03-12 12:10     ` Andrey Albershteyn
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:18 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:45PM +0100, Andrey Albershteyn wrote:
> fs-verity adds new inode flag which causes scrub to fail as it is
> not yet known.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/attr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
> index 9a1f59f7b5a4..ae4227cb55ec 100644
> --- a/fs/xfs/scrub/attr.c
> +++ b/fs/xfs/scrub/attr.c
> @@ -494,7 +494,7 @@ xchk_xattr_rec(
>  	/* Retrieve the entry and check it. */
>  	hash = be32_to_cpu(ent->hashval);
>  	badflags = ~(XFS_ATTR_LOCAL | XFS_ATTR_ROOT | XFS_ATTR_SECURE |
> -			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT);
> +			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT | XFS_ATTR_VERITY);

Now that online repair can modify/discard/salvage broken xattr trees and
is pretty close to merging, how can I make it invalidate all the incore
merkle tree data after a repair?

--D

>  	if ((ent->flags & badflags) != 0)
>  		xchk_da_set_corrupt(ds, level);
>  	if (ent->flags & XFS_ATTR_LOCAL) {
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-07 22:06     ` Darrick J. Wong
@ 2024-03-07 22:19       ` Eric Biggers
  0 siblings, 0 replies; 94+ messages in thread
From: Eric Biggers @ 2024-03-07 22:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, Christoph Hellwig

On Thu, Mar 07, 2024 at 02:06:08PM -0800, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 03:39:27PM -0800, Eric Biggers wrote:
> > On Mon, Mar 04, 2024 at 08:10:33PM +0100, Andrey Albershteyn wrote:
> > > +#ifdef CONFIG_FS_VERITY
> > > +struct iomap_fsverity_bio {
> > > +	struct work_struct	work;
> > > +	struct bio		bio;
> > > +};
> > 
> > Maybe leave a comment above that mentions that bio must be the last field.
> > 
> > > @@ -471,6 +529,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> > >   * iomap_readahead - Attempt to read pages from a file.
> > >   * @rac: Describes the pages to be read.
> > >   * @ops: The operations vector for the filesystem.
> > > + * @wq: Workqueue for post-I/O processing (only need for fsverity)
> > 
> > This should not be here.
> > 
> > > +#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
> > > +
> > >  static int __init iomap_init(void)
> > >  {
> > > -	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> > > -			   offsetof(struct iomap_ioend, io_inline_bio),
> > > -			   BIOSET_NEED_BVECS);
> > > +	int error;
> > > +
> > > +	error = bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
> > > +			    offsetof(struct iomap_ioend, io_inline_bio),
> > > +			    BIOSET_NEED_BVECS);
> > > +#ifdef CONFIG_FS_VERITY
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	error = bioset_init(&iomap_fsverity_bioset, IOMAP_POOL_SIZE,
> > > +			    offsetof(struct iomap_fsverity_bio, bio),
> > > +			    BIOSET_NEED_BVECS);
> > > +	if (error)
> > > +		bioset_exit(&iomap_ioend_bioset);
> > > +#endif
> > > +	return error;
> > >  }
> > >  fs_initcall(iomap_init);
> > 
> > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > for these bios, regardless of whether they end up being used or not.  When
> > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> > 
> > How about allocating the pool when it's known it's actually going to be used,
> > similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> > there could be a flag in struct fsverity_operations that says whether filesystem
> > wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> > for any file for the first time since boot, it could call into fs/iomap/ to
> > initialize the iomap fsverity bioset if needed.
> 
> I was thinking the same thing.
> 
> Though I wondered if you want to have a CONFIG_IOMAP_FS_VERITY as well?
> But that might be tinykernel tetrapyloctomy.
> 

I'd prefer just doing the alloc-on-first-use optimization, and not do
CONFIG_IOMAP_FS_VERITY.  CONFIG_IOMAP_FS_VERITY would serve a similar purpose,
but only if XFS is not built into the kernel, and only before the other
filesystems that support fsverity convert their buffered reads to use iomap.

For tiny kernels there's always the option of CONFIG_FS_VERITY=n.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing
  2024-03-07 21:58     ` Darrick J. Wong
@ 2024-03-07 22:26       ` Eric Biggers
  2024-03-08  3:53         ` Darrick J. Wong
  2024-03-07 22:55       ` Dave Chinner
  1 sibling, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-07 22:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 01:58:57PM -0800, Darrick J. Wong wrote:
> > Maybe s_verity_wq?  Or do you anticipate this being used for other purposes too,
> > such as fscrypt?  Note that it's behind CONFIG_FS_VERITY and is allocated by an
> > fsverity_* function, so at least at the moment it doesn't feel very generic.
> 
> Doesn't fscrypt already create its own workqueues?

There's a global workqueue in fs/crypto/ too.  IIRC, it has to be separate from
the fs/verity/ workqueue to avoid deadlocks when a file has both encryption and
verity enabled.  This is because the verity step can involve reading and
decrypting Merkle tree blocks.

The main thing I was wondering was whether there was some plan to use this new
per-sb workqueue as a generic post-read processing queue, as seemed to be
implied by the chosen naming.  If there's no clear plan, it should instead just
be named after what it's actually used for, fsverity.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-04 19:10 ` [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Andrey Albershteyn
@ 2024-03-07 22:46   ` Darrick J. Wong
  2024-03-08  1:59     ` Dave Chinner
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 22:46 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> One of essential ideas of fs-verity is that pages which are already
> verified won't need to be re-verified if they still in page cache.
> 
> XFS will store Merkle tree blocks in extended file attributes. When
> read extended attribute data is put into xfs_buf.
> 
> fs-verity uses PG_checked flag to track status of the blocks in the
> page. This flag can has two meanings - page was re-instantiated and
> the only block in the page is verified.
> 
> However, in XFS, the data in the buffer is not aligned with xfs_buf
> pages and we don't have a reference to these pages. Moreover, these
> pages are released when value is copied out in xfs_attr code. In
> other words, we can not directly mark underlying xfs_buf's pages as
> verified as it's done by fs-verity for other filesystems.
> 
> One way to track that these pages were processed by fs-verity is to
> mark buffer as verified instead. If buffer is evicted the incore
> XBF_VERITY_SEEN flag is lost. When the xattr is read again
> xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> flag is then used to tell fs-verity this buffer was cached or not.
> 
> The second state indicated by PG_checked is if the only block in the
> PAGE is verified. This is not the case for XFS as there could be
> multiple blocks in single buffer (page size 64k block size 4k). This
> is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> XFS despite of Merkle tree block size.
> 
> The meaning of the flag is that value of the extended attribute in
> the buffer is processed by fs-verity.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/xfs_buf.h | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 73249abca968..2a73918193ba 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -24,14 +24,15 @@ struct xfs_buf;
>  
>  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
>  
> -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */

Yuck.  I still dislike this entire approach.

XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
gets loaded on behalf of a merkle tree request, then uses the extra
space to shadow the contents of the ondisk block.  AFAICT the shadow
doesn't get updated even if the cached data does, which sounds like a
landmine for coherency issues.

XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
exposing the low level buffering details of the xattr code to
xfs_attr_get callers.

XBF_VERITY_SEEN is a layering violation because now the overall buffer
cache can track file metadata state.  I think the reason why you need
this state flag is because the datadev buffer target indexes based on
physical xfs_daddr_t, whereas merkle tree blocks have their own internal
block numbers.  You can't directly go from the merkle block number to an
xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
fell out of memory.

ISTR asking for a separation of these indices when I reviewed some
previous version of this patchset.  At the time it seems to me that a
much more efficient way to cache the merkle tree blocks would be to set
up a direct (merkle tree block number) -> (blob of data) lookup table.
That I don't see here.

In the spirit of the recent collaboration style that I've taken with
Christoph, I pulled your branch and started appending patches to it to
see if the design that I'm asking for is reasonable.  As it so happens,
I was working on a simplified version of the xfs buffer cache ("fsbuf")
that could be used by simple filesystems to get them off of buffer
heads.

(Ab)using the fsbuf code did indeed work (and passed all the fstests -g
verity tests), so now I know the idea is reasonable.  Patches 11, 12,
14, and 15 become unnecessary.  However, this solution is itself grossly
overengineered, since all we want are the following operations:

peek(key): returns an fsbuf if there's any data cached for key

get(key): returns an fsbuf for key, regardless of state

store(fsbuf, p): attach a memory buffer p to fsbuf

Then the xfs ->read_merkle_tree_block function becomes:

	bp = peek(key)
	if (bp)
		/* return bp data up to verity */

	p = xfs_attr_get(key)
	if (!p)
		/* error */

	bp = get(key)
	store(bp, p)

	/* return bp data up to verity */

*Fortunately* you've also rebased against the 6.9 for-next branch, which
means I think there's an elegant solution at hand.  The online repair
patches that are in that branch make it possible to create a standalone
xfs_buftarg.  We can attach one of these to an xfs_inode to cache merkle
tree blocks.  xfs_bufs have well known usage semantics, they're already
hooked up to shrinkers so that we can reclaim the data if memory is
tight, and they can clean up a lot of code.

Could you take a look at my branch here:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-07

And tell me what you think?

--D

>  
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1u << 16)/* inode buffer */
> @@ -65,6 +66,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_DONE,		"DONE" }, \
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> +	{ XBF_VERITY_SEEN,	"VERITY_SEEN" }, \
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching
  2024-03-07 21:54     ` Darrick J. Wong
@ 2024-03-07 22:49       ` Eric Biggers
  2024-03-08  3:50         ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-07 22:49 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > > index b3506f56e180..dad33e6ff0d6 100644
> > > --- a/fs/verity/fsverity_private.h
> > > +++ b/fs/verity/fsverity_private.h
> > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> > >  
> > >  void __init fsverity_init_workqueue(void);
> > >  
> > > +/*
> > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > > + * block->context.
> > > + */
> > > +void fsverity_drop_block(struct inode *inode,
> > > +			 struct fsverity_blockbuf *block);
> > > +
> > >  #endif /* _FSVERITY_PRIVATE_H */
> > 
> > This should be paired with a helper function that reads a Merkle tree block by
> > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> > being consistent with having a helper function for drop, this would prevent code
> > duplication between verify_data_block() and fsverity_read_merkle_tree().
> > 
> > I recommend that it look like this:
> > 
> > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> > 				    unsigned long ra_bytes,
> > 				    struct fsverity_blockbuf *block);
> > 
> > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> > would be the number of bytes for the filesystem to (optionally) readahead if the
> > block is not yet cached.  I think that things work out simpler if these values
> > are measured in bytes, not blocks.  'block' would be at the end because it's an
> > output, and it can be confusing to interleave inputs and outputs in parameters.
> 
> FWIW I don't really like 'pos' here because that's usually short for
> "file position", which is a byte, and this looks a lot more like a
> merkle tree block number.
> 
> u64 blkno?
> 
> Or better yet use a typedef ("merkle_blkno_t") to make it really clear
> when we're dealing with a tree block number.  Ignore checkpatch
> complaining about typeedefs. :)

My suggestion is for 'pos' to be a byte position, in alignment with the
pagecache naming convention as well as ->write_merkle_tree_block which currently
uses a 'u64 pos' byte position too.

It would also work for it to be a block index, in which case it should be named
'index' or 'blkno' and have type unsigned long.  I *think* that things work out
a bit cleaner if it's a byte position, but I could be convinced that it's
actually better for it to be a block index.

> > How about changing the prototype to:
> > 
> > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
> >
> > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> > can it happen in cases like filesystem corruption?  If it can only happen due to
> > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.
> 
> I think XFS only passes to _invalidate_* the same pos that was passed to
> ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
> problem.
> 
> Perhaps this function ought to note that @pos is supposed to be the same
> value that was given to ->read_merkle_tree_block?
> 
> Or: make the implementations return 1 for "reloaded from disk", 0 for
> "still in cache", or a negative error code.  Then fsverity can call
> the invalidation routine itself and XFS doesn't have to worry about this
> part.
> 
> (I think?  I have questions about the xfs_invalidate_blocks function.)

It looks like XFS can invalidate blocks other than the one being read by
->read_merkle_tree_block.

If it really was only a matter of the single block being read, then it would
indeed be simpler to just make it a piece of information returned from
->read_merkle_tree_block.

If the generic invalidation function is needed, it needs to be clearly
documented when filesystems are expected to invalidate blocks.

> > > +
> > > +	/**
> > > +	 * Release the reference to a Merkle tree block
> > > +	 *
> > > +	 * @block: the block to release
> > > +	 *
> > > +	 * This is called when fs-verity is done with a block obtained with
> > > +	 * ->read_merkle_tree_block().
> > > +	 */
> > > +	void (*drop_block)(struct fsverity_blockbuf *block);
> > 
> > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block
> 
> Yep.  I noticed that xfs_verity.c doesn't put them together, which made
> me wonder if the write_merkle_tree_block path made use of that.  It
> doesn't, AFAICT.
> 
> And I think the reason is that when we're setting up the merkle tree,
> we want to stream the contents straight to disk instead of ending up
> with a huge cache that might not all be necessary?

In the current patchset, fsverity_blockbuf isn't used for writes (despite the
comment saying it is).  I think that's fine and that it keeps things simpler.
Filesystems can still cache the blocks that are passed to
->write_merkle_tree_block if they want to.  That already happens on ext4 and
f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing
  2024-03-07 21:58     ` Darrick J. Wong
  2024-03-07 22:26       ` Eric Biggers
@ 2024-03-07 22:55       ` Dave Chinner
  1 sibling, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2024-03-07 22:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Eric Biggers, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 01:58:57PM -0800, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 05:08:05PM -0800, Eric Biggers wrote:
> > On Mon, Mar 04, 2024 at 08:10:31PM +0100, Andrey Albershteyn wrote:
> > > For XFS, fsverity's global workqueue is not really suitable due to:
> > > 
> > > 1. High priority workqueues are used within XFS to ensure that data
> > >    IO completion cannot stall processing of journal IO completions.
> > >    Hence using a WQ_HIGHPRI workqueue directly in the user data IO
> > >    path is a potential filesystem livelock/deadlock vector.
> > > 
> > > 2. The fsverity workqueue is global - it creates a cross-filesystem
> > >    contention point.
> > > 
> > > This patch adds per-filesystem, per-cpu workqueue for fsverity
> > > work. This allows iomap to add verification work in the read path on
> > > BIO completion.
> > > 
> > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > 
> > Should ext4 and f2fs switch over to this by converting
> > fsverity_enqueue_verify_work() to use it?  I'd prefer not to have to maintain
> > two separate workqueue strategies as part of the fs/verity/ infrastructure.
> 
> (Agreed.)
> 
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 1fbc72c5f112..5863519ffd51 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -1223,6 +1223,8 @@ struct super_block {
> > >  #endif
> > >  #ifdef CONFIG_FS_VERITY
> > >  	const struct fsverity_operations *s_vop;
> > > +	/* Completion queue for post read verification */
> > > +	struct workqueue_struct *s_read_done_wq;
> > >  #endif
> > 
> > Maybe s_verity_wq?  Or do you anticipate this being used for other purposes too,
> > such as fscrypt?  Note that it's behind CONFIG_FS_VERITY and is allocated by an
> > fsverity_* function, so at least at the moment it doesn't feel very generic.
> 
> Doesn't fscrypt already create its own workqueues?

Yes, but iomap really needs it to be a generic sb workqueue like the
sb->s_dio_done_wq.

i.e. the completion processing in task context is not isolated to
fsverity - it will likely also be needed for fscrypt completion and
decompression on read completion, when they get supported by the
generic iomap IO infrastruture. i.e. Any read IO that needs a task
context for completion side data processing will need this.

The reality is that the major filesytsems are already using either
generic or internal per-fs workqueues for DIO completion and
buffered write completions. Some already use buried workqueues on
read IO completion, too, so moving this to being supported by the
generic IO infrastructure rather than reimplmenting it just a little
bit differently in every filesystem makes a lot of sense.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 21/24] xfs: add fs-verity support
  2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
  2024-03-06  4:55   ` Eric Biggers
@ 2024-03-07 23:10   ` Darrick J. Wong
  1 sibling, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 23:10 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:44PM +0100, Andrey Albershteyn wrote:
> Add integration with fs-verity. The XFS store fs-verity metadata in
> the extended file attributes. The metadata consist of verity
> descriptor and Merkle tree blocks.
> 
> The descriptor is stored under "vdesc" extended attribute. The
> Merkle tree blocks are stored under binary indexes which are offsets
> into the Merkle tree.
> 
> When fs-verity is enabled on an inode, the XFS_IVERITY_CONSTRUCTION
> flag is set meaning that the Merkle tree is being build. The
> initialization ends with storing of verity descriptor and setting
> inode on-disk flag (XFS_DIFLAG2_VERITY).
> 
> The verification on read is done in read path of iomap.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/Makefile                 |   1 +
>  fs/xfs/libxfs/xfs_attr.c        |  13 ++
>  fs/xfs/libxfs/xfs_attr_leaf.c   |  17 +-
>  fs/xfs/libxfs/xfs_attr_remote.c |   8 +-
>  fs/xfs/libxfs/xfs_da_format.h   |  27 +++
>  fs/xfs/libxfs/xfs_ondisk.h      |   4 +
>  fs/xfs/xfs_inode.h              |   3 +-
>  fs/xfs/xfs_super.c              |  10 +
>  fs/xfs/xfs_verity.c             | 355 ++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_verity.h             |  33 +++
>  10 files changed, 464 insertions(+), 7 deletions(-)
>  create mode 100644 fs/xfs/xfs_verity.c
>  create mode 100644 fs/xfs/xfs_verity.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index f8845e65cac7..8396a633b541 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -130,6 +130,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
>  xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
>  xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
>  xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
> +xfs-$(CONFIG_FS_VERITY)		+= xfs_verity.o
>  
>  # notify failure
>  ifeq ($(CONFIG_MEMORY_FAILURE),y)
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index ca515e8bd2ed..cde5352db9aa 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -27,6 +27,7 @@
>  #include "xfs_attr_item.h"
>  #include "xfs_xattr.h"
>  #include "xfs_parent.h"
> +#include "xfs_verity.h"
>  
>  struct kmem_cache		*xfs_attr_intent_cache;
>  
> @@ -1527,6 +1528,18 @@ xfs_attr_namecheck(
>  	if (flags & XFS_ATTR_PARENT)
>  		return xfs_parent_namecheck(mp, name, length, flags);
>  
> +	if (flags & XFS_ATTR_VERITY) {
> +		/* Merkle tree pages are stored under u64 indexes */
> +		if (length == sizeof(struct xfs_fsverity_merkle_key))
> +			return true;
> +
> +		/* Verity descriptor blocks are held in a named attribute. */
> +		if (length == XFS_VERITY_DESCRIPTOR_NAME_LEN)
> +			return true;
> +
> +		return false;
> +	}
> +
>  	/*
>  	 * MAXNAMELEN includes the trailing null, but (name/length) leave it
>  	 * out, so use >= for the length check.
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index b51f439e4aed..f1f6aefc0420 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -30,6 +30,7 @@
>  #include "xfs_ag.h"
>  #include "xfs_errortag.h"
>  #include "xfs_health.h"
> +#include "xfs_verity.h"
>  
>  
>  /*
> @@ -519,7 +520,12 @@ xfs_attr_copy_value(
>  		return -ERANGE;
>  	}
>  
> -	if (!args->value) {
> +	/*
> +	 * We don't want to allocate memory for fs-verity Merkle tree blocks
> +	 * (fs-verity descriptor is fine though). They will be stored in
> +	 * underlying xfs_buf
> +	 */
> +	if (!args->value && !xfs_verity_merkle_block(args)) {
>  		args->value = kvmalloc(valuelen, GFP_KERNEL | __GFP_NOLOCKDEP);
>  		if (!args->value)
>  			return -ENOMEM;
> @@ -538,7 +544,14 @@ xfs_attr_copy_value(
>  	 */
>  	if (!value)
>  		return -EINVAL;
> -	memcpy(args->value, value, valuelen);
> +	/*
> +	 * We won't copy Merkle tree block to the args->value as we want it be
> +	 * in the xfs_buf. And we didn't allocate any memory in args->value.
> +	 */
> +	if (xfs_verity_merkle_block(args))
> +		args->value = value;
> +	else
> +		memcpy(args->value, value, valuelen);
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
> index f1b7842da809..a631ddff8068 100644
> --- a/fs/xfs/libxfs/xfs_attr_remote.c
> +++ b/fs/xfs/libxfs/xfs_attr_remote.c
> @@ -23,6 +23,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
>  #include "xfs_health.h"
> +#include "xfs_verity.h"
>  
>  #define ATTR_RMTVALUE_MAPSIZE	1	/* # of map entries at once */
>  
> @@ -404,11 +405,10 @@ xfs_attr_rmtval_get(
>  	ASSERT(args->rmtvaluelen == args->valuelen);
>  
>  	/*
> -	 * We also check for _OP_BUFFER as we want to trigger on
> -	 * verity blocks only, not on verity_descriptor
> +	 * For fs-verity we want additional space in the xfs_buf. This space is
> +	 * used to copy xattr value without leaf headers (crc header).
>  	 */
> -	if (args->attr_filter & XFS_ATTR_VERITY &&
> -			args->op_flags & XFS_DA_OP_BUFFER)
> +	if (xfs_verity_merkle_block(args))
>  		flags = XBF_DOUBLE_ALLOC;
>  
>  	valuelen = args->rmtvaluelen;
> diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h
> index 28d4ac6fa156..c30c3c253191 100644
> --- a/fs/xfs/libxfs/xfs_da_format.h
> +++ b/fs/xfs/libxfs/xfs_da_format.h
> @@ -914,4 +914,31 @@ struct xfs_parent_name_rec {
>   */
>  #define XFS_PARENT_DIRENT_NAME_MAX_SIZE		(MAXNAMELEN - 1)
>  
> +/*
> + * fs-verity attribute name format
> + *
> + * Merkle tree blocks are stored under extended attributes of the inode. The
> + * name of the attributes are offsets into merkle tree.
> + */
> +struct xfs_fsverity_merkle_key {
> +	__be64 merkleoff;

Nit: tab here ^ for consistency.

> +};
> +
> +static inline void
> +xfs_fsverity_merkle_key_to_disk(struct xfs_fsverity_merkle_key *key, loff_t pos)
> +{
> +	key->merkleoff = cpu_to_be64(pos);
> +}
> +
> +static inline loff_t
> +xfs_fsverity_name_to_block_offset(unsigned char *name)
> +{
> +	struct xfs_fsverity_merkle_key key = {
> +		.merkleoff = *(__be64 *)name
> +	};
> +	loff_t offset = be64_to_cpu(key.merkleoff);
> +
> +	return offset;
> +}

These helpers should go in xfs_verity.c or wherever they're actually
used.  They don't need to be in this header file.

>  #endif /* __XFS_DA_FORMAT_H__ */
> diff --git a/fs/xfs/libxfs/xfs_ondisk.h b/fs/xfs/libxfs/xfs_ondisk.h
> index 81885a6a028e..39209943c474 100644
> --- a/fs/xfs/libxfs/xfs_ondisk.h
> +++ b/fs/xfs/libxfs/xfs_ondisk.h
> @@ -194,6 +194,10 @@ xfs_check_ondisk_structs(void)
>  	XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MIN << XFS_DQ_BIGTIME_SHIFT, 4);
>  	XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MAX << XFS_DQ_BIGTIME_SHIFT,
>  			16299260424LL);
> +
> +	/* fs-verity descriptor xattr name */
> +	XFS_CHECK_VALUE(strlen(XFS_VERITY_DESCRIPTOR_NAME),
> +			XFS_VERITY_DESCRIPTOR_NAME_LEN);

strlen works within a macro?

I would've thought this would be:

	XFS_CHECK_VALUE(sizeof(XFS_VERITY_DESCRIPTOR_NAME) ==
			XFS_VERITY_DESCRIPTOR_NAME_LEN + 1);

>  }
>  
>  #endif /* __XFS_ONDISK_H */
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ab46ffb3ac19..d6664e9afa22 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -348,7 +348,8 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>   * inactivation completes, both flags will be cleared and the inode is a
>   * plain old IRECLAIMABLE inode.
>   */
> -#define XFS_INACTIVATING	(1 << 13)
> +#define XFS_INACTIVATING		(1 << 13)
> +#define XFS_IVERITY_CONSTRUCTION	(1 << 14) /* merkle tree construction */

This (still) collides with...

>  /* Quotacheck is running but inode has not been added to quota counts. */
>  #define XFS_IQUOTAUNCHECKED	(1 << 14)

...this value here.  These are bitmasks; you MUST create a
non-conflicting value here.

You'll also have to update the xfs_iflags_* functions to take an
unsigned long.

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 9f9c35cff9bf..996e6ea91fe1 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -30,6 +30,7 @@
>  #include "xfs_filestream.h"
>  #include "xfs_quota.h"
>  #include "xfs_sysfs.h"
> +#include "xfs_verity.h"
>  #include "xfs_ondisk.h"
>  #include "xfs_rmap_item.h"
>  #include "xfs_refcount_item.h"
> @@ -1520,6 +1521,11 @@ xfs_fs_fill_super(
>  	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
>  #endif
>  	sb->s_op = &xfs_super_operations;
> +#ifdef CONFIG_FS_VERITY
> +	error = fsverity_set_ops(sb, &xfs_verity_ops);
> +	if (error)
> +		return error;
> +#endif

Isn't fsverity_set_ops a nop if !CONFIG_FS_VERITY?

>  	/*
>  	 * Delay mount work if the debug hook is set. This is debug
> @@ -1729,6 +1735,10 @@ xfs_fs_fill_super(
>  		goto out_filestream_unmount;
>  	}
>  
> +	if (xfs_has_verity(mp))
> +		xfs_alert(mp,
> +	"EXPERIMENTAL fs-verity feature in use. Use at your own risk!");
> +
>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> new file mode 100644
> index 000000000000..ea31b5bf6214
> --- /dev/null
> +++ b/fs/xfs/xfs_verity.c
> @@ -0,0 +1,355 @@
> +/* SPDX-License-Identifier: GPL-2.0 */

Nit: The SPDX header for C files needs to use C++ comment style, not C.

(This is a really stupid rule decided on by the spdx bureaucrats.)

> +/*
> + * Copyright (C) 2023 Red Hat, Inc.
> + */
> +#include "xfs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_da_format.h"
> +#include "xfs_da_btree.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_inode.h"
> +#include "xfs_log_format.h"
> +#include "xfs_attr.h"
> +#include "xfs_verity.h"
> +#include "xfs_bmap_util.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans.h"
> +#include "xfs_attr_leaf.h"
> +
> +/*
> + * Make fs-verity invalidate verified status of Merkle tree block
> + */
> +static void
> +xfs_verity_put_listent(
> +	struct xfs_attr_list_context	*context,
> +	int				flags,
> +	unsigned char			*name,
> +	int				namelen,
> +	int				valuelen)
> +{
> +	struct fsverity_blockbuf	block = {
> +		.offset = xfs_fsverity_name_to_block_offset(name),
> +		.size = valuelen,
> +	};
> +	/*
> +	 * Verity descriptor is smaller than 1024; verity block min size is
> +	 * 1024. Exclude verity descriptor
> +	 */
> +	if (valuelen < 1024)
> +		return;

As Eric implied elsewhere, shouldn't you be checking here if @name is
/not/ XFS_VERITY_DESCRIPTOR_NAME?

> +
> +	fsverity_invalidate_block(VFS_I(context->dp), &block);
> +}
> +
> +/*
> + * Iterate over extended attributes in the bp to invalidate Merkle tree blocks
> + */
> +static int
> +xfs_invalidate_blocks(
> +	struct xfs_inode	*ip,
> +	struct xfs_buf		*bp)
> +{
> +	struct xfs_attr_list_context context;
> +
> +	context.dp = ip;
> +	context.resynch = 0;
> +	context.buffer = NULL;
> +	context.bufsize = 0;
> +	context.firstu = 0;
> +	context.attr_filter = XFS_ATTR_VERITY;
> +	context.put_listent = xfs_verity_put_listent;
> +
> +	return xfs_attr3_leaf_list_int(bp, &context);
> +}

Why is it necessary to invalidate every block in the merkle tree?

OH.  This is what happens any time that xfs_read_merkle_tree_block
doesn't find a DOUBLE_ALLOC buffer.  That's unfortunate, because the
merkle tree could be large enough that a double-alloc'd xattr buffer
gets purged, a subsequent xattr read from the same block nets us a
!double buffer, and then fsverity wants to read the block again.

Does my xblobc rework eliminate the need for all of this?

> +
> +static int
> +xfs_get_verity_descriptor(
> +	struct inode		*inode,
> +	void			*buf,
> +	size_t			buf_size)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	int			error = 0;
> +	struct xfs_da_args	args = {
> +		.dp		= ip,
> +		.attr_filter	= XFS_ATTR_VERITY,
> +		.name		= (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME,
> +		.namelen	= XFS_VERITY_DESCRIPTOR_NAME_LEN,
> +		.value		= buf,
> +		.valuelen	= buf_size,
> +	};
> +
> +	/*
> +	 * The fact that (returned attribute size) == (provided buf_size) is
> +	 * checked by xfs_attr_copy_value() (returns -ERANGE)

Can verity handle ERANGE?  Or is that more of an EFSCORRUPTED?

> +	 */
> +	error = xfs_attr_get(&args);
> +	if (error)
> +		return error;
> +
> +	return args.valuelen;
> +}
> +
> +static int
> +xfs_begin_enable_verity(
> +	struct file	    *filp)
> +{
> +	struct inode	    *inode = file_inode(filp);
> +	struct xfs_inode    *ip = XFS_I(inode);
> +	int		    error = 0;
> +
> +	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);

Do we need to lock out pagecache invalidations too?

(Can't remember if I already asked that question.)

> +
> +	if (IS_DAX(inode))
> +		return -EINVAL;
> +
> +	if (xfs_iflags_test_and_set(ip, XFS_IVERITY_CONSTRUCTION))
> +		return -EBUSY;
> +
> +	return error;
> +}
> +
> +static int
> +xfs_drop_merkle_tree(
> +	struct xfs_inode		*ip,
> +	u64				merkle_tree_size,
> +	unsigned int			tree_blocksize)
> +{
> +	struct xfs_fsverity_merkle_key	name;
> +	int				error = 0;
> +	u64				offset = 0;
> +	struct xfs_da_args		args = {
> +		.dp			= ip,
> +		.whichfork		= XFS_ATTR_FORK,
> +		.attr_filter		= XFS_ATTR_VERITY,
> +		.op_flags		= XFS_DA_OP_REMOVE,
> +		.namelen		= sizeof(struct xfs_fsverity_merkle_key),
> +		/* NULL value make xfs_attr_set remove the attr */
> +		.value			= NULL,
> +	};
> +
> +	if (!merkle_tree_size)
> +		return 0;
> +
> +	args.name = (const uint8_t *)&name.merkleoff;

Can you assign this in the @args initializer?

And shouldn't that be taking the address of the entire struct:

		.name			= (const uint8_t *)&name,

not just some field within the struct?

> +	for (offset = 0; offset < merkle_tree_size; offset += tree_blocksize) {
> +		xfs_fsverity_merkle_key_to_disk(&name, offset);
> +		error = xfs_attr_set(&args);
> +		if (error)
> +			return error;
> +	}
> +
> +	args.name = (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME;
> +	args.namelen = XFS_VERITY_DESCRIPTOR_NAME_LEN;
> +	error = xfs_attr_set(&args);
> +
> +	return error;
> +}
> +
> +static int
> +xfs_end_enable_verity(
> +	struct file		*filp,
> +	const void		*desc,
> +	size_t			desc_size,
> +	u64			merkle_tree_size,
> +	unsigned int		tree_blocksize)
> +{
> +	struct inode		*inode = file_inode(filp);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*tp;
> +	struct xfs_da_args	args = {
> +		.dp		= ip,
> +		.whichfork	= XFS_ATTR_FORK,
> +		.attr_filter	= XFS_ATTR_VERITY,
> +		.attr_flags	= XATTR_CREATE,

Hmm.  How do you recover from the weird situation where DIFLAG2_VERITY
is not set on the file but there /is/ a "vdesc" in the xattrs and you go
to re-enable verity?

Or: Why is it important to fail the ->end_enable operation when we have
all the information we need to reset the verity descriptor?

> +		.name		= (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME,
> +		.namelen	= XFS_VERITY_DESCRIPTOR_NAME_LEN,
> +		.value		= (void *)desc,
> +		.valuelen	= desc_size,
> +	};
> +	int			error = 0;
> +
> +	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
> +
> +	/* fs-verity failed, just cleanup */
> +	if (desc == NULL)
> +		goto out;
> +
> +	error = xfs_attr_set(&args);
> +	if (error)
> +		goto out;

Case in point: if the system persists this transaction to the log and
crashes at this point, a subsequent re-run of 'fsverity enable' will
fail.

Should the attr_set of the verity descriptor and the setting of
DIFLAG2_VERITY be in the same transaction?

> +	/* Set fsverity inode flag */
> +	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_ichange,
> +			0, 0, false, &tp);
> +	if (error)
> +		goto out;
> +
> +	/*
> +	 * Ensure that we've persisted the verity information before we enable
> +	 * it on the inode and tell the caller we have sealed the inode.
> +	 */
> +	ip->i_diflags2 |= XFS_DIFLAG2_VERITY;
> +
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +	xfs_trans_set_sync(tp);

Why is it necessary to commit this transaction synchronously?  Is this
required by the FS_IOC_ENABLE_VERITY API?

> +
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	if (!error)
> +		inode->i_flags |= S_VERITY;
> +
> +out:
> +	if (error)
> +		WARN_ON_ONCE(xfs_drop_merkle_tree(ip, merkle_tree_size,
> +						  tree_blocksize));
> +
> +	xfs_iflags_clear(ip, XFS_IVERITY_CONSTRUCTION);
> +	return error;
> +}
> +
> +static int
> +xfs_read_merkle_tree_block(
> +	struct inode			*inode,
> +	u64				pos,
> +	struct fsverity_blockbuf	*block,
> +	unsigned int			log_blocksize,
> +	u64				ra_bytes)
> +{
> +	struct xfs_inode		*ip = XFS_I(inode);
> +	struct xfs_fsverity_merkle_key	name;
> +	int				error = 0;
> +	struct xfs_da_args		args = {
> +		.dp			= ip,
> +		.attr_filter		= XFS_ATTR_VERITY,
> +		.op_flags		= XFS_DA_OP_BUFFER,
> +		.namelen		= sizeof(struct xfs_fsverity_merkle_key),
> +		.valuelen		= (1 << log_blocksize),
> +	};
> +	xfs_fsverity_merkle_key_to_disk(&name, pos);
> +	args.name = (const uint8_t *)&name.merkleoff;

		.name			= (const uint8_t *)&name,

same as above.

> +
> +	error = xfs_attr_get(&args);
> +	if (error)
> +		goto out;
> +
> +	if (!args.valuelen)
> +		return -ENODATA;
> +
> +	block->kaddr = args.value;
> +	block->offset = pos;
> +	block->size = args.valuelen;
> +	block->context = args.bp;
> +
> +	/*
> +	 * Memory barriers are used to force operation ordering of clearing
> +	 * bitmap in fsverity_invalidate_block() and setting XBF_VERITY_SEEN
> +	 * flag.
> +	 *
> +	 * Multiple threads may execute this code concurrently on the same block.
> +	 * This is safe because we use memory barriers to ensure that if a
> +	 * thread sees XBF_VERITY_SEEN, then fsverity bitmap is already up to
> +	 * date.
> +	 *
> +	 * Invalidating block in a bitmap again at worst causes a hash block to
> +	 * be verified redundantly. That event should be very rare, so it's not
> +	 * worth using a lock to avoid.
> +	 */
> +	if (!(args.bp->b_flags & XBF_VERITY_SEEN)) {
> +		/*
> +		 * A read memory barrier is needed here to give ACQUIRE
> +		 * semantics to the above check.
> +		 */
> +		smp_rmb();
> +		/*
> +		 * fs-verity is not aware if buffer was evicted from the memory.
> +		 * Make fs-verity invalidate verfied status of all blocks in the
> +		 * buffer.
> +		 *
> +		 * Single extended attribute can contain multiple Merkle tree
> +		 * blocks:
> +		 * - leaf with inline data -> invalidate all blocks in the leaf
> +		 * - remote value -> invalidate single block
> +		 *
> +		 * For example, leaf on 64k system with 4k/1k filesystem will
> +		 * contain multiple Merkle tree blocks.
> +		 *
> +		 * Only remote value buffers would have XBF_DOUBLE_ALLOC flag
> +		 */
> +		if (args.bp->b_flags & XBF_DOUBLE_ALLOC)
> +			fsverity_invalidate_block(inode, block);
> +		else {
> +			error = xfs_invalidate_blocks(ip, args.bp);
> +			if (error)
> +				goto out;
> +		}
> +	}
> +
> +	/*
> +	 * A write memory barrier is needed here to give RELEASE
> +	 * semantics to the below flag.
> +	 */
> +	smp_wmb();
> +	args.bp->b_flags |= XBF_VERITY_SEEN;

Eww.  See my version of this in:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-07

> +	return error;
> +
> +out:
> +	kvfree(args.value);
> +	if (args.bp)
> +		xfs_buf_rele(args.bp);
> +	return error;
> +}
> +
> +static int
> +xfs_write_merkle_tree_block(
> +	struct inode		*inode,
> +	const void		*buf,
> +	u64			pos,
> +	unsigned int		size)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_fsverity_merkle_key	name;
> +	struct xfs_da_args	args = {
> +		.dp		= ip,
> +		.whichfork	= XFS_ATTR_FORK,
> +		.attr_filter	= XFS_ATTR_VERITY,
> +		.attr_flags	= XATTR_CREATE,
> +		.namelen	= sizeof(struct xfs_fsverity_merkle_key),
> +		.value		= (void *)buf,
> +		.valuelen	= size,
> +	};
> +
> +	xfs_fsverity_merkle_key_to_disk(&name, pos);
> +	args.name = (const uint8_t *)&name.merkleoff;

		.name			= (const uint8_t *)&name,

same as above.

> +
> +	return xfs_attr_set(&args);
> +}
> +
> +static void
> +xfs_drop_block(
> +	struct fsverity_blockbuf	*block)
> +{
> +	struct xfs_buf			*bp;
> +
> +	ASSERT(block != NULL);
> +	bp = (struct xfs_buf *)block->context;
> +	ASSERT(bp->b_flags & XBF_VERITY_SEEN);
> +
> +	xfs_buf_rele(bp);
> +
> +	kunmap_local(block->kaddr);
> +}
> +
> +const struct fsverity_operations xfs_verity_ops = {
> +	.begin_enable_verity		= &xfs_begin_enable_verity,
> +	.end_enable_verity		= &xfs_end_enable_verity,
> +	.get_verity_descriptor		= &xfs_get_verity_descriptor,
> +	.read_merkle_tree_block		= &xfs_read_merkle_tree_block,
> +	.write_merkle_tree_block	= &xfs_write_merkle_tree_block,

Not sure why all these function names don't start with "xfs_verity_"?

--D

> +	.drop_block			= &xfs_drop_block,
> +};
> diff --git a/fs/xfs/xfs_verity.h b/fs/xfs/xfs_verity.h
> new file mode 100644
> index 000000000000..deea15cd3cc5
> --- /dev/null
> +++ b/fs/xfs/xfs_verity.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2022 Red Hat, Inc.
> + */
> +#ifndef __XFS_VERITY_H__
> +#define __XFS_VERITY_H__
> +
> +#include "xfs.h"
> +#include "xfs_da_format.h"
> +#include "xfs_da_btree.h"
> +#include <linux/fsverity.h>
> +
> +#define XFS_VERITY_DESCRIPTOR_NAME "vdesc"
> +#define XFS_VERITY_DESCRIPTOR_NAME_LEN 5
> +
> +static inline bool
> +xfs_verity_merkle_block(
> +		struct xfs_da_args *args)
> +{
> +	if (!(args->attr_filter & XFS_ATTR_VERITY))
> +		return false;
> +
> +	if (!(args->op_flags & XFS_DA_OP_BUFFER))
> +		return false;
> +
> +	return true;
> +}
> +
> +#ifdef CONFIG_FS_VERITY
> +extern const struct fsverity_operations xfs_verity_ops;
> +#endif	/* CONFIG_FS_VERITY */
> +
> +#endif	/* __XFS_VERITY_H__ */
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-04 23:39   ` Eric Biggers
  2024-03-07 22:06     ` Darrick J. Wong
@ 2024-03-07 23:38     ` Dave Chinner
  2024-03-07 23:45       ` Darrick J. Wong
  2024-03-07 23:59       ` Eric Biggers
  1 sibling, 2 replies; 94+ messages in thread
From: Dave Chinner @ 2024-03-07 23:38 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, djwong, Christoph Hellwig

On Mon, Mar 04, 2024 at 03:39:27PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:33PM +0100, Andrey Albershteyn wrote:
> > +#ifdef CONFIG_FS_VERITY
> > +struct iomap_fsverity_bio {
> > +	struct work_struct	work;
> > +	struct bio		bio;
> > +};
> 
> Maybe leave a comment above that mentions that bio must be the last field.
> 
> > @@ -471,6 +529,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> >   * iomap_readahead - Attempt to read pages from a file.
> >   * @rac: Describes the pages to be read.
> >   * @ops: The operations vector for the filesystem.
> > + * @wq: Workqueue for post-I/O processing (only need for fsverity)
> 
> This should not be here.
> 
> > +#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
> > +
> >  static int __init iomap_init(void)
> >  {
> > -	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> > -			   offsetof(struct iomap_ioend, io_inline_bio),
> > -			   BIOSET_NEED_BVECS);
> > +	int error;
> > +
> > +	error = bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
> > +			    offsetof(struct iomap_ioend, io_inline_bio),
> > +			    BIOSET_NEED_BVECS);
> > +#ifdef CONFIG_FS_VERITY
> > +	if (error)
> > +		return error;
> > +
> > +	error = bioset_init(&iomap_fsverity_bioset, IOMAP_POOL_SIZE,
> > +			    offsetof(struct iomap_fsverity_bio, bio),
> > +			    BIOSET_NEED_BVECS);
> > +	if (error)
> > +		bioset_exit(&iomap_ioend_bioset);
> > +#endif
> > +	return error;
> >  }
> >  fs_initcall(iomap_init);
> 
> This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> for these bios, regardless of whether they end up being used or not.  When
> PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.

Honestly: I don't think we care about this.

Indeed, if a system is configured with iomap and does not use XFS,
GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
all, either. So by you definition that's just wasted memory, too, on
systems that don't use any of these three filesystems. But we
aren't going to make that one conditional, because the complexity
and overhead of checks that never trigger after the first IO doesn't
actually provide any return for the cost of ongoing maintenance.

Similarly, once XFS has fsverity enabled, it's going to get used all
over the place in the container and VM world. So we are *always*
going to want this bioset to be initialised on these production
systems, so it falls into the same category as the
iomap_ioend_bioset. That is, if you don't want that overhead, turn
the functionality off via CONFIG file options.

> How about allocating the pool when it's known it's actually going to be used,
> similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> there could be a flag in struct fsverity_operations that says whether filesystem
> wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> for any file for the first time since boot, it could call into fs/iomap/ to
> initialize the iomap fsverity bioset if needed.
> 
> BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
> error handling logic above does not really work as may have been intended.

That's not an iomap problem - lots of fs_initcall() functions return
errors because they failed things like memory allocation. If this is
actually problem, then fix the core init infrastructure to handle
errors properly, eh?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-07 23:38     ` Dave Chinner
@ 2024-03-07 23:45       ` Darrick J. Wong
  2024-03-08  0:47         ` Dave Chinner
  2024-03-07 23:59       ` Eric Biggers
  1 sibling, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-07 23:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Biggers, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, Christoph Hellwig

On Fri, Mar 08, 2024 at 10:38:06AM +1100, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 03:39:27PM -0800, Eric Biggers wrote:
> > On Mon, Mar 04, 2024 at 08:10:33PM +0100, Andrey Albershteyn wrote:
> > > +#ifdef CONFIG_FS_VERITY
> > > +struct iomap_fsverity_bio {
> > > +	struct work_struct	work;
> > > +	struct bio		bio;
> > > +};
> > 
> > Maybe leave a comment above that mentions that bio must be the last field.
> > 
> > > @@ -471,6 +529,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> > >   * iomap_readahead - Attempt to read pages from a file.
> > >   * @rac: Describes the pages to be read.
> > >   * @ops: The operations vector for the filesystem.
> > > + * @wq: Workqueue for post-I/O processing (only need for fsverity)
> > 
> > This should not be here.
> > 
> > > +#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
> > > +
> > >  static int __init iomap_init(void)
> > >  {
> > > -	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> > > -			   offsetof(struct iomap_ioend, io_inline_bio),
> > > -			   BIOSET_NEED_BVECS);
> > > +	int error;
> > > +
> > > +	error = bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
> > > +			    offsetof(struct iomap_ioend, io_inline_bio),
> > > +			    BIOSET_NEED_BVECS);
> > > +#ifdef CONFIG_FS_VERITY
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	error = bioset_init(&iomap_fsverity_bioset, IOMAP_POOL_SIZE,
> > > +			    offsetof(struct iomap_fsverity_bio, bio),
> > > +			    BIOSET_NEED_BVECS);
> > > +	if (error)
> > > +		bioset_exit(&iomap_ioend_bioset);
> > > +#endif
> > > +	return error;
> > >  }
> > >  fs_initcall(iomap_init);
> > 
> > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > for these bios, regardless of whether they end up being used or not.  When
> > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> 
> Honestly: I don't think we care about this.
> 
> Indeed, if a system is configured with iomap and does not use XFS,
> GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
> all, either. So by you definition that's just wasted memory, too, on
> systems that don't use any of these three filesystems. But we
> aren't going to make that one conditional, because the complexity
> and overhead of checks that never trigger after the first IO doesn't
> actually provide any return for the cost of ongoing maintenance.

I've occasionally wondered if I shouldn't try harder to make iomap a
module so that nobody pays the cost of having it until they load an fs
driver that uses it.

> Similarly, once XFS has fsverity enabled, it's going to get used all
> over the place in the container and VM world. So we are *always*
> going to want this bioset to be initialised on these production
> systems, so it falls into the same category as the
> iomap_ioend_bioset. That is, if you don't want that overhead, turn
> the functionality off via CONFIG file options.

<nod> If someone really cares I guess we could employ that weird
cmpxchg trick that the dio workqueue setup function uses.  But... let's
let someone complain first, eh? ;)

> > How about allocating the pool when it's known it's actually going to be used,
> > similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> > there could be a flag in struct fsverity_operations that says whether filesystem
> > wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> > for any file for the first time since boot, it could call into fs/iomap/ to
> > initialize the iomap fsverity bioset if needed.
> > 
> > BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
> > error handling logic above does not really work as may have been intended.
> 
> That's not an iomap problem - lots of fs_initcall() functions return
> errors because they failed things like memory allocation. If this is
> actually problem, then fix the core init infrastructure to handle
> errors properly, eh?

Soooo ... the kernel crashes if iomap cannot allocate its bioset?

I guess that's handling it, FSVO handle. :P

(Is that why I can't boot with mem=60M anymore?)

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-07 23:38     ` Dave Chinner
  2024-03-07 23:45       ` Darrick J. Wong
@ 2024-03-07 23:59       ` Eric Biggers
  2024-03-08  1:20         ` Dave Chinner
  1 sibling, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-07 23:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, djwong, Christoph Hellwig

On Fri, Mar 08, 2024 at 10:38:06AM +1100, Dave Chinner wrote:
> > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > for these bios, regardless of whether they end up being used or not.  When
> > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> 
> Honestly: I don't think we care about this.
> 
> Indeed, if a system is configured with iomap and does not use XFS,
> GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
> all, either. So by you definition that's just wasted memory, too, on
> systems that don't use any of these three filesystems. But we
> aren't going to make that one conditional, because the complexity
> and overhead of checks that never trigger after the first IO doesn't
> actually provide any return for the cost of ongoing maintenance.
> 
> Similarly, once XFS has fsverity enabled, it's going to get used all
> over the place in the container and VM world. So we are *always*
> going to want this bioset to be initialised on these production
> systems, so it falls into the same category as the
> iomap_ioend_bioset. That is, if you don't want that overhead, turn
> the functionality off via CONFIG file options.

"We're already wasting memory, therefore it's fine to waste more" isn't a great
argument.

iomap_ioend_bioset is indeed also problematic, though it's also a bit different
in that it's needed for basic filesystem functionality, not an optional feature.
Yes, ext4 and f2fs don't actually use iomap for buffered reads, but IIUC at
least there's a plan to do that.  Meanwhile there's no plan for fsverity to be
used on every system that may be using a filesystem that supports it.

We should take care not to over-estimate how many users use optional features.
Linux distros turn on most kconfig options just in case, but often the common
case is that the option is on but the feature is not used.

> > How about allocating the pool when it's known it's actually going to be used,
> > similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> > there could be a flag in struct fsverity_operations that says whether filesystem
> > wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> > for any file for the first time since boot, it could call into fs/iomap/ to
> > initialize the iomap fsverity bioset if needed.
> > 
> > BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
> > error handling logic above does not really work as may have been intended.
> 
> That's not an iomap problem - lots of fs_initcall() functions return
> errors because they failed things like memory allocation. If this is
> actually problem, then fix the core init infrastructure to handle
> errors properly, eh?

What does "properly" mean?  Even if init/main.c did something with the returned
error code, individual subsystems still need to know what behavior they want and
act accordingly.  Do they want to panic the kernel, or do they want to fall back
gracefully with degraded functionality.  If subsystems go with the panic (like
fsverity_init() does these days), then there is no need for doing any cleanup on
errors like this patchset does.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-07 23:45       ` Darrick J. Wong
@ 2024-03-08  0:47         ` Dave Chinner
  0 siblings, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2024-03-08  0:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Eric Biggers, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, Christoph Hellwig

On Thu, Mar 07, 2024 at 03:45:22PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 08, 2024 at 10:38:06AM +1100, Dave Chinner wrote:
> > On Mon, Mar 04, 2024 at 03:39:27PM -0800, Eric Biggers wrote:
> > > On Mon, Mar 04, 2024 at 08:10:33PM +0100, Andrey Albershteyn wrote:
> > > > +#ifdef CONFIG_FS_VERITY
> > > > +struct iomap_fsverity_bio {
> > > > +	struct work_struct	work;
> > > > +	struct bio		bio;
> > > > +};
> > > 
> > > Maybe leave a comment above that mentions that bio must be the last field.
> > > 
> > > > @@ -471,6 +529,7 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> > > >   * iomap_readahead - Attempt to read pages from a file.
> > > >   * @rac: Describes the pages to be read.
> > > >   * @ops: The operations vector for the filesystem.
> > > > + * @wq: Workqueue for post-I/O processing (only need for fsverity)
> > > 
> > > This should not be here.
> > > 
> > > > +#define IOMAP_POOL_SIZE		(4 * (PAGE_SIZE / SECTOR_SIZE))
> > > > +
> > > >  static int __init iomap_init(void)
> > > >  {
> > > > -	return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
> > > > -			   offsetof(struct iomap_ioend, io_inline_bio),
> > > > -			   BIOSET_NEED_BVECS);
> > > > +	int error;
> > > > +
> > > > +	error = bioset_init(&iomap_ioend_bioset, IOMAP_POOL_SIZE,
> > > > +			    offsetof(struct iomap_ioend, io_inline_bio),
> > > > +			    BIOSET_NEED_BVECS);
> > > > +#ifdef CONFIG_FS_VERITY
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	error = bioset_init(&iomap_fsverity_bioset, IOMAP_POOL_SIZE,
> > > > +			    offsetof(struct iomap_fsverity_bio, bio),
> > > > +			    BIOSET_NEED_BVECS);
> > > > +	if (error)
> > > > +		bioset_exit(&iomap_ioend_bioset);
> > > > +#endif
> > > > +	return error;
> > > >  }
> > > >  fs_initcall(iomap_init);
> > > 
> > > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > > for these bios, regardless of whether they end up being used or not.  When
> > > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> > 
> > Honestly: I don't think we care about this.
> > 
> > Indeed, if a system is configured with iomap and does not use XFS,
> > GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
> > all, either. So by you definition that's just wasted memory, too, on
> > systems that don't use any of these three filesystems. But we
> > aren't going to make that one conditional, because the complexity
> > and overhead of checks that never trigger after the first IO doesn't
> > actually provide any return for the cost of ongoing maintenance.
> 
> I've occasionally wondered if I shouldn't try harder to make iomap a
> module so that nobody pays the cost of having it until they load an fs
> driver that uses it.

Not sure it is worth that effort. e.g. the block device code is
already using iomap (when bufferheads are turned off) so the future
reality is going to be that iomap will always be required when
CONFIG_BLOCK=y.

> > > How about allocating the pool when it's known it's actually going to be used,
> > > similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> > > there could be a flag in struct fsverity_operations that says whether filesystem
> > > wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> > > for any file for the first time since boot, it could call into fs/iomap/ to
> > > initialize the iomap fsverity bioset if needed.
> > > 
> > > BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
> > > error handling logic above does not really work as may have been intended.
> > 
> > That's not an iomap problem - lots of fs_initcall() functions return
> > errors because they failed things like memory allocation. If this is
> > actually problem, then fix the core init infrastructure to handle
> > errors properly, eh?
> 
> Soooo ... the kernel crashes if iomap cannot allocate its bioset?
>
> I guess that's handling it, FSVO handle. :P

Yup, the kernel will crash if any fs_initcall() fails to allocate
the memory it needs and that subsystem is then used. So the
workaround for that is that a bunch of these initcalls simply use
panic() calls for error handling. :(

> (Is that why I can't boot with mem=60M anymore?)

Something else is probably panic()ing on ENOMEM... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-07 23:59       ` Eric Biggers
@ 2024-03-08  1:20         ` Dave Chinner
  2024-03-08  3:16           ` Eric Biggers
  2024-03-08  3:22           ` Darrick J. Wong
  0 siblings, 2 replies; 94+ messages in thread
From: Dave Chinner @ 2024-03-08  1:20 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, djwong, Christoph Hellwig

On Thu, Mar 07, 2024 at 03:59:07PM -0800, Eric Biggers wrote:
> On Fri, Mar 08, 2024 at 10:38:06AM +1100, Dave Chinner wrote:
> > > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > > for these bios, regardless of whether they end up being used or not.  When
> > > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> > 
> > Honestly: I don't think we care about this.
> > 
> > Indeed, if a system is configured with iomap and does not use XFS,
> > GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
> > all, either. So by you definition that's just wasted memory, too, on
> > systems that don't use any of these three filesystems. But we
> > aren't going to make that one conditional, because the complexity
> > and overhead of checks that never trigger after the first IO doesn't
> > actually provide any return for the cost of ongoing maintenance.
> > 
> > Similarly, once XFS has fsverity enabled, it's going to get used all
> > over the place in the container and VM world. So we are *always*
> > going to want this bioset to be initialised on these production
> > systems, so it falls into the same category as the
> > iomap_ioend_bioset. That is, if you don't want that overhead, turn
> > the functionality off via CONFIG file options.
> 
> "We're already wasting memory, therefore it's fine to waste more" isn't a great
> argument.

Adding complexity just because -you- think the memory is wasted
isn't any better argument. I don't think the memory is wasted, and I
expect that fsverity will end up in -very- wide use across
enterprise production systems as container/vm image build
infrastructure moves towards composefs-like algorithms that have a
hard dependency on fsverity functionality being present in the host
filesytsems.

> iomap_ioend_bioset is indeed also problematic, though it's also a bit different
> in that it's needed for basic filesystem functionality, not an optional feature.
> Yes, ext4 and f2fs don't actually use iomap for buffered reads, but IIUC at
> least there's a plan to do that.  Meanwhile there's no plan for fsverity to be
> used on every system that may be using a filesystem that supports it.

That's not the case I see coming for XFS - by the time we get this
merged and out of experimental support, the distro and application
support will already be there for endemic use of fsverity on XFS.
That's all being prototyped on ext4+fsverity right now, and you
probably don't see any of that happening.

> We should take care not to over-estimate how many users use optional features.
> Linux distros turn on most kconfig options just in case, but often the common
> case is that the option is on but the feature is not used.

I don't think that fsverity is going to be an optional feature in
future distros - we're already building core infrastructure based on
the data integrity guarantees that fsverity provides us with. IOWs,
I think fsverity support will soon become required core OS
functionality by more than one distro, and so we just don't care
about this overhead because the extra read IO bioset will always be
necessary....

> > > How about allocating the pool when it's known it's actually going to be used,
> > > similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> > > there could be a flag in struct fsverity_operations that says whether filesystem
> > > wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> > > for any file for the first time since boot, it could call into fs/iomap/ to
> > > initialize the iomap fsverity bioset if needed.
> > > 
> > > BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
> > > error handling logic above does not really work as may have been intended.
> > 
> > That's not an iomap problem - lots of fs_initcall() functions return
> > errors because they failed things like memory allocation. If this is
> > actually problem, then fix the core init infrastructure to handle
> > errors properly, eh?
> 
> What does "properly" mean?

Having documented requirements and behaviour and then enforcing
them. There is no documentation for initcalls - they return an int,
so by convention it is expected that errors should be returned to
the caller.

There's *nothing* that says initcalls should panic() instead of
returning errors if they have a fatal error. There's nothing that
says "errors are ignored" - having them be declared as void would be
a good hint that errors can't be returned or handled.

Expecting people to cargo-cult other implementations and magically
get it right is almost guaranteed to ensure that nobody actually
gets it right the first time...

> Even if init/main.c did something with the returned
> error code, individual subsystems still need to know what behavior they want and
> act accordingly.

"return an error only on fatal initialisation failure" and then the
core code can decide if the initcall level is high enough to warrant
panic()ing the machine.

> Do they want to panic the kernel, or do they want to fall back
> gracefully with degraded functionality.

If they can gracefully fall back to some other mechanism, then they
*haven't failed* and there's no need to return an error or panic.

> If subsystems go with the panic (like
> fsverity_init() does these days), then there is no need for doing any cleanup on
> errors like this patchset does.

s/panic/BUG()/ and read that back.

Because that's essentially what it is - error handling via BUG()
calls. We get told off for using BUG() instead of correct error
handling, yet here we are with code that does correct error handling
and we're being told to replace it with BUG() because the errors
aren't handled by the infrastructure calling our code. Gross, nasty
and really, really needs to be documented.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-07 22:46   ` Darrick J. Wong
@ 2024-03-08  1:59     ` Dave Chinner
  2024-03-08  3:31       ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Dave Chinner @ 2024-03-08  1:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> > One of essential ideas of fs-verity is that pages which are already
> > verified won't need to be re-verified if they still in page cache.
> > 
> > XFS will store Merkle tree blocks in extended file attributes. When
> > read extended attribute data is put into xfs_buf.
> > 
> > fs-verity uses PG_checked flag to track status of the blocks in the
> > page. This flag can has two meanings - page was re-instantiated and
> > the only block in the page is verified.
> > 
> > However, in XFS, the data in the buffer is not aligned with xfs_buf
> > pages and we don't have a reference to these pages. Moreover, these
> > pages are released when value is copied out in xfs_attr code. In
> > other words, we can not directly mark underlying xfs_buf's pages as
> > verified as it's done by fs-verity for other filesystems.
> > 
> > One way to track that these pages were processed by fs-verity is to
> > mark buffer as verified instead. If buffer is evicted the incore
> > XBF_VERITY_SEEN flag is lost. When the xattr is read again
> > xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> > flag is then used to tell fs-verity this buffer was cached or not.
> > 
> > The second state indicated by PG_checked is if the only block in the
> > PAGE is verified. This is not the case for XFS as there could be
> > multiple blocks in single buffer (page size 64k block size 4k). This
> > is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> > XFS despite of Merkle tree block size.
> > 
> > The meaning of the flag is that value of the extended attribute in
> > the buffer is processed by fs-verity.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.h | 18 ++++++++++--------
> >  1 file changed, 10 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 73249abca968..2a73918193ba 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -24,14 +24,15 @@ struct xfs_buf;
> >  
> >  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
> >  
> > -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> > -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> > -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> > -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> > -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> > -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> > -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> > +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> > +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> > +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> > +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> > +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> > +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> > +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> > +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
> 
> Yuck.  I still dislike this entire approach.
> 
> XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
> gets loaded on behalf of a merkle tree request, then uses the extra
> space to shadow the contents of the ondisk block.  AFAICT the shadow
> doesn't get updated even if the cached data does, which sounds like a
> landmine for coherency issues.
>
> XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
> exposing the low level buffering details of the xattr code to
> xfs_attr_get callers.
> 
> XBF_VERITY_SEEN is a layering violation because now the overall buffer
> cache can track file metadata state.  I think the reason why you need
> this state flag is because the datadev buffer target indexes based on
> physical xfs_daddr_t, whereas merkle tree blocks have their own internal
> block numbers.  You can't directly go from the merkle block number to an
> xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
> fell out of memory.
>
> ISTR asking for a separation of these indices when I reviewed some
> previous version of this patchset.  At the time it seems to me that a
> much more efficient way to cache the merkle tree blocks would be to set
> up a direct (merkle tree block number) -> (blob of data) lookup table.
> That I don't see here.
>
> In the spirit of the recent collaboration style that I've taken with
> Christoph, I pulled your branch and started appending patches to it to
> see if the design that I'm asking for is reasonable.  As it so happens,
> I was working on a simplified version of the xfs buffer cache ("fsbuf")
> that could be used by simple filesystems to get them off of buffer
> heads.
> 
> (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> 14, and 15 become unnecessary.  However, this solution is itself grossly
> overengineered, since all we want are the following operations:
> 
> peek(key): returns an fsbuf if there's any data cached for key
> 
> get(key): returns an fsbuf for key, regardless of state
> 
> store(fsbuf, p): attach a memory buffer p to fsbuf
> 
> Then the xfs ->read_merkle_tree_block function becomes:
> 
> 	bp = peek(key)
> 	if (bp)
> 		/* return bp data up to verity */
> 
> 	p = xfs_attr_get(key)
> 	if (!p)
> 		/* error */
> 
> 	bp = get(key)
> 	store(bp, p)

Ok, that looks good - it definitely gets rid of a lot of the
nastiness, but I have to ask: why does it need to be based on
xfs_bufs?  That's just wasting 300 bytes of memory on a handle to
store a key and a opaque blob in a rhashtable.

IIUC, the key here is a sequential index, so an xarray would be a
much better choice as it doesn't require internal storage of the
key.

i.e.

	p = xa_load(key);
	if (p)
		return p;

	xfs_attr_get(key);
	if (!args->value)
		/* error */

	/*
	 * store the current value, freeing any old value that we
	 * replaced at this key. Don't care about failure to store,
	 * this is optimistic caching.
	 */
	p = xa_store(key, args->value, GFP_NOFS);
	if (p)
		kvfree(p);
	return args->value;

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-08  1:20         ` Dave Chinner
@ 2024-03-08  3:16           ` Eric Biggers
  2024-03-08  3:57             ` Darrick J. Wong
  2024-03-08  3:22           ` Darrick J. Wong
  1 sibling, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-08  3:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, djwong, Christoph Hellwig

On Fri, Mar 08, 2024 at 12:20:31PM +1100, Dave Chinner wrote:
> On Thu, Mar 07, 2024 at 03:59:07PM -0800, Eric Biggers wrote:
> > On Fri, Mar 08, 2024 at 10:38:06AM +1100, Dave Chinner wrote:
> > > > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > > > for these bios, regardless of whether they end up being used or not.  When
> > > > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > > > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > > > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> > > 
> > > Honestly: I don't think we care about this.
> > > 
> > > Indeed, if a system is configured with iomap and does not use XFS,
> > > GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
> > > all, either. So by you definition that's just wasted memory, too, on
> > > systems that don't use any of these three filesystems. But we
> > > aren't going to make that one conditional, because the complexity
> > > and overhead of checks that never trigger after the first IO doesn't
> > > actually provide any return for the cost of ongoing maintenance.
> > > 
> > > Similarly, once XFS has fsverity enabled, it's going to get used all
> > > over the place in the container and VM world. So we are *always*
> > > going to want this bioset to be initialised on these production
> > > systems, so it falls into the same category as the
> > > iomap_ioend_bioset. That is, if you don't want that overhead, turn
> > > the functionality off via CONFIG file options.
> > 
> > "We're already wasting memory, therefore it's fine to waste more" isn't a great
> > argument.
> 
> Adding complexity just because -you- think the memory is wasted
> isn't any better argument. I don't think the memory is wasted, and I
> expect that fsverity will end up in -very- wide use across
> enterprise production systems as container/vm image build
> infrastructure moves towards composefs-like algorithms that have a
> hard dependency on fsverity functionality being present in the host
> filesytsems.
> 
> > iomap_ioend_bioset is indeed also problematic, though it's also a bit different
> > in that it's needed for basic filesystem functionality, not an optional feature.
> > Yes, ext4 and f2fs don't actually use iomap for buffered reads, but IIUC at
> > least there's a plan to do that.  Meanwhile there's no plan for fsverity to be
> > used on every system that may be using a filesystem that supports it.
> 
> That's not the case I see coming for XFS - by the time we get this
> merged and out of experimental support, the distro and application
> support will already be there for endemic use of fsverity on XFS.
> That's all being prototyped on ext4+fsverity right now, and you
> probably don't see any of that happening.
> 
> > We should take care not to over-estimate how many users use optional features.
> > Linux distros turn on most kconfig options just in case, but often the common
> > case is that the option is on but the feature is not used.
> 
> I don't think that fsverity is going to be an optional feature in
> future distros - we're already building core infrastructure based on
> the data integrity guarantees that fsverity provides us with. IOWs,
> I think fsverity support will soon become required core OS
> functionality by more than one distro, and so we just don't care
> about this overhead because the extra read IO bioset will always be
> necessary....

That's great that you're finding fsverity to be useful!

At the same time, please do keep in mind that there's a huge variety of Linux
systems besides the "enterprise" systems using XFS that you may have in mind.

The memory allocated for iomap_fsverity_bioset, as proposed, is indeed wasted on
any system that isn't using both XFS *and* fsverity (as long as
CONFIG_FS_VERITY=y, and either CONFIG_EXT4_FS, CONFIG_F2FS_FS, or
CONFIG_BTRFS_FS is enabled too).  The amount is also quadrupled on systems that
use a 16K page size, which will become increasingly common in the future.

It may be that part of your intention for pushing back against this optimization
is to encourage filesystems to convert their buffered reads to iomap.  I'm not
sure it will actually have that effect.  First, it's not clear that it will be
technically feasible for the filesystems that support compression, btrfs and
f2fs, to convert their buffered reads to iomap.  Second, this sort of thing
reinforces an impression that iomap is targeted at enterprise systems using XFS,
and that changes that would be helpful on other types of systems are rejected.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-08  1:20         ` Dave Chinner
  2024-03-08  3:16           ` Eric Biggers
@ 2024-03-08  3:22           ` Darrick J. Wong
  1 sibling, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-08  3:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Biggers, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, Christoph Hellwig

On Fri, Mar 08, 2024 at 12:20:31PM +1100, Dave Chinner wrote:
> On Thu, Mar 07, 2024 at 03:59:07PM -0800, Eric Biggers wrote:
> > On Fri, Mar 08, 2024 at 10:38:06AM +1100, Dave Chinner wrote:
> > > > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > > > for these bios, regardless of whether they end up being used or not.  When
> > > > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > > > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > > > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> > > 
> > > Honestly: I don't think we care about this.
> > > 
> > > Indeed, if a system is configured with iomap and does not use XFS,
> > > GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
> > > all, either. So by you definition that's just wasted memory, too, on
> > > systems that don't use any of these three filesystems. But we
> > > aren't going to make that one conditional, because the complexity
> > > and overhead of checks that never trigger after the first IO doesn't
> > > actually provide any return for the cost of ongoing maintenance.
> > > 
> > > Similarly, once XFS has fsverity enabled, it's going to get used all
> > > over the place in the container and VM world. So we are *always*
> > > going to want this bioset to be initialised on these production
> > > systems, so it falls into the same category as the
> > > iomap_ioend_bioset. That is, if you don't want that overhead, turn
> > > the functionality off via CONFIG file options.
> > 
> > "We're already wasting memory, therefore it's fine to waste more" isn't a great
> > argument.
> 
> Adding complexity just because -you- think the memory is wasted
> isn't any better argument. I don't think the memory is wasted, and I
> expect that fsverity will end up in -very- wide use across
> enterprise production systems as container/vm image build
> infrastructure moves towards composefs-like algorithms that have a
> hard dependency on fsverity functionality being present in the host
> filesytsems.

FWIW I would like to evaluate verity for certain things, such as bitrot
detection on backup disks and the like.  My employers are probably much
more interested in the same sealed rootfs blahdyblah that everyone else
has spilt much ink over. :)

> > iomap_ioend_bioset is indeed also problematic, though it's also a bit different
> > in that it's needed for basic filesystem functionality, not an optional feature.
> > Yes, ext4 and f2fs don't actually use iomap for buffered reads, but IIUC at
> > least there's a plan to do that.  Meanwhile there's no plan for fsverity to be
> > used on every system that may be using a filesystem that supports it.
> 
> That's not the case I see coming for XFS - by the time we get this
> merged and out of experimental support, the distro and application
> support will already be there for endemic use of fsverity on XFS.
> That's all being prototyped on ext4+fsverity right now, and you
> probably don't see any of that happening.
> 
> > We should take care not to over-estimate how many users use optional features.
> > Linux distros turn on most kconfig options just in case, but often the common
> > case is that the option is on but the feature is not used.
> 
> I don't think that fsverity is going to be an optional feature in
> future distros - we're already building core infrastructure based on
> the data integrity guarantees that fsverity provides us with. IOWs,
> I think fsverity support will soon become required core OS
> functionality by more than one distro, and so we just don't careg
> about this overhead because the extra read IO bioset will always be
> necessary....

Same here.

> > > > How about allocating the pool when it's known it's actually going to be used,
> > > > similar to what fs/crypto/ does for fscrypt_bounce_page_pool?  For example,
> > > > there could be a flag in struct fsverity_operations that says whether filesystem
> > > > wants the iomap fsverity bioset, and when fs/verity/ sets up the fsverity_info
> > > > for any file for the first time since boot, it could call into fs/iomap/ to
> > > > initialize the iomap fsverity bioset if needed.
> > > > 
> > > > BTW, errors from builtin initcalls such as iomap_init() get ignored.  So the
> > > > error handling logic above does not really work as may have been intended.
> > > 
> > > That's not an iomap problem - lots of fs_initcall() functions return
> > > errors because they failed things like memory allocation. If this is
> > > actually problem, then fix the core init infrastructure to handle
> > > errors properly, eh?
> > 
> > What does "properly" mean?
> 
> Having documented requirements and behaviour and then enforcing
> them. There is no documentation for initcalls - they return an int,
> so by convention it is expected that errors should be returned to
> the caller.

Agree.  I hated that weird thing that sync_filesystem did where it
mostly ignored the return values.

> There's *nothing* that says initcalls should panic() instead of
> returning errors if they have a fatal error. There's nothing that
> says "errors are ignored" - having them be declared as void would be
> a good hint that errors can't be returned or handled.
> 
> Expecting people to cargo-cult other implementations and magically
> get it right is almost guaranteed to ensure that nobody actually
> gets it right the first time...

Hrmm.  Should we have a iomap_init_succeeded() function that init_xfs_fs
and the like can call to find out if the initcall failed and refuse to
load?

Or just make it a module and then the insmod(iomap) can fail.

> > Even if init/main.c did something with the returned
> > error code, individual subsystems still need to know what behavior they want and
> > act accordingly.
> 
> "return an error only on fatal initialisation failure" and then the
> core code can decide if the initcall level is high enough to warrant
> panic()ing the machine.
> 
> > Do they want to panic the kernel, or do they want to fall back
> > gracefully with degraded functionality.
> 
> If they can gracefully fall back to some other mechanism, then they
> *haven't failed* and there's no need to return an error or panic.

I'm not sure if insmod(xfs) bailing out is all that graceful, but I
suppose it's less rude than a kernel panic :P

> > If subsystems go with the panic (like
> > fsverity_init() does these days), then there is no need for doing any cleanup on
> > errors like this patchset does.
> 
> s/panic/BUG()/ and read that back.
> 
> Because that's essentially what it is - error handling via BUG()
> calls. We get told off for using BUG() instead of correct error
> handling, yet here we are with code that does correct error handling
> and we're being told to replace it with BUG() because the errors
> aren't handled by the infrastructure calling our code. Gross, nasty
> and really, really needs to be documented.

Log an error and invoke the autorepair to see if you can really let the
smoke out? ;)

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-08  1:59     ` Dave Chinner
@ 2024-03-08  3:31       ` Darrick J. Wong
  2024-03-09 16:28         ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-08  3:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote:
> > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> > > One of essential ideas of fs-verity is that pages which are already
> > > verified won't need to be re-verified if they still in page cache.
> > > 
> > > XFS will store Merkle tree blocks in extended file attributes. When
> > > read extended attribute data is put into xfs_buf.
> > > 
> > > fs-verity uses PG_checked flag to track status of the blocks in the
> > > page. This flag can has two meanings - page was re-instantiated and
> > > the only block in the page is verified.
> > > 
> > > However, in XFS, the data in the buffer is not aligned with xfs_buf
> > > pages and we don't have a reference to these pages. Moreover, these
> > > pages are released when value is copied out in xfs_attr code. In
> > > other words, we can not directly mark underlying xfs_buf's pages as
> > > verified as it's done by fs-verity for other filesystems.
> > > 
> > > One way to track that these pages were processed by fs-verity is to
> > > mark buffer as verified instead. If buffer is evicted the incore
> > > XBF_VERITY_SEEN flag is lost. When the xattr is read again
> > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> > > flag is then used to tell fs-verity this buffer was cached or not.
> > > 
> > > The second state indicated by PG_checked is if the only block in the
> > > PAGE is verified. This is not the case for XFS as there could be
> > > multiple blocks in single buffer (page size 64k block size 4k). This
> > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> > > XFS despite of Merkle tree block size.
> > > 
> > > The meaning of the flag is that value of the extended attribute in
> > > the buffer is processed by fs-verity.
> > > 
> > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > > ---
> > >  fs/xfs/xfs_buf.h | 18 ++++++++++--------
> > >  1 file changed, 10 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > index 73249abca968..2a73918193ba 100644
> > > --- a/fs/xfs/xfs_buf.h
> > > +++ b/fs/xfs/xfs_buf.h
> > > @@ -24,14 +24,15 @@ struct xfs_buf;
> > >  
> > >  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
> > >  
> > > -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> > > -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> > > -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> > > -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> > > -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> > > -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> > > -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> > > +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> > > +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> > > +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> > > +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> > > +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> > > +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> > > +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> > > +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
> > 
> > Yuck.  I still dislike this entire approach.
> > 
> > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
> > gets loaded on behalf of a merkle tree request, then uses the extra
> > space to shadow the contents of the ondisk block.  AFAICT the shadow
> > doesn't get updated even if the cached data does, which sounds like a
> > landmine for coherency issues.
> >
> > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
> > exposing the low level buffering details of the xattr code to
> > xfs_attr_get callers.
> > 
> > XBF_VERITY_SEEN is a layering violation because now the overall buffer
> > cache can track file metadata state.  I think the reason why you need
> > this state flag is because the datadev buffer target indexes based on
> > physical xfs_daddr_t, whereas merkle tree blocks have their own internal
> > block numbers.  You can't directly go from the merkle block number to an
> > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
> > fell out of memory.
> >
> > ISTR asking for a separation of these indices when I reviewed some
> > previous version of this patchset.  At the time it seems to me that a
> > much more efficient way to cache the merkle tree blocks would be to set
> > up a direct (merkle tree block number) -> (blob of data) lookup table.
> > That I don't see here.
> >
> > In the spirit of the recent collaboration style that I've taken with
> > Christoph, I pulled your branch and started appending patches to it to
> > see if the design that I'm asking for is reasonable.  As it so happens,
> > I was working on a simplified version of the xfs buffer cache ("fsbuf")
> > that could be used by simple filesystems to get them off of buffer
> > heads.
> > 
> > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > overengineered, since all we want are the following operations:
> > 
> > peek(key): returns an fsbuf if there's any data cached for key
> > 
> > get(key): returns an fsbuf for key, regardless of state
> > 
> > store(fsbuf, p): attach a memory buffer p to fsbuf
> > 
> > Then the xfs ->read_merkle_tree_block function becomes:
> > 
> > 	bp = peek(key)
> > 	if (bp)
> > 		/* return bp data up to verity */
> > 
> > 	p = xfs_attr_get(key)
> > 	if (!p)
> > 		/* error */
> > 
> > 	bp = get(key)
> > 	store(bp, p)
> 
> Ok, that looks good - it definitely gets rid of a lot of the
> nastiness, but I have to ask: why does it need to be based on
> xfs_bufs?

(copying from IRC) It was still warm in my brain L2 after all the xfile
buftarg cleaning and merging that just got done a few weeks ago.   So I
went with the simplest thing I could rig up to test my ideas, and now
we're at the madly iterate until exhaustion stage. ;)

>            That's just wasting 300 bytes of memory on a handle to
> store a key and a opaque blob in a rhashtable.

Yep.  The fsbufs implementation was a lot more slender, but a bunch more
code.  I agree that I ought to go look at xarrays or something that's
more of a direct mapping as a next step.  However, i wanted to get
Andrey's feedback on this general approach first.

> IIUC, the key here is a sequential index, so an xarray would be a
> much better choice as it doesn't require internal storage of the
> key.

I wonder, what are the access patterns for merkle blobs?  Is it actually
sequential, or is more like 0 -> N -> N*N as we walk towards leaves?

Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
that done because merkle trees may some day have more than 2^32 blocks
in them?  That won't play well with things like xarrays on 32-bit
machines.

(Granted we've been talking about deprecating XFS on 32-bit for a while
now but we're not the whole world)

> i.e.
> 
> 	p = xa_load(key);
> 	if (p)
> 		return p;
> 
> 	xfs_attr_get(key);
> 	if (!args->value)
> 		/* error */
> 
> 	/*
> 	 * store the current value, freeing any old value that we
> 	 * replaced at this key. Don't care about failure to store,
> 	 * this is optimistic caching.
> 	 */
> 	p = xa_store(key, args->value, GFP_NOFS);
> 	if (p)
> 		kvfree(p);
> 	return args->value;

Attractive.  Will have to take a look at that tomorrow.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-07 22:02       ` Eric Biggers
@ 2024-03-08  3:46         ` Darrick J. Wong
  2024-03-08  4:40           ` Eric Biggers
  2024-03-08 21:34           ` Dave Chinner
  0 siblings, 2 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-08  3:46 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 02:02:24PM -0800, Eric Biggers wrote:
> On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> > Or you could leave the unfinished tree as-is; that will waste space, but
> > if userspace tries again, the xattr code will replace the old merkle
> > tree block contents with the new ones.  This assumes that we're not
> > using XATTR_CREATE during FS_IOC_ENABLE_VERITY.
> 
> This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
> that was interrupted and the one that completed, there may be extra Merkle tree
> blocks left over.

What if ->enable_begin walked the xattrs and trimmed out any verity
xattrs that were already there?  Though I think ->enable_end actually
could do this since one of the args is the tree size, right?

(Hmm that still wouldn't be any guarantee of success since those xattr
remove calls are each separate transactions...)

> BTW, is xfs_repair planned to do anything about any such extra blocks?

Sorry to answer your question with a question, but how much checking is
$filesystem expected to do for merkle trees?

In theory xfs_repair could learn how to interpret the verity descriptor,
walk the merkle tree blocks, and even read the file data to confirm
intactness.  If the descriptor specifies the highest block address then
we could certainly trim off excess blocks.  But I don't know how much of
libfsverity actually lets you do that; I haven't looked into that
deeply. :/

For xfs_scrub I guess the job is theoretically simpler, since we only
need to stream reads of the verity files through the page cache and let
verity tell us if the file data are consistent.

For both tools, if something finds errors in the merkle tree structure
itself, do we turn off verity?  Or do we do something nasty like
truncate the file?

Is there an ioctl or something that allows userspace to validate an
entire file's contents?  Sort of like what BLKVERIFY would have done for
block devices, except that we might believe its answers?

Also -- inconsistencies between the file data and the merkle tree aren't
something that xfs can self-heal, right?

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching
  2024-03-07 22:49       ` Eric Biggers
@ 2024-03-08  3:50         ` Darrick J. Wong
  2024-03-09 16:24           ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-08  3:50 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 02:49:03PM -0800, Eric Biggers wrote:
> On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote:
> > On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> > > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > > > index b3506f56e180..dad33e6ff0d6 100644
> > > > --- a/fs/verity/fsverity_private.h
> > > > +++ b/fs/verity/fsverity_private.h
> > > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> > > >  
> > > >  void __init fsverity_init_workqueue(void);
> > > >  
> > > > +/*
> > > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > > > + * block->context.
> > > > + */
> > > > +void fsverity_drop_block(struct inode *inode,
> > > > +			 struct fsverity_blockbuf *block);
> > > > +
> > > >  #endif /* _FSVERITY_PRIVATE_H */
> > > 
> > > This should be paired with a helper function that reads a Merkle tree block by
> > > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> > > being consistent with having a helper function for drop, this would prevent code
> > > duplication between verify_data_block() and fsverity_read_merkle_tree().
> > > 
> > > I recommend that it look like this:
> > > 
> > > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> > > 				    unsigned long ra_bytes,
> > > 				    struct fsverity_blockbuf *block);
> > > 
> > > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> > > would be the number of bytes for the filesystem to (optionally) readahead if the
> > > block is not yet cached.  I think that things work out simpler if these values
> > > are measured in bytes, not blocks.  'block' would be at the end because it's an
> > > output, and it can be confusing to interleave inputs and outputs in parameters.
> > 
> > FWIW I don't really like 'pos' here because that's usually short for
> > "file position", which is a byte, and this looks a lot more like a
> > merkle tree block number.
> > 
> > u64 blkno?
> > 
> > Or better yet use a typedef ("merkle_blkno_t") to make it really clear
> > when we're dealing with a tree block number.  Ignore checkpatch
> > complaining about typeedefs. :)
> 
> My suggestion is for 'pos' to be a byte position, in alignment with the
> pagecache naming convention as well as ->write_merkle_tree_block which currently
> uses a 'u64 pos' byte position too.
> 
> It would also work for it to be a block index, in which case it should be named
> 'index' or 'blkno' and have type unsigned long.  I *think* that things work out
> a bit cleaner if it's a byte position, but I could be convinced that it's
> actually better for it to be a block index.

It's probably cleaner (or at least willy says so) to measure everything
in bytes.  The only place that gets messy is if we want to cache merkle
tree blocks in an xarray or something, in which case we'd want to >> by
the block_shift to avoid leaving gaps.

> > > How about changing the prototype to:
> > > 
> > > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
> > >
> > > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> > > can it happen in cases like filesystem corruption?  If it can only happen due to
> > > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.
> > 
> > I think XFS only passes to _invalidate_* the same pos that was passed to
> > ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
> > problem.
> > 
> > Perhaps this function ought to note that @pos is supposed to be the same
> > value that was given to ->read_merkle_tree_block?
> > 
> > Or: make the implementations return 1 for "reloaded from disk", 0 for
> > "still in cache", or a negative error code.  Then fsverity can call
> > the invalidation routine itself and XFS doesn't have to worry about this
> > part.
> > 
> > (I think?  I have questions about the xfs_invalidate_blocks function.)
> 
> It looks like XFS can invalidate blocks other than the one being read by
> ->read_merkle_tree_block.
> 
> If it really was only a matter of the single block being read, then it would
> indeed be simpler to just make it a piece of information returned from
> ->read_merkle_tree_block.
> 
> If the generic invalidation function is needed, it needs to be clearly
> documented when filesystems are expected to invalidate blocks.

I /think/ the generic invalidation is only necessary with the weird
DOUBLE_ALLOC thing.  If the other two implementations (ext4/f2fs)
haven't needed a "nuke from orbit" function then xfs shouldn't require
one too.

> > > > +
> > > > +	/**
> > > > +	 * Release the reference to a Merkle tree block
> > > > +	 *
> > > > +	 * @block: the block to release
> > > > +	 *
> > > > +	 * This is called when fs-verity is done with a block obtained with
> > > > +	 * ->read_merkle_tree_block().
> > > > +	 */
> > > > +	void (*drop_block)(struct fsverity_blockbuf *block);
> > > 
> > > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block
> > 
> > Yep.  I noticed that xfs_verity.c doesn't put them together, which made
> > me wonder if the write_merkle_tree_block path made use of that.  It
> > doesn't, AFAICT.
> > 
> > And I think the reason is that when we're setting up the merkle tree,
> > we want to stream the contents straight to disk instead of ending up
> > with a huge cache that might not all be necessary?
> 
> In the current patchset, fsverity_blockbuf isn't used for writes (despite the
> comment saying it is).  I think that's fine and that it keeps things simpler.
> Filesystems can still cache the blocks that are passed to
> ->write_merkle_tree_block if they want to.  That already happens on ext4 and
> f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache.

Oh!  Maybe it should do that then.  At least the root block?

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing
  2024-03-07 22:26       ` Eric Biggers
@ 2024-03-08  3:53         ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-08  3:53 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 02:26:22PM -0800, Eric Biggers wrote:
> On Thu, Mar 07, 2024 at 01:58:57PM -0800, Darrick J. Wong wrote:
> > > Maybe s_verity_wq?  Or do you anticipate this being used for other purposes too,
> > > such as fscrypt?  Note that it's behind CONFIG_FS_VERITY and is allocated by an
> > > fsverity_* function, so at least at the moment it doesn't feel very generic.
> > 
> > Doesn't fscrypt already create its own workqueues?
> 
> There's a global workqueue in fs/crypto/ too.  IIRC, it has to be separate from
> the fs/verity/ workqueue to avoid deadlocks when a file has both encryption and
> verity enabled.  This is because the verity step can involve reading and
> decrypting Merkle tree blocks.
> 
> The main thing I was wondering was whether there was some plan to use this new
> per-sb workqueue as a generic post-read processing queue, as seemed to be
> implied by the chosen naming.  If there's no clear plan, it should instead just
> be named after what it's actually used for, fsverity.

My guess is that there's a lot less complexity if each subsystem (crypt,
compression, verity, etc) gets its own workqueues instead of sharing so
that there aren't starvation issues.  It's too bad that's sort of
wasteful, when what we really want is to submit a chain of dependent
work_structs and have the workqueue(s) run that chain on the same cpu if
possible.

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path
  2024-03-08  3:16           ` Eric Biggers
@ 2024-03-08  3:57             ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-08  3:57 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Dave Chinner, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, Christoph Hellwig

On Thu, Mar 07, 2024 at 07:16:29PM -0800, Eric Biggers wrote:
> On Fri, Mar 08, 2024 at 12:20:31PM +1100, Dave Chinner wrote:
> > On Thu, Mar 07, 2024 at 03:59:07PM -0800, Eric Biggers wrote:
> > > On Fri, Mar 08, 2024 at 10:38:06AM +1100, Dave Chinner wrote:
> > > > > This makes all kernels with CONFIG_FS_VERITY enabled start preallocating memory
> > > > > for these bios, regardless of whether they end up being used or not.  When
> > > > > PAGE_SIZE==4096 it comes out to about 134 KiB of memory total (32 bios at just
> > > > > over 4 KiB per bio, most of which is used for the BIO_MAX_VECS bvecs), and it
> > > > > scales up with PAGE_SIZE such that with PAGE_SIZE==65536 it's about 2144 KiB.
> > > > 
> > > > Honestly: I don't think we care about this.
> > > > 
> > > > Indeed, if a system is configured with iomap and does not use XFS,
> > > > GFS2 or zonefs, it's not going to be using the iomap_ioend_bioset at
> > > > all, either. So by you definition that's just wasted memory, too, on
> > > > systems that don't use any of these three filesystems. But we
> > > > aren't going to make that one conditional, because the complexity
> > > > and overhead of checks that never trigger after the first IO doesn't
> > > > actually provide any return for the cost of ongoing maintenance.
> > > > 
> > > > Similarly, once XFS has fsverity enabled, it's going to get used all
> > > > over the place in the container and VM world. So we are *always*
> > > > going to want this bioset to be initialised on these production
> > > > systems, so it falls into the same category as the
> > > > iomap_ioend_bioset. That is, if you don't want that overhead, turn
> > > > the functionality off via CONFIG file options.
> > > 
> > > "We're already wasting memory, therefore it's fine to waste more" isn't a great
> > > argument.
> > 
> > Adding complexity just because -you- think the memory is wasted
> > isn't any better argument. I don't think the memory is wasted, and I
> > expect that fsverity will end up in -very- wide use across
> > enterprise production systems as container/vm image build
> > infrastructure moves towards composefs-like algorithms that have a
> > hard dependency on fsverity functionality being present in the host
> > filesytsems.
> > 
> > > iomap_ioend_bioset is indeed also problematic, though it's also a bit different
> > > in that it's needed for basic filesystem functionality, not an optional feature.
> > > Yes, ext4 and f2fs don't actually use iomap for buffered reads, but IIUC at
> > > least there's a plan to do that.  Meanwhile there's no plan for fsverity to be
> > > used on every system that may be using a filesystem that supports it.
> > 
> > That's not the case I see coming for XFS - by the time we get this
> > merged and out of experimental support, the distro and application
> > support will already be there for endemic use of fsverity on XFS.
> > That's all being prototyped on ext4+fsverity right now, and you
> > probably don't see any of that happening.
> > 
> > > We should take care not to over-estimate how many users use optional features.
> > > Linux distros turn on most kconfig options just in case, but often the common
> > > case is that the option is on but the feature is not used.
> > 
> > I don't think that fsverity is going to be an optional feature in
> > future distros - we're already building core infrastructure based on
> > the data integrity guarantees that fsverity provides us with. IOWs,
> > I think fsverity support will soon become required core OS
> > functionality by more than one distro, and so we just don't care
> > about this overhead because the extra read IO bioset will always be
> > necessary....
> 
> That's great that you're finding fsverity to be useful!
> 
> At the same time, please do keep in mind that there's a huge variety of Linux
> systems besides the "enterprise" systems using XFS that you may have in mind.
> 
> The memory allocated for iomap_fsverity_bioset, as proposed, is indeed wasted on
> any system that isn't using both XFS *and* fsverity (as long as
> CONFIG_FS_VERITY=y, and either CONFIG_EXT4_FS, CONFIG_F2FS_FS, or
> CONFIG_BTRFS_FS is enabled too).  The amount is also quadrupled on systems that
> use a 16K page size, which will become increasingly common in the future.
> 
> It may be that part of your intention for pushing back against this optimization
> is to encourage filesystems to convert their buffered reads to iomap.  I'm not
> sure it will actually have that effect.  First, it's not clear that it will be
> technically feasible for the filesystems that support compression, btrfs and
> f2fs, to convert their buffered reads to iomap.

I do hope more of that happens some day, even if xfs is too obsolete to
gain those features itself.

>                                                  Second, this sort of thing
> reinforces an impression that iomap is targeted at enterprise systems using XFS,
> and that changes that would be helpful on other types of systems are rejected.

I wouldn't be opposed to iomap_prepare_{filemap,verity}() functions that
client modules could call from their init methods to ensure that
static(ish) resources like the biosets have been activated.  You'd still
waste space in the bss section, but that would be a bit more svelte.

(Says me, who <cough> might some day end up with 64k page kernels
again.)

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-08  3:46         ` Darrick J. Wong
@ 2024-03-08  4:40           ` Eric Biggers
  2024-03-11 22:38             ` Darrick J. Wong
  2024-03-08 21:34           ` Dave Chinner
  1 sibling, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-08  4:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > BTW, is xfs_repair planned to do anything about any such extra blocks?
> 
> Sorry to answer your question with a question, but how much checking is
> $filesystem expected to do for merkle trees?
> 
> In theory xfs_repair could learn how to interpret the verity descriptor,
> walk the merkle tree blocks, and even read the file data to confirm
> intactness.  If the descriptor specifies the highest block address then
> we could certainly trim off excess blocks.  But I don't know how much of
> libfsverity actually lets you do that; I haven't looked into that
> deeply. :/
> 
> For xfs_scrub I guess the job is theoretically simpler, since we only
> need to stream reads of the verity files through the page cache and let
> verity tell us if the file data are consistent.
> 
> For both tools, if something finds errors in the merkle tree structure
> itself, do we turn off verity?  Or do we do something nasty like
> truncate the file?

As far as I know (I haven't been following btrfs-progs, but I'm familiar with
e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
validating the data of verity inodes against their Merkle trees.

e2fsck does delete the verity metadata of inodes that don't have the verity flag
enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.

I suppose that ideally, if an inode's verity metadata is invalid, then fsck
should delete that inode's verity metadata and remove the verity flag from the
inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
fairly straightforward, but it probably wouldn't catch much compared to actually
validating the data against the Merkle tree.  And actually validating the data
against the Merkle tree would be complex and expensive.  Note, none of this
would work on files that are encrypted.

Re: libfsverity, I think it would be possible to validate a Merkle tree using
libfsverity_compute_digest() and the callbacks that it supports.  But that's not
quite what it was designed for.

> Is there an ioctl or something that allows userspace to validate an
> entire file's contents?  Sort of like what BLKVERIFY would have done for
> block devices, except that we might believe its answers?

Just reading the whole file and seeing whether you get an error would do it.

Though if you want to make sure it's really re-reading the on-disk data, it's
necessary to drop the file's pagecache first.

> Also -- inconsistencies between the file data and the merkle tree aren't
> something that xfs can self-heal, right?

Similar to file data itself, only way to self-heal would be via mechanisms that
provide redundancy.  There's been some interest in adding support forward error
correction (FEC) to fsverity similar to what dm-verity has, but this would be
complex, and it's not something that anyone has gotten around to yet.

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-08  3:46         ` Darrick J. Wong
  2024-03-08  4:40           ` Eric Biggers
@ 2024-03-08 21:34           ` Dave Chinner
  2024-03-09 16:19             ` Darrick J. Wong
  1 sibling, 1 reply; 94+ messages in thread
From: Dave Chinner @ 2024-03-08 21:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Eric Biggers, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 07, 2024 at 02:02:24PM -0800, Eric Biggers wrote:
> > On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> > > Or you could leave the unfinished tree as-is; that will waste space, but
> > > if userspace tries again, the xattr code will replace the old merkle
> > > tree block contents with the new ones.  This assumes that we're not
> > > using XATTR_CREATE during FS_IOC_ENABLE_VERITY.
> > 
> > This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
> > that was interrupted and the one that completed, there may be extra Merkle tree
> > blocks left over.
> 
> What if ->enable_begin walked the xattrs and trimmed out any verity
> xattrs that were already there?  Though I think ->enable_end actually
> could do this since one of the args is the tree size, right?

If we are overwriting xattrs, it's effectively a remove then a new
create operation, so we may as well just add a XFS_ATTR_VERITY
namespace invalidation filter that removes any xattr in that
namespace in ->enable_begin...

> > BTW, is xfs_repair planned to do anything about any such extra blocks?
> 
> Sorry to answer your question with a question, but how much checking is
> $filesystem expected to do for merkle trees?
> 
> In theory xfs_repair could learn how to interpret the verity descriptor,
> walk the merkle tree blocks, and even read the file data to confirm
> intactness.  If the descriptor specifies the highest block address then
> we could certainly trim off excess blocks.  But I don't know how much of
> libfsverity actually lets you do that; I haven't looked into that
> deeply. :/

Perhaps a generic fsverity userspace checking library we can link in
to fs utilities like e2fsck and xfs_repair is the way to go here.
That way any filesystem that supports fsverity can do offline
validation of the merkle tree after checking the metadata is OK if
desired.

> For xfs_scrub I guess the job is theoretically simpler, since we only
> need to stream reads of the verity files through the page cache and let
> verity tell us if the file data are consistent.

*nod*

> For both tools, if something finds errors in the merkle tree structure
> itself, do we turn off verity?  Or do we do something nasty like
> truncate the file?

Mark it as "data corrupt" in terms of generic XFS health status, and
leave it up to the user to repair the data and/or recalc the merkle
tree, depending on what they find when they look at the corrupt file
status.

> Is there an ioctl or something that allows userspace to validate an
> entire file's contents?  Sort of like what BLKVERIFY would have done for
> block devices, except that we might believe its answers?
> 
> Also -- inconsistencies between the file data and the merkle tree aren't
> something that xfs can self-heal, right?

Not that I know of - the file data has to be validated before we can
tell if the error is in the data or the merkle tree, and only the
user can validate the data is correct.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-08 21:34           ` Dave Chinner
@ 2024-03-09 16:19             ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-09 16:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Eric Biggers, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu

On Sat, Mar 09, 2024 at 08:34:51AM +1100, Dave Chinner wrote:
> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > On Thu, Mar 07, 2024 at 02:02:24PM -0800, Eric Biggers wrote:
> > > On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> > > > Or you could leave the unfinished tree as-is; that will waste space, but
> > > > if userspace tries again, the xattr code will replace the old merkle
> > > > tree block contents with the new ones.  This assumes that we're not
> > > > using XATTR_CREATE during FS_IOC_ENABLE_VERITY.
> > > 
> > > This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
> > > that was interrupted and the one that completed, there may be extra Merkle tree
> > > blocks left over.
> > 
> > What if ->enable_begin walked the xattrs and trimmed out any verity
> > xattrs that were already there?  Though I think ->enable_end actually
> > could do this since one of the args is the tree size, right?
> 
> If we are overwriting xattrs, it's effectively a remove then a new
> create operation, so we may as well just add a XFS_ATTR_VERITY
> namespace invalidation filter that removes any xattr in that
> namespace in ->enable_begin...

Yeah, that sounds like a good idea.  One nice aspect of the generic
listxattr code (aka not the simplified one that scrub uses) is that the
cursor tracking means that we could actually iterate-and-zap old merkle
tree blocks.

If we know the size of the merkle tree ahead of time (say it's N blocks)
then we just start zapping N, then N+1, etc. until we don't find any
more.  That wouldn't be exhaustive, but it's good enough to catch most
cases.

Online fsck should, however, have a way to call ensure_verity_info() so
that it can scan the xattrs looking for merkle tree blocks beyond
tree_size, missing merkle tree blocks within tree_size, missing
descriptors, etc.  It looks like the merkle tree block contents are
entirely hashes (no sibling/child/parent pointers, block headers, etc.)
so there's not a lot to check in the tree structure.  It looks pretty
similar to flattening a heap into a linear array.

> > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > 
> > Sorry to answer your question with a question, but how much checking is
> > $filesystem expected to do for merkle trees?
> > 
> > In theory xfs_repair could learn how to interpret the verity descriptor,
> > walk the merkle tree blocks, and even read the file data to confirm
> > intactness.  If the descriptor specifies the highest block address then
> > we could certainly trim off excess blocks.  But I don't know how much of
> > libfsverity actually lets you do that; I haven't looked into that
> > deeply. :/
> 
> Perhaps a generic fsverity userspace checking library we can link in
> to fs utilities like e2fsck and xfs_repair is the way to go here.
> That way any filesystem that supports fsverity can do offline
> validation of the merkle tree after checking the metadata is OK if
> desired.

That'd be nice.  Does the above checking sound reasonable? :)

> > For xfs_scrub I guess the job is theoretically simpler, since we only
> > need to stream reads of the verity files through the page cache and let
> > verity tell us if the file data are consistent.
> 
> *nod*

I had another thought overnight -- regular read()s incur the cost of
copying pagecache contents to userspace.  Do we really care about that,
though?  In theory we could mmap verity file contents and then use
MADV_POPULATE_READ to pull in the page cache and return error codes.  No
copying, and fewer syscalls.

> > For both tools, if something finds errors in the merkle tree structure
> > itself, do we turn off verity?  Or do we do something nasty like
> > truncate the file?
> 
> Mark it as "data corrupt" in terms of generic XFS health status, and
> leave it up to the user to repair the data and/or recalc the merkle
> tree, depending on what they find when they look at the corrupt file
> status.

Is there a way to forcibly read the file contents even if it fails
verity validation?  I was assuming the only recourse in that case is to
delete the file and restore from backup/package manager/etc.

> > Is there an ioctl or something that allows userspace to validate an
> > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > block devices, except that we might believe its answers?
> > 
> > Also -- inconsistencies between the file data and the merkle tree aren't
> > something that xfs can self-heal, right?
> 
> Not that I know of - the file data has to be validated before we can
> tell if the error is in the data or the merkle tree, and only the
> user can validate the data is correct.

<nod>

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching
  2024-03-08  3:50         ` Darrick J. Wong
@ 2024-03-09 16:24           ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-09 16:24 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel, chandan.babu

On Thu, Mar 07, 2024 at 07:50:36PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 07, 2024 at 02:49:03PM -0800, Eric Biggers wrote:
> > On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote:
> > > On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> > > > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > > > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > > > > index b3506f56e180..dad33e6ff0d6 100644
> > > > > --- a/fs/verity/fsverity_private.h
> > > > > +++ b/fs/verity/fsverity_private.h
> > > > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> > > > >  
> > > > >  void __init fsverity_init_workqueue(void);
> > > > >  
> > > > > +/*
> > > > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > > > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > > > > + * block->context.
> > > > > + */
> > > > > +void fsverity_drop_block(struct inode *inode,
> > > > > +			 struct fsverity_blockbuf *block);
> > > > > +
> > > > >  #endif /* _FSVERITY_PRIVATE_H */
> > > > 
> > > > This should be paired with a helper function that reads a Merkle tree block by
> > > > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> > > > being consistent with having a helper function for drop, this would prevent code
> > > > duplication between verify_data_block() and fsverity_read_merkle_tree().
> > > > 
> > > > I recommend that it look like this:
> > > > 
> > > > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> > > > 				    unsigned long ra_bytes,
> > > > 				    struct fsverity_blockbuf *block);
> > > > 
> > > > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> > > > would be the number of bytes for the filesystem to (optionally) readahead if the
> > > > block is not yet cached.  I think that things work out simpler if these values
> > > > are measured in bytes, not blocks.  'block' would be at the end because it's an
> > > > output, and it can be confusing to interleave inputs and outputs in parameters.
> > > 
> > > FWIW I don't really like 'pos' here because that's usually short for
> > > "file position", which is a byte, and this looks a lot more like a
> > > merkle tree block number.
> > > 
> > > u64 blkno?
> > > 
> > > Or better yet use a typedef ("merkle_blkno_t") to make it really clear
> > > when we're dealing with a tree block number.  Ignore checkpatch
> > > complaining about typeedefs. :)
> > 
> > My suggestion is for 'pos' to be a byte position, in alignment with the
> > pagecache naming convention as well as ->write_merkle_tree_block which currently
> > uses a 'u64 pos' byte position too.
> > 
> > It would also work for it to be a block index, in which case it should be named
> > 'index' or 'blkno' and have type unsigned long.  I *think* that things work out
> > a bit cleaner if it's a byte position, but I could be convinced that it's
> > actually better for it to be a block index.
> 
> It's probably cleaner (or at least willy says so) to measure everything
> in bytes.  The only place that gets messy is if we want to cache merkle
> tree blocks in an xarray or something, in which case we'd want to >> by
> the block_shift to avoid leaving gaps.

...and now having just done that, yes, I would like to retain passing
log_blocksize to the ->read_merkle_tree_block function.  I cleaned it up
a bit for my own purposes:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fsverity-cleanups-6.9&id=4d36c90be4ac143f3e31989317bde7cd6c109c6e

Though I saw you had other feedback to Andrey about that. :)

--D

> > > > How about changing the prototype to:
> > > > 
> > > > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
> > > >
> > > > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> > > > can it happen in cases like filesystem corruption?  If it can only happen due to
> > > > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.
> > > 
> > > I think XFS only passes to _invalidate_* the same pos that was passed to
> > > ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
> > > problem.
> > > 
> > > Perhaps this function ought to note that @pos is supposed to be the same
> > > value that was given to ->read_merkle_tree_block?
> > > 
> > > Or: make the implementations return 1 for "reloaded from disk", 0 for
> > > "still in cache", or a negative error code.  Then fsverity can call
> > > the invalidation routine itself and XFS doesn't have to worry about this
> > > part.
> > > 
> > > (I think?  I have questions about the xfs_invalidate_blocks function.)
> > 
> > It looks like XFS can invalidate blocks other than the one being read by
> > ->read_merkle_tree_block.
> > 
> > If it really was only a matter of the single block being read, then it would
> > indeed be simpler to just make it a piece of information returned from
> > ->read_merkle_tree_block.
> > 
> > If the generic invalidation function is needed, it needs to be clearly
> > documented when filesystems are expected to invalidate blocks.
> 
> I /think/ the generic invalidation is only necessary with the weird
> DOUBLE_ALLOC thing.  If the other two implementations (ext4/f2fs)
> haven't needed a "nuke from orbit" function then xfs shouldn't require
> one too.
> 
> > > > > +
> > > > > +	/**
> > > > > +	 * Release the reference to a Merkle tree block
> > > > > +	 *
> > > > > +	 * @block: the block to release
> > > > > +	 *
> > > > > +	 * This is called when fs-verity is done with a block obtained with
> > > > > +	 * ->read_merkle_tree_block().
> > > > > +	 */
> > > > > +	void (*drop_block)(struct fsverity_blockbuf *block);
> > > > 
> > > > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block
> > > 
> > > Yep.  I noticed that xfs_verity.c doesn't put them together, which made
> > > me wonder if the write_merkle_tree_block path made use of that.  It
> > > doesn't, AFAICT.
> > > 
> > > And I think the reason is that when we're setting up the merkle tree,
> > > we want to stream the contents straight to disk instead of ending up
> > > with a huge cache that might not all be necessary?
> > 
> > In the current patchset, fsverity_blockbuf isn't used for writes (despite the
> > comment saying it is).  I think that's fine and that it keeps things simpler.
> > Filesystems can still cache the blocks that are passed to
> > ->write_merkle_tree_block if they want to.  That already happens on ext4 and
> > f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache.
> 
> Oh!  Maybe it should do that then.  At least the root block?
> 
> --D
> 
> > - Eric
> > 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-08  3:31       ` Darrick J. Wong
@ 2024-03-09 16:28         ` Darrick J. Wong
  2024-03-11  0:26           ` Dave Chinner
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-09 16:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> > On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote:
> > > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> > > > One of essential ideas of fs-verity is that pages which are already
> > > > verified won't need to be re-verified if they still in page cache.
> > > > 
> > > > XFS will store Merkle tree blocks in extended file attributes. When
> > > > read extended attribute data is put into xfs_buf.
> > > > 
> > > > fs-verity uses PG_checked flag to track status of the blocks in the
> > > > page. This flag can has two meanings - page was re-instantiated and
> > > > the only block in the page is verified.
> > > > 
> > > > However, in XFS, the data in the buffer is not aligned with xfs_buf
> > > > pages and we don't have a reference to these pages. Moreover, these
> > > > pages are released when value is copied out in xfs_attr code. In
> > > > other words, we can not directly mark underlying xfs_buf's pages as
> > > > verified as it's done by fs-verity for other filesystems.
> > > > 
> > > > One way to track that these pages were processed by fs-verity is to
> > > > mark buffer as verified instead. If buffer is evicted the incore
> > > > XBF_VERITY_SEEN flag is lost. When the xattr is read again
> > > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> > > > flag is then used to tell fs-verity this buffer was cached or not.
> > > > 
> > > > The second state indicated by PG_checked is if the only block in the
> > > > PAGE is verified. This is not the case for XFS as there could be
> > > > multiple blocks in single buffer (page size 64k block size 4k). This
> > > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> > > > XFS despite of Merkle tree block size.
> > > > 
> > > > The meaning of the flag is that value of the extended attribute in
> > > > the buffer is processed by fs-verity.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_buf.h | 18 ++++++++++--------
> > > >  1 file changed, 10 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > > index 73249abca968..2a73918193ba 100644
> > > > --- a/fs/xfs/xfs_buf.h
> > > > +++ b/fs/xfs/xfs_buf.h
> > > > @@ -24,14 +24,15 @@ struct xfs_buf;
> > > >  
> > > >  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
> > > >  
> > > > -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> > > > -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> > > > -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> > > > -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > > -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> > > > -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> > > > -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> > > > -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> > > > +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> > > > +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> > > > +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> > > > +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > > +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> > > > +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> > > > +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> > > > +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> > > > +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
> > > 
> > > Yuck.  I still dislike this entire approach.
> > > 
> > > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
> > > gets loaded on behalf of a merkle tree request, then uses the extra
> > > space to shadow the contents of the ondisk block.  AFAICT the shadow
> > > doesn't get updated even if the cached data does, which sounds like a
> > > landmine for coherency issues.
> > >
> > > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
> > > exposing the low level buffering details of the xattr code to
> > > xfs_attr_get callers.
> > > 
> > > XBF_VERITY_SEEN is a layering violation because now the overall buffer
> > > cache can track file metadata state.  I think the reason why you need
> > > this state flag is because the datadev buffer target indexes based on
> > > physical xfs_daddr_t, whereas merkle tree blocks have their own internal
> > > block numbers.  You can't directly go from the merkle block number to an
> > > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
> > > fell out of memory.
> > >
> > > ISTR asking for a separation of these indices when I reviewed some
> > > previous version of this patchset.  At the time it seems to me that a
> > > much more efficient way to cache the merkle tree blocks would be to set
> > > up a direct (merkle tree block number) -> (blob of data) lookup table.
> > > That I don't see here.
> > >
> > > In the spirit of the recent collaboration style that I've taken with
> > > Christoph, I pulled your branch and started appending patches to it to
> > > see if the design that I'm asking for is reasonable.  As it so happens,
> > > I was working on a simplified version of the xfs buffer cache ("fsbuf")
> > > that could be used by simple filesystems to get them off of buffer
> > > heads.
> > > 
> > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > > overengineered, since all we want are the following operations:
> > > 
> > > peek(key): returns an fsbuf if there's any data cached for key
> > > 
> > > get(key): returns an fsbuf for key, regardless of state
> > > 
> > > store(fsbuf, p): attach a memory buffer p to fsbuf
> > > 
> > > Then the xfs ->read_merkle_tree_block function becomes:
> > > 
> > > 	bp = peek(key)
> > > 	if (bp)
> > > 		/* return bp data up to verity */
> > > 
> > > 	p = xfs_attr_get(key)
> > > 	if (!p)
> > > 		/* error */
> > > 
> > > 	bp = get(key)
> > > 	store(bp, p)
> > 
> > Ok, that looks good - it definitely gets rid of a lot of the
> > nastiness, but I have to ask: why does it need to be based on
> > xfs_bufs?
> 
> (copying from IRC) It was still warm in my brain L2 after all the xfile
> buftarg cleaning and merging that just got done a few weeks ago.   So I
> went with the simplest thing I could rig up to test my ideas, and now
> we're at the madly iterate until exhaustion stage. ;)
> 
> >            That's just wasting 300 bytes of memory on a handle to
> > store a key and a opaque blob in a rhashtable.
> 
> Yep.  The fsbufs implementation was a lot more slender, but a bunch more
> code.  I agree that I ought to go look at xarrays or something that's
> more of a direct mapping as a next step.  However, i wanted to get
> Andrey's feedback on this general approach first.
> 
> > IIUC, the key here is a sequential index, so an xarray would be a
> > much better choice as it doesn't require internal storage of the
> > key.
> 
> I wonder, what are the access patterns for merkle blobs?  Is it actually
> sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> 
> Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> that done because merkle trees may some day have more than 2^32 blocks
> in them?  That won't play well with things like xarrays on 32-bit
> machines.
> 
> (Granted we've been talking about deprecating XFS on 32-bit for a while
> now but we're not the whole world)
> 
> > i.e.
> > 
> > 	p = xa_load(key);
> > 	if (p)
> > 		return p;
> > 
> > 	xfs_attr_get(key);
> > 	if (!args->value)
> > 		/* error */
> > 
> > 	/*
> > 	 * store the current value, freeing any old value that we
> > 	 * replaced at this key. Don't care about failure to store,
> > 	 * this is optimistic caching.
> > 	 */
> > 	p = xa_store(key, args->value, GFP_NOFS);
> > 	if (p)
> > 		kvfree(p);
> > 	return args->value;
> 
> Attractive.  Will have to take a look at that tomorrow.

Done.  I think.  Not sure that I actually got all the interactions
between the shrinker and the xarray correct though.  KASAN and lockdep
don't have any complaints running fstests, so that's a start.

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09

--D

> 
> --D
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-09 16:28         ` Darrick J. Wong
@ 2024-03-11  0:26           ` Dave Chinner
  2024-03-11 15:25             ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Dave Chinner @ 2024-03-11  0:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Sat, Mar 09, 2024 at 08:28:28AM -0800, Darrick J. Wong wrote:
> On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > > > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > > > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > > > overengineered, since all we want are the following operations:
> > > > 
> > > > peek(key): returns an fsbuf if there's any data cached for key
> > > > 
> > > > get(key): returns an fsbuf for key, regardless of state
> > > > 
> > > > store(fsbuf, p): attach a memory buffer p to fsbuf
> > > > 
> > > > Then the xfs ->read_merkle_tree_block function becomes:
> > > > 
> > > > 	bp = peek(key)
> > > > 	if (bp)
> > > > 		/* return bp data up to verity */
> > > > 
> > > > 	p = xfs_attr_get(key)
> > > > 	if (!p)
> > > > 		/* error */
> > > > 
> > > > 	bp = get(key)
> > > > 	store(bp, p)
> > > 
> > > Ok, that looks good - it definitely gets rid of a lot of the
> > > nastiness, but I have to ask: why does it need to be based on
> > > xfs_bufs?
> > 
> > (copying from IRC) It was still warm in my brain L2 after all the xfile
> > buftarg cleaning and merging that just got done a few weeks ago.   So I
> > went with the simplest thing I could rig up to test my ideas, and now
> > we're at the madly iterate until exhaustion stage. ;)
> > 
> > >            That's just wasting 300 bytes of memory on a handle to
> > > store a key and a opaque blob in a rhashtable.
> > 
> > Yep.  The fsbufs implementation was a lot more slender, but a bunch more
> > code.  I agree that I ought to go look at xarrays or something that's
> > more of a direct mapping as a next step.  However, i wanted to get
> > Andrey's feedback on this general approach first.
> > 
> > > IIUC, the key here is a sequential index, so an xarray would be a
> > > much better choice as it doesn't require internal storage of the
> > > key.
> > 
> > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?

I think the leaf level (i.e. individual record) access patterns
largely match data access patterns, so I'd just treat it like as if
it's a normal file being accessed....

> > Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> > that done because merkle trees may some day have more than 2^32 blocks
> > in them?  That won't play well with things like xarrays on 32-bit
> > machines.
> > 
> > (Granted we've been talking about deprecating XFS on 32-bit for a while
> > now but we're not the whole world)
> > 
> > > i.e.
> > > 
> > > 	p = xa_load(key);
> > > 	if (p)
> > > 		return p;
> > > 
> > > 	xfs_attr_get(key);
> > > 	if (!args->value)
> > > 		/* error */
> > > 
> > > 	/*
> > > 	 * store the current value, freeing any old value that we
> > > 	 * replaced at this key. Don't care about failure to store,
> > > 	 * this is optimistic caching.
> > > 	 */
> > > 	p = xa_store(key, args->value, GFP_NOFS);
> > > 	if (p)
> > > 		kvfree(p);
> > > 	return args->value;
> > 
> > Attractive.  Will have to take a look at that tomorrow.
> 
> Done.  I think.  Not sure that I actually got all the interactions
> between the shrinker and the xarray correct though.  KASAN and lockdep
> don't have any complaints running fstests, so that's a start.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09

My initial impression is "over-engineered".

I personally would have just allocated the xattr value buffer with a
little extra size and added all the external cache information (a
reference counter is all we need as these are fixed sized blocks) to
the tail of the blob we actually pass to fsverity. If we tag the
inode in the radix tree as having verity blobs that can be freed, we
can then just extend the existing fs sueprblock shrinker callout to
also walk all the verity inodes with cached data to try to reclaim
some objects...

But, if a generic blob cache is what it takes to move this forwards,
so be it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-11  0:26           ` Dave Chinner
@ 2024-03-11 15:25             ` Darrick J. Wong
  2024-03-12  2:43               ` Eric Biggers
  2024-03-12  2:45               ` Darrick J. Wong
  0 siblings, 2 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-11 15:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Mon, Mar 11, 2024 at 11:26:24AM +1100, Dave Chinner wrote:
> On Sat, Mar 09, 2024 at 08:28:28AM -0800, Darrick J. Wong wrote:
> > On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote:
> > > On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> > > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > > > > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > > > > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > > > > overengineered, since all we want are the following operations:
> > > > > 
> > > > > peek(key): returns an fsbuf if there's any data cached for key
> > > > > 
> > > > > get(key): returns an fsbuf for key, regardless of state
> > > > > 
> > > > > store(fsbuf, p): attach a memory buffer p to fsbuf
> > > > > 
> > > > > Then the xfs ->read_merkle_tree_block function becomes:
> > > > > 
> > > > > 	bp = peek(key)
> > > > > 	if (bp)
> > > > > 		/* return bp data up to verity */
> > > > > 
> > > > > 	p = xfs_attr_get(key)
> > > > > 	if (!p)
> > > > > 		/* error */
> > > > > 
> > > > > 	bp = get(key)
> > > > > 	store(bp, p)
> > > > 
> > > > Ok, that looks good - it definitely gets rid of a lot of the
> > > > nastiness, but I have to ask: why does it need to be based on
> > > > xfs_bufs?
> > > 
> > > (copying from IRC) It was still warm in my brain L2 after all the xfile
> > > buftarg cleaning and merging that just got done a few weeks ago.   So I
> > > went with the simplest thing I could rig up to test my ideas, and now
> > > we're at the madly iterate until exhaustion stage. ;)
> > > 
> > > >            That's just wasting 300 bytes of memory on a handle to
> > > > store a key and a opaque blob in a rhashtable.
> > > 
> > > Yep.  The fsbufs implementation was a lot more slender, but a bunch more
> > > code.  I agree that I ought to go look at xarrays or something that's
> > > more of a direct mapping as a next step.  However, i wanted to get
> > > Andrey's feedback on this general approach first.
> > > 
> > > > IIUC, the key here is a sequential index, so an xarray would be a
> > > > much better choice as it doesn't require internal storage of the
> > > > key.
> > > 
> > > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> 
> I think the leaf level (i.e. individual record) access patterns
> largely match data access patterns, so I'd just treat it like as if
> it's a normal file being accessed....

<nod> The latest version of this tries to avoid letting reclaim take the
top of the tree.  Logically this makes sense to me to reduce read verify
latency, but I was hoping Eric or Andrey or someone with more
familiarity with fsverity would chime in on whether or not that made
sense.

> > > Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> > > that done because merkle trees may some day have more than 2^32 blocks
> > > in them?  That won't play well with things like xarrays on 32-bit
> > > machines.
> > > 
> > > (Granted we've been talking about deprecating XFS on 32-bit for a while
> > > now but we're not the whole world)
> > > 
> > > > i.e.
> > > > 
> > > > 	p = xa_load(key);
> > > > 	if (p)
> > > > 		return p;
> > > > 
> > > > 	xfs_attr_get(key);
> > > > 	if (!args->value)
> > > > 		/* error */
> > > > 
> > > > 	/*
> > > > 	 * store the current value, freeing any old value that we
> > > > 	 * replaced at this key. Don't care about failure to store,
> > > > 	 * this is optimistic caching.
> > > > 	 */
> > > > 	p = xa_store(key, args->value, GFP_NOFS);
> > > > 	if (p)
> > > > 		kvfree(p);
> > > > 	return args->value;
> > > 
> > > Attractive.  Will have to take a look at that tomorrow.
> > 
> > Done.  I think.  Not sure that I actually got all the interactions
> > between the shrinker and the xarray correct though.  KASAN and lockdep
> > don't have any complaints running fstests, so that's a start.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09
> 
> My initial impression is "over-engineered".

Overly focused on unfamiliar data structures -- I've never built
anything with an xarray before.

> I personally would have just allocated the xattr value buffer with a
> little extra size and added all the external cache information (a
> reference counter is all we need as these are fixed sized blocks) to
> the tail of the blob we actually pass to fsverity.

Oho, that's a good idea.  I didn't like the separate allocation anyway.
Friday was mostly chatting with willy and trying to make sure I got the
xarray access patterns correct.

>                                                    If we tag the
> inode in the radix tree as having verity blobs that can be freed, we
> can then just extend the existing fs sueprblock shrinker callout to
> also walk all the verity inodes with cached data to try to reclaim
> some objects...

This too is a wonderful suggestion -- use the third radix tree tag to
mark inodes with extra incore caches that can be reclaimed, then teach
xfs_reclaim_inodes_{nr,count} to scan them.  Allocating a per-inode
shrinker was likely to cause problems with flooding debugfs with too
many knobs anyway.

> But, if a generic blob cache is what it takes to move this forwards,
> so be it.

Not necessarily. ;)

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-08  4:40           ` Eric Biggers
@ 2024-03-11 22:38             ` Darrick J. Wong
  2024-03-12 15:13               ` David Hildenbrand
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-11 22:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, akpm, linux-mm, Eric Biggers

[add willy and linux-mm]

On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > 
> > Sorry to answer your question with a question, but how much checking is
> > $filesystem expected to do for merkle trees?
> > 
> > In theory xfs_repair could learn how to interpret the verity descriptor,
> > walk the merkle tree blocks, and even read the file data to confirm
> > intactness.  If the descriptor specifies the highest block address then
> > we could certainly trim off excess blocks.  But I don't know how much of
> > libfsverity actually lets you do that; I haven't looked into that
> > deeply. :/
> > 
> > For xfs_scrub I guess the job is theoretically simpler, since we only
> > need to stream reads of the verity files through the page cache and let
> > verity tell us if the file data are consistent.
> > 
> > For both tools, if something finds errors in the merkle tree structure
> > itself, do we turn off verity?  Or do we do something nasty like
> > truncate the file?
> 
> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> validating the data of verity inodes against their Merkle trees.
> 
> e2fsck does delete the verity metadata of inodes that don't have the verity flag
> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> 
> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> should delete that inode's verity metadata and remove the verity flag from the
> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> fairly straightforward, but it probably wouldn't catch much compared to actually
> validating the data against the Merkle tree.  And actually validating the data
> against the Merkle tree would be complex and expensive.  Note, none of this
> would work on files that are encrypted.
> 
> Re: libfsverity, I think it would be possible to validate a Merkle tree using
> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> quite what it was designed for.
> 
> > Is there an ioctl or something that allows userspace to validate an
> > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > block devices, except that we might believe its answers?
> 
> Just reading the whole file and seeing whether you get an error would do it.
> 
> Though if you want to make sure it's really re-reading the on-disk data, it's
> necessary to drop the file's pagecache first.

I tried a straight pagecache read and it worked like a charm!

But then I thought to myself, do I really want to waste memory bandwidth
copying a bunch of data?  No.  I don't even want to incur system call
overhead from reading a single byte every $pagesize bytes.

So I created 2M mmap areas and read a byte every $pagesize bytes.  That
worked too, insofar as SIGBUSes are annoying to handle.  But it's
annoying to take signals like that.

Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
like what I wanted -- it prefaults in the pages, and "If populating
fails, a SIGBUS signal is not generated; instead, an error is returned."

But then I tried rigging up a test to see if I could catch an EIO, and
instead I had to SIGKILL the process!  It looks filemap_fault returns
VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
__do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
the VM_FAULT_RETRY is translated to -EBUSY.

__get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
that to madvise_populate.  Unfortunately, madvise_populate increments
its loop counter by the return value (still 0) so it runs in an
infinite loop.  The only way out is SIGKILL.

So I don't know what the correct behavior is here, other than the
infinite loop seems pretty suspect.  Is it the correct behavior that
madvise_populate returns EIO if __get_user_pages ever returns zero?
That doesn't quite sound right if it's the case that a zero return could
also happen if memory is tight.

I suppose filemap_fault could return VM_FAULT_SIGBUS in this one
scenario so userspace would get an -EFAULT.  That would solve this one
case of weird behavior.  But I think that doesn't happen in the
page_not_uptodate case because fpin is non-null?

As for xfs_scrub validating data files, I suppose it's not /so/
terrible to read one byte every $fsblocksize so that we can report
exactly where fsverity and the file data became inconsistent.  The
POPULATE_READ interface doesn't tell you how many pages it /did/ manage
to load, so perhaps MADV_POPULATE_READ isn't workable anyway.

(and now I'm just handwaving wildly about pagecache behaviors ;))

--D

> > Also -- inconsistencies between the file data and the merkle tree aren't
> > something that xfs can self-heal, right?
> 
> Similar to file data itself, only way to self-heal would be via mechanisms that
> provide redundancy.  There's been some interest in adding support forward error
> correction (FEC) to fsverity similar to what dm-verity has, but this would be
> complex, and it's not something that anyone has gotten around to yet.
> 
> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 07/24] fsverity: support block-based Merkle tree caching
  2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
  2024-03-06  3:56   ` Eric Biggers
@ 2024-03-11 23:22   ` Darrick J. Wong
  1 sibling, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-11 23:22 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> In the current implementation fs-verity expects filesystem to
> provide PAGEs filled with Merkle tree blocks. Then, when fs-verity
> is done with processing the blocks, reference to PAGE is freed. This
> doesn't fit well with the way XFS manages its memory.
> 
> To allow XFS integrate fs-verity this patch adds ability to
> fs-verity verification code to take Merkle tree blocks instead of
> PAGE reference. This way ext4, f2fs, and btrfs are still able to
> pass PAGE references and XFS can pass reference to Merkle tree
> blocks stored in XFS's buffer infrastructure.
> 
> Another addition is invalidation function which tells fs-verity to
> mark part of Merkle tree as not verified. This function is used
> by filesystem to tell fs-verity to invalidate block which was
> evicted from memory.
> 
> Depending on Merkle tree block size fs-verity is using either bitmap
> or PG_checked flag to track "verified" status of the blocks. With a
> Merkle tree block caching (XFS) there is no PAGE to flag it as
> verified. fs-verity always uses bitmap to track verified blocks for
> filesystems which use block caching.
> 
> Further this patch allows filesystem to make additional processing
> on verified pages via fsverity_drop_block() instead of just dropping
> a reference. This will be used by XFS for internal buffer cache
> manipulation in further patches. The btrfs, ext4, and f2fs just drop
> the reference.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/verity/fsverity_private.h |   8 +++
>  fs/verity/open.c             |   8 ++-
>  fs/verity/read_metadata.c    |  64 +++++++++++-------
>  fs/verity/verify.c           | 125 +++++++++++++++++++++++++++--------
>  include/linux/fsverity.h     |  65 ++++++++++++++++++
>  5 files changed, 217 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> index b3506f56e180..dad33e6ff0d6 100644
> --- a/fs/verity/fsverity_private.h
> +++ b/fs/verity/fsverity_private.h
> @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
>  
>  void __init fsverity_init_workqueue(void);
>  
> +/*
> + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> + * block->context.
> + */
> +void fsverity_drop_block(struct inode *inode,
> +			 struct fsverity_blockbuf *block);
> +
>  #endif /* _FSVERITY_PRIVATE_H */
> diff --git a/fs/verity/open.c b/fs/verity/open.c
> index fdeb95eca3af..6e6922b4b014 100644
> --- a/fs/verity/open.c
> +++ b/fs/verity/open.c
> @@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
>  	if (err)
>  		goto fail;
>  
> -	if (vi->tree_params.block_size != PAGE_SIZE) {
> +	/*
> +	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
> +	 * fs-verity should use hash_block_verified bitmap as there's no page
> +	 * to mark it with PG_checked.
> +	 */
> +	if (vi->tree_params.block_size != PAGE_SIZE ||
> +			inode->i_sb->s_vop->read_merkle_tree_block) {
>  		/*
>  		 * When the Merkle tree block size and page size differ, we use
>  		 * a bitmap to keep track of which hash blocks have been
> diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> index f58432772d9e..5da40b5a81af 100644
> --- a/fs/verity/read_metadata.c
> +++ b/fs/verity/read_metadata.c
> @@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  {
>  	const struct fsverity_operations *vops = inode->i_sb->s_vop;
>  	u64 end_offset;
> -	unsigned int offs_in_page;
> +	unsigned int offs_in_block;
>  	pgoff_t index, last_index;
>  	int retval = 0;
>  	int err = 0;
> +	const unsigned int block_size = vi->tree_params.block_size;
> +	const u8 log_blocksize = vi->tree_params.log_blocksize;
>  
>  	end_offset = min(offset + length, vi->tree_params.tree_size);
>  	if (offset >= end_offset)
>  		return 0;
> -	offs_in_page = offset_in_page(offset);
> -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> +	offs_in_block = offset & (block_size - 1);
> +	last_index = (end_offset - 1) >> log_blocksize;
>  
>  	/*
> -	 * Iterate through each Merkle tree page in the requested range and copy
> -	 * the requested portion to userspace.  Note that the Merkle tree block
> -	 * size isn't important here, as we are returning a byte stream; i.e.,
> -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> +	 * Iterate through each Merkle tree block in the requested range and
> +	 * copy the requested portion to userspace. Note that we are returning
> +	 * a byte stream.
>  	 */
> -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> +	for (index = offset >> log_blocksize; index <= last_index; index++) {
>  		unsigned long num_ra_pages =
>  			min_t(unsigned long, last_index - index + 1,
>  			      inode->i_sb->s_bdi->io_pages);
>  		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> -						   PAGE_SIZE - offs_in_page);
> -		struct page *page;
> -		const void *virt;
> +						   block_size - offs_in_block);
> +		struct fsverity_blockbuf block = {
> +			.size = block_size,
> +		};
>  
> -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> -		if (IS_ERR(page)) {
> -			err = PTR_ERR(page);
> +		if (!vops->read_merkle_tree_block) {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			unsigned long page_idx =
> +				round_down(index, blocks_per_page);
> +			struct page *page = vops->read_merkle_tree_page(inode,
> +					page_idx, num_ra_pages);
> +
> +			if (IS_ERR(page)) {
> +				err = PTR_ERR(page);
> +			} else {
> +				block.kaddr = kmap_local_page(page) +
> +					((index - page_idx) << log_blocksize);
> +				block.context = page;
> +			}
> +		} else {
> +			err = vops->read_merkle_tree_block(inode,
> +					index << log_blocksize,
> +					&block, log_blocksize, num_ra_pages);
> +		}
> +
> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %d reading Merkle tree page %lu",
> -				     err, index);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, index << log_blocksize);
>  			break;
>  		}
>  
> -		virt = kmap_local_page(page);
> -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> -			kunmap_local(virt);
> -			put_page(page);
> +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> +			fsverity_drop_block(inode, &block);
>  			err = -EFAULT;
>  			break;
>  		}
> -		kunmap_local(virt);
> -		put_page(page);
> +		fsverity_drop_block(inode, &block);
>  
>  		retval += bytes_to_copy;
>  		buf += bytes_to_copy;
> @@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  			break;
>  		}
>  		cond_resched();
> -		offs_in_page = 0;
> +		offs_in_block = 0;
>  	}
>  	return retval ? retval : err;
>  }
> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index 4fcad0825a12..de71911d400c 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -13,14 +13,17 @@
>  static struct workqueue_struct *fsverity_read_workqueue;
>  
>  /*
> - * Returns true if the hash block with index @hblock_idx in the tree, located in
> - * @hpage, has already been verified.
> + * Returns true if the hash block with index @hblock_idx in the tree has
> + * already been verified.
>   */
> -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> +static bool is_hash_block_verified(struct inode *inode,
> +				   struct fsverity_blockbuf *block,
>  				   unsigned long hblock_idx)
>  {
>  	unsigned int blocks_per_page;
>  	unsigned int i;
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	struct page *hpage = (struct page *)block->context;
>  
>  	/*
>  	 * When the Merkle tree block size and page size are the same, then the
> @@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
>  	if (!vi->hash_block_verified)
>  		return PageChecked(hpage);
>  
> +	/*
> +	 * Filesystems which use block based caching (e.g. XFS) always use
> +	 * bitmap.
> +	 */
> +	if (inode->i_sb->s_vop->read_merkle_tree_block)
> +		return test_bit(hblock_idx, vi->hash_block_verified);
>  	/*
>  	 * When the Merkle tree block size and page size differ, we use a bitmap
>  	 * to indicate whether each hash block has been verified.
> @@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  	const struct merkle_tree_params *params = &vi->tree_params;
>  	const unsigned int hsize = params->digest_size;
>  	int level;
> +	int err;

FWIW, Dan Carpenter complained that err can be used uninitialized
here...

> +	int num_ra_pages;
>  	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
>  	const u8 *want_hash;
>  	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
>  	/* The hash blocks that are traversed, indexed by level */
>  	struct {
> -		/* Page containing the hash block */
> -		struct page *page;
> -		/* Mapped address of the hash block (will be within @page) */
> -		const void *addr;
> +		/* Buffer containing the hash block */
> +		struct fsverity_blockbuf block;
>  		/* Index of the hash block in the tree overall */
>  		unsigned long index;
>  		/* Byte offset of the wanted hash relative to @addr */
> @@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		unsigned long next_hidx;
>  		unsigned long hblock_idx;
>  		pgoff_t hpage_idx;
> +		u64 hblock_pos;
>  		unsigned int hblock_offset_in_page;
>  		unsigned int hoffset;
>  		struct page *hpage;
> -		const void *haddr;
> +		struct fsverity_blockbuf *block = &hblocks[level].block;
>  
>  		/*
>  		 * The index of the block in the current level; also the index
> @@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		hblock_offset_in_page =
>  			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
>  
> +		/* Offset of the Merkle tree block into the tree */
> +		hblock_pos = hblock_idx << params->log_blocksize;
> +
>  		/* Byte offset of the hash within the block */
>  		hoffset = (hidx << params->log_digestsize) &
>  			  (params->block_size - 1);
>  
> -		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
> -				hpage_idx, level == 0 ? min(max_ra_pages,
> -					params->tree_pages - hpage_idx) : 0);
> -		if (IS_ERR(hpage)) {
> +		num_ra_pages = level == 0 ?
> +			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
> +
> +		if (inode->i_sb->s_vop->read_merkle_tree_block) {
> +			err = inode->i_sb->s_vop->read_merkle_tree_block(
> +				inode, hblock_pos, block, params->log_blocksize,
> +				num_ra_pages);
> +		} else {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			hblock_idx = round_down(hblock_idx, blocks_per_page);
> +			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
> +				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
> +
> +			if (IS_ERR(hpage)) {
> +				err = PTR_ERR(hpage);
> +			} else {
> +				block->kaddr = kmap_local_page(hpage) +
> +					hblock_offset_in_page;
> +				block->context = hpage;

...in the case where hpage is not an error.  I'll fix that in my tree.

https://lore.kernel.org/oe-kbuild-all/f9e79efc-a16f-495d-8e53-d5a9eef5936e@moroto.mountain/T/#u

--D

> +			}
> +		}
> +
> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %ld reading Merkle tree page %lu",
> -				     PTR_ERR(hpage), hpage_idx);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, hblock_idx);
>  			goto error;
>  		}
> -		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
> -		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
> -			memcpy(_want_hash, haddr + hoffset, hsize);
> +
> +		if (is_hash_block_verified(inode, block, hblock_idx)) {
> +			memcpy(_want_hash, block->kaddr + hoffset, hsize);
>  			want_hash = _want_hash;
> -			kunmap_local(haddr);
> -			put_page(hpage);
> +			fsverity_drop_block(inode, block);
>  			goto descend;
>  		}
> -		hblocks[level].page = hpage;
> -		hblocks[level].addr = haddr;
>  		hblocks[level].index = hblock_idx;
>  		hblocks[level].hoffset = hoffset;
>  		hidx = next_hidx;
> @@ -197,10 +227,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  descend:
>  	/* Descend the tree verifying hash blocks. */
>  	for (; level > 0; level--) {
> -		struct page *hpage = hblocks[level - 1].page;
> -		const void *haddr = hblocks[level - 1].addr;
> +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> +		const void *haddr = block->kaddr;
>  		unsigned long hblock_idx = hblocks[level - 1].index;
>  		unsigned int hoffset = hblocks[level - 1].hoffset;
> +		struct page *hpage = (struct page *)block->context;
>  
>  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
>  			goto error;
> @@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  			SetPageChecked(hpage);
>  		memcpy(_want_hash, haddr + hoffset, hsize);
>  		want_hash = _want_hash;
> -		kunmap_local(haddr);
> -		put_page(hpage);
> +		fsverity_drop_block(inode, block);
>  	}
>  
>  	/* Finally, verify the data block. */
> @@ -235,10 +265,8 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		     params->hash_alg->name, hsize, want_hash,
>  		     params->hash_alg->name, hsize, real_hash);
>  error:
> -	for (; level > 0; level--) {
> -		kunmap_local(hblocks[level - 1].addr);
> -		put_page(hblocks[level - 1].page);
> -	}
> +	for (; level > 0; level--)
> +		fsverity_drop_block(inode, &hblocks[level - 1].block);
>  	return false;
>  }
>  
> @@ -362,3 +390,42 @@ void __init fsverity_init_workqueue(void)
>  	if (!fsverity_read_workqueue)
>  		panic("failed to allocate fsverity_read_queue");
>  }
> +
> +/**
> + * fsverity_invalidate_block() - invalidate Merkle tree block
> + * @inode: inode to which this Merkle tree blocks belong
> + * @block: block to be invalidated
> + *
> + * This function invalidates/clears "verified" state of Merkle tree block
> + * in the fs-verity bitmap. The block needs to have ->offset set.
> + */
> +void fsverity_invalidate_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)
> +{
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
> +
> +	if (block->offset > vi->tree_params.tree_size) {
> +		fsverity_err(inode,
> +"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
> +			     vi->tree_params.tree_size, block->offset);
> +		return;
> +	}
> +
> +	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
> +}
> +EXPORT_SYMBOL_GPL(fsverity_invalidate_block);
> +
> +void fsverity_drop_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)
> +{
> +	if (inode->i_sb->s_vop->drop_block)
> +		inode->i_sb->s_vop->drop_block(block);
> +	else {
> +		struct page *page = (struct page *)block->context;
> +
> +		kunmap_local(block->kaddr);
> +		put_page(page);
> +	}
> +	block->kaddr = NULL;
> +}
> diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
> index ac58b19f23d3..0973b521ac5a 100644
> --- a/include/linux/fsverity.h
> +++ b/include/linux/fsverity.h
> @@ -26,6 +26,33 @@
>  /* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
>  #define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
>  
> +/**
> + * struct fsverity_blockbuf - Merkle Tree block buffer
> + * @kaddr: virtual address of the block's data
> + * @offset: block's offset into Merkle tree
> + * @size: the Merkle tree block size
> + * @context: filesystem private context
> + *
> + * Buffer containing single Merkle Tree block. These buffers are passed
> + *  - to filesystem, when fs-verity is building merkel tree,
> + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> + * Filesystems sets kaddr together with size to point to a memory which contains
> + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> + * written down to disk.
> + *
> + * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
> + * ->drop_block to let filesystem know that memory can be freed.
> + *
> + * The context is optional. This field can be used by filesystem to passthrough
> + * state from ->read_merkle_tree_block to ->drop_block.
> + */
> +struct fsverity_blockbuf {
> +	void *kaddr;
> +	u64 offset;
> +	unsigned int size;
> +	void *context;
> +};
> +
>  /* Verity operations for filesystems */
>  struct fsverity_operations {
>  
> @@ -107,6 +134,32 @@ struct fsverity_operations {
>  					      pgoff_t index,
>  					      unsigned long num_ra_pages);
>  
> +	/**
> +	 * Read a Merkle tree block of the given inode.
> +	 * @inode: the inode
> +	 * @pos: byte offset of the block within the Merkle tree
> +	 * @block: block buffer for filesystem to point it to the block
> +	 * @log_blocksize: log2 of the size of the expected block
> +	 * @ra_bytes: The number of bytes that should be
> +	 *		prefetched starting at @pos if the page at @pos
> +	 *		isn't already cached.  Implementations may ignore this
> +	 *		argument; it's only a performance optimization.
> +	 *
> +	 * This can be called at any time on an open verity file.  It may be
> +	 * called by multiple processes concurrently.
> +	 *
> +	 * In case that block was evicted from the memory filesystem has to use
> +	 * fsverity_invalidate_block() to let fsverity know that block's
> +	 * verification state is not valid anymore.
> +	 *
> +	 * Return: 0 on success, -errno on failure
> +	 */
> +	int (*read_merkle_tree_block)(struct inode *inode,
> +				      u64 pos,
> +				      struct fsverity_blockbuf *block,
> +				      unsigned int log_blocksize,
> +				      u64 ra_bytes);
> +
>  	/**
>  	 * Write a Merkle tree block to the given inode.
>  	 *
> @@ -122,6 +175,16 @@ struct fsverity_operations {
>  	 */
>  	int (*write_merkle_tree_block)(struct inode *inode, const void *buf,
>  				       u64 pos, unsigned int size);
> +
> +	/**
> +	 * Release the reference to a Merkle tree block
> +	 *
> +	 * @block: the block to release
> +	 *
> +	 * This is called when fs-verity is done with a block obtained with
> +	 * ->read_merkle_tree_block().
> +	 */
> +	void (*drop_block)(struct fsverity_blockbuf *block);
>  };
>  
>  #ifdef CONFIG_FS_VERITY
> @@ -175,6 +238,8 @@ int fsverity_ioctl_read_metadata(struct file *filp, const void __user *uarg);
>  bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset);
>  void fsverity_verify_bio(struct bio *bio);
>  void fsverity_enqueue_verify_work(struct work_struct *work);
> +void fsverity_invalidate_block(struct inode *inode,
> +		struct fsverity_blockbuf *block);
>  
>  #else /* !CONFIG_FS_VERITY */
>  
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-11 15:25             ` Darrick J. Wong
@ 2024-03-12  2:43               ` Eric Biggers
  2024-03-12  4:15                 ` Darrick J. Wong
  2024-03-12  2:45               ` Darrick J. Wong
  1 sibling, 1 reply; 94+ messages in thread
From: Eric Biggers @ 2024-03-12  2:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu

On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> > 
> > I think the leaf level (i.e. individual record) access patterns
> > largely match data access patterns, so I'd just treat it like as if
> > it's a normal file being accessed....
> 
> <nod> The latest version of this tries to avoid letting reclaim take the
> top of the tree.  Logically this makes sense to me to reduce read verify
> latency, but I was hoping Eric or Andrey or someone with more
> familiarity with fsverity would chime in on whether or not that made
> sense.

The levels are stored ordered from root to leaf; they aren't interleaved.  So
technically the overall access pattern isn't sequential, but it is sequential
within each level, and the leaf level is over 99% of the accesses anyway
(assuming the default parameters where each block has 128 children).

It makes sense to try to keep the top levels, i.e. the blocks with the lowest
indices, cached.  (The other filesystems that support fsverity currently aren't
doing anything special to treat them differently from other pagecache pages.)

- Eric

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-11 15:25             ` Darrick J. Wong
  2024-03-12  2:43               ` Eric Biggers
@ 2024-03-12  2:45               ` Darrick J. Wong
  2024-03-12  7:01                 ` Dave Chinner
  1 sibling, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-12  2:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:

<snip>

> <nod> The latest version of this tries to avoid letting reclaim take the
> top of the tree.  Logically this makes sense to me to reduce read verify
> latency, but I was hoping Eric or Andrey or someone with more
> familiarity with fsverity would chime in on whether or not that made
> sense.
> 
> > > > Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> > > > that done because merkle trees may some day have more than 2^32 blocks
> > > > in them?  That won't play well with things like xarrays on 32-bit
> > > > machines.
> > > > 
> > > > (Granted we've been talking about deprecating XFS on 32-bit for a while
> > > > now but we're not the whole world)
> > > > 
> > > > > i.e.
> > > > > 
> > > > > 	p = xa_load(key);
> > > > > 	if (p)
> > > > > 		return p;
> > > > > 
> > > > > 	xfs_attr_get(key);
> > > > > 	if (!args->value)
> > > > > 		/* error */
> > > > > 
> > > > > 	/*
> > > > > 	 * store the current value, freeing any old value that we
> > > > > 	 * replaced at this key. Don't care about failure to store,
> > > > > 	 * this is optimistic caching.
> > > > > 	 */
> > > > > 	p = xa_store(key, args->value, GFP_NOFS);
> > > > > 	if (p)
> > > > > 		kvfree(p);
> > > > > 	return args->value;
> > > > 
> > > > Attractive.  Will have to take a look at that tomorrow.
> > > 
> > > Done.  I think.  Not sure that I actually got all the interactions
> > > between the shrinker and the xarray correct though.  KASAN and lockdep
> > > don't have any complaints running fstests, so that's a start.
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09
> > 
> > My initial impression is "over-engineered".
> 
> Overly focused on unfamiliar data structures -- I've never built
> anything with an xarray before.
> 
> > I personally would have just allocated the xattr value buffer with a
> > little extra size and added all the external cache information (a
> > reference counter is all we need as these are fixed sized blocks) to
> > the tail of the blob we actually pass to fsverity.
> 
> Oho, that's a good idea.  I didn't like the separate allocation anyway.
> Friday was mostly chatting with willy and trying to make sure I got the
> xarray access patterns correct.
> 
> >                                                    If we tag the
> > inode in the radix tree as having verity blobs that can be freed, we
> > can then just extend the existing fs sueprblock shrinker callout to
> > also walk all the verity inodes with cached data to try to reclaim
> > some objects...
> 
> This too is a wonderful suggestion -- use the third radix tree tag to
> mark inodes with extra incore caches that can be reclaimed, then teach
> xfs_reclaim_inodes_{nr,count} to scan them.  Allocating a per-inode
> shrinker was likely to cause problems with flooding debugfs with too
> many knobs anyway.
> 
> > But, if a generic blob cache is what it takes to move this forwards,
> > so be it.
> 
> Not necessarily. ;)

And here's today's branch, with xfs_blobcache.[ch] removed and a few
more cleanups:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11

--D

> 
> --D
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-12  2:43               ` Eric Biggers
@ 2024-03-12  4:15                 ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-12  4:15 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Dave Chinner, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu

On Mon, Mar 11, 2024 at 07:43:20PM -0700, Eric Biggers wrote:
> On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > > > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> > > 
> > > I think the leaf level (i.e. individual record) access patterns
> > > largely match data access patterns, so I'd just treat it like as if
> > > it's a normal file being accessed....
> > 
> > <nod> The latest version of this tries to avoid letting reclaim take the
> > top of the tree.  Logically this makes sense to me to reduce read verify
> > latency, but I was hoping Eric or Andrey or someone with more
> > familiarity with fsverity would chime in on whether or not that made
> > sense.
> 
> The levels are stored ordered from root to leaf; they aren't interleaved.  So
> technically the overall access pattern isn't sequential, but it is sequential
> within each level, and the leaf level is over 99% of the accesses anyway
> (assuming the default parameters where each block has 128 children).

<nod>

> It makes sense to try to keep the top levels, i.e. the blocks with the lowest
> indices, cached.  (The other filesystems that support fsverity currently aren't
> doing anything special to treat them differently from other pagecache pages.)

<nod>

--D

> - Eric
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-12  2:45               ` Darrick J. Wong
@ 2024-03-12  7:01                 ` Dave Chinner
  2024-03-12 20:04                   ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Dave Chinner @ 2024-03-12  7:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > But, if a generic blob cache is what it takes to move this forwards,
> > > so be it.
> > 
> > Not necessarily. ;)
> 
> And here's today's branch, with xfs_blobcache.[ch] removed and a few
> more cleanups:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11

Walking all the inodes counting all the verity blobs in the shrinker
is going to be -expensive-. Shrinkers are run very frequently and
with high concurrency under memory pressure by direct reclaim, and
every single shrinker execution is going to run that traversal even
if it is decided there is nothing that can be shrunk.

IMO, it would be better to keep a count of reclaimable objects
either on the inode itself (protected by the xa_lock when
adding/removing) to avoid needing to walk the xarray to count the
blocks on the inode. Even better would be a counter in the perag or
a global percpu counter in the mount of all caches objects. Both of
those pretty much remove all the shrinker side counting overhead.

Couple of other small things.

- verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It
  really has nothing to do with the icache other than calling
  xfs_icwalk(). That gets rid of some of the config ifdefs.

- SHRINK_STOP is what should be returned by the scan when
  xfs_verity_shrinker_scan() wants the shrinker to immediately stop,
  not LONG_MAX.

- In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of
  how many object to try to free, not how many we must free. i.e.
  even if we can't free objects, they are still objects that got
  scanned and so should decement nr_to_scan...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 20/24] xfs: disable direct read path for fs-verity files
  2024-03-07 22:11   ` Darrick J. Wong
@ 2024-03-12 12:02     ` Andrey Albershteyn
  2024-03-12 16:36       ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-12 12:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On 2024-03-07 14:11:08, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 08:10:43PM +0100, Andrey Albershteyn wrote:
> > The direct path is not supported on verity files. Attempts to use direct
> > I/O path on such files should fall back to buffered I/O path.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > ---
> >  fs/xfs/xfs_file.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 17404c2e7e31..af3201075066 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -281,7 +281,8 @@ xfs_file_dax_read(
> >  	struct kiocb		*iocb,
> >  	struct iov_iter		*to)
> >  {
> > -	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> > +	struct inode		*inode = iocb->ki_filp->f_mapping->host;
> > +	struct xfs_inode	*ip = XFS_I(inode);
> >  	ssize_t			ret = 0;
> >  
> >  	trace_xfs_file_dax_read(iocb, to);
> > @@ -334,10 +335,18 @@ xfs_file_read_iter(
> >  
> >  	if (IS_DAX(inode))
> >  		ret = xfs_file_dax_read(iocb, to);
> > -	else if (iocb->ki_flags & IOCB_DIRECT)
> > +	else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
> >  		ret = xfs_file_dio_read(iocb, to);
> > -	else
> > +	else {
> 
> I think the earlier cases need curly braces {} too.
> 
> > +		/*
> > +		 * In case fs-verity is enabled, we also fallback to the
> > +		 * buffered read from the direct read path. Therefore,
> > +		 * IOCB_DIRECT is set and need to be cleared (see
> > +		 * generic_file_read_iter())
> > +		 */
> > +		iocb->ki_flags &= ~IOCB_DIRECT;
> 
> I'm curious that you added this flag here; how have we gotten along
> this far without clearing it?
> 
> --D

Do you know any better place? Not sure if that should be somewhere
before. I've made it same as ext4 does it.

-- 
- Andrey


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag
  2024-03-07 22:18   ` Darrick J. Wong
@ 2024-03-12 12:10     ` Andrey Albershteyn
  2024-03-12 16:38       ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-12 12:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On 2024-03-07 14:18:09, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 08:10:45PM +0100, Andrey Albershteyn wrote:
> > fs-verity adds new inode flag which causes scrub to fail as it is
> > not yet known.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/scrub/attr.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
> > index 9a1f59f7b5a4..ae4227cb55ec 100644
> > --- a/fs/xfs/scrub/attr.c
> > +++ b/fs/xfs/scrub/attr.c
> > @@ -494,7 +494,7 @@ xchk_xattr_rec(
> >  	/* Retrieve the entry and check it. */
> >  	hash = be32_to_cpu(ent->hashval);
> >  	badflags = ~(XFS_ATTR_LOCAL | XFS_ATTR_ROOT | XFS_ATTR_SECURE |
> > -			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT);
> > +			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT | XFS_ATTR_VERITY);
> 
> Now that online repair can modify/discard/salvage broken xattr trees and
> is pretty close to merging, how can I make it invalidate all the incore
> merkle tree data after a repair?
> 
> --D
> 

I suppose dropping all the xattr XFS_ATTR_VERITY buffers associated
with an inode should do the job.

-- 
- Andrey


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 23/24] xfs: add fs-verity ioctls
  2024-03-07 22:14   ` Darrick J. Wong
@ 2024-03-12 12:42     ` Andrey Albershteyn
  0 siblings, 0 replies; 94+ messages in thread
From: Andrey Albershteyn @ 2024-03-12 12:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On 2024-03-07 14:14:45, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 08:10:46PM +0100, Andrey Albershteyn wrote:
> > Add fs-verity ioctls to enable, dump metadata (descriptor and Merkle
> > tree pages) and obtain file's digest.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > ---
> >  fs/xfs/xfs_ioctl.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index ab61d7d552fb..4763d20c05ff 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -43,6 +43,7 @@
> >  #include <linux/mount.h>
> >  #include <linux/namei.h>
> >  #include <linux/fileattr.h>
> > +#include <linux/fsverity.h>
> >  
> >  /*
> >   * xfs_find_handle maps from userspace xfs_fsop_handlereq structure to
> > @@ -2174,6 +2175,22 @@ xfs_file_ioctl(
> >  		return error;
> >  	}
> >  
> > +	case FS_IOC_ENABLE_VERITY:
> > +		if (!xfs_has_verity(mp))
> > +			return -EOPNOTSUPP;
> > +		return fsverity_ioctl_enable(filp, (const void __user *)arg);
> 
> Isn't @arg already declared as a (void __user *) ?
> 
> --D
> 

Right, will remove that.

-- 
- Andrey


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-11 22:38             ` Darrick J. Wong
@ 2024-03-12 15:13               ` David Hildenbrand
  2024-03-12 15:33                 ` David Hildenbrand
  0 siblings, 1 reply; 94+ messages in thread
From: David Hildenbrand @ 2024-03-12 15:13 UTC (permalink / raw)
  To: Darrick J. Wong, Matthew Wilcox
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, akpm, linux-mm, Eric Biggers

On 11.03.24 23:38, Darrick J. Wong wrote:
> [add willy and linux-mm]
> 
> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>
>>> Sorry to answer your question with a question, but how much checking is
>>> $filesystem expected to do for merkle trees?
>>>
>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>> walk the merkle tree blocks, and even read the file data to confirm
>>> intactness.  If the descriptor specifies the highest block address then
>>> we could certainly trim off excess blocks.  But I don't know how much of
>>> libfsverity actually lets you do that; I haven't looked into that
>>> deeply. :/
>>>
>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>> need to stream reads of the verity files through the page cache and let
>>> verity tell us if the file data are consistent.
>>>
>>> For both tools, if something finds errors in the merkle tree structure
>>> itself, do we turn off verity?  Or do we do something nasty like
>>> truncate the file?
>>
>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>> validating the data of verity inodes against their Merkle trees.
>>
>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>
>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>> should delete that inode's verity metadata and remove the verity flag from the
>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>> fairly straightforward, but it probably wouldn't catch much compared to actually
>> validating the data against the Merkle tree.  And actually validating the data
>> against the Merkle tree would be complex and expensive.  Note, none of this
>> would work on files that are encrypted.
>>
>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>> quite what it was designed for.
>>
>>> Is there an ioctl or something that allows userspace to validate an
>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>> block devices, except that we might believe its answers?
>>
>> Just reading the whole file and seeing whether you get an error would do it.
>>
>> Though if you want to make sure it's really re-reading the on-disk data, it's
>> necessary to drop the file's pagecache first.
> 
> I tried a straight pagecache read and it worked like a charm!
> 
> But then I thought to myself, do I really want to waste memory bandwidth
> copying a bunch of data?  No.  I don't even want to incur system call
> overhead from reading a single byte every $pagesize bytes.
> 
> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> worked too, insofar as SIGBUSes are annoying to handle.  But it's
> annoying to take signals like that.
> 
> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> like what I wanted -- it prefaults in the pages, and "If populating
> fails, a SIGBUS signal is not generated; instead, an error is returned."
> 

Yes, these were the expected semantics :)

> But then I tried rigging up a test to see if I could catch an EIO, and
> instead I had to SIGKILL the process!  It looks filemap_fault returns
> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> the VM_FAULT_RETRY is translated to -EBUSY.
> 
> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> that to madvise_populate.  Unfortunately, madvise_populate increments
> its loop counter by the return value (still 0) so it runs in an
> infinite loop.  The only way out is SIGKILL.

That's certainly unexpected. One user I know is QEMU, which primarily 
uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is 
primarily used with shmem/hugetlb, where I haven't heard of any such 
endless loops.

> 
> So I don't know what the correct behavior is here, other than the
> infinite loop seems pretty suspect.  Is it the correct behavior that
> madvise_populate returns EIO if __get_user_pages ever returns zero?
> That doesn't quite sound right if it's the case that a zero return could
> also happen if memory is tight.

madvise_populate() ends up calling faultin_vma_page_range() in a loop. 
That one calls __get_user_pages().

__get_user_pages() documents: "0 return value is possible when the fault 
would need to be retried."

So that's what the caller does. IIRC, there are cases where we really 
have to retry (at least once) and will make progress, so treating "0" as 
an error would be wrong.

Staring at other __get_user_pages() users, __get_user_pages_locked() 
documents: "Please note that this function, unlike __get_user_pages(), 
will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".

But there is some elaborate retry logic in there, whereby the retry will 
set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second 
retry attempt (there are cases where we retry more often, but that's 
related to something else I believe).

So maybe we need a similar retry logic in faultin_vma_page_range()? Or 
make it use __get_user_pages_locked(), but I recall when I introduced 
MADV_POPULATE_READ, there was a catch to it.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-12 15:13               ` David Hildenbrand
@ 2024-03-12 15:33                 ` David Hildenbrand
  2024-03-12 16:44                   ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: David Hildenbrand @ 2024-03-12 15:33 UTC (permalink / raw)
  To: Darrick J. Wong, Matthew Wilcox
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, akpm, linux-mm, Eric Biggers

On 12.03.24 16:13, David Hildenbrand wrote:
> On 11.03.24 23:38, Darrick J. Wong wrote:
>> [add willy and linux-mm]
>>
>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>
>>>> Sorry to answer your question with a question, but how much checking is
>>>> $filesystem expected to do for merkle trees?
>>>>
>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>> intactness.  If the descriptor specifies the highest block address then
>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>> libfsverity actually lets you do that; I haven't looked into that
>>>> deeply. :/
>>>>
>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>> need to stream reads of the verity files through the page cache and let
>>>> verity tell us if the file data are consistent.
>>>>
>>>> For both tools, if something finds errors in the merkle tree structure
>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>> truncate the file?
>>>
>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>> validating the data of verity inodes against their Merkle trees.
>>>
>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>
>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>> should delete that inode's verity metadata and remove the verity flag from the
>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>> validating the data against the Merkle tree.  And actually validating the data
>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>> would work on files that are encrypted.
>>>
>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>> quite what it was designed for.
>>>
>>>> Is there an ioctl or something that allows userspace to validate an
>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>> block devices, except that we might believe its answers?
>>>
>>> Just reading the whole file and seeing whether you get an error would do it.
>>>
>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>> necessary to drop the file's pagecache first.
>>
>> I tried a straight pagecache read and it worked like a charm!
>>
>> But then I thought to myself, do I really want to waste memory bandwidth
>> copying a bunch of data?  No.  I don't even want to incur system call
>> overhead from reading a single byte every $pagesize bytes.
>>
>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>> annoying to take signals like that.
>>
>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>> like what I wanted -- it prefaults in the pages, and "If populating
>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>
> 
> Yes, these were the expected semantics :)
> 
>> But then I tried rigging up a test to see if I could catch an EIO, and
>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>> the VM_FAULT_RETRY is translated to -EBUSY.
>>
>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>> that to madvise_populate.  Unfortunately, madvise_populate increments
>> its loop counter by the return value (still 0) so it runs in an
>> infinite loop.  The only way out is SIGKILL.
> 
> That's certainly unexpected. One user I know is QEMU, which primarily
> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> primarily used with shmem/hugetlb, where I haven't heard of any such
> endless loops.
> 
>>
>> So I don't know what the correct behavior is here, other than the
>> infinite loop seems pretty suspect.  Is it the correct behavior that
>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>> That doesn't quite sound right if it's the case that a zero return could
>> also happen if memory is tight.
> 
> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> That one calls __get_user_pages().
> 
> __get_user_pages() documents: "0 return value is possible when the fault
> would need to be retried."
> 
> So that's what the caller does. IIRC, there are cases where we really
> have to retry (at least once) and will make progress, so treating "0" as
> an error would be wrong.
> 
> Staring at other __get_user_pages() users, __get_user_pages_locked()
> documents: "Please note that this function, unlike __get_user_pages(),
> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> 
> But there is some elaborate retry logic in there, whereby the retry will
> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> retry attempt (there are cases where we retry more often, but that's
> related to something else I believe).
> 
> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> make it use __get_user_pages_locked(), but I recall when I introduced
> MADV_POPULATE_READ, there was a catch to it.

I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the 
mmap()+access case you describe above.

Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately 
see how we would transition from a VM_FAULT_RETRY loop to 
VM_FAULT_SIGBUS. Because VM_FAULT_SIGBUS would be required for that 
function to call do_sigbus().

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 20/24] xfs: disable direct read path for fs-verity files
  2024-03-12 12:02     ` Andrey Albershteyn
@ 2024-03-12 16:36       ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-12 16:36 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Tue, Mar 12, 2024 at 01:02:52PM +0100, Andrey Albershteyn wrote:
> On 2024-03-07 14:11:08, Darrick J. Wong wrote:
> > On Mon, Mar 04, 2024 at 08:10:43PM +0100, Andrey Albershteyn wrote:
> > > The direct path is not supported on verity files. Attempts to use direct
> > > I/O path on such files should fall back to buffered I/O path.
> > > 
> > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > > ---
> > >  fs/xfs/xfs_file.c | 15 ++++++++++++---
> > >  1 file changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 17404c2e7e31..af3201075066 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -281,7 +281,8 @@ xfs_file_dax_read(
> > >  	struct kiocb		*iocb,
> > >  	struct iov_iter		*to)
> > >  {
> > > -	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> > > +	struct inode		*inode = iocb->ki_filp->f_mapping->host;
> > > +	struct xfs_inode	*ip = XFS_I(inode);
> > >  	ssize_t			ret = 0;
> > >  
> > >  	trace_xfs_file_dax_read(iocb, to);
> > > @@ -334,10 +335,18 @@ xfs_file_read_iter(
> > >  
> > >  	if (IS_DAX(inode))
> > >  		ret = xfs_file_dax_read(iocb, to);
> > > -	else if (iocb->ki_flags & IOCB_DIRECT)
> > > +	else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
> > >  		ret = xfs_file_dio_read(iocb, to);
> > > -	else
> > > +	else {
> > 
> > I think the earlier cases need curly braces {} too.
> > 
> > > +		/*
> > > +		 * In case fs-verity is enabled, we also fallback to the
> > > +		 * buffered read from the direct read path. Therefore,
> > > +		 * IOCB_DIRECT is set and need to be cleared (see
> > > +		 * generic_file_read_iter())
> > > +		 */
> > > +		iocb->ki_flags &= ~IOCB_DIRECT;
> > 
> > I'm curious that you added this flag here; how have we gotten along
> > this far without clearing it?
> > 
> > --D
> 
> Do you know any better place? Not sure if that should be somewhere
> before. I've made it same as ext4 does it.

It's not the placement that I'm wondering about, it's /only/ the
clearing of IOCB_DIRECT.

Notice how directio writes (xfs_file_write_iter) fall back to buffered
without clearing that flag:


	if (iocb->ki_flags & IOCB_DIRECT) {
		/*
		 * Allow a directio write to fall back to a buffered
		 * write *only* in the case that we're doing a reflink
		 * CoW.  In all other directio scenarios we do not
		 * allow an operation to fall back to buffered mode.
		 */
		ret = xfs_file_dio_write(iocb, from);
		if (ret != -ENOTBLK)
			return ret;
	}

	return xfs_file_buffered_write(iocb, from);

Or is the problem here that generic_file_read_iter will trip over it?
Oh, haha, yes it will.

Maybe we should call filemap_read directly then?  But I guess this
(clearing IOCB_DIRECT) is fine the way it is.

--D

> -- 
> - Andrey
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag
  2024-03-12 12:10     ` Andrey Albershteyn
@ 2024-03-12 16:38       ` Darrick J. Wong
  2024-03-13  1:35         ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-12 16:38 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Tue, Mar 12, 2024 at 01:10:06PM +0100, Andrey Albershteyn wrote:
> On 2024-03-07 14:18:09, Darrick J. Wong wrote:
> > On Mon, Mar 04, 2024 at 08:10:45PM +0100, Andrey Albershteyn wrote:
> > > fs-verity adds new inode flag which causes scrub to fail as it is
> > > not yet known.
> > > 
> > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/scrub/attr.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
> > > index 9a1f59f7b5a4..ae4227cb55ec 100644
> > > --- a/fs/xfs/scrub/attr.c
> > > +++ b/fs/xfs/scrub/attr.c
> > > @@ -494,7 +494,7 @@ xchk_xattr_rec(
> > >  	/* Retrieve the entry and check it. */
> > >  	hash = be32_to_cpu(ent->hashval);
> > >  	badflags = ~(XFS_ATTR_LOCAL | XFS_ATTR_ROOT | XFS_ATTR_SECURE |
> > > -			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT);
> > > +			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT | XFS_ATTR_VERITY);
> > 
> > Now that online repair can modify/discard/salvage broken xattr trees and
> > is pretty close to merging, how can I make it invalidate all the incore
> > merkle tree data after a repair?
> > 
> > --D
> > 
> 
> I suppose dropping all the xattr XFS_ATTR_VERITY buffers associated
> with an inode should do the job.

Oh!  Yes, xfs_verity_cache_drop would handle that nicely.

--D

> -- 
> - Andrey
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-12 15:33                 ` David Hildenbrand
@ 2024-03-12 16:44                   ` Darrick J. Wong
  2024-03-13 12:29                     ` David Hildenbrand
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-12 16:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
> On 12.03.24 16:13, David Hildenbrand wrote:
> > On 11.03.24 23:38, Darrick J. Wong wrote:
> > > [add willy and linux-mm]
> > > 
> > > On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> > > > On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > > > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > > > > 
> > > > > Sorry to answer your question with a question, but how much checking is
> > > > > $filesystem expected to do for merkle trees?
> > > > > 
> > > > > In theory xfs_repair could learn how to interpret the verity descriptor,
> > > > > walk the merkle tree blocks, and even read the file data to confirm
> > > > > intactness.  If the descriptor specifies the highest block address then
> > > > > we could certainly trim off excess blocks.  But I don't know how much of
> > > > > libfsverity actually lets you do that; I haven't looked into that
> > > > > deeply. :/
> > > > > 
> > > > > For xfs_scrub I guess the job is theoretically simpler, since we only
> > > > > need to stream reads of the verity files through the page cache and let
> > > > > verity tell us if the file data are consistent.
> > > > > 
> > > > > For both tools, if something finds errors in the merkle tree structure
> > > > > itself, do we turn off verity?  Or do we do something nasty like
> > > > > truncate the file?
> > > > 
> > > > As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> > > > e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> > > > validating the data of verity inodes against their Merkle trees.
> > > > 
> > > > e2fsck does delete the verity metadata of inodes that don't have the verity flag
> > > > enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> > > > 
> > > > I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> > > > should delete that inode's verity metadata and remove the verity flag from the
> > > > inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> > > > fairly straightforward, but it probably wouldn't catch much compared to actually
> > > > validating the data against the Merkle tree.  And actually validating the data
> > > > against the Merkle tree would be complex and expensive.  Note, none of this
> > > > would work on files that are encrypted.
> > > > 
> > > > Re: libfsverity, I think it would be possible to validate a Merkle tree using
> > > > libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> > > > quite what it was designed for.
> > > > 
> > > > > Is there an ioctl or something that allows userspace to validate an
> > > > > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > > > > block devices, except that we might believe its answers?
> > > > 
> > > > Just reading the whole file and seeing whether you get an error would do it.
> > > > 
> > > > Though if you want to make sure it's really re-reading the on-disk data, it's
> > > > necessary to drop the file's pagecache first.
> > > 
> > > I tried a straight pagecache read and it worked like a charm!
> > > 
> > > But then I thought to myself, do I really want to waste memory bandwidth
> > > copying a bunch of data?  No.  I don't even want to incur system call
> > > overhead from reading a single byte every $pagesize bytes.
> > > 
> > > So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> > > worked too, insofar as SIGBUSes are annoying to handle.  But it's
> > > annoying to take signals like that.
> > > 
> > > Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> > > like what I wanted -- it prefaults in the pages, and "If populating
> > > fails, a SIGBUS signal is not generated; instead, an error is returned."
> > > 
> > 
> > Yes, these were the expected semantics :)
> > 
> > > But then I tried rigging up a test to see if I could catch an EIO, and
> > > instead I had to SIGKILL the process!  It looks filemap_fault returns
> > > VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> > > __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> > > handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> > > the VM_FAULT_RETRY is translated to -EBUSY.
> > > 
> > > __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> > > that to madvise_populate.  Unfortunately, madvise_populate increments
> > > its loop counter by the return value (still 0) so it runs in an
> > > infinite loop.  The only way out is SIGKILL.
> > 
> > That's certainly unexpected. One user I know is QEMU, which primarily
> > uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> > primarily used with shmem/hugetlb, where I haven't heard of any such
> > endless loops.
> > 
> > > 
> > > So I don't know what the correct behavior is here, other than the
> > > infinite loop seems pretty suspect.  Is it the correct behavior that
> > > madvise_populate returns EIO if __get_user_pages ever returns zero?
> > > That doesn't quite sound right if it's the case that a zero return could
> > > also happen if memory is tight.
> > 
> > madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> > That one calls __get_user_pages().
> > 
> > __get_user_pages() documents: "0 return value is possible when the fault
> > would need to be retried."
> > 
> > So that's what the caller does. IIRC, there are cases where we really
> > have to retry (at least once) and will make progress, so treating "0" as
> > an error would be wrong.
> > 
> > Staring at other __get_user_pages() users, __get_user_pages_locked()
> > documents: "Please note that this function, unlike __get_user_pages(),
> > will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> > 
> > But there is some elaborate retry logic in there, whereby the retry will
> > set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> > retry attempt (there are cases where we retry more often, but that's
> > related to something else I believe).
> > 
> > So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> > make it use __get_user_pages_locked(), but I recall when I introduced
> > MADV_POPULATE_READ, there was a catch to it.
> 
> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
> mmap()+access case you describe above.
> 
> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
> Because VM_FAULT_SIGBUS would be required for that function to call
> do_sigbus().

The code I was looking at yesterday in filemap_fault was:

page_not_uptodate:
	/*
	 * Umm, take care of errors if the page isn't up-to-date.
	 * Try to re-read it _once_. We do this synchronously,
	 * because there really aren't any performance issues here
	 * and we need to check for errors.
	 */
	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
	if (fpin)
		goto out_retry;
	folio_put(folio);

	if (!error || error == AOP_TRUNCATED_PAGE)
		goto retry_find;
	filemap_invalidate_unlock_shared(mapping);

	return VM_FAULT_SIGBUS;

Wherein I /think/ fpin is non-null in this case, so if
filemap_read_folio returns an error, we'll do this instead:

out_retry:
	/*
	 * We dropped the mmap_lock, we need to return to the fault handler to
	 * re-find the vma and come back and find our hopefully still populated
	 * page.
	 */
	if (!IS_ERR(folio))
		folio_put(folio);
	if (mapping_locked)
		filemap_invalidate_unlock_shared(mapping);
	if (fpin)
		fput(fpin);
	return ret | VM_FAULT_RETRY;

and since ret was 0 before the goto, the only return code is
VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:

	if (fpin) {
		if (error)
			ret |= VM_FAULT_SIGBUS;
		goto out_retry;
	}

But I think the hard part here is that there doesn't seem to be any
distinction between transient read errors (e.g. disk cable fell out) vs.
semi-permanent errors (e.g. verity says the hash doesn't match).
AFAICT, either the read(ahead) sets uptodate and callers read the page,
or it doesn't set it and callers treat that as an error-retry
opportunity.

For the transient error case VM_FAULT_RETRY makes perfect sense; for the
second case I imagine we'd want something closer to _SIGBUS.

</handwave>

--D

> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-12  7:01                 ` Dave Chinner
@ 2024-03-12 20:04                   ` Darrick J. Wong
  2024-03-12 21:45                     ` Dave Chinner
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-12 20:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Tue, Mar 12, 2024 at 06:01:01PM +1100, Dave Chinner wrote:
> On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote:
> > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > But, if a generic blob cache is what it takes to move this forwards,
> > > > so be it.
> > > 
> > > Not necessarily. ;)
> > 
> > And here's today's branch, with xfs_blobcache.[ch] removed and a few
> > more cleanups:
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11
> 
> Walking all the inodes counting all the verity blobs in the shrinker
> is going to be -expensive-. Shrinkers are run very frequently and
> with high concurrency under memory pressure by direct reclaim, and
> every single shrinker execution is going to run that traversal even
> if it is decided there is nothing that can be shrunk.
> 
> IMO, it would be better to keep a count of reclaimable objects
> either on the inode itself (protected by the xa_lock when
> adding/removing) to avoid needing to walk the xarray to count the
> blocks on the inode. Even better would be a counter in the perag or
> a global percpu counter in the mount of all caches objects. Both of
> those pretty much remove all the shrinker side counting overhead.

I went with a global percpu counter, let's see if lockdep/kasan have
anything to say about my new design. :P

> Couple of other small things.
> 
> - verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It
>   really has nothing to do with the icache other than calling
>   xfs_icwalk(). That gets rid of some of the config ifdefs.

Done.

> - SHRINK_STOP is what should be returned by the scan when
>   xfs_verity_shrinker_scan() wants the shrinker to immediately stop,
>   not LONG_MAX.

Aha.  Ok, thanks for the tipoff. ;)

> - In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of
>   how many object to try to free, not how many we must free. i.e.
>   even if we can't free objects, they are still objects that got
>   scanned and so should decement nr_to_scan...

<nod>

Is there any way for ->scan_objects to tell its caller how much memory
it actually freed?  Or does it only know about objects?  I suppose
"number of bytes freed" wouldn't be that helpful since someone else
could allocate all the freed memory immediately anyway.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag
  2024-03-12 20:04                   ` Darrick J. Wong
@ 2024-03-12 21:45                     ` Dave Chinner
  0 siblings, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2024-03-12 21:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrey Albershteyn, fsverity, linux-xfs, linux-fsdevel,
	chandan.babu, ebiggers

On Tue, Mar 12, 2024 at 01:04:24PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 06:01:01PM +1100, Dave Chinner wrote:
> > On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote:
> > > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > > But, if a generic blob cache is what it takes to move this forwards,
> > > > > so be it.
> > > > 
> > > > Not necessarily. ;)
> > > 
> > > And here's today's branch, with xfs_blobcache.[ch] removed and a few
> > > more cleanups:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11
> > 
> > Walking all the inodes counting all the verity blobs in the shrinker
> > is going to be -expensive-. Shrinkers are run very frequently and
> > with high concurrency under memory pressure by direct reclaim, and
> > every single shrinker execution is going to run that traversal even
> > if it is decided there is nothing that can be shrunk.
> > 
> > IMO, it would be better to keep a count of reclaimable objects
> > either on the inode itself (protected by the xa_lock when
> > adding/removing) to avoid needing to walk the xarray to count the
> > blocks on the inode. Even better would be a counter in the perag or
> > a global percpu counter in the mount of all caches objects. Both of
> > those pretty much remove all the shrinker side counting overhead.
> 
> I went with a global percpu counter, let's see if lockdep/kasan have
> anything to say about my new design. :P
> 
> > Couple of other small things.
> > 
> > - verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It
> >   really has nothing to do with the icache other than calling
> >   xfs_icwalk(). That gets rid of some of the config ifdefs.
> 
> Done.
> 
> > - SHRINK_STOP is what should be returned by the scan when
> >   xfs_verity_shrinker_scan() wants the shrinker to immediately stop,
> >   not LONG_MAX.
> 
> Aha.  Ok, thanks for the tipoff. ;)
> 
> > - In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of
> >   how many object to try to free, not how many we must free. i.e.
> >   even if we can't free objects, they are still objects that got
> >   scanned and so should decement nr_to_scan...
> 
> <nod>
> 
> Is there any way for ->scan_objects to tell its caller how much memory
> it actually freed? Or does it only know about objects? 

The shrinker infrastructure itself only concerns itself with object
counts. It's almost impossible to balance memory used by slab caches
based on memory used because the objects are of different sizes.

For example, it's easy to acheive a 1:1 balance for dentry objects
and inode objects when shrinking is done by object count. Now try
getting that balance when individual shrinkers and the shrinker
infrastructure don't know the relative size of the objects in caches
it is trying to balance against. For shmem and fuse inode it is 4:1
size difference between a dentry and the inode, XFS inodes is 5:1,
ext4 it's 6:1, etc. Yet they all want the same 1:1 object count
balance with the dentry cache, and the same shrinker implementation
has to manage that...

IOws, there are two key scan values - a control value to tell the
shrinker scan how many objects to scan, and a return value that
tells the shrinker how many objects that specific scan freed.

> I suppose
> "number of bytes freed" wouldn't be that helpful since someone else
> could allocate all the freed memory immediately anyway.

The problem is that there isn't a direct realtionship between
"object freed" and "memory available for allocation" with slab cache
shrinkers. If you look at the back end of SLUB freeing, when it
frees a backing page for a cache because it is now empty (all
objects freed) it calls mm_account_reclaimed_pages() to account for
the page being freed. We do the same in xfs_buf_free_pages() to
account for the fact the object being freed also freed a bunch of
extra pages not directly accounted to the shrinker.

THis is the feedback to the memory reclaim process to determine how
much progress was made by the entire shrinker scan. Memory reclaim
is not actually caring how many objects were freed, it's using
hidden accounting to track how many pages actually got freed by the
overall 'every shrinker scan' call.

IOWs, object count base shrinking makes for realtively simple
cross-cache balance tuning, whilst actual memory freed accounting
by scans is hidden in the background...

In the case of this cache, it might be worthwhile adding something
like this for vmalloc()d objects:

	if (is_vmalloc_addr(ptr))
		mm_account_reclaimed_pages(how_many_pages(ptr))
	kvfree(ptr)

as I think that anything allocated from the SLUB heap by kvmalloc()
is already accounted to reclaim by the SLUB freeing code.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag
  2024-03-12 16:38       ` Darrick J. Wong
@ 2024-03-13  1:35         ` Darrick J. Wong
  0 siblings, 0 replies; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-13  1:35 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: fsverity, linux-xfs, linux-fsdevel, chandan.babu, ebiggers

On Tue, Mar 12, 2024 at 09:38:09AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 01:10:06PM +0100, Andrey Albershteyn wrote:
> > On 2024-03-07 14:18:09, Darrick J. Wong wrote:
> > > On Mon, Mar 04, 2024 at 08:10:45PM +0100, Andrey Albershteyn wrote:
> > > > fs-verity adds new inode flag which causes scrub to fail as it is
> > > > not yet known.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/scrub/attr.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
> > > > index 9a1f59f7b5a4..ae4227cb55ec 100644
> > > > --- a/fs/xfs/scrub/attr.c
> > > > +++ b/fs/xfs/scrub/attr.c
> > > > @@ -494,7 +494,7 @@ xchk_xattr_rec(
> > > >  	/* Retrieve the entry and check it. */
> > > >  	hash = be32_to_cpu(ent->hashval);
> > > >  	badflags = ~(XFS_ATTR_LOCAL | XFS_ATTR_ROOT | XFS_ATTR_SECURE |
> > > > -			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT);
> > > > +			XFS_ATTR_INCOMPLETE | XFS_ATTR_PARENT | XFS_ATTR_VERITY);
> > > 
> > > Now that online repair can modify/discard/salvage broken xattr trees and
> > > is pretty close to merging, how can I make it invalidate all the incore
> > > merkle tree data after a repair?
> > > 
> > > --D
> > > 
> > 
> > I suppose dropping all the xattr XFS_ATTR_VERITY buffers associated
> > with an inode should do the job.
> 
> Oh!  Yes, xfs_verity_cache_drop would handle that nicely.

Here's today's branch, in which I implemented Eric's suggestions for the
"support block-based Merkle tree caching" patch, except for the ones
that conflict with the different direction I went in for
->read_merkle_tree_block; and implemented Dave's suggestions for
shrinker improvements:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-12

--D

> --D
> 
> > -- 
> > - Andrey
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-12 16:44                   ` Darrick J. Wong
@ 2024-03-13 12:29                     ` David Hildenbrand
  2024-03-13 17:19                       ` Darrick J. Wong
  0 siblings, 1 reply; 94+ messages in thread
From: David Hildenbrand @ 2024-03-13 12:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On 12.03.24 17:44, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>> On 12.03.24 16:13, David Hildenbrand wrote:
>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>> [add willy and linux-mm]
>>>>
>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>
>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>> $filesystem expected to do for merkle trees?
>>>>>>
>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>> deeply. :/
>>>>>>
>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>> verity tell us if the file data are consistent.
>>>>>>
>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>> truncate the file?
>>>>>
>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>
>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>
>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>> would work on files that are encrypted.
>>>>>
>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>> quite what it was designed for.
>>>>>
>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>> block devices, except that we might believe its answers?
>>>>>
>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>
>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>> necessary to drop the file's pagecache first.
>>>>
>>>> I tried a straight pagecache read and it worked like a charm!
>>>>
>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>> overhead from reading a single byte every $pagesize bytes.
>>>>
>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>> annoying to take signals like that.
>>>>
>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>
>>>
>>> Yes, these were the expected semantics :)
>>>
>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>
>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>> its loop counter by the return value (still 0) so it runs in an
>>>> infinite loop.  The only way out is SIGKILL.
>>>
>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>> endless loops.
>>>
>>>>
>>>> So I don't know what the correct behavior is here, other than the
>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>> That doesn't quite sound right if it's the case that a zero return could
>>>> also happen if memory is tight.
>>>
>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>> That one calls __get_user_pages().
>>>
>>> __get_user_pages() documents: "0 return value is possible when the fault
>>> would need to be retried."
>>>
>>> So that's what the caller does. IIRC, there are cases where we really
>>> have to retry (at least once) and will make progress, so treating "0" as
>>> an error would be wrong.
>>>
>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>> documents: "Please note that this function, unlike __get_user_pages(),
>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>
>>> But there is some elaborate retry logic in there, whereby the retry will
>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>> retry attempt (there are cases where we retry more often, but that's
>>> related to something else I believe).
>>>
>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>> MADV_POPULATE_READ, there was a catch to it.
>>
>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>> mmap()+access case you describe above.
>>
>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>> Because VM_FAULT_SIGBUS would be required for that function to call
>> do_sigbus().
> 
> The code I was looking at yesterday in filemap_fault was:
> 
> page_not_uptodate:
> 	/*
> 	 * Umm, take care of errors if the page isn't up-to-date.
> 	 * Try to re-read it _once_. We do this synchronously,
> 	 * because there really aren't any performance issues here
> 	 * and we need to check for errors.
> 	 */
> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> 	if (fpin)
> 		goto out_retry;
> 	folio_put(folio);
> 
> 	if (!error || error == AOP_TRUNCATED_PAGE)
> 		goto retry_find;
> 	filemap_invalidate_unlock_shared(mapping);
> 
> 	return VM_FAULT_SIGBUS;
> 
> Wherein I /think/ fpin is non-null in this case, so if
> filemap_read_folio returns an error, we'll do this instead:
> 
> out_retry:
> 	/*
> 	 * We dropped the mmap_lock, we need to return to the fault handler to
> 	 * re-find the vma and come back and find our hopefully still populated
> 	 * page.
> 	 */
> 	if (!IS_ERR(folio))
> 		folio_put(folio);
> 	if (mapping_locked)
> 		filemap_invalidate_unlock_shared(mapping);
> 	if (fpin)
> 		fput(fpin);
> 	return ret | VM_FAULT_RETRY;
> 
> and since ret was 0 before the goto, the only return code is
> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
> 
> 	if (fpin) {
> 		if (error)
> 			ret |= VM_FAULT_SIGBUS;
> 		goto out_retry;
> 	}
> 
> But I think the hard part here is that there doesn't seem to be any
> distinction between transient read errors (e.g. disk cable fell out) vs.
> semi-permanent errors (e.g. verity says the hash doesn't match).
> AFAICT, either the read(ahead) sets uptodate and callers read the page,
> or it doesn't set it and callers treat that as an error-retry
> opportunity.
> 
> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> second case I imagine we'd want something closer to _SIGBUS.


Agreed, it's really hard to judge when it's the right time to give up 
retrying. At least with MADV_POPULATE_READ we should try achieving the 
same behavior as with mmap()+read access. So if the latter manages to 
trigger SIGBUS, MADV_POPULATE_READ should return an error.

Is there an easy way to for me to reproduce this scenario?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-13 12:29                     ` David Hildenbrand
@ 2024-03-13 17:19                       ` Darrick J. Wong
  2024-03-13 19:10                         ` David Hildenbrand
  0 siblings, 1 reply; 94+ messages in thread
From: Darrick J. Wong @ 2024-03-13 17:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
> On 12.03.24 17:44, Darrick J. Wong wrote:
> > On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
> > > On 12.03.24 16:13, David Hildenbrand wrote:
> > > > On 11.03.24 23:38, Darrick J. Wong wrote:
> > > > > [add willy and linux-mm]
> > > > > 
> > > > > On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> > > > > > On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > > > > > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > > > > > > 
> > > > > > > Sorry to answer your question with a question, but how much checking is
> > > > > > > $filesystem expected to do for merkle trees?
> > > > > > > 
> > > > > > > In theory xfs_repair could learn how to interpret the verity descriptor,
> > > > > > > walk the merkle tree blocks, and even read the file data to confirm
> > > > > > > intactness.  If the descriptor specifies the highest block address then
> > > > > > > we could certainly trim off excess blocks.  But I don't know how much of
> > > > > > > libfsverity actually lets you do that; I haven't looked into that
> > > > > > > deeply. :/
> > > > > > > 
> > > > > > > For xfs_scrub I guess the job is theoretically simpler, since we only
> > > > > > > need to stream reads of the verity files through the page cache and let
> > > > > > > verity tell us if the file data are consistent.
> > > > > > > 
> > > > > > > For both tools, if something finds errors in the merkle tree structure
> > > > > > > itself, do we turn off verity?  Or do we do something nasty like
> > > > > > > truncate the file?
> > > > > > 
> > > > > > As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> > > > > > e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> > > > > > validating the data of verity inodes against their Merkle trees.
> > > > > > 
> > > > > > e2fsck does delete the verity metadata of inodes that don't have the verity flag
> > > > > > enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> > > > > > 
> > > > > > I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> > > > > > should delete that inode's verity metadata and remove the verity flag from the
> > > > > > inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> > > > > > fairly straightforward, but it probably wouldn't catch much compared to actually
> > > > > > validating the data against the Merkle tree.  And actually validating the data
> > > > > > against the Merkle tree would be complex and expensive.  Note, none of this
> > > > > > would work on files that are encrypted.
> > > > > > 
> > > > > > Re: libfsverity, I think it would be possible to validate a Merkle tree using
> > > > > > libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> > > > > > quite what it was designed for.
> > > > > > 
> > > > > > > Is there an ioctl or something that allows userspace to validate an
> > > > > > > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > > > > > > block devices, except that we might believe its answers?
> > > > > > 
> > > > > > Just reading the whole file and seeing whether you get an error would do it.
> > > > > > 
> > > > > > Though if you want to make sure it's really re-reading the on-disk data, it's
> > > > > > necessary to drop the file's pagecache first.
> > > > > 
> > > > > I tried a straight pagecache read and it worked like a charm!
> > > > > 
> > > > > But then I thought to myself, do I really want to waste memory bandwidth
> > > > > copying a bunch of data?  No.  I don't even want to incur system call
> > > > > overhead from reading a single byte every $pagesize bytes.
> > > > > 
> > > > > So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> > > > > worked too, insofar as SIGBUSes are annoying to handle.  But it's
> > > > > annoying to take signals like that.
> > > > > 
> > > > > Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> > > > > like what I wanted -- it prefaults in the pages, and "If populating
> > > > > fails, a SIGBUS signal is not generated; instead, an error is returned."
> > > > > 
> > > > 
> > > > Yes, these were the expected semantics :)
> > > > 
> > > > > But then I tried rigging up a test to see if I could catch an EIO, and
> > > > > instead I had to SIGKILL the process!  It looks filemap_fault returns
> > > > > VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> > > > > __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> > > > > handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> > > > > the VM_FAULT_RETRY is translated to -EBUSY.
> > > > > 
> > > > > __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> > > > > that to madvise_populate.  Unfortunately, madvise_populate increments
> > > > > its loop counter by the return value (still 0) so it runs in an
> > > > > infinite loop.  The only way out is SIGKILL.
> > > > 
> > > > That's certainly unexpected. One user I know is QEMU, which primarily
> > > > uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> > > > primarily used with shmem/hugetlb, where I haven't heard of any such
> > > > endless loops.
> > > > 
> > > > > 
> > > > > So I don't know what the correct behavior is here, other than the
> > > > > infinite loop seems pretty suspect.  Is it the correct behavior that
> > > > > madvise_populate returns EIO if __get_user_pages ever returns zero?
> > > > > That doesn't quite sound right if it's the case that a zero return could
> > > > > also happen if memory is tight.
> > > > 
> > > > madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> > > > That one calls __get_user_pages().
> > > > 
> > > > __get_user_pages() documents: "0 return value is possible when the fault
> > > > would need to be retried."
> > > > 
> > > > So that's what the caller does. IIRC, there are cases where we really
> > > > have to retry (at least once) and will make progress, so treating "0" as
> > > > an error would be wrong.
> > > > 
> > > > Staring at other __get_user_pages() users, __get_user_pages_locked()
> > > > documents: "Please note that this function, unlike __get_user_pages(),
> > > > will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> > > > 
> > > > But there is some elaborate retry logic in there, whereby the retry will
> > > > set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> > > > retry attempt (there are cases where we retry more often, but that's
> > > > related to something else I believe).
> > > > 
> > > > So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> > > > make it use __get_user_pages_locked(), but I recall when I introduced
> > > > MADV_POPULATE_READ, there was a catch to it.
> > > 
> > > I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
> > > mmap()+access case you describe above.
> > > 
> > > Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
> > > how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
> > > Because VM_FAULT_SIGBUS would be required for that function to call
> > > do_sigbus().
> > 
> > The code I was looking at yesterday in filemap_fault was:
> > 
> > page_not_uptodate:
> > 	/*
> > 	 * Umm, take care of errors if the page isn't up-to-date.
> > 	 * Try to re-read it _once_. We do this synchronously,
> > 	 * because there really aren't any performance issues here
> > 	 * and we need to check for errors.
> > 	 */
> > 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> > 	if (fpin)
> > 		goto out_retry;
> > 	folio_put(folio);
> > 
> > 	if (!error || error == AOP_TRUNCATED_PAGE)
> > 		goto retry_find;
> > 	filemap_invalidate_unlock_shared(mapping);
> > 
> > 	return VM_FAULT_SIGBUS;
> > 
> > Wherein I /think/ fpin is non-null in this case, so if
> > filemap_read_folio returns an error, we'll do this instead:
> > 
> > out_retry:
> > 	/*
> > 	 * We dropped the mmap_lock, we need to return to the fault handler to
> > 	 * re-find the vma and come back and find our hopefully still populated
> > 	 * page.
> > 	 */
> > 	if (!IS_ERR(folio))
> > 		folio_put(folio);
> > 	if (mapping_locked)
> > 		filemap_invalidate_unlock_shared(mapping);
> > 	if (fpin)
> > 		fput(fpin);
> > 	return ret | VM_FAULT_RETRY;
> > 
> > and since ret was 0 before the goto, the only return code is
> > VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
> > 
> > 	if (fpin) {
> > 		if (error)
> > 			ret |= VM_FAULT_SIGBUS;
> > 		goto out_retry;
> > 	}
> > 
> > But I think the hard part here is that there doesn't seem to be any
> > distinction between transient read errors (e.g. disk cable fell out) vs.
> > semi-permanent errors (e.g. verity says the hash doesn't match).
> > AFAICT, either the read(ahead) sets uptodate and callers read the page,
> > or it doesn't set it and callers treat that as an error-retry
> > opportunity.
> > 
> > For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> > second case I imagine we'd want something closer to _SIGBUS.
> 
> 
> Agreed, it's really hard to judge when it's the right time to give up
> retrying. At least with MADV_POPULATE_READ we should try achieving the same
> behavior as with mmap()+read access. So if the latter manages to trigger
> SIGBUS, MADV_POPULATE_READ should return an error.
> 
> Is there an easy way to for me to reproduce this scenario?

Yes.  Take this Makefile:

CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable

all: mpr

and this C program mpr.c:

/* test MAP_POPULATE_READ on a file */
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/mman.h>

#define min(a, b)	((a) < (b) ? (a) : (b))
#define BUFSIZE		(2097152)

int main(int argc, char *argv[])
{
	struct stat sb;
	long pagesize = sysconf(_SC_PAGESIZE);
	off_t read_sz, pos;
	void *addr;
	char c;
	int fd, ret;

	if (argc != 2) {
		printf("Usage: %s fname\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror(argv[1]);
		return 1;
	}

	ret = fstat(fd, &sb);
	if (ret) {
		perror("fstat");
		return 1;
	}

	/* Validate the file contents with regular reads */
	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
		ret = pread(fd, &c, 1, pos);
		if (ret < 0) {
			if (errno != EIO) {
				perror("pread");
				return 1;
			}

			printf("%s: at offset %llu: %s\n", argv[1],
					(unsigned long long)pos,
					strerror(errno));
			break;
		}
	}

	ret = pread(fd, &c, 1, sb.st_size);
	if (ret < 0) {
		if (errno != EIO) {
			perror("pread");
			return 1;
		}

		printf("%s: at offset %llu: %s\n", argv[1],
				(unsigned long long)sb.st_size,
				strerror(errno));
	}

	/* Validate the file contents with MADV_POPULATE_READ */
	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);

	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
		unsigned int mappos;
		size_t maplen = min(read_sz - pos, BUFSIZE);

		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
		if (addr == MAP_FAILED) {
			perror("mmap");
			return 1;
		}

		ret = madvise(addr, maplen, MADV_POPULATE_READ);
		if (ret) {
			perror("madvise");
			return 1;
		}

		ret = munmap(addr, maplen);
		if (ret) {
			perror("munmap");
			return 1;
		}
	}

	ret = close(fd);
	if (ret) {
		perror("close");
		return 1;
	}

	return 0;
}

and this shell script mpr.sh:

#!/bin/bash -x

# Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ

scriptdir="$(dirname "$0")"

commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
for cmd in "${commands[@]}"; do
	if ! command -v "$cmd" &>/dev/null; then
		echo "$cmd: Command required for this program."
		exit 1
	fi
done

dev="${1:-/dev/sda}"
mnt="${2:-/mnt}"
dmtarget="dumbtarget"

# Clean up any old mounts
umount "$dev" "$mnt"
dmsetup remove "$dmtarget"
rmmod xfs

# Create dm linear mapping to block device and format filesystem
sectors="$(blockdev --getsz "$dev")"
tgt="/dev/mapper/$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
mkfs.xfs -f "$tgt"

# Create a file that we'll read, then cycle mount to zap pagecache
mount "$tgt" "$mnt"
xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
umount "$mnt"
mount "$tgt" "$mnt"

# Load file metadata
stat "$mnt/a"

# Induce EIO errors on read
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors error" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"

# Try to provoke the kernel; kill the process after 10s so we can clean up
timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"

# Stop EIO errors so we can unmount
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"

# Unmount and clean up after ourselves
umount "$mnt"
dmsetup remove "$dmtarget"
<EOF>

make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
stall in the madvise call until timeout sends sigkill to the program;
you can crank the 10s timeout up if you want.

<insert usual disclaimer that I run all these things in scratch VMs>

--D

> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-13 17:19                       ` Darrick J. Wong
@ 2024-03-13 19:10                         ` David Hildenbrand
  2024-03-13 21:03                           ` David Hildenbrand
  0 siblings, 1 reply; 94+ messages in thread
From: David Hildenbrand @ 2024-03-13 19:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On 13.03.24 18:19, Darrick J. Wong wrote:
> On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
>> On 12.03.24 17:44, Darrick J. Wong wrote:
>>> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>>>> On 12.03.24 16:13, David Hildenbrand wrote:
>>>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>>>> [add willy and linux-mm]
>>>>>>
>>>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>>>
>>>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>>>> $filesystem expected to do for merkle trees?
>>>>>>>>
>>>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>>>> deeply. :/
>>>>>>>>
>>>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>>>> verity tell us if the file data are consistent.
>>>>>>>>
>>>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>>>> truncate the file?
>>>>>>>
>>>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>>>
>>>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>>>
>>>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>>>> would work on files that are encrypted.
>>>>>>>
>>>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>>>> quite what it was designed for.
>>>>>>>
>>>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>>>> block devices, except that we might believe its answers?
>>>>>>>
>>>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>>>
>>>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>>>> necessary to drop the file's pagecache first.
>>>>>>
>>>>>> I tried a straight pagecache read and it worked like a charm!
>>>>>>
>>>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>>>> overhead from reading a single byte every $pagesize bytes.
>>>>>>
>>>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>>>> annoying to take signals like that.
>>>>>>
>>>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>>>
>>>>>
>>>>> Yes, these were the expected semantics :)
>>>>>
>>>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>>>
>>>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>>>> its loop counter by the return value (still 0) so it runs in an
>>>>>> infinite loop.  The only way out is SIGKILL.
>>>>>
>>>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>>>> endless loops.
>>>>>
>>>>>>
>>>>>> So I don't know what the correct behavior is here, other than the
>>>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>>>> That doesn't quite sound right if it's the case that a zero return could
>>>>>> also happen if memory is tight.
>>>>>
>>>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>>>> That one calls __get_user_pages().
>>>>>
>>>>> __get_user_pages() documents: "0 return value is possible when the fault
>>>>> would need to be retried."
>>>>>
>>>>> So that's what the caller does. IIRC, there are cases where we really
>>>>> have to retry (at least once) and will make progress, so treating "0" as
>>>>> an error would be wrong.
>>>>>
>>>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>>>> documents: "Please note that this function, unlike __get_user_pages(),
>>>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>>>
>>>>> But there is some elaborate retry logic in there, whereby the retry will
>>>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>>>> retry attempt (there are cases where we retry more often, but that's
>>>>> related to something else I believe).
>>>>>
>>>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>>>> MADV_POPULATE_READ, there was a catch to it.
>>>>
>>>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>>>> mmap()+access case you describe above.
>>>>
>>>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>>>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>>>> Because VM_FAULT_SIGBUS would be required for that function to call
>>>> do_sigbus().
>>>
>>> The code I was looking at yesterday in filemap_fault was:
>>>
>>> page_not_uptodate:
>>> 	/*
>>> 	 * Umm, take care of errors if the page isn't up-to-date.
>>> 	 * Try to re-read it _once_. We do this synchronously,
>>> 	 * because there really aren't any performance issues here
>>> 	 * and we need to check for errors.
>>> 	 */
>>> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
>>> 	if (fpin)
>>> 		goto out_retry;
>>> 	folio_put(folio);
>>>
>>> 	if (!error || error == AOP_TRUNCATED_PAGE)
>>> 		goto retry_find;
>>> 	filemap_invalidate_unlock_shared(mapping);
>>>
>>> 	return VM_FAULT_SIGBUS;
>>>
>>> Wherein I /think/ fpin is non-null in this case, so if
>>> filemap_read_folio returns an error, we'll do this instead:
>>>
>>> out_retry:
>>> 	/*
>>> 	 * We dropped the mmap_lock, we need to return to the fault handler to
>>> 	 * re-find the vma and come back and find our hopefully still populated
>>> 	 * page.
>>> 	 */
>>> 	if (!IS_ERR(folio))
>>> 		folio_put(folio);
>>> 	if (mapping_locked)
>>> 		filemap_invalidate_unlock_shared(mapping);
>>> 	if (fpin)
>>> 		fput(fpin);
>>> 	return ret | VM_FAULT_RETRY;
>>>
>>> and since ret was 0 before the goto, the only return code is
>>> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
>>>
>>> 	if (fpin) {
>>> 		if (error)
>>> 			ret |= VM_FAULT_SIGBUS;
>>> 		goto out_retry;
>>> 	}
>>>
>>> But I think the hard part here is that there doesn't seem to be any
>>> distinction between transient read errors (e.g. disk cable fell out) vs.
>>> semi-permanent errors (e.g. verity says the hash doesn't match).
>>> AFAICT, either the read(ahead) sets uptodate and callers read the page,
>>> or it doesn't set it and callers treat that as an error-retry
>>> opportunity.
>>>
>>> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
>>> second case I imagine we'd want something closer to _SIGBUS.
>>
>>
>> Agreed, it's really hard to judge when it's the right time to give up
>> retrying. At least with MADV_POPULATE_READ we should try achieving the same
>> behavior as with mmap()+read access. So if the latter manages to trigger
>> SIGBUS, MADV_POPULATE_READ should return an error.
>>
>> Is there an easy way to for me to reproduce this scenario?
> 
> Yes.  Take this Makefile:
> 
> CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable
> 
> all: mpr
> 
> and this C program mpr.c:
> 
> /* test MAP_POPULATE_READ on a file */
> #include <stdio.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/stat.h>
> #include <sys/mman.h>
> 
> #define min(a, b)	((a) < (b) ? (a) : (b))
> #define BUFSIZE		(2097152)
> 
> int main(int argc, char *argv[])
> {
> 	struct stat sb;
> 	long pagesize = sysconf(_SC_PAGESIZE);
> 	off_t read_sz, pos;
> 	void *addr;
> 	char c;
> 	int fd, ret;
> 
> 	if (argc != 2) {
> 		printf("Usage: %s fname\n", argv[0]);
> 		return 1;
> 	}
> 
> 	fd = open(argv[1], O_RDONLY);
> 	if (fd < 0) {
> 		perror(argv[1]);
> 		return 1;
> 	}
> 
> 	ret = fstat(fd, &sb);
> 	if (ret) {
> 		perror("fstat");
> 		return 1;
> 	}
> 
> 	/* Validate the file contents with regular reads */
> 	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
> 		ret = pread(fd, &c, 1, pos);
> 		if (ret < 0) {
> 			if (errno != EIO) {
> 				perror("pread");
> 				return 1;
> 			}
> 
> 			printf("%s: at offset %llu: %s\n", argv[1],
> 					(unsigned long long)pos,
> 					strerror(errno));
> 			break;
> 		}
> 	}
> 
> 	ret = pread(fd, &c, 1, sb.st_size);
> 	if (ret < 0) {
> 		if (errno != EIO) {
> 			perror("pread");
> 			return 1;
> 		}
> 
> 		printf("%s: at offset %llu: %s\n", argv[1],
> 				(unsigned long long)sb.st_size,
> 				strerror(errno));
> 	}
> 
> 	/* Validate the file contents with MADV_POPULATE_READ */
> 	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
> 	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);
> 
> 	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
> 		unsigned int mappos;
> 		size_t maplen = min(read_sz - pos, BUFSIZE);
> 
> 		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
> 		if (addr == MAP_FAILED) {
> 			perror("mmap");
> 			return 1;
> 		}
> 
> 		ret = madvise(addr, maplen, MADV_POPULATE_READ);
> 		if (ret) {
> 			perror("madvise");
> 			return 1;
> 		}
> 
> 		ret = munmap(addr, maplen);
> 		if (ret) {
> 			perror("munmap");
> 			return 1;
> 		}
> 	}
> 
> 	ret = close(fd);
> 	if (ret) {
> 		perror("close");
> 		return 1;
> 	}
> 
> 	return 0;
> }
> 
> and this shell script mpr.sh:
> 
> #!/bin/bash -x
> 
> # Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ
> 
> scriptdir="$(dirname "$0")"
> 
> commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
> for cmd in "${commands[@]}"; do
> 	if ! command -v "$cmd" &>/dev/null; then
> 		echo "$cmd: Command required for this program."
> 		exit 1
> 	fi
> done
> 
> dev="${1:-/dev/sda}"
> mnt="${2:-/mnt}"
> dmtarget="dumbtarget"
> 
> # Clean up any old mounts
> umount "$dev" "$mnt"
> dmsetup remove "$dmtarget"
> rmmod xfs
> 
> # Create dm linear mapping to block device and format filesystem
> sectors="$(blockdev --getsz "$dev")"
> tgt="/dev/mapper/$dmtarget"
> echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
> mkfs.xfs -f "$tgt"
> 
> # Create a file that we'll read, then cycle mount to zap pagecache
> mount "$tgt" "$mnt"
> xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
> umount "$mnt"
> mount "$tgt" "$mnt"
> 
> # Load file metadata
> stat "$mnt/a"
> 
> # Induce EIO errors on read
> dmsetup suspend --noflush --nolockfs "$dmtarget"
> echo "0 $sectors error" | dmsetup load "$dmtarget"
> dmsetup resume "$dmtarget"
> 
> # Try to provoke the kernel; kill the process after 10s so we can clean up
> timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"
> 
> # Stop EIO errors so we can unmount
> dmsetup suspend --noflush --nolockfs "$dmtarget"
> echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
> dmsetup resume "$dmtarget"
> 
> # Unmount and clean up after ourselves
> umount "$mnt"
> dmsetup remove "$dmtarget"
> <EOF>
> 
> make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
> stall in the madvise call until timeout sends sigkill to the program;
> you can crank the 10s timeout up if you want.
> 
> <insert usual disclaimer that I run all these things in scratch VMs>

Yes, seems to work, nice!


[  452.455636] buffer_io_error: 6 callbacks suppressed
[  452.455638] Buffer I/O error on dev dm-0, logical block 16, async page read
[  452.456169] Buffer I/O error on dev dm-0, logical block 17, async page read
[  452.456456] Buffer I/O error on dev dm-0, logical block 18, async page read
[  452.456754] Buffer I/O error on dev dm-0, logical block 19, async page read
[  452.457061] Buffer I/O error on dev dm-0, logical block 20, async page read
[  452.457350] Buffer I/O error on dev dm-0, logical block 21, async page read
[  452.457639] Buffer I/O error on dev dm-0, logical block 22, async page read
[  452.457942] Buffer I/O error on dev dm-0, logical block 23, async page read
[  452.458242] Buffer I/O error on dev dm-0, logical block 16, async page read
[  452.458552] Buffer I/O error on dev dm-0, logical block 17, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
madvise(0x7f9393624000, 1048576, MADV_POPULATE_READ./mpr.sh: line 45:  2070 Killed                  tim"


And once I switch over to reading instead of MADV_POPULATE_READ:

[  753.940230] buffer_io_error: 6 callbacks suppressed
[  753.940233] Buffer I/O error on dev dm-0, logical block 8192, async page read
[  753.941402] Buffer I/O error on dev dm-0, logical block 8193, async page read
[  753.942084] Buffer I/O error on dev dm-0, logical block 8194, async page read
[  753.942738] Buffer I/O error on dev dm-0, logical block 8195, async page read
[  753.943412] Buffer I/O error on dev dm-0, logical block 8196, async page read
[  753.944088] Buffer I/O error on dev dm-0, logical block 8197, async page read
[  753.944741] Buffer I/O error on dev dm-0, logical block 8198, async page read
[  753.945415] Buffer I/O error on dev dm-0, logical block 8199, async page read
[  753.946105] Buffer I/O error on dev dm-0, logical block 8192, async page read
[  753.946661] Buffer I/O error on dev dm-0, logical block 8193, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
--- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f34f82d8000} ---
+++ killed by SIGBUS (core dumped) +++
timeout: the monitored command dumped core
./mpr.sh: line 45:  2388 Bus error               timeout -s KILL 10s strace -s99 -e madvise "$scriptdir"


Let me dig how the fault handler is able to conclude SIGBUS here!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
  2024-03-13 19:10                         ` David Hildenbrand
@ 2024-03-13 21:03                           ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2024-03-13 21:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrey Albershteyn, fsverity, linux-xfs,
	linux-fsdevel, chandan.babu, akpm, linux-mm, Eric Biggers

On 13.03.24 20:10, David Hildenbrand wrote:
> On 13.03.24 18:19, Darrick J. Wong wrote:
>> On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
>>> On 12.03.24 17:44, Darrick J. Wong wrote:
>>>> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>>>>> On 12.03.24 16:13, David Hildenbrand wrote:
>>>>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>>>>> [add willy and linux-mm]
>>>>>>>
>>>>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>>>>
>>>>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>>>>> $filesystem expected to do for merkle trees?
>>>>>>>>>
>>>>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>>>>> deeply. :/
>>>>>>>>>
>>>>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>>>>> verity tell us if the file data are consistent.
>>>>>>>>>
>>>>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>>>>> truncate the file?
>>>>>>>>
>>>>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>>>>
>>>>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>>>>
>>>>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>>>>> would work on files that are encrypted.
>>>>>>>>
>>>>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>>>>> quite what it was designed for.
>>>>>>>>
>>>>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>>>>> block devices, except that we might believe its answers?
>>>>>>>>
>>>>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>>>>
>>>>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>>>>> necessary to drop the file's pagecache first.
>>>>>>>
>>>>>>> I tried a straight pagecache read and it worked like a charm!
>>>>>>>
>>>>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>>>>> overhead from reading a single byte every $pagesize bytes.
>>>>>>>
>>>>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>>>>> annoying to take signals like that.
>>>>>>>
>>>>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>>>>
>>>>>>
>>>>>> Yes, these were the expected semantics :)
>>>>>>
>>>>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>>>>
>>>>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>>>>> its loop counter by the return value (still 0) so it runs in an
>>>>>>> infinite loop.  The only way out is SIGKILL.
>>>>>>
>>>>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>>>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>>>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>>>>> endless loops.
>>>>>>
>>>>>>>
>>>>>>> So I don't know what the correct behavior is here, other than the
>>>>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>>>>> That doesn't quite sound right if it's the case that a zero return could
>>>>>>> also happen if memory is tight.
>>>>>>
>>>>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>>>>> That one calls __get_user_pages().
>>>>>>
>>>>>> __get_user_pages() documents: "0 return value is possible when the fault
>>>>>> would need to be retried."
>>>>>>
>>>>>> So that's what the caller does. IIRC, there are cases where we really
>>>>>> have to retry (at least once) and will make progress, so treating "0" as
>>>>>> an error would be wrong.
>>>>>>
>>>>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>>>>> documents: "Please note that this function, unlike __get_user_pages(),
>>>>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>>>>
>>>>>> But there is some elaborate retry logic in there, whereby the retry will
>>>>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>>>>> retry attempt (there are cases where we retry more often, but that's
>>>>>> related to something else I believe).
>>>>>>
>>>>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>>>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>>>>> MADV_POPULATE_READ, there was a catch to it.
>>>>>
>>>>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>>>>> mmap()+access case you describe above.
>>>>>
>>>>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>>>>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>>>>> Because VM_FAULT_SIGBUS would be required for that function to call
>>>>> do_sigbus().
>>>>
>>>> The code I was looking at yesterday in filemap_fault was:
>>>>
>>>> page_not_uptodate:
>>>> 	/*
>>>> 	 * Umm, take care of errors if the page isn't up-to-date.
>>>> 	 * Try to re-read it _once_. We do this synchronously,
>>>> 	 * because there really aren't any performance issues here
>>>> 	 * and we need to check for errors.
>>>> 	 */
>>>> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>>> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
>>>> 	if (fpin)
>>>> 		goto out_retry;
>>>> 	folio_put(folio);
>>>>
>>>> 	if (!error || error == AOP_TRUNCATED_PAGE)
>>>> 		goto retry_find;
>>>> 	filemap_invalidate_unlock_shared(mapping);
>>>>
>>>> 	return VM_FAULT_SIGBUS;
>>>>
>>>> Wherein I /think/ fpin is non-null in this case, so if
>>>> filemap_read_folio returns an error, we'll do this instead:
>>>>
>>>> out_retry:
>>>> 	/*
>>>> 	 * We dropped the mmap_lock, we need to return to the fault handler to
>>>> 	 * re-find the vma and come back and find our hopefully still populated
>>>> 	 * page.
>>>> 	 */
>>>> 	if (!IS_ERR(folio))
>>>> 		folio_put(folio);
>>>> 	if (mapping_locked)
>>>> 		filemap_invalidate_unlock_shared(mapping);
>>>> 	if (fpin)
>>>> 		fput(fpin);
>>>> 	return ret | VM_FAULT_RETRY;
>>>>
>>>> and since ret was 0 before the goto, the only return code is
>>>> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
>>>>
>>>> 	if (fpin) {
>>>> 		if (error)
>>>> 			ret |= VM_FAULT_SIGBUS;
>>>> 		goto out_retry;
>>>> 	}
>>>>
>>>> But I think the hard part here is that there doesn't seem to be any
>>>> distinction between transient read errors (e.g. disk cable fell out) vs.
>>>> semi-permanent errors (e.g. verity says the hash doesn't match).
>>>> AFAICT, either the read(ahead) sets uptodate and callers read the page,
>>>> or it doesn't set it and callers treat that as an error-retry
>>>> opportunity.
>>>>
>>>> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
>>>> second case I imagine we'd want something closer to _SIGBUS.
>>>
>>>
>>> Agreed, it's really hard to judge when it's the right time to give up
>>> retrying. At least with MADV_POPULATE_READ we should try achieving the same
>>> behavior as with mmap()+read access. So if the latter manages to trigger
>>> SIGBUS, MADV_POPULATE_READ should return an error.
>>>
>>> Is there an easy way to for me to reproduce this scenario?
>>
>> Yes.  Take this Makefile:
>>
>> CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable
>>
>> all: mpr
>>
>> and this C program mpr.c:
>>
>> /* test MAP_POPULATE_READ on a file */
>> #include <stdio.h>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <string.h>
>> #include <sys/stat.h>
>> #include <sys/mman.h>
>>
>> #define min(a, b)	((a) < (b) ? (a) : (b))
>> #define BUFSIZE		(2097152)
>>
>> int main(int argc, char *argv[])
>> {
>> 	struct stat sb;
>> 	long pagesize = sysconf(_SC_PAGESIZE);
>> 	off_t read_sz, pos;
>> 	void *addr;
>> 	char c;
>> 	int fd, ret;
>>
>> 	if (argc != 2) {
>> 		printf("Usage: %s fname\n", argv[0]);
>> 		return 1;
>> 	}
>>
>> 	fd = open(argv[1], O_RDONLY);
>> 	if (fd < 0) {
>> 		perror(argv[1]);
>> 		return 1;
>> 	}
>>
>> 	ret = fstat(fd, &sb);
>> 	if (ret) {
>> 		perror("fstat");
>> 		return 1;
>> 	}
>>
>> 	/* Validate the file contents with regular reads */
>> 	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
>> 		ret = pread(fd, &c, 1, pos);
>> 		if (ret < 0) {
>> 			if (errno != EIO) {
>> 				perror("pread");
>> 				return 1;
>> 			}
>>
>> 			printf("%s: at offset %llu: %s\n", argv[1],
>> 					(unsigned long long)pos,
>> 					strerror(errno));
>> 			break;
>> 		}
>> 	}
>>
>> 	ret = pread(fd, &c, 1, sb.st_size);
>> 	if (ret < 0) {
>> 		if (errno != EIO) {
>> 			perror("pread");
>> 			return 1;
>> 		}
>>
>> 		printf("%s: at offset %llu: %s\n", argv[1],
>> 				(unsigned long long)sb.st_size,
>> 				strerror(errno));
>> 	}
>>
>> 	/* Validate the file contents with MADV_POPULATE_READ */
>> 	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
>> 	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);
>>
>> 	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
>> 		unsigned int mappos;
>> 		size_t maplen = min(read_sz - pos, BUFSIZE);
>>
>> 		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
>> 		if (addr == MAP_FAILED) {
>> 			perror("mmap");
>> 			return 1;
>> 		}
>>
>> 		ret = madvise(addr, maplen, MADV_POPULATE_READ);
>> 		if (ret) {
>> 			perror("madvise");
>> 			return 1;
>> 		}
>>
>> 		ret = munmap(addr, maplen);
>> 		if (ret) {
>> 			perror("munmap");
>> 			return 1;
>> 		}
>> 	}
>>
>> 	ret = close(fd);
>> 	if (ret) {
>> 		perror("close");
>> 		return 1;
>> 	}
>>
>> 	return 0;
>> }
>>
>> and this shell script mpr.sh:
>>
>> #!/bin/bash -x
>>
>> # Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ
>>
>> scriptdir="$(dirname "$0")"
>>
>> commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
>> for cmd in "${commands[@]}"; do
>> 	if ! command -v "$cmd" &>/dev/null; then
>> 		echo "$cmd: Command required for this program."
>> 		exit 1
>> 	fi
>> done
>>
>> dev="${1:-/dev/sda}"
>> mnt="${2:-/mnt}"
>> dmtarget="dumbtarget"
>>
>> # Clean up any old mounts
>> umount "$dev" "$mnt"
>> dmsetup remove "$dmtarget"
>> rmmod xfs
>>
>> # Create dm linear mapping to block device and format filesystem
>> sectors="$(blockdev --getsz "$dev")"
>> tgt="/dev/mapper/$dmtarget"
>> echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
>> mkfs.xfs -f "$tgt"
>>
>> # Create a file that we'll read, then cycle mount to zap pagecache
>> mount "$tgt" "$mnt"
>> xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
>> umount "$mnt"
>> mount "$tgt" "$mnt"
>>
>> # Load file metadata
>> stat "$mnt/a"
>>
>> # Induce EIO errors on read
>> dmsetup suspend --noflush --nolockfs "$dmtarget"
>> echo "0 $sectors error" | dmsetup load "$dmtarget"
>> dmsetup resume "$dmtarget"
>>
>> # Try to provoke the kernel; kill the process after 10s so we can clean up
>> timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"
>>
>> # Stop EIO errors so we can unmount
>> dmsetup suspend --noflush --nolockfs "$dmtarget"
>> echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
>> dmsetup resume "$dmtarget"
>>
>> # Unmount and clean up after ourselves
>> umount "$mnt"
>> dmsetup remove "$dmtarget"
>> <EOF>
>>
>> make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
>> stall in the madvise call until timeout sends sigkill to the program;
>> you can crank the 10s timeout up if you want.
>>
>> <insert usual disclaimer that I run all these things in scratch VMs>
> 
> Yes, seems to work, nice!
> 
> 
> [  452.455636] buffer_io_error: 6 callbacks suppressed
> [  452.455638] Buffer I/O error on dev dm-0, logical block 16, async page read
> [  452.456169] Buffer I/O error on dev dm-0, logical block 17, async page read
> [  452.456456] Buffer I/O error on dev dm-0, logical block 18, async page read
> [  452.456754] Buffer I/O error on dev dm-0, logical block 19, async page read
> [  452.457061] Buffer I/O error on dev dm-0, logical block 20, async page read
> [  452.457350] Buffer I/O error on dev dm-0, logical block 21, async page read
> [  452.457639] Buffer I/O error on dev dm-0, logical block 22, async page read
> [  452.457942] Buffer I/O error on dev dm-0, logical block 23, async page read
> [  452.458242] Buffer I/O error on dev dm-0, logical block 16, async page read
> [  452.458552] Buffer I/O error on dev dm-0, logical block 17, async page read
> + timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
> /mnt/tmp//a: at offset 0: Input/output error
> /mnt/tmp//a: read bytes 1048576
> madvise(0x7f9393624000, 1048576, MADV_POPULATE_READ./mpr.sh: line 45:  2070 Killed                  tim"
> 
> 
> And once I switch over to reading instead of MADV_POPULATE_READ:
> 
> [  753.940230] buffer_io_error: 6 callbacks suppressed
> [  753.940233] Buffer I/O error on dev dm-0, logical block 8192, async page read
> [  753.941402] Buffer I/O error on dev dm-0, logical block 8193, async page read
> [  753.942084] Buffer I/O error on dev dm-0, logical block 8194, async page read
> [  753.942738] Buffer I/O error on dev dm-0, logical block 8195, async page read
> [  753.943412] Buffer I/O error on dev dm-0, logical block 8196, async page read
> [  753.944088] Buffer I/O error on dev dm-0, logical block 8197, async page read
> [  753.944741] Buffer I/O error on dev dm-0, logical block 8198, async page read
> [  753.945415] Buffer I/O error on dev dm-0, logical block 8199, async page read
> [  753.946105] Buffer I/O error on dev dm-0, logical block 8192, async page read
> [  753.946661] Buffer I/O error on dev dm-0, logical block 8193, async page read
> + timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
> /mnt/tmp//a: at offset 0: Input/output error
> /mnt/tmp//a: read bytes 1048576
> --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f34f82d8000} ---
> +++ killed by SIGBUS (core dumped) +++
> timeout: the monitored command dumped core
> ./mpr.sh: line 45:  2388 Bus error               timeout -s KILL 10s strace -s99 -e madvise "$scriptdir"
> 
> 
> Let me dig how the fault handler is able to conclude SIGBUS here!

Might be as simple as this:

diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d5..62df1548b7779 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1734,8 +1734,8 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
         if (check_vma_flags(vma, gup_flags))
                 return -EINVAL;
  
-       ret = __get_user_pages(mm, start, nr_pages, gup_flags,
-                              NULL, locked);
+       ret = __get_user_pages_locked(mm, start, nr_pages, NULL, locked,
+                                     gup_flags);
         lru_add_drain();
         return ret;
  }


[   57.154542] Buffer I/O error on dev dm-0, logical block 16, async page read
[   57.155276] Buffer I/O error on dev dm-0, logical block 17, async page read
[   57.155911] Buffer I/O error on dev dm-0, logical block 18, async page read
[   57.156568] Buffer I/O error on dev dm-0, logical block 19, async page read
[   57.157245] Buffer I/O error on dev dm-0, logical block 20, async page read
[   57.157880] Buffer I/O error on dev dm-0, logical block 21, async page read
[   57.158539] Buffer I/O error on dev dm-0, logical block 22, async page read
[   57.159197] Buffer I/O error on dev dm-0, logical block 23, async page read
[   57.159838] Buffer I/O error on dev dm-0, logical block 16, async page read
[   57.160492] Buffer I/O error on dev dm-0, logical block 17, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
[   57.190472] Retrying
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
madvise(0x7f0fa0384000, 1048576, MADV_POPULATE_READ) = -1 EFAULT (Bad address)
madvise: Bad address


And -EFAULT is what MADV_POPULATE_READ is documented to return for SIGBUS.

(there are a couple of ways we could speedup MADV_POPULATE_READ, maybe
at some point using VMA locks)

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2024-03-13 21:03 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 01/24] fsverity: remove hash page spin lock Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 02/24] xfs: add parent pointer support to attribute code Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 03/24] xfs: define parent pointer ondisk extended attribute format Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 04/24] xfs: add parent pointer validator functions Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
2024-03-04 22:35   ` Eric Biggers
2024-03-07 21:39     ` Darrick J. Wong
2024-03-07 22:06       ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
2024-03-05  0:52   ` Eric Biggers
2024-03-06 16:30     ` Darrick J. Wong
2024-03-07 22:02       ` Eric Biggers
2024-03-08  3:46         ` Darrick J. Wong
2024-03-08  4:40           ` Eric Biggers
2024-03-11 22:38             ` Darrick J. Wong
2024-03-12 15:13               ` David Hildenbrand
2024-03-12 15:33                 ` David Hildenbrand
2024-03-12 16:44                   ` Darrick J. Wong
2024-03-13 12:29                     ` David Hildenbrand
2024-03-13 17:19                       ` Darrick J. Wong
2024-03-13 19:10                         ` David Hildenbrand
2024-03-13 21:03                           ` David Hildenbrand
2024-03-08 21:34           ` Dave Chinner
2024-03-09 16:19             ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
2024-03-06  3:56   ` Eric Biggers
2024-03-07 21:54     ` Darrick J. Wong
2024-03-07 22:49       ` Eric Biggers
2024-03-08  3:50         ` Darrick J. Wong
2024-03-09 16:24           ` Darrick J. Wong
2024-03-11 23:22   ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing Andrey Albershteyn
2024-03-05  1:08   ` Eric Biggers
2024-03-07 21:58     ` Darrick J. Wong
2024-03-07 22:26       ` Eric Biggers
2024-03-08  3:53         ` Darrick J. Wong
2024-03-07 22:55       ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 09/24] fsverity: add tracepoints Andrey Albershteyn
2024-03-05  0:33   ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
2024-03-04 23:39   ` Eric Biggers
2024-03-07 22:06     ` Darrick J. Wong
2024-03-07 22:19       ` Eric Biggers
2024-03-07 23:38     ` Dave Chinner
2024-03-07 23:45       ` Darrick J. Wong
2024-03-08  0:47         ` Dave Chinner
2024-03-07 23:59       ` Eric Biggers
2024-03-08  1:20         ` Dave Chinner
2024-03-08  3:16           ` Eric Biggers
2024-03-08  3:57             ` Darrick J. Wong
2024-03-08  3:22           ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Andrey Albershteyn
2024-03-07 22:46   ` Darrick J. Wong
2024-03-08  1:59     ` Dave Chinner
2024-03-08  3:31       ` Darrick J. Wong
2024-03-09 16:28         ` Darrick J. Wong
2024-03-11  0:26           ` Dave Chinner
2024-03-11 15:25             ` Darrick J. Wong
2024-03-12  2:43               ` Eric Biggers
2024-03-12  4:15                 ` Darrick J. Wong
2024-03-12  2:45               ` Darrick J. Wong
2024-03-12  7:01                 ` Dave Chinner
2024-03-12 20:04                   ` Darrick J. Wong
2024-03-12 21:45                     ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 13/24] xfs: add attribute type for fs-verity Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 14/24] xfs: make xfs_buf_get() to take XBF_* flags Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 16/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 17/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
2024-03-07 22:06   ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
2024-03-07 22:09   ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode Andrey Albershteyn
2024-03-07 22:09   ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 20/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
2024-03-07 22:11   ` Darrick J. Wong
2024-03-12 12:02     ` Andrey Albershteyn
2024-03-12 16:36       ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
2024-03-06  4:55   ` Eric Biggers
2024-03-06  5:01     ` Eric Biggers
2024-03-07 23:10   ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag Andrey Albershteyn
2024-03-07 22:18   ` Darrick J. Wong
2024-03-12 12:10     ` Andrey Albershteyn
2024-03-12 16:38       ` Darrick J. Wong
2024-03-13  1:35         ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 23/24] xfs: add fs-verity ioctls Andrey Albershteyn
2024-03-07 22:14   ` Darrick J. Wong
2024-03-12 12:42     ` Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
2024-03-07 22:16   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.