From: "Darrick J. Wong" <djwong@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>,
Andrey Albershteyn <aalbersh@redhat.com>,
fsverity@lists.linux.dev, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, chandan.babu@oracle.com,
akpm@linux-foundation.org, linux-mm@kvack.org,
Eric Biggers <ebiggers@kernel.org>
Subject: Re: [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity()
Date: Wed, 13 Mar 2024 10:19:36 -0700 [thread overview]
Message-ID: <20240313171936.GN1927156@frogsfrogsfrogs> (raw)
In-Reply-To: <420b6d5f-adef-4415-b8cb-16c234dcec38@redhat.com>
On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
> On 12.03.24 17:44, Darrick J. Wong wrote:
> > On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
> > > On 12.03.24 16:13, David Hildenbrand wrote:
> > > > On 11.03.24 23:38, Darrick J. Wong wrote:
> > > > > [add willy and linux-mm]
> > > > >
> > > > > On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> > > > > > On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > > > > > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > > > > > >
> > > > > > > Sorry to answer your question with a question, but how much checking is
> > > > > > > $filesystem expected to do for merkle trees?
> > > > > > >
> > > > > > > In theory xfs_repair could learn how to interpret the verity descriptor,
> > > > > > > walk the merkle tree blocks, and even read the file data to confirm
> > > > > > > intactness. If the descriptor specifies the highest block address then
> > > > > > > we could certainly trim off excess blocks. But I don't know how much of
> > > > > > > libfsverity actually lets you do that; I haven't looked into that
> > > > > > > deeply. :/
> > > > > > >
> > > > > > > For xfs_scrub I guess the job is theoretically simpler, since we only
> > > > > > > need to stream reads of the verity files through the page cache and let
> > > > > > > verity tell us if the file data are consistent.
> > > > > > >
> > > > > > > For both tools, if something finds errors in the merkle tree structure
> > > > > > > itself, do we turn off verity? Or do we do something nasty like
> > > > > > > truncate the file?
> > > > > >
> > > > > > As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> > > > > > e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> > > > > > validating the data of verity inodes against their Merkle trees.
> > > > > >
> > > > > > e2fsck does delete the verity metadata of inodes that don't have the verity flag
> > > > > > enabled. That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> > > > > >
> > > > > > I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> > > > > > should delete that inode's verity metadata and remove the verity flag from the
> > > > > > inode. Checking for a missing or obviously corrupt fsverity_descriptor would be
> > > > > > fairly straightforward, but it probably wouldn't catch much compared to actually
> > > > > > validating the data against the Merkle tree. And actually validating the data
> > > > > > against the Merkle tree would be complex and expensive. Note, none of this
> > > > > > would work on files that are encrypted.
> > > > > >
> > > > > > Re: libfsverity, I think it would be possible to validate a Merkle tree using
> > > > > > libfsverity_compute_digest() and the callbacks that it supports. But that's not
> > > > > > quite what it was designed for.
> > > > > >
> > > > > > > Is there an ioctl or something that allows userspace to validate an
> > > > > > > entire file's contents? Sort of like what BLKVERIFY would have done for
> > > > > > > block devices, except that we might believe its answers?
> > > > > >
> > > > > > Just reading the whole file and seeing whether you get an error would do it.
> > > > > >
> > > > > > Though if you want to make sure it's really re-reading the on-disk data, it's
> > > > > > necessary to drop the file's pagecache first.
> > > > >
> > > > > I tried a straight pagecache read and it worked like a charm!
> > > > >
> > > > > But then I thought to myself, do I really want to waste memory bandwidth
> > > > > copying a bunch of data? No. I don't even want to incur system call
> > > > > overhead from reading a single byte every $pagesize bytes.
> > > > >
> > > > > So I created 2M mmap areas and read a byte every $pagesize bytes. That
> > > > > worked too, insofar as SIGBUSes are annoying to handle. But it's
> > > > > annoying to take signals like that.
> > > > >
> > > > > Then I started looking at madvise. MADV_POPULATE_READ looked exactly
> > > > > like what I wanted -- it prefaults in the pages, and "If populating
> > > > > fails, a SIGBUS signal is not generated; instead, an error is returned."
> > > > >
> > > >
> > > > Yes, these were the expected semantics :)
> > > >
> > > > > But then I tried rigging up a test to see if I could catch an EIO, and
> > > > > instead I had to SIGKILL the process! It looks filemap_fault returns
> > > > > VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> > > > > __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> > > > > handle_mm_fault -> faultin_page -> __get_user_pages. At faultin_pages,
> > > > > the VM_FAULT_RETRY is translated to -EBUSY.
> > > > >
> > > > > __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> > > > > that to madvise_populate. Unfortunately, madvise_populate increments
> > > > > its loop counter by the return value (still 0) so it runs in an
> > > > > infinite loop. The only way out is SIGKILL.
> > > >
> > > > That's certainly unexpected. One user I know is QEMU, which primarily
> > > > uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> > > > primarily used with shmem/hugetlb, where I haven't heard of any such
> > > > endless loops.
> > > >
> > > > >
> > > > > So I don't know what the correct behavior is here, other than the
> > > > > infinite loop seems pretty suspect. Is it the correct behavior that
> > > > > madvise_populate returns EIO if __get_user_pages ever returns zero?
> > > > > That doesn't quite sound right if it's the case that a zero return could
> > > > > also happen if memory is tight.
> > > >
> > > > madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> > > > That one calls __get_user_pages().
> > > >
> > > > __get_user_pages() documents: "0 return value is possible when the fault
> > > > would need to be retried."
> > > >
> > > > So that's what the caller does. IIRC, there are cases where we really
> > > > have to retry (at least once) and will make progress, so treating "0" as
> > > > an error would be wrong.
> > > >
> > > > Staring at other __get_user_pages() users, __get_user_pages_locked()
> > > > documents: "Please note that this function, unlike __get_user_pages(),
> > > > will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> > > >
> > > > But there is some elaborate retry logic in there, whereby the retry will
> > > > set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> > > > retry attempt (there are cases where we retry more often, but that's
> > > > related to something else I believe).
> > > >
> > > > So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> > > > make it use __get_user_pages_locked(), but I recall when I introduced
> > > > MADV_POPULATE_READ, there was a catch to it.
> > >
> > > I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
> > > mmap()+access case you describe above.
> > >
> > > Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
> > > how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
> > > Because VM_FAULT_SIGBUS would be required for that function to call
> > > do_sigbus().
> >
> > The code I was looking at yesterday in filemap_fault was:
> >
> > page_not_uptodate:
> > /*
> > * Umm, take care of errors if the page isn't up-to-date.
> > * Try to re-read it _once_. We do this synchronously,
> > * because there really aren't any performance issues here
> > * and we need to check for errors.
> > */
> > fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> > if (fpin)
> > goto out_retry;
> > folio_put(folio);
> >
> > if (!error || error == AOP_TRUNCATED_PAGE)
> > goto retry_find;
> > filemap_invalidate_unlock_shared(mapping);
> >
> > return VM_FAULT_SIGBUS;
> >
> > Wherein I /think/ fpin is non-null in this case, so if
> > filemap_read_folio returns an error, we'll do this instead:
> >
> > out_retry:
> > /*
> > * We dropped the mmap_lock, we need to return to the fault handler to
> > * re-find the vma and come back and find our hopefully still populated
> > * page.
> > */
> > if (!IS_ERR(folio))
> > folio_put(folio);
> > if (mapping_locked)
> > filemap_invalidate_unlock_shared(mapping);
> > if (fpin)
> > fput(fpin);
> > return ret | VM_FAULT_RETRY;
> >
> > and since ret was 0 before the goto, the only return code is
> > VM_FAULT_RETRY. I had speculated that perhaps we could instead do:
> >
> > if (fpin) {
> > if (error)
> > ret |= VM_FAULT_SIGBUS;
> > goto out_retry;
> > }
> >
> > But I think the hard part here is that there doesn't seem to be any
> > distinction between transient read errors (e.g. disk cable fell out) vs.
> > semi-permanent errors (e.g. verity says the hash doesn't match).
> > AFAICT, either the read(ahead) sets uptodate and callers read the page,
> > or it doesn't set it and callers treat that as an error-retry
> > opportunity.
> >
> > For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> > second case I imagine we'd want something closer to _SIGBUS.
>
>
> Agreed, it's really hard to judge when it's the right time to give up
> retrying. At least with MADV_POPULATE_READ we should try achieving the same
> behavior as with mmap()+read access. So if the latter manages to trigger
> SIGBUS, MADV_POPULATE_READ should return an error.
>
> Is there an easy way to for me to reproduce this scenario?
Yes. Take this Makefile:
CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable
all: mpr
and this C program mpr.c:
/* test MAP_POPULATE_READ on a file */
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/mman.h>
#define min(a, b) ((a) < (b) ? (a) : (b))
#define BUFSIZE (2097152)
int main(int argc, char *argv[])
{
struct stat sb;
long pagesize = sysconf(_SC_PAGESIZE);
off_t read_sz, pos;
void *addr;
char c;
int fd, ret;
if (argc != 2) {
printf("Usage: %s fname\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror(argv[1]);
return 1;
}
ret = fstat(fd, &sb);
if (ret) {
perror("fstat");
return 1;
}
/* Validate the file contents with regular reads */
for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
ret = pread(fd, &c, 1, pos);
if (ret < 0) {
if (errno != EIO) {
perror("pread");
return 1;
}
printf("%s: at offset %llu: %s\n", argv[1],
(unsigned long long)pos,
strerror(errno));
break;
}
}
ret = pread(fd, &c, 1, sb.st_size);
if (ret < 0) {
if (errno != EIO) {
perror("pread");
return 1;
}
printf("%s: at offset %llu: %s\n", argv[1],
(unsigned long long)sb.st_size,
strerror(errno));
}
/* Validate the file contents with MADV_POPULATE_READ */
read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);
for (pos = 0; pos < read_sz; pos += BUFSIZE) {
unsigned int mappos;
size_t maplen = min(read_sz - pos, BUFSIZE);
addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
ret = madvise(addr, maplen, MADV_POPULATE_READ);
if (ret) {
perror("madvise");
return 1;
}
ret = munmap(addr, maplen);
if (ret) {
perror("munmap");
return 1;
}
}
ret = close(fd);
if (ret) {
perror("close");
return 1;
}
return 0;
}
and this shell script mpr.sh:
#!/bin/bash -x
# Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ
scriptdir="$(dirname "$0")"
commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
for cmd in "${commands[@]}"; do
if ! command -v "$cmd" &>/dev/null; then
echo "$cmd: Command required for this program."
exit 1
fi
done
dev="${1:-/dev/sda}"
mnt="${2:-/mnt}"
dmtarget="dumbtarget"
# Clean up any old mounts
umount "$dev" "$mnt"
dmsetup remove "$dmtarget"
rmmod xfs
# Create dm linear mapping to block device and format filesystem
sectors="$(blockdev --getsz "$dev")"
tgt="/dev/mapper/$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
mkfs.xfs -f "$tgt"
# Create a file that we'll read, then cycle mount to zap pagecache
mount "$tgt" "$mnt"
xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
umount "$mnt"
mount "$tgt" "$mnt"
# Load file metadata
stat "$mnt/a"
# Induce EIO errors on read
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors error" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"
# Try to provoke the kernel; kill the process after 10s so we can clean up
timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"
# Stop EIO errors so we can unmount
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"
# Unmount and clean up after ourselves
umount "$mnt"
dmsetup remove "$dmtarget"
<EOF>
make the C program, then run ./mpr.sh <device> <mountpoint>. It should
stall in the madvise call until timeout sends sigkill to the program;
you can crank the 10s timeout up if you want.
<insert usual disclaimer that I run all these things in scratch VMs>
--D
> --
> Cheers,
>
> David / dhildenb
>
>
next prev parent reply other threads:[~2024-03-13 17:19 UTC|newest]
Thread overview: 94+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-04 19:10 [PATCH v5 00/24] fs-verity support for XFS Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 01/24] fsverity: remove hash page spin lock Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 02/24] xfs: add parent pointer support to attribute code Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 03/24] xfs: define parent pointer ondisk extended attribute format Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 04/24] xfs: add parent pointer validator functions Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 05/24] fs: add FS_XFLAG_VERITY for verity files Andrey Albershteyn
2024-03-04 22:35 ` Eric Biggers
2024-03-07 21:39 ` Darrick J. Wong
2024-03-07 22:06 ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 06/24] fsverity: pass tree_blocksize to end_enable_verity() Andrey Albershteyn
2024-03-05 0:52 ` Eric Biggers
2024-03-06 16:30 ` Darrick J. Wong
2024-03-07 22:02 ` Eric Biggers
2024-03-08 3:46 ` Darrick J. Wong
2024-03-08 4:40 ` Eric Biggers
2024-03-11 22:38 ` Darrick J. Wong
2024-03-12 15:13 ` David Hildenbrand
2024-03-12 15:33 ` David Hildenbrand
2024-03-12 16:44 ` Darrick J. Wong
2024-03-13 12:29 ` David Hildenbrand
2024-03-13 17:19 ` Darrick J. Wong [this message]
2024-03-13 19:10 ` David Hildenbrand
2024-03-13 21:03 ` David Hildenbrand
2024-03-08 21:34 ` Dave Chinner
2024-03-09 16:19 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 07/24] fsverity: support block-based Merkle tree caching Andrey Albershteyn
2024-03-06 3:56 ` Eric Biggers
2024-03-07 21:54 ` Darrick J. Wong
2024-03-07 22:49 ` Eric Biggers
2024-03-08 3:50 ` Darrick J. Wong
2024-03-09 16:24 ` Darrick J. Wong
2024-03-11 23:22 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 08/24] fsverity: add per-sb workqueue for post read processing Andrey Albershteyn
2024-03-05 1:08 ` Eric Biggers
2024-03-07 21:58 ` Darrick J. Wong
2024-03-07 22:26 ` Eric Biggers
2024-03-08 3:53 ` Darrick J. Wong
2024-03-07 22:55 ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 09/24] fsverity: add tracepoints Andrey Albershteyn
2024-03-05 0:33 ` Eric Biggers
2024-03-04 19:10 ` [PATCH v5 10/24] iomap: integrate fs-verity verification into iomap's read path Andrey Albershteyn
2024-03-04 23:39 ` Eric Biggers
2024-03-07 22:06 ` Darrick J. Wong
2024-03-07 22:19 ` Eric Biggers
2024-03-07 23:38 ` Dave Chinner
2024-03-07 23:45 ` Darrick J. Wong
2024-03-08 0:47 ` Dave Chinner
2024-03-07 23:59 ` Eric Biggers
2024-03-08 1:20 ` Dave Chinner
2024-03-08 3:16 ` Eric Biggers
2024-03-08 3:57 ` Darrick J. Wong
2024-03-08 3:22 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Andrey Albershteyn
2024-03-07 22:46 ` Darrick J. Wong
2024-03-08 1:59 ` Dave Chinner
2024-03-08 3:31 ` Darrick J. Wong
2024-03-09 16:28 ` Darrick J. Wong
2024-03-11 0:26 ` Dave Chinner
2024-03-11 15:25 ` Darrick J. Wong
2024-03-12 2:43 ` Eric Biggers
2024-03-12 4:15 ` Darrick J. Wong
2024-03-12 2:45 ` Darrick J. Wong
2024-03-12 7:01 ` Dave Chinner
2024-03-12 20:04 ` Darrick J. Wong
2024-03-12 21:45 ` Dave Chinner
2024-03-04 19:10 ` [PATCH v5 12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 13/24] xfs: add attribute type for fs-verity Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 14/24] xfs: make xfs_buf_get() to take XBF_* flags Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 16/24] xfs: add fs-verity ro-compat flag Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 17/24] xfs: add inode on-disk VERITY flag Andrey Albershteyn
2024-03-07 22:06 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction Andrey Albershteyn
2024-03-07 22:09 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode Andrey Albershteyn
2024-03-07 22:09 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 20/24] xfs: disable direct read path for fs-verity files Andrey Albershteyn
2024-03-07 22:11 ` Darrick J. Wong
2024-03-12 12:02 ` Andrey Albershteyn
2024-03-12 16:36 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 21/24] xfs: add fs-verity support Andrey Albershteyn
2024-03-06 4:55 ` Eric Biggers
2024-03-06 5:01 ` Eric Biggers
2024-03-07 23:10 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 22/24] xfs: make scrub aware of verity dinode flag Andrey Albershteyn
2024-03-07 22:18 ` Darrick J. Wong
2024-03-12 12:10 ` Andrey Albershteyn
2024-03-12 16:38 ` Darrick J. Wong
2024-03-13 1:35 ` Darrick J. Wong
2024-03-04 19:10 ` [PATCH v5 23/24] xfs: add fs-verity ioctls Andrey Albershteyn
2024-03-07 22:14 ` Darrick J. Wong
2024-03-12 12:42 ` Andrey Albershteyn
2024-03-04 19:10 ` [PATCH v5 24/24] xfs: enable ro-compat fs-verity flag Andrey Albershteyn
2024-03-07 22:16 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240313171936.GN1927156@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=aalbersh@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=chandan.babu@oracle.com \
--cc=david@redhat.com \
--cc=ebiggers@kernel.org \
--cc=fsverity@lists.linux.dev \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).