linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH v2 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs
  2024-05-24  4:10 12% [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-05-24  4:10  8% ` [PATCH v2 7/9] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
@ 2024-05-24  4:10  9% ` Andrii Nakryiko
  5 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-24  4:10 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Implement a simple tool/benchmark for comparing address "resolution"
logic based on textual /proc/<pid>/maps interface and new binary
ioctl-based PROCMAP_QUERY command.

The tool expects a file with a list of hex addresses, relevant PID, and
then provides control over whether textual or binary ioctl-based ways to
process VMAs should be used.

The overall logic implements as efficient way to do batched processing
of a given set of (unsorted) addresses. We first sort them in increasing
order (remembering their original position to restore original order, if
necessary), and then process all VMAs from /proc/<pid>/maps, matching
addresses to VMAs and calculating file offsets, if matched. For
ioctl-based approach the idea is similar, but is implemented even more
efficiently, requesting only VMAs that cover all given addresses,
skipping all the irrelevant VMAs altogether.

To be able to compare efficiency of both APIs the tool has "benchark" mode.
User provides a number of processing runs to run in a tight loop, only timing
specifically /proc/<pid>/maps parsing and processing parts of the logic.
Address sorting and re-sorting is excluded. This gives a more direct way
to compare ioctl- vs text-based APIs.

We used a medium-sized production application to do representative
benchmark. A bunch of stack traces were captured, resulting in 3244
user space addresses (464 unique ones, but we didn't deduplicate them).
Application itself had 655 VMAs reported in /proc/<pid>/maps.

Averaging time taken to process all addresses 10000 times, showed that:
  - text-based approach took 333 microseconds *per one batch run*;
  - ioctl-based approach took 8 microseconds *per (identical) batch run*.

This gives about ~40x speed up to do exactly the same amount of work
(build IDs were not fetched for ioctl-based benchmark; fetching build
IDs resulted in 2x slowdown compared to no-build-ID case). The ratio
will vary depending on exact set of addresses and how many VMAs they are
mapped to. So 40x isn't something to take for granted, but it does show
possible improvements that are achievable.

I also did an strace run of both cases. In text-based one the tool did
27 read() syscalls, fetching up to 4KB of data in one go (which is
seq_file limitations, bumping the buffer size has no effect, as data is
always capped at 4KB). In comparison, ioctl-based implementation had to
do only 5 ioctl() calls to fetch all relevant VMAs.

It is projected that savings from processing big production applications
would only widen the gap in favor of binary-based querying ioctl API, as
bigger applications will tend to have even more non-executable VMA
mappings relative to executable ones. E.g., one of the larger production
applications in the server fleet has upwards of 20000 VMAs, which would
make benchmark even more unfair to processing /proc/<pid>/maps file.

This tool is implementing one of the patterns of usage, referred to as
"on-demand profiling" use case in the main patch implementing ioctl()
API. perf is an example of the pre-processing pattern in which all (or
all executable) VMAs are loaded and stored for further querying. We
implemented an experimental change to perf to benchmark text-based and
ioctl-based APIs, and in perf benchmarks ioctl-based interface was no
worse than optimized text-based parsing benchmark. Filtering to only
executable VMAs further made ioctl-based benchmarks faster, as perf
would be querying about 1/3 of all VMAs only, compared to the need to
read and parse all of VMAs.

E.g., running `perf bench internals synthesize --mt -M 8`, we are getting.

TEXT-BASED
==========
  # ./perf-parse bench internals synthesize --mt -M 8
  # Running 'internals/synthesize' benchmark:
  Computing performance of multi threaded perf event synthesis by
  synthesizing events on CPU 0:
    Number of synthesis threads: 1
      Average synthesis took: 10238.600 usec (+- 309.656 usec)
      Average num. events: 3744.000 (+- 0.000)
      Average time per event 2.735 usec
    ...
    Number of synthesis threads: 8
      Average synthesis took: 6814.600 usec (+- 149.418 usec)
      Average num. events: 3744.000 (+- 0.000)
      Average time per event 1.820 usec

IOCTL-BASED, FETCHING ALL VMAS
==============================
  # ./perf-ioctl-all bench internals synthesize --mt -M 8
  # Running 'internals/synthesize' benchmark:
  Computing performance of multi threaded perf event synthesis by
  synthesizing events on CPU 0:
    Number of synthesis threads: 1
      Average synthesis took: 9944.800 usec (+- 381.794 usec)
      Average num. events: 3593.000 (+- 0.000)
      Average time per event 2.768 usec
    ...
    Number of synthesis threads: 8
      Average synthesis took: 6598.600 usec (+- 137.503 usec)
      Average num. events: 3595.000 (+- 0.000)
      Average time per event 1.835 usec

IOCTL-BASED, FETCHING EXECUTABLE VMAS
=====================================
  # ./perf-ioctl-exec bench internals synthesize --mt -M 8
  # Running 'internals/synthesize' benchmark:
  Computing performance of multi threaded perf event synthesis by
  synthesizing events on CPU 0:
    Number of synthesis threads: 1
      Average synthesis took: 8539.600 usec (+- 364.875 usec)
      Average num. events: 3569.000 (+- 0.000)
      Average time per event 2.393 usec
    ...
    Number of synthesis threads: 8
      Average synthesis took: 5657.600 usec (+- 107.219 usec)
      Average num. events: 3571.000 (+- 0.000)
      Average time per event 1.584 usec

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   2 +-
 tools/testing/selftests/bpf/procfs_query.c | 386 +++++++++++++++++++++
 3 files changed, 388 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 5025401323af..903b14931bfe 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -44,6 +44,7 @@ test_cpp
 /veristat
 /sign-file
 /uprobe_multi
+/procfs_query
 *.ko
 *.tmp
 xskxceiver
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e0b3887b3d2d..0afa667a54e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -144,7 +144,7 @@ TEST_GEN_PROGS_EXTENDED = test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
 	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
-	xdp_features bpf_test_no_cfi.ko
+	xdp_features bpf_test_no_cfi.ko procfs_query
 
 TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
 
diff --git a/tools/testing/selftests/bpf/procfs_query.c b/tools/testing/selftests/bpf/procfs_query.c
new file mode 100644
index 000000000000..63e06568f1ff
--- /dev/null
+++ b/tools/testing/selftests/bpf/procfs_query.c
@@ -0,0 +1,386 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+#include <argp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <time.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <time.h>
+
+static bool verbose;
+static bool quiet;
+static bool use_ioctl;
+static bool request_build_id;
+static char *addrs_path;
+static int pid;
+static int bench_runs;
+
+const char *argp_program_version = "procfs_query 0.0";
+const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
+
+static inline uint64_t get_time_ns(void)
+{
+	struct timespec t;
+
+	clock_gettime(CLOCK_MONOTONIC, &t);
+
+	return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+static const struct argp_option opts[] = {
+	{ "verbose", 'v', NULL, 0, "Verbose mode" },
+	{ "quiet", 'q', NULL, 0, "Quiet mode (no output)" },
+	{ "pid", 'p', "PID", 0, "PID of the process" },
+	{ "addrs-path", 'f', "PATH", 0, "File with addresses to resolve" },
+	{ "benchmark", 'B', "RUNS", 0, "Benchmark mode" },
+	{ "query", 'Q', NULL, 0, "Use ioctl()-based point query API (by default text parsing is done)" },
+	{ "build-id", 'b', NULL, 0, "Fetch build ID, if available (only for ioctl mode)" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case 'v':
+		verbose = true;
+		break;
+	case 'q':
+		quiet = true;
+		break;
+	case 'Q':
+		use_ioctl = true;
+		break;
+	case 'b':
+		request_build_id = true;
+		break;
+	case 'p':
+		pid = strtol(arg, NULL, 10);
+		break;
+	case 'f':
+		addrs_path = strdup(arg);
+		break;
+	case 'B':
+		bench_runs = strtol(arg, NULL, 10);
+		if (bench_runs <= 0) {
+			fprintf(stderr, "Invalid benchmark run count: %s\n", arg);
+			return -EINVAL;
+		}
+		break;
+	case ARGP_KEY_ARG:
+		argp_usage(state);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static const struct argp argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+struct addr {
+	unsigned long long addr;
+	int idx;
+};
+
+static struct addr *addrs;
+static size_t addr_cnt, addr_cap;
+
+struct resolved_addr {
+	unsigned long long file_off;
+	const char *vma_name;
+	int build_id_sz;
+	char build_id[20];
+};
+
+static struct resolved_addr *resolved;
+
+static int resolve_addrs_ioctl(void)
+{
+	char buf[32], build_id_buf[20], vma_name[PATH_MAX];
+	struct procmap_query q;
+	int fd, err, i;
+	struct addr *a = &addrs[0];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	fd = open(buf, O_RDONLY);
+	if (fd < 0) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	q.query_flags = PROCMAP_QUERY_COVERING_OR_NEXT_VMA;
+	q.vma_name_addr = (__u64)vma_name;
+	if (request_build_id)
+		q.build_id_addr = (__u64)build_id_buf;
+
+	for (i = 0; i < addr_cnt; ) {
+		char *name = NULL;
+
+		q.query_addr = (__u64)a->addr;
+		q.vma_name_size = sizeof(vma_name);
+		if (request_build_id)
+			q.build_id_size = sizeof(build_id_buf);
+
+		err = ioctl(fd, PROCMAP_QUERY, &q);
+		if (err < 0 && errno == ENOTTY) {
+			close(fd);
+			fprintf(stderr, "PROCMAP_QUERY ioctl() command is not supported on this kernel!\n");
+			return -EOPNOTSUPP; /* ioctl() not implemented yet */
+		}
+		if (err < 0 && errno == ENOENT) {
+			fprintf(stderr, "ENOENT addr %lx\n", (long)q.query_addr);
+			i++;
+			a++;
+			continue; /* unresolved address */
+		}
+		if (err < 0) {
+			err = -errno;
+			close(fd);
+			fprintf(stderr, "PROCMAP_QUERY ioctl() returned error: %d\n", err);
+			return err;
+		}
+
+		if (verbose) {
+			printf("VMA FOUND (addr %08lx): %08lx-%08lx %c%c%c%c %08lx %02x:%02x %ld %s (build ID: %s, %d bytes)\n",
+			       (long)q.query_addr, (long)q.vma_start, (long)q.vma_end,
+			       (q.vma_flags & PROCMAP_QUERY_VMA_READABLE) ? 'r' : '-',
+			       (q.vma_flags & PROCMAP_QUERY_VMA_WRITABLE) ? 'w' : '-',
+			       (q.vma_flags & PROCMAP_QUERY_VMA_EXECUTABLE) ? 'x' : '-',
+			       (q.vma_flags & PROCMAP_QUERY_VMA_SHARED) ? 's' : 'p',
+			       (long)q.vma_offset, q.dev_major, q.dev_minor, (long)q.inode,
+			       q.vma_name_size ? vma_name : "",
+			       q.build_id_size ? "YES" : "NO",
+			       q.build_id_size);
+		}
+
+		/* skip addrs falling before current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_start; i++, a++) {
+		}
+		/* process addrs covered by current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_end; i++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - q.vma_start + q.vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (q.vma_name_size)
+				name = name ?: strdup(vma_name);
+			r->vma_name = name;
+
+			if (q.build_id_size) {
+				r->build_id_sz = q.build_id_size;
+				memcpy(r->build_id, build_id_buf, q.build_id_size);
+			}
+		}
+	}
+
+	close(fd);
+	return 0;
+}
+
+static int resolve_addrs_parse(void)
+{
+	size_t vma_start, vma_end, vma_offset, ino;
+	uint32_t dev_major, dev_minor;
+	char perms[4], buf[32], vma_name[PATH_MAX], fbuf[4096];
+	FILE *f;
+	int err, idx = 0;
+	struct addr *a = &addrs[idx];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	f = fopen(buf, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	err = setvbuf(f, fbuf, _IOFBF, sizeof(fbuf));
+	if (err) {
+		err = -errno;
+		fprintf(stderr, "Failed to set custom file buffer size: %d\n", err);
+		return err;
+	}
+
+	while ((err = fscanf(f, "%zx-%zx %c%c%c%c %zx %x:%x %zu %[^\n]\n",
+			     &vma_start, &vma_end,
+			     &perms[0], &perms[1], &perms[2], &perms[3],
+			     &vma_offset, &dev_major, &dev_minor, &ino, vma_name)) >= 10) {
+		const char *name = NULL;
+
+		/* skip addrs before current vma, they stay unresolved */
+		for (; idx < addr_cnt && a->addr < vma_start; idx++, a++) {
+		}
+
+		/* resolve all addrs within current vma now */
+		for (; idx < addr_cnt && a->addr < vma_end; idx++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - vma_start + vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (err > 10)
+				name = name ?: strdup(vma_name);
+			else
+				name = NULL;
+			r->vma_name = name;
+		}
+
+		/* ran out of addrs to resolve, stop early */
+		if (idx >= addr_cnt)
+			break;
+	}
+
+	fclose(f);
+	return 0;
+}
+
+static int cmp_by_addr(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	if (x->addr != y->addr)
+		return x->addr < y->addr ? -1 : 1;
+	return x->idx < y->idx ? -1 : 1;
+}
+
+static int cmp_by_idx(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	return x->idx < y->idx ? -1 : 1;
+}
+
+int main(int argc, char **argv)
+{
+	FILE* f;
+	int err, i;
+	unsigned long long addr;
+	uint64_t start_ns;
+	double total_ns;
+
+	/* Parse command line arguments */
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+	if (err)
+		return err;
+
+	if (pid <= 0 || !addrs_path) {
+		fprintf(stderr, "Please provide PID and file with addresses to process!\n");
+		exit(1);
+	}
+
+	if (verbose) {
+		fprintf(stderr, "PID: %d\n", pid);
+		fprintf(stderr, "PATH: %s\n", addrs_path);
+	}
+
+	f = fopen(addrs_path, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open '%s': %d\n", addrs_path, err);
+		goto out;
+	}
+
+	while ((err = fscanf(f, "%llx\n", &addr)) == 1) {
+		if (addr_cnt == addr_cap) {
+			addr_cap = addr_cap == 0 ? 16 : (addr_cap * 3 / 2);
+			addrs = realloc(addrs, sizeof(*addrs) * addr_cap);
+			memset(addrs + addr_cnt, 0, (addr_cap - addr_cnt) * sizeof(*addrs));
+		}
+
+		addrs[addr_cnt].addr = addr;
+		addrs[addr_cnt].idx = addr_cnt;
+
+		addr_cnt++;
+	}
+	if (verbose)
+		fprintf(stderr, "READ %zu addrs!\n", addr_cnt);
+	if (!feof(f)) {
+		fprintf(stderr, "Failure parsing full list of addresses at '%s'!\n", addrs_path);
+		err = -EINVAL;
+		fclose(f);
+		goto out;
+	}
+	fclose(f);
+	if (addr_cnt == 0) {
+		fprintf(stderr, "No addresses provided, bailing out!\n");
+		err = -ENOENT;
+		goto out;
+	}
+
+	resolved = calloc(addr_cnt, sizeof(*resolved));
+
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_addr);
+	if (verbose) {
+		fprintf(stderr, "SORTED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			fprintf(stderr, "ADDR #%d: %#llx\n", addrs[i].idx, addrs[i].addr);
+		}
+	}
+
+	start_ns = get_time_ns();
+	for (i = bench_runs ?: 1; i > 0; i--) {
+		if (use_ioctl) {
+			err = resolve_addrs_ioctl();
+		} else {
+			err = resolve_addrs_parse();
+		}
+		if (err) {
+			fprintf(stderr, "Failed to resolve addrs: %d!\n", err);
+			goto out;
+		}
+	}
+	total_ns = get_time_ns() - start_ns;
+
+	if (bench_runs) {
+		fprintf(stderr, "BENCHMARK MODE. RUNS: %d TOTAL TIME (ms): %.3lf TIME/RUN (ms): %.3lf TIME/ADDR (us): %.3lf\n",
+			bench_runs, total_ns / 1000000.0, total_ns / bench_runs / 1000000.0,
+			total_ns / bench_runs / addr_cnt / 1000.0);
+	}
+
+	/* sort them back into the original order */
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_idx);
+
+	if (!quiet) {
+		printf("RESOLVED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			const struct addr *a = &addrs[i];
+			const struct resolved_addr *r = &resolved[a->idx];
+
+			if (r->file_off) {
+				printf("RESOLVED   #%d: %#llx -> OFF %#llx",
+					a->idx, a->addr, r->file_off);
+				if (r->vma_name)
+					printf(" NAME %s", r->vma_name);
+				if (r->build_id_sz) {
+					char build_id_str[41];
+					int j;
+
+					for (j = 0; j < r->build_id_sz; j++)
+						sprintf(&build_id_str[j * 2], "%02hhx", r->build_id[j]);
+					printf(" BUILDID %s", build_id_str);
+				}
+				printf("\n");
+			} else {
+				printf("UNRESOLVED #%d: %#llx\n", a->idx, a->addr);
+			}
+		}
+	}
+out:
+	free(addrs);
+	free(addrs_path);
+	free(resolved);
+
+	return err < 0 ? -err : 0;
+}
-- 
2.43.0


^ permalink raw reply related	[relevance 9%]

* [PATCH v2 7/9] tools: sync uapi/linux/fs.h header into tools subdir
  2024-05-24  4:10 12% [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-05-24  4:10  8% ` [PATCH v2 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence Andrii Nakryiko
@ 2024-05-24  4:10  8% ` Andrii Nakryiko
  2024-05-24  4:10  9% ` [PATCH v2 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
  5 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-24  4:10 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

We need this UAPI header in tools/include subdirectory for using it from
BPF selftests.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/include/uapi/linux/fs.h | 550 ++++++++++++++++++++++++++++++++++
 1 file changed, 550 insertions(+)
 create mode 100644 tools/include/uapi/linux/fs.h

diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
new file mode 100644
index 000000000000..7306022780d3
--- /dev/null
+++ b/tools/include/uapi/linux/fs.h
@@ -0,0 +1,550 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_FS_H
+#define _UAPI_LINUX_FS_H
+
+/*
+ * This file has definitions for some important file table structures
+ * and constants and structures used by various generic file system
+ * ioctl's.  Please do not make any changes in this file before
+ * sending patches for review to linux-fsdevel@vger.kernel.org and
+ * linux-api@vger.kernel.org.
+ */
+
+#include <linux/limits.h>
+#include <linux/ioctl.h>
+#include <linux/types.h>
+#ifndef __KERNEL__
+#include <linux/fscrypt.h>
+#endif
+
+/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */
+#if !defined(__KERNEL__)
+#include <linux/mount.h>
+#endif
+
+/*
+ * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
+ * the file limit at runtime and only root can increase the per-process
+ * nr_file rlimit, so it's safe to set up a ridiculously high absolute
+ * upper limit on files-per-process.
+ *
+ * Some programs (notably those using select()) may have to be 
+ * recompiled to take full advantage of the new limits..  
+ */
+
+/* Fixed constants first: */
+#undef NR_OPEN
+#define INR_OPEN_CUR 1024	/* Initial setting for nfile rlimits */
+#define INR_OPEN_MAX 4096	/* Hard limit for nfile rlimits */
+
+#define BLOCK_SIZE_BITS 10
+#define BLOCK_SIZE (1<<BLOCK_SIZE_BITS)
+
+#define SEEK_SET	0	/* seek relative to beginning of file */
+#define SEEK_CUR	1	/* seek relative to current file position */
+#define SEEK_END	2	/* seek relative to end of file */
+#define SEEK_DATA	3	/* seek to the next data */
+#define SEEK_HOLE	4	/* seek to the next hole */
+#define SEEK_MAX	SEEK_HOLE
+
+#define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
+#define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
+#define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+
+struct file_clone_range {
+	__s64 src_fd;
+	__u64 src_offset;
+	__u64 src_length;
+	__u64 dest_offset;
+};
+
+struct fstrim_range {
+	__u64 start;
+	__u64 len;
+	__u64 minlen;
+};
+
+/*
+ * We include a length field because some filesystems (vfat) have an identifier
+ * that we do want to expose as a UUID, but doesn't have the standard length.
+ *
+ * We use a fixed size buffer beacuse this interface will, by fiat, never
+ * support "UUIDs" longer than 16 bytes; we don't want to force all downstream
+ * users to have to deal with that.
+ */
+struct fsuuid2 {
+	__u8	len;
+	__u8	uuid[16];
+};
+
+struct fs_sysfs_path {
+	__u8			len;
+	__u8			name[128];
+};
+
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME		0
+#define FILE_DEDUPE_RANGE_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+	__s64 dest_fd;		/* in - destination file */
+	__u64 dest_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file. */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+	 * == FILE_DEDUPE_RANGE_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;		/* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+	__u64 src_offset;	/* in - start of extent in source */
+	__u64 src_length;	/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;	/* must be zero */
+	__u32 reserved2;	/* must be zero */
+	struct file_dedupe_range_info info[];
+};
+
+/* And dynamically-tunable limits and defaults: */
+struct files_stat_struct {
+	unsigned long nr_files;		/* read only */
+	unsigned long nr_free_files;	/* read only */
+	unsigned long max_files;		/* tunable */
+};
+
+struct inodes_stat_t {
+	long nr_inodes;
+	long nr_unused;
+	long dummy[5];		/* padding for sysctl ABI compatibility */
+};
+
+
+#define NR_FILE  8192	/* this can well be larger on a larger system */
+
+/*
+ * Structure for FS_IOC_FSGETXATTR[A] and FS_IOC_FSSETXATTR.
+ */
+struct fsxattr {
+	__u32		fsx_xflags;	/* xflags field value (get/set) */
+	__u32		fsx_extsize;	/* extsize field value (get/set)*/
+	__u32		fsx_nextents;	/* nextents field value (get)	*/
+	__u32		fsx_projid;	/* project identifier (get/set) */
+	__u32		fsx_cowextsize;	/* CoW extsize field value (get/set)*/
+	unsigned char	fsx_pad[8];
+};
+
+/*
+ * Flags for the fsx_xflags field
+ */
+#define FS_XFLAG_REALTIME	0x00000001	/* data in realtime volume */
+#define FS_XFLAG_PREALLOC	0x00000002	/* preallocated file extents */
+#define FS_XFLAG_IMMUTABLE	0x00000008	/* file cannot be modified */
+#define FS_XFLAG_APPEND		0x00000010	/* all writes append */
+#define FS_XFLAG_SYNC		0x00000020	/* all writes synchronous */
+#define FS_XFLAG_NOATIME	0x00000040	/* do not update access time */
+#define FS_XFLAG_NODUMP		0x00000080	/* do not include in backups */
+#define FS_XFLAG_RTINHERIT	0x00000100	/* create with rt bit set */
+#define FS_XFLAG_PROJINHERIT	0x00000200	/* create with parents projid */
+#define FS_XFLAG_NOSYMLINKS	0x00000400	/* disallow symlink creation */
+#define FS_XFLAG_EXTSIZE	0x00000800	/* extent size allocator hint */
+#define FS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
+#define FS_XFLAG_NODEFRAG	0x00002000	/* do not defragment */
+#define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
+#define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
+#define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
+
+/* the read-only stuff doesn't really belong here, but any other place is
+   probably as bad and I don't want to create yet another include file. */
+
+#define BLKROSET   _IO(0x12,93)	/* set device read-only (0 = read-write) */
+#define BLKROGET   _IO(0x12,94)	/* get read-only status (0 = read_write) */
+#define BLKRRPART  _IO(0x12,95)	/* re-read partition table */
+#define BLKGETSIZE _IO(0x12,96)	/* return device size /512 (long *arg) */
+#define BLKFLSBUF  _IO(0x12,97)	/* flush buffer cache */
+#define BLKRASET   _IO(0x12,98)	/* set read ahead for block device */
+#define BLKRAGET   _IO(0x12,99)	/* get current read ahead setting */
+#define BLKFRASET  _IO(0x12,100)/* set filesystem (mm/filemap.c) read-ahead */
+#define BLKFRAGET  _IO(0x12,101)/* get filesystem (mm/filemap.c) read-ahead */
+#define BLKSECTSET _IO(0x12,102)/* set max sectors per request (ll_rw_blk.c) */
+#define BLKSECTGET _IO(0x12,103)/* get max sectors per request (ll_rw_blk.c) */
+#define BLKSSZGET  _IO(0x12,104)/* get block device sector size */
+#if 0
+#define BLKPG      _IO(0x12,105)/* See blkpg.h */
+
+/* Some people are morons.  Do not use sizeof! */
+
+#define BLKELVGET  _IOR(0x12,106,size_t)/* elevator get */
+#define BLKELVSET  _IOW(0x12,107,size_t)/* elevator set */
+/* This was here just to show that the number is taken -
+   probably all these _IO(0x12,*) ioctls should be moved to blkpg.h. */
+#endif
+/* A jump here: 108-111 have been used for various private purposes. */
+#define BLKBSZGET  _IOR(0x12,112,size_t)
+#define BLKBSZSET  _IOW(0x12,113,size_t)
+#define BLKGETSIZE64 _IOR(0x12,114,size_t)	/* return device size in bytes (u64 *arg) */
+#define BLKTRACESETUP _IOWR(0x12,115,struct blk_user_trace_setup)
+#define BLKTRACESTART _IO(0x12,116)
+#define BLKTRACESTOP _IO(0x12,117)
+#define BLKTRACETEARDOWN _IO(0x12,118)
+#define BLKDISCARD _IO(0x12,119)
+#define BLKIOMIN _IO(0x12,120)
+#define BLKIOOPT _IO(0x12,121)
+#define BLKALIGNOFF _IO(0x12,122)
+#define BLKPBSZGET _IO(0x12,123)
+#define BLKDISCARDZEROES _IO(0x12,124)
+#define BLKSECDISCARD _IO(0x12,125)
+#define BLKROTATIONAL _IO(0x12,126)
+#define BLKZEROOUT _IO(0x12,127)
+#define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+/*
+ * A jump here: 130-136 are reserved for zoned block devices
+ * (see uapi/linux/blkzoned.h)
+ */
+
+#define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
+#define FIBMAP	   _IO(0x00,1)	/* bmap access */
+#define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
+#define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
+#define FITHAW		_IOWR('X', 120, int)	/* Thaw */
+#define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
+#define FICLONE		_IOW(0x94, 9, int)
+#define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
+#define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
+
+#define FSLABEL_MAX 256	/* Max chars for the interface; each fs may differ */
+
+#define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
+#define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
+#define	FS_IOC_GETVERSION		_IOR('v', 1, long)
+#define	FS_IOC_SETVERSION		_IOW('v', 2, long)
+#define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)
+#define FS_IOC32_GETFLAGS		_IOR('f', 1, int)
+#define FS_IOC32_SETFLAGS		_IOW('f', 2, int)
+#define FS_IOC32_GETVERSION		_IOR('v', 1, int)
+#define FS_IOC32_SETVERSION		_IOW('v', 2, int)
+#define FS_IOC_FSGETXATTR		_IOR('X', 31, struct fsxattr)
+#define FS_IOC_FSSETXATTR		_IOW('X', 32, struct fsxattr)
+#define FS_IOC_GETFSLABEL		_IOR(0x94, 49, char[FSLABEL_MAX])
+#define FS_IOC_SETFSLABEL		_IOW(0x94, 50, char[FSLABEL_MAX])
+/* Returns the external filesystem UUID, the same one blkid returns */
+#define FS_IOC_GETFSUUID		_IOR(0x15, 0, struct fsuuid2)
+/*
+ * Returns the path component under /sys/fs/ that refers to this filesystem;
+ * also /sys/kernel/debug/ for filesystems with debugfs exports
+ */
+#define FS_IOC_GETFSSYSFSPATH		_IOR(0x15, 1, struct fs_sysfs_path)
+
+/*
+ * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS)
+ *
+ * Note: for historical reasons, these flags were originally used and
+ * defined for use by ext2/ext3, and then other file systems started
+ * using these flags so they wouldn't need to write their own version
+ * of chattr/lsattr (which was shipped as part of e2fsprogs).  You
+ * should think twice before trying to use these flags in new
+ * contexts, or trying to assign these flags, since they are used both
+ * as the UAPI and the on-disk encoding for ext2/3/4.  Also, we are
+ * almost out of 32-bit flags.  :-)
+ *
+ * We have recently hoisted FS_IOC_FSGETXATTR / FS_IOC_FSSETXATTR from
+ * XFS to the generic FS level interface.  This uses a structure that
+ * has padding and hence has more room to grow, so it may be more
+ * appropriate for many new use cases.
+ *
+ * Please do not change these flags or interfaces before checking with
+ * linux-fsdevel@vger.kernel.org and linux-api@vger.kernel.org.
+ */
+#define	FS_SECRM_FL			0x00000001 /* Secure deletion */
+#define	FS_UNRM_FL			0x00000002 /* Undelete */
+#define	FS_COMPR_FL			0x00000004 /* Compress file */
+#define FS_SYNC_FL			0x00000008 /* Synchronous updates */
+#define FS_IMMUTABLE_FL			0x00000010 /* Immutable file */
+#define FS_APPEND_FL			0x00000020 /* writes to file may only append */
+#define FS_NODUMP_FL			0x00000040 /* do not dump file */
+#define FS_NOATIME_FL			0x00000080 /* do not update atime */
+/* Reserved for compression usage... */
+#define FS_DIRTY_FL			0x00000100
+#define FS_COMPRBLK_FL			0x00000200 /* One or more compressed clusters */
+#define FS_NOCOMP_FL			0x00000400 /* Don't compress */
+/* End compression flags --- maybe not all used */
+#define FS_ENCRYPT_FL			0x00000800 /* Encrypted file */
+#define FS_BTREE_FL			0x00001000 /* btree format dir */
+#define FS_INDEX_FL			0x00001000 /* hash-indexed directory */
+#define FS_IMAGIC_FL			0x00002000 /* AFS directory */
+#define FS_JOURNAL_DATA_FL		0x00004000 /* Reserved for ext3 */
+#define FS_NOTAIL_FL			0x00008000 /* file tail should not be merged */
+#define FS_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
+#define FS_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
+#define FS_HUGE_FILE_FL			0x00040000 /* Reserved for ext4 */
+#define FS_EXTENT_FL			0x00080000 /* Extents */
+#define FS_VERITY_FL			0x00100000 /* Verity protected inode */
+#define FS_EA_INODE_FL			0x00200000 /* Inode used for large EA */
+#define FS_EOFBLOCKS_FL			0x00400000 /* Reserved for ext4 */
+#define FS_NOCOW_FL			0x00800000 /* Do not cow file */
+#define FS_DAX_FL			0x02000000 /* Inode is DAX */
+#define FS_INLINE_DATA_FL		0x10000000 /* Reserved for ext4 */
+#define FS_PROJINHERIT_FL		0x20000000 /* Create with parents projid */
+#define FS_CASEFOLD_FL			0x40000000 /* Folder is case insensitive */
+#define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
+
+#define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
+#define FS_FL_USER_MODIFIABLE		0x000380FF /* User modifiable flags */
+
+
+#define SYNC_FILE_RANGE_WAIT_BEFORE	1
+#define SYNC_FILE_RANGE_WRITE		2
+#define SYNC_FILE_RANGE_WAIT_AFTER	4
+#define SYNC_FILE_RANGE_WRITE_AND_WAIT	(SYNC_FILE_RANGE_WRITE | \
+					 SYNC_FILE_RANGE_WAIT_BEFORE | \
+					 SYNC_FILE_RANGE_WAIT_AFTER)
+
+/*
+ * Flags for preadv2/pwritev2:
+ */
+
+typedef int __bitwise __kernel_rwf_t;
+
+/* high priority request, poll if possible */
+#define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
+
+/* per-IO O_DSYNC */
+#define RWF_DSYNC	((__force __kernel_rwf_t)0x00000002)
+
+/* per-IO O_SYNC */
+#define RWF_SYNC	((__force __kernel_rwf_t)0x00000004)
+
+/* per-IO, return -EAGAIN if operation would block */
+#define RWF_NOWAIT	((__force __kernel_rwf_t)0x00000008)
+
+/* per-IO O_APPEND */
+#define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
+
+/* per-IO negation of O_APPEND */
+#define RWF_NOAPPEND	((__force __kernel_rwf_t)0x00000020)
+
+/* mask of flags supported by the kernel */
+#define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
+			 RWF_APPEND | RWF_NOAPPEND)
+
+#define PROCFS_IOCTL_MAGIC 'f'
+
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
+
+/* Bitmasks provided in pm_scan_args masks and reported in page_region.categories. */
+#define PAGE_IS_WPALLOWED	(1 << 0)
+#define PAGE_IS_WRITTEN		(1 << 1)
+#define PAGE_IS_FILE		(1 << 2)
+#define PAGE_IS_PRESENT		(1 << 3)
+#define PAGE_IS_SWAPPED		(1 << 4)
+#define PAGE_IS_PFNZERO		(1 << 5)
+#define PAGE_IS_HUGE		(1 << 6)
+#define PAGE_IS_SOFT_DIRTY	(1 << 7)
+
+/*
+ * struct page_region - Page region with flags
+ * @start:	Start of the region
+ * @end:	End of the region (exclusive)
+ * @categories:	PAGE_IS_* category bitmask for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 end;
+	__u64 categories;
+};
+
+/* Flags for PAGEMAP_SCAN ioctl */
+#define PM_SCAN_WP_MATCHING	(1 << 0)	/* Write protect the pages matched. */
+#define PM_SCAN_CHECK_WPASYNC	(1 << 1)	/* Abort the scan when a non-WP-enabled page is found. */
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ * @end:		Ending address of the region
+ * @walk_end		Address where the scan stopped (written by kernel).
+ *			walk_end == end (address tags cleared) informs that the scan completed on entire range.
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional limit for number of returned pages (0 = disabled)
+ * @category_inverted:	PAGE_IS_* categories which values match if 0 instead of 1
+ * @category_mask:	Skip pages for which any category doesn't match
+ * @category_anyof_mask: Skip pages for which no category matches
+ * @return_mask:	PAGE_IS_* categories that are to be reported in `page_region`s returned
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 end;
+	__u64 walk_end;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 category_inverted;
+	__u64 category_mask;
+	__u64 category_anyof_mask;
+	__u64 return_mask;
+};
+
+/* /proc/<pid>/maps ioctl */
+#define PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query)
+
+enum procmap_query_flags {
+	/*
+	 * VMA permission flags.
+	 *
+	 * Can be used as part of procmap_query.query_flags field to look up
+	 * only VMAs satisfying specified subset of permissions. E.g., specifying
+	 * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs,
+	 * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only
+	 * return read/write VMAs, though both executable/non-executable and
+	 * private/shared will be ignored.
+	 *
+	 * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags
+	 * field to specify actual VMA permissions.
+	 */
+	PROCMAP_QUERY_VMA_READABLE		= 0x01,
+	PROCMAP_QUERY_VMA_WRITABLE		= 0x02,
+	PROCMAP_QUERY_VMA_EXECUTABLE		= 0x04,
+	PROCMAP_QUERY_VMA_SHARED		= 0x08,
+	/*
+	 * Query modifier flags.
+	 *
+	 * By default VMA that covers provided address is returned, or -ENOENT
+	 * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest
+	 * VMA with vma_start > addr will be returned if no covering VMA is
+	 * found.
+	 *
+	 * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that
+	 * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
+	 * to iterate all VMAs with file backing.
+	 */
+	PROCMAP_QUERY_COVERING_OR_NEXT_VMA	= 0x10,
+	PROCMAP_QUERY_FILE_BACKED_VMA		= 0x20,
+};
+
+/*
+ * Input/output argument structured passed into ioctl() call. It can be used
+ * to query a set of VMAs (Virtual Memory Areas) of a process.
+ *
+ * Each field can be one of three kinds, marked in a short comment to the
+ * right of the field:
+ *   - "in", input argument, user has to provide this value, kernel doesn't modify it;
+ *   - "out", output argument, kernel sets this field with VMA data;
+ *   - "in/out", input and output argument; user provides initial value (used
+ *     to specify maximum allowable buffer size), and kernel sets it to actual
+ *     amount of data written (or zero, if there is no data).
+ *
+ * If matching VMA is found (according to criterias specified by
+ * query_addr/query_flags, all the out fields are filled out, and ioctl()
+ * returns 0. If there is no matching VMA, -ENOENT will be returned.
+ * In case of any other error, negative error code other than -ENOENT is
+ * returned.
+ *
+ * Most of the data is similar to the one returned as text in /proc/<pid>/maps
+ * file, but procmap_query provides more querying flexibility. There are no
+ * consistency guarantees between subsequent ioctl() calls, but data returned
+ * for matched VMA is self-consistent.
+ */
+struct procmap_query {
+	/* Query struct size, for backwards/forward compatibility */
+	__u64 size;
+	/*
+	 * Query flags, a combination of enum procmap_query_flags values.
+	 * Defines query filtering and behavior, see enum procmap_query_flags.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_flags;		/* in */
+	/*
+	 * Query address. By default, VMA that covers this address will
+	 * be looked up. PROCMAP_QUERY_* flags above modify this default
+	 * behavior further.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_addr;		/* in */
+	/* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */
+	__u64 vma_start;		/* out */
+	__u64 vma_end;			/* out */
+	/* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */
+	__u64 vma_flags;		/* out */
+	/*
+	 * VMA file offset. If VMA has file backing, this specifies offset
+	 * within the file that VMA's start address corresponds to.
+	 * Is set to zero if VMA has no backing file.
+	 */
+	__u64 vma_offset;		/* out */
+	/* Backing file's inode number, or zero, if VMA has no backing file. */
+	__u64 inode;			/* out */
+	/* Backing file's device major/minor number, or zero, if VMA has no backing file. */
+	__u32 dev_major;		/* out */
+	__u32 dev_minor;		/* out */
+	/*
+	 * If set to non-zero value, signals the request to return VMA name
+	 * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix
+	 * appended, if file was unlinked from FS) for matched VMA. VMA name
+	 * can also be some special name (e.g., "[heap]", "[stack]") or could
+	 * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME).
+	 *
+	 * Kernel will set this field to zero, if VMA has no associated name.
+	 * Otherwise kernel will return actual amount of bytes filled in
+	 * user-supplied buffer (see vma_name_addr field below), including the
+	 * terminating zero.
+	 *
+	 * If VMA name is longer that user-supplied maximum buffer size,
+	 * -E2BIG error is returned.
+	 *
+	 * If this field is set to non-zero value, vma_name_addr should point
+	 * to valid user space memory buffer of at least vma_name_size bytes.
+	 * If set to zero, vma_name_addr should be set to zero as well
+	 */
+	__u32 vma_name_size;		/* in/out */
+	/*
+	 * If set to non-zero value, signals the request to extract and return
+	 * VMA's backing file's build ID, if the backing file is an ELF file
+	 * and it contains embedded build ID.
+	 *
+	 * Kernel will set this field to zero, if VMA has no backing file,
+	 * backing file is not an ELF file, or ELF file has no build ID
+	 * embedded.
+	 *
+	 * Build ID is a binary value (not a string). Kernel will set
+	 * build_id_size field to exact number of bytes used for build ID.
+	 * If build ID is requested and present, but needs more bytes than
+	 * user-supplied maximum buffer size (see build_id_addr field below),
+	 * -E2BIG error will be returned.
+	 *
+	 * If this field is set to non-zero value, build_id_addr should point
+	 * to valid user space memory buffer of at least build_id_size bytes.
+	 * If set to zero, build_id_addr should be set to zero as well
+	 */
+	__u32 build_id_size;		/* in/out */
+	/*
+	 * User-supplied address of a buffer of at least vma_name_size bytes
+	 * for kernel to fill with matched VMA's name (see vma_name_size field
+	 * description above for details).
+	 *
+	 * Should be set to zero if VMA name should not be returned.
+	 */
+	__u64 vma_name_addr;		/* in */
+	/*
+	 * User-supplied address of a buffer of at least build_id_size bytes
+	 * for kernel to fill with matched VMA's ELF build ID, if available
+	 * (see build_id_size field description above for details).
+	 *
+	 * Should be set to zero if build ID should not be returned.
+	 */
+	__u64 build_id_addr;		/* in */
+};
+
+#endif /* _UAPI_LINUX_FS_H */
-- 
2.43.0


^ permalink raw reply related	[relevance 8%]

* [PATCH v2 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence
  2024-05-24  4:10 12% [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-05-24  4:10  4% ` [PATCH v2 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Andrii Nakryiko
@ 2024-05-24  4:10  8% ` Andrii Nakryiko
  2024-05-24  4:10  8% ` [PATCH v2 7/9] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
  2024-05-24  4:10  9% ` [PATCH v2 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
  5 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-24  4:10 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Call out PROCMAP_QUERY ioctl() existence in the section describing
/proc/PID/maps file in documentation. We refer user to UAPI header for
low-level details of this programmatic interface.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 Documentation/filesystems/proc.rst | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 7c3a565ffbef..f2bbd1e86204 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -443,6 +443,14 @@ is not associated with a file:
 
  or if empty, the mapping is anonymous.
 
+Starting with 6.11 kernel, /proc/PID/maps provides an alternative
+ioctl()-based API that gives ability to flexibly and efficiently query and
+filter individual VMAs.  This interface is binary and is meant for more
+efficient programmatic use. `struct procmap_query`, defined in linux/fs.h UAPI
+header, serves as an input/output argument to the `PROCMAP_QUERY` ioctl()
+command. See comments in linus/fs.h UAPI header for details on query
+semantics, supported flags, data returned, and general API usage information.
+
 The /proc/PID/smaps is an extension based on maps, showing the memory
 consumption for each of the process's mappings. For each mapping (aka Virtual
 Memory Area, or VMA) there is a series of lines such as the following::
-- 
2.43.0


^ permalink raw reply related	[relevance 8%]

* [PATCH v2 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  2024-05-24  4:10 12% [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
  2024-05-24  4:10  4% ` [PATCH v2 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
  2024-05-24  4:10 17% ` [PATCH v2 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
@ 2024-05-24  4:10  4% ` Andrii Nakryiko
  2024-05-24  4:10  8% ` [PATCH v2 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence Andrii Nakryiko
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-24  4:10 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Attempt to use RCU-protected per-VAM lock when looking up requested VMA
as much as possible, only falling back to mmap_lock if per-VMA lock
failed. This is done so that querying of VMAs doesn't interfere with
other critical tasks, like page fault handling.

This has been suggested by mm folks, and we make use of a newly added
internal API that works like find_vma(), but tries to use per-VMA lock.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c | 42 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8ad547efd38d..2b14d06d1def 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -389,12 +389,30 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 )
 
 static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
-						 unsigned long addr, u32 flags)
+						 unsigned long addr, u32 flags,
+						 bool *mm_locked)
 {
 	struct vm_area_struct *vma;
+	bool mmap_locked;
+
+	*mm_locked = mmap_locked = false;
 
 next_vma:
-	vma = find_vma(mm, addr);
+	if (!mmap_locked) {
+		/* if we haven't yet acquired mmap_lock, try to use less disruptive per-VMA */
+		vma = find_and_lock_vma_rcu(mm, addr);
+		if (IS_ERR(vma)) {
+			/* failed to take per-VMA lock, fallback to mmap_lock */
+			if (mmap_read_lock_killable(mm))
+				return ERR_PTR(-EINTR);
+
+			*mm_locked = mmap_locked = true;
+			vma = find_vma(mm, addr);
+		}
+	} else {
+		/* if we have mmap_lock, get through the search as fast as possible */
+		vma = find_vma(mm, addr);
+	}
 
 	/* no VMA found */
 	if (!vma)
@@ -428,18 +446,25 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
 skip_vma:
 	/*
 	 * If the user needs closest matching VMA, keep iterating.
+	 * But before we proceed we might need to unlock current VMA.
 	 */
 	addr = vma->vm_end;
+	if (!mmap_locked)
+		vma_end_read(vma);
 	if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
 		goto next_vma;
 no_vma:
-	mmap_read_unlock(mm);
+	if (mmap_locked)
+		mmap_read_unlock(mm);
 	return ERR_PTR(-ENOENT);
 }
 
-static void unlock_vma(struct vm_area_struct *vma)
+static void unlock_vma(struct vm_area_struct *vma, bool mm_locked)
 {
-	mmap_read_unlock(vma->vm_mm);
+	if (mm_locked)
+		mmap_read_unlock(vma->vm_mm);
+	else
+		vma_end_read(vma);
 }
 
 static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
@@ -447,6 +472,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	struct procmap_query karg;
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
+	bool mm_locked;
 	const char *name = NULL;
 	char *name_buf = NULL;
 	__u64 usize;
@@ -475,7 +501,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	if (!mm || !mmget_not_zero(mm))
 		return -ESRCH;
 
-	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
+	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags, &mm_locked);
 	if (IS_ERR(vma)) {
 		mmput(mm);
 		return PTR_ERR(vma);
@@ -542,7 +568,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	}
 
 	/* unlock vma/mm_struct and put mm_struct before copying data to user */
-	unlock_vma(vma);
+	unlock_vma(vma, mm_locked);
 	mmput(mm);
 
 	if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
@@ -558,7 +584,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	return 0;
 
 out:
-	unlock_vma(vma);
+	unlock_vma(vma, mm_locked);
 	mmput(mm);
 	kfree(name_buf);
 	return err;
-- 
2.43.0


^ permalink raw reply related	[relevance 4%]

* [PATCH v2 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-24  4:10 12% [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
  2024-05-24  4:10  4% ` [PATCH v2 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
@ 2024-05-24  4:10 17% ` Andrii Nakryiko
  2024-05-24  4:10  4% ` [PATCH v2 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Andrii Nakryiko
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-24  4:10 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

/proc/<pid>/maps file is extremely useful in practice for various tasks
involving figuring out process memory layout, what files are backing any
given memory range, etc. One important class of applications that
absolutely rely on this are profilers/stack symbolizers (perf tool being one
of them). Patterns of use differ, but they generally would fall into two
categories.

In on-demand pattern, a profiler/symbolizer would normally capture stack
trace containing absolute memory addresses of some functions, and would
then use /proc/<pid>/maps file to find corresponding backing ELF files
(normally, only executable VMAs are of interest), file offsets within
them, and then continue from there to get yet more information (ELF
symbols, DWARF information) to get human-readable symbolic information.
This pattern is used by Meta's fleet-wide profiler, as one example.

In preprocessing pattern, application doesn't know the set of addresses
of interest, so it has to fetch all relevant VMAs (again, probably only
executable ones), store or cache them, then proceed with profiling and
stack trace capture. Once done, it would do symbolization based on
stored VMA information. This can happen at much later point in time.
This patterns is used by perf tool, as an example.

In either case, there are both performance and correctness requirement
involved. This address to VMA information translation has to be done as
efficiently as possible, but also not miss any VMA (especially in the
case of loading/unloading shared libraries). In practice, correctness
can't be guaranteed (due to process dying before VMA data can be
captured, or shared library being unloaded, etc), but any effort to
maximize the chance of finding the VMA is appreciated.

Unfortunately, for all the /proc/<pid>/maps file universality and
usefulness, it doesn't fit the above use cases 100%.

First, it's main purpose is to emit all VMAs sequentially, but in
practice captured addresses would fall only into a smaller subset of all
process' VMAs, mainly containing executable text. Yet, library would
need to parse most or all of the contents to find needed VMAs, as there
is no way to skip VMAs that are of no use. Efficient library can do the
linear pass and it is still relatively efficient, but it's definitely an
overhead that can be avoided, if there was a way to do more targeted
querying of the relevant VMA information.

Second, it's a text based interface, which makes its programmatic use from
applications and libraries more cumbersome and inefficient due to the
need to handle text parsing to get necessary pieces of information. The
overhead is actually payed both by kernel, formatting originally binary
VMA data into text, and then by user space application, parsing it back
into binary data for further use.

For the on-demand pattern of usage, described above, another problem
when writing generic stack trace symbolization library is an unfortunate
performance-vs-correctness tradeoff that needs to be made. Library has
to make a decision to either cache parsed contents of /proc/<pid>/maps
(after initial processing) to service future requests (if application
requests to symbolize another set of addresses (for the same process),
captured at some later time, which is typical for periodic/continuous
profiling cases) to avoid higher costs of re-parsing this file. Or it
has to choose to cache the contents in memory to speed up future
requests. In the former case, more memory is used for the cache and
there is a risk of getting stale data if application loads or unloads
shared libraries, or otherwise changed its set of VMAs somehow, e.g.,
through additional mmap() calls.  In the latter case, it's the
performance hit that comes from re-opening the file and re-parsing its
contents all over again.

This patch aims to solve this problem by providing a new API built on
top of /proc/<pid>/maps. It's meant to address both non-selectiveness
and text nature of /proc/<pid>/maps, by giving user more control of what
sort of VMA(s) needs to be queried, and being binary-based interface
eliminates the overhead of text formatting (on kernel side) and parsing
(on user space side).

It's also designed to be extensible and forward/backward compatible by
including required struct size field, which user has to provide. We use
established copy_struct_from_user() approach to handle extensibility.

User has a choice to pick either getting VMA that covers provided
address or -ENOENT if none is found (exact, least surprising, case). Or,
with an extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can
get either VMA that covers the address (if there is one), or the closest
next VMA (i.e., VMA with the smallest vm_start > addr). The latter allows
more efficient use, but, given it could be a surprising behavior,
requires an explicit opt-in.

There is another query flag that is useful for some use cases.
PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return
file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
makes it possible to efficiently iterate only file-backed VMAs of the
process, which is what profilers/symbolizers are normally interested in.

All the above querying flags can be combined with (also optional) set of
desired VMA permissions flags. This allows to, for example, iterate only
an executable subset of VMAs, which is what preprocessing pattern, used
by perf tool, would benefit from, as the assumption is that captured
stack traces would have addresses of executable code. This saves time by
skipping non-executable VMAs altogether efficienty.

All these querying flags (modifiers) are orthogonal and can be combined
in a semantically meaningful and natural way.

Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
sense given it's querying the same set of VMA data. It's also benefitial
because permission checks for /proc/<pid>/maps is performed at open time
once, and the actual data read of text contents of /proc/<pid>/maps is
done without further permission checks. We piggyback on this pattern
with ioctl()-based API as well, as that's a desired property. Both for
performance reasons, but also for security and flexibility reasons.

Allowing application to open an FD for /proc/self/maps without any extra
capabilities, and then passing it to some sort of profiling agent
through Unix-domain socket, would allow such profiling agent to not
require some of the capabilities that are otherwise expected when
opening /proc/<pid>/maps file for *another* process. This is a desirable
property for some more restricted setups.

This new ioctl-based implementation doesn't interfere with
seq_file-based implementation of /proc/<pid>/maps textual interface, and
so could be used together or independently without paying any price for
that.

Note also, that fetching VMA name (e.g., backing file path, or special
hard-coded or user-provided names) is optional just like build ID. If
user sets vma_name_size to zero, kernel code won't attempt to retrieve
it, saving resources.

To simplify reviewing, per-VMA locking is not yet added in this patch,
but the overall code structure is ready for it and will be adjusted in
the next patch to take per-VMA locking into account.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c      | 204 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h | 128 ++++++++++++++++++++++++-
 2 files changed, 331 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8e503a1635b7..8ad547efd38d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -375,11 +375,215 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 	return do_maps_open(inode, file, &proc_pid_maps_op);
 }
 
+#define PROCMAP_QUERY_VMA_FLAGS (				\
+		PROCMAP_QUERY_VMA_READABLE |			\
+		PROCMAP_QUERY_VMA_WRITABLE |			\
+		PROCMAP_QUERY_VMA_EXECUTABLE |			\
+		PROCMAP_QUERY_VMA_SHARED			\
+)
+
+#define PROCMAP_QUERY_VALID_FLAGS_MASK (			\
+		PROCMAP_QUERY_COVERING_OR_NEXT_VMA |		\
+		PROCMAP_QUERY_FILE_BACKED_VMA |			\
+		PROCMAP_QUERY_VMA_FLAGS				\
+)
+
+static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
+						 unsigned long addr, u32 flags)
+{
+	struct vm_area_struct *vma;
+
+next_vma:
+	vma = find_vma(mm, addr);
+
+	/* no VMA found */
+	if (!vma)
+		goto no_vma;
+
+	/* user requested only file-backed VMA, keep iterating */
+	if ((flags & PROCMAP_QUERY_FILE_BACKED_VMA) && !vma->vm_file)
+		goto skip_vma;
+
+	/* VMA permissions should satisfy query flags */
+	if (flags & PROCMAP_QUERY_VMA_FLAGS) {
+		u32 perm = 0;
+
+		if (flags & PROCMAP_QUERY_VMA_READABLE)
+			perm |= VM_READ;
+		if (flags & PROCMAP_QUERY_VMA_WRITABLE)
+			perm |= VM_WRITE;
+		if (flags & PROCMAP_QUERY_VMA_EXECUTABLE)
+			perm |= VM_EXEC;
+		if (flags & PROCMAP_QUERY_VMA_SHARED)
+			perm |= VM_MAYSHARE;
+
+		if ((vma->vm_flags & perm) != perm)
+			goto skip_vma;
+	}
+
+	/* found covering VMA or user is OK with the matching next VMA */
+	if ((flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) || vma->vm_start <= addr)
+		return vma;
+
+skip_vma:
+	/*
+	 * If the user needs closest matching VMA, keep iterating.
+	 */
+	addr = vma->vm_end;
+	if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
+		goto next_vma;
+no_vma:
+	mmap_read_unlock(mm);
+	return ERR_PTR(-ENOENT);
+}
+
+static void unlock_vma(struct vm_area_struct *vma)
+{
+	mmap_read_unlock(vma->vm_mm);
+}
+
+static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
+{
+	struct procmap_query karg;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	const char *name = NULL;
+	char *name_buf = NULL;
+	__u64 usize;
+	int err;
+
+	if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
+		return -EFAULT;
+	/* argument struct can never be that large, reject abuse */
+	if (usize > PAGE_SIZE)
+		return -E2BIG;
+	/* argument struct should have at least query_flags and query_addr fields */
+	if (usize < offsetofend(struct procmap_query, query_addr))
+		return -EINVAL;
+	err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
+	if (err)
+		return err;
+
+	/* reject unknown flags */
+	if (karg.query_flags & ~PROCMAP_QUERY_VALID_FLAGS_MASK)
+		return -EINVAL;
+	/* either both buffer address and size are set, or both should be zero */
+	if (!!karg.vma_name_size != !!karg.vma_name_addr)
+		return -EINVAL;
+
+	mm = priv->mm;
+	if (!mm || !mmget_not_zero(mm))
+		return -ESRCH;
+
+	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
+	if (IS_ERR(vma)) {
+		mmput(mm);
+		return PTR_ERR(vma);
+	}
+
+	karg.vma_start = vma->vm_start;
+	karg.vma_end = vma->vm_end;
+
+	if (vma->vm_file) {
+		const struct inode *inode = file_user_inode(vma->vm_file);
+
+		karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
+		karg.dev_major = MAJOR(inode->i_sb->s_dev);
+		karg.dev_minor = MINOR(inode->i_sb->s_dev);
+		karg.inode = inode->i_ino;
+	} else {
+		karg.vma_offset = 0;
+		karg.dev_major = 0;
+		karg.dev_minor = 0;
+		karg.inode = 0;
+	}
+
+	karg.vma_flags = 0;
+	if (vma->vm_flags & VM_READ)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_READABLE;
+	if (vma->vm_flags & VM_WRITE)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_WRITABLE;
+	if (vma->vm_flags & VM_EXEC)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_EXECUTABLE;
+	if (vma->vm_flags & VM_MAYSHARE)
+		karg.vma_flags |= PROCMAP_QUERY_VMA_SHARED;
+
+	if (karg.vma_name_size) {
+		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
+		const struct path *path;
+		const char *name_fmt;
+		size_t name_sz = 0;
+
+		get_vma_name(vma, &path, &name, &name_fmt);
+
+		if (path || name_fmt || name) {
+			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
+			if (!name_buf) {
+				err = -ENOMEM;
+				goto out;
+			}
+		}
+		if (path) {
+			name = d_path(path, name_buf, name_buf_sz);
+			if (IS_ERR(name)) {
+				err = PTR_ERR(name);
+				goto out;
+			}
+			name_sz = name_buf + name_buf_sz - name;
+		} else if (name || name_fmt) {
+			name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
+			name = name_buf;
+		}
+		if (name_sz > name_buf_sz) {
+			err = -ENAMETOOLONG;
+			goto out;
+		}
+		karg.vma_name_size = name_sz;
+	}
+
+	/* unlock vma/mm_struct and put mm_struct before copying data to user */
+	unlock_vma(vma);
+	mmput(mm);
+
+	if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
+					       name, karg.vma_name_size)) {
+		kfree(name_buf);
+		return -EFAULT;
+	}
+	kfree(name_buf);
+
+	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
+		return -EFAULT;
+
+	return 0;
+
+out:
+	unlock_vma(vma);
+	mmput(mm);
+	kfree(name_buf);
+	return err;
+}
+
+static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct seq_file *seq = file->private_data;
+	struct proc_maps_private *priv = seq->private;
+
+	switch (cmd) {
+	case PROCMAP_QUERY:
+		return do_procmap_query(priv, (void __user *)arg);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
 const struct file_operations proc_pid_maps_operations = {
 	.open		= pid_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= proc_map_release,
+	.unlocked_ioctl = procfs_procmap_ioctl,
+	.compat_ioctl	= procfs_procmap_ioctl,
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 45e4e64fd664..f25e7004972d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -333,8 +333,10 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND | RWF_NOAPPEND)
 
+#define PROCFS_IOCTL_MAGIC 'f'
+
 /* Pagemap ioctl */
-#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+#define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
 
 /* Bitmasks provided in pm_scan_args masks and reported in page_region.categories. */
 #define PAGE_IS_WPALLOWED	(1 << 0)
@@ -393,4 +395,128 @@ struct pm_scan_arg {
 	__u64 return_mask;
 };
 
+/* /proc/<pid>/maps ioctl */
+#define PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query)
+
+enum procmap_query_flags {
+	/*
+	 * VMA permission flags.
+	 *
+	 * Can be used as part of procmap_query.query_flags field to look up
+	 * only VMAs satisfying specified subset of permissions. E.g., specifying
+	 * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs,
+	 * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only
+	 * return read/write VMAs, though both executable/non-executable and
+	 * private/shared will be ignored.
+	 *
+	 * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags
+	 * field to specify actual VMA permissions.
+	 */
+	PROCMAP_QUERY_VMA_READABLE		= 0x01,
+	PROCMAP_QUERY_VMA_WRITABLE		= 0x02,
+	PROCMAP_QUERY_VMA_EXECUTABLE		= 0x04,
+	PROCMAP_QUERY_VMA_SHARED		= 0x08,
+	/*
+	 * Query modifier flags.
+	 *
+	 * By default VMA that covers provided address is returned, or -ENOENT
+	 * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest
+	 * VMA with vma_start > addr will be returned if no covering VMA is
+	 * found.
+	 *
+	 * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that
+	 * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA
+	 * to iterate all VMAs with file backing.
+	 */
+	PROCMAP_QUERY_COVERING_OR_NEXT_VMA	= 0x10,
+	PROCMAP_QUERY_FILE_BACKED_VMA		= 0x20,
+};
+
+/*
+ * Input/output argument structured passed into ioctl() call. It can be used
+ * to query a set of VMAs (Virtual Memory Areas) of a process.
+ *
+ * Each field can be one of three kinds, marked in a short comment to the
+ * right of the field:
+ *   - "in", input argument, user has to provide this value, kernel doesn't modify it;
+ *   - "out", output argument, kernel sets this field with VMA data;
+ *   - "in/out", input and output argument; user provides initial value (used
+ *     to specify maximum allowable buffer size), and kernel sets it to actual
+ *     amount of data written (or zero, if there is no data).
+ *
+ * If matching VMA is found (according to criterias specified by
+ * query_addr/query_flags, all the out fields are filled out, and ioctl()
+ * returns 0. If there is no matching VMA, -ENOENT will be returned.
+ * In case of any other error, negative error code other than -ENOENT is
+ * returned.
+ *
+ * Most of the data is similar to the one returned as text in /proc/<pid>/maps
+ * file, but procmap_query provides more querying flexibility. There are no
+ * consistency guarantees between subsequent ioctl() calls, but data returned
+ * for matched VMA is self-consistent.
+ */
+struct procmap_query {
+	/* Query struct size, for backwards/forward compatibility */
+	__u64 size;
+	/*
+	 * Query flags, a combination of enum procmap_query_flags values.
+	 * Defines query filtering and behavior, see enum procmap_query_flags.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_flags;		/* in */
+	/*
+	 * Query address. By default, VMA that covers this address will
+	 * be looked up. PROCMAP_QUERY_* flags above modify this default
+	 * behavior further.
+	 *
+	 * Input argument, provided by user. Kernel doesn't modify it.
+	 */
+	__u64 query_addr;		/* in */
+	/* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */
+	__u64 vma_start;		/* out */
+	__u64 vma_end;			/* out */
+	/* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */
+	__u64 vma_flags;		/* out */
+	/*
+	 * VMA file offset. If VMA has file backing, this specifies offset
+	 * within the file that VMA's start address corresponds to.
+	 * Is set to zero if VMA has no backing file.
+	 */
+	__u64 vma_offset;		/* out */
+	/* Backing file's inode number, or zero, if VMA has no backing file. */
+	__u64 inode;			/* out */
+	/* Backing file's device major/minor number, or zero, if VMA has no backing file. */
+	__u32 dev_major;		/* out */
+	__u32 dev_minor;		/* out */
+	/*
+	 * If set to non-zero value, signals the request to return VMA name
+	 * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix
+	 * appended, if file was unlinked from FS) for matched VMA. VMA name
+	 * can also be some special name (e.g., "[heap]", "[stack]") or could
+	 * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME).
+	 *
+	 * Kernel will set this field to zero, if VMA has no associated name.
+	 * Otherwise kernel will return actual amount of bytes filled in
+	 * user-supplied buffer (see vma_name_addr field below), including the
+	 * terminating zero.
+	 *
+	 * If VMA name is longer that user-supplied maximum buffer size,
+	 * -E2BIG error is returned.
+	 *
+	 * If this field is set to non-zero value, vma_name_addr should point
+	 * to valid user space memory buffer of at least vma_name_size bytes.
+	 * If set to zero, vma_name_addr should be set to zero as well
+	 */
+	__u32 vma_name_size;		/* in/out */
+	/*
+	 * User-supplied address of a buffer of at least vma_name_size bytes
+	 * for kernel to fill with matched VMA's name (see vma_name_size field
+	 * description above for details).
+	 *
+	 * Should be set to zero if VMA name should not be returned.
+	 */
+	__u64 vma_name_addr;		/* in */
+};
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.43.0


^ permalink raw reply related	[relevance 17%]

* [PATCH v2 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock
  2024-05-24  4:10 12% [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
@ 2024-05-24  4:10  4% ` Andrii Nakryiko
  2024-05-24  4:10 17% ` [PATCH v2 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-24  4:10 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Existing lock_vma_under_rcu() API assumes exact VMA match, so it's not
a 100% equivalent of find_vma(). There are use cases that do want
find_vma() semantics of finding an exact VMA or the next one.

Also, it's important for such an API to let user distinguish between not
being able to get per-VMA lock and not having any VMAs at or after
provided address.

As such, this patch adds a new find_vma()-like API,
find_and_lock_vma_rcu(), which finds exact or next VMA, attempts to take
per-VMA lock, and if that fails, returns ERR_PTR(-EBUSY). It still
returns NULL if there is no VMA at or after address. In successfuly case
it will return valid and non-isolated VMA with VMA lock taken.

This API will be used in subsequent patch in this patch set to implement
a new user-facing API for querying process VMAs.

Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/mm.h |  8 ++++++
 mm/memory.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..a6846401da77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -776,6 +776,8 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 		mmap_assert_locked(vmf->vma->vm_mm);
 }
 
+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+					  unsigned long address);
 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 					  unsigned long address);
 
@@ -790,6 +792,12 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
 
+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+					     unsigned long address)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 		unsigned long address)
 {
diff --git a/mm/memory.c b/mm/memory.c
index b5453b86ec4b..9d0413e98d8b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5824,6 +5824,68 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
+/*
+ * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
+ * next VMA. Search is done under RCU protection, without taking or assuming
+ * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
+
+ * @mm: The mm_struct to check
+ * @addr: The address
+ *
+ * Returns: The VMA associated with addr, or the next VMA.
+ * May return %NULL in the case of no VMA at addr or above.
+ * If the VMA is being modified and can't be locked, -EBUSY is returned.
+ */
+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+					     unsigned long address)
+{
+	MA_STATE(mas, &mm->mm_mt, address, address);
+	struct vm_area_struct *vma;
+	int err;
+
+	rcu_read_lock();
+retry:
+	vma = mas_find(&mas, ULONG_MAX);
+	if (!vma) {
+		err = 0; /* no VMA, return NULL */
+		goto inval;
+	}
+
+	if (!vma_start_read(vma)) {
+		err = -EBUSY;
+		goto inval;
+	}
+
+	/*
+	 * Check since vm_start/vm_end might change before we lock the VMA.
+	 * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
+	 * address or the next one, so we only make sure VMA wasn't updated to
+	 * end before the address.
+	 */
+	if (unlikely(vma->vm_end <= address)) {
+		err = -EBUSY;
+		goto inval_end_read;
+	}
+
+	/* Check if the VMA got isolated after we found it */
+	if (vma->detached) {
+		vma_end_read(vma);
+		count_vm_vma_lock_event(VMA_LOCK_MISS);
+		/* The area was replaced with another one */
+		goto retry;
+	}
+
+	rcu_read_unlock();
+	return vma;
+
+inval_end_read:
+	vma_end_read(vma);
+inval:
+	rcu_read_unlock();
+	count_vm_vma_lock_event(VMA_LOCK_ABORT);
+	return ERR_PTR(err);
+}
+
 /*
  * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
  * stable and not isolated. If the VMA is not found or is being modified the
-- 
2.43.0


^ permalink raw reply related	[relevance 4%]

* [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps
@ 2024-05-24  4:10 12% Andrii Nakryiko
  2024-05-24  4:10  4% ` [PATCH v2 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
                   ` (5 more replies)
  0 siblings, 6 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-24  4:10 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, liam.howlett, surenb, rppt,
	Andrii Nakryiko

Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than reading *all* VMAs
nonselectively through text-based interface of /proc/<pid>/maps file.

Patch #3 goes into a lot of details and background on some common patterns of
using /proc/<pid>/maps in the area of performance profiling and subsequent
symbolization of captured stack traces. As mentioned in that patch, patterns
of VMA querying can differ depending on specific use case, but can generally
be grouped into two main categories: the need to query a small subset of VMAs
covering a given batch of addresses, or reading/storing/caching all
(typically, executable) VMAs upfront for later processing.

The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by the
former pattern of usage. Patch #9 adds a tool that faithfully reproduces an
efficient VMA matching pass of a symbolizer, collecting a subset of covering
VMAs for a given set of addresses as efficiently as possible. This tool is
serving both as a testing ground, as well as a benchmarking tool.
It implements everything both for currently existing text-based
/proc/<pid>/maps interface, as well as for newly-added PROCMAP_QUERY ioctl().

But based on discussion on previous revision of this patch set, it turned out
that this ioctl() API is competitive with highly-optimized text-based
pre-processing pattern that perf tool is using. Based on perf discussion, this
revision adds more flexibility in specifying a subset of VMAs that are of
interest. Now it's possible to specify desired permissions of VMAs (e.g.,
request only executable ones) and/or restrict to only a subset of VMAs that
have file backing. This further improves the efficiency when using this new
API thanks to more selective (executable VMAs only) querying.

In addition to a custom benchmarking tool from patch #9, and experimental perf
integration (available at [0]), Daniel Mueller has since also implemented an
experimental integration into blazesym (see [1]), a library used for stack
trace symbolization by our server fleet-wide profiler and another on-device
profiler agent that runs on weaker ARM devices. The latter ARM-based device
profiler is especially sensitive to performance, and so we benchmarked and
compared text-based /proc/<pid>/maps solution to the equivalent one using
PROCMAP_QUERY ioctl().

Results are very encouraging, giving us 5x improvement for end-to-end
so-called "address normalization" pass, which is the part of the symbolization
process that happens locally on ARM device, before being sent out for further
heavier-weight processing on more powerful remote server. Note that this is
not an artificial microbenchmark. It's a full end-to-end API call being
measured with real-world data on real-world device.

  TEXT-BASED
  ==========
  Benchmarking main/normalize_process_no_build_ids_uncached_maps
  main/normalize_process_no_build_ids_uncached_maps
	  time:   [49.777 µs 49.982 µs 50.250 µs]

  IOCTL-BASED
  ===========
  Benchmarking main/normalize_process_no_build_ids_uncached_maps
  main/normalize_process_no_build_ids_uncached_maps
	  time:   [10.328 µs 10.391 µs 10.457 µs]
	  change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
	  Performance has improved.

You can see above that we see the drop from 50µs down to 10µs for exactly
the same amount of work, with the same data and target process.

Results for more synthentic benchmarks that hammer /proc/<pid>/maps processing
specifically can be found in patch #9. In short, we see about ~40x improvement
with our custom benchmark tool (it varies depending on captured set of
addresses, previous revision used a different set of captured addresses,
giving about ~35x improvement). And even for perf-based benchmark it's on par
or slightly ahead when using permission-based filtering (fetching only
executable VMAs).

Another big change since v1 is the use of RCU-protected per-VMA lock during
querying, which is what has been requested by mm folks in favor of current
mmap_lock-based protection used by /proc/<pid>/maps text-based implementation.
For that, we added a new internal API that is equivalent to find_vma(), see
patch #1.

One thing that did not change was basing this new API as an ioctl() command
on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was considered,
but has its own downsides. Implementing ioctl() directly on pidfd will cause
access permission checks on every single ioctl(), which leads to performance
concerns and potential spam of capable() audit messages. It also prevents
a nice pattern, possible with /proc/<pid>/maps, in which application opens
/proc/self/maps FD (requiring no additional capabilities) and passed this FD
to profiling agent for querying. To achieve similar pattern, a new file would
have to be created from pidf just for VMA querying, which is considered to be
inferior to just querying /proc/<pid>/maps FD as proposed in current approach.
These aspects were discussed in the hallway track at recent LSF/MM/BPF 2024
and sticking to procfs ioctl() was the final agreement we arrived at.

This patch set is based on top of next-20240522 tag in linux-next tree.

  [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
  [1] https://github.com/libbpf/blazesym/pull/675

v1->v2:
  - per-VMA lock is used, if possible (Liam, Suren);
  - added file-backed VMA querying (perf folks);
  - added permission-based VMA querying (perf folks);
  - split out build ID into separate patch (Suren);
  - better documented API, added mention of ioctl() into procfs docs (Greg).

Andrii Nakryiko (9):
  mm: add find_vma()-like API but RCU protected and taking VMA lock
  fs/procfs: extract logic for getting VMA name constituents
  fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
  fs/procfs: add build ID fetching to PROCMAP_QUERY API
  docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence
  tools: sync uapi/linux/fs.h header into tools subdir
  selftests/bpf: make use of PROCMAP_QUERY ioctl if available
  selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs

 Documentation/filesystems/proc.rst          |   8 +
 fs/proc/task_mmu.c                          | 378 ++++++++++++--
 include/linux/mm.h                          |   8 +
 include/uapi/linux/fs.h                     | 156 +++++-
 mm/memory.c                                 |  62 +++
 tools/include/uapi/linux/fs.h               | 550 ++++++++++++++++++++
 tools/testing/selftests/bpf/.gitignore      |   1 +
 tools/testing/selftests/bpf/Makefile        |   2 +-
 tools/testing/selftests/bpf/procfs_query.c  | 386 ++++++++++++++
 tools/testing/selftests/bpf/test_progs.c    |   3 +
 tools/testing/selftests/bpf/test_progs.h    |   2 +
 tools/testing/selftests/bpf/trace_helpers.c | 104 +++-
 12 files changed, 1589 insertions(+), 71 deletions(-)
 create mode 100644 tools/include/uapi/linux/fs.h
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

-- 
2.43.0


^ permalink raw reply	[relevance 12%]

* Re: [PATCH 0/3] Introduce user namespace capabilities
  @ 2024-05-20  0:54  5%                 ` Jonathan Calmels
  0 siblings, 0 replies; 200+ results
From: Jonathan Calmels @ 2024-05-20  0:54 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Serge Hallyn, Jarkko Sakkinen, brauner, ebiederm,
	Luis Chamberlain, Kees Cook, Joel Granados, Paul Moore,
	James Morris, David Howells, containers, linux-kernel,
	linux-fsdevel, linux-security-module, keyrings

On Sun, May 19, 2024 at 10:03:29AM GMT, Casey Schaufler wrote:
> I do understand that. My objection is not to the intent, but to the approach.
> Adding a capability set to the general mechanism in support of a limited, specific
> use case seems wrong to me. I would rather see a mechanism in userns to limit
> the capabilities in a user namespace than a mechanism in capabilities that is
> specific to user namespaces.

> An option to clone() then, to limit the capabilities available?
> I honestly can't recall if that has been suggested elsewhere, and
> apologize if it's already been dismissed as a stoopid idea.

No and you're right, this would also make sense. This was considered as
well as things like ioctl_ns() (basically introducing the concept of
capabilities in the user_namespace struct). I also considered reusing
the existing sets with various schemes to no avail.

The main issue with this approach is that you've to consider how this is
going to be used. This ties into the other thread we've had with John
and Eric.
Basically, we're coming from a model where things are wide open and
we're trying to tighten things down.

Quoting John here:

> We are starting from a different posture here. Where applications have
> assumed that user namespaces where safe and no measures were needed.
> Tools like unshare and bwrap if set to allow user namespaces in their
> fcaps will allow exploits a trivial by-pass.

We can't really expect userspace to patch every single userns callsite
and opt-in this new security mechanism.
You said it well yourself:

> Capabilities are already more complicated than modern developers
> want to deal with.

Moreover, policies are not necessarily enforced at said callsites. Take
for example a service like systemd-machined, or a PAM session. Those
need to be able to place restrictions on any processes spawned under
them.

If we do this in clone() (or similar), we'll also need to come up with
inheritance rules, being able to query capabilities, etc.
At this point we're just reinventing capability sets.

Finally the nice thing about having it as a capability set, is that we
can easily define rules between them. Patch 2 is a good example of this.
It constrains the userns set to the bounding set of a task. Thus,
requiring minimal/no change to userspace, and helping with adoption.

> Yes, I understand. I would rather see a change to userns in support of a userns
> specific need than a change to capabilities for a userns specific need.

Valid point, but at the end of the day, those are really just tasks'
capabilities. The unshare() just happens to trigger specific rules when it
comes to the tasks' creds. This isn't so different than the other sets
and their specific rules for execve() or UID 0.

This could also be reframed as:

Why would setting capabilities on taks in a userns be so different than
tasks outside of it?

^ permalink raw reply	[relevance 5%]

* Re: [PATCH v5 01/10] ext4: factor out a common helper to query extent map
  2024-05-17 12:39 11% ` [PATCH v5 01/10] ext4: factor out a common helper to query extent map Zhang Yi
@ 2024-05-17 16:19  7%   ` Markus Elfring
  0 siblings, 0 replies; 200+ results
From: Markus Elfring @ 2024-05-17 16:19 UTC (permalink / raw)
  To: Zhang Yi, linux-ext4, linux-fsdevel, kernel-janitors
  Cc: LKML, Andreas Dilger, Jan Kara, Ritesh Harjani,
	Theodore Ts'o, Yu Kuai, Zhang Yi, Zhihao Cheng

…
> ext4_da_map_blocks(), it query and return the extent map status on the
> inode's extent path, no logic changes.

Please improve this change description another bit.

Regards,
Markus

^ permalink raw reply	[relevance 7%]

* [PATCH v5 01/10] ext4: factor out a common helper to query extent map
  2024-05-17 12:39  4% [PATCH v5 00/10] ext4: support adding multi-delalloc blocks Zhang Yi
@ 2024-05-17 12:39 11% ` Zhang Yi
  2024-05-17 16:19  7%   ` Markus Elfring
  0 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-05-17 12:39 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Factor out a new common helper ext4_map_query_blocks() from the
ext4_da_map_blocks(), it query and return the extent map status on the
inode's extent path, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 537803250ca9..6a41172c06e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
+				 struct ext4_map_blocks *map)
+{
+	unsigned int status;
+	int retval;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		retval = ext4_ext_map_blocks(handle, inode, map, 0);
+	else
+		retval = ext4_ind_map_blocks(handle, inode, map, 0);
+
+	if (retval <= 0)
+		return retval;
+
+	if (unlikely(retval != map->m_len)) {
+		ext4_warning(inode->i_sb,
+			     "ES len assertion failed for inode "
+			     "%lu: retval %d != map->m_len %d",
+			     inode->i_ino, retval, map->m_len);
+		WARN_ON(1);
+	}
+
+	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+			      map->m_pblk, status);
+	return retval;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_has_inline_data(inode))
 		retval = 0;
-	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
-	if (retval < 0) {
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
+		retval = ext4_map_query_blocks(NULL, inode, map);
 	up_read(&EXT4_I(inode)->i_data_sem);
+	if (retval)
+		return retval;
 
 add_delayed:
 	down_write(&EXT4_I(inode)->i_data_sem);
-- 
2.39.2


^ permalink raw reply related	[relevance 11%]

* [PATCH v5 00/10] ext4: support adding multi-delalloc blocks
@ 2024-05-17 12:39  4% Zhang Yi
  2024-05-17 12:39 11% ` [PATCH v5 01/10] ext4: factor out a common helper to query extent map Zhang Yi
  0 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-05-17 12:39 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, yi.zhang, yi.zhang, chengzhihao1, yukuai3

Changes since v4:
 - In patch 3, switch to check EXT4_ERROR_FS instead of
   ext4_forced_shutdown() to prevent warning on errors=continue mode as
   Jan suggested.
 - In patch 8, rename ext4_da_check_clu_allocated() to
   ext4_clu_alloc_state() and change the return value according to the
   cluster allocation state as Jan suggested.
 - In patch 9, do some appropriate logic changes since
   the ext4_clu_alloc_state() has been changed in patch 8, so I remove
   the reviewed-by tag from Jan, please take a look again.

Changes since v3:
 - Fix two commit message grammatical issues in patch 2 and 4.

Changes since v2:
 - Improve the commit message in patch 2,4,6 as Ritesh and Jan
   suggested, makes the changes more clear.
 - Add patch 3, add a warning if the delalloc counters are still not
   zero on inactive.
 - In patch 6, add a WARN in ext4_es_insert_delayed_extent(), strictly
   requires the end_allocated parameter to be set to false if the
   inserting extent belongs to one cluster.
 - In patch 9, modify the reserve blocks math formula as Jan suggested,
   prevent the count going to be negative.
 - In patch 10, update the stale ext4_da_map_blocks() function comments.

Hello!

This patch series is the part 2 prepartory changes of the buffered IO
iomap conversion, I picked them out from my buffered IO iomap conversion
RFC series v3[1], add a fix for an issue found in current ext4 code, and
also add bigalloc feature support. Please look the following patches for
details.

The first 3 patches fix an incorrect delalloc reserved blocks count
issue and add a warning to make it easy to detect, the second 6 patches
make ext4_insert_delayed_block() call path support inserting
multi-delalloc blocks once a time, and the last patch makes
ext4_da_map_blocks() buffer_head unaware, prepared for iomap.

This patch set has been passed 'kvm-xfstests -g auto' tests, I hope it
could be reviewed and merged first.

[1] https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/

Thanks,
Yi.

---
v2: https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/
v3: https://lore.kernel.org/linux-ext4/20240508061220.967970-1-yi.zhang@huaweicloud.com/

Zhang Yi (10):
  ext4: factor out a common helper to query extent map
  ext4: check the extent status again before inserting delalloc block
  ext4: warn if delalloc counters are not zero on inactive
  ext4: trim delalloc extent
  ext4: drop iblock parameter
  ext4: make ext4_es_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_reserve_space() reserve multi-clusters
  ext4: factor out a helper to check the cluster allocation state
  ext4: make ext4_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_map_blocks() buffer_head unaware

 fs/ext4/extents_status.c    |  70 +++++++---
 fs/ext4/extents_status.h    |   5 +-
 fs/ext4/inode.c             | 250 +++++++++++++++++++++++-------------
 fs/ext4/super.c             |   6 +-
 include/trace/events/ext4.h |  26 ++--
 5 files changed, 234 insertions(+), 123 deletions(-)

-- 
2.39.2


^ permalink raw reply	[relevance 4%]

* [PATCH v4 01/10] ext4: factor out a common helper to query extent map
  2024-05-11 11:26  4% [PATCH v4 00/10] ext4: support adding multi-delalloc blocks Zhang Yi
@ 2024-05-11 11:26 11% ` Zhang Yi
  0 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-05-11 11:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Factor out a new common helper ext4_map_query_blocks() from the
ext4_da_map_blocks(), it query and return the extent map status on the
inode's extent path, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 537803250ca9..6a41172c06e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
+				 struct ext4_map_blocks *map)
+{
+	unsigned int status;
+	int retval;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		retval = ext4_ext_map_blocks(handle, inode, map, 0);
+	else
+		retval = ext4_ind_map_blocks(handle, inode, map, 0);
+
+	if (retval <= 0)
+		return retval;
+
+	if (unlikely(retval != map->m_len)) {
+		ext4_warning(inode->i_sb,
+			     "ES len assertion failed for inode "
+			     "%lu: retval %d != map->m_len %d",
+			     inode->i_ino, retval, map->m_len);
+		WARN_ON(1);
+	}
+
+	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+			      map->m_pblk, status);
+	return retval;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_has_inline_data(inode))
 		retval = 0;
-	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
-	if (retval < 0) {
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
+		retval = ext4_map_query_blocks(NULL, inode, map);
 	up_read(&EXT4_I(inode)->i_data_sem);
+	if (retval)
+		return retval;
 
 add_delayed:
 	down_write(&EXT4_I(inode)->i_data_sem);
-- 
2.39.2


^ permalink raw reply related	[relevance 11%]

* [PATCH v4 00/10] ext4: support adding multi-delalloc blocks
@ 2024-05-11 11:26  4% Zhang Yi
  2024-05-11 11:26 11% ` [PATCH v4 01/10] ext4: factor out a common helper to query extent map Zhang Yi
  0 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-05-11 11:26 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v3:
 - Fix two commit message grammatical issues in patch 2 and 4.

Changes since v2:
 - Improve the commit message in patch 2,4,6 as Ritesh and Jan
   suggested, makes the changes more clear.
 - Add patch 3, add a warning if the delalloc counters are still not
   zero on inactive.
 - In patch 6, add a WARN in ext4_es_insert_delayed_extent(), strictly
   requires the end_allocated parameter to be set to false if the
   inserting extent belongs to one cluster.
 - In patch 9, modify the reserve blocks math formula as Jan suggested,
   prevent the count going to be negative.
 - In patch 10, update the stale ext4_da_map_blocks() function comments.

Hello!

This patch series is the part 2 prepartory changes of the buffered IO
iomap conversion, I picked them out from my buffered IO iomap conversion
RFC series v3[1], add a fix for an issue found in current ext4 code, and
also add bigalloc feature support. Please look the following patches for
details.

The first 3 patches fix an incorrect delalloc reserved blocks count
issue and add a warning to make it easy to detect, the second 6 patches
make ext4_insert_delayed_block() call path support inserting
multi-delalloc blocks once a time, and the last patch makes
ext4_da_map_blocks() buffer_head unaware, prepared for iomap.

This patch set has been passed 'kvm-xfstests -g auto' tests, I hope it
could be reviewed and merged first.

[1] https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/

Thanks,
Yi.

---
v2: https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/
v3: https://lore.kernel.org/linux-ext4/20240508061220.967970-1-yi.zhang@huaweicloud.com/

Zhang Yi (10):
  ext4: factor out a common helper to query extent map
  ext4: check the extent status again before inserting delalloc block
  ext4: warn if delalloc counters are not zero on inactive
  ext4: trim delalloc extent
  ext4: drop iblock parameter
  ext4: make ext4_es_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_reserve_space() reserve multi-clusters
  ext4: factor out check for whether a cluster is allocated
  ext4: make ext4_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_map_blocks() buffer_head unaware

 fs/ext4/extents_status.c    |  70 +++++++---
 fs/ext4/extents_status.h    |   5 +-
 fs/ext4/inode.c             | 248 +++++++++++++++++++++++-------------
 fs/ext4/super.c             |   6 +-
 include/trace/events/ext4.h |  26 ++--
 5 files changed, 231 insertions(+), 124 deletions(-)

-- 
2.39.2


^ permalink raw reply	[relevance 4%]

* Re: [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks()
  2024-05-10  9:41  0%           ` Luis Henriques
@ 2024-05-10 11:40  0%             ` Zhang Yi
  0 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-05-10 11:40 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Theodore Ts'o, linux-ext4, linux-fsdevel, linux-mm,
	linux-kernel, adilger.kernel, jack, ritesh.list, hch, djwong,
	willy, zokeefe, yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

On 2024/5/10 17:41, Luis Henriques wrote:
> On Fri 10 May 2024 11:39:48 AM +08, Zhang Yi wrote;
> 
>> On 2024/5/10 1:23, Luis Henriques wrote:
>>> On Thu 09 May 2024 12:39:53 PM -04, Theodore Ts'o wrote;
>>>
>>>> On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote:
>>>>>
>>>>> It's looks like it's easy to trigger an infinite loop here using fstest
>>>>> generic/039.  If I understand it correctly (which doesn't happen as often
>>>>> as I'd like), this is due to an integer overflow in the 'if' condition,
>>>>> and should be fixed with the patch below.
>>>>
>>>> Thanks for the report.  However, I can't reproduce the failure, and
>>>> looking at generic/039, I don't see how it could be relevant to the
>>>> code path in question.  Generic/039 creates a test symlink with two
>>>> hard links in the same directory, syncs the file system, and then
>>>> removes one of the hard links, and then drops access to the block
>>>> device using dmflakey.  So I don't see how the extent code would be
>>>> involved at all.  Are you sure that you have the correct test listed?
>>>
>>> Yep, I just retested and it's definitely generic/039.  I'm using a simple
>>> test environment, with virtme-ng.
>>>
>>>> Looking at the code in question in fs/ext4/extents.c:
>>>>
>>>> again:
>>>> 	ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
>>>> 				  hole_start + len - 1, &es);
>>>> 	if (!es.es_len)
>>>> 		goto insert_hole;
>>>>
>>>>   	 * There's a delalloc extent in the hole, handle it if the delalloc
>>>>   	 * extent is in front of, behind and straddle the queried range.
>>>>   	 */
>>>>  -	if (lblk >= es.es_lblk + es.es_len) {
>>>>  +	if (lblk >= ((__u64) es.es_lblk) + es.es_len) {
>>>>   		/*
>>>>   		 * The delalloc extent is in front of the queried range,
>>>>   		 * find again from the queried start block.
>>>> 		len -= lblk - hole_start;
>>>> 		hole_start = lblk;
>>>> 		goto again;
>>>>
>>>> lblk and es.es_lblk are both __u32.  So the infinite loop is
>>>> presumably because es.es_lblk + es.es_len has overflowed.  This should
>>>> never happen(tm), and in fact we have a test for this case which
>>>
>>> If I instrument the code, I can see that es.es_len is definitely set to
>>> EXT_MAX_BLOCKS, which will overflow.
>>>
>>
>> Thanks for the report. After looking at the code, I think the root
>> cause of this issue is the variable es was not initialized on replaying
>> fast commit. ext4_es_find_extent_range() will return directly when
>> EXT4_FC_REPLAY flag is set, and then the es.len becomes stall.
>>
>> I can always reproduce this issue on generic/039 with
>> MKFS_OPTIONS="-O fast_commit".
>>
>> This uninitialization problem originally existed in the old
>> ext4_ext_put_gap_in_cache(), but it didn't trigger any real problem
>> since we never check and use extent cache when replaying fast commit.
>> So I suppose the correct fix would be to unconditionally initialize
>> the es variable.
> 
> Oh, you're absolutely right -- the extent_status 'es' struct isn't being
> initialized in that case.  I totally failed to see that.  And yes, I also
> failed to mention I had 'fast_commit' feature enabled, sorry!
> 
> Thanks a lot for figuring this out, Yi.  I'm looking at this code and
> trying to understand if it would be safe to call __es_find_extent_range()
> when EXT4_FC_REPLAY is in progress.  Probably not, and probably better to
> simply do:
> 
> 	es->es_lblk = es->es_len = es->es_pblk = 0;
> 
> in that case.  I'll send out a patch later today.
> 

Yeah, I'm glad it could help.

Thanks,
Yi.


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks()
  2024-05-10  3:39  0%         ` Zhang Yi
@ 2024-05-10  9:41  0%           ` Luis Henriques
  2024-05-10 11:40  0%             ` Zhang Yi
  0 siblings, 1 reply; 200+ results
From: Luis Henriques @ 2024-05-10  9:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Theodore Ts'o, linux-ext4, linux-fsdevel, linux-mm,
	linux-kernel, adilger.kernel, jack, ritesh.list, hch, djwong,
	willy, zokeefe, yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

On Fri 10 May 2024 11:39:48 AM +08, Zhang Yi wrote;

> On 2024/5/10 1:23, Luis Henriques wrote:
>> On Thu 09 May 2024 12:39:53 PM -04, Theodore Ts'o wrote;
>> 
>>> On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote:
>>>>
>>>> It's looks like it's easy to trigger an infinite loop here using fstest
>>>> generic/039.  If I understand it correctly (which doesn't happen as often
>>>> as I'd like), this is due to an integer overflow in the 'if' condition,
>>>> and should be fixed with the patch below.
>>>
>>> Thanks for the report.  However, I can't reproduce the failure, and
>>> looking at generic/039, I don't see how it could be relevant to the
>>> code path in question.  Generic/039 creates a test symlink with two
>>> hard links in the same directory, syncs the file system, and then
>>> removes one of the hard links, and then drops access to the block
>>> device using dmflakey.  So I don't see how the extent code would be
>>> involved at all.  Are you sure that you have the correct test listed?
>> 
>> Yep, I just retested and it's definitely generic/039.  I'm using a simple
>> test environment, with virtme-ng.
>> 
>>> Looking at the code in question in fs/ext4/extents.c:
>>>
>>> again:
>>> 	ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
>>> 				  hole_start + len - 1, &es);
>>> 	if (!es.es_len)
>>> 		goto insert_hole;
>>>
>>>   	 * There's a delalloc extent in the hole, handle it if the delalloc
>>>   	 * extent is in front of, behind and straddle the queried range.
>>>   	 */
>>>  -	if (lblk >= es.es_lblk + es.es_len) {
>>>  +	if (lblk >= ((__u64) es.es_lblk) + es.es_len) {
>>>   		/*
>>>   		 * The delalloc extent is in front of the queried range,
>>>   		 * find again from the queried start block.
>>> 		len -= lblk - hole_start;
>>> 		hole_start = lblk;
>>> 		goto again;
>>>
>>> lblk and es.es_lblk are both __u32.  So the infinite loop is
>>> presumably because es.es_lblk + es.es_len has overflowed.  This should
>>> never happen(tm), and in fact we have a test for this case which
>> 
>> If I instrument the code, I can see that es.es_len is definitely set to
>> EXT_MAX_BLOCKS, which will overflow.
>> 
>
> Thanks for the report. After looking at the code, I think the root
> cause of this issue is the variable es was not initialized on replaying
> fast commit. ext4_es_find_extent_range() will return directly when
> EXT4_FC_REPLAY flag is set, and then the es.len becomes stall.
>
> I can always reproduce this issue on generic/039 with
> MKFS_OPTIONS="-O fast_commit".
>
> This uninitialization problem originally existed in the old
> ext4_ext_put_gap_in_cache(), but it didn't trigger any real problem
> since we never check and use extent cache when replaying fast commit.
> So I suppose the correct fix would be to unconditionally initialize
> the es variable.

Oh, you're absolutely right -- the extent_status 'es' struct isn't being
initialized in that case.  I totally failed to see that.  And yes, I also
failed to mention I had 'fast_commit' feature enabled, sorry!

Thanks a lot for figuring this out, Yi.  I'm looking at this code and
trying to understand if it would be safe to call __es_find_extent_range()
when EXT4_FC_REPLAY is in progress.  Probably not, and probably better to
simply do:

	es->es_lblk = es->es_len = es->es_pblk = 0;

in that case.  I'll send out a patch later today.

Cheers,
-- 
Luis

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks()
  2024-05-09 17:23  0%       ` Luis Henriques
@ 2024-05-10  3:39  0%         ` Zhang Yi
  2024-05-10  9:41  0%           ` Luis Henriques
  0 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-05-10  3:39 UTC (permalink / raw)
  To: Luis Henriques, Theodore Ts'o
  Cc: linux-ext4, linux-fsdevel, linux-mm, linux-kernel,
	adilger.kernel, jack, ritesh.list, hch, djwong, willy, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

On 2024/5/10 1:23, Luis Henriques wrote:
> On Thu 09 May 2024 12:39:53 PM -04, Theodore Ts'o wrote;
> 
>> On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote:
>>>
>>> It's looks like it's easy to trigger an infinite loop here using fstest
>>> generic/039.  If I understand it correctly (which doesn't happen as often
>>> as I'd like), this is due to an integer overflow in the 'if' condition,
>>> and should be fixed with the patch below.
>>
>> Thanks for the report.  However, I can't reproduce the failure, and
>> looking at generic/039, I don't see how it could be relevant to the
>> code path in question.  Generic/039 creates a test symlink with two
>> hard links in the same directory, syncs the file system, and then
>> removes one of the hard links, and then drops access to the block
>> device using dmflakey.  So I don't see how the extent code would be
>> involved at all.  Are you sure that you have the correct test listed?
> 
> Yep, I just retested and it's definitely generic/039.  I'm using a simple
> test environment, with virtme-ng.
> 
>> Looking at the code in question in fs/ext4/extents.c:
>>
>> again:
>> 	ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
>> 				  hole_start + len - 1, &es);
>> 	if (!es.es_len)
>> 		goto insert_hole;
>>
>>   	 * There's a delalloc extent in the hole, handle it if the delalloc
>>   	 * extent is in front of, behind and straddle the queried range.
>>   	 */
>>  -	if (lblk >= es.es_lblk + es.es_len) {
>>  +	if (lblk >= ((__u64) es.es_lblk) + es.es_len) {
>>   		/*
>>   		 * The delalloc extent is in front of the queried range,
>>   		 * find again from the queried start block.
>> 		len -= lblk - hole_start;
>> 		hole_start = lblk;
>> 		goto again;
>>
>> lblk and es.es_lblk are both __u32.  So the infinite loop is
>> presumably because es.es_lblk + es.es_len has overflowed.  This should
>> never happen(tm), and in fact we have a test for this case which
> 
> If I instrument the code, I can see that es.es_len is definitely set to
> EXT_MAX_BLOCKS, which will overflow.
> 

Thanks for the report. After looking at the code, I think the root
cause of this issue is the variable es was not initialized on replaying
fast commit. ext4_es_find_extent_range() will return directly when
EXT4_FC_REPLAY flag is set, and then the es.len becomes stall.

I can always reproduce this issue on generic/039 with
MKFS_OPTIONS="-O fast_commit".

This uninitialization problem originally existed in the old
ext4_ext_put_gap_in_cache(), but it didn't trigger any real problem
since we never check and use extent cache when replaying fast commit.
So I suppose the correct fix would be to unconditionally initialize
the es variable.

Thanks,
Yi.

>> *should* have gotten tripped when ext4_es_find_extent_range() calls
>> __es_tree_search() in fs/ext4/extents_status.c:
>>
>> static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>> {
>> 	BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
>> 	return es->es_lblk + es->es_len - 1;
>> }
>>
>> So the patch is harmless, and I can see how it might fix what you were
>> seeing --- but I'm a bit nervous that I can't reproduce it and the
>> commit description claims that it reproduces easily; and we should
>> have never allowed the entry to have gotten introduced into the
>> extents status tree in the first place, and if it had been introduced,
>> it should have been caught before it was returned by
>> ext4_es_find_extent_range().
>>
>> Can you give more details about the reproducer; can you double check
>> the test id, and how easily you can trigger the failure, and what is
>> the hardware you used to run the test?
> 
> So, here's few more details that may clarify, and that I should have added
> to the commit description:
> 
> When the test hangs, the test is blocked mounting the flakey device:
> 
>    mount -t ext4 -o acl,user_xattr /dev/mapper/flakey-test /mnt/scratch
> 
> which will eventually call into ext4_ext_map_blocks(), triggering the bug.
> 
> Also, some more code instrumentation shows that after the call to
> ext4_ext_find_hole(), the 'hole_start' will be set to '1' and 'len' to
> '0xfffffffe'.  This '0xfffffffe' value is a bit odd, but it comes from the
> fact that, in ext4_ext_find_hole(), the call to
> ext4_ext_next_allocated_block() will return EXT_MAX_BLOCKS and 'len' will
> thus be set to 'EXT_MAX_BLOCKS - 1'.
> 
> Does this make sense?
> 
> Cheers,
> 


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks()
  2024-05-09 16:39  7%     ` Theodore Ts'o
@ 2024-05-09 17:23  0%       ` Luis Henriques
  2024-05-10  3:39  0%         ` Zhang Yi
  0 siblings, 1 reply; 200+ results
From: Luis Henriques @ 2024-05-09 17:23 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Luis Henriques, Zhang Yi, linux-ext4, linux-fsdevel, linux-mm,
	linux-kernel, adilger.kernel, jack, ritesh.list, hch, djwong,
	willy, zokeefe, yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

On Thu 09 May 2024 12:39:53 PM -04, Theodore Ts'o wrote;

> On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote:
>> 
>> It's looks like it's easy to trigger an infinite loop here using fstest
>> generic/039.  If I understand it correctly (which doesn't happen as often
>> as I'd like), this is due to an integer overflow in the 'if' condition,
>> and should be fixed with the patch below.
>
> Thanks for the report.  However, I can't reproduce the failure, and
> looking at generic/039, I don't see how it could be relevant to the
> code path in question.  Generic/039 creates a test symlink with two
> hard links in the same directory, syncs the file system, and then
> removes one of the hard links, and then drops access to the block
> device using dmflakey.  So I don't see how the extent code would be
> involved at all.  Are you sure that you have the correct test listed?

Yep, I just retested and it's definitely generic/039.  I'm using a simple
test environment, with virtme-ng.

> Looking at the code in question in fs/ext4/extents.c:
>
> again:
> 	ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
> 				  hole_start + len - 1, &es);
> 	if (!es.es_len)
> 		goto insert_hole;
>
>   	 * There's a delalloc extent in the hole, handle it if the delalloc
>   	 * extent is in front of, behind and straddle the queried range.
>   	 */
>  -	if (lblk >= es.es_lblk + es.es_len) {
>  +	if (lblk >= ((__u64) es.es_lblk) + es.es_len) {
>   		/*
>   		 * The delalloc extent is in front of the queried range,
>   		 * find again from the queried start block.
> 		len -= lblk - hole_start;
> 		hole_start = lblk;
> 		goto again;
>
> lblk and es.es_lblk are both __u32.  So the infinite loop is
> presumably because es.es_lblk + es.es_len has overflowed.  This should
> never happen(tm), and in fact we have a test for this case which

If I instrument the code, I can see that es.es_len is definitely set to
EXT_MAX_BLOCKS, which will overflow.

> *should* have gotten tripped when ext4_es_find_extent_range() calls
> __es_tree_search() in fs/ext4/extents_status.c:
>
> static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
> {
> 	BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
> 	return es->es_lblk + es->es_len - 1;
> }
>
> So the patch is harmless, and I can see how it might fix what you were
> seeing --- but I'm a bit nervous that I can't reproduce it and the
> commit description claims that it reproduces easily; and we should
> have never allowed the entry to have gotten introduced into the
> extents status tree in the first place, and if it had been introduced,
> it should have been caught before it was returned by
> ext4_es_find_extent_range().
>
> Can you give more details about the reproducer; can you double check
> the test id, and how easily you can trigger the failure, and what is
> the hardware you used to run the test?

So, here's few more details that may clarify, and that I should have added
to the commit description:

When the test hangs, the test is blocked mounting the flakey device:

   mount -t ext4 -o acl,user_xattr /dev/mapper/flakey-test /mnt/scratch

which will eventually call into ext4_ext_map_blocks(), triggering the bug.

Also, some more code instrumentation shows that after the call to
ext4_ext_find_hole(), the 'hole_start' will be set to '1' and 'len' to
'0xfffffffe'.  This '0xfffffffe' value is a bit odd, but it comes from the
fact that, in ext4_ext_find_hole(), the call to
ext4_ext_next_allocated_block() will return EXT_MAX_BLOCKS and 'len' will
thus be set to 'EXT_MAX_BLOCKS - 1'.

Does this make sense?

Cheers,
-- 
Luis

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks()
  2024-05-09 15:16  9%   ` Luis Henriques
@ 2024-05-09 16:39  7%     ` Theodore Ts'o
  2024-05-09 17:23  0%       ` Luis Henriques
  0 siblings, 1 reply; 200+ results
From: Theodore Ts'o @ 2024-05-09 16:39 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-mm, linux-kernel,
	adilger.kernel, jack, ritesh.list, hch, djwong, willy, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

On Thu, May 09, 2024 at 04:16:34PM +0100, Luis Henriques wrote:
> 
> It's looks like it's easy to trigger an infinite loop here using fstest
> generic/039.  If I understand it correctly (which doesn't happen as often
> as I'd like), this is due to an integer overflow in the 'if' condition,
> and should be fixed with the patch below.

Thanks for the report.  However, I can't reproduce the failure, and
looking at generic/039, I don't see how it could be relevant to the
code path in question.  Generic/039 creates a test symlink with two
hard links in the same directory, syncs the file system, and then
removes one of the hard links, and then drops access to the block
device using dmflakey.  So I don't see how the extent code would be
involved at all.  Are you sure that you have the correct test listed?

Looking at the code in question in fs/ext4/extents.c:

again:
	ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
				  hole_start + len - 1, &es);
	if (!es.es_len)
		goto insert_hole;

  	 * There's a delalloc extent in the hole, handle it if the delalloc
  	 * extent is in front of, behind and straddle the queried range.
  	 */
 -	if (lblk >= es.es_lblk + es.es_len) {
 +	if (lblk >= ((__u64) es.es_lblk) + es.es_len) {
  		/*
  		 * The delalloc extent is in front of the queried range,
  		 * find again from the queried start block.
		len -= lblk - hole_start;
		hole_start = lblk;
		goto again;

lblk and es.es_lblk are both __u32.  So the infinite loop is
presumably because es.es_lblk + es.es_len has overflowed.  This should
never happen(tm), and in fact we have a test for this case which
*should* have gotten tripped when ext4_es_find_extent_range() calls
__es_tree_search() in fs/ext4/extents_status.c:

static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
{
	BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
	return es->es_lblk + es->es_len - 1;
}

So the patch is harmless, and I can see how it might fix what you were
seeing --- but I'm a bit nervous that I can't reproduce it and the
commit description claims that it reproduces easily; and we should
have never allowed the entry to have gotten introduced into the
extents status tree in the first place, and if it had been introduced,
it should have been caught before it was returned by
ext4_es_find_extent_range().

Can you give more details about the reproducer; can you double check
the test id, and how easily you can trigger the failure, and what is
the hardware you used to run the test?

Many thanks,

					- Ted

^ permalink raw reply	[relevance 7%]

* Re: [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks()
  @ 2024-05-09 15:16  9%   ` Luis Henriques
  2024-05-09 16:39  7%     ` Theodore Ts'o
  0 siblings, 1 reply; 200+ results
From: Luis Henriques @ 2024-05-09 15:16 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-mm, linux-kernel, tytso,
	adilger.kernel, jack, ritesh.list, hch, djwong, willy, zokeefe,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

On Sat 27 Jan 2024 09:58:02 AM +08, Zhang Yi wrote;
<...>
> +static ext4_lblk_t ext4_ext_determine_insert_hole(struct inode *inode,
> +						  struct ext4_ext_path *path,
> +						  ext4_lblk_t lblk)
> +{
> +	ext4_lblk_t hole_start, len;
> +	struct extent_status es;
> +
> +	hole_start = lblk;
> +	len = ext4_ext_find_hole(inode, path, &hole_start);
> +again:
> +	ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
> +				  hole_start + len - 1, &es);
> +	if (!es.es_len)
> +		goto insert_hole;
> +
> +	/*
> +	 * There's a delalloc extent in the hole, handle it if the delalloc
> +	 * extent is in front of, behind and straddle the queried range.
> +	 */
> +	if (lblk >= es.es_lblk + es.es_len) {
> +		/*
> +		 * The delalloc extent is in front of the queried range,
> +		 * find again from the queried start block.
> +		 */
> +		len -= lblk - hole_start;
> +		hole_start = lblk;
> +		goto again;

It's looks like it's easy to trigger an infinite loop here using fstest
generic/039.  If I understand it correctly (which doesn't happen as often
as I'd like), this is due to an integer overflow in the 'if' condition,
and should be fixed with the patch below.

From 3117af2f8dacad37a2722850421f31075ae9e88d Mon Sep 17 00:00:00 2001
From: "Luis Henriques (SUSE)" <luis.henriques@linux.dev>
Date: Thu, 9 May 2024 15:53:01 +0100
Subject: [PATCH] ext4: fix infinite loop caused by integer overflow

An integer overflow will happen if the extent_status len is set to
EXT_MAX_BLOCKS (0xffffffff).  This may cause an infinite loop in function
ext4_ext_determine_insert_hole(), easily reproducible using fstest
generic/039.

Fixes: 6430dea07e85 ("ext4: correct the hole length returned by ext4_map_blocks()")
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
---
 fs/ext4/extents.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e57054bdc5fd..193121b394f9 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4064,7 +4064,7 @@ static ext4_lblk_t ext4_ext_determine_insert_hole(struct inode *inode,
 	 * There's a delalloc extent in the hole, handle it if the delalloc
 	 * extent is in front of, behind and straddle the queried range.
 	 */
-	if (lblk >= es.es_lblk + es.es_len) {
+	if (lblk >= ((__u64) es.es_lblk) + es.es_len) {
 		/*
 		 * The delalloc extent is in front of the queried range,
 		 * find again from the queried start block.

^ permalink raw reply related	[relevance 9%]

* Re: [PATCH v15 00/11] Landlock: IOCTL support
  2024-04-19 16:11  2% [PATCH v15 00/11] Landlock: IOCTL support Günther Noack
  2024-04-19 16:11  6% ` [PATCH v15 01/11] landlock: Add IOCTL access right for character and block devices Günther Noack
@ 2024-05-08 10:40  0% ` Mickaël Salaün
  1 sibling, 0 replies; 200+ results
From: Mickaël Salaün @ 2024-05-08 10:40 UTC (permalink / raw)
  To: Günther Noack
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel

This patch series has been in -next for some time now.  I just added
some tiny cosmetic fixes and a missing (on some distros) C header file.
I plan to send it for v6.10 but I'll probably rebase it again because of
kselftest changes.

It is noteworthy that test coverage dropped by 1.5%: from 92.4% to 90.9%
.  This is due to the tests not covering the IOCTL compat code.  It
would be good to find a way to cover this case, probably building 32-bit
test binary stubs (to avoid depending on 32-bit libraries).

Thanks again!

 Mickaël


On Fri, Apr 19, 2024 at 04:11:11PM +0000, Günther Noack wrote:
> Hello!
> 
> These patches add simple ioctl(2) support to Landlock.
> 
> Objective
> ~~~~~~~~~
> 
> Make ioctl(2) requests for device files restrictable with Landlock,
> in a way that is useful for real-world applications.
> 
> Proposed approach
> ~~~~~~~~~~~~~~~~~
> 
> Introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right, which restricts the
> use of ioctl(2) on block and character devices.
> 
> We attach the this access right to opened file descriptors, as we
> already do for LANDLOCK_ACCESS_FS_TRUNCATE.
> 
> If LANDLOCK_ACCESS_FS_IOCTL_DEV is handled (restricted in the
> ruleset), the LANDLOCK_ACCESS_FS_IOCTL_DEV right governs the use of
> all device-specific IOCTL commands.  We make exceptions for common and
> known-harmless IOCTL commands such as FIOCLEX, FIONCLEX, FIONBIO and
> FIOASYNC, as well as other IOCTL commands which are implemented in
> fs/ioctl.c.  A full list of these IOCTL commands is listed in the
> documentation.
> 
> I believe that this approach works for the majority of use cases, and
> offers a good trade-off between complexity of the Landlock API and
> implementation and flexibility when the feature is used.
> 
> Current limitations
> ~~~~~~~~~~~~~~~~~~~
> 
> With this patch set, ioctl(2) requests can *not* be filtered based on
> file type, device number (dev_t) or on the ioctl(2) request number.
> 
> On the initial RFC patch set [1], we have reached consensus to start
> with this simpler coarse-grained approach, and build additional IOCTL
> restriction capabilities on top in subsequent steps.
> 
> [1] https://lore.kernel.org/linux-security-module/d4f1395c-d2d4-1860-3a02-2a0c023dd761@digikod.net/
> 
> Notable implications of this approach
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> * A processes' existing open file descriptors stay unaffected
>   when a process enables Landlock.
> 
>   This means that in common scenarios, where the terminal file
>   descriptor is inherited from the parent process, the terminal's
>   IOCTLs (ioctl_tty(2)) continue to work.
> 
> * ioctl(2) continues to be available for file descriptors for
>   non-device files.  Example: Network sockets, memfd_create(2),
>   regular files and directories.
> 
> Examples
> ~~~~~~~~
> 
> Starting a sandboxed shell from $HOME with samples/landlock/sandboxer:
> 
>   LL_FS_RO=/ LL_FS_RW=. ./sandboxer /bin/bash
> 
> The LANDLOCK_ACCESS_FS_IOCTL_DEV right is part of the "read-write"
> rights here, so we expect that newly opened device files outside of
> $HOME don't work with most IOCTL commands.
> 
>   * "stty" works: It probes terminal properties
> 
>   * "stty </dev/tty" fails: /dev/tty can be reopened, but the IOCTL is
>     denied.
> 
>   * "eject" fails: ioctls to use CD-ROM drive are denied.
> 
>   * "ls /dev" works: It uses ioctl to get the terminal size for
>     columnar layout
> 
>   * The text editors "vim" and "mg" work.  (GNU Emacs fails because it
>     attempts to reopen /dev/tty.)
> 
> Unaffected IOCTL commands
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> To decide which IOCTL commands should be blanket-permitted, we went
> through the list of IOCTL commands which are handled directly in
> fs/ioctl.c and looked at them individually to understand what they are
> about.
> 
> The following commands are permitted by Landlock unconditionally:
> 
>  * FIOCLEX, FIONCLEX - these work on the file descriptor and
>    manipulate the close-on-exec flag (also available through
>    fcntl(2) with F_SETFD)
>  * FIONBIO, FIOASYNC - these work on the struct file and enable
>    nonblocking-IO and async flags (also available through
>    fcntl(2) with F_SETFL)
> 
> The following commands are also unconditionally permitted by Landlock, because
> they are really operating on the file system's superblock, rather than on the
> file itself (the same funcionality is also available from any other file on the
> same file system):
> 
>  * FIFREEZE, FITHAW - work on superblock(!) to freeze/thaw the file
>    system. Requires CAP_SYS_ADMIN.
>  * FIGETBSZ - get file system blocksize
>  * FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH - getting file system properties
> 
> Notably, the command FIONREAD is *not* blanket-permitted,
> because it would be a device-specific implementation.
> 
> Detailed reasoning about each IOCTL command from fs/ioctl.c is in
> get_required_ioctl_dev_access() in security/landlock/fs.c.
> 
> 
> Related Work
> ~~~~~~~~~~~~
> 
> OpenBSD's pledge(2) [2] restricts ioctl(2) independent of the file
> descriptor which is used.  The implementers maintain multiple
> allow-lists of predefined ioctl(2) operations required for different
> application domains such as "audio", "bpf", "tty" and "inet".
> 
> OpenBSD does not guarantee backwards compatibility to the same extent
> as Linux does, so it's easier for them to update these lists in later
> versions.  It might not be a feasible approach for Linux though.
> 
> [2] https://man.openbsd.org/OpenBSD-7.4/pledge.2
> 
> 
> Implementation Rationale
> ~~~~~~~~~~~~~~~~~~~~~~~~
> 
> A main constraint of this implementation is that the blanket-permitted
> IOCTL commands for device files should never dispatch to the
> device-specific implementations in f_ops->unlocked_ioctl() and
> f_ops->compat_ioctl().
> 
> There are many implementations of these f_ops operations and they are
> too scattered across the kernel to give strong guarantees about them.
> Additionally, some existing implementations do work before even
> checking whether they support the cmd number which was passed to them.
> 
> 
> In this implementation, we are listing the blanket-permitted IOCTL
> commands in the Landlock implementation, mirroring a subset of the
> IOCTL commands which are directly implemented in do_vfs_ioctl() in
> fs/ioctl.c.  The trade-off is that the Landlock LSM needs to track
> future developments in fs/ioctl.c to keep up to date with that, in
> particular when new IOCTL commands are introduced there, or when they
> are moved there from the f_ops implementations.
> 
> We mitigate this risk in this patch set by adding fs/ioctl.c to the
> paths that are relevant to Landlock in the MAINTAINERS file.
> 
> The trade-off is discussed in more detail in [3].
> 
> 
> Previous versions of this patch set have used different implementation
> approaches to guarantee the main constraint above, which we have
> dismissed due to the following reasons:
> 
> * V10: Introduced a new LSM hook file_vfs_ioctl, which gets invoked
>   just before the call to f_ops->unlocked_ioctl().
> 
>   Not done, because it would have created an avoidable overlap between
>   the file_ioctl and file_vfs_ioctl LSM hooks [4].
> 
> * V11: Introduced an indirection layer in fs/ioctl.c, so that Landlock
>   could figure out the list of IOCTL commands which are handled by
>   do_vfs_ioctl().
> 
>   Not done due to additional indirection and possible performance
>   impact in fs/ioctl.c [5]
> 
> * V12: Introduced a special error code to be returned from the
>   file_ioctl hook, and matching logic that would disallow the call to
>   f_ops->unlocked_ioctl() in case that this error code is returned.
> 
>   Not done due because this approach would conflict with Landlock's
>   planned audit logging [6] and because LSM hooks with special error
>   codes are generally discouraged and have lead to problems in the
>   past [7].
> 
> Thanks to Arnd Bergmann, Christian Brauner, Kent Overstreet, Mickaël Salaün and
> Paul Moore for guiding this implementation on the right track!
> 
> [3] https://lore.kernel.org/all/ZgLJG0aN0psur5Z7@google.com/
> [4] https://lore.kernel.org/all/CAHC9VhRojXNSU9zi2BrP8z6JmOmT3DAqGNtinvvz=tL1XhVdyg@mail.gmail.com/
> [5] https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com
> [6] https://lore.kernel.org/all/20240326.ahyaaPa0ohs6@digikod.net
> [7] https://lore.kernel.org/all/CAHC9VhQJFWYeheR-EqqdfCq0YpvcQX5Scjfgcz1q+jrWg8YsdA@mail.gmail.com/
> 
> 
> Changes
> ~~~~~~~
> 
> V15:
>  * Drop the commit about FS_IOC_GETFSUUID / FS_IOC_GETFSSYSFSPATH --
>    it is already assumed as a prerequisite now.
>  * security/landlock/fs.c:
>    * Add copyright notice for my contributions (also for the truncate
>      patch set)
>  * Tests:
>    * In commit "Test IOCTL support":
>      * Test with /dev/zero instead of /dev/tty
>      * Check only FIONREAD instead of both FIONREAD and TCGETS
>      * Remove a now-unused SKIP()
>    * In test for Named UNIX Domain Sockets:
>      * Do not inline variable assignments in ASSERT() usages
>    * In commit "Exhaustive test for the IOCTL allow-list":
>      * Make IOCTL results deterministic:
>        * Zero the input buffer
>        * Close FD 0 for the ioctl() call, to avoid accidentally using it
>  * Cosmetic changes and cleanups
>    * Remove a leftover mention of "synthetic" access rights
>    * Fix docstring format for is_masked_device_ioctl()
>    * Newline and comment ordering cleanups as discussed in v14 review
> 
> V14:
>  * Revise which IOCTLs are permitted.
>    It is almost the same as the vfs_masked_device_ioctl() hooks from
>    https://lore.kernel.org/all/20240219183539.2926165-1-mic@digikod.net/,
>    with the following differences:
>    * Added cases for FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH
>    * Do not blanket-permit FS_IOC_{GET,SET}{FLAGS,XATTR}.
>      They fall back to the device implementation.
>  * fs/ioctl:
>    * Small prerequisite change so that FS_IOC_GETFSUUID and
>      FS_IOC_GETFSSYSFSPATH do not fall back to the device implementation.
>    * Slightly rephrase wording in the warning above do_vfs_ioctl().
>  * Implement compat handler
>  * Improve UAPI header documentation
>  * Code structure
>    * Change helper function style to return a boolean
>    * Reorder structure of the IOCTL hooks (much cleaner now -- thanks for the
>      hint, Mickaël!)
>    * Extract is_device() helper
> 
> V13:
>  * Using the existing file_ioctl hook and a hardcoded list of IOCTL commands.
>    (See the section on implementation rationale above.)
>  * Add support for FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH.
>    
> V12:
>  * Rebased on Arnd's proposal:
>    https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com/
>    This means that:
>    * the IOCTL security hooks can return a special value ENOFILEOPS,
>      which is treated specially in fs/ioctl.c to permit the IOCTL,
>      but only as long as it does not call f_ops->unlocked_ioctl or
>      f_ops->compat_ioctl.
>  * The only change compared to V11 is commit 1, as well as a small
>    adaptation in the commit 2 (The Landlock implementation needs to
>    return the new special value).  The tests and documentation commits
>    are exactly the same as before.
> 
> V11:
>  * Rebased on Mickaël's proposal to refactor fs/ioctl.c:
>    https://lore.kernel.org/all/20240315145848.1844554-1-mic@digikod.net/
>    This means that:
>    * we do not add the file_vfs_ioctl() hook as in V10
>    * we add vfs_get_ioctl_handler() instead, so that Landlock
>      can query which of the IOCTL commands in handled in do_vfs_ioctl()
> 
>    That proposal is used here unmodified (except for minor typos in the commit
>    description).
>  * Use the hook_ioctl_compat LSM hook as well.
> 
> V10:
>  * Major change: only restrict IOCTL invocations on device files
>    * Rename access right to LANDLOCK_ACCESS_FS_IOCTL_DEV
>    * Remove the notion of synthetic access rights and IOCTL right groups
>  * Introduce a new LSM hook file_vfs_ioctl, which gets invoked just
>    before the call to f_ops->unlocked_ioctl()
>  * Documentation
>    * Various complications were removed or simplified:
>      * Suggestion to mount file systems as nodev is not needed any more,
>        as Landlock already lets users distinguish device files.
>      * Remarks about fscrypt were removed.  The fscrypt-related IOCTLs only
>        applied to regular files and directories, so this patch does not affect
>        them any more.
>      * Various documentation of the IOCTL grouping approach was removed,
>        as it's not needed any more.
> 
> V9:
>  * in “landlock: Add IOCTL access right”:
>    * Change IOCTL group names and grouping as discussed with Mickaël.
>      This makes the grouping coarser, and we occasionally rely on the
>      underlying implementation to perform the appropriate read/write
>      checks.
>      * Group IOCTL_RW (one of READ_FILE, WRITE_FILE or READ_DIR):
>        FIONREAD, FIOQSIZE, FIGETBSZ
>      * Group IOCTL_RWF (one of READ_FILE or WRITE_FILE):
>        FS_IOC_FIEMAP, FIBMAP, FIDEDUPERANGE, FICLONE, FICLONERANGE,
>        FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
>        FS_IOC_ZERO_RANGE
>    * Excempt pipe file descriptors from IOCTL restrictions,
>      even for named pipes which are opened from the file system.
>      This is to be consistent with anonymous pipes created with pipe(2).
>      As discussed in https://lore.kernel.org/r/ZP7lxmXklksadvz+@google.com
>    * Document rationale for the IOCTL grouping in the code
>    * Use __attribute_const__
>    * Rename required_ioctl_access() to get_required_ioctl_access()
>  * Selftests
>    * Simplify IOCTL test fixtures as a result of simpler grouping.
>    * Test that IOCTLs are permitted on named pipe FDs.
>    * Test that IOCTLs are permitted on named Unix Domain Socket FDs.
>    * Work around compilation issue with old GCC / glibc.
>      https://sourceware.org/glibc/wiki/Synchronizing_Headers
>      Thanks to Huyadi <hu.yadi@h3c.com>, who pointed this out in
>      https://lore.kernel.org/all/f25be6663bcc4608adf630509f045a76@h3c.com/
>      and Mickaël, who fixed it through #include reordering.
>  * Documentation changes
>    * Reword "IOCTL commands" section a bit
>    * s/permit/allow/
>    * s/access right/right/, if preceded by LANDLOCK_ACCESS_FS_*
>    * s/IOCTL/FS_IOCTL/ in ASCII table
>    * Update IOCTL grouping documentation in header file
>  * Removed a few of the earlier commits in this patch set,
>    which have already been merged.
> 
> V8:
>  * Documentation changes
>    * userspace-api/landlock.rst:
>      * Add an extra paragraph about how the IOCTL right combines
>        when used with other access rights.
>      * Explain better the circumstances under which passing of
>        file descriptors between different Landlock domains can happen
>    * limits.h: Add comment to explain public vs internal FS access rights
>    * Add a paragraph in the commit to explain better why the IOCTL
>      right works as it does
> 
> V7:
>  * in “landlock: Add IOCTL access right”:
>    * Make IOCTL_GROUPS a #define so that static_assert works even on
>      old compilers (bug reported by Intel about PowerPC GCC9 config)
>    * Adapt indentation of IOCTL_GROUPS definition
>    * Add missing dots in kernel-doc comments.
>  * in “landlock: Remove remaining "inline" modifiers in .c files”:
>    * explain reasoning in commit message
> 
> V6:
>  * Implementation:
>    * Check that only publicly visible access rights can be used when adding a
>      rule (rather than the synthetic ones).  Thanks Mickaël for spotting that!
>    * Move all functionality related to IOCTL groups and synthetic access rights
>      into the same place at the top of fs.c
>    * Move kernel doc to the .c file in one instance
>    * Smaller code style issues (upcase IOCTL, vardecl at block start)
>    * Remove inline modifier from functions in .c files
>  * Tests:
>    * use SKIP
>    * Rename 'fd' to dir_fd and file_fd where appropriate
>    * Remove duplicate "ioctl" mentions from test names
>    * Rename "permitted" to "allowed", in ioctl and ftruncate tests
>    * Do not add rules if access is 0, in test helper
> 
> V5:
>  * Implementation:
>    * move IOCTL group expansion logic into fs.c (implementation suggested by
>      mic)
>    * rename IOCTL_CMD_G* constants to LANDLOCK_ACCESS_FS_IOCTL_GROUP*
>    * fs.c: create ioctl_groups constant
>    * add "const" to some variables
>  * Formatting and docstring fixes (including wrong kernel-doc format)
>  * samples/landlock: fix ABI version and fallback attribute (mic)
>  * Documentation
>    * move header documentation changes into the implementation commit
>    * spell out how FIFREEZE, FITHAW and attribute-manipulation ioctls from
>      fs/ioctl.c are handled
>    * change ABI 4 to ABI 5 in some missing places
> 
> V4:
>  * use "synthetic" IOCTL access rights, as previously discussed
>  * testing changes
>    * use a large fixture-based test, for more exhaustive coverage,
>      and replace some of the earlier tests with it
>  * rebased on mic-next
> 
> V3:
>  * always permit the IOCTL commands FIOCLEX, FIONCLEX, FIONBIO, FIOASYNC and
>    FIONREAD, independent of LANDLOCK_ACCESS_FS_IOCTL
>  * increment ABI version in the same commit where the feature is introduced
>  * testing changes
>    * use FIOQSIZE instead of TTY IOCTL commands
>      (FIOQSIZE works with regular files, directories and memfds)
>    * run the memfd test with both Landlock enabled and disabled
>    * add a test for the always-permitted IOCTL commands
> 
> V2:
>  * rebased on mic-next
>  * added documentation
>  * exercise ioctl(2) in the memfd test
>  * test: Use layout0 for the test
> 
> ---
> 
> V1: https://lore.kernel.org/all/20230502171755.9788-1-gnoack3000@gmail.com/
> V2: https://lore.kernel.org/all/20230623144329.136541-1-gnoack@google.com/
> V3: https://lore.kernel.org/all/20230814172816.3907299-1-gnoack@google.com/
> V4: https://lore.kernel.org/all/20231103155717.78042-1-gnoack@google.com/
> V5: https://lore.kernel.org/all/20231117154920.1706371-1-gnoack@google.com/
> V6: https://lore.kernel.org/all/20231124173026.3257122-1-gnoack@google.com/
> V7: https://lore.kernel.org/all/20231201143042.3276833-1-gnoack@google.com/
> V8: https://lore.kernel.org/all/20231208155121.1943775-1-gnoack@google.com/
> V9: https://lore.kernel.org/all/20240209170612.1638517-1-gnoack@google.com/
> V10: https://lore.kernel.org/all/20240309075320.160128-1-gnoack@google.com/
> V11: https://lore.kernel.org/all/20240322151002.3653639-1-gnoack@google.com/
> V12: https://lore.kernel.org/all/20240325134004.4074874-1-gnoack@google.com/
> V13: https://lore.kernel.org/all/20240327131040.158777-1-gnoack@google.com/
> V14: https://lore.kernel.org/all/20240405214040.101396-1-gnoack@google.com/
> 
> Günther Noack (11):
>   landlock: Add IOCTL access right for character and block devices
>   selftests/landlock: Test IOCTL support
>   selftests/landlock: Test IOCTL with memfds
>   selftests/landlock: Test ioctl(2) and ftruncate(2) with open(O_PATH)
>   selftests/landlock: Test IOCTLs on named pipes
>   selftests/landlock: Check IOCTL restrictions for named UNIX domain
>     sockets
>   selftests/landlock: Exhaustive test for the IOCTL allow-list
>   samples/landlock: Add support for LANDLOCK_ACCESS_FS_IOCTL_DEV
>   landlock: Document IOCTL support
>   MAINTAINERS: Notify Landlock maintainers about changes to fs/ioctl.c
>   fs/ioctl: Add a comment to keep the logic in sync with LSM policies
> 
>  Documentation/userspace-api/landlock.rst     |  76 ++-
>  MAINTAINERS                                  |   1 +
>  fs/ioctl.c                                   |   3 +
>  include/uapi/linux/landlock.h                |  38 +-
>  samples/landlock/sandboxer.c                 |  13 +-
>  security/landlock/fs.c                       | 225 ++++++++-
>  security/landlock/limits.h                   |   2 +-
>  security/landlock/syscalls.c                 |   2 +-
>  tools/testing/selftests/landlock/base_test.c |   2 +-
>  tools/testing/selftests/landlock/fs_test.c   | 486 ++++++++++++++++++-
>  10 files changed, 805 insertions(+), 43 deletions(-)
> 
> 
> base-commit: fe611b72031cc211a96cf0b3b58838953950cb13
> -- 
> 2.44.0.769.g3c40516874-goog
> 
> 

^ permalink raw reply	[relevance 0%]

* [PATCH v3 01/10] ext4: factor out a common helper to query extent map
  2024-05-08  6:12  5% [PATCH v3 00/10] ext4: support adding multi-delalloc blocks Zhang Yi
@ 2024-05-08  6:12 11% ` Zhang Yi
  0 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-05-08  6:12 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Factor out a new common helper ext4_map_query_blocks() from the
ext4_da_map_blocks(), it query and return the extent map status on the
inode's extent path, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 537803250ca9..6a41172c06e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
+				 struct ext4_map_blocks *map)
+{
+	unsigned int status;
+	int retval;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		retval = ext4_ext_map_blocks(handle, inode, map, 0);
+	else
+		retval = ext4_ind_map_blocks(handle, inode, map, 0);
+
+	if (retval <= 0)
+		return retval;
+
+	if (unlikely(retval != map->m_len)) {
+		ext4_warning(inode->i_sb,
+			     "ES len assertion failed for inode "
+			     "%lu: retval %d != map->m_len %d",
+			     inode->i_ino, retval, map->m_len);
+		WARN_ON(1);
+	}
+
+	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+			      map->m_pblk, status);
+	return retval;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_has_inline_data(inode))
 		retval = 0;
-	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
-	if (retval < 0) {
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
+		retval = ext4_map_query_blocks(NULL, inode, map);
 	up_read(&EXT4_I(inode)->i_data_sem);
+	if (retval)
+		return retval;
 
 add_delayed:
 	down_write(&EXT4_I(inode)->i_data_sem);
-- 
2.39.2


^ permalink raw reply related	[relevance 11%]

* [PATCH v3 00/10] ext4: support adding multi-delalloc blocks
@ 2024-05-08  6:12  5% Zhang Yi
  2024-05-08  6:12 11% ` [PATCH v3 01/10] ext4: factor out a common helper to query extent map Zhang Yi
  0 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-05-08  6:12 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack,
	ritesh.list, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v2:
 - Improve the commit message in patch 2,4,6 as Ritesh and Jan
   suggested, makes the changes more clear.
 - Add patch 3, add a warning if the delalloc counters are still not
   zero on inactive.
 - In patch 6, add a WARN in ext4_es_insert_delayed_extent(), strictly
   requires the end_allocated parameter to be set to false if the
   inserting extent belongs to one cluster.
 - In patch 9, modify the reserve blocks math formula as Jan suggested,
   prevent the count going to be negative.
 - In patch 10, update the stale ext4_da_map_blocks() function comments.

Hello!

This patch series is the part 2 prepartory changes of the buffered IO
iomap conversion, I picked them out from my buffered IO iomap conversion
RFC series v3[1], add a fix for an issue found in current ext4 code, and
also add bigalloc feature support. Please look the following patches for
details.

The first 3 patches fix an incorrect delalloc reserved blocks count
issue and add a warning to make it easy to detect, the second 6 patches
make ext4_insert_delayed_block() call path support inserting
multi-delalloc blocks once a time, and the last patch makes
ext4_da_map_blocks() buffer_head unaware, prepared for iomap.

This patch set has been passed 'kvm-xfstests -g auto' tests, I hope it
could be reviewed and merged first.

[1] https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/

Thanks,
Yi.

---
v2: https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/

Zhang Yi (10):
  ext4: factor out a common helper to query extent map
  ext4: check the extent status again before inserting delalloc block
  ext4: warn if delalloc counters are not zero on inactive
  ext4: trim delalloc extent
  ext4: drop iblock parameter
  ext4: make ext4_es_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_reserve_space() reserve multi-clusters
  ext4: factor out check for whether a cluster is allocated
  ext4: make ext4_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_map_blocks() buffer_head unaware

 fs/ext4/extents_status.c    |  70 +++++++---
 fs/ext4/extents_status.h    |   5 +-
 fs/ext4/inode.c             | 248 +++++++++++++++++++++++-------------
 fs/ext4/super.c             |   6 +-
 include/trace/events/ext4.h |  26 ++--
 5 files changed, 231 insertions(+), 124 deletions(-)

-- 
2.39.2


^ permalink raw reply	[relevance 5%]

* Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
  2024-05-07 19:00  6%             ` Andrii Nakryiko
@ 2024-05-08  1:20  0%               ` Liam R. Howlett
  0 siblings, 0 replies; 200+ results
From: Liam R. Howlett @ 2024-05-08  1:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Greg KH, Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm,
	linux-kernel, bpf, linux-mm, Suren Baghdasaryan, Matthew Wilcox

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 15:01]:
> On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
...
> > >
> > > As for the mmap_read_lock_killable() (is that what we are talking
> > > about?), I'm happy to use anything else available, please give me a
> > > pointer. But I suspect given how fast and small this new API is,
> > > mmap_read_lock_killable() in it is not comparable to holding it for
> > > producing /proc/<pid>/maps contents.
> >
> > Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> > mmap sem).
> >
> > You can see examples of avoiding the mmap lock by use of rcu in
> > mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> > userfaultfd has an example as well. But again, remember that not all
> > archs have this functionality, so you'd need to fall back to full mmap
> > locking.
> 
> Thanks for the pointer (didn't see email when replying on the other thread).
> 
> I looked at lock_vma_under_rcu() quickly, and seems like it's designed
> to find VMA that covers given address, but not the next closest one.
> So it's a bit problematic for the API I'm adding, as
> PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
> COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
> cases we have. But maybe some variation of lock_vma_under_rcu() can be
> added that would fit this case?

Yes, as long as we have the rcu read lock, we can use the same
vma_next() calls you use today.  We will have to be careful not to use
the vma while it's being altered, but per-vma locking should provide
that functionality for you.

> 
> >
> > Certainly a single lookup and copy will be faster than a 4k buffer
> > filling copy, but you will be walking the tree O(n) times, where n is
> > the vma count.  This isn't as efficient as multiple lookups in a row as
> > we will re-walk from the top of the tree. You will also need to contend
> > with the fact that the chance of the vmas changing between calls is much
> > higher here too - if that's an issue. Neither of these issues go away
> > with use of the rcu locking instead of the mmap lock, but we can be
> > quite certain that we won't cause locking contention.
> 
> You are right about O(n) times, but note that for symbolization cases
> I'm describing, this n will be, generally, *much* smaller than a total
> number of VMAs within the process. It's a huge speed up in practice.
> This is because we pre-sort addresses in user-space, and then we query
> VMA for the first address, but then we quickly skip all the other
> addresses that are already covered by this VMA, and so the next
> request will query a new VMA that covers another subset of addresses.
> This way we'll get the minimal number of VMAs that cover captured
> addresses (which in the case of stack traces would be a few VMAs
> belonging to executable sections of process' binary plus a bunch of
> shared libraries).

This also implies you won't have to worry about shifting addresses?  I'd
think that the reference to the mm means none of these are going to be
changing at the point of the calls (not exiting).

Given your usecase, I'm surprised you're looking for the next vma at
all.

Thanks,
Liam

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 19:16  7%             ` Arnaldo Carvalho de Melo
@ 2024-05-07 21:55  7%               ` Namhyung Kim
  0 siblings, 0 replies; 200+ results
From: Namhyung Kim @ 2024-05-07 21:55 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Andrii Nakryiko, Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko,
	linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

On Mon, May 6, 2024 at 12:16 PM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Mon, May 06, 2024 at 03:53:40PM -0300, Arnaldo Carvalho de Melo wrote:
> > On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote:
> > > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > > > it, saving resources.
> >
> > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > > > > > Where is the userspace code that uses this new api you have created?
> >
> > > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > > > ioctl() API to solve a common problem (as described above) in patch
> > > > > #5. The plan is to put it in mentioned blazesym library at the very
> > > > > least.
> > > > >
> > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > > > linux-perf-user), as they need to do stack symbolization as well.
> >
> > > I think the general use case in perf is different.  This ioctl API is great
> > > for live tracing of a single (or a small number of) process(es).  And
> > > yes, perf tools have those tracing use cases too.  But I think the
> > > major use case of perf tools is system-wide profiling.
> >
> > > For system-wide profiling, you need to process samples of many
> > > different processes at a high frequency.  Now perf record doesn't
> > > process them and just save it for offline processing (well, it does
> > > at the end to find out build-ID but it can be omitted).
> >
> > Since:
> >
> >   Author: Jiri Olsa <jolsa@kernel.org>
> >   Date:   Mon Dec 14 11:54:49 2020 +0100
> >   1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events")
> >
> > We don't need to to process the events to find the build ids. I haven't
> > checked if we still do it to find out which DSOs had hits, but we
> > shouldn't need to do it for build-ids (unless they were not in memory
> > when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I
> > haven't checked but IIRC is a possibility if that ELF part isn't in
> > memory at the time we want to copy it).
>
> > If we're still traversing it like that I guess we can have a knob and
> > make it the default to not do that and instead create the perf.data
> > build ID header table with all the build-ids we got from
> > PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event
> > processing at the end of a 'perf record' session.
>
> But then we don't process the PERF_RECORD_MMAP2 in 'perf record', it
> just goes on directly to the perf.data file :-\

Yep, we don't process build-IDs at the end if --buildid-mmap
option is given.  It won't have build-ID header table but it's
not needed anymore and perf report can know build-ID from
MMAP2 directly.

Thanks,
Namhyung

^ permalink raw reply	[relevance 7%]

* Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
  @ 2024-05-07 19:00  6%             ` Andrii Nakryiko
  2024-05-08  1:20  0%               ` Liam R. Howlett
  0 siblings, 1 reply; 200+ results
From: Andrii Nakryiko @ 2024-05-07 19:00 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Greg KH, Andrii Nakryiko,
	linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Suren Baghdasaryan, Matthew Wilcox

On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240507 12:28]:
> > On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > .. Adding Suren & Willy to the Cc
> > >
> > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]:
> > > > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > I also did an strace run of both cases. In text-based one the tool did
> > > > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > > > >
> > > > > Why not fetch more at once?
> > > > >
> > > >
> > > > I didn't expect to be interrogated so much on the performance of the
> > > > text parsing front, sorry. :) You can probably tune this, but where is
> > > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > > > production numbers.
> > >
> > > The reason the file reads are limited to 4KB is because this file is
> > > used for monitoring processes.  We have a significant number of
> > > organisations polling this file so frequently that the mmap lock
> > > contention becomes an issue. (reading a file is free, right?)  People
> > > also tend to try to figure out why a process is slow by reading this
> > > file - which amplifies the lock contention.
> > >
> > > What happens today is that the lock is yielded after 4KB to allow time
> > > for mmap writes to happen.  This also means your data may be
> > > inconsistent from one 4KB block to the next (the write may be around
> > > this boundary).
> > >
> > > This new interface also takes the lock in do_procmap_query() and does
> > > the 4kb blocks as well.  Extending this size means more time spent
> > > blocking mmap writes, but a more consistent view of the world (less
> > > "tearing" of the addresses).
> >
> > Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
> > does a single VMA look up (presumably O(logN) operation) using a
> > single vma_iter_init(addr) + vma_next() call on vma_iterator.
>
> Sorry, I read this:
>
> +       if (usize > PAGE_SIZE)
> +               return -E2BIG;
>
> And thought you were going to return many vmas in that buffer.  I see
> now that you are doing one copy at a time.
>
> >
> > As for the mmap_read_lock_killable() (is that what we are talking
> > about?), I'm happy to use anything else available, please give me a
> > pointer. But I suspect given how fast and small this new API is,
> > mmap_read_lock_killable() in it is not comparable to holding it for
> > producing /proc/<pid>/maps contents.
>
> Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> mmap sem).
>
> You can see examples of avoiding the mmap lock by use of rcu in
> mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> userfaultfd has an example as well. But again, remember that not all
> archs have this functionality, so you'd need to fall back to full mmap
> locking.

Thanks for the pointer (didn't see email when replying on the other thread).

I looked at lock_vma_under_rcu() quickly, and seems like it's designed
to find VMA that covers given address, but not the next closest one.
So it's a bit problematic for the API I'm adding, as
PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
cases we have. But maybe some variation of lock_vma_under_rcu() can be
added that would fit this case?

>
> Certainly a single lookup and copy will be faster than a 4k buffer
> filling copy, but you will be walking the tree O(n) times, where n is
> the vma count.  This isn't as efficient as multiple lookups in a row as
> we will re-walk from the top of the tree. You will also need to contend
> with the fact that the chance of the vmas changing between calls is much
> higher here too - if that's an issue. Neither of these issues go away
> with use of the rcu locking instead of the mmap lock, but we can be
> quite certain that we won't cause locking contention.

You are right about O(n) times, but note that for symbolization cases
I'm describing, this n will be, generally, *much* smaller than a total
number of VMAs within the process. It's a huge speed up in practice.
This is because we pre-sort addresses in user-space, and then we query
VMA for the first address, but then we quickly skip all the other
addresses that are already covered by this VMA, and so the next
request will query a new VMA that covers another subset of addresses.
This way we'll get the minimal number of VMAs that cover captured
addresses (which in the case of stack traces would be a few VMAs
belonging to executable sections of process' binary plus a bunch of
shared libraries).

>
> Thanks,
> Liam
>

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-07 18:10  7%   ` Liam R. Howlett
@ 2024-05-07 18:52  6%     ` Andrii Nakryiko
  0 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-07 18:52 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, linux-fsdevel, brauner, viro,
	akpm, linux-kernel, bpf, gregkh, linux-mm

On Tue, May 7, 2024 at 11:10 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii@kernel.org> [240503 20:30]:
> > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > involving figuring out process memory layout, what files are backing any
> > given memory range, etc. One important class of applications that
> > absolutely rely on this are profilers/stack symbolizers. They would
> > normally capture stack trace containing absolute memory addresses of
> > some functions, and would then use /proc/<pid>/maps file to file
> > corresponding backing ELF files, file offsets within them, and then
> > continue from there to get yet more information (ELF symbols, DWARF
> > information) to get human-readable symbolic information.
> >
> > As such, there are both performance and correctness requirement
> > involved. This address to VMA information translation has to be done as
> > efficiently as possible, but also not miss any VMA (especially in the
> > case of loading/unloading shared libraries).
> >
> > Unfortunately, for all the /proc/<pid>/maps file universality and
> > usefulness, it doesn't fit the above 100%.
> >
> > First, it's text based, which makes its programmatic use from
> > applications and libraries unnecessarily cumbersome and slow due to the
> > need to do text parsing to get necessary pieces of information.
> >
> > Second, it's main purpose is to emit all VMAs sequentially, but in
> > practice captured addresses would fall only into a small subset of all
> > process' VMAs, mainly containing executable text. Yet, library would
> > need to parse most or all of the contents to find needed VMAs, as there
> > is no way to skip VMAs that are of no use. Efficient library can do the
> > linear pass and it is still relatively efficient, but it's definitely an
> > overhead that can be avoided, if there was a way to do more targeted
> > querying of the relevant VMA information.
> >
> > Another problem when writing generic stack trace symbolization library
> > is an unfortunate performance-vs-correctness tradeoff that needs to be
> > made. Library has to make a decision to either cache parsed contents of
> > /proc/<pid>/maps for service future requests (if application requests to
> > symbolize another set of addresses, captured at some later time, which
> > is typical for periodic/continuous profiling cases) to avoid higher
> > costs of needed to re-parse this file or caching the contents in memory
> > to speed up future requests. In the former case, more memory is used for
> > the cache and there is a risk of getting stale data if application
> > loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> > through additiona mmap() calls (and other means of altering memory
> > address space). In the latter case, it's the performance hit that comes
> > from re-opening the file and re-reading/re-parsing its contents all over
> > again.
> >
> > This patch aims to solve this problem by providing a new API built on
> > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> > interface, avoiding the cost and awkwardness of textual representation
> > for programmatic use. It's designed to be extensible and
> > forward/backward compatible by including user-specified field size and
> > using copy_struct_from_user() approach. But, most importantly, it allows
> > to do point queries for specific single address, specified by user. And
> > this is done efficiently using VMA iterator.
> >
> > User has a choice to pick either getting VMA that covers provided
> > address or -ENOENT if none is found (exact, least surprising, case). Or,
> > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> > get either VMA that covers the address (if there is one), or the closest
> > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> > more efficient use, but, given it could be a surprising behavior,
> > requires an explicit opt-in.
> >
> > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> > sense given it's querying the same set of VMA data. All the permissions
> > checks performed on /proc/<pid>/maps opening fit here as well.
> > ioctl-based implementation is fetching remembered mm_struct reference,
> > but otherwise doesn't interfere with seq_file-based implementation of
> > /proc/<pid>/maps textual interface, and so could be used together or
> > independently without paying any price for that.
> >
> > There is one extra thing that /proc/<pid>/maps doesn't currently
> > provide, and that's an ability to fetch ELF build ID, if present. User
> > has control over whether this piece of information is requested or not
> > by either setting build_id_size field to zero or non-zero maximum buffer
> > size they provided through build_id_addr field (which encodes user
> > pointer as __u64 field).
> >
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this,
> > requiring applications to open underlying ELF binary through
> > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> > permissions implications due giving a full access to the binary from
> > (potentially) another process, while all application is interested in is
> > build ID. Giving an ability to request just build ID doesn't introduce
> > any additional security concerns, on top of what /proc/<pid>/maps is
> > already concerned with, simplifying the overall logic.
> >
> > Kernel already implements build ID fetching, which is used from BPF
> > subsystem. We are reusing this code here, but plan a follow up changes
> > to make it work better under more relaxed assumption (compared to what
> > existing code assumes) of being called from user process context, in
> > which page faults are allowed. BPF-specific implementation currently
> > bails out if necessary part of ELF file is not paged in, all due to
> > extra BPF-specific restrictions (like the need to fetch build ID in
> > restrictive contexts such as NMI handler).
> >
> > Note also, that fetching VMA name (e.g., backing file path, or special
> > hard-coded or user-provided names) is optional just like build ID. If
> > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > it, saving resources.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/fs.h |  32 ++++++++
> >  2 files changed, 197 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 8e503a1635b7..cb7b1ff1a144 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/pkeys.h>
> >  #include <linux/minmax.h>
> >  #include <linux/overflow.h>
> > +#include <linux/buildid.h>
> >
> >  #include <asm/elf.h>
> >  #include <asm/tlb.h>
> > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> >       return do_maps_open(inode, file, &proc_pid_maps_op);
> >  }
> >
> > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > +{
> > +     struct procfs_procmap_query karg;
> > +     struct vma_iterator iter;
> > +     struct vm_area_struct *vma;
> > +     struct mm_struct *mm;
> > +     const char *name = NULL;
> > +     char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > +     __u64 usize;
> > +     int err;
> > +
> > +     if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > +             return -EFAULT;
> > +     if (usize > PAGE_SIZE)
> > +             return -E2BIG;
> > +     if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > +             return -EINVAL;
> > +     err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> > +     if (err)
> > +             return err;
> > +
> > +     if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > +             return -EINVAL;
> > +     if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > +             return -EINVAL;
> > +     if (!!karg.build_id_size != !!karg.build_id_addr)
> > +             return -EINVAL;
> > +
> > +     mm = priv->mm;
> > +     if (!mm || !mmget_not_zero(mm))
> > +             return -ESRCH;
> > +     if (mmap_read_lock_killable(mm)) {
> > +             mmput(mm);
> > +             return -EINTR;
> > +     }
>
> Using the rcu lookup here will allow for more success rate with less
> lock contention.
>

If you have any code pointers, I'd appreciate it. If not, I'll try to
find it myself, no worries.

> > +
> > +     vma_iter_init(&iter, mm, karg.query_addr);
> > +     vma = vma_next(&iter);
> > +     if (!vma) {
> > +             err = -ENOENT;
> > +             goto out;
> > +     }
> > +     /* user wants covering VMA, not the closest next one */
> > +     if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > +         vma->vm_start > karg.query_addr) {
> > +             err = -ENOENT;
> > +             goto out;
> > +     }
>
> The interface you are using is a start address to search from to the end
> of the address space, so this won't work as you intended with the
> PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag.  I do not think the vma iterator

Maybe the name isn't the best, by "EXACT" here I meant "VMA that
exactly covers provided address", so maybe "COVERING_OR_NEXT_VMA"
would be better wording.

With that out of the way, I think this API works exactly how I expect
it to work:

# cat /proc/3406/maps | grep -C1 7f42099fe000
7f42099fa000-7f42099fc000 rw-p 00000000 00:00 0
7f42099fc000-7f42099fe000 r--p 00000000 00:21 109331
  /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331
  /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
7f4209a0e000-7f4209a14000 r--p 00012000 00:21 109331
  /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8

# cat addrs.txt
0x7f42099fe010

# ./procfs_query -f addrs.txt -p 3406 -v -Q
PID: 3406
PATH: addrs.txt
READ 1 addrs!
SORTED ADDRS (1):
ADDR #0: 0x7f42099fe010
VMA FOUND (addr 7f42099fe010): 7f42099fe000-7f4209a0e000 r-xp 00002000
00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
(build ID: NO, 0 bytes)
RESOLVED ADDRS (1):
RESOLVED   #0: 0x7f42099fe010 -> OFF 0x2010 NAME
/usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8

You can see above that for the requested 0x7f42099fe010 address we got
a VMA that starts before this address: 7f42099fe000-7f4209a0e000,
which is what we want.

Before submitting I ran the tool with /proc/<pid>/maps and ioctl to
"resolve" the exact same set of addresses and I compared results. They
were identical.


Note, there is a small bug in the tool I added in patch #5. I changed
`-i` argument to `-Q` at the very last moment and haven't updated the
code in one place. But other than that I didn't change anything. For
the above output, I added "VMA FOUND" verbose logging to see all the
details of VMA, not just resolved offset. I'll add that in v2.

> has the desired interface you want as the single address lookup doesn't
> use the vma iterator.  I'd just run the vma_next() and check the limits.
> See find_exact_vma() for the limit checks.
>
> > +
> > +     karg.vma_start = vma->vm_start;
> > +     karg.vma_end = vma->vm_end;
> > +

[...]

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-04  0:30  9% ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
  2024-05-04 15:28  6%   ` Greg KH
  2024-05-04 23:36  9%   ` kernel test robot
@ 2024-05-07 18:10  7%   ` Liam R. Howlett
  2024-05-07 18:52  6%     ` Andrii Nakryiko
  2 siblings, 1 reply; 200+ results
From: Liam R. Howlett @ 2024-05-07 18:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, gregkh, linux-mm

* Andrii Nakryiko <andrii@kernel.org> [240503 20:30]:
> /proc/<pid>/maps file is extremely useful in practice for various tasks
> involving figuring out process memory layout, what files are backing any
> given memory range, etc. One important class of applications that
> absolutely rely on this are profilers/stack symbolizers. They would
> normally capture stack trace containing absolute memory addresses of
> some functions, and would then use /proc/<pid>/maps file to file
> corresponding backing ELF files, file offsets within them, and then
> continue from there to get yet more information (ELF symbols, DWARF
> information) to get human-readable symbolic information.
> 
> As such, there are both performance and correctness requirement
> involved. This address to VMA information translation has to be done as
> efficiently as possible, but also not miss any VMA (especially in the
> case of loading/unloading shared libraries).
> 
> Unfortunately, for all the /proc/<pid>/maps file universality and
> usefulness, it doesn't fit the above 100%.
> 
> First, it's text based, which makes its programmatic use from
> applications and libraries unnecessarily cumbersome and slow due to the
> need to do text parsing to get necessary pieces of information.
> 
> Second, it's main purpose is to emit all VMAs sequentially, but in
> practice captured addresses would fall only into a small subset of all
> process' VMAs, mainly containing executable text. Yet, library would
> need to parse most or all of the contents to find needed VMAs, as there
> is no way to skip VMAs that are of no use. Efficient library can do the
> linear pass and it is still relatively efficient, but it's definitely an
> overhead that can be avoided, if there was a way to do more targeted
> querying of the relevant VMA information.
> 
> Another problem when writing generic stack trace symbolization library
> is an unfortunate performance-vs-correctness tradeoff that needs to be
> made. Library has to make a decision to either cache parsed contents of
> /proc/<pid>/maps for service future requests (if application requests to
> symbolize another set of addresses, captured at some later time, which
> is typical for periodic/continuous profiling cases) to avoid higher
> costs of needed to re-parse this file or caching the contents in memory
> to speed up future requests. In the former case, more memory is used for
> the cache and there is a risk of getting stale data if application
> loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> through additiona mmap() calls (and other means of altering memory
> address space). In the latter case, it's the performance hit that comes
> from re-opening the file and re-reading/re-parsing its contents all over
> again.
> 
> This patch aims to solve this problem by providing a new API built on
> top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> interface, avoiding the cost and awkwardness of textual representation
> for programmatic use. It's designed to be extensible and
> forward/backward compatible by including user-specified field size and
> using copy_struct_from_user() approach. But, most importantly, it allows
> to do point queries for specific single address, specified by user. And
> this is done efficiently using VMA iterator.
> 
> User has a choice to pick either getting VMA that covers provided
> address or -ENOENT if none is found (exact, least surprising, case). Or,
> with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> get either VMA that covers the address (if there is one), or the closest
> next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> more efficient use, but, given it could be a surprising behavior,
> requires an explicit opt-in.
> 
> Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> sense given it's querying the same set of VMA data. All the permissions
> checks performed on /proc/<pid>/maps opening fit here as well.
> ioctl-based implementation is fetching remembered mm_struct reference,
> but otherwise doesn't interfere with seq_file-based implementation of
> /proc/<pid>/maps textual interface, and so could be used together or
> independently without paying any price for that.
> 
> There is one extra thing that /proc/<pid>/maps doesn't currently
> provide, and that's an ability to fetch ELF build ID, if present. User
> has control over whether this piece of information is requested or not
> by either setting build_id_size field to zero or non-zero maximum buffer
> size they provided through build_id_addr field (which encodes user
> pointer as __u64 field).
> 
> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this,
> requiring applications to open underlying ELF binary through
> /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> permissions implications due giving a full access to the binary from
> (potentially) another process, while all application is interested in is
> build ID. Giving an ability to request just build ID doesn't introduce
> any additional security concerns, on top of what /proc/<pid>/maps is
> already concerned with, simplifying the overall logic.
> 
> Kernel already implements build ID fetching, which is used from BPF
> subsystem. We are reusing this code here, but plan a follow up changes
> to make it work better under more relaxed assumption (compared to what
> existing code assumes) of being called from user process context, in
> which page faults are allowed. BPF-specific implementation currently
> bails out if necessary part of ELF file is not paged in, all due to
> extra BPF-specific restrictions (like the need to fetch build ID in
> restrictive contexts such as NMI handler).
> 
> Note also, that fetching VMA name (e.g., backing file path, or special
> hard-coded or user-provided names) is optional just like build ID. If
> user sets vma_name_size to zero, kernel code won't attempt to retrieve
> it, saving resources.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fs.h |  32 ++++++++
>  2 files changed, 197 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8e503a1635b7..cb7b1ff1a144 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -22,6 +22,7 @@
>  #include <linux/pkeys.h>
>  #include <linux/minmax.h>
>  #include <linux/overflow.h>
> +#include <linux/buildid.h>
>  
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
>  	return do_maps_open(inode, file, &proc_pid_maps_op);
>  }
>  
> +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> +{
> +	struct procfs_procmap_query karg;
> +	struct vma_iterator iter;
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +	const char *name = NULL;
> +	char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> +	__u64 usize;
> +	int err;
> +
> +	if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> +		return -EFAULT;
> +	if (usize > PAGE_SIZE)
> +		return -E2BIG;
> +	if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> +		return -EINVAL;
> +	err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> +	if (err)
> +		return err;
> +
> +	if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> +		return -EINVAL;
> +	if (!!karg.vma_name_size != !!karg.vma_name_addr)
> +		return -EINVAL;
> +	if (!!karg.build_id_size != !!karg.build_id_addr)
> +		return -EINVAL;
> +
> +	mm = priv->mm;
> +	if (!mm || !mmget_not_zero(mm))
> +		return -ESRCH;
> +	if (mmap_read_lock_killable(mm)) {
> +		mmput(mm);
> +		return -EINTR;
> +	}

Using the rcu lookup here will allow for more success rate with less
lock contention.

> +
> +	vma_iter_init(&iter, mm, karg.query_addr);
> +	vma = vma_next(&iter);
> +	if (!vma) {
> +		err = -ENOENT;
> +		goto out;
> +	}
> +	/* user wants covering VMA, not the closest next one */
> +	if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> +	    vma->vm_start > karg.query_addr) {
> +		err = -ENOENT;
> +		goto out;
> +	}

The interface you are using is a start address to search from to the end
of the address space, so this won't work as you intended with the
PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag.  I do not think the vma iterator
has the desired interface you want as the single address lookup doesn't
use the vma iterator.  I'd just run the vma_next() and check the limits.
See find_exact_vma() for the limit checks.

> +
> +	karg.vma_start = vma->vm_start;
> +	karg.vma_end = vma->vm_end;
> +
> +	if (vma->vm_file) {
> +		const struct inode *inode = file_user_inode(vma->vm_file);
> +
> +		karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> +		karg.dev_major = MAJOR(inode->i_sb->s_dev);
> +		karg.dev_minor = MINOR(inode->i_sb->s_dev);
> +		karg.inode = inode->i_ino;
> +	} else {
> +		karg.vma_offset = 0;
> +		karg.dev_major = 0;
> +		karg.dev_minor = 0;
> +		karg.inode = 0;
> +	}
> +
> +	karg.vma_flags = 0;
> +	if (vma->vm_flags & VM_READ)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> +	if (vma->vm_flags & VM_WRITE)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> +	if (vma->vm_flags & VM_EXEC)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> +
> +	if (karg.build_id_size) {
> +		__u32 build_id_sz = BUILD_ID_SIZE_MAX;
> +
> +		err = build_id_parse(vma, build_id_buf, &build_id_sz);
> +		if (!err) {
> +			if (karg.build_id_size < build_id_sz) {
> +				err = -ENAMETOOLONG;
> +				goto out;
> +			}
> +			karg.build_id_size = build_id_sz;
> +		}
> +	}
> +
> +	if (karg.vma_name_size) {
> +		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
> +		const struct path *path;
> +		const char *name_fmt;
> +		size_t name_sz = 0;
> +
> +		get_vma_name(vma, &path, &name, &name_fmt);
> +
> +		if (path || name_fmt || name) {
> +			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
> +			if (!name_buf) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +		}
> +		if (path) {
> +			name = d_path(path, name_buf, name_buf_sz);
> +			if (IS_ERR(name)) {
> +				err = PTR_ERR(name);
> +				goto out;
> +			}
> +			name_sz = name_buf + name_buf_sz - name;
> +		} else if (name || name_fmt) {
> +			name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
> +			name = name_buf;
> +		}
> +		if (name_sz > name_buf_sz) {
> +			err = -ENAMETOOLONG;
> +			goto out;
> +		}
> +		karg.vma_name_size = name_sz;
> +	}
> +
> +	/* unlock and put mm_struct before copying data to user */
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
> +					       name, karg.vma_name_size)) {
> +		kfree(name_buf);
> +		return -EFAULT;
> +	}
> +	kfree(name_buf);
> +
> +	if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
> +					       build_id_buf, karg.build_id_size))
> +		return -EFAULT;
> +
> +	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
> +		return -EFAULT;
> +
> +	return 0;
> +
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	kfree(name_buf);
> +	return err;
> +}
> +
> +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	struct seq_file *seq = file->private_data;
> +	struct proc_maps_private *priv = seq->private;
> +
> +	switch (cmd) {
> +	case PROCFS_PROCMAP_QUERY:
> +		return do_procmap_query(priv, (void __user *)arg);
> +	default:
> +		return -ENOIOCTLCMD;
> +	}
> +}
> +
>  const struct file_operations proc_pid_maps_operations = {
>  	.open		= pid_maps_open,
>  	.read		= seq_read,
>  	.llseek		= seq_lseek,
>  	.release	= proc_map_release,
> +	.unlocked_ioctl = procfs_procmap_ioctl,
> +	.compat_ioctl	= procfs_procmap_ioctl,
>  };
>  
>  /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 45e4e64fd664..fe8924a8d916 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -393,4 +393,36 @@ struct pm_scan_arg {
>  	__u64 return_mask;
>  };
>  
> +/* /proc/<pid>/maps ioctl */
> +#define PROCFS_IOCTL_MAGIC 0x9f
> +#define PROCFS_PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> +
> +enum procmap_query_flags {
> +	PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> +};
> +
> +enum procmap_vma_flags {
> +	PROCFS_PROCMAP_VMA_READABLE = 0x01,
> +	PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> +	PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> +	PROCFS_PROCMAP_VMA_SHARED = 0x08,
> +};
> +
> +struct procfs_procmap_query {
> +	__u64 size;
> +	__u64 query_flags;		/* in */
> +	__u64 query_addr;		/* in */
> +	__u64 vma_start;		/* out */
> +	__u64 vma_end;			/* out */
> +	__u64 vma_flags;		/* out */
> +	__u64 vma_offset;		/* out */
> +	__u64 inode;			/* out */
> +	__u32 dev_major;		/* out */
> +	__u32 dev_minor;		/* out */
> +	__u32 vma_name_size;		/* in/out */
> +	__u32 build_id_size;		/* in/out */
> +	__u64 vma_name_addr;		/* in */
> +	__u64 build_id_addr;		/* in */
> +};
> +
>  #endif /* _UAPI_LINUX_FS_H */
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[relevance 7%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 20:35  6%           ` Arnaldo Carvalho de Melo
@ 2024-05-07 16:36 11%             ` Andrii Nakryiko
  0 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-07 16:36 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

On Mon, May 6, 2024 at 1:35 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Mon, May 06, 2024 at 11:41:43AM -0700, Andrii Nakryiko wrote:
> > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > >
> > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > > it, saving resources.
> > >
> > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > >
> > > > > Where is the userspace code that uses this new api you have created?
> > >
> > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > > ioctl() API to solve a common problem (as described above) in patch
> > > > #5. The plan is to put it in mentioned blazesym library at the very
> > > > least.
> > > >
> > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > > linux-perf-user), as they need to do stack symbolization as well.
> > >
> > > At some point, when BPF iterators became a thing we thought about, IIRC
> > > Jiri did some experimentation, but I lost track, of using BPF to
> > > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> > > as in uapi/linux/perf_event.h:
> > >
> > >         /*
> > >          * The MMAP2 records are an augmented version of MMAP, they add
> > >          * maj, min, ino numbers to be used to uniquely identify each mapping
> > >          *
> > >          * struct {
> > >          *      struct perf_event_header        header;
> > >          *
> > >          *      u32                             pid, tid;
> > >          *      u64                             addr;
> > >          *      u64                             len;
> > >          *      u64                             pgoff;
> > >          *      union {
> > >          *              struct {
> > >          *                      u32             maj;
> > >          *                      u32             min;
> > >          *                      u64             ino;
> > >          *                      u64             ino_generation;
> > >          *              };
> > >          *              struct {
> > >          *                      u8              build_id_size;
> > >          *                      u8              __reserved_1;
> > >          *                      u16             __reserved_2;
> > >          *                      u8              build_id[20];
> > >          *              };
> > >          *      };
> > >          *      u32                             prot, flags;
> > >          *      char                            filename[];
> > >          *      struct sample_id                sample_id;
> > >          * };
> > >          */
> > >         PERF_RECORD_MMAP2                       = 10,
> > >
> > >  *   PERF_RECORD_MISC_MMAP_BUILD_ID      - PERF_RECORD_MMAP2 event
> > >
> > > As perf.data files can be used for many purposes we want them all, so we
> >
> > ok, so because you want them all and you don't know which VMAs will be
> > useful or not, it's a different problem. BPF iterators will be faster
> > purely due to avoiding binary -> text -> binary conversion path, but
> > other than that you'll still retrieve all VMAs.
>
> But not using tons of syscalls to parse text data from /proc.

In terms of syscall *count* you win with 4KB text reads, there are
fewer syscalls because of this 4KB-based batching. But the cost of
syscall + amount of user-space processing is a different matter. My
benchmark in perf (see patch #5 discussion) suggests that even with
more ioctl() syscalls, perf would win here.

But I also realized that what you really need (I think, correct me if
I'm wrong) is only file-backed VMAs, because all the other ones are
not that useful for symbolization. So I'm adding a minimal change to
my code to allow the user to specify another query flag to only return
file-backed VMAs. I'm going to try it with perf code and see how that
helps. I'll post results in patch #5 thread, once I have them.

>
> > You can still do the same full VMA iteration with this new API, of
> > course, but advantages are probably smaller as you'll be retrieving a
> > full set of VMAs regardless (though it would be interesting to compare
> > anyways).
>
> sure, I can't see how it would be faster, but yeah, interesting to see
> what is the difference.

see patch #5 thread, seems like it's still a bit faster

>
> > > setup a meta data perf file descriptor to go on receiving the new mmaps
> > > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> > > it in parallel, etc:
> > >
> > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
> > >
> > >  Usage: perf record [<options>] [<command>]
> > >     or: perf record [<options>] -- <command> [<options>]
> > >
> > >         --num-thread-synthesize <n>
> > >                           number of threads to run for event synthesis
> > >         --synth <no|all|task|mmap|cgroup>
> > >                           Fine-tune event synthesis: default=all
> > >
> > > ⬢[acme@toolbox perf-tools-next]$
> > >
> > > For this specific initial synthesis of everything the plan, as mentioned
> > > about Jiri's experiments, was to use a BPF iterator to just feed the
> > > perf ring buffer with those events, that way userspace would just
> > > receive the usual records it gets when a new mmap is put in place, the
> > > BPF iterator would just feed the preexisting mmaps, as instructed via
> > > the perf_event_attr for the perf_event_open syscall.
> > >
> > > For people not wanting BPF, i.e. disabling it altogether in perf or
> > > disabling just BPF skels, then we would fallback to the current method,
> > > or to the one being discussed here when it becomes available.
> > >
> > > One thing to have in mind is for this iterator not to generate duplicate
> > > records for non-pre-existing mmaps, i.e. we would need some generation
> > > number that would be bumped when asking for such pre-existing maps
> > > PERF_RECORD_MMAP2 dumps.
> >
> > Looking briefly at struct vm_area_struct, it doesn't seems like the
> > kernel maintains any sort of generation (at least not at
> > vm_area_struct level), so this would be nice to have, I'm sure, but
>
> Yeah, this would be something specific to the "retrieve me the list of
> VMAs" bulky thing, i.e. the kernel perf code (or the BPF that would
> generate the PERF_RECORD_MMAP2 records by using a BPF vma iterator)
> would bump the generation number and store it to the VMA in
> perf_event_mmap() so that the iterator doesn't consider it, as it is a
> new mmap that is being just sent to whoever is listening, and the perf
> tool that put in place the BPF program to iterate is listening.

Ok, we went on *so many* tangents in emails on this patch set :) Seems
like there are a bunch of perf-specific improvements possible which
are completely irrelevant to the API I'm proposing. Let's please keep
them separate (and you, perf folks, should propose them upstream),
it's getting hard to see what this patch set is actually about with
all the tangential emails.

>
> > isn't really related to adding this API. Once the kernel does have
>
> Well, perf wants to enumerate pre-existing mmaps _and_ after that
> finishes to know about new mmaps, so we need to know a way to avoid
> having the BPF program enumerating pre-existing maps sending
> PERF_RECORD_MMAP2 for maps perf already knows about via a regular
> PERF_RECORD_MMAP2 sent when a new mmap is put in place.
>
> So there is an overlap where perf (or any other tool wanting to
> enumerate all pre-existing maps and new ones) can receive info for the
> same map from the enumerator and from the existing mechanism generating
> PERF_RECORD_MMAP2 records.
>
> - Arnaldo
>
> > this "VMA generation" counter, it can be trivially added to this
> > binary interface (which can't be said about /proc/<pid>/maps,
> > unfortunately).
> >
> > >
> > > > It will be up to other similar projects to adopt this, but we'll
> > > > definitely get this into blazesym as it is actually a problem for the
> > >
> > > At some point looking at plugging blazesym somehow with perf may be
> > > something to consider, indeed.
> >
> > In the above I meant direct use of this new API in perf code itself,
> > but yes, blazesym is a generic library for symbolization that handles
> > ELF/DWARF/GSYM (and I believe more formats), so it indeed might make
> > sense to use it.
> >
> > >
> > > - Arnaldo
> > >
> > > > abovementioned Oculus use case. We already had to make a tradeoff (see
> > > > [2], this wasn't done just because we could, but it was requested by
> > > > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > > > the risk of missing some shared libraries that can be loaded later. It
> > > > would be great to not have to do this tradeoff, which this new API
> > > > would enable.
> > > >
> > > >   [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> > > >
> >
> > [...]

^ permalink raw reply	[relevance 11%]

* Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
  2024-05-07 15:48  0%       ` Liam R. Howlett
@ 2024-05-07 16:27  0%         ` Andrii Nakryiko
    0 siblings, 1 reply; 200+ results
From: Andrii Nakryiko @ 2024-05-07 16:27 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Greg KH, Andrii Nakryiko,
	linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Suren Baghdasaryan, Matthew Wilcox

On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> .. Adding Suren & Willy to the Cc
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]:
> > On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > >
> > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > I also did an strace run of both cases. In text-based one the tool did
> > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > >
> > > Why not fetch more at once?
> > >
> >
> > I didn't expect to be interrogated so much on the performance of the
> > text parsing front, sorry. :) You can probably tune this, but where is
> > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > production numbers.
>
> The reason the file reads are limited to 4KB is because this file is
> used for monitoring processes.  We have a significant number of
> organisations polling this file so frequently that the mmap lock
> contention becomes an issue. (reading a file is free, right?)  People
> also tend to try to figure out why a process is slow by reading this
> file - which amplifies the lock contention.
>
> What happens today is that the lock is yielded after 4KB to allow time
> for mmap writes to happen.  This also means your data may be
> inconsistent from one 4KB block to the next (the write may be around
> this boundary).
>
> This new interface also takes the lock in do_procmap_query() and does
> the 4kb blocks as well.  Extending this size means more time spent
> blocking mmap writes, but a more consistent view of the world (less
> "tearing" of the addresses).

Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
does a single VMA look up (presumably O(logN) operation) using a
single vma_iter_init(addr) + vma_next() call on vma_iterator.

As for the mmap_read_lock_killable() (is that what we are talking
about?), I'm happy to use anything else available, please give me a
pointer. But I suspect given how fast and small this new API is,
mmap_read_lock_killable() in it is not comparable to holding it for
producing /proc/<pid>/maps contents.

>
> We are working to reduce these issues by switching the /proc/<pid>/maps
> file to use rcu lookup.  I would recommend we do not proceed with this
> interface using the old method and instead, implement it using rcu from
> the start - if it fits your use case (or we can make it fit your use
> case).
>
> At least, for most page faults, we can work around the lock contention
> (since v6.6), but not all and not on all archs.
>
> ...
>
> >
> > > > In comparison,
> > > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > > > relevant VMAs.
> > > >
> > > > It is projected that savings from processing big production applications
> > > > would only widen the gap in favor of binary-based querying ioctl API, as
> > > > bigger applications will tend to have even more non-executable VMA
> > > > mappings relative to executable ones.
> > >
> > > Define "bigger applications" please.  Is this some "large database
> > > company workload" type of thing, or something else?
> >
> > I don't have a definition. But I had in mind, as one example, an
> > ads-serving service we use internally (it's a pretty large application
> > by pretty much any metric you can come up with). I just randomly
> > picked one of the production hosts, found one instance of that
> > service, and looked at its /proc/<pid>/maps file. Hopefully it will
> > satisfy your need for specifics.
> >
> > # cat /proc/1126243/maps | wc -c
> > 1570178
> > # cat /proc/1126243/maps | wc -l
> > 28875
> > # cat /proc/1126243/maps | grep ' ..x. ' | wc -l
> > 7347
>
> We have distributions increasing the map_count to an insane number to
> allow games to work [1].  It is, unfortunately, only a matter of time until
> this is regularly an issue as it is being normalised and allowed by an
> increased number of distributions (fedora, arch, ubuntu).  So, despite
> my email address, I am not talking about large database companies here.
>
> Also, note that applications that use guard VMAs double the number for
> the guards.  Fun stuff.
>
> We are really doing a lot in the VMA area to reduce the mmap locking
> contention and it seems you have a use case for a new interface that can
> leverage these changes.
>
> We have at least two talks around this area at LSF if you are attending.

I am attending LSFMM, yes, I'll try to not miss them.

>
> Thanks,
> Liam
>
> [1] https://lore.kernel.org/linux-mm/8f6e2d69-b4df-45f3-aed4-5190966e2dea@valvesoftware.com/
>

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
  2024-05-04 22:13  0%     ` Andrii Nakryiko
@ 2024-05-07 15:48  0%       ` Liam R. Howlett
  2024-05-07 16:27  0%         ` Andrii Nakryiko
  0 siblings, 1 reply; 200+ results
From: Liam R. Howlett @ 2024-05-07 15:48 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Greg KH, Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm,
	linux-kernel, bpf, linux-mm, Suren Baghdasaryan, Matthew Wilcox

.. Adding Suren & Willy to the Cc

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [240504 18:14]:
> On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > I also did an strace run of both cases. In text-based one the tool did
> > > 68 read() syscalls, fetching up to 4KB of data in one go.
> >
> > Why not fetch more at once?
> >
> 
> I didn't expect to be interrogated so much on the performance of the
> text parsing front, sorry. :) You can probably tune this, but where is
> the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> production numbers.

The reason the file reads are limited to 4KB is because this file is
used for monitoring processes.  We have a significant number of
organisations polling this file so frequently that the mmap lock
contention becomes an issue. (reading a file is free, right?)  People
also tend to try to figure out why a process is slow by reading this
file - which amplifies the lock contention.

What happens today is that the lock is yielded after 4KB to allow time
for mmap writes to happen.  This also means your data may be
inconsistent from one 4KB block to the next (the write may be around
this boundary).

This new interface also takes the lock in do_procmap_query() and does
the 4kb blocks as well.  Extending this size means more time spent
blocking mmap writes, but a more consistent view of the world (less
"tearing" of the addresses).

We are working to reduce these issues by switching the /proc/<pid>/maps
file to use rcu lookup.  I would recommend we do not proceed with this
interface using the old method and instead, implement it using rcu from
the start - if it fits your use case (or we can make it fit your use
case).

At least, for most page faults, we can work around the lock contention
(since v6.6), but not all and not on all archs.

...

> 
> > > In comparison,
> > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > > relevant VMAs.
> > >
> > > It is projected that savings from processing big production applications
> > > would only widen the gap in favor of binary-based querying ioctl API, as
> > > bigger applications will tend to have even more non-executable VMA
> > > mappings relative to executable ones.
> >
> > Define "bigger applications" please.  Is this some "large database
> > company workload" type of thing, or something else?
> 
> I don't have a definition. But I had in mind, as one example, an
> ads-serving service we use internally (it's a pretty large application
> by pretty much any metric you can come up with). I just randomly
> picked one of the production hosts, found one instance of that
> service, and looked at its /proc/<pid>/maps file. Hopefully it will
> satisfy your need for specifics.
> 
> # cat /proc/1126243/maps | wc -c
> 1570178
> # cat /proc/1126243/maps | wc -l
> 28875
> # cat /proc/1126243/maps | grep ' ..x. ' | wc -l
> 7347

We have distributions increasing the map_count to an insane number to
allow games to work [1].  It is, unfortunately, only a matter of time until
this is regularly an issue as it is being normalised and allowed by an
increased number of distributions (fedora, arch, ubuntu).  So, despite
my email address, I am not talking about large database companies here.

Also, note that applications that use guard VMAs double the number for
the guards.  Fun stuff.

We are really doing a lot in the VMA area to reduce the mmap locking
contention and it seems you have a use case for a new interface that can
leverage these changes.

We have at least two talks around this area at LSF if you are attending.

Thanks,
Liam

[1] https://lore.kernel.org/linux-mm/8f6e2d69-b4df-45f3-aed4-5190966e2dea@valvesoftware.com/


^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 18:41  6%         ` Andrii Nakryiko
@ 2024-05-06 20:35  6%           ` Arnaldo Carvalho de Melo
  2024-05-07 16:36 11%             ` Andrii Nakryiko
  0 siblings, 1 reply; 200+ results
From: Arnaldo Carvalho de Melo @ 2024-05-06 20:35 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

On Mon, May 06, 2024 at 11:41:43AM -0700, Andrii Nakryiko wrote:
> On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >
> > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > it, saving resources.
> >
> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > > > Where is the userspace code that uses this new api you have created?
> >
> > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > ioctl() API to solve a common problem (as described above) in patch
> > > #5. The plan is to put it in mentioned blazesym library at the very
> > > least.
> > >
> > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > linux-perf-user), as they need to do stack symbolization as well.
> >
> > At some point, when BPF iterators became a thing we thought about, IIRC
> > Jiri did some experimentation, but I lost track, of using BPF to
> > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> > as in uapi/linux/perf_event.h:
> >
> >         /*
> >          * The MMAP2 records are an augmented version of MMAP, they add
> >          * maj, min, ino numbers to be used to uniquely identify each mapping
> >          *
> >          * struct {
> >          *      struct perf_event_header        header;
> >          *
> >          *      u32                             pid, tid;
> >          *      u64                             addr;
> >          *      u64                             len;
> >          *      u64                             pgoff;
> >          *      union {
> >          *              struct {
> >          *                      u32             maj;
> >          *                      u32             min;
> >          *                      u64             ino;
> >          *                      u64             ino_generation;
> >          *              };
> >          *              struct {
> >          *                      u8              build_id_size;
> >          *                      u8              __reserved_1;
> >          *                      u16             __reserved_2;
> >          *                      u8              build_id[20];
> >          *              };
> >          *      };
> >          *      u32                             prot, flags;
> >          *      char                            filename[];
> >          *      struct sample_id                sample_id;
> >          * };
> >          */
> >         PERF_RECORD_MMAP2                       = 10,
> >
> >  *   PERF_RECORD_MISC_MMAP_BUILD_ID      - PERF_RECORD_MMAP2 event
> >
> > As perf.data files can be used for many purposes we want them all, so we
> 
> ok, so because you want them all and you don't know which VMAs will be
> useful or not, it's a different problem. BPF iterators will be faster
> purely due to avoiding binary -> text -> binary conversion path, but
> other than that you'll still retrieve all VMAs.

But not using tons of syscalls to parse text data from /proc.
 
> You can still do the same full VMA iteration with this new API, of
> course, but advantages are probably smaller as you'll be retrieving a
> full set of VMAs regardless (though it would be interesting to compare
> anyways).

sure, I can't see how it would be faster, but yeah, interesting to see
what is the difference.
 
> > setup a meta data perf file descriptor to go on receiving the new mmaps
> > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> > it in parallel, etc:
> >
> > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
> >
> >  Usage: perf record [<options>] [<command>]
> >     or: perf record [<options>] -- <command> [<options>]
> >
> >         --num-thread-synthesize <n>
> >                           number of threads to run for event synthesis
> >         --synth <no|all|task|mmap|cgroup>
> >                           Fine-tune event synthesis: default=all
> >
> > ⬢[acme@toolbox perf-tools-next]$
> >
> > For this specific initial synthesis of everything the plan, as mentioned
> > about Jiri's experiments, was to use a BPF iterator to just feed the
> > perf ring buffer with those events, that way userspace would just
> > receive the usual records it gets when a new mmap is put in place, the
> > BPF iterator would just feed the preexisting mmaps, as instructed via
> > the perf_event_attr for the perf_event_open syscall.
> >
> > For people not wanting BPF, i.e. disabling it altogether in perf or
> > disabling just BPF skels, then we would fallback to the current method,
> > or to the one being discussed here when it becomes available.
> >
> > One thing to have in mind is for this iterator not to generate duplicate
> > records for non-pre-existing mmaps, i.e. we would need some generation
> > number that would be bumped when asking for such pre-existing maps
> > PERF_RECORD_MMAP2 dumps.
> 
> Looking briefly at struct vm_area_struct, it doesn't seems like the
> kernel maintains any sort of generation (at least not at
> vm_area_struct level), so this would be nice to have, I'm sure, but

Yeah, this would be something specific to the "retrieve me the list of
VMAs" bulky thing, i.e. the kernel perf code (or the BPF that would
generate the PERF_RECORD_MMAP2 records by using a BPF vma iterator)
would bump the generation number and store it to the VMA in
perf_event_mmap() so that the iterator doesn't consider it, as it is a
new mmap that is being just sent to whoever is listening, and the perf
tool that put in place the BPF program to iterate is listening.

> isn't really related to adding this API. Once the kernel does have

Well, perf wants to enumerate pre-existing mmaps _and_ after that
finishes to know about new mmaps, so we need to know a way to avoid
having the BPF program enumerating pre-existing maps sending
PERF_RECORD_MMAP2 for maps perf already knows about via a regular
PERF_RECORD_MMAP2 sent when a new mmap is put in place.

So there is an overlap where perf (or any other tool wanting to
enumerate all pre-existing maps and new ones) can receive info for the
same map from the enumerator and from the existing mechanism generating
PERF_RECORD_MMAP2 records.

- Arnaldo

> this "VMA generation" counter, it can be trivially added to this
> binary interface (which can't be said about /proc/<pid>/maps,
> unfortunately).
> 
> >
> > > It will be up to other similar projects to adopt this, but we'll
> > > definitely get this into blazesym as it is actually a problem for the
> >
> > At some point looking at plugging blazesym somehow with perf may be
> > something to consider, indeed.
> 
> In the above I meant direct use of this new API in perf code itself,
> but yes, blazesym is a generic library for symbolization that handles
> ELF/DWARF/GSYM (and I believe more formats), so it indeed might make
> sense to use it.
> 
> >
> > - Arnaldo
> >
> > > abovementioned Oculus use case. We already had to make a tradeoff (see
> > > [2], this wasn't done just because we could, but it was requested by
> > > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > > the risk of missing some shared libraries that can be loaded later. It
> > > would be great to not have to do this tradeoff, which this new API
> > > would enable.
> > >
> > >   [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> > >
> 
> [...]

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 18:53  6%           ` Arnaldo Carvalho de Melo
@ 2024-05-06 19:16  7%             ` Arnaldo Carvalho de Melo
  2024-05-07 21:55  7%               ` Namhyung Kim
  0 siblings, 1 reply; 200+ results
From: Arnaldo Carvalho de Melo @ 2024-05-06 19:16 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Andrii Nakryiko, Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko,
	linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

On Mon, May 06, 2024 at 03:53:40PM -0300, Arnaldo Carvalho de Melo wrote:
> On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote:
> > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > > it, saving resources.
> 
> > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> 
> > > > > Where is the userspace code that uses this new api you have created?
> 
> > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > > ioctl() API to solve a common problem (as described above) in patch
> > > > #5. The plan is to put it in mentioned blazesym library at the very
> > > > least.
> > > >
> > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > > linux-perf-user), as they need to do stack symbolization as well.
>  
> > I think the general use case in perf is different.  This ioctl API is great
> > for live tracing of a single (or a small number of) process(es).  And
> > yes, perf tools have those tracing use cases too.  But I think the
> > major use case of perf tools is system-wide profiling.
>  
> > For system-wide profiling, you need to process samples of many
> > different processes at a high frequency.  Now perf record doesn't
> > process them and just save it for offline processing (well, it does
> > at the end to find out build-ID but it can be omitted).
> 
> Since:
> 
>   Author: Jiri Olsa <jolsa@kernel.org>
>   Date:   Mon Dec 14 11:54:49 2020 +0100
>   1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events")
> 
> We don't need to to process the events to find the build ids. I haven't
> checked if we still do it to find out which DSOs had hits, but we
> shouldn't need to do it for build-ids (unless they were not in memory
> when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I
> haven't checked but IIRC is a possibility if that ELF part isn't in
> memory at the time we want to copy it).

> If we're still traversing it like that I guess we can have a knob and
> make it the default to not do that and instead create the perf.data
> build ID header table with all the build-ids we got from
> PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event
> processing at the end of a 'perf record' session.

But then we don't process the PERF_RECORD_MMAP2 in 'perf record', it
just goes on directly to the perf.data file :-\

Humm, perhaps the sideband thread...

- Arnaldo

^ permalink raw reply	[relevance 7%]

* Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
  2024-05-05  5:26  6% ` Ian Rogers
@ 2024-05-06 18:58 11%   ` Andrii Nakryiko
  0 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-06 18:58 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm,
	linux-kernel, bpf, gregkh, linux-mm

On Sat, May 4, 2024 at 10:26 PM Ian Rogers <irogers@google.com> wrote:
>
> On Fri, May 3, 2024 at 5:30 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
> >
> > This patch set was based on top of next-20240503 tag in linux-next tree.
> > Not sure what should be the target tree for this, I'd appreciate any guidance,
> > thank you!
> >
> > Andrii Nakryiko (5):
> >   fs/procfs: extract logic for getting VMA name constituents
> >   fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
> >   tools: sync uapi/linux/fs.h header into tools subdir
> >   selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
> >   selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
>
> I'd love to see improvements like this for the Linux perf command.
> Some thoughts:
>
>  - Could we do something scalability wise better than a file
> descriptor per pid? If a profiler is running in a container the cost
> of many file descriptors can be significant, and something that
> increases as machines get larger. Could we have a /proc/maps for all
> processes?

It's probably not a question to me, as it seems like an entirely
different set of APIs. But it also seems a bit convoluted to mix
together information about many address spaces.

As for the cost of FDs, I haven't run into this limitation, and it
seems like the trend in Linux in general is towards "everything is a
file". Just look at pidfd, for example.

Also, having a fd that can be queries has an extra nice property. For
example, opening /proc/self/maps (i.e., process' own maps file)
doesn't require any extra permissions, and then it can be transferred
to another trusted process that would do address
resolution/symbolization. In practice right now it's unavoidable to
add extra caps/root permissions to the profiling process even if the
only thing that it needs is contents of /proc/<pid>/maps (and the use
case is as benign as symbol resolution). Not having an FD for this API
would make this use case unworkable.

>
>  - Something that is broken in perf currently is that we can race
> between reading /proc and opening events on the pids it contains. For
> example, perf top supports a uid option that first scans to find all
> processes owned by a user then tries to open an event on each process.
> This fails if the process terminates between the scan and the open
> leading to a frequent:
> ```
> $ sudo perf top -u `id -u`
> The sys_perf_event_open() syscall returned with 3 (No such process)
> for event (cycles:P).
> ```
> It would be nice for the API to consider cgroups, uids and the like as
> ways to get a subset of things to scan.

This seems like putting too much into an API, tbh. It feels like
mapping cgroupos/uids to their processes is its own way and if we
don't have efficient APIs to do this, we should add it. But conflating
it into "get VMAs from this process" seems wrong to me.

>
>  - Some what related, the mmap perf events give data after the mmap
> call has happened. As VMAs get merged this can lead to mmap perf
> events looking like the memory overlaps (for jits using anonymous
> memory) and we lack munmap/mremap events.

Is this related to "VMA generation" that Arnaldo mentioned? I'd
happily add it to the new API, as it's easily extensible, if the
kernel already maintains it. If not, then it should be a separate work
to discuss whether kernel *should* track this information.

>
> Jiri Olsa has looked at improvements in this area in the past.
>
> Thanks,
> Ian
>
> >  fs/proc/task_mmu.c                            | 290 +++++++++++---
> >  include/uapi/linux/fs.h                       |  32 ++
> >  .../perf/trace/beauty/include/uapi/linux/fs.h |  32 ++
> >  tools/testing/selftests/bpf/.gitignore        |   1 +
> >  tools/testing/selftests/bpf/Makefile          |   2 +-
> >  tools/testing/selftests/bpf/procfs_query.c    | 366 ++++++++++++++++++
> >  tools/testing/selftests/bpf/test_progs.c      |   3 +
> >  tools/testing/selftests/bpf/test_progs.h      |   2 +
> >  tools/testing/selftests/bpf/trace_helpers.c   | 105 ++++-
> >  9 files changed, 763 insertions(+), 70 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/procfs_query.c
> >
> > --
> > 2.43.0
> >
> >

^ permalink raw reply	[relevance 11%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 18:05  6%         ` Namhyung Kim
  2024-05-06 18:51  6%           ` Andrii Nakryiko
@ 2024-05-06 18:53  6%           ` Arnaldo Carvalho de Melo
  2024-05-06 19:16  7%             ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 200+ results
From: Arnaldo Carvalho de Melo @ 2024-05-06 18:53 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Andrii Nakryiko, Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko,
	linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote:
> On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > it, saving resources.

> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

> > > > Where is the userspace code that uses this new api you have created?

> > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > ioctl() API to solve a common problem (as described above) in patch
> > > #5. The plan is to put it in mentioned blazesym library at the very
> > > least.
> > >
> > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > linux-perf-user), as they need to do stack symbolization as well.
 
> I think the general use case in perf is different.  This ioctl API is great
> for live tracing of a single (or a small number of) process(es).  And
> yes, perf tools have those tracing use cases too.  But I think the
> major use case of perf tools is system-wide profiling.
 
> For system-wide profiling, you need to process samples of many
> different processes at a high frequency.  Now perf record doesn't
> process them and just save it for offline processing (well, it does
> at the end to find out build-ID but it can be omitted).

Since:

  Author: Jiri Olsa <jolsa@kernel.org>
  Date:   Mon Dec 14 11:54:49 2020 +0100
  1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events")

We don't need to to process the events to find the build ids. I haven't
checked if we still do it to find out which DSOs had hits, but we
shouldn't need to do it for build-ids (unless they were not in memory
when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I
haven't checked but IIRC is a possibility if that ELF part isn't in
memory at the time we want to copy it).

If we're still traversing it like that I guess we can have a knob and
make it the default to not do that and instead create the perf.data
build ID header table with all the build-ids we got from
PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event
processing at the end of a 'perf record' session.

> Doing it online is possible (like perf top) but it would add more
> overhead during the profiling.  And we cannot move processing

It comes in the PERF_RECORD_MMAP2, filled by the kernel.

> or symbolization to the end of profiling because some (short-
> lived) tasks can go away.

right
 
> Also it should support perf report (offline) on data from a
> different kernel or even a different machine.

right
 
> So it saves the memory map of processes and symbolizes
> the stack trace with it later.  Of course it needs to be updated
> as the memory map changes and that's why it tracks mmap
> or similar syscalls with PERF_RECORD_MMAP[2] records.
 
> A problem with this approach is to get the initial state of all
> (or a target for non-system-wide mode) existing processes.
> We call it synthesizing, and read /proc/PID/maps to generate
> the mmap records.
 
> I think the below comment from Arnaldo talked about how
> we can improve the synthesizing (which is sequential access
> to proc maps) using BPF.

Yes, I wonder how far Jiri went, Jiri?

- Arnaldo
 
> Thanks,
> Namhyung
> 
> 
> >
> > At some point, when BPF iterators became a thing we thought about, IIRC
> > Jiri did some experimentation, but I lost track, of using BPF to
> > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> > as in uapi/linux/perf_event.h:
> >
> >         /*
> >          * The MMAP2 records are an augmented version of MMAP, they add
> >          * maj, min, ino numbers to be used to uniquely identify each mapping
> >          *
> >          * struct {
> >          *      struct perf_event_header        header;
> >          *
> >          *      u32                             pid, tid;
> >          *      u64                             addr;
> >          *      u64                             len;
> >          *      u64                             pgoff;
> >          *      union {
> >          *              struct {
> >          *                      u32             maj;
> >          *                      u32             min;
> >          *                      u64             ino;
> >          *                      u64             ino_generation;
> >          *              };
> >          *              struct {
> >          *                      u8              build_id_size;
> >          *                      u8              __reserved_1;
> >          *                      u16             __reserved_2;
> >          *                      u8              build_id[20];
> >          *              };
> >          *      };
> >          *      u32                             prot, flags;
> >          *      char                            filename[];
> >          *      struct sample_id                sample_id;
> >          * };
> >          */
> >         PERF_RECORD_MMAP2                       = 10,
> >
> >  *   PERF_RECORD_MISC_MMAP_BUILD_ID      - PERF_RECORD_MMAP2 event
> >
> > As perf.data files can be used for many purposes we want them all, so we
> > setup a meta data perf file descriptor to go on receiving the new mmaps
> > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> > it in parallel, etc:
> >
> > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
> >
> >  Usage: perf record [<options>] [<command>]
> >     or: perf record [<options>] -- <command> [<options>]
> >
> >         --num-thread-synthesize <n>
> >                           number of threads to run for event synthesis
> >         --synth <no|all|task|mmap|cgroup>
> >                           Fine-tune event synthesis: default=all
> >
> > ⬢[acme@toolbox perf-tools-next]$
> >
> > For this specific initial synthesis of everything the plan, as mentioned
> > about Jiri's experiments, was to use a BPF iterator to just feed the
> > perf ring buffer with those events, that way userspace would just
> > receive the usual records it gets when a new mmap is put in place, the
> > BPF iterator would just feed the preexisting mmaps, as instructed via
> > the perf_event_attr for the perf_event_open syscall.
> >
> > For people not wanting BPF, i.e. disabling it altogether in perf or
> > disabling just BPF skels, then we would fallback to the current method,
> > or to the one being discussed here when it becomes available.
> >
> > One thing to have in mind is for this iterator not to generate duplicate
> > records for non-pre-existing mmaps, i.e. we would need some generation
> > number that would be bumped when asking for such pre-existing maps
> > PERF_RECORD_MMAP2 dumps.
> >
> > > It will be up to other similar projects to adopt this, but we'll
> > > definitely get this into blazesym as it is actually a problem for the
> >
> > At some point looking at plugging blazesym somehow with perf may be
> > something to consider, indeed.
> >
> > - Arnaldo
> >
> > > abovementioned Oculus use case. We already had to make a tradeoff (see
> > > [2], this wasn't done just because we could, but it was requested by
> > > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > > the risk of missing some shared libraries that can be loaded later. It
> > > would be great to not have to do this tradeoff, which this new API
> > > would enable.
> > >
> > >   [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> > >
> > > >
> > > > > ---
> > > > >  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
> > > > >  include/uapi/linux/fs.h |  32 ++++++++
> > > > >  2 files changed, 197 insertions(+)
> > > > >
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index 8e503a1635b7..cb7b1ff1a144 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -22,6 +22,7 @@
> > > > >  #include <linux/pkeys.h>
> > > > >  #include <linux/minmax.h>
> > > > >  #include <linux/overflow.h>
> > > > > +#include <linux/buildid.h>
> > > > >
> > > > >  #include <asm/elf.h>
> > > > >  #include <asm/tlb.h>
> > > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > >       return do_maps_open(inode, file, &proc_pid_maps_op);
> > > > >  }
> > > > >
> > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > > > > +{
> > > > > +     struct procfs_procmap_query karg;
> > > > > +     struct vma_iterator iter;
> > > > > +     struct vm_area_struct *vma;
> > > > > +     struct mm_struct *mm;
> > > > > +     const char *name = NULL;
> > > > > +     char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > > > > +     __u64 usize;
> > > > > +     int err;
> > > > > +
> > > > > +     if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > > > > +             return -EFAULT;
> > > > > +     if (usize > PAGE_SIZE)
> > > >
> > > > Nice, where did you document that?  And how is that portable given that
> > > > PAGE_SIZE can be different on different systems?
> > >
> > > I'm happy to document everything, can you please help by pointing
> > > where this documentation has to live?
> > >
> > > This is mostly fool-proofing, though, because the user has to pass
> > > sizeof(struct procfs_procmap_query), which I don't see ever getting
> > > close to even 4KB (not even saying about 64KB). This is just to
> > > prevent copy_struct_from_user() below to do too much zero-checking.
> > >
> > > >
> > > > and why aren't you checking the actual structure size instead?  You can
> > > > easily run off the end here without knowing it.
> > >
> > > See copy_struct_from_user(), it does more checks. This is a helper
> > > designed specifically to deal with use cases like this where kernel
> > > struct size can change and user space might be newer or older.
> > > copy_struct_from_user() has a nice documentation describing all these
> > > nuances.
> > >
> > > >
> > > > > +             return -E2BIG;
> > > > > +     if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > > > > +             return -EINVAL;
> > > >
> > > > Ok, so you have two checks?  How can the first one ever fail?
> > >
> > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
> > > won't fail, but this one will fail.
> > >
> > > The point of this check is that user has to specify at least first
> > > three fields of procfs_procmap_query (size, query_flags, and
> > > query_addr), because without those the query is meaningless.
> > > >
> > > >
> > > > > +     err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> > >
> > > and this helper does more checks validating that the user either has a
> > > shorter struct (and then zero-fills the rest of kernel-side struct) or
> > > has longer (and then the longer part has to be zero filled). Do check
> > > copy_struct_from_user() documentation, it's great.
> > >
> > > > > +     if (err)
> > > > > +             return err;
> > > > > +
> > > > > +     if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > > > > +             return -EINVAL;
> > > > > +     if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > > > > +             return -EINVAL;
> > > > > +     if (!!karg.build_id_size != !!karg.build_id_addr)
> > > > > +             return -EINVAL;
> > > >
> > > > So you want values to be set, right?
> > >
> > > Either both should be set, or neither. It's ok for both size/addr
> > > fields to be zero, in which case it indicates that the user doesn't
> > > want this part of information (which is usually a bit more expensive
> > > to get and might not be necessary for all the cases).
> > >
> > > >
> > > > > +
> > > > > +     mm = priv->mm;
> > > > > +     if (!mm || !mmget_not_zero(mm))
> > > > > +             return -ESRCH;
> > > >
> > > > What is this error for?  Where is this documentned?
> > >
> > > I copied it from existing /proc/<pid>/maps checks. I presume it's
> > > guarding the case when mm might be already put. So if the process is
> > > gone, but we have /proc/<pid>/maps file open?
> > >
> > > >
> > > > > +     if (mmap_read_lock_killable(mm)) {
> > > > > +             mmput(mm);
> > > > > +             return -EINTR;
> > > > > +     }
> > > > > +
> > > > > +     vma_iter_init(&iter, mm, karg.query_addr);
> > > > > +     vma = vma_next(&iter);
> > > > > +     if (!vma) {
> > > > > +             err = -ENOENT;
> > > > > +             goto out;
> > > > > +     }
> > > > > +     /* user wants covering VMA, not the closest next one */
> > > > > +     if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > > > > +         vma->vm_start > karg.query_addr) {
> > > > > +             err = -ENOENT;
> > > > > +             goto out;
> > > > > +     }
> > > > > +
> > > > > +     karg.vma_start = vma->vm_start;
> > > > > +     karg.vma_end = vma->vm_end;
> > > > > +
> > > > > +     if (vma->vm_file) {
> > > > > +             const struct inode *inode = file_user_inode(vma->vm_file);
> > > > > +
> > > > > +             karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > > > > +             karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > > > > +             karg.dev_minor = MINOR(inode->i_sb->s_dev);
> > > >
> > > > So the major/minor is that of the file superblock?  Why?
> > >
> > > Because inode number is unique only within given super block (and even
> > > then it's more complicated, e.g., btrfs subvolumes add more headaches,
> > > I believe). inode + dev maj/min is sometimes used for cache/reuse of
> > > per-binary information (e.g., pre-processed DWARF information, which
> > > is *very* expensive, so anything that allows to avoid doing this is
> > > helpful).
> > >
> > > >
> > > > > +             karg.inode = inode->i_ino;
> > > >
> > > > What is userspace going to do with this?
> > > >
> > >
> > > See above.
> > >
> > > > > +     } else {
> > > > > +             karg.vma_offset = 0;
> > > > > +             karg.dev_major = 0;
> > > > > +             karg.dev_minor = 0;
> > > > > +             karg.inode = 0;
> > > >
> > > > Why not set everything to 0 up above at the beginning so you never miss
> > > > anything, and you don't miss any holes accidentally in the future.
> > > >
> > >
> > > Stylistic preference, I find this more explicit, but I don't care much
> > > one way or another.
> > >
> > > > > +     }
> > > > > +
> > > > > +     karg.vma_flags = 0;
> > > > > +     if (vma->vm_flags & VM_READ)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > > > > +     if (vma->vm_flags & VM_WRITE)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > > > > +     if (vma->vm_flags & VM_EXEC)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > > > > +     if (vma->vm_flags & VM_MAYSHARE)
> > > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > > > > +
> > >
> > > [...]
> > >
> > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > > index 45e4e64fd664..fe8924a8d916 100644
> > > > > --- a/include/uapi/linux/fs.h
> > > > > +++ b/include/uapi/linux/fs.h
> > > > > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > > > >       __u64 return_mask;
> > > > >  };
> > > > >
> > > > > +/* /proc/<pid>/maps ioctl */
> > > > > +#define PROCFS_IOCTL_MAGIC 0x9f
> > > >
> > > > Don't you need to document this in the proper place?
> > >
> > > I probably do, but I'm asking for help in knowing where. procfs is not
> > > a typical area of kernel I'm working with, so any pointers are highly
> > > appreciated.
> > >
> > > >
> > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > > > > +
> > > > > +enum procmap_query_flags {
> > > > > +     PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > > > > +};
> > > > > +
> > > > > +enum procmap_vma_flags {
> > > > > +     PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > > > > +     PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > > > > +     PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > > > > +     PROCFS_PROCMAP_VMA_SHARED = 0x08,
> > > >
> > > > Are these bits?  If so, please use the bit macro for it to make it
> > > > obvious.
> > > >
> > >
> > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
> > > add any extra #includes to this UAPI header, but I can figure out the
> > > necessary dependency and do BIT(), I just didn't feel like BIT() adds
> > > much here, tbh.
> > >
> > > > > +};
> > > > > +
> > > > > +struct procfs_procmap_query {
> > > > > +     __u64 size;
> > > > > +     __u64 query_flags;              /* in */
> > > >
> > > > Does this map to the procmap_vma_flags enum?  if so, please say so.
> > >
> > > no, procmap_query_flags, and yes, I will
> > >
> > > >
> > > > > +     __u64 query_addr;               /* in */
> > > > > +     __u64 vma_start;                /* out */
> > > > > +     __u64 vma_end;                  /* out */
> > > > > +     __u64 vma_flags;                /* out */
> > > > > +     __u64 vma_offset;               /* out */
> > > > > +     __u64 inode;                    /* out */
> > > >
> > > > What is the inode for, you have an inode for the file already, why give
> > > > it another one?
> > >
> > > This is inode of vma's backing file, same as /proc/<pid>/maps' file
> > > column. What inode of file do I already have here? You mean of
> > > /proc/<pid>/maps itself? It's useless for the intended purposes.
> > >
> > > >
> > > > > +     __u32 dev_major;                /* out */
> > > > > +     __u32 dev_minor;                /* out */
> > > >
> > > > What is major/minor for?
> > >
> > > This is the same information as emitted by /proc/<pid>/maps,
> > > identifies superblock of vma's backing file. As I mentioned above, it
> > > can be used for caching per-file (i.e., per-ELF binary) information
> > > (for example).
> > >
> > > >
> > > > > +     __u32 vma_name_size;            /* in/out */
> > > > > +     __u32 build_id_size;            /* in/out */
> > > > > +     __u64 vma_name_addr;            /* in */
> > > > > +     __u64 build_id_addr;            /* in */
> > > >
> > > > Why not document this all using kerneldoc above the structure?
> > >
> > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
> > > figuring out the best place and approach, and so wanted to avoid
> > > documentation churn.
> > >
> > > Would something like what we have for pm_scan_arg and pagemap APIs
> > > work? I see it added a few simple descriptions for pm_scan_arg struct,
> > > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
> > > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
> > > though)? Anyways, I'm hoping for pointers where all this should be
> > > documented. Thank you!
> > >
> > > >
> > > > anyway, I don't like ioctls, but there is a place for them, you just
> > > > have to actually justify the use for them and not say "not efficient
> > > > enough" as that normally isn't an issue overall.
> > >
> > > I've written a demo tool in patch #5 which performs real-world task:
> > > mapping addresses to their VMAs (specifically calculating file offset,
> > > finding vma_start + vma_end range to further access files from
> > > /proc/<pid>/map_files/<start>-<end>). I did the implementation
> > > faithfully, doing it in the most optimal way for both APIs. I showed
> > > that for "typical" (it's hard to specify what typical is, of course,
> > > too many variables) scenario (it was data collected on a real server
> > > running real service, 30 seconds of process-specific stack traces were
> > > captured, if I remember correctly). I showed that doing exactly the
> > > same amount of work is ~35x times slower with /proc/<pid>/maps.
> > >
> > > Take another process, another set of addresses, another anything, and
> > > the numbers will be different, but I think it gives the right idea.
> > >
> > > But I think we are overpivoting on text vs binary distinction here.
> > > It's the more targeted querying of VMAs that's beneficial here. This
> > > allows applications to not cache anything and just re-query when doing
> > > periodic or continuous profiling (where addresses are coming in not as
> > > one batch, as a sequence of batches extended in time).
> > >
> > > /proc/<pid>/maps, for all its usefulness, just can't provide this sort
> > > of ability, as it wasn't designed to do that and is targeting
> > > different use cases.
> > >
> > > And then, a new ability to request reliable (it's not 100% reliable
> > > today, I'm going to address that as a follow up) build ID is *crucial*
> > > for some scenarios. The mentioned Oculus use case, the need to fully
> > > access underlying ELF binary just to get build ID is frowned upon. And
> > > for a good reason. Profiler only needs build ID, which is no secret
> > > and not sensitive information. This new (and binary, yes) API allows
> > > to add this into an API without breaking any backwards compatibility.
> > >
> > > >
> > > > thanks,
> > > >
> > > > greg k-h
> >

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 18:05  6%         ` Namhyung Kim
@ 2024-05-06 18:51  6%           ` Andrii Nakryiko
  2024-05-06 18:53  6%           ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-06 18:51 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Ian Rogers, Greg KH,
	Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm,
	linux-kernel, bpf, linux-mm, Daniel Müller, linux-perf-use.

On Mon, May 6, 2024 at 11:05 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >
> > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > it, saving resources.
> >
> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > > > Where is the userspace code that uses this new api you have created?
> >
> > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > ioctl() API to solve a common problem (as described above) in patch
> > > #5. The plan is to put it in mentioned blazesym library at the very
> > > least.
> > >
> > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > linux-perf-user), as they need to do stack symbolization as well.
>
> I think the general use case in perf is different.  This ioctl API is great
> for live tracing of a single (or a small number of) process(es).  And
> yes, perf tools have those tracing use cases too.  But I think the
> major use case of perf tools is system-wide profiling.

The intended use case is also a system-wide profiling, but I haven't
heard that opening a file per process is a big bottleneck or a
limitation, tbh.

>
> For system-wide profiling, you need to process samples of many
> different processes at a high frequency.  Now perf record doesn't
> process them and just save it for offline processing (well, it does
> at the end to find out build-ID but it can be omitted).
>
> Doing it online is possible (like perf top) but it would add more
> overhead during the profiling.  And we cannot move processing
> or symbolization to the end of profiling because some (short-
> lived) tasks can go away.

We do have some setups where we install a BPF program that monitors
process exit and mmap() events and emits (proactively) VMA
information. It's not applicable everywhere, and in some setups (like
Oculus case) we just accept that short-lived processes will be missed
at the expense of less interruption, simpler and less privileged
"agents" doing profiling and address resolution logic.

So the problem space, as can be seen, is pretty vast and varied, and
there is no single API that would serve all the needs perfectly.

>
> Also it should support perf report (offline) on data from a
> different kernel or even a different machine.

We fetch build ID (and resolve file offset) and offload actual
symbolization to a dedicated fleet of servers, whenever possible. We
don't yet do it for kernel stack traces, but we are moving in this
direction (and there are their own problems with /proc/kallsyms being
text-based, listing everything, and pretty big all in itself; but
that's a separate topic).

>
> So it saves the memory map of processes and symbolizes
> the stack trace with it later.  Of course it needs to be updated
> as the memory map changes and that's why it tracks mmap
> or similar syscalls with PERF_RECORD_MMAP[2] records.
>
> A problem with this approach is to get the initial state of all
> (or a target for non-system-wide mode) existing processes.
> We call it synthesizing, and read /proc/PID/maps to generate
> the mmap records.
>
> I think the below comment from Arnaldo talked about how
> we can improve the synthesizing (which is sequential access
> to proc maps) using BPF.

Yep. We can also benchmark using this new ioctl() to fetch a full set
of VMAs, it might still be good enough.

>
> Thanks,
> Namhyung
>

[...]

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 13:58  6%       ` Arnaldo Carvalho de Melo
  2024-05-06 18:05  6%         ` Namhyung Kim
@ 2024-05-06 18:41  6%         ` Andrii Nakryiko
  2024-05-06 20:35  6%           ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 200+ results
From: Andrii Nakryiko @ 2024-05-06 18:41 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > it, saving resources.
>
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>
> > > Where is the userspace code that uses this new api you have created?
>
> > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > ioctl() API to solve a common problem (as described above) in patch
> > #5. The plan is to put it in mentioned blazesym library at the very
> > least.
> >
> > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > linux-perf-user), as they need to do stack symbolization as well.
>
> At some point, when BPF iterators became a thing we thought about, IIRC
> Jiri did some experimentation, but I lost track, of using BPF to
> synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> as in uapi/linux/perf_event.h:
>
>         /*
>          * The MMAP2 records are an augmented version of MMAP, they add
>          * maj, min, ino numbers to be used to uniquely identify each mapping
>          *
>          * struct {
>          *      struct perf_event_header        header;
>          *
>          *      u32                             pid, tid;
>          *      u64                             addr;
>          *      u64                             len;
>          *      u64                             pgoff;
>          *      union {
>          *              struct {
>          *                      u32             maj;
>          *                      u32             min;
>          *                      u64             ino;
>          *                      u64             ino_generation;
>          *              };
>          *              struct {
>          *                      u8              build_id_size;
>          *                      u8              __reserved_1;
>          *                      u16             __reserved_2;
>          *                      u8              build_id[20];
>          *              };
>          *      };
>          *      u32                             prot, flags;
>          *      char                            filename[];
>          *      struct sample_id                sample_id;
>          * };
>          */
>         PERF_RECORD_MMAP2                       = 10,
>
>  *   PERF_RECORD_MISC_MMAP_BUILD_ID      - PERF_RECORD_MMAP2 event
>
> As perf.data files can be used for many purposes we want them all, so we

ok, so because you want them all and you don't know which VMAs will be
useful or not, it's a different problem. BPF iterators will be faster
purely due to avoiding binary -> text -> binary conversion path, but
other than that you'll still retrieve all VMAs.

You can still do the same full VMA iteration with this new API, of
course, but advantages are probably smaller as you'll be retrieving a
full set of VMAs regardless (though it would be interesting to compare
anyways).

> setup a meta data perf file descriptor to go on receiving the new mmaps
> while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> it in parallel, etc:
>
> ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
>
>  Usage: perf record [<options>] [<command>]
>     or: perf record [<options>] -- <command> [<options>]
>
>         --num-thread-synthesize <n>
>                           number of threads to run for event synthesis
>         --synth <no|all|task|mmap|cgroup>
>                           Fine-tune event synthesis: default=all
>
> ⬢[acme@toolbox perf-tools-next]$
>
> For this specific initial synthesis of everything the plan, as mentioned
> about Jiri's experiments, was to use a BPF iterator to just feed the
> perf ring buffer with those events, that way userspace would just
> receive the usual records it gets when a new mmap is put in place, the
> BPF iterator would just feed the preexisting mmaps, as instructed via
> the perf_event_attr for the perf_event_open syscall.
>
> For people not wanting BPF, i.e. disabling it altogether in perf or
> disabling just BPF skels, then we would fallback to the current method,
> or to the one being discussed here when it becomes available.
>
> One thing to have in mind is for this iterator not to generate duplicate
> records for non-pre-existing mmaps, i.e. we would need some generation
> number that would be bumped when asking for such pre-existing maps
> PERF_RECORD_MMAP2 dumps.

Looking briefly at struct vm_area_struct, it doesn't seems like the
kernel maintains any sort of generation (at least not at
vm_area_struct level), so this would be nice to have, I'm sure, but
isn't really related to adding this API. Once the kernel does have
this "VMA generation" counter, it can be trivially added to this
binary interface (which can't be said about /proc/<pid>/maps,
unfortunately).

>
> > It will be up to other similar projects to adopt this, but we'll
> > definitely get this into blazesym as it is actually a problem for the
>
> At some point looking at plugging blazesym somehow with perf may be
> something to consider, indeed.

In the above I meant direct use of this new API in perf code itself,
but yes, blazesym is a generic library for symbolization that handles
ELF/DWARF/GSYM (and I believe more formats), so it indeed might make
sense to use it.

>
> - Arnaldo
>
> > abovementioned Oculus use case. We already had to make a tradeoff (see
> > [2], this wasn't done just because we could, but it was requested by
> > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > the risk of missing some shared libraries that can be loaded later. It
> > would be great to not have to do this tradeoff, which this new API
> > would enable.
> >
> >   [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> >

[...]

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-06 13:58  6%       ` Arnaldo Carvalho de Melo
@ 2024-05-06 18:05  6%         ` Namhyung Kim
  2024-05-06 18:51  6%           ` Andrii Nakryiko
  2024-05-06 18:53  6%           ` Arnaldo Carvalho de Melo
  2024-05-06 18:41  6%         ` Andrii Nakryiko
  1 sibling, 2 replies; 200+ results
From: Namhyung Kim @ 2024-05-06 18:05 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Andrii Nakryiko, Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko,
	linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

Hello,

On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > it, saving resources.
>
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>
> > > Where is the userspace code that uses this new api you have created?
>
> > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > ioctl() API to solve a common problem (as described above) in patch
> > #5. The plan is to put it in mentioned blazesym library at the very
> > least.
> >
> > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > linux-perf-user), as they need to do stack symbolization as well.

I think the general use case in perf is different.  This ioctl API is great
for live tracing of a single (or a small number of) process(es).  And
yes, perf tools have those tracing use cases too.  But I think the
major use case of perf tools is system-wide profiling.

For system-wide profiling, you need to process samples of many
different processes at a high frequency.  Now perf record doesn't
process them and just save it for offline processing (well, it does
at the end to find out build-ID but it can be omitted).

Doing it online is possible (like perf top) but it would add more
overhead during the profiling.  And we cannot move processing
or symbolization to the end of profiling because some (short-
lived) tasks can go away.

Also it should support perf report (offline) on data from a
different kernel or even a different machine.

So it saves the memory map of processes and symbolizes
the stack trace with it later.  Of course it needs to be updated
as the memory map changes and that's why it tracks mmap
or similar syscalls with PERF_RECORD_MMAP[2] records.

A problem with this approach is to get the initial state of all
(or a target for non-system-wide mode) existing processes.
We call it synthesizing, and read /proc/PID/maps to generate
the mmap records.

I think the below comment from Arnaldo talked about how
we can improve the synthesizing (which is sequential access
to proc maps) using BPF.

Thanks,
Namhyung


>
> At some point, when BPF iterators became a thing we thought about, IIRC
> Jiri did some experimentation, but I lost track, of using BPF to
> synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> as in uapi/linux/perf_event.h:
>
>         /*
>          * The MMAP2 records are an augmented version of MMAP, they add
>          * maj, min, ino numbers to be used to uniquely identify each mapping
>          *
>          * struct {
>          *      struct perf_event_header        header;
>          *
>          *      u32                             pid, tid;
>          *      u64                             addr;
>          *      u64                             len;
>          *      u64                             pgoff;
>          *      union {
>          *              struct {
>          *                      u32             maj;
>          *                      u32             min;
>          *                      u64             ino;
>          *                      u64             ino_generation;
>          *              };
>          *              struct {
>          *                      u8              build_id_size;
>          *                      u8              __reserved_1;
>          *                      u16             __reserved_2;
>          *                      u8              build_id[20];
>          *              };
>          *      };
>          *      u32                             prot, flags;
>          *      char                            filename[];
>          *      struct sample_id                sample_id;
>          * };
>          */
>         PERF_RECORD_MMAP2                       = 10,
>
>  *   PERF_RECORD_MISC_MMAP_BUILD_ID      - PERF_RECORD_MMAP2 event
>
> As perf.data files can be used for many purposes we want them all, so we
> setup a meta data perf file descriptor to go on receiving the new mmaps
> while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> it in parallel, etc:
>
> ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
>
>  Usage: perf record [<options>] [<command>]
>     or: perf record [<options>] -- <command> [<options>]
>
>         --num-thread-synthesize <n>
>                           number of threads to run for event synthesis
>         --synth <no|all|task|mmap|cgroup>
>                           Fine-tune event synthesis: default=all
>
> ⬢[acme@toolbox perf-tools-next]$
>
> For this specific initial synthesis of everything the plan, as mentioned
> about Jiri's experiments, was to use a BPF iterator to just feed the
> perf ring buffer with those events, that way userspace would just
> receive the usual records it gets when a new mmap is put in place, the
> BPF iterator would just feed the preexisting mmaps, as instructed via
> the perf_event_attr for the perf_event_open syscall.
>
> For people not wanting BPF, i.e. disabling it altogether in perf or
> disabling just BPF skels, then we would fallback to the current method,
> or to the one being discussed here when it becomes available.
>
> One thing to have in mind is for this iterator not to generate duplicate
> records for non-pre-existing mmaps, i.e. we would need some generation
> number that would be bumped when asking for such pre-existing maps
> PERF_RECORD_MMAP2 dumps.
>
> > It will be up to other similar projects to adopt this, but we'll
> > definitely get this into blazesym as it is actually a problem for the
>
> At some point looking at plugging blazesym somehow with perf may be
> something to consider, indeed.
>
> - Arnaldo
>
> > abovementioned Oculus use case. We already had to make a tradeoff (see
> > [2], this wasn't done just because we could, but it was requested by
> > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > the risk of missing some shared libraries that can be loaded later. It
> > would be great to not have to do this tradeoff, which this new API
> > would enable.
> >
> >   [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> >
> > >
> > > > ---
> > > >  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/fs.h |  32 ++++++++
> > > >  2 files changed, 197 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 8e503a1635b7..cb7b1ff1a144 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -22,6 +22,7 @@
> > > >  #include <linux/pkeys.h>
> > > >  #include <linux/minmax.h>
> > > >  #include <linux/overflow.h>
> > > > +#include <linux/buildid.h>
> > > >
> > > >  #include <asm/elf.h>
> > > >  #include <asm/tlb.h>
> > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > >       return do_maps_open(inode, file, &proc_pid_maps_op);
> > > >  }
> > > >
> > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > > > +{
> > > > +     struct procfs_procmap_query karg;
> > > > +     struct vma_iterator iter;
> > > > +     struct vm_area_struct *vma;
> > > > +     struct mm_struct *mm;
> > > > +     const char *name = NULL;
> > > > +     char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > > > +     __u64 usize;
> > > > +     int err;
> > > > +
> > > > +     if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > > > +             return -EFAULT;
> > > > +     if (usize > PAGE_SIZE)
> > >
> > > Nice, where did you document that?  And how is that portable given that
> > > PAGE_SIZE can be different on different systems?
> >
> > I'm happy to document everything, can you please help by pointing
> > where this documentation has to live?
> >
> > This is mostly fool-proofing, though, because the user has to pass
> > sizeof(struct procfs_procmap_query), which I don't see ever getting
> > close to even 4KB (not even saying about 64KB). This is just to
> > prevent copy_struct_from_user() below to do too much zero-checking.
> >
> > >
> > > and why aren't you checking the actual structure size instead?  You can
> > > easily run off the end here without knowing it.
> >
> > See copy_struct_from_user(), it does more checks. This is a helper
> > designed specifically to deal with use cases like this where kernel
> > struct size can change and user space might be newer or older.
> > copy_struct_from_user() has a nice documentation describing all these
> > nuances.
> >
> > >
> > > > +             return -E2BIG;
> > > > +     if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > > > +             return -EINVAL;
> > >
> > > Ok, so you have two checks?  How can the first one ever fail?
> >
> > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
> > won't fail, but this one will fail.
> >
> > The point of this check is that user has to specify at least first
> > three fields of procfs_procmap_query (size, query_flags, and
> > query_addr), because without those the query is meaningless.
> > >
> > >
> > > > +     err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> >
> > and this helper does more checks validating that the user either has a
> > shorter struct (and then zero-fills the rest of kernel-side struct) or
> > has longer (and then the longer part has to be zero filled). Do check
> > copy_struct_from_user() documentation, it's great.
> >
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > > > +             return -EINVAL;
> > > > +     if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > > > +             return -EINVAL;
> > > > +     if (!!karg.build_id_size != !!karg.build_id_addr)
> > > > +             return -EINVAL;
> > >
> > > So you want values to be set, right?
> >
> > Either both should be set, or neither. It's ok for both size/addr
> > fields to be zero, in which case it indicates that the user doesn't
> > want this part of information (which is usually a bit more expensive
> > to get and might not be necessary for all the cases).
> >
> > >
> > > > +
> > > > +     mm = priv->mm;
> > > > +     if (!mm || !mmget_not_zero(mm))
> > > > +             return -ESRCH;
> > >
> > > What is this error for?  Where is this documentned?
> >
> > I copied it from existing /proc/<pid>/maps checks. I presume it's
> > guarding the case when mm might be already put. So if the process is
> > gone, but we have /proc/<pid>/maps file open?
> >
> > >
> > > > +     if (mmap_read_lock_killable(mm)) {
> > > > +             mmput(mm);
> > > > +             return -EINTR;
> > > > +     }
> > > > +
> > > > +     vma_iter_init(&iter, mm, karg.query_addr);
> > > > +     vma = vma_next(&iter);
> > > > +     if (!vma) {
> > > > +             err = -ENOENT;
> > > > +             goto out;
> > > > +     }
> > > > +     /* user wants covering VMA, not the closest next one */
> > > > +     if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > > > +         vma->vm_start > karg.query_addr) {
> > > > +             err = -ENOENT;
> > > > +             goto out;
> > > > +     }
> > > > +
> > > > +     karg.vma_start = vma->vm_start;
> > > > +     karg.vma_end = vma->vm_end;
> > > > +
> > > > +     if (vma->vm_file) {
> > > > +             const struct inode *inode = file_user_inode(vma->vm_file);
> > > > +
> > > > +             karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > > > +             karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > > > +             karg.dev_minor = MINOR(inode->i_sb->s_dev);
> > >
> > > So the major/minor is that of the file superblock?  Why?
> >
> > Because inode number is unique only within given super block (and even
> > then it's more complicated, e.g., btrfs subvolumes add more headaches,
> > I believe). inode + dev maj/min is sometimes used for cache/reuse of
> > per-binary information (e.g., pre-processed DWARF information, which
> > is *very* expensive, so anything that allows to avoid doing this is
> > helpful).
> >
> > >
> > > > +             karg.inode = inode->i_ino;
> > >
> > > What is userspace going to do with this?
> > >
> >
> > See above.
> >
> > > > +     } else {
> > > > +             karg.vma_offset = 0;
> > > > +             karg.dev_major = 0;
> > > > +             karg.dev_minor = 0;
> > > > +             karg.inode = 0;
> > >
> > > Why not set everything to 0 up above at the beginning so you never miss
> > > anything, and you don't miss any holes accidentally in the future.
> > >
> >
> > Stylistic preference, I find this more explicit, but I don't care much
> > one way or another.
> >
> > > > +     }
> > > > +
> > > > +     karg.vma_flags = 0;
> > > > +     if (vma->vm_flags & VM_READ)
> > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > > > +     if (vma->vm_flags & VM_WRITE)
> > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > > > +     if (vma->vm_flags & VM_EXEC)
> > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > > > +     if (vma->vm_flags & VM_MAYSHARE)
> > > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > > > +
> >
> > [...]
> >
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 45e4e64fd664..fe8924a8d916 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > > >       __u64 return_mask;
> > > >  };
> > > >
> > > > +/* /proc/<pid>/maps ioctl */
> > > > +#define PROCFS_IOCTL_MAGIC 0x9f
> > >
> > > Don't you need to document this in the proper place?
> >
> > I probably do, but I'm asking for help in knowing where. procfs is not
> > a typical area of kernel I'm working with, so any pointers are highly
> > appreciated.
> >
> > >
> > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > > > +
> > > > +enum procmap_query_flags {
> > > > +     PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > > > +};
> > > > +
> > > > +enum procmap_vma_flags {
> > > > +     PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > > > +     PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > > > +     PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > > > +     PROCFS_PROCMAP_VMA_SHARED = 0x08,
> > >
> > > Are these bits?  If so, please use the bit macro for it to make it
> > > obvious.
> > >
> >
> > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
> > add any extra #includes to this UAPI header, but I can figure out the
> > necessary dependency and do BIT(), I just didn't feel like BIT() adds
> > much here, tbh.
> >
> > > > +};
> > > > +
> > > > +struct procfs_procmap_query {
> > > > +     __u64 size;
> > > > +     __u64 query_flags;              /* in */
> > >
> > > Does this map to the procmap_vma_flags enum?  if so, please say so.
> >
> > no, procmap_query_flags, and yes, I will
> >
> > >
> > > > +     __u64 query_addr;               /* in */
> > > > +     __u64 vma_start;                /* out */
> > > > +     __u64 vma_end;                  /* out */
> > > > +     __u64 vma_flags;                /* out */
> > > > +     __u64 vma_offset;               /* out */
> > > > +     __u64 inode;                    /* out */
> > >
> > > What is the inode for, you have an inode for the file already, why give
> > > it another one?
> >
> > This is inode of vma's backing file, same as /proc/<pid>/maps' file
> > column. What inode of file do I already have here? You mean of
> > /proc/<pid>/maps itself? It's useless for the intended purposes.
> >
> > >
> > > > +     __u32 dev_major;                /* out */
> > > > +     __u32 dev_minor;                /* out */
> > >
> > > What is major/minor for?
> >
> > This is the same information as emitted by /proc/<pid>/maps,
> > identifies superblock of vma's backing file. As I mentioned above, it
> > can be used for caching per-file (i.e., per-ELF binary) information
> > (for example).
> >
> > >
> > > > +     __u32 vma_name_size;            /* in/out */
> > > > +     __u32 build_id_size;            /* in/out */
> > > > +     __u64 vma_name_addr;            /* in */
> > > > +     __u64 build_id_addr;            /* in */
> > >
> > > Why not document this all using kerneldoc above the structure?
> >
> > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
> > figuring out the best place and approach, and so wanted to avoid
> > documentation churn.
> >
> > Would something like what we have for pm_scan_arg and pagemap APIs
> > work? I see it added a few simple descriptions for pm_scan_arg struct,
> > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
> > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
> > though)? Anyways, I'm hoping for pointers where all this should be
> > documented. Thank you!
> >
> > >
> > > anyway, I don't like ioctls, but there is a place for them, you just
> > > have to actually justify the use for them and not say "not efficient
> > > enough" as that normally isn't an issue overall.
> >
> > I've written a demo tool in patch #5 which performs real-world task:
> > mapping addresses to their VMAs (specifically calculating file offset,
> > finding vma_start + vma_end range to further access files from
> > /proc/<pid>/map_files/<start>-<end>). I did the implementation
> > faithfully, doing it in the most optimal way for both APIs. I showed
> > that for "typical" (it's hard to specify what typical is, of course,
> > too many variables) scenario (it was data collected on a real server
> > running real service, 30 seconds of process-specific stack traces were
> > captured, if I remember correctly). I showed that doing exactly the
> > same amount of work is ~35x times slower with /proc/<pid>/maps.
> >
> > Take another process, another set of addresses, another anything, and
> > the numbers will be different, but I think it gives the right idea.
> >
> > But I think we are overpivoting on text vs binary distinction here.
> > It's the more targeted querying of VMAs that's beneficial here. This
> > allows applications to not cache anything and just re-query when doing
> > periodic or continuous profiling (where addresses are coming in not as
> > one batch, as a sequence of batches extended in time).
> >
> > /proc/<pid>/maps, for all its usefulness, just can't provide this sort
> > of ability, as it wasn't designed to do that and is targeting
> > different use cases.
> >
> > And then, a new ability to request reliable (it's not 100% reliable
> > today, I'm going to address that as a follow up) build ID is *crucial*
> > for some scenarios. The mentioned Oculus use case, the need to fully
> > access underlying ELF binary just to get build ID is frowned upon. And
> > for a good reason. Profiler only needs build ID, which is no secret
> > and not sensitive information. This new (and binary, yes) API allows
> > to add this into an API without breaking any backwards compatibility.
> >
> > >
> > > thanks,
> > >
> > > greg k-h
>

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-04 21:50  9%     ` Andrii Nakryiko
@ 2024-05-06 13:58  6%       ` Arnaldo Carvalho de Melo
  2024-05-06 18:05  6%         ` Namhyung Kim
  2024-05-06 18:41  6%         ` Andrii Nakryiko
  0 siblings, 2 replies; 200+ results
From: Arnaldo Carvalho de Melo @ 2024-05-06 13:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiri Olsa, Ian Rogers, Greg KH, Andrii Nakryiko, linux-fsdevel,
	brauner, viro, akpm, linux-kernel, bpf, linux-mm,
	Daniel Müller, linux-perf-use.

On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > hard-coded or user-provided names) is optional just like build ID. If
> > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > it, saving resources.

> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

> > Where is the userspace code that uses this new api you have created?
 
> So I added a faithful comparison of existing /proc/<pid>/maps vs new
> ioctl() API to solve a common problem (as described above) in patch
> #5. The plan is to put it in mentioned blazesym library at the very
> least.
> 
> I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> linux-perf-user), as they need to do stack symbolization as well.

At some point, when BPF iterators became a thing we thought about, IIRC
Jiri did some experimentation, but I lost track, of using BPF to
synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
as in uapi/linux/perf_event.h:

        /*
         * The MMAP2 records are an augmented version of MMAP, they add
         * maj, min, ino numbers to be used to uniquely identify each mapping
         *
         * struct {
         *      struct perf_event_header        header;
         *
         *      u32                             pid, tid;
         *      u64                             addr;
         *      u64                             len;
         *      u64                             pgoff;
         *      union {
         *              struct {
         *                      u32             maj;
         *                      u32             min;
         *                      u64             ino;
         *                      u64             ino_generation;
         *              };
         *              struct {
         *                      u8              build_id_size;
         *                      u8              __reserved_1;
         *                      u16             __reserved_2;
         *                      u8              build_id[20];
         *              };
         *      };
         *      u32                             prot, flags;
         *      char                            filename[];
         *      struct sample_id                sample_id;
         * };
         */
        PERF_RECORD_MMAP2                       = 10,

 *   PERF_RECORD_MISC_MMAP_BUILD_ID      - PERF_RECORD_MMAP2 event

As perf.data files can be used for many purposes we want them all, so we
setup a meta data perf file descriptor to go on receiving the new mmaps
while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
it in parallel, etc:

⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'

 Usage: perf record [<options>] [<command>]
    or: perf record [<options>] -- <command> [<options>]

        --num-thread-synthesize <n>
                          number of threads to run for event synthesis
        --synth <no|all|task|mmap|cgroup>
                          Fine-tune event synthesis: default=all

⬢[acme@toolbox perf-tools-next]$

For this specific initial synthesis of everything the plan, as mentioned
about Jiri's experiments, was to use a BPF iterator to just feed the
perf ring buffer with those events, that way userspace would just
receive the usual records it gets when a new mmap is put in place, the
BPF iterator would just feed the preexisting mmaps, as instructed via
the perf_event_attr for the perf_event_open syscall.

For people not wanting BPF, i.e. disabling it altogether in perf or
disabling just BPF skels, then we would fallback to the current method,
or to the one being discussed here when it becomes available.

One thing to have in mind is for this iterator not to generate duplicate
records for non-pre-existing mmaps, i.e. we would need some generation
number that would be bumped when asking for such pre-existing maps
PERF_RECORD_MMAP2 dumps.
 
> It will be up to other similar projects to adopt this, but we'll
> definitely get this into blazesym as it is actually a problem for the

At some point looking at plugging blazesym somehow with perf may be
something to consider, indeed.

- Arnaldo

> abovementioned Oculus use case. We already had to make a tradeoff (see
> [2], this wasn't done just because we could, but it was requested by
> Oculus customers) to cache the contents of /proc/<pid>/maps and run
> the risk of missing some shared libraries that can be loaded later. It
> would be great to not have to do this tradeoff, which this new API
> would enable.
> 
>   [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> 
> >
> > > ---
> > >  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/fs.h |  32 ++++++++
> > >  2 files changed, 197 insertions(+)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 8e503a1635b7..cb7b1ff1a144 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -22,6 +22,7 @@
> > >  #include <linux/pkeys.h>
> > >  #include <linux/minmax.h>
> > >  #include <linux/overflow.h>
> > > +#include <linux/buildid.h>
> > >
> > >  #include <asm/elf.h>
> > >  #include <asm/tlb.h>
> > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > >       return do_maps_open(inode, file, &proc_pid_maps_op);
> > >  }
> > >
> > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > > +{
> > > +     struct procfs_procmap_query karg;
> > > +     struct vma_iterator iter;
> > > +     struct vm_area_struct *vma;
> > > +     struct mm_struct *mm;
> > > +     const char *name = NULL;
> > > +     char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > > +     __u64 usize;
> > > +     int err;
> > > +
> > > +     if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > > +             return -EFAULT;
> > > +     if (usize > PAGE_SIZE)
> >
> > Nice, where did you document that?  And how is that portable given that
> > PAGE_SIZE can be different on different systems?
> 
> I'm happy to document everything, can you please help by pointing
> where this documentation has to live?
> 
> This is mostly fool-proofing, though, because the user has to pass
> sizeof(struct procfs_procmap_query), which I don't see ever getting
> close to even 4KB (not even saying about 64KB). This is just to
> prevent copy_struct_from_user() below to do too much zero-checking.
> 
> >
> > and why aren't you checking the actual structure size instead?  You can
> > easily run off the end here without knowing it.
> 
> See copy_struct_from_user(), it does more checks. This is a helper
> designed specifically to deal with use cases like this where kernel
> struct size can change and user space might be newer or older.
> copy_struct_from_user() has a nice documentation describing all these
> nuances.
> 
> >
> > > +             return -E2BIG;
> > > +     if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > > +             return -EINVAL;
> >
> > Ok, so you have two checks?  How can the first one ever fail?
> 
> Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
> won't fail, but this one will fail.
> 
> The point of this check is that user has to specify at least first
> three fields of procfs_procmap_query (size, query_flags, and
> query_addr), because without those the query is meaningless.
> >
> >
> > > +     err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> 
> and this helper does more checks validating that the user either has a
> shorter struct (and then zero-fills the rest of kernel-side struct) or
> has longer (and then the longer part has to be zero filled). Do check
> copy_struct_from_user() documentation, it's great.
> 
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > > +             return -EINVAL;
> > > +     if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > > +             return -EINVAL;
> > > +     if (!!karg.build_id_size != !!karg.build_id_addr)
> > > +             return -EINVAL;
> >
> > So you want values to be set, right?
> 
> Either both should be set, or neither. It's ok for both size/addr
> fields to be zero, in which case it indicates that the user doesn't
> want this part of information (which is usually a bit more expensive
> to get and might not be necessary for all the cases).
> 
> >
> > > +
> > > +     mm = priv->mm;
> > > +     if (!mm || !mmget_not_zero(mm))
> > > +             return -ESRCH;
> >
> > What is this error for?  Where is this documentned?
> 
> I copied it from existing /proc/<pid>/maps checks. I presume it's
> guarding the case when mm might be already put. So if the process is
> gone, but we have /proc/<pid>/maps file open?
> 
> >
> > > +     if (mmap_read_lock_killable(mm)) {
> > > +             mmput(mm);
> > > +             return -EINTR;
> > > +     }
> > > +
> > > +     vma_iter_init(&iter, mm, karg.query_addr);
> > > +     vma = vma_next(&iter);
> > > +     if (!vma) {
> > > +             err = -ENOENT;
> > > +             goto out;
> > > +     }
> > > +     /* user wants covering VMA, not the closest next one */
> > > +     if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > > +         vma->vm_start > karg.query_addr) {
> > > +             err = -ENOENT;
> > > +             goto out;
> > > +     }
> > > +
> > > +     karg.vma_start = vma->vm_start;
> > > +     karg.vma_end = vma->vm_end;
> > > +
> > > +     if (vma->vm_file) {
> > > +             const struct inode *inode = file_user_inode(vma->vm_file);
> > > +
> > > +             karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > > +             karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > > +             karg.dev_minor = MINOR(inode->i_sb->s_dev);
> >
> > So the major/minor is that of the file superblock?  Why?
> 
> Because inode number is unique only within given super block (and even
> then it's more complicated, e.g., btrfs subvolumes add more headaches,
> I believe). inode + dev maj/min is sometimes used for cache/reuse of
> per-binary information (e.g., pre-processed DWARF information, which
> is *very* expensive, so anything that allows to avoid doing this is
> helpful).
> 
> >
> > > +             karg.inode = inode->i_ino;
> >
> > What is userspace going to do with this?
> >
> 
> See above.
> 
> > > +     } else {
> > > +             karg.vma_offset = 0;
> > > +             karg.dev_major = 0;
> > > +             karg.dev_minor = 0;
> > > +             karg.inode = 0;
> >
> > Why not set everything to 0 up above at the beginning so you never miss
> > anything, and you don't miss any holes accidentally in the future.
> >
> 
> Stylistic preference, I find this more explicit, but I don't care much
> one way or another.
> 
> > > +     }
> > > +
> > > +     karg.vma_flags = 0;
> > > +     if (vma->vm_flags & VM_READ)
> > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > > +     if (vma->vm_flags & VM_WRITE)
> > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > > +     if (vma->vm_flags & VM_EXEC)
> > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > > +     if (vma->vm_flags & VM_MAYSHARE)
> > > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > > +
> 
> [...]
> 
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index 45e4e64fd664..fe8924a8d916 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > >       __u64 return_mask;
> > >  };
> > >
> > > +/* /proc/<pid>/maps ioctl */
> > > +#define PROCFS_IOCTL_MAGIC 0x9f
> >
> > Don't you need to document this in the proper place?
> 
> I probably do, but I'm asking for help in knowing where. procfs is not
> a typical area of kernel I'm working with, so any pointers are highly
> appreciated.
> 
> >
> > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > > +
> > > +enum procmap_query_flags {
> > > +     PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > > +};
> > > +
> > > +enum procmap_vma_flags {
> > > +     PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > > +     PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > > +     PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > > +     PROCFS_PROCMAP_VMA_SHARED = 0x08,
> >
> > Are these bits?  If so, please use the bit macro for it to make it
> > obvious.
> >
> 
> Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
> add any extra #includes to this UAPI header, but I can figure out the
> necessary dependency and do BIT(), I just didn't feel like BIT() adds
> much here, tbh.
> 
> > > +};
> > > +
> > > +struct procfs_procmap_query {
> > > +     __u64 size;
> > > +     __u64 query_flags;              /* in */
> >
> > Does this map to the procmap_vma_flags enum?  if so, please say so.
> 
> no, procmap_query_flags, and yes, I will
> 
> >
> > > +     __u64 query_addr;               /* in */
> > > +     __u64 vma_start;                /* out */
> > > +     __u64 vma_end;                  /* out */
> > > +     __u64 vma_flags;                /* out */
> > > +     __u64 vma_offset;               /* out */
> > > +     __u64 inode;                    /* out */
> >
> > What is the inode for, you have an inode for the file already, why give
> > it another one?
> 
> This is inode of vma's backing file, same as /proc/<pid>/maps' file
> column. What inode of file do I already have here? You mean of
> /proc/<pid>/maps itself? It's useless for the intended purposes.
> 
> >
> > > +     __u32 dev_major;                /* out */
> > > +     __u32 dev_minor;                /* out */
> >
> > What is major/minor for?
> 
> This is the same information as emitted by /proc/<pid>/maps,
> identifies superblock of vma's backing file. As I mentioned above, it
> can be used for caching per-file (i.e., per-ELF binary) information
> (for example).
> 
> >
> > > +     __u32 vma_name_size;            /* in/out */
> > > +     __u32 build_id_size;            /* in/out */
> > > +     __u64 vma_name_addr;            /* in */
> > > +     __u64 build_id_addr;            /* in */
> >
> > Why not document this all using kerneldoc above the structure?
> 
> Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
> figuring out the best place and approach, and so wanted to avoid
> documentation churn.
> 
> Would something like what we have for pm_scan_arg and pagemap APIs
> work? I see it added a few simple descriptions for pm_scan_arg struct,
> and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
> Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
> though)? Anyways, I'm hoping for pointers where all this should be
> documented. Thank you!
> 
> >
> > anyway, I don't like ioctls, but there is a place for them, you just
> > have to actually justify the use for them and not say "not efficient
> > enough" as that normally isn't an issue overall.
> 
> I've written a demo tool in patch #5 which performs real-world task:
> mapping addresses to their VMAs (specifically calculating file offset,
> finding vma_start + vma_end range to further access files from
> /proc/<pid>/map_files/<start>-<end>). I did the implementation
> faithfully, doing it in the most optimal way for both APIs. I showed
> that for "typical" (it's hard to specify what typical is, of course,
> too many variables) scenario (it was data collected on a real server
> running real service, 30 seconds of process-specific stack traces were
> captured, if I remember correctly). I showed that doing exactly the
> same amount of work is ~35x times slower with /proc/<pid>/maps.
> 
> Take another process, another set of addresses, another anything, and
> the numbers will be different, but I think it gives the right idea.
> 
> But I think we are overpivoting on text vs binary distinction here.
> It's the more targeted querying of VMAs that's beneficial here. This
> allows applications to not cache anything and just re-query when doing
> periodic or continuous profiling (where addresses are coming in not as
> one batch, as a sequence of batches extended in time).
> 
> /proc/<pid>/maps, for all its usefulness, just can't provide this sort
> of ability, as it wasn't designed to do that and is targeting
> different use cases.
> 
> And then, a new ability to request reliable (it's not 100% reliable
> today, I'm going to address that as a follow up) build ID is *crucial*
> for some scenarios. The mentioned Oculus use case, the need to fully
> access underlying ELF binary just to get build ID is frowned upon. And
> for a good reason. Profiler only needs build ID, which is no secret
> and not sensitive information. This new (and binary, yes) API allows
> to add this into an API without breaking any backwards compatibility.
> 
> >
> > thanks,
> >
> > greg k-h

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
  2024-05-04  0:30 13% [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-05-04 11:24  7% ` [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Christian Brauner
@ 2024-05-05  5:26  6% ` Ian Rogers
  2024-05-06 18:58 11%   ` Andrii Nakryiko
  3 siblings, 1 reply; 200+ results
From: Ian Rogers @ 2024-05-05  5:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, gregkh, linux-mm

On Fri, May 3, 2024 at 5:30 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than through textual
> processing of /proc/<pid>/maps contents. See patch #2 for the context,
> justification, and nuances of the API design.
>
> Patch #1 is a refactoring to keep VMA name logic determination in one place.
> Patch #2 is the meat of kernel-side API.
> Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> optionally use this new ioctl()-based API, if supported.
> Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> both textual and binary interfaces) and allows benchmarking them. Patch itself
> also has performance numbers of a test based on one of the medium-sized
> internal applications taken from production.
>
> This patch set was based on top of next-20240503 tag in linux-next tree.
> Not sure what should be the target tree for this, I'd appreciate any guidance,
> thank you!
>
> Andrii Nakryiko (5):
>   fs/procfs: extract logic for getting VMA name constituents
>   fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
>   tools: sync uapi/linux/fs.h header into tools subdir
>   selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
>   selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

I'd love to see improvements like this for the Linux perf command.
Some thoughts:

 - Could we do something scalability wise better than a file
descriptor per pid? If a profiler is running in a container the cost
of many file descriptors can be significant, and something that
increases as machines get larger. Could we have a /proc/maps for all
processes?

 - Something that is broken in perf currently is that we can race
between reading /proc and opening events on the pids it contains. For
example, perf top supports a uid option that first scans to find all
processes owned by a user then tries to open an event on each process.
This fails if the process terminates between the scan and the open
leading to a frequent:
```
$ sudo perf top -u `id -u`
The sys_perf_event_open() syscall returned with 3 (No such process)
for event (cycles:P).
```
It would be nice for the API to consider cgroups, uids and the like as
ways to get a subset of things to scan.

 - Some what related, the mmap perf events give data after the mmap
call has happened. As VMAs get merged this can lead to mmap perf
events looking like the memory overlaps (for jits using anonymous
memory) and we lack munmap/mremap events.

Jiri Olsa has looked at improvements in this area in the past.

Thanks,
Ian

>  fs/proc/task_mmu.c                            | 290 +++++++++++---
>  include/uapi/linux/fs.h                       |  32 ++
>  .../perf/trace/beauty/include/uapi/linux/fs.h |  32 ++
>  tools/testing/selftests/bpf/.gitignore        |   1 +
>  tools/testing/selftests/bpf/Makefile          |   2 +-
>  tools/testing/selftests/bpf/procfs_query.c    | 366 ++++++++++++++++++
>  tools/testing/selftests/bpf/test_progs.c      |   3 +
>  tools/testing/selftests/bpf/test_progs.h      |   2 +
>  tools/testing/selftests/bpf/trace_helpers.c   | 105 ++++-
>  9 files changed, 763 insertions(+), 70 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/procfs_query.c
>
> --
> 2.43.0
>
>

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-04  0:30  9% ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
  2024-05-04 15:28  6%   ` Greg KH
@ 2024-05-04 23:36  9%   ` kernel test robot
  2024-05-07 18:10  7%   ` Liam R. Howlett
  2 siblings, 0 replies; 200+ results
From: kernel test robot @ 2024-05-04 23:36 UTC (permalink / raw)
  To: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm
  Cc: oe-kbuild-all, linux-kernel, bpf, gregkh, linux-mm, Andrii Nakryiko

Hi Andrii,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20240503]
[also build test WARNING on v6.9-rc6]
[cannot apply to bpf-next/master bpf/master perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools brauner-vfs/vfs.all linus/master acme/perf/core v6.9-rc6 v6.9-rc5 v6.9-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/fs-procfs-extract-logic-for-getting-VMA-name-constituents/20240504-083146
base:   next-20240503
patch link:    https://lore.kernel.org/r/20240504003006.3303334-3-andrii%40kernel.org
patch subject: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240505/202405050750.5oyajnPF-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240505/202405050750.5oyajnPF-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405050750.5oyajnPF-lkp@intel.com/

All warnings (new ones prefixed by >>):

   fs/proc/task_mmu.c: In function 'do_procmap_query':
>> fs/proc/task_mmu.c:505:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     505 |         if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
         |                                                ^
   fs/proc/task_mmu.c:512:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     512 |         if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
         |                                                ^


vim +505 fs/proc/task_mmu.c

   378	
   379	static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
   380	{
   381		struct procfs_procmap_query karg;
   382		struct vma_iterator iter;
   383		struct vm_area_struct *vma;
   384		struct mm_struct *mm;
   385		const char *name = NULL;
   386		char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
   387		__u64 usize;
   388		int err;
   389	
   390		if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
   391			return -EFAULT;
   392		if (usize > PAGE_SIZE)
   393			return -E2BIG;
   394		if (usize < offsetofend(struct procfs_procmap_query, query_addr))
   395			return -EINVAL;
   396		err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
   397		if (err)
   398			return err;
   399	
   400		if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
   401			return -EINVAL;
   402		if (!!karg.vma_name_size != !!karg.vma_name_addr)
   403			return -EINVAL;
   404		if (!!karg.build_id_size != !!karg.build_id_addr)
   405			return -EINVAL;
   406	
   407		mm = priv->mm;
   408		if (!mm || !mmget_not_zero(mm))
   409			return -ESRCH;
   410		if (mmap_read_lock_killable(mm)) {
   411			mmput(mm);
   412			return -EINTR;
   413		}
   414	
   415		vma_iter_init(&iter, mm, karg.query_addr);
   416		vma = vma_next(&iter);
   417		if (!vma) {
   418			err = -ENOENT;
   419			goto out;
   420		}
   421		/* user wants covering VMA, not the closest next one */
   422		if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
   423		    vma->vm_start > karg.query_addr) {
   424			err = -ENOENT;
   425			goto out;
   426		}
   427	
   428		karg.vma_start = vma->vm_start;
   429		karg.vma_end = vma->vm_end;
   430	
   431		if (vma->vm_file) {
   432			const struct inode *inode = file_user_inode(vma->vm_file);
   433	
   434			karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
   435			karg.dev_major = MAJOR(inode->i_sb->s_dev);
   436			karg.dev_minor = MINOR(inode->i_sb->s_dev);
   437			karg.inode = inode->i_ino;
   438		} else {
   439			karg.vma_offset = 0;
   440			karg.dev_major = 0;
   441			karg.dev_minor = 0;
   442			karg.inode = 0;
   443		}
   444	
   445		karg.vma_flags = 0;
   446		if (vma->vm_flags & VM_READ)
   447			karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
   448		if (vma->vm_flags & VM_WRITE)
   449			karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
   450		if (vma->vm_flags & VM_EXEC)
   451			karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
   452		if (vma->vm_flags & VM_MAYSHARE)
   453			karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
   454	
   455		if (karg.build_id_size) {
   456			__u32 build_id_sz = BUILD_ID_SIZE_MAX;
   457	
   458			err = build_id_parse(vma, build_id_buf, &build_id_sz);
   459			if (!err) {
   460				if (karg.build_id_size < build_id_sz) {
   461					err = -ENAMETOOLONG;
   462					goto out;
   463				}
   464				karg.build_id_size = build_id_sz;
   465			}
   466		}
   467	
   468		if (karg.vma_name_size) {
   469			size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
   470			const struct path *path;
   471			const char *name_fmt;
   472			size_t name_sz = 0;
   473	
   474			get_vma_name(vma, &path, &name, &name_fmt);
   475	
   476			if (path || name_fmt || name) {
   477				name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
   478				if (!name_buf) {
   479					err = -ENOMEM;
   480					goto out;
   481				}
   482			}
   483			if (path) {
   484				name = d_path(path, name_buf, name_buf_sz);
   485				if (IS_ERR(name)) {
   486					err = PTR_ERR(name);
   487					goto out;
   488				}
   489				name_sz = name_buf + name_buf_sz - name;
   490			} else if (name || name_fmt) {
   491				name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
   492				name = name_buf;
   493			}
   494			if (name_sz > name_buf_sz) {
   495				err = -ENAMETOOLONG;
   496				goto out;
   497			}
   498			karg.vma_name_size = name_sz;
   499		}
   500	
   501		/* unlock and put mm_struct before copying data to user */
   502		mmap_read_unlock(mm);
   503		mmput(mm);
   504	
 > 505		if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
   506						       name, karg.vma_name_size)) {
   507			kfree(name_buf);
   508			return -EFAULT;
   509		}
   510		kfree(name_buf);
   511	
   512		if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
   513						       build_id_buf, karg.build_id_size))
   514			return -EFAULT;
   515	
   516		if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
   517			return -EFAULT;
   518	
   519		return 0;
   520	
   521	out:
   522		mmap_read_unlock(mm);
   523		mmput(mm);
   524		kfree(name_buf);
   525		return err;
   526	}
   527	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[relevance 9%]

* Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
  2024-05-04 15:32  0%   ` Greg KH
@ 2024-05-04 22:13  0%     ` Andrii Nakryiko
  2024-05-07 15:48  0%       ` Liam R. Howlett
  0 siblings, 1 reply; 200+ results
From: Andrii Nakryiko @ 2024-05-04 22:13 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm,
	linux-kernel, bpf, linux-mm

On Sat, May 4, 2024 at 8:32 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > I also did an strace run of both cases. In text-based one the tool did
> > 68 read() syscalls, fetching up to 4KB of data in one go.
>
> Why not fetch more at once?
>

I didn't expect to be interrogated so much on the performance of the
text parsing front, sorry. :) You can probably tune this, but where is
the reasonable limit? 64KB? 256KB? 1MB? See below for some more
production numbers.

> And I have a fun 'readfile()' syscall implementation around here that
> needs justification to get merged (I try so every other year or so) that
> can do the open/read/close loop in one call, with the buffer size set by
> userspace if you really are saying this is a "hot path" that needs that
> kind of speedup.  But in the end, io_uring usually is the proper api for
> that instead, why not use that here instead of slow open/read/close if
> you care about speed?
>

I'm not sure what I need to say here. I'm sure it will be useful, but
as I already explained, it's not about the text file or not, it's
about having to read too much information that's completely
irrelevant. Again, see below for another data point.

> > In comparison,
> > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > relevant VMAs.
> >
> > It is projected that savings from processing big production applications
> > would only widen the gap in favor of binary-based querying ioctl API, as
> > bigger applications will tend to have even more non-executable VMA
> > mappings relative to executable ones.
>
> Define "bigger applications" please.  Is this some "large database
> company workload" type of thing, or something else?

I don't have a definition. But I had in mind, as one example, an
ads-serving service we use internally (it's a pretty large application
by pretty much any metric you can come up with). I just randomly
picked one of the production hosts, found one instance of that
service, and looked at its /proc/<pid>/maps file. Hopefully it will
satisfy your need for specifics.

# cat /proc/1126243/maps | wc -c
1570178
# cat /proc/1126243/maps | wc -l
28875
# cat /proc/1126243/maps | grep ' ..x. ' | wc -l
7347

You can see that maps file itself is about 1.5MB of text (which means
single-shot reading of its entire contents is a bit unrealistic,
though, sure, why not). The process contains 28875 VMAs, out of which
only 7347 are executable.

This means if we were to profile this process (and normally we profile
entire system, so it's almost never single /proc/<pid>/maps file that
needs to be open and processed), we'd need *at most* (absolute worst
case!) 7347/28875 = 25.5% of entries. In reality, most code will be
concentrated in a much smaller number of executable VMAs, of course.
But no, I don't have specific numbers at hand, sorry.

It matters less whether it's text or binary (though binary undoubtedly
will be faster, it's strange to even argue about this), it's the
ability to fetch only relevant VMAs that is the point here.

>
> thanks,
>
> greg k-h

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
  2024-05-04 11:24  7% ` [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Christian Brauner
  2024-05-04 15:33  7%   ` Greg KH
@ 2024-05-04 21:50 14%   ` Andrii Nakryiko
  1 sibling, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-04 21:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, linux-fsdevel, viro, akpm, linux-kernel, bpf,
	gregkh, linux-mm

On Sat, May 4, 2024 at 4:24 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
>
> I don't have anything against adding a binary interface for this. But
> it's somewhat odd to do ioctls based on /proc files. I wonder if there
> isn't a more suitable place for this. prctl()? New vmstat() system call
> using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

I did ioctl() on /proc/<pid>/maps because that's the file that's used
for the same use cases and it can be opened from other processes for
any target PID. I'm open to any suggestions that make more sense, this
v1 is mostly to start the conversation.

prctl() probably doesn't make sense, as according to man page:

       prctl() manipulates various aspects of the behavior of the
       calling thread or process.

And this facility is most often used from another (profiler or
symbolizer) process.

New syscall feels like an overkill, but if that's the only way, so be it.

I do like the idea of ioctl() on top of pidfd (I assume that's what
you mean by "fs/pidfs.c", right)? This seems most promising. One
question/nuance. If I understand correctly, pidfd won't hold
task_struct (and its mm_struct) reference, right? So if the process
exits, even if I have pidfd, that task is gone and so we won't be able
to query it. Is that right?

If yes, then it's still workable in a lot of situations, but it would
be nice to have an ability to query VMAs (at least for binary's own
text segments) even if the process exits. This is the case for
short-lived processes that profilers capture some stack traces from,
but by the time these stack traces are processed they are gone.

This might be a stupid idea and question, but what if ioctl() on pidfd
itself would create another FD that would represent mm_struct of that
process, and then we have ioctl() on *that* soft-of-mm-struct-fd to
query VMA. Would that work at all? This approach would allow
long-running profiler application to open pidfd and this other "mm fd"
once, cache it, and then just query it. Meanwhile we can epoll() pidfd
itself to know when the process exits so that these mm_structs are not
referenced for longer than necessary.

Is this pushing too far or you think that would work and be acceptable?

But in any case, I think ioctl() on top of pidfd makes total sense for
this, thanks.

^ permalink raw reply	[relevance 14%]

* Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
  2024-05-04 15:33  7%   ` Greg KH
@ 2024-05-04 21:50  7%     ` Andrii Nakryiko
  0 siblings, 0 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-04 21:50 UTC (permalink / raw)
  To: Greg KH
  Cc: Christian Brauner, Andrii Nakryiko, linux-fsdevel, viro, akpm,
	linux-kernel, bpf, linux-mm

On Sat, May 4, 2024 at 8:34 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Sat, May 04, 2024 at 01:24:23PM +0200, Christian Brauner wrote:
> > On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > > applications to query VMA information more efficiently than through textual
> > > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > > justification, and nuances of the API design.
> > >
> > > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > > Patch #2 is the meat of kernel-side API.
> > > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > > optionally use this new ioctl()-based API, if supported.
> > > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > > also has performance numbers of a test based on one of the medium-sized
> > > internal applications taken from production.
> >
> > I don't have anything against adding a binary interface for this. But
> > it's somewhat odd to do ioctls based on /proc files. I wonder if there
> > isn't a more suitable place for this. prctl()? New vmstat() system call
> > using a pidfd/pid as reference? ioctl() on fs/pidfs.c?
>
> See my objection to the ioctl api in the patch review itself.

Will address them there.


>
> Also, as this is a new user/kernel api, it needs loads of documentation
> (there was none), and probably also cc: linux-api, right?

Will cc linux-api. And yes, I didn't want to invest too much time in
documentation upfront, as I knew that API itself will be tweaked and
tuned, moved to some other place (see Christian's pidfd suggestion).
But I'm happy to write it, I'd appreciate the pointers where exactly
this should live. Thanks!

>
> thanks,
>
> greg k-h

^ permalink raw reply	[relevance 7%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-04 15:28  6%   ` Greg KH
@ 2024-05-04 21:50  9%     ` Andrii Nakryiko
  2024-05-06 13:58  6%       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 200+ results
From: Andrii Nakryiko @ 2024-05-04 21:50 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrii Nakryiko, linux-fsdevel, brauner, viro, akpm,
	linux-kernel, bpf, linux-mm, Daniel Müller, linux-perf-use.,
	Arnaldo Carvalho de Melo

On Sat, May 4, 2024 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > involving figuring out process memory layout, what files are backing any
> > given memory range, etc. One important class of applications that
> > absolutely rely on this are profilers/stack symbolizers. They would
> > normally capture stack trace containing absolute memory addresses of
> > some functions, and would then use /proc/<pid>/maps file to file
> > corresponding backing ELF files, file offsets within them, and then
> > continue from there to get yet more information (ELF symbols, DWARF
> > information) to get human-readable symbolic information.
> >
> > As such, there are both performance and correctness requirement
> > involved. This address to VMA information translation has to be done as
> > efficiently as possible, but also not miss any VMA (especially in the
> > case of loading/unloading shared libraries).
> >
> > Unfortunately, for all the /proc/<pid>/maps file universality and
> > usefulness, it doesn't fit the above 100%.
>
> Is this a new change or has it always been this way?
>

Probably always has been this way. My first exposure to profiling and
stack symbolization was about 7 years ago, and already then
/proc/<pid>/maps was the only way to do this, and not a 100% fit even
then.

> > First, it's text based, which makes its programmatic use from
> > applications and libraries unnecessarily cumbersome and slow due to the
> > need to do text parsing to get necessary pieces of information.
>
> slow in what way?  How has it never been noticed before as a problem?

It's just inherently slower to parse text to fish out a bunch of
integers (vma_start address, offset, inode+dev and file paths are
typical pieces needed to "normalize" captured stack trace addresses).
It's not too bad in terms of programming and performance for
scanf-like APIs, but without scanf, you are dealing with splitting by
whitespaces and tons of unnecessary string allocations.

It was noticed, I think people using this for profiling/symbolization
are not necessarily well versed in kernel development and they just
get by with what kernel provides.

>
> And exact numbers are appreciated please, yes open/read/close seems
> slower than open/ioctl/close, but is it really overall an issue in the
> real world for anything?
>
> Text apis are good as everyone can handle them, ioctls are harder for
> obvious reasons.

Yes, and acknowledged the usefulness of text-based interface. But it's
my (and other people I've talked with that had to deal with these
textual interfaces) opinion that using binary interfaces are far
superior when it comes to *programmatic* usage (i.e., from
C/C++/Rust/whatever languages directly). Textual is great for bash
scripts and human debugging, of course.

>
> > Second, it's main purpose is to emit all VMAs sequentially, but in
> > practice captured addresses would fall only into a small subset of all
> > process' VMAs, mainly containing executable text. Yet, library would
> > need to parse most or all of the contents to find needed VMAs, as there
> > is no way to skip VMAs that are of no use. Efficient library can do the
> > linear pass and it is still relatively efficient, but it's definitely an
> > overhead that can be avoided, if there was a way to do more targeted
> > querying of the relevant VMA information.
>
> I don't understand, is this a bug in the current files?  If so, why not
> just fix that up?
>

It's not a bug, I think /proc/<pid>/maps was targeted to describe
*entire* address space, but for profiling and symbolization needs we
need to find only a small subset of relevant VMAs. There is nothing
wrong with existing implementation, it's just not a 100% fit for the
more specialized "let's find relevant VMAs for this set of addresses"
problem.

> And again "efficient" need to be quantified.

You probably saw patch #5 where I solve exactly the same problem in
two different ways. And the problem is typical for symbolization: you
are given a bunch of addresses within some process, we need to find
files they belong to and what file offset they are mapped to. This is
then used to, for example, match them to ELF symbols representing
functions.

>
> > Another problem when writing generic stack trace symbolization library
> > is an unfortunate performance-vs-correctness tradeoff that needs to be
> > made.
>
> What requirement has caused a "generic stack trace symbolization
> library" to be needed at all?  What is the problem you are trying to
> solve that is not already solved by existing tools?

Capturing stack trace is a very common part, especially for BPF-based
tools and applications. E.g., bpftrace allows one to capture stack
traces for some "interesting events" (whatever that is, some kernel
function call, user function call, perf event, there is tons of
flexibility). Stack traces answer "how did we get here", but it's just
an array of addresses, which need to be translated to something that
humans can make sense of.

That's what the symbolization library is helping with. This process is
multi-step, quite involved, hard to get right with a good balance of
efficiency, correctness and fullness of information (there is always a
choice of doing simplistic symbolization using just ELF symbols, or
much more expensive but also fuller symbolization using DWARF
information, which gives also file name + line number information, can
symbolize inlined functions, etc).

One such library is blazesym ([0], cc'ed Daniel, who's working on it),
which is developed by Meta for both internal use in our fleet-wide
profiler, and is also in the process of being integrated into bpftrace
(to improve bpftrace's current somewhat limited symbolization approach
based on BCC). There is also a non-Meta project (I believe Datadog)
that is using it for its own needs.

Symbolization is quite a common task, that's highly non-trivial.

  [0] https://github.com/libbpf/blazesym

>
> > Library has to make a decision to either cache parsed contents of
> > /proc/<pid>/maps for service future requests (if application requests to
> > symbolize another set of addresses, captured at some later time, which
> > is typical for periodic/continuous profiling cases) to avoid higher
> > costs of needed to re-parse this file or caching the contents in memory
> > to speed up future requests. In the former case, more memory is used for
> > the cache and there is a risk of getting stale data if application
> > loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> > through additiona mmap() calls (and other means of altering memory
> > address space). In the latter case, it's the performance hit that comes
> > from re-opening the file and re-reading/re-parsing its contents all over
> > again.
>
> Again, "performance hit" needs to be justified, it shouldn't be much
> overall.

I'm not sure how to answer whether it's much or not. Can you be a bit
more specific on what you'd like to see?

But I want to say that sensitivity to any overhead differs a lot
depending on specifics. As general rule, we try to minimize any
resource usage of the profiler/symbolizer itself on the host that is
being profiled, to minimize the disruption of the production workload.
So anything that can be done to optimize any part of the overall
profiling process is a benefit.

But while for big servers tolerance might be higher in terms of
re-opening and re-parsing a bunch of text files, we also have use
cases on much less powerful and very performance sensitive Oculus VR
devices, for example. There, any extra piece of work is scrutinized,
so having to parse text on those relatively weak devices does add up.
Enough to spend effort to optimize text parsing in blazesym's Rust
code (see [1] for recent improvements).

  [1] https://github.com/libbpf/blazesym/pull/643/commits/b89b91b42b994b135a0079bf04b2319c0054f745

>
> > This patch aims to solve this problem by providing a new API built on
> > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> > interface, avoiding the cost and awkwardness of textual representation
> > for programmatic use.
>
> Some people find text easier to handle for programmatic use :)

I don't disagree, but pretty much everyone I discussed having to deal
with text-based kernel APIs are pretty uniformly in favor of
binary-based interfaces, if they are available.

But note, I'm not proposing to deprecate or remove text-based
/proc/<pid>/maps. And the main point of this work is not so much
binary vs text, as more selecting "point-based" querying capability as
opposed to the "iterate everything" approach of /proc/<pid>/maps.

>
> > It's designed to be extensible and
> > forward/backward compatible by including user-specified field size and
> > using copy_struct_from_user() approach. But, most importantly, it allows
> > to do point queries for specific single address, specified by user. And
> > this is done efficiently using VMA iterator.
>
> Ok, maybe this is the main issue, you only want one at a time?

Yes. More or less, I need "a few" that cover a captured set of addresses.

>
> > User has a choice to pick either getting VMA that covers provided
> > address or -ENOENT if none is found (exact, least surprising, case). Or,
> > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> > get either VMA that covers the address (if there is one), or the closest
> > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> > more efficient use, but, given it could be a surprising behavior,
> > requires an explicit opt-in.
> >
> > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> > sense given it's querying the same set of VMA data. All the permissions
> > checks performed on /proc/<pid>/maps opening fit here as well.
> > ioctl-based implementation is fetching remembered mm_struct reference,
> > but otherwise doesn't interfere with seq_file-based implementation of
> > /proc/<pid>/maps textual interface, and so could be used together or
> > independently without paying any price for that.
> >
> > There is one extra thing that /proc/<pid>/maps doesn't currently
> > provide, and that's an ability to fetch ELF build ID, if present. User
> > has control over whether this piece of information is requested or not
> > by either setting build_id_size field to zero or non-zero maximum buffer
> > size they provided through build_id_addr field (which encodes user
> > pointer as __u64 field).
> >
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this,
> > requiring applications to open underlying ELF binary through
> > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> > permissions implications due giving a full access to the binary from
> > (potentially) another process, while all application is interested in is
> > build ID. Giving an ability to request just build ID doesn't introduce
> > any additional security concerns, on top of what /proc/<pid>/maps is
> > already concerned with, simplifying the overall logic.
> >
> > Kernel already implements build ID fetching, which is used from BPF
> > subsystem. We are reusing this code here, but plan a follow up changes
> > to make it work better under more relaxed assumption (compared to what
> > existing code assumes) of being called from user process context, in
> > which page faults are allowed. BPF-specific implementation currently
> > bails out if necessary part of ELF file is not paged in, all due to
> > extra BPF-specific restrictions (like the need to fetch build ID in
> > restrictive contexts such as NMI handler).
> >
> > Note also, that fetching VMA name (e.g., backing file path, or special
> > hard-coded or user-provided names) is optional just like build ID. If
> > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > it, saving resources.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>
> Where is the userspace code that uses this new api you have created?

So I added a faithful comparison of existing /proc/<pid>/maps vs new
ioctl() API to solve a common problem (as described above) in patch
#5. The plan is to put it in mentioned blazesym library at the very
least.

I'm sure perf would benefit from this as well (cc'ed Arnaldo and
linux-perf-user), as they need to do stack symbolization as well.

It will be up to other similar projects to adopt this, but we'll
definitely get this into blazesym as it is actually a problem for the
abovementioned Oculus use case. We already had to make a tradeoff (see
[2], this wasn't done just because we could, but it was requested by
Oculus customers) to cache the contents of /proc/<pid>/maps and run
the risk of missing some shared libraries that can be loaded later. It
would be great to not have to do this tradeoff, which this new API
would enable.

  [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf

>
> > ---
> >  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/fs.h |  32 ++++++++
> >  2 files changed, 197 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 8e503a1635b7..cb7b1ff1a144 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/pkeys.h>
> >  #include <linux/minmax.h>
> >  #include <linux/overflow.h>
> > +#include <linux/buildid.h>
> >
> >  #include <asm/elf.h>
> >  #include <asm/tlb.h>
> > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> >       return do_maps_open(inode, file, &proc_pid_maps_op);
> >  }
> >
> > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > +{
> > +     struct procfs_procmap_query karg;
> > +     struct vma_iterator iter;
> > +     struct vm_area_struct *vma;
> > +     struct mm_struct *mm;
> > +     const char *name = NULL;
> > +     char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > +     __u64 usize;
> > +     int err;
> > +
> > +     if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > +             return -EFAULT;
> > +     if (usize > PAGE_SIZE)
>
> Nice, where did you document that?  And how is that portable given that
> PAGE_SIZE can be different on different systems?

I'm happy to document everything, can you please help by pointing
where this documentation has to live?

This is mostly fool-proofing, though, because the user has to pass
sizeof(struct procfs_procmap_query), which I don't see ever getting
close to even 4KB (not even saying about 64KB). This is just to
prevent copy_struct_from_user() below to do too much zero-checking.

>
> and why aren't you checking the actual structure size instead?  You can
> easily run off the end here without knowing it.

See copy_struct_from_user(), it does more checks. This is a helper
designed specifically to deal with use cases like this where kernel
struct size can change and user space might be newer or older.
copy_struct_from_user() has a nice documentation describing all these
nuances.

>
> > +             return -E2BIG;
> > +     if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > +             return -EINVAL;
>
> Ok, so you have two checks?  How can the first one ever fail?

Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
won't fail, but this one will fail.

The point of this check is that user has to specify at least first
three fields of procfs_procmap_query (size, query_flags, and
query_addr), because without those the query is meaningless.
>
>
> > +     err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);

and this helper does more checks validating that the user either has a
shorter struct (and then zero-fills the rest of kernel-side struct) or
has longer (and then the longer part has to be zero filled). Do check
copy_struct_from_user() documentation, it's great.

> > +     if (err)
> > +             return err;
> > +
> > +     if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > +             return -EINVAL;
> > +     if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > +             return -EINVAL;
> > +     if (!!karg.build_id_size != !!karg.build_id_addr)
> > +             return -EINVAL;
>
> So you want values to be set, right?

Either both should be set, or neither. It's ok for both size/addr
fields to be zero, in which case it indicates that the user doesn't
want this part of information (which is usually a bit more expensive
to get and might not be necessary for all the cases).

>
> > +
> > +     mm = priv->mm;
> > +     if (!mm || !mmget_not_zero(mm))
> > +             return -ESRCH;
>
> What is this error for?  Where is this documentned?

I copied it from existing /proc/<pid>/maps checks. I presume it's
guarding the case when mm might be already put. So if the process is
gone, but we have /proc/<pid>/maps file open?

>
> > +     if (mmap_read_lock_killable(mm)) {
> > +             mmput(mm);
> > +             return -EINTR;
> > +     }
> > +
> > +     vma_iter_init(&iter, mm, karg.query_addr);
> > +     vma = vma_next(&iter);
> > +     if (!vma) {
> > +             err = -ENOENT;
> > +             goto out;
> > +     }
> > +     /* user wants covering VMA, not the closest next one */
> > +     if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > +         vma->vm_start > karg.query_addr) {
> > +             err = -ENOENT;
> > +             goto out;
> > +     }
> > +
> > +     karg.vma_start = vma->vm_start;
> > +     karg.vma_end = vma->vm_end;
> > +
> > +     if (vma->vm_file) {
> > +             const struct inode *inode = file_user_inode(vma->vm_file);
> > +
> > +             karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > +             karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > +             karg.dev_minor = MINOR(inode->i_sb->s_dev);
>
> So the major/minor is that of the file superblock?  Why?

Because inode number is unique only within given super block (and even
then it's more complicated, e.g., btrfs subvolumes add more headaches,
I believe). inode + dev maj/min is sometimes used for cache/reuse of
per-binary information (e.g., pre-processed DWARF information, which
is *very* expensive, so anything that allows to avoid doing this is
helpful).

>
> > +             karg.inode = inode->i_ino;
>
> What is userspace going to do with this?
>

See above.

> > +     } else {
> > +             karg.vma_offset = 0;
> > +             karg.dev_major = 0;
> > +             karg.dev_minor = 0;
> > +             karg.inode = 0;
>
> Why not set everything to 0 up above at the beginning so you never miss
> anything, and you don't miss any holes accidentally in the future.
>

Stylistic preference, I find this more explicit, but I don't care much
one way or another.

> > +     }
> > +
> > +     karg.vma_flags = 0;
> > +     if (vma->vm_flags & VM_READ)
> > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > +     if (vma->vm_flags & VM_WRITE)
> > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > +     if (vma->vm_flags & VM_EXEC)
> > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > +     if (vma->vm_flags & VM_MAYSHARE)
> > +             karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > +

[...]

> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 45e4e64fd664..fe8924a8d916 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> >       __u64 return_mask;
> >  };
> >
> > +/* /proc/<pid>/maps ioctl */
> > +#define PROCFS_IOCTL_MAGIC 0x9f
>
> Don't you need to document this in the proper place?

I probably do, but I'm asking for help in knowing where. procfs is not
a typical area of kernel I'm working with, so any pointers are highly
appreciated.

>
> > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > +
> > +enum procmap_query_flags {
> > +     PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > +};
> > +
> > +enum procmap_vma_flags {
> > +     PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > +     PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > +     PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > +     PROCFS_PROCMAP_VMA_SHARED = 0x08,
>
> Are these bits?  If so, please use the bit macro for it to make it
> obvious.
>

Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
add any extra #includes to this UAPI header, but I can figure out the
necessary dependency and do BIT(), I just didn't feel like BIT() adds
much here, tbh.

> > +};
> > +
> > +struct procfs_procmap_query {
> > +     __u64 size;
> > +     __u64 query_flags;              /* in */
>
> Does this map to the procmap_vma_flags enum?  if so, please say so.

no, procmap_query_flags, and yes, I will

>
> > +     __u64 query_addr;               /* in */
> > +     __u64 vma_start;                /* out */
> > +     __u64 vma_end;                  /* out */
> > +     __u64 vma_flags;                /* out */
> > +     __u64 vma_offset;               /* out */
> > +     __u64 inode;                    /* out */
>
> What is the inode for, you have an inode for the file already, why give
> it another one?

This is inode of vma's backing file, same as /proc/<pid>/maps' file
column. What inode of file do I already have here? You mean of
/proc/<pid>/maps itself? It's useless for the intended purposes.

>
> > +     __u32 dev_major;                /* out */
> > +     __u32 dev_minor;                /* out */
>
> What is major/minor for?

This is the same information as emitted by /proc/<pid>/maps,
identifies superblock of vma's backing file. As I mentioned above, it
can be used for caching per-file (i.e., per-ELF binary) information
(for example).

>
> > +     __u32 vma_name_size;            /* in/out */
> > +     __u32 build_id_size;            /* in/out */
> > +     __u64 vma_name_addr;            /* in */
> > +     __u64 build_id_addr;            /* in */
>
> Why not document this all using kerneldoc above the structure?

Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
figuring out the best place and approach, and so wanted to avoid
documentation churn.

Would something like what we have for pm_scan_arg and pagemap APIs
work? I see it added a few simple descriptions for pm_scan_arg struct,
and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
though)? Anyways, I'm hoping for pointers where all this should be
documented. Thank you!

>
> anyway, I don't like ioctls, but there is a place for them, you just
> have to actually justify the use for them and not say "not efficient
> enough" as that normally isn't an issue overall.

I've written a demo tool in patch #5 which performs real-world task:
mapping addresses to their VMAs (specifically calculating file offset,
finding vma_start + vma_end range to further access files from
/proc/<pid>/map_files/<start>-<end>). I did the implementation
faithfully, doing it in the most optimal way for both APIs. I showed
that for "typical" (it's hard to specify what typical is, of course,
too many variables) scenario (it was data collected on a real server
running real service, 30 seconds of process-specific stack traces were
captured, if I remember correctly). I showed that doing exactly the
same amount of work is ~35x times slower with /proc/<pid>/maps.

Take another process, another set of addresses, another anything, and
the numbers will be different, but I think it gives the right idea.

But I think we are overpivoting on text vs binary distinction here.
It's the more targeted querying of VMAs that's beneficial here. This
allows applications to not cache anything and just re-query when doing
periodic or continuous profiling (where addresses are coming in not as
one batch, as a sequence of batches extended in time).

/proc/<pid>/maps, for all its usefulness, just can't provide this sort
of ability, as it wasn't designed to do that and is targeting
different use cases.

And then, a new ability to request reliable (it's not 100% reliable
today, I'm going to address that as a follow up) build ID is *crucial*
for some scenarios. The mentioned Oculus use case, the need to fully
access underlying ELF binary just to get build ID is frowned upon. And
for a good reason. Profiler only needs build ID, which is no secret
and not sensitive information. This new (and binary, yes) API allows
to add this into an API without breaking any backwards compatibility.

>
> thanks,
>
> greg k-h

^ permalink raw reply	[relevance 9%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
@ 2024-05-04 18:37  6% Alexey Dobriyan
  0 siblings, 0 replies; 200+ results
From: Alexey Dobriyan @ 2024-05-04 18:37 UTC (permalink / raw)
  To: gregkh
  Cc: andrii, linux-fsdevel, brauner, viro, akpm, gregkh, linux-mm,
	linux-kernel

Hi, Greg.

We've discussed this earlier.

Breaking news: /proc is slow, /sys too. Always have been.

Each /sys file is kind of fast, but there are so many files that
lookups eat all the runtime.

/proc files are bigger and thus slower. There is no way to filter
information.

If someone would post /proc today and said "it is 20-50-100" times
slower (which is true) than existing interfraces, linux-kernel would
not even laugh at him/her.

> slow in what way?

open/read/close is slow compared to equivalent not involving file
descriptors and textual processing.

> Text apis are good as everyone can handle them,

Text APIs provoke inefficient software:

Any noob can write

	for name in name_list:
	    with open(f'/sys/kernel/slab/{name}/order') as f:
	        slab_order = int(f.read().split()[0])

See the problem? It's inefficient.
No open("/sys", O_DIRECTORY|O_PATH);
No openat(sys_fd, "kernel/slab", O_DIRECTORY|O_PATH);
No openat(sys_kernel_slab, buf, O_RDONLY);

buf is allocated dynamically many times probably, it's Python after all.
buf is longer than necessary. pathname buf won't be reused for result.

.split() conses a list, only to discard everything but first element.

Internally, sysfs allocates 1 page, instead of putting 1 byte somewhere
in userspace memory. /proc too.

Lookup is done every time (I don't think sysfs caches dentries in dcache
but I may be mistaken, so lookup is even slower).

Multiply by many times monitoring daemons run this (potentially disturbing
other tasks).

> ioctls are harder for obvious reasons.

What? ioctl are hard now?

Text APIs are garbage. If it's some crap in debugfs then noone cares.
But /proc/*/maps is not in debugfs.

Specifically on /proc/*/maps:

* _very_ well written software know that unescaping needs to be done on pathname

* (deleted) and (unreachable) junk. readlink and /proc/*/maps don't have
  space for flags for unambigious deleted/unreachable status which
  doesn't eat into pathname -- whoops


> I don't understand, is this a bug in the current files?  If so, why not
> just fix that up?

open/read DO NOT accept file-specific flags, they are dumb like that.

In theory /proc/*/maps _could_ accept

	pread(fd, buf, sizeof(buf), addr);

and return data for VMA containing "addr", but it can't because "addr"
is offset in textual file. Such offset is not interesting at all.

> And again "efficient" need to be quantified.

	* roll eyes *

> Some people find text easier to handle for programmatic use :)

Some people should be barred from writing software by Programming Supreme Court
or something like that.

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
  2024-05-04 11:24  7% ` [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Christian Brauner
@ 2024-05-04 15:33  7%   ` Greg KH
  2024-05-04 21:50  7%     ` Andrii Nakryiko
  2024-05-04 21:50 14%   ` Andrii Nakryiko
  1 sibling, 1 reply; 200+ results
From: Greg KH @ 2024-05-04 15:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrii Nakryiko, linux-fsdevel, viro, akpm, linux-kernel, bpf, linux-mm

On Sat, May 04, 2024 at 01:24:23PM +0200, Christian Brauner wrote:
> On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> > 
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
> 
> I don't have anything against adding a binary interface for this. But
> it's somewhat odd to do ioctls based on /proc files. I wonder if there
> isn't a more suitable place for this. prctl()? New vmstat() system call
> using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

See my objection to the ioctl api in the patch review itself.

Also, as this is a new user/kernel api, it needs loads of documentation
(there was none), and probably also cc: linux-api, right?

thanks,

greg k-h

^ permalink raw reply	[relevance 7%]

* Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
  2024-05-04  0:30  8% ` [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
@ 2024-05-04 15:32  0%   ` Greg KH
  2024-05-04 22:13  0%     ` Andrii Nakryiko
  0 siblings, 1 reply; 200+ results
From: Greg KH @ 2024-05-04 15:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm

On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> I also did an strace run of both cases. In text-based one the tool did
> 68 read() syscalls, fetching up to 4KB of data in one go.

Why not fetch more at once?

And I have a fun 'readfile()' syscall implementation around here that
needs justification to get merged (I try so every other year or so) that
can do the open/read/close loop in one call, with the buffer size set by
userspace if you really are saying this is a "hot path" that needs that
kind of speedup.  But in the end, io_uring usually is the proper api for
that instead, why not use that here instead of slow open/read/close if
you care about speed?

> In comparison,
> ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> relevant VMAs.
> 
> It is projected that savings from processing big production applications
> would only widen the gap in favor of binary-based querying ioctl API, as
> bigger applications will tend to have even more non-executable VMA
> mappings relative to executable ones.

Define "bigger applications" please.  Is this some "large database
company workload" type of thing, or something else?

thanks,

greg k-h

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-04  0:30  9% ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
@ 2024-05-04 15:28  6%   ` Greg KH
  2024-05-04 21:50  9%     ` Andrii Nakryiko
  2024-05-04 23:36  9%   ` kernel test robot
  2024-05-07 18:10  7%   ` Liam R. Howlett
  2 siblings, 1 reply; 200+ results
From: Greg KH @ 2024-05-04 15:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, brauner, viro, akpm, linux-kernel, bpf, linux-mm

On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> /proc/<pid>/maps file is extremely useful in practice for various tasks
> involving figuring out process memory layout, what files are backing any
> given memory range, etc. One important class of applications that
> absolutely rely on this are profilers/stack symbolizers. They would
> normally capture stack trace containing absolute memory addresses of
> some functions, and would then use /proc/<pid>/maps file to file
> corresponding backing ELF files, file offsets within them, and then
> continue from there to get yet more information (ELF symbols, DWARF
> information) to get human-readable symbolic information.
> 
> As such, there are both performance and correctness requirement
> involved. This address to VMA information translation has to be done as
> efficiently as possible, but also not miss any VMA (especially in the
> case of loading/unloading shared libraries).
> 
> Unfortunately, for all the /proc/<pid>/maps file universality and
> usefulness, it doesn't fit the above 100%.

Is this a new change or has it always been this way?

> First, it's text based, which makes its programmatic use from
> applications and libraries unnecessarily cumbersome and slow due to the
> need to do text parsing to get necessary pieces of information.

slow in what way?  How has it never been noticed before as a problem?

And exact numbers are appreciated please, yes open/read/close seems
slower than open/ioctl/close, but is it really overall an issue in the
real world for anything?

Text apis are good as everyone can handle them, ioctls are harder for
obvious reasons.

> Second, it's main purpose is to emit all VMAs sequentially, but in
> practice captured addresses would fall only into a small subset of all
> process' VMAs, mainly containing executable text. Yet, library would
> need to parse most or all of the contents to find needed VMAs, as there
> is no way to skip VMAs that are of no use. Efficient library can do the
> linear pass and it is still relatively efficient, but it's definitely an
> overhead that can be avoided, if there was a way to do more targeted
> querying of the relevant VMA information.

I don't understand, is this a bug in the current files?  If so, why not
just fix that up?

And again "efficient" need to be quantified.

> Another problem when writing generic stack trace symbolization library
> is an unfortunate performance-vs-correctness tradeoff that needs to be
> made.

What requirement has caused a "generic stack trace symbolization
library" to be needed at all?  What is the problem you are trying to
solve that is not already solved by existing tools?

> Library has to make a decision to either cache parsed contents of
> /proc/<pid>/maps for service future requests (if application requests to
> symbolize another set of addresses, captured at some later time, which
> is typical for periodic/continuous profiling cases) to avoid higher
> costs of needed to re-parse this file or caching the contents in memory
> to speed up future requests. In the former case, more memory is used for
> the cache and there is a risk of getting stale data if application
> loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> through additiona mmap() calls (and other means of altering memory
> address space). In the latter case, it's the performance hit that comes
> from re-opening the file and re-reading/re-parsing its contents all over
> again.

Again, "performance hit" needs to be justified, it shouldn't be much
overall.

> This patch aims to solve this problem by providing a new API built on
> top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> interface, avoiding the cost and awkwardness of textual representation
> for programmatic use.

Some people find text easier to handle for programmatic use :)

> It's designed to be extensible and
> forward/backward compatible by including user-specified field size and
> using copy_struct_from_user() approach. But, most importantly, it allows
> to do point queries for specific single address, specified by user. And
> this is done efficiently using VMA iterator.

Ok, maybe this is the main issue, you only want one at a time?

> User has a choice to pick either getting VMA that covers provided
> address or -ENOENT if none is found (exact, least surprising, case). Or,
> with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> get either VMA that covers the address (if there is one), or the closest
> next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> more efficient use, but, given it could be a surprising behavior,
> requires an explicit opt-in.
> 
> Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> sense given it's querying the same set of VMA data. All the permissions
> checks performed on /proc/<pid>/maps opening fit here as well.
> ioctl-based implementation is fetching remembered mm_struct reference,
> but otherwise doesn't interfere with seq_file-based implementation of
> /proc/<pid>/maps textual interface, and so could be used together or
> independently without paying any price for that.
> 
> There is one extra thing that /proc/<pid>/maps doesn't currently
> provide, and that's an ability to fetch ELF build ID, if present. User
> has control over whether this piece of information is requested or not
> by either setting build_id_size field to zero or non-zero maximum buffer
> size they provided through build_id_addr field (which encodes user
> pointer as __u64 field).
> 
> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this,
> requiring applications to open underlying ELF binary through
> /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> permissions implications due giving a full access to the binary from
> (potentially) another process, while all application is interested in is
> build ID. Giving an ability to request just build ID doesn't introduce
> any additional security concerns, on top of what /proc/<pid>/maps is
> already concerned with, simplifying the overall logic.
> 
> Kernel already implements build ID fetching, which is used from BPF
> subsystem. We are reusing this code here, but plan a follow up changes
> to make it work better under more relaxed assumption (compared to what
> existing code assumes) of being called from user process context, in
> which page faults are allowed. BPF-specific implementation currently
> bails out if necessary part of ELF file is not paged in, all due to
> extra BPF-specific restrictions (like the need to fetch build ID in
> restrictive contexts such as NMI handler).
> 
> Note also, that fetching VMA name (e.g., backing file path, or special
> hard-coded or user-provided names) is optional just like build ID. If
> user sets vma_name_size to zero, kernel code won't attempt to retrieve
> it, saving resources.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Where is the userspace code that uses this new api you have created?

> ---
>  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fs.h |  32 ++++++++
>  2 files changed, 197 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8e503a1635b7..cb7b1ff1a144 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -22,6 +22,7 @@
>  #include <linux/pkeys.h>
>  #include <linux/minmax.h>
>  #include <linux/overflow.h>
> +#include <linux/buildid.h>
>  
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
>  	return do_maps_open(inode, file, &proc_pid_maps_op);
>  }
>  
> +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> +{
> +	struct procfs_procmap_query karg;
> +	struct vma_iterator iter;
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +	const char *name = NULL;
> +	char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> +	__u64 usize;
> +	int err;
> +
> +	if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> +		return -EFAULT;
> +	if (usize > PAGE_SIZE)

Nice, where did you document that?  And how is that portable given that
PAGE_SIZE can be different on different systems?

and why aren't you checking the actual structure size instead?  You can
easily run off the end here without knowing it.

> +		return -E2BIG;
> +	if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> +		return -EINVAL;

Ok, so you have two checks?  How can the first one ever fail?


> +	err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> +	if (err)
> +		return err;
> +
> +	if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> +		return -EINVAL;
> +	if (!!karg.vma_name_size != !!karg.vma_name_addr)
> +		return -EINVAL;
> +	if (!!karg.build_id_size != !!karg.build_id_addr)
> +		return -EINVAL;

So you want values to be set, right?

> +
> +	mm = priv->mm;
> +	if (!mm || !mmget_not_zero(mm))
> +		return -ESRCH;

What is this error for?  Where is this documentned?

> +	if (mmap_read_lock_killable(mm)) {
> +		mmput(mm);
> +		return -EINTR;
> +	}
> +
> +	vma_iter_init(&iter, mm, karg.query_addr);
> +	vma = vma_next(&iter);
> +	if (!vma) {
> +		err = -ENOENT;
> +		goto out;
> +	}
> +	/* user wants covering VMA, not the closest next one */
> +	if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> +	    vma->vm_start > karg.query_addr) {
> +		err = -ENOENT;
> +		goto out;
> +	}
> +
> +	karg.vma_start = vma->vm_start;
> +	karg.vma_end = vma->vm_end;
> +
> +	if (vma->vm_file) {
> +		const struct inode *inode = file_user_inode(vma->vm_file);
> +
> +		karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> +		karg.dev_major = MAJOR(inode->i_sb->s_dev);
> +		karg.dev_minor = MINOR(inode->i_sb->s_dev);

So the major/minor is that of the file superblock?  Why?

> +		karg.inode = inode->i_ino;

What is userspace going to do with this?

> +	} else {
> +		karg.vma_offset = 0;
> +		karg.dev_major = 0;
> +		karg.dev_minor = 0;
> +		karg.inode = 0;

Why not set everything to 0 up above at the beginning so you never miss
anything, and you don't miss any holes accidentally in the future.

> +	}
> +
> +	karg.vma_flags = 0;
> +	if (vma->vm_flags & VM_READ)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> +	if (vma->vm_flags & VM_WRITE)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> +	if (vma->vm_flags & VM_EXEC)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> +
> +	if (karg.build_id_size) {
> +		__u32 build_id_sz = BUILD_ID_SIZE_MAX;
> +
> +		err = build_id_parse(vma, build_id_buf, &build_id_sz);
> +		if (!err) {
> +			if (karg.build_id_size < build_id_sz) {
> +				err = -ENAMETOOLONG;
> +				goto out;
> +			}
> +			karg.build_id_size = build_id_sz;
> +		}
> +	}
> +
> +	if (karg.vma_name_size) {
> +		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
> +		const struct path *path;
> +		const char *name_fmt;
> +		size_t name_sz = 0;
> +
> +		get_vma_name(vma, &path, &name, &name_fmt);
> +
> +		if (path || name_fmt || name) {
> +			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
> +			if (!name_buf) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +		}
> +		if (path) {
> +			name = d_path(path, name_buf, name_buf_sz);
> +			if (IS_ERR(name)) {
> +				err = PTR_ERR(name);
> +				goto out;
> +			}
> +			name_sz = name_buf + name_buf_sz - name;
> +		} else if (name || name_fmt) {
> +			name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
> +			name = name_buf;
> +		}
> +		if (name_sz > name_buf_sz) {
> +			err = -ENAMETOOLONG;
> +			goto out;
> +		}
> +		karg.vma_name_size = name_sz;
> +	}
> +
> +	/* unlock and put mm_struct before copying data to user */
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
> +					       name, karg.vma_name_size)) {
> +		kfree(name_buf);
> +		return -EFAULT;
> +	}
> +	kfree(name_buf);
> +
> +	if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
> +					       build_id_buf, karg.build_id_size))
> +		return -EFAULT;
> +
> +	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
> +		return -EFAULT;
> +
> +	return 0;
> +
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	kfree(name_buf);
> +	return err;
> +}
> +
> +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	struct seq_file *seq = file->private_data;
> +	struct proc_maps_private *priv = seq->private;
> +
> +	switch (cmd) {
> +	case PROCFS_PROCMAP_QUERY:
> +		return do_procmap_query(priv, (void __user *)arg);
> +	default:
> +		return -ENOIOCTLCMD;
> +	}
> +}
> +
>  const struct file_operations proc_pid_maps_operations = {
>  	.open		= pid_maps_open,
>  	.read		= seq_read,
>  	.llseek		= seq_lseek,
>  	.release	= proc_map_release,
> +	.unlocked_ioctl = procfs_procmap_ioctl,
> +	.compat_ioctl	= procfs_procmap_ioctl,
>  };
>  
>  /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 45e4e64fd664..fe8924a8d916 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -393,4 +393,36 @@ struct pm_scan_arg {
>  	__u64 return_mask;
>  };
>  
> +/* /proc/<pid>/maps ioctl */
> +#define PROCFS_IOCTL_MAGIC 0x9f

Don't you need to document this in the proper place?

> +#define PROCFS_PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> +
> +enum procmap_query_flags {
> +	PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> +};
> +
> +enum procmap_vma_flags {
> +	PROCFS_PROCMAP_VMA_READABLE = 0x01,
> +	PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> +	PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> +	PROCFS_PROCMAP_VMA_SHARED = 0x08,

Are these bits?  If so, please use the bit macro for it to make it
obvious.

> +};
> +
> +struct procfs_procmap_query {
> +	__u64 size;
> +	__u64 query_flags;		/* in */

Does this map to the procmap_vma_flags enum?  if so, please say so.

> +	__u64 query_addr;		/* in */
> +	__u64 vma_start;		/* out */
> +	__u64 vma_end;			/* out */
> +	__u64 vma_flags;		/* out */
> +	__u64 vma_offset;		/* out */
> +	__u64 inode;			/* out */

What is the inode for, you have an inode for the file already, why give
it another one?

> +	__u32 dev_major;		/* out */
> +	__u32 dev_minor;		/* out */

What is major/minor for?

> +	__u32 vma_name_size;		/* in/out */
> +	__u32 build_id_size;		/* in/out */
> +	__u64 vma_name_addr;		/* in */
> +	__u64 build_id_addr;		/* in */

Why not document this all using kerneldoc above the structure?

anyway, I don't like ioctls, but there is a place for them, you just
have to actually justify the use for them and not say "not efficient
enough" as that normally isn't an issue overall.

thanks,

greg k-h

^ permalink raw reply	[relevance 6%]

* Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
  2024-05-04  0:30 13% [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
  2024-05-04  0:30  9% ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
  2024-05-04  0:30  8% ` [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
@ 2024-05-04 11:24  7% ` Christian Brauner
  2024-05-04 15:33  7%   ` Greg KH
  2024-05-04 21:50 14%   ` Andrii Nakryiko
  2024-05-05  5:26  6% ` Ian Rogers
  3 siblings, 2 replies; 200+ results
From: Christian Brauner @ 2024-05-04 11:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-fsdevel, viro, akpm, linux-kernel, bpf, gregkh, linux-mm

On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than through textual
> processing of /proc/<pid>/maps contents. See patch #2 for the context,
> justification, and nuances of the API design.
> 
> Patch #1 is a refactoring to keep VMA name logic determination in one place.
> Patch #2 is the meat of kernel-side API.
> Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> optionally use this new ioctl()-based API, if supported.
> Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> both textual and binary interfaces) and allows benchmarking them. Patch itself
> also has performance numbers of a test based on one of the medium-sized
> internal applications taken from production.

I don't have anything against adding a binary interface for this. But
it's somewhat odd to do ioctls based on /proc files. I wonder if there
isn't a more suitable place for this. prctl()? New vmstat() system call
using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

^ permalink raw reply	[relevance 7%]

* [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
  2024-05-04  0:30 13% [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
  2024-05-04  0:30  9% ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
@ 2024-05-04  0:30  8% ` Andrii Nakryiko
  2024-05-04 15:32  0%   ` Greg KH
  2024-05-04 11:24  7% ` [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Christian Brauner
  2024-05-05  5:26  6% ` Ian Rogers
  3 siblings, 1 reply; 200+ results
From: Andrii Nakryiko @ 2024-05-04  0:30 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, Andrii Nakryiko

Implement a simple tool/benchmark for comparing address "resolution"
logic based on textual /proc/<pid>/maps interface and new binary
ioctl-based PROCFS_PROCMAP_QUERY command.

The tool expects a file with a list of hex addresses, relevant PID, and
then provides control over whether textual or binary ioctl-based ways to
process VMAs should be used.

The overall logic implements as efficient way to do batched processing
of a given set of (unsorted) addresses. We first sort them in increasing
order (remembering their original position to restore original order, if
necessary), and then process all VMAs from /proc/<pid>/maps, matching
addresses to VMAs and calculating file offsets, if matched. For
ioctl-based approach the idea is similar, but is implemented even more
efficiently, requesting only VMAs that cover all given addresses,
skipping all the irrelevant VMAs altogether.

To be able to compare efficiency of both APIs tool has "benchark" mode.
User provides a number of processing runs to run in a tight loop, timing
specifically /proc/<pid>/maps parsing and processing parts of the logic
only. Address sorting and re-sorting is excluded. This gives a more
direct way to compare ioctl- vs text-based APIs.

We used a medium-sized production application to do representative
benchmark. A bunch of stack traces were captured, resulting in 4435
user space addresses (699 unique ones, but we didn't deduplicate them).
Application itself had 702 VMAs reported in /proc/<pid>/maps.

Averaging time taken to process all addresses 10000 times, showed that:
  - text-based approach took 380 microseconds *per one batch run*;
  - ioctl-based approach took 10 microseconds *per identical batch run*.

This gives about ~35x speed up to do exactly the same amoun of work
(build IDs were not fetched for ioctl-based benchmark; fetching build
IDs resulted in 2x slowdown compared to no-build-ID case).

I also did an strace run of both cases. In text-based one the tool did
68 read() syscalls, fetching up to 4KB of data in one go. In comparison,
ioctl-based implementation had to do only 6 ioctl() calls to fetch all
relevant VMAs.

It is projected that savings from processing big production applications
would only widen the gap in favor of binary-based querying ioctl API, as
bigger applications will tend to have even more non-executable VMA
mappings relative to executable ones.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 tools/testing/selftests/bpf/.gitignore     |   1 +
 tools/testing/selftests/bpf/Makefile       |   2 +-
 tools/testing/selftests/bpf/procfs_query.c | 366 +++++++++++++++++++++
 3 files changed, 368 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index f1aebabfb017..7eaa8f417278 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -45,6 +45,7 @@ test_cpp
 /veristat
 /sign-file
 /uprobe_multi
+/procfs_query
 *.ko
 *.tmp
 xskxceiver
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index ba28d42b74db..07e17bb89767 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -131,7 +131,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
 	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
-	xdp_features bpf_test_no_cfi.ko
+	xdp_features bpf_test_no_cfi.ko procfs_query
 
 TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
 
diff --git a/tools/testing/selftests/bpf/procfs_query.c b/tools/testing/selftests/bpf/procfs_query.c
new file mode 100644
index 000000000000..8ca3978244ad
--- /dev/null
+++ b/tools/testing/selftests/bpf/procfs_query.c
@@ -0,0 +1,366 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+#include <argp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <time.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <time.h>
+
+static bool verbose;
+static bool quiet;
+static bool use_ioctl;
+static bool request_build_id;
+static char *addrs_path;
+static int pid;
+static int bench_runs;
+
+const char *argp_program_version = "procfs_query 0.0";
+const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
+
+static inline uint64_t get_time_ns(void)
+{
+	struct timespec t;
+
+	clock_gettime(CLOCK_MONOTONIC, &t);
+
+	return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+static const struct argp_option opts[] = {
+	{ "verbose", 'v', NULL, 0, "Verbose mode" },
+	{ "quiet", 'q', NULL, 0, "Quiet mode (no output)" },
+	{ "pid", 'p', "PID", 0, "PID of the process" },
+	{ "addrs-path", 'f', "PATH", 0, "File with addresses to resolve" },
+	{ "benchmark", 'B', "RUNS", 0, "Benchmark mode" },
+	{ "query", 'Q', NULL, 0, "Use ioctl()-based point query API (by default text parsing is done)" },
+	{ "build-id", 'b', NULL, 0, "Fetch build ID, if available (only for ioctl mode)" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case 'v':
+		verbose = true;
+		break;
+	case 'q':
+		quiet = true;
+		break;
+	case 'i':
+		use_ioctl = true;
+		break;
+	case 'b':
+		request_build_id = true;
+		break;
+	case 'p':
+		pid = strtol(arg, NULL, 10);
+		break;
+	case 'f':
+		addrs_path = strdup(arg);
+		break;
+	case 'B':
+		bench_runs = strtol(arg, NULL, 10);
+		if (bench_runs <= 0) {
+			fprintf(stderr, "Invalid benchmark run count: %s\n", arg);
+			return -EINVAL;
+		}
+		break;
+	case ARGP_KEY_ARG:
+		argp_usage(state);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static const struct argp argp = {
+	.options = opts,
+	.parser = parse_arg,
+};
+
+struct addr {
+	unsigned long long addr;
+	int idx;
+};
+
+static struct addr *addrs;
+static size_t addr_cnt, addr_cap;
+
+struct resolved_addr {
+	unsigned long long file_off;
+	const char *vma_name;
+	int build_id_sz;
+	char build_id[20];
+};
+
+static struct resolved_addr *resolved;
+
+static int resolve_addrs_ioctl(void)
+{
+	char buf[32], build_id_buf[20], vma_name[PATH_MAX];
+	struct procfs_procmap_query q;
+	int fd, err, i;
+	struct addr *a = &addrs[0];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	fd = open(buf, O_RDONLY);
+	if (fd < 0) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	q.query_flags = PROCFS_PROCMAP_EXACT_OR_NEXT_VMA;
+	q.vma_name_addr = (__u64)vma_name;
+	if (request_build_id)
+		q.build_id_addr = (__u64)build_id_buf;
+
+	for (i = 0; i < addr_cnt; ) {
+		char *name = NULL;
+
+		q.query_addr = (__u64)a->addr;
+		q.vma_name_size = sizeof(vma_name);
+		if (request_build_id)
+			q.build_id_size = sizeof(build_id_buf);
+
+		err = ioctl(fd, PROCFS_PROCMAP_QUERY, &q);
+		if (err < 0 && errno == ENOTTY) {
+			close(fd);
+			fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() command is not supported on this kernel!\n");
+			return -EOPNOTSUPP; /* ioctl() not implemented yet */
+		}
+		if (err < 0 && errno == ENOENT) {
+			fprintf(stderr, "ENOENT\n");
+			i++;
+			a++;
+			continue; /* unresolved address */
+		}
+		if (err < 0) {
+			err = -errno;
+			close(fd);
+			fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() returned error: %d\n", err);
+			return err;
+		}
+
+		/* skip addrs falling before current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_start; i++, a++) {
+		}
+		/* process addrs covered by current VMA */
+		for (; i < addr_cnt && a->addr < q.vma_end; i++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - q.vma_start + q.vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (q.vma_name_size)
+				name = name ?: strdup(vma_name);
+			r->vma_name = name;
+
+			if (q.build_id_size) {
+				r->build_id_sz = q.build_id_size;
+				memcpy(r->build_id, build_id_buf, q.build_id_size);
+			}
+		}
+	}
+
+	close(fd);
+	return 0;
+}
+
+static int resolve_addrs_parse(void)
+{
+	size_t vma_start, vma_end, vma_offset, ino;
+	uint32_t dev_major, dev_minor;
+	char perms[4], buf[32], vma_name[PATH_MAX];
+	FILE *f;
+	int err, idx = 0;
+	struct addr *a = &addrs[idx];
+	struct resolved_addr *r;
+
+	snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+	f = fopen(buf, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+		return err;
+	}
+
+	while ((err = fscanf(f, "%zx-%zx %c%c%c%c %zx %x:%x %zu %[^\n]\n",
+			     &vma_start, &vma_end,
+			     &perms[0], &perms[1], &perms[2], &perms[3],
+			     &vma_offset, &dev_major, &dev_minor, &ino, vma_name)) >= 10) {
+		const char *name = NULL;
+
+		/* skip addrs before current vma, they stay unresolved */
+		for (; idx < addr_cnt && a->addr < vma_start; idx++, a++) {
+		}
+
+		/* resolve all addrs within current vma now */
+		for (; idx < addr_cnt && a->addr < vma_end; idx++, a++) {
+			r = &resolved[a->idx];
+			r->file_off = a->addr - vma_start + vma_offset;
+
+			/* reuse name, if it was already strdup()'ed */
+			if (err > 10)
+				name = name ?: strdup(vma_name);
+			else
+				name = NULL;
+			r->vma_name = name;
+		}
+
+		/* ran out of addrs to resolve, stop early */
+		if (idx >= addr_cnt)
+			break;
+	}
+
+	fclose(f);
+	return 0;
+}
+
+static int cmp_by_addr(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	if (x->addr != y->addr)
+		return x->addr < y->addr ? -1 : 1;
+	return x->idx < y->idx ? -1 : 1;
+}
+
+static int cmp_by_idx(const void *a, const void *b)
+{
+	const struct addr *x = a, *y = b;
+
+	return x->idx < y->idx ? -1 : 1;
+}
+
+int main(int argc, char **argv)
+{
+	FILE* f;
+	int err, i;
+	unsigned long long addr;
+	uint64_t start_ns;
+	double total_ns;
+
+	/* Parse command line arguments */
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+	if (err)
+		return err;
+
+	if (pid <= 0 || !addrs_path) {
+		fprintf(stderr, "Please provide PID and file with addresses to process!\n");
+		exit(1);
+	}
+
+	if (verbose) {
+		fprintf(stderr, "PID: %d\n", pid);
+		fprintf(stderr, "PATH: %s\n", addrs_path);
+	}
+
+	f = fopen(addrs_path, "r");
+	if (!f) {
+		err = -errno;
+		fprintf(stderr, "Failed to open '%s': %d\n", addrs_path, err);
+		goto out;
+	}
+
+	while ((err = fscanf(f, "%llx\n", &addr)) == 1) {
+		if (addr_cnt == addr_cap) {
+			addr_cap = addr_cap == 0 ? 16 : (addr_cap * 3 / 2);
+			addrs = realloc(addrs, sizeof(*addrs) * addr_cap);
+			memset(addrs + addr_cnt, 0, (addr_cap - addr_cnt) * sizeof(*addrs));
+		}
+
+		addrs[addr_cnt].addr = addr;
+		addrs[addr_cnt].idx = addr_cnt;
+
+		addr_cnt++;
+	}
+	if (verbose)
+		fprintf(stderr, "READ %zu addrs!\n", addr_cnt);
+	if (!feof(f)) {
+		fprintf(stderr, "Failure parsing full list of addresses at '%s'!\n", addrs_path);
+		err = -EINVAL;
+		fclose(f);
+		goto out;
+	}
+	fclose(f);
+	if (addr_cnt == 0) {
+		fprintf(stderr, "No addresses provided, bailing out!\n");
+		err = -ENOENT;
+		goto out;
+	}
+
+	resolved = calloc(addr_cnt, sizeof(*resolved));
+
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_addr);
+	if (verbose) {
+		fprintf(stderr, "SORTED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			fprintf(stderr, "ADDR #%d: %#llx\n", addrs[i].idx, addrs[i].addr);
+		}
+	}
+
+	start_ns = get_time_ns();
+	for (i = bench_runs ?: 1; i > 0; i--) {
+		if (use_ioctl) {
+			err = resolve_addrs_ioctl();
+		} else {
+			err = resolve_addrs_parse();
+		}
+		if (err) {
+			fprintf(stderr, "Failed to resolve addrs: %d!\n", err);
+			goto out;
+		}
+	}
+	total_ns = get_time_ns() - start_ns;
+
+	if (bench_runs) {
+		fprintf(stderr, "BENCHMARK MODE. RUNS: %d TOTAL TIME (ms): %.3lf TIME/RUN (ms): %.3lf TIME/ADDR (us): %.3lf\n",
+			bench_runs, total_ns / 1000000.0, total_ns / bench_runs / 1000000.0,
+			total_ns / bench_runs / addr_cnt / 1000.0);
+	}
+
+	/* sort them back into the original order */
+	qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_idx);
+
+	if (!quiet) {
+		printf("RESOLVED ADDRS (%zu):\n", addr_cnt);
+		for (i = 0; i < addr_cnt; i++) {
+			const struct addr *a = &addrs[i];
+			const struct resolved_addr *r = &resolved[a->idx];
+
+			if (r->file_off) {
+				printf("RESOLVED   #%d: %#llx -> OFF %#llx",
+					a->idx, a->addr, r->file_off);
+				if (r->vma_name)
+					printf(" NAME %s", r->vma_name);
+				if (r->build_id_sz) {
+					char build_id_str[41];
+					int j;
+
+					for (j = 0; j < r->build_id_sz; j++)
+						sprintf(&build_id_str[j * 2], "%02hhx", r->build_id[j]);
+					printf(" BUILDID %s", build_id_str);
+				}
+				printf("\n");
+			} else {
+				printf("UNRESOLVED #%d: %#llx\n", a->idx, a->addr);
+			}
+		}
+	}
+out:
+	free(addrs);
+	free(addrs_path);
+	free(resolved);
+
+	return err < 0 ? -err : 0;
+}
-- 
2.43.0


^ permalink raw reply related	[relevance 8%]

* [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  2024-05-04  0:30 13% [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
@ 2024-05-04  0:30  9% ` Andrii Nakryiko
  2024-05-04 15:28  6%   ` Greg KH
                     ` (2 more replies)
  2024-05-04  0:30  8% ` [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-04  0:30 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, Andrii Nakryiko

/proc/<pid>/maps file is extremely useful in practice for various tasks
involving figuring out process memory layout, what files are backing any
given memory range, etc. One important class of applications that
absolutely rely on this are profilers/stack symbolizers. They would
normally capture stack trace containing absolute memory addresses of
some functions, and would then use /proc/<pid>/maps file to file
corresponding backing ELF files, file offsets within them, and then
continue from there to get yet more information (ELF symbols, DWARF
information) to get human-readable symbolic information.

As such, there are both performance and correctness requirement
involved. This address to VMA information translation has to be done as
efficiently as possible, but also not miss any VMA (especially in the
case of loading/unloading shared libraries).

Unfortunately, for all the /proc/<pid>/maps file universality and
usefulness, it doesn't fit the above 100%.

First, it's text based, which makes its programmatic use from
applications and libraries unnecessarily cumbersome and slow due to the
need to do text parsing to get necessary pieces of information.

Second, it's main purpose is to emit all VMAs sequentially, but in
practice captured addresses would fall only into a small subset of all
process' VMAs, mainly containing executable text. Yet, library would
need to parse most or all of the contents to find needed VMAs, as there
is no way to skip VMAs that are of no use. Efficient library can do the
linear pass and it is still relatively efficient, but it's definitely an
overhead that can be avoided, if there was a way to do more targeted
querying of the relevant VMA information.

Another problem when writing generic stack trace symbolization library
is an unfortunate performance-vs-correctness tradeoff that needs to be
made. Library has to make a decision to either cache parsed contents of
/proc/<pid>/maps for service future requests (if application requests to
symbolize another set of addresses, captured at some later time, which
is typical for periodic/continuous profiling cases) to avoid higher
costs of needed to re-parse this file or caching the contents in memory
to speed up future requests. In the former case, more memory is used for
the cache and there is a risk of getting stale data if application
loaded/unloaded shared libraries, or otherwise changed its set of VMAs
through additiona mmap() calls (and other means of altering memory
address space). In the latter case, it's the performance hit that comes
from re-opening the file and re-reading/re-parsing its contents all over
again.

This patch aims to solve this problem by providing a new API built on
top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
interface, avoiding the cost and awkwardness of textual representation
for programmatic use. It's designed to be extensible and
forward/backward compatible by including user-specified field size and
using copy_struct_from_user() approach. But, most importantly, it allows
to do point queries for specific single address, specified by user. And
this is done efficiently using VMA iterator.

User has a choice to pick either getting VMA that covers provided
address or -ENOENT if none is found (exact, least surprising, case). Or,
with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
get either VMA that covers the address (if there is one), or the closest
next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
more efficient use, but, given it could be a surprising behavior,
requires an explicit opt-in.

Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
sense given it's querying the same set of VMA data. All the permissions
checks performed on /proc/<pid>/maps opening fit here as well.
ioctl-based implementation is fetching remembered mm_struct reference,
but otherwise doesn't interfere with seq_file-based implementation of
/proc/<pid>/maps textual interface, and so could be used together or
independently without paying any price for that.

There is one extra thing that /proc/<pid>/maps doesn't currently
provide, and that's an ability to fetch ELF build ID, if present. User
has control over whether this piece of information is requested or not
by either setting build_id_size field to zero or non-zero maximum buffer
size they provided through build_id_addr field (which encodes user
pointer as __u64 field).

The need to get ELF build ID reliably is an important aspect when
dealing with profiling and stack trace symbolization, and
/proc/<pid>/maps textual representation doesn't help with this,
requiring applications to open underlying ELF binary through
/proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
permissions implications due giving a full access to the binary from
(potentially) another process, while all application is interested in is
build ID. Giving an ability to request just build ID doesn't introduce
any additional security concerns, on top of what /proc/<pid>/maps is
already concerned with, simplifying the overall logic.

Kernel already implements build ID fetching, which is used from BPF
subsystem. We are reusing this code here, but plan a follow up changes
to make it work better under more relaxed assumption (compared to what
existing code assumes) of being called from user process context, in
which page faults are allowed. BPF-specific implementation currently
bails out if necessary part of ELF file is not paged in, all due to
extra BPF-specific restrictions (like the need to fetch build ID in
restrictive contexts such as NMI handler).

Note also, that fetching VMA name (e.g., backing file path, or special
hard-coded or user-provided names) is optional just like build ID. If
user sets vma_name_size to zero, kernel code won't attempt to retrieve
it, saving resources.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  32 ++++++++
 2 files changed, 197 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8e503a1635b7..cb7b1ff1a144 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,6 +22,7 @@
 #include <linux/pkeys.h>
 #include <linux/minmax.h>
 #include <linux/overflow.h>
+#include <linux/buildid.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 	return do_maps_open(inode, file, &proc_pid_maps_op);
 }
 
+static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
+{
+	struct procfs_procmap_query karg;
+	struct vma_iterator iter;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	const char *name = NULL;
+	char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
+	__u64 usize;
+	int err;
+
+	if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
+		return -EFAULT;
+	if (usize > PAGE_SIZE)
+		return -E2BIG;
+	if (usize < offsetofend(struct procfs_procmap_query, query_addr))
+		return -EINVAL;
+	err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
+	if (err)
+		return err;
+
+	if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
+		return -EINVAL;
+	if (!!karg.vma_name_size != !!karg.vma_name_addr)
+		return -EINVAL;
+	if (!!karg.build_id_size != !!karg.build_id_addr)
+		return -EINVAL;
+
+	mm = priv->mm;
+	if (!mm || !mmget_not_zero(mm))
+		return -ESRCH;
+	if (mmap_read_lock_killable(mm)) {
+		mmput(mm);
+		return -EINTR;
+	}
+
+	vma_iter_init(&iter, mm, karg.query_addr);
+	vma = vma_next(&iter);
+	if (!vma) {
+		err = -ENOENT;
+		goto out;
+	}
+	/* user wants covering VMA, not the closest next one */
+	if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
+	    vma->vm_start > karg.query_addr) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	karg.vma_start = vma->vm_start;
+	karg.vma_end = vma->vm_end;
+
+	if (vma->vm_file) {
+		const struct inode *inode = file_user_inode(vma->vm_file);
+
+		karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
+		karg.dev_major = MAJOR(inode->i_sb->s_dev);
+		karg.dev_minor = MINOR(inode->i_sb->s_dev);
+		karg.inode = inode->i_ino;
+	} else {
+		karg.vma_offset = 0;
+		karg.dev_major = 0;
+		karg.dev_minor = 0;
+		karg.inode = 0;
+	}
+
+	karg.vma_flags = 0;
+	if (vma->vm_flags & VM_READ)
+		karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
+	if (vma->vm_flags & VM_WRITE)
+		karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
+	if (vma->vm_flags & VM_EXEC)
+		karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
+	if (vma->vm_flags & VM_MAYSHARE)
+		karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
+
+	if (karg.build_id_size) {
+		__u32 build_id_sz = BUILD_ID_SIZE_MAX;
+
+		err = build_id_parse(vma, build_id_buf, &build_id_sz);
+		if (!err) {
+			if (karg.build_id_size < build_id_sz) {
+				err = -ENAMETOOLONG;
+				goto out;
+			}
+			karg.build_id_size = build_id_sz;
+		}
+	}
+
+	if (karg.vma_name_size) {
+		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
+		const struct path *path;
+		const char *name_fmt;
+		size_t name_sz = 0;
+
+		get_vma_name(vma, &path, &name, &name_fmt);
+
+		if (path || name_fmt || name) {
+			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
+			if (!name_buf) {
+				err = -ENOMEM;
+				goto out;
+			}
+		}
+		if (path) {
+			name = d_path(path, name_buf, name_buf_sz);
+			if (IS_ERR(name)) {
+				err = PTR_ERR(name);
+				goto out;
+			}
+			name_sz = name_buf + name_buf_sz - name;
+		} else if (name || name_fmt) {
+			name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
+			name = name_buf;
+		}
+		if (name_sz > name_buf_sz) {
+			err = -ENAMETOOLONG;
+			goto out;
+		}
+		karg.vma_name_size = name_sz;
+	}
+
+	/* unlock and put mm_struct before copying data to user */
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
+					       name, karg.vma_name_size)) {
+		kfree(name_buf);
+		return -EFAULT;
+	}
+	kfree(name_buf);
+
+	if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
+					       build_id_buf, karg.build_id_size))
+		return -EFAULT;
+
+	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
+		return -EFAULT;
+
+	return 0;
+
+out:
+	mmap_read_unlock(mm);
+	mmput(mm);
+	kfree(name_buf);
+	return err;
+}
+
+static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct seq_file *seq = file->private_data;
+	struct proc_maps_private *priv = seq->private;
+
+	switch (cmd) {
+	case PROCFS_PROCMAP_QUERY:
+		return do_procmap_query(priv, (void __user *)arg);
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
 const struct file_operations proc_pid_maps_operations = {
 	.open		= pid_maps_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= proc_map_release,
+	.unlocked_ioctl = procfs_procmap_ioctl,
+	.compat_ioctl	= procfs_procmap_ioctl,
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 45e4e64fd664..fe8924a8d916 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -393,4 +393,36 @@ struct pm_scan_arg {
 	__u64 return_mask;
 };
 
+/* /proc/<pid>/maps ioctl */
+#define PROCFS_IOCTL_MAGIC 0x9f
+#define PROCFS_PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
+
+enum procmap_query_flags {
+	PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
+};
+
+enum procmap_vma_flags {
+	PROCFS_PROCMAP_VMA_READABLE = 0x01,
+	PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
+	PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
+	PROCFS_PROCMAP_VMA_SHARED = 0x08,
+};
+
+struct procfs_procmap_query {
+	__u64 size;
+	__u64 query_flags;		/* in */
+	__u64 query_addr;		/* in */
+	__u64 vma_start;		/* out */
+	__u64 vma_end;			/* out */
+	__u64 vma_flags;		/* out */
+	__u64 vma_offset;		/* out */
+	__u64 inode;			/* out */
+	__u32 dev_major;		/* out */
+	__u32 dev_minor;		/* out */
+	__u32 vma_name_size;		/* in/out */
+	__u32 build_id_size;		/* in/out */
+	__u64 vma_name_addr;		/* in */
+	__u64 build_id_addr;		/* in */
+};
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.43.0


^ permalink raw reply related	[relevance 9%]

* [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps
@ 2024-05-04  0:30 13% Andrii Nakryiko
  2024-05-04  0:30  9% ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
                   ` (3 more replies)
  0 siblings, 4 replies; 200+ results
From: Andrii Nakryiko @ 2024-05-04  0:30 UTC (permalink / raw)
  To: linux-fsdevel, brauner, viro, akpm
  Cc: linux-kernel, bpf, gregkh, linux-mm, Andrii Nakryiko

Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than through textual
processing of /proc/<pid>/maps contents. See patch #2 for the context,
justification, and nuances of the API design.

Patch #1 is a refactoring to keep VMA name logic determination in one place.
Patch #2 is the meat of kernel-side API.
Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
optionally use this new ioctl()-based API, if supported.
Patch #5 implements a simple C tool to demonstrate intended efficient use (for
both textual and binary interfaces) and allows benchmarking them. Patch itself
also has performance numbers of a test based on one of the medium-sized
internal applications taken from production.

This patch set was based on top of next-20240503 tag in linux-next tree.
Not sure what should be the target tree for this, I'd appreciate any guidance,
thank you!

Andrii Nakryiko (5):
  fs/procfs: extract logic for getting VMA name constituents
  fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  tools: sync uapi/linux/fs.h header into tools subdir
  selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
  selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

 fs/proc/task_mmu.c                            | 290 +++++++++++---
 include/uapi/linux/fs.h                       |  32 ++
 .../perf/trace/beauty/include/uapi/linux/fs.h |  32 ++
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   2 +-
 tools/testing/selftests/bpf/procfs_query.c    | 366 ++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.c      |   3 +
 tools/testing/selftests/bpf/test_progs.h      |   2 +
 tools/testing/selftests/bpf/trace_helpers.c   | 105 ++++-
 9 files changed, 763 insertions(+), 70 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

-- 
2.43.0


^ permalink raw reply	[relevance 13%]

* [PATCH v7 13/16] cifs: Remove some code that's no longer used, part 1
  @ 2024-04-30 14:09  2% ` David Howells
  0 siblings, 0 replies; 200+ results
From: David Howells @ 2024-04-30 14:09 UTC (permalink / raw)
  To: Steve French
  Cc: David Howells, Jeff Layton, Matthew Wilcox, Paulo Alcantara,
	Shyam Prasad N, Tom Talpey, Christian Brauner, netfs, linux-cifs,
	linux-fsdevel, linux-mm, linux-kernel, Steve French,
	Shyam Prasad N, Rohith Surabattula

Remove some code that was #if'd out with the netfslib conversion.  This is
split into parts for file.c as the diff generator otherwise produces a hard
to read diff for part of it where a big chunk is cut out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/smb/client/cifsglob.h  |  12 -
 fs/smb/client/cifsproto.h |  25 --
 fs/smb/client/file.c      | 619 --------------------------------------
 fs/smb/client/fscache.c   | 111 -------
 fs/smb/client/fscache.h   |  58 ----
 5 files changed, 825 deletions(-)

diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
index 983860bf5fbb..65574e69ba4f 100644
--- a/fs/smb/client/cifsglob.h
+++ b/fs/smb/client/cifsglob.h
@@ -1515,18 +1515,6 @@ struct cifs_io_subrequest {
 	struct smbd_mr			*mr;
 #endif
 	struct cifs_credits		credits;
-
-#if 0 // TODO: Remove following elements
-	struct list_head		list;
-	struct completion		done;
-	struct work_struct		work;
-	struct cifsFileInfo		*cfile;
-	struct address_space		*mapping;
-	struct cifs_aio_ctx		*ctx;
-	enum writeback_sync_modes	sync_mode;
-	bool				uncached;
-	struct bio_vec			*bv;
-#endif
 };
 
 /*
diff --git a/fs/smb/client/cifsproto.h b/fs/smb/client/cifsproto.h
index d46ad86150cd..c15bb5ee7eb7 100644
--- a/fs/smb/client/cifsproto.h
+++ b/fs/smb/client/cifsproto.h
@@ -601,36 +601,11 @@ void __cifs_put_smb_ses(struct cifs_ses *ses);
 extern struct cifs_ses *
 cifs_get_smb_ses(struct TCP_Server_Info *server, struct smb3_fs_context *ctx);
 
-#if 0 // TODO Remove
-void cifs_readdata_release(struct cifs_io_subrequest *rdata);
-static inline void cifs_get_readdata(struct cifs_io_subrequest *rdata)
-{
-	refcount_inc(&rdata->subreq.ref);
-}
-static inline void cifs_put_readdata(struct cifs_io_subrequest *rdata)
-{
-	if (refcount_dec_and_test(&rdata->subreq.ref))
-		cifs_readdata_release(rdata);
-}
-#endif
 int cifs_async_readv(struct cifs_io_subrequest *rdata);
 int cifs_readv_receive(struct TCP_Server_Info *server, struct mid_q_entry *mid);
 
 void cifs_async_writev(struct cifs_io_subrequest *wdata);
 void cifs_writev_complete(struct work_struct *work);
-#if 0 // TODO Remove
-struct cifs_io_subrequest *cifs_writedata_alloc(work_func_t complete);
-void cifs_writedata_release(struct cifs_io_subrequest *rdata);
-static inline void cifs_get_writedata(struct cifs_io_subrequest *wdata)
-{
-	refcount_inc(&wdata->subreq.ref);
-}
-static inline void cifs_put_writedata(struct cifs_io_subrequest *wdata)
-{
-	if (refcount_dec_and_test(&wdata->subreq.ref))
-		cifs_writedata_release(wdata);
-}
-#endif
 int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
 			  struct cifs_sb_info *cifs_sb,
 			  const unsigned char *path, char *pbuf,
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index bb3adde9343e..6088d0f80522 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -352,133 +352,6 @@ const struct netfs_request_ops cifs_req_ops = {
 	.issue_write		= cifs_issue_write,
 };
 
-#if 0 // TODO remove 397
-/*
- * Remove the dirty flags from a span of pages.
- */
-static void cifs_undirty_folios(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each_marked(&xas, folio, end, PAGECACHE_TAG_DIRTY) {
-		if (xas_retry(&xas, folio))
-			continue;
-		xas_pause(&xas);
-		rcu_read_unlock();
-		folio_lock(folio);
-		folio_clear_dirty_for_io(folio);
-		folio_unlock(folio);
-		rcu_read_lock();
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * Completion of write to server.
- */
-void cifs_pages_written_back(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	if (!len)
-		return;
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each(&xas, folio, end) {
-		if (xas_retry(&xas, folio))
-			continue;
-		if (!folio_test_writeback(folio)) {
-			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
-				  len, start, folio->index, end);
-			continue;
-		}
-
-		folio_detach_private(folio);
-		folio_end_writeback(folio);
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * Failure of write to server.
- */
-void cifs_pages_write_failed(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	if (!len)
-		return;
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each(&xas, folio, end) {
-		if (xas_retry(&xas, folio))
-			continue;
-		if (!folio_test_writeback(folio)) {
-			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
-				  len, start, folio->index, end);
-			continue;
-		}
-
-		folio_set_error(folio);
-		folio_end_writeback(folio);
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * Redirty pages after a temporary failure.
- */
-void cifs_pages_write_redirty(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	if (!len)
-		return;
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each(&xas, folio, end) {
-		if (!folio_test_writeback(folio)) {
-			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
-				  len, start, folio->index, end);
-			continue;
-		}
-
-		filemap_dirty_folio(folio->mapping, folio);
-		folio_end_writeback(folio);
-	}
-
-	rcu_read_unlock();
-}
-#endif // end netfslib remove 397
-
 /*
  * Mark as invalid, all open files on tree connections since they
  * were closed when session to server was lost.
@@ -2540,92 +2413,6 @@ void cifs_write_subrequest_terminated(struct cifs_io_subrequest *wdata, ssize_t
 	netfs_write_subrequest_terminated(&wdata->subreq, result, was_async);
 }
 
-#if 0 // TODO remove 2483
-static ssize_t
-cifs_write(struct cifsFileInfo *open_file, __u32 pid, const char *write_data,
-	   size_t write_size, loff_t *offset)
-{
-	int rc = 0;
-	unsigned int bytes_written = 0;
-	unsigned int total_written;
-	struct cifs_tcon *tcon;
-	struct TCP_Server_Info *server;
-	unsigned int xid;
-	struct dentry *dentry = open_file->dentry;
-	struct cifsInodeInfo *cifsi = CIFS_I(d_inode(dentry));
-	struct cifs_io_parms io_parms = {0};
-
-	cifs_dbg(FYI, "write %zd bytes to offset %lld of %pd\n",
-		 write_size, *offset, dentry);
-
-	tcon = tlink_tcon(open_file->tlink);
-	server = tcon->ses->server;
-
-	if (!server->ops->sync_write)
-		return -ENOSYS;
-
-	xid = get_xid();
-
-	for (total_written = 0; write_size > total_written;
-	     total_written += bytes_written) {
-		rc = -EAGAIN;
-		while (rc == -EAGAIN) {
-			struct kvec iov[2];
-			unsigned int len;
-
-			if (open_file->invalidHandle) {
-				/* we could deadlock if we called
-				   filemap_fdatawait from here so tell
-				   reopen_file not to flush data to
-				   server now */
-				rc = cifs_reopen_file(open_file, false);
-				if (rc != 0)
-					break;
-			}
-
-			len = min(server->ops->wp_retry_size(d_inode(dentry)),
-				  (unsigned int)write_size - total_written);
-			/* iov[0] is reserved for smb header */
-			iov[1].iov_base = (char *)write_data + total_written;
-			iov[1].iov_len = len;
-			io_parms.pid = pid;
-			io_parms.tcon = tcon;
-			io_parms.offset = *offset;
-			io_parms.length = len;
-			rc = server->ops->sync_write(xid, &open_file->fid,
-					&io_parms, &bytes_written, iov, 1);
-		}
-		if (rc || (bytes_written == 0)) {
-			if (total_written)
-				break;
-			else {
-				free_xid(xid);
-				return rc;
-			}
-		} else {
-			spin_lock(&d_inode(dentry)->i_lock);
-			cifs_update_eof(cifsi, *offset, bytes_written);
-			spin_unlock(&d_inode(dentry)->i_lock);
-			*offset += bytes_written;
-		}
-	}
-
-	cifs_stats_bytes_written(tcon, total_written);
-
-	if (total_written > 0) {
-		spin_lock(&d_inode(dentry)->i_lock);
-		if (*offset > d_inode(dentry)->i_size) {
-			i_size_write(d_inode(dentry), *offset);
-			d_inode(dentry)->i_blocks = (512 - 1 + *offset) >> 9;
-		}
-		spin_unlock(&d_inode(dentry)->i_lock);
-	}
-	mark_inode_dirty_sync(d_inode(dentry));
-	free_xid(xid);
-	return total_written;
-}
-#endif // end netfslib remove 2483
-
 struct cifsFileInfo *find_readable_file(struct cifsInodeInfo *cifs_inode,
 					bool fsuid_only)
 {
@@ -4826,293 +4613,6 @@ int cifs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return rc;
 }
 
-#if 0 // TODO remove 4794
-/*
- * Unlock a bunch of folios in the pagecache.
- */
-static void cifs_unlock_folios(struct address_space *mapping, pgoff_t first, pgoff_t last)
-{
-	struct folio *folio;
-	XA_STATE(xas, &mapping->i_pages, first);
-
-	rcu_read_lock();
-	xas_for_each(&xas, folio, last) {
-		folio_unlock(folio);
-	}
-	rcu_read_unlock();
-}
-
-static void cifs_readahead_complete(struct work_struct *work)
-{
-	struct cifs_io_subrequest *rdata = container_of(work,
-							struct cifs_io_subrequest, work);
-	struct folio *folio;
-	pgoff_t last;
-	bool good = rdata->result == 0 || (rdata->result == -EAGAIN && rdata->got_bytes);
-
-	XA_STATE(xas, &rdata->mapping->i_pages, rdata->subreq.start / PAGE_SIZE);
-
-	if (good)
-		cifs_readahead_to_fscache(rdata->mapping->host,
-					  rdata->subreq.start, rdata->subreq.len);
-
-	if (iov_iter_count(&rdata->subreq.io_iter) > 0)
-		iov_iter_zero(iov_iter_count(&rdata->subreq.io_iter), &rdata->subreq.io_iter);
-
-	last = (rdata->subreq.start + rdata->subreq.len - 1) / PAGE_SIZE;
-
-	rcu_read_lock();
-	xas_for_each(&xas, folio, last) {
-		if (good) {
-			flush_dcache_folio(folio);
-			folio_mark_uptodate(folio);
-		}
-		folio_unlock(folio);
-	}
-	rcu_read_unlock();
-
-	cifs_put_readdata(rdata);
-}
-
-static void cifs_readahead(struct readahead_control *ractl)
-{
-	struct cifsFileInfo *open_file = ractl->file->private_data;
-	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(ractl->file);
-	struct TCP_Server_Info *server;
-	unsigned int xid, nr_pages, cache_nr_pages = 0;
-	unsigned int ra_pages;
-	pgoff_t next_cached = ULONG_MAX, ra_index;
-	bool caching = fscache_cookie_enabled(cifs_inode_cookie(ractl->mapping->host)) &&
-		cifs_inode_cookie(ractl->mapping->host)->cache_priv;
-	bool check_cache = caching;
-	pid_t pid;
-	int rc = 0;
-
-	/* Note that readahead_count() lags behind our dequeuing of pages from
-	 * the ractl, wo we have to keep track for ourselves.
-	 */
-	ra_pages = readahead_count(ractl);
-	ra_index = readahead_index(ractl);
-
-	xid = get_xid();
-
-	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
-		pid = open_file->pid;
-	else
-		pid = current->tgid;
-
-	server = cifs_pick_channel(tlink_tcon(open_file->tlink)->ses);
-
-	cifs_dbg(FYI, "%s: file=%p mapping=%p num_pages=%u\n",
-		 __func__, ractl->file, ractl->mapping, ra_pages);
-
-	/*
-	 * Chop the readahead request up into rsize-sized read requests.
-	 */
-	while ((nr_pages = ra_pages)) {
-		unsigned int i;
-		struct cifs_io_subrequest *rdata;
-		struct cifs_credits credits_on_stack;
-		struct cifs_credits *credits = &credits_on_stack;
-		struct folio *folio;
-		pgoff_t fsize;
-		size_t rsize;
-
-		/*
-		 * Find out if we have anything cached in the range of
-		 * interest, and if so, where the next chunk of cached data is.
-		 */
-		if (caching) {
-			if (check_cache) {
-				rc = cifs_fscache_query_occupancy(
-					ractl->mapping->host, ra_index, nr_pages,
-					&next_cached, &cache_nr_pages);
-				if (rc < 0)
-					caching = false;
-				check_cache = false;
-			}
-
-			if (ra_index == next_cached) {
-				/*
-				 * TODO: Send a whole batch of pages to be read
-				 * by the cache.
-				 */
-				folio = readahead_folio(ractl);
-				fsize = folio_nr_pages(folio);
-				ra_pages -= fsize;
-				ra_index += fsize;
-				if (cifs_readpage_from_fscache(ractl->mapping->host,
-							       &folio->page) < 0) {
-					/*
-					 * TODO: Deal with cache read failure
-					 * here, but for the moment, delegate
-					 * that to readpage.
-					 */
-					caching = false;
-				}
-				folio_unlock(folio);
-				next_cached += fsize;
-				cache_nr_pages -= fsize;
-				if (cache_nr_pages == 0)
-					check_cache = true;
-				continue;
-			}
-		}
-
-		if (open_file->invalidHandle) {
-			rc = cifs_reopen_file(open_file, true);
-			if (rc) {
-				if (rc == -EAGAIN)
-					continue;
-				break;
-			}
-		}
-
-		if (cifs_sb->ctx->rsize == 0)
-			cifs_sb->ctx->rsize =
-				server->ops->negotiate_rsize(tlink_tcon(open_file->tlink),
-							     cifs_sb->ctx);
-
-		rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize,
-						   &rsize, credits);
-		if (rc)
-			break;
-		nr_pages = min_t(size_t, rsize / PAGE_SIZE, ra_pages);
-		if (next_cached != ULONG_MAX)
-			nr_pages = min_t(size_t, nr_pages, next_cached - ra_index);
-
-		/*
-		 * Give up immediately if rsize is too small to read an entire
-		 * page. The VFS will fall back to readpage. We should never
-		 * reach this point however since we set ra_pages to 0 when the
-		 * rsize is smaller than a cache page.
-		 */
-		if (unlikely(!nr_pages)) {
-			add_credits_and_wake_if(server, credits, 0);
-			break;
-		}
-
-		rdata = cifs_readdata_alloc(cifs_readahead_complete);
-		if (!rdata) {
-			/* best to give up if we're out of mem */
-			add_credits_and_wake_if(server, credits, 0);
-			break;
-		}
-
-		rdata->subreq.start	= ra_index * PAGE_SIZE;
-		rdata->subreq.len	= nr_pages * PAGE_SIZE;
-		rdata->cfile	= cifsFileInfo_get(open_file);
-		rdata->server	= server;
-		rdata->mapping	= ractl->mapping;
-		rdata->pid	= pid;
-		rdata->credits	= credits_on_stack;
-
-		for (i = 0; i < nr_pages; i++) {
-			if (!readahead_folio(ractl))
-				WARN_ON(1);
-		}
-		ra_pages -= nr_pages;
-		ra_index += nr_pages;
-
-		iov_iter_xarray(&rdata->subreq.io_iter, ITER_DEST, &rdata->mapping->i_pages,
-				rdata->subreq.start, rdata->subreq.len);
-
-		rc = adjust_credits(server, &rdata->credits, rdata->subreq.len);
-		if (!rc) {
-			if (rdata->cfile->invalidHandle)
-				rc = -EAGAIN;
-			else
-				rc = server->ops->async_readv(rdata);
-		}
-
-		if (rc) {
-			add_credits_and_wake_if(server, &rdata->credits, 0);
-			cifs_unlock_folios(rdata->mapping,
-					   rdata->subreq.start / PAGE_SIZE,
-					   (rdata->subreq.start + rdata->subreq.len - 1) / PAGE_SIZE);
-			/* Fallback to the readpage in error/reconnect cases */
-			cifs_put_readdata(rdata);
-			break;
-		}
-
-		cifs_put_readdata(rdata);
-	}
-
-	free_xid(xid);
-}
-
-/*
- * cifs_readpage_worker must be called with the page pinned
- */
-static int cifs_readpage_worker(struct file *file, struct page *page,
-	loff_t *poffset)
-{
-	struct inode *inode = file_inode(file);
-	struct timespec64 atime, mtime;
-	char *read_data;
-	int rc;
-
-	/* Is the page cached? */
-	rc = cifs_readpage_from_fscache(inode, page);
-	if (rc == 0)
-		goto read_complete;
-
-	read_data = kmap(page);
-	/* for reads over a certain size could initiate async read ahead */
-
-	rc = cifs_read(file, read_data, PAGE_SIZE, poffset);
-
-	if (rc < 0)
-		goto io_error;
-	else
-		cifs_dbg(FYI, "Bytes read %d\n", rc);
-
-	/* we do not want atime to be less than mtime, it broke some apps */
-	atime = inode_set_atime_to_ts(inode, current_time(inode));
-	mtime = inode_get_mtime(inode);
-	if (timespec64_compare(&atime, &mtime) < 0)
-		inode_set_atime_to_ts(inode, inode_get_mtime(inode));
-
-	if (PAGE_SIZE > rc)
-		memset(read_data + rc, 0, PAGE_SIZE - rc);
-
-	flush_dcache_page(page);
-	SetPageUptodate(page);
-	rc = 0;
-
-io_error:
-	kunmap(page);
-
-read_complete:
-	unlock_page(page);
-	return rc;
-}
-
-static int cifs_read_folio(struct file *file, struct folio *folio)
-{
-	struct page *page = &folio->page;
-	loff_t offset = page_file_offset(page);
-	int rc = -EACCES;
-	unsigned int xid;
-
-	xid = get_xid();
-
-	if (file->private_data == NULL) {
-		rc = -EBADF;
-		free_xid(xid);
-		return rc;
-	}
-
-	cifs_dbg(FYI, "read_folio %p at offset %d 0x%x\n",
-		 page, (int)offset, (int)offset);
-
-	rc = cifs_readpage_worker(file, page, &offset);
-
-	free_xid(xid);
-	return rc;
-}
-#endif // end netfslib remove 4794
-
 static int is_inode_writable(struct cifsInodeInfo *cifs_inode)
 {
 	struct cifsFileInfo *open_file;
@@ -5160,104 +4660,6 @@ bool is_size_safe_to_change(struct cifsInodeInfo *cifsInode, __u64 end_of_file,
 		return true;
 }
 
-#if 0 // TODO remove 5152
-static int cifs_write_begin(struct file *file, struct address_space *mapping,
-			loff_t pos, unsigned len,
-			struct page **pagep, void **fsdata)
-{
-	int oncethru = 0;
-	pgoff_t index = pos >> PAGE_SHIFT;
-	loff_t offset = pos & (PAGE_SIZE - 1);
-	loff_t page_start = pos & PAGE_MASK;
-	loff_t i_size;
-	struct page *page;
-	int rc = 0;
-
-	cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len);
-
-start:
-	page = grab_cache_page_write_begin(mapping, index);
-	if (!page) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
-	if (PageUptodate(page))
-		goto out;
-
-	/*
-	 * If we write a full page it will be up to date, no need to read from
-	 * the server. If the write is short, we'll end up doing a sync write
-	 * instead.
-	 */
-	if (len == PAGE_SIZE)
-		goto out;
-
-	/*
-	 * optimize away the read when we have an oplock, and we're not
-	 * expecting to use any of the data we'd be reading in. That
-	 * is, when the page lies beyond the EOF, or straddles the EOF
-	 * and the write will cover all of the existing data.
-	 */
-	if (CIFS_CACHE_READ(CIFS_I(mapping->host))) {
-		i_size = i_size_read(mapping->host);
-		if (page_start >= i_size ||
-		    (offset == 0 && (pos + len) >= i_size)) {
-			zero_user_segments(page, 0, offset,
-					   offset + len,
-					   PAGE_SIZE);
-			/*
-			 * PageChecked means that the parts of the page
-			 * to which we're not writing are considered up
-			 * to date. Once the data is copied to the
-			 * page, it can be set uptodate.
-			 */
-			SetPageChecked(page);
-			goto out;
-		}
-	}
-
-	if ((file->f_flags & O_ACCMODE) != O_WRONLY && !oncethru) {
-		/*
-		 * might as well read a page, it is fast enough. If we get
-		 * an error, we don't need to return it. cifs_write_end will
-		 * do a sync write instead since PG_uptodate isn't set.
-		 */
-		cifs_readpage_worker(file, page, &page_start);
-		put_page(page);
-		oncethru = 1;
-		goto start;
-	} else {
-		/* we could try using another file handle if there is one -
-		   but how would we lock it to prevent close of that handle
-		   racing with this read? In any case
-		   this will be written out by write_end so is fine */
-	}
-out:
-	*pagep = page;
-	return rc;
-}
-
-static bool cifs_release_folio(struct folio *folio, gfp_t gfp)
-{
-	if (folio_test_private(folio))
-		return 0;
-	if (folio_test_private_2(folio)) { /* [DEPRECATED] */
-		if (current_is_kswapd() || !(gfp & __GFP_FS))
-			return false;
-		folio_wait_private_2(folio);
-	}
-	fscache_note_page_release(cifs_inode_cookie(folio->mapping->host));
-	return true;
-}
-
-static void cifs_invalidate_folio(struct folio *folio, size_t offset,
-				 size_t length)
-{
-	folio_wait_private_2(folio); /* [DEPRECATED] */
-}
-#endif // end netfslib remove 5152
-
 void cifs_oplock_break(struct work_struct *work)
 {
 	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
@@ -5347,27 +4749,6 @@ void cifs_oplock_break(struct work_struct *work)
 	cifs_done_oplock_break(cinode);
 }
 
-#if 0 // TODO remove 5333
-/*
- * The presence of cifs_direct_io() in the address space ops vector
- * allowes open() O_DIRECT flags which would have failed otherwise.
- *
- * In the non-cached mode (mount with cache=none), we shunt off direct read and write requests
- * so this method should never be called.
- *
- * Direct IO is not yet supported in the cached mode.
- */
-static ssize_t
-cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter)
-{
-        /*
-         * FIXME
-         * Eventually need to support direct IO for non forcedirectio mounts
-         */
-        return -EINVAL;
-}
-#endif // netfs end remove 5333
-
 static int cifs_swap_activate(struct swap_info_struct *sis,
 			      struct file *swap_file, sector_t *span)
 {
diff --git a/fs/smb/client/fscache.c b/fs/smb/client/fscache.c
index b36c493f4c56..01424a5cdb99 100644
--- a/fs/smb/client/fscache.c
+++ b/fs/smb/client/fscache.c
@@ -170,114 +170,3 @@ void cifs_fscache_release_inode_cookie(struct inode *inode)
 		cifsi->netfs.cache = NULL;
 	}
 }
-
-#if 0 // TODO remove
-/*
- * Fallback page reading interface.
- */
-static int fscache_fallback_read_page(struct inode *inode, struct page *page)
-{
-	struct netfs_cache_resources cres;
-	struct fscache_cookie *cookie = cifs_inode_cookie(inode);
-	struct iov_iter iter;
-	struct bio_vec bvec;
-	int ret;
-
-	memset(&cres, 0, sizeof(cres));
-	bvec_set_page(&bvec, page, PAGE_SIZE, 0);
-	iov_iter_bvec(&iter, ITER_DEST, &bvec, 1, PAGE_SIZE);
-
-	ret = fscache_begin_read_operation(&cres, cookie);
-	if (ret < 0)
-		return ret;
-
-	ret = fscache_read(&cres, page_offset(page), &iter, NETFS_READ_HOLE_FAIL,
-			   NULL, NULL);
-	fscache_end_operation(&cres);
-	return ret;
-}
-
-/*
- * Fallback page writing interface.
- */
-static int fscache_fallback_write_pages(struct inode *inode, loff_t start, size_t len,
-					bool no_space_allocated_yet)
-{
-	struct netfs_cache_resources cres;
-	struct fscache_cookie *cookie = cifs_inode_cookie(inode);
-	struct iov_iter iter;
-	int ret;
-
-	memset(&cres, 0, sizeof(cres));
-	iov_iter_xarray(&iter, ITER_SOURCE, &inode->i_mapping->i_pages, start, len);
-
-	ret = fscache_begin_write_operation(&cres, cookie);
-	if (ret < 0)
-		return ret;
-
-	ret = cres.ops->prepare_write(&cres, &start, &len, len, i_size_read(inode),
-				      no_space_allocated_yet);
-	if (ret == 0)
-		ret = fscache_write(&cres, start, &iter, NULL, NULL);
-	fscache_end_operation(&cres);
-	return ret;
-}
-
-/*
- * Retrieve a page from FS-Cache
- */
-int __cifs_readpage_from_fscache(struct inode *inode, struct page *page)
-{
-	int ret;
-
-	cifs_dbg(FYI, "%s: (fsc:%p, p:%p, i:0x%p\n",
-		 __func__, cifs_inode_cookie(inode), page, inode);
-
-	ret = fscache_fallback_read_page(inode, page);
-	if (ret < 0)
-		return ret;
-
-	/* Read completed synchronously */
-	SetPageUptodate(page);
-	return 0;
-}
-
-void __cifs_readahead_to_fscache(struct inode *inode, loff_t pos, size_t len)
-{
-	cifs_dbg(FYI, "%s: (fsc: %p, p: %llx, l: %zx, i: %p)\n",
-		 __func__, cifs_inode_cookie(inode), pos, len, inode);
-
-	fscache_fallback_write_pages(inode, pos, len, true);
-}
-
-/*
- * Query the cache occupancy.
- */
-int __cifs_fscache_query_occupancy(struct inode *inode,
-				   pgoff_t first, unsigned int nr_pages,
-				   pgoff_t *_data_first,
-				   unsigned int *_data_nr_pages)
-{
-	struct netfs_cache_resources cres;
-	struct fscache_cookie *cookie = cifs_inode_cookie(inode);
-	loff_t start, data_start;
-	size_t len, data_len;
-	int ret;
-
-	ret = fscache_begin_read_operation(&cres, cookie);
-	if (ret < 0)
-		return ret;
-
-	start = first * PAGE_SIZE;
-	len = nr_pages * PAGE_SIZE;
-	ret = cres.ops->query_occupancy(&cres, start, len, PAGE_SIZE,
-					&data_start, &data_len);
-	if (ret == 0) {
-		*_data_first = data_start / PAGE_SIZE;
-		*_data_nr_pages = len / PAGE_SIZE;
-	}
-
-	fscache_end_operation(&cres);
-	return ret;
-}
-#endif
diff --git a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h
index 08b30f79d4cd..f06cb24f5f3c 100644
--- a/fs/smb/client/fscache.h
+++ b/fs/smb/client/fscache.h
@@ -74,43 +74,6 @@ static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags
 			   i_size_read(inode), flags);
 }
 
-#if 0 // TODO remove
-extern int __cifs_fscache_query_occupancy(struct inode *inode,
-					  pgoff_t first, unsigned int nr_pages,
-					  pgoff_t *_data_first,
-					  unsigned int *_data_nr_pages);
-
-static inline int cifs_fscache_query_occupancy(struct inode *inode,
-					       pgoff_t first, unsigned int nr_pages,
-					       pgoff_t *_data_first,
-					       unsigned int *_data_nr_pages)
-{
-	if (!cifs_inode_cookie(inode))
-		return -ENOBUFS;
-	return __cifs_fscache_query_occupancy(inode, first, nr_pages,
-					      _data_first, _data_nr_pages);
-}
-
-extern int __cifs_readpage_from_fscache(struct inode *pinode, struct page *ppage);
-extern void __cifs_readahead_to_fscache(struct inode *pinode, loff_t pos, size_t len);
-
-
-static inline int cifs_readpage_from_fscache(struct inode *inode,
-					     struct page *page)
-{
-	if (cifs_inode_cookie(inode))
-		return __cifs_readpage_from_fscache(inode, page);
-	return -ENOBUFS;
-}
-
-static inline void cifs_readahead_to_fscache(struct inode *inode,
-					     loff_t pos, size_t len)
-{
-	if (cifs_inode_cookie(inode))
-		__cifs_readahead_to_fscache(inode, pos, len);
-}
-#endif
-
 static inline bool cifs_fscache_enabled(struct inode *inode)
 {
 	return fscache_cookie_enabled(cifs_inode_cookie(inode));
@@ -133,27 +96,6 @@ static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { re
 static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) {}
 static inline bool cifs_fscache_enabled(struct inode *inode) { return false; }
 
-#if 0 // TODO remove
-static inline int cifs_fscache_query_occupancy(struct inode *inode,
-					       pgoff_t first, unsigned int nr_pages,
-					       pgoff_t *_data_first,
-					       unsigned int *_data_nr_pages)
-{
-	*_data_first = ULONG_MAX;
-	*_data_nr_pages = 0;
-	return -ENOBUFS;
-}
-
-static inline int
-cifs_readpage_from_fscache(struct inode *inode, struct page *page)
-{
-	return -ENOBUFS;
-}
-
-static inline
-void cifs_readahead_to_fscache(struct inode *inode, loff_t pos, size_t len) {}
-#endif
-
 #endif /* CONFIG_CIFS_FSCACHE */
 
 #endif /* _CIFS_FSCACHE_H */


^ permalink raw reply related	[relevance 2%]

* [PATCH 09/18] fsverity: expose merkle tree geometry to callers
  @ 2024-04-30  3:21  6% ` Darrick J. Wong
  0 siblings, 0 replies; 200+ results
From: Darrick J. Wong @ 2024-04-30  3:21 UTC (permalink / raw)
  To: aalbersh, ebiggers, djwong
  Cc: linux-xfs, alexl, walters, fsverity, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

Create a function that will return selected information about the
geometry of the merkle tree.  Online fsck for XFS will need this piece
to perform basic checks of the merkle tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/verity/open.c         |   32 ++++++++++++++++++++++++++++++++
 include/linux/fsverity.h |   10 ++++++++++
 2 files changed, 42 insertions(+)


diff --git a/fs/verity/open.c b/fs/verity/open.c
index 4777130322866..aa71a4d3cbff1 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -427,6 +427,38 @@ void __fsverity_cleanup_inode(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(__fsverity_cleanup_inode);
 
+/**
+ * fsverity_merkle_tree_geometry() - return Merkle tree geometry
+ * @inode: the inode to query
+ * @block_size: will be set to the size of a merkle tree block, in bytes
+ * @tree_size: will be set to the size of the merkle tree, in bytes
+ *
+ * Callers are not required to have opened the file.
+ *
+ * Return: 0 for success, -ENODATA if verity is not enabled, or any of the
+ * error codes that can result from loading verity information while opening a
+ * file.
+ */
+int fsverity_merkle_tree_geometry(struct inode *inode, unsigned int *block_size,
+				  u64 *tree_size)
+{
+	struct fsverity_info *vi;
+	int error;
+
+	if (!IS_VERITY(inode))
+		return -ENODATA;
+
+	error = ensure_verity_info(inode);
+	if (error)
+		return error;
+
+	vi = inode->i_verity_info;
+	*block_size = vi->tree_params.block_size;
+	*tree_size = vi->tree_params.tree_size;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(fsverity_merkle_tree_geometry);
+
 void __init fsverity_init_info_cache(void)
 {
 	fsverity_info_cachep = KMEM_CACHE_USERCOPY(
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 7c51d7cf835ec..a3a5b68bed0d3 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -243,6 +243,9 @@ int __fsverity_file_open(struct inode *inode, struct file *filp);
 int __fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr);
 void __fsverity_cleanup_inode(struct inode *inode);
 
+int fsverity_merkle_tree_geometry(struct inode *inode, unsigned int *block_size,
+				  u64 *tree_size);
+
 /**
  * fsverity_cleanup_inode() - free the inode's verity info, if present
  * @inode: an inode being evicted
@@ -326,6 +329,13 @@ static inline void fsverity_cleanup_inode(struct inode *inode)
 {
 }
 
+static inline int fsverity_merkle_tree_geometry(struct inode *inode,
+						unsigned int *block_size,
+						u64 *tree_size)
+{
+	return -EOPNOTSUPP;
+}
+
 /* read_metadata.c */
 
 static inline int fsverity_ioctl_read_metadata(struct file *filp,


^ permalink raw reply related	[relevance 6%]

* [PATCH v3 16/21] fs: Add FS_XFLAG_ATOMICWRITES flag
  @ 2024-04-29 17:47  5% ` John Garry
  0 siblings, 0 replies; 200+ results
From: John Garry @ 2024-04-29 17:47 UTC (permalink / raw)
  To: david, djwong, hch, viro, brauner, jack, chandan.babu, willy
  Cc: axboe, martin.petersen, linux-kernel, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, mcgrof, p.raghav, linux-xfs,
	catherine.hoang, John Garry

Add a flag indicating that a regular file is enabled for atomic writes.

This is a file attribute that mirrors an ondisk inode flag.  Actual support
for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX
to find out if they really got that IO path.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/uapi/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 6a6bcb53594a..0eae5383a0b4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -160,6 +160,7 @@ struct fsxattr {
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
 /* data extent mappings for regular files must be aligned to extent size hint */
 #define FS_XFLAG_FORCEALIGN	0x00020000
+#define FS_XFLAG_ATOMICWRITES	0x00040000	/* atomic writes enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH v4 01/34] ext4: factor out a common helper to query extent map
  2024-04-10 14:29 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
@ 2024-04-26 11:55  7%   ` Ritesh Harjani
  0 siblings, 0 replies; 200+ results
From: Ritesh Harjani @ 2024-04-26 11:55 UTC (permalink / raw)
  To: Zhang Yi, linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, hch, djwong, david, willy, zokeefe, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3, wangkefeng.wang

Zhang Yi <yi.zhang@huaweicloud.com> writes:

> From: Zhang Yi <yi.zhang@huawei.com>
>
> Factor out a new common helper ext4_map_query_blocks() from the
> ext4_da_map_blocks(), it query and return the extent map status on the
> inode's extent path, no logic changes.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
>  1 file changed, 32 insertions(+), 25 deletions(-)

Looks good to me. Straight forward refactoring.
Feel free to add - 

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 537803250ca9..6a41172c06e1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
>  }
>  #endif /* ES_AGGRESSIVE_TEST */
>  
> +static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
> +				 struct ext4_map_blocks *map)
> +{
> +	unsigned int status;
> +	int retval;
> +
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +		retval = ext4_ext_map_blocks(handle, inode, map, 0);
> +	else
> +		retval = ext4_ind_map_blocks(handle, inode, map, 0);
> +
> +	if (retval <= 0)
> +		return retval;
> +
> +	if (unlikely(retval != map->m_len)) {
> +		ext4_warning(inode->i_sb,
> +			     "ES len assertion failed for inode "
> +			     "%lu: retval %d != map->m_len %d",
> +			     inode->i_ino, retval, map->m_len);
> +		WARN_ON(1);
> +	}
> +
> +	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> +			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> +	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> +			      map->m_pblk, status);
> +	return retval;
> +}
> +
>  /*
>   * The ext4_map_blocks() function tries to look up the requested blocks,
>   * and returns if the blocks are already mapped.
> @@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  	down_read(&EXT4_I(inode)->i_data_sem);
>  	if (ext4_has_inline_data(inode))
>  		retval = 0;
> -	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> -		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
>  	else
> -		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
> -	if (retval < 0) {
> -		up_read(&EXT4_I(inode)->i_data_sem);
> -		return retval;
> -	}
> -	if (retval > 0) {
> -		unsigned int status;
> -
> -		if (unlikely(retval != map->m_len)) {
> -			ext4_warning(inode->i_sb,
> -				     "ES len assertion failed for inode "
> -				     "%lu: retval %d != map->m_len %d",
> -				     inode->i_ino, retval, map->m_len);
> -			WARN_ON(1);
> -		}
> -
> -		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> -				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> -		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> -				      map->m_pblk, status);
> -		up_read(&EXT4_I(inode)->i_data_sem);
> -		return retval;
> -	}
> +		retval = ext4_map_query_blocks(NULL, inode, map);
>  	up_read(&EXT4_I(inode)->i_data_sem);
> +	if (retval)
> +		return retval;
>  
>  add_delayed:
>  	down_write(&EXT4_I(inode)->i_data_sem);
> -- 
> 2.39.2

^ permalink raw reply	[relevance 7%]

* Re: [PATCH 08/13] fsverity: expose merkle tree geometry to callers
  2024-04-25  1:01  0%         ` Darrick J. Wong
@ 2024-04-25  1:04  0%           ` Eric Biggers
  0 siblings, 0 replies; 200+ results
From: Eric Biggers @ 2024-04-25  1:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs, linux-fsdevel, fsverity

On Wed, Apr 24, 2024 at 06:01:37PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 25, 2024 at 12:49:27AM +0000, Eric Biggers wrote:
> > On Wed, Apr 24, 2024 at 05:45:45PM -0700, Darrick J. Wong wrote:
> > > On Thu, Apr 04, 2024 at 10:50:45PM -0400, Eric Biggers wrote:
> > > > On Fri, Mar 29, 2024 at 05:34:45PM -0700, Darrick J. Wong wrote:
> > > > > +/**
> > > > > + * fsverity_merkle_tree_geometry() - return Merkle tree geometry
> > > > > + * @inode: the inode for which the Merkle tree is being built
> > > > 
> > > > This function is actually for inodes that already have fsverity enabled.  So the
> > > > above comment is misleading.
> > > 
> > > How about:
> > > 
> > > /**
> > >  * fsverity_merkle_tree_geometry() - return Merkle tree geometry
> > >  * @inode: the inode to query
> > >  * @block_size: size of a merkle tree block, in bytes
> > >  * @tree_size: size of the merkle tree, in bytes
> > >  *
> > >  * Callers are not required to have opened the file.
> > >  */
> > 
> > Looks okay, but it would be helpful to document that the two output parameters
> > are outputs, and to document the return value.
> 
> How about:
> 
>  * Callers are not required to have opened the file.  Returns 0 for success,
>  * -ENODATA if verity is not enabled, or any of the error codes that can result
>  * from loading verity information while opening a file.
> 

The wording sounds good, but since this is a kerneldoc-style comment the
information about the return value should be in a "Return:" section.

- Eric

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 08/13] fsverity: expose merkle tree geometry to callers
  2024-04-25  0:49  0%       ` Eric Biggers
@ 2024-04-25  1:01  0%         ` Darrick J. Wong
  2024-04-25  1:04  0%           ` Eric Biggers
  0 siblings, 1 reply; 200+ results
From: Darrick J. Wong @ 2024-04-25  1:01 UTC (permalink / raw)
  To: Eric Biggers; +Cc: aalbersh, linux-xfs, linux-fsdevel, fsverity

On Thu, Apr 25, 2024 at 12:49:27AM +0000, Eric Biggers wrote:
> On Wed, Apr 24, 2024 at 05:45:45PM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 04, 2024 at 10:50:45PM -0400, Eric Biggers wrote:
> > > On Fri, Mar 29, 2024 at 05:34:45PM -0700, Darrick J. Wong wrote:
> > > > +/**
> > > > + * fsverity_merkle_tree_geometry() - return Merkle tree geometry
> > > > + * @inode: the inode for which the Merkle tree is being built
> > > 
> > > This function is actually for inodes that already have fsverity enabled.  So the
> > > above comment is misleading.
> > 
> > How about:
> > 
> > /**
> >  * fsverity_merkle_tree_geometry() - return Merkle tree geometry
> >  * @inode: the inode to query
> >  * @block_size: size of a merkle tree block, in bytes
> >  * @tree_size: size of the merkle tree, in bytes
> >  *
> >  * Callers are not required to have opened the file.
> >  */
> 
> Looks okay, but it would be helpful to document that the two output parameters
> are outputs, and to document the return value.

How about:

 * Callers are not required to have opened the file.  Returns 0 for success,
 * -ENODATA if verity is not enabled, or any of the error codes that can result
 * from loading verity information while opening a file.

--D

> - Eric
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 08/13] fsverity: expose merkle tree geometry to callers
  2024-04-25  0:45  5%     ` Darrick J. Wong
@ 2024-04-25  0:49  0%       ` Eric Biggers
  2024-04-25  1:01  0%         ` Darrick J. Wong
  0 siblings, 1 reply; 200+ results
From: Eric Biggers @ 2024-04-25  0:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs, linux-fsdevel, fsverity

On Wed, Apr 24, 2024 at 05:45:45PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 04, 2024 at 10:50:45PM -0400, Eric Biggers wrote:
> > On Fri, Mar 29, 2024 at 05:34:45PM -0700, Darrick J. Wong wrote:
> > > +/**
> > > + * fsverity_merkle_tree_geometry() - return Merkle tree geometry
> > > + * @inode: the inode for which the Merkle tree is being built
> > 
> > This function is actually for inodes that already have fsverity enabled.  So the
> > above comment is misleading.
> 
> How about:
> 
> /**
>  * fsverity_merkle_tree_geometry() - return Merkle tree geometry
>  * @inode: the inode to query
>  * @block_size: size of a merkle tree block, in bytes
>  * @tree_size: size of the merkle tree, in bytes
>  *
>  * Callers are not required to have opened the file.
>  */

Looks okay, but it would be helpful to document that the two output parameters
are outputs, and to document the return value.

- Eric

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 08/13] fsverity: expose merkle tree geometry to callers
  @ 2024-04-25  0:45  5%     ` Darrick J. Wong
  2024-04-25  0:49  0%       ` Eric Biggers
  0 siblings, 1 reply; 200+ results
From: Darrick J. Wong @ 2024-04-25  0:45 UTC (permalink / raw)
  To: Eric Biggers; +Cc: aalbersh, linux-xfs, linux-fsdevel, fsverity

On Thu, Apr 04, 2024 at 10:50:45PM -0400, Eric Biggers wrote:
> On Fri, Mar 29, 2024 at 05:34:45PM -0700, Darrick J. Wong wrote:
> > +/**
> > + * fsverity_merkle_tree_geometry() - return Merkle tree geometry
> > + * @inode: the inode for which the Merkle tree is being built
> 
> This function is actually for inodes that already have fsverity enabled.  So the
> above comment is misleading.

How about:

/**
 * fsverity_merkle_tree_geometry() - return Merkle tree geometry
 * @inode: the inode to query
 * @block_size: size of a merkle tree block, in bytes
 * @tree_size: size of the merkle tree, in bytes
 *
 * Callers are not required to have opened the file.
 */


> > +int fsverity_merkle_tree_geometry(struct inode *inode, unsigned int *block_size,
> > +				  u64 *tree_size)
> > +{
> > +	struct fsverity_info *vi;
> > +	int error;
> > +
> > +	if (!IS_VERITY(inode))
> > +		return -EOPNOTSUPP;
> 
> Maybe use ENODATA, similar to fsverity_ioctl_measure() and
> bpf_get_fsverity_digest().

Done.

> > +
> > +	error = ensure_verity_info(inode);
> > +	if (error)
> > +		return error;
> > +
> > +	vi = fsverity_get_info(inode);
> 
> This can just use 'vi = inode->i_verity_info', since ensure_verity_info() was
> called.

Changed.

> It should also be documented that an open need not have been done on the file
> yet, as this behavior differs from functions like fsverity_get_digest() that
> require that an open was done first.

Done.

--D

> - Eric
> 

^ permalink raw reply	[relevance 5%]

* Re: [PATCH v2 1/9] ext4: factor out a common helper to query extent map
  2024-04-10  3:41 11% ` [PATCH v2 1/9] ext4: factor out a common helper to query extent map Zhang Yi
@ 2024-04-24 20:05  7%   ` Jan Kara
  0 siblings, 0 replies; 200+ results
From: Jan Kara @ 2024-04-24 20:05 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, tytso, adilger.kernel, jack, yi.zhang,
	chengzhihao1, yukuai3

On Wed 10-04-24 11:41:55, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Factor out a new common helper ext4_map_query_blocks() from the
> ext4_da_map_blocks(), it query and return the extent map status on the
> inode's extent path, no logic changes.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
>  1 file changed, 32 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 537803250ca9..6a41172c06e1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
>  }
>  #endif /* ES_AGGRESSIVE_TEST */
>  
> +static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
> +				 struct ext4_map_blocks *map)
> +{
> +	unsigned int status;
> +	int retval;
> +
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +		retval = ext4_ext_map_blocks(handle, inode, map, 0);
> +	else
> +		retval = ext4_ind_map_blocks(handle, inode, map, 0);
> +
> +	if (retval <= 0)
> +		return retval;
> +
> +	if (unlikely(retval != map->m_len)) {
> +		ext4_warning(inode->i_sb,
> +			     "ES len assertion failed for inode "
> +			     "%lu: retval %d != map->m_len %d",
> +			     inode->i_ino, retval, map->m_len);
> +		WARN_ON(1);
> +	}
> +
> +	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> +			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> +	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> +			      map->m_pblk, status);
> +	return retval;
> +}
> +
>  /*
>   * The ext4_map_blocks() function tries to look up the requested blocks,
>   * and returns if the blocks are already mapped.
> @@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>  	down_read(&EXT4_I(inode)->i_data_sem);
>  	if (ext4_has_inline_data(inode))
>  		retval = 0;
> -	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> -		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
>  	else
> -		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
> -	if (retval < 0) {
> -		up_read(&EXT4_I(inode)->i_data_sem);
> -		return retval;
> -	}
> -	if (retval > 0) {
> -		unsigned int status;
> -
> -		if (unlikely(retval != map->m_len)) {
> -			ext4_warning(inode->i_sb,
> -				     "ES len assertion failed for inode "
> -				     "%lu: retval %d != map->m_len %d",
> -				     inode->i_ino, retval, map->m_len);
> -			WARN_ON(1);
> -		}
> -
> -		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
> -				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> -		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> -				      map->m_pblk, status);
> -		up_read(&EXT4_I(inode)->i_data_sem);
> -		return retval;
> -	}
> +		retval = ext4_map_query_blocks(NULL, inode, map);
>  	up_read(&EXT4_I(inode)->i_data_sem);
> +	if (retval)
> +		return retval;
>  
>  add_delayed:
>  	down_write(&EXT4_I(inode)->i_data_sem);
> -- 
> 2.39.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[relevance 7%]

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]
  2024-04-23 13:30  0% ` [Lsf-pc] " Amir Goldstein
@ 2024-04-24 12:22  0%   ` John Groves
  0 siblings, 0 replies; 200+ results
From: John Groves @ 2024-04-24 12:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Bernd Schubert, lsf-pc, Jonathan Corbet,
	Dan Williams, Vishal Verma, Dave Jiang, Alexander Viro,
	Christian Brauner, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, nvdimm, Randy Dunlap, Jon Grimm, Dave Chinner,
	john, Bharata B Rao, Jerome Glisse, gregory.price, Ajay Joshi,
	Aneesh Kumar K . V, Alistair Popple, Christoph Hellwig, Zi Yan,
	David Rientjes, Ravi Shankar, dave.hansen, John Hubbard, mykolal,
	Brian Morris, Eishan Mirakhur, Wei Xu, Theodore Ts'o,
	Srinivasulu Thanneeru, John Groves, Christoph Lameter,
	Johannes Weiner, Andrew Morton, Aravind Ramesh

On 24/04/23 04:30PM, Amir Goldstein wrote:
> On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@groves.net> wrote:
> >
> > John Groves, Micron
> >
> > Micron recently released the first RFC for famfs [1]. Although famfs is not
> > CXL-specific in any way, it aims to enable hosts to share data sets in shared
> > memory (such as CXL) by providing a memory-mappable fs-dax file system
> > interface to the memory.
> >
> > Sharable disaggregated memory already exists in the lab, and will be possible
> > in the wild soon. Famfs aims to do the following:
> >
> > * Provide an access method that provides isolation between files, and does not
> >   tempt developers to mmap all the memory writable on every host.
> > * Provide an an access method that can be used by unmodified apps.
> >
> > Without something like famfs, enabling the use of sharable memory will involve
> > the temptation to do things that may destabilize systems, such as
> > mapping large shared, writable global memory ranges and hooking allocators to
> > use it (potentially sacrificing isolation), and forcing the same virtual
> > address ranges in every host/process (compromising security).
> >
> > The most obvious candidate app categories are data analytics and data lakes.
> > Both make heavy use of "zero-copy" data frames - column oriented data that
> > is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
> > categories are generally driven by python code that wrangles data into
> > appropriate data frames - making it straightforward to put the data frames
> > into famfs. Furthermore, these use cases usually involve the shared data being
> > read-only during computation or query jobs - meaning they are often free of
> > cache coherency concerns.
> >
> > Workloads such as these often deal with data sets that are too large to fit
> > in a single server's memory, so the data gets sharded - requiring movement via
> > a network. Sharded apps also sometimes have to do expensive reshuffling -
> > moving data to nodes with available compute resources. Avoiding the sharding
> > overheads by accessing such data sets in disaggregated shared memory looks
> > promising to make make better use of memory and compute resources, and by
> > effectively de-duplicating data sets in memory.
> >
> > About sharable memory
> >
> > * Shared memory is pmem-like, in that hosts will connect in order to access
> >   pre-existing contents
> > * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
> > * CXL 3 provides for optionally-supported hardware-managed cache coherency
> > * But "multiple-readers, no writers" use cases don't need hardware support
> >   for coherency
> > * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
> >   an allocator built in.
> > * When sharable capacity is allocated, each host that has access will see a
> >   /dev/dax device that can be found by the "tag" of the allocation. The tag is
> >   just a uuid.
> > * CXL 3.1 also allows the capacity associated with any allocated tag to be
> >   provided to each host (or host group) as either writable or read-only.
> >
> > About famfs
> >
> > Famfs is an append-only log-structured file system that places many limits
> > on what can be done. This allows famfs to tolerate clients with a stale copy
> > of metadata. All memory allocation and log maintenance is performed from user
> > space, but file extent lists are cached in the kernel for fast fault
> > resolution. The current limitations are fairly extreme, but many can be relaxed
> > by writing more code, managing Byzantine generals, etc. ;)
> >
> > A famfs-enabled kernel can be cloned at [3], and the user space repo can be
> > cloned at [4]. Even with major functional limitations in its current form
> > (e.g. famfs does not currently support deleting files), it is sufficient to
> > use in data analytics workloads - in which you 1) create a famfs file system,
> > 2) dump data sets into it, 3) run clustered jobs that consume the shared data
> > sets, and 4) dismount and deallocate the memory containing the file system.
> >
> > Famfs Open Issues
> >
> > * Volatile CXL memory is exposed as character dax devices; the famfs patch
> >   set adds the iomap API, which is required for fs-dax but until now missing
> >   from character dax.
> > * (/dev/pmem devices are block, and support the iomap api for fs-dax file
> >   systems)
> > * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
> >   devices cannot be converted to pmem mode.
> > * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
> >   patch set adds that.
> > * VFS layer hooks for a file system on a character device may be needed.
> > * Famfs has uncovered some previously latent bugs in the /dev/dax mmap
> >   machinery that probably require attention.
> > * Famfs currently works with either pmem or devdax devices, but our
> >   inclination is to drop pmem support to, reduce the complexity of supporting
> >   two different underlying device types - particularly since famfs is not
> >   intended for actual pmem.
> >
> >
> > Required :-
> > Dan Williams
> > Christian Brauner
> > Jonathan Cameron
> > Dave Hansen
> >
> > [LSF/MM + BPF ATTEND]
> >
> > I am the author of the famfs file system. Famfs was first introduced at LPC
> > 2023 [2]. I'm also Micron's voting member on the Software and Systems Working
> > Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
> > specification.
> >
> >
> > References
> >
> > [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@groves.net/#t
> > [2] https://lpc.events/event/17/contributions/1455/
> > [3] https://www.computeexpresslink.org/download-the-specification
> > [4] https://github.com/cxl-micron-reskit/famfs-linux
> >
> 
> Hi John,
> 
> Following our correspondence on your patch set [1], I am not sure that the
> details of famfs file system itself are an interesting topic for the
> LSFMM crowd??
> What I would like to do is schedule a session on:
> "Famfs: new userspace filesystem driver vs. improving FUSE/DAX"
> 
> I am hoping that Miklos and Bernd will be able to participate in this
> session remotely.
> 
> You see the last time that someone tried to introduce a specialized
> faster FUSE replacement [2], the comments from the community were
> that FUSE protocol can and should be improved instead of introducing
> another "filesystem in userspace" protocol.
> 
> Since 2019, FUSE has gained virtiofs/dax support, it recently gained
> FUSE passthrough support and Bernd is working on FUSE uring [3].
> 
> My hope is that you will be able to list the needed improvements
> to /dev/dax iomap and FUSE so that you could use the existing
> kernel infrastructure and FUSE libraries to implement famfs.
> 
> How does that sound for a discussion?
> 
> Thanks,
> Amir.
> 
> [1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/
> [2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@netapp.com/
> [3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@ddn.com/

Amir,

That sounds good, thanks! I'll start preparing for it!

Re: [2]: I do think there are important ways that famfs is not "another 
filesystem in user space protocol" - but I'll save it for the LSFMM session!

FYI famfs v2 patches will be going out before LSFMM (and possibly before
next week).

Thanks Amir,
John


^ permalink raw reply	[relevance 0%]

* Re: [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio
  2024-04-10 14:29  2% [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
                   ` (4 preceding siblings ...)
  2024-04-11  1:12  0% ` [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
@ 2024-04-24  8:12  0% ` Zhang Yi
  5 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-24  8:12 UTC (permalink / raw)
  To: linux-ext4, tytso, jack
  Cc: linux-fsdevel, linux-mm, linux-kernel, adilger.kernel,
	ritesh.list, hch, djwong, david, willy, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, wangkefeng.wang

Hi Ted and Jan,

I'm almost done with the first phase of this iomap conversion for
regular file's buffered IO path, could you please take time to take
a look at this series, I'd appreciated if I could get some feedback
and comments before the next phase of development, or is there any
plan to merge this series?

Thanks,
Yi.

On 2024/4/10 22:29, Zhang Yi wrote:
> Hello!
> 
> This is the fourth version of RFC patch series that convert ext4 regular
> file's buffered IO path to iomap and enable large folio. I've rebased it
> on 6.9-rc3, it also **depends on my xfs/iomap fix series** which has
> been reviewed but not merged yet[1]. Compared to the third vesion, this
> iteration fixes an issue discovered in current ext4 code, and contains
> another two main changes, 1) add bigalloc support and 2) simplify the
> updating logic of reserved delalloc data block, both changes could be
> sent out as preliminary patch series, besides these, others are some
> small code cleanups, performance optimize and commit log improvements.
> Please take a look at this series and any comments are welcome.
> 
> This series supports ext4 with the default features and mount
> options(bigalloc is also supported), doesn't support non-extent(ext3),
> inline_data, dax, fs_verity, fs_crypt and data=journal mode, ext4 would
> fall back to buffer_head path automatically if you enabled those
> features or options. Although it has many limitations now, it can satisfy
> the requirements of most common cases and bring a significant performance
> benefit for large IOs.
> 
> The iomap path would be simpler than the buffer_head path to some extent,
> please note that there are 4 major differences:
> 1. Always allocate unwritten extent for new blocks, it means that it's
>    not controlled by dioread_nolock mount option.
> 2. Since 1, there is no risk of exposing stale data during the append
>    write, so we don't need to write back data before metadata, it's time
>    to drop 'data = ordered' mode automatically.
> 3. Since 2, we don't need to reserve journal credits and use reserved
>    handle for the extent status conversion during writeback.
> 4. We could postpone updating the i_disksize to the endio, it could
>    avoid exposing zero data during append write and instantaneous power
>    failure.
> 
> Series details:
> Patch 1-9: this is the part 2 preparation series, it fix a problem
> first, and makes ext4_insert_delayed_block() call path support inserting
> multiple delalloc blocks (also support bigalloc), finally make
> ext4_da_map_blocks() buffer_head unaware, I've send it out separately[2]
> and hope this could be merged first.
> 
> Patch 10-19: this is the part 3 prepartory changes(picked out from my
> metadata reservation series[3], these are not a strong dependency
> patches, but I'd suggested these could be merged before the iomap
> conversion). These patches moves ext4_da_update_reserve_space() to
> ext4_es_insert_extent(), and always set EXT4_GET_BLOCKS_DELALLOC_RESERVE
> when allocating delalloc blocks, no matter it's from delayed allocate or
> non-delayed allocate (fallocate) path, it makes delalloc extents always
> delonly. These can make delalloc reservation simpler and cleaner than
> before.
> 
> Patch 20-34: These patches are the main implements of the buffered IO
> iomap conversion, It first introduce a sequence counter for extent
> status tree, then add a new iomap aops for read, write, mmap, replace
> current buffered_head path. Finally, enable iomap path besides inline
> data, non-extent, dax, fs_verity, fs_crypt, defrag and data=journal
> mode, if user specify "buffered_iomap" mount option, also enable large
> folio. Please look at the following patch for details.
> 
> About Tests:
>  - Pass kvm-xfstests in auto mode, and the keep running stress tests and
>    fault injection tests.
>  - A performance tests below (tested on my version 3 series,
>    theoretically there won't be much difference in this version).
> 
>    Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
>    with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.
> 
>    == buffer read ==
> 
>                   buffer head        iomap + large folio
>    type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>    ----------------------------------------------------
>    hole     4K    565k    2206       811k    3167
>    hole     64K   45.1k   2820       78.1k   4879
>    hole     1M    2744    2744       4890    4891
>    ramdisk  4K    436k    1703       554k    2163
>    ramdisk  64K   29.6k   1848       44.0k   2747
>    ramdisk  1M    1994    1995       2809    2809
>    nvme     4K    306k    1196       324k    1267
>    nvme     64K   19.3k   1208       24.3k   1517
>    nvme     1M    1694    1694       2256    2256
> 
>    == buffer write ==
> 
>                                         buffer head  iomap + large folio
>    type   Overwrite Sync Writeback bs   IOPS   BW    IOPS   BW
>    ------------------------------------------------------------
>    cache    N       N    N         4K   395k   1544  415k   1621
>    cache    N       N    N         64K  30.8k  1928  80.1k  5005
>    cache    N       N    N         1M   1963   1963  5641   5642
>    cache    Y       N    N         4K   423k   1652  443k   1730
>    cache    Y       N    N         64K  33.0k  2063  80.8k  5051
>    cache    Y       N    N         1M   2103   2103  5588   5589
>    ramdisk  N       N    Y         4K   362k   1416  307k   1198
>    ramdisk  N       N    Y         64K  22.4k  1399  64.8k  4050
>    ramdisk  N       N    Y         1M   1670   1670  4559   4560
>    ramdisk  N       Y    N         4K   9830   38.4  13.5k  52.8
>    ramdisk  N       Y    N         64K  5834   365   10.1k  629
>    ramdisk  N       Y    N         1M   1011   1011  2064   2064
>    ramdisk  Y       N    Y         4K   397k   1550  409k   1598
>    ramdisk  Y       N    Y         64K  29.2k  1827  73.6k  4597
>    ramdisk  Y       N    Y         1M   1837   1837  4985   4985
>    ramdisk  Y       Y    N         4K   173k   675   182k   710
>    ramdisk  Y       Y    N         64K  17.7k  1109  33.7k  2105
>    ramdisk  Y       Y    N         1M   1128   1129  1790   1791
>    nvme     N       N    Y         4K   298k   1164  290k   1134
>    nvme     N       N    Y         64K  21.5k  1343  57.4k  3590
>    nvme     N       N    Y         1M   1308   1308  3664   3664
>    nvme     N       Y    N         4K   10.7k  41.8  12.0k  46.9
>    nvme     N       Y    N         64K  5962   373   8598   537
>    nvme     N       Y    N         1M   676    677   1417   1418
>    nvme     Y       N    Y         4K   366k   1430  373k   1456
>    nvme     Y       N    Y         64K  26.7k  1670  56.8k  3547
>    nvme     Y       N    Y         1M   1745   1746  3586   3586
>    nvme     Y       Y    N         4K   59.0k  230   61.2k  239
>    nvme     Y       Y    N         64K  13.0k  813   21.0k  1311
>    nvme     Y       Y    N         1M   683    683   1368   1369
>  
> TODO
>  - Keep on doing stress tests and fixing.
>  - Reserve enough space for delalloc metadata blocks and try to drop
>    ext4_nonda_switch().
>  - First support defrag and then support other more unsupported features
>    and mount options.
> 
> Changes since v3:
>  - Drop the part 1 prepartory patches which have been merged [4].
>  - Drop the two iomap patches since I've submitted separately [1].
>  - Fix an incorrect reserved delalloc blocks count and incorrect extent
>    status cache issue found on current ext4 code.
>  - Pick out part 2 prepartory patch series [2], it make
>    ext4_insert_delayed_block() call path support inserting multiple
>    delalloc blocks (also support bigalloc )and make ext4_da_map_blocks()
>    buffer_head unaware.
>  - Adjust and simplify the reserved delalloc blocks updating logic,
>    preparing for reserving meta data blocks for delalloc.
>  - Drop datasync dirty check in ext4_set_iomap() for buffered
>    read/write, improves the concurrent performance on small I/Os.
>  - Prevent always hold invalid_lock in page_cache_ra_order(), add
>    lockless check.
>  - Disable iomap path by default since it's experimental new, add a
>    mount option "buffered_iomap" to enable it.
>  - Some other minor fixes and change log improvements.
> Changes since v2:
>  - Update patch 1-6 to v3.
>  - iomap_zero and iomap_unshare don't need to update i_size and call
>    iomap_write_failed(), introduce a new helper iomap_write_end_simple()
>    to avoid doing that.
>  - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(),
>    introduce a new helper ext4_iomap_map_one_extent() to allocate
>    delalloc blocks in writeback, which is always under i_data_sem in
>    write mode. This is done to prevent the writing back delalloc
>    extents become stale if it raced by truncate.
>  - Add a lock detection in mapping_clear_large_folios().
> Changes since v1:
>  - Introduce seq count for iomap buffered write and writeback to protect
>    races from extents changes, e.g. truncate, mwrite.
>  - Always allocate unwritten extents for new blocks, drop dioread_lock
>    mode, and make no distinctions between dioread_lock and
>    dioread_nolock.
>  - Don't add ditry data range to jinode, drop data=ordered mode, and
>    make no distinctions between data=ordered and data=writeback mode.
>  - Postpone updating i_disksize to endio.
>  - Allow splitting extents and use reserved space in endio.
>  - Instead of reimplement a new delayed mapping helper
>    ext4_iomap_da_map_blocks() for buffer write, try to reuse
>    ext4_da_map_blocks().
>  - Add support for disabling large folio on active inodes.
>  - Support online defragmentation, make file fall back to buffer_head
>    and disable large folio in ext4_move_extents().
>  - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
>  - Add dirty_len and pos trace info to trace_iomap_writepage_map().
>  - Update patch 1-6 to v2.
> 
> [1] https://lore.kernel.org/linux-xfs/20240320110548.2200662-1-yi.zhang@huaweicloud.com/
> [2] https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/
> [3] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/
> [4] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/
> 
> Thanks,
> Yi.
> 
> ---
> v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
> 
> Zhang Yi (34):
>   ext4: factor out a common helper to query extent map
>   ext4: check the extent status again before inserting delalloc block
>   ext4: trim delalloc extent
>   ext4: drop iblock parameter
>   ext4: make ext4_es_insert_delayed_block() insert multi-blocks
>   ext4: make ext4_da_reserve_space() reserve multi-clusters
>   ext4: factor out check for whether a cluster is allocated
>   ext4: make ext4_insert_delayed_block() insert multi-blocks
>   ext4: make ext4_da_map_blocks() buffer_head unaware
>   ext4: factor out ext4_map_create_blocks() to allocate new blocks
>   ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set
>   ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks
>   ext4: let __revise_pending() return newly inserted pendings
>   ext4: count removed reserved blocks for delalloc only extent entry
>   ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
>   ext4: drop ext4_es_delayed_clu()
>   ext4: use ext4_map_query_blocks() in ext4_map_blocks()
>   ext4: drop ext4_es_is_delonly()
>   ext4: drop all delonly descriptions
>   ext4: use reserved metadata blocks when splitting extent on endio
>   ext4: introduce seq counter for the extent status entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: implement zero_range iomap path
>   ext4: writeback partial blocks before zeroing out range
>   ext4: fall back to buffer_head path for defrag
>   ext4: partial enable iomap for regular file's buffered IO path
>   filemap: support disable large folios on active inode
>   ext4: enable large folio for regular file with iomap buffered IO path
>   ext4: don't mark IOMAP_F_DIRTY for buffer write
>   ext4: add mount option for buffered IO iomap path
> 


^ permalink raw reply	[relevance 0%]

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]
  2024-02-29  0:20  3% [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND] John Groves
@ 2024-04-23 13:30  0% ` Amir Goldstein
  2024-04-24 12:22  0%   ` John Groves
  0 siblings, 1 reply; 200+ results
From: Amir Goldstein @ 2024-04-23 13:30 UTC (permalink / raw)
  To: John Groves, Miklos Szeredi, Bernd Schubert
  Cc: lsf-pc, Jonathan Corbet, Dan Williams, Vishal Verma, Dave Jiang,
	Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	linux-cxl, linux-fsdevel, nvdimm, Randy Dunlap, Jon Grimm,
	Dave Chinner, john, Bharata B Rao, Jerome Glisse, gregory.price,
	Ajay Joshi, Aneesh Kumar K . V, Alistair Popple,
	Christoph Hellwig, Zi Yan, David Rientjes, Ravi Shankar,
	dave.hansen, John Hubbard, mykolal, Brian Morris,
	Eishan Mirakhur, Wei Xu, Theodore Ts'o,
	Srinivasulu Thanneeru, John Groves, Christoph Lameter,
	Johannes Weiner, Andrew Morton, Aravind Ramesh

On Thu, Feb 29, 2024 at 2:20 AM John Groves <John@groves.net> wrote:
>
> John Groves, Micron
>
> Micron recently released the first RFC for famfs [1]. Although famfs is not
> CXL-specific in any way, it aims to enable hosts to share data sets in shared
> memory (such as CXL) by providing a memory-mappable fs-dax file system
> interface to the memory.
>
> Sharable disaggregated memory already exists in the lab, and will be possible
> in the wild soon. Famfs aims to do the following:
>
> * Provide an access method that provides isolation between files, and does not
>   tempt developers to mmap all the memory writable on every host.
> * Provide an an access method that can be used by unmodified apps.
>
> Without something like famfs, enabling the use of sharable memory will involve
> the temptation to do things that may destabilize systems, such as
> mapping large shared, writable global memory ranges and hooking allocators to
> use it (potentially sacrificing isolation), and forcing the same virtual
> address ranges in every host/process (compromising security).
>
> The most obvious candidate app categories are data analytics and data lakes.
> Both make heavy use of "zero-copy" data frames - column oriented data that
> is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
> categories are generally driven by python code that wrangles data into
> appropriate data frames - making it straightforward to put the data frames
> into famfs. Furthermore, these use cases usually involve the shared data being
> read-only during computation or query jobs - meaning they are often free of
> cache coherency concerns.
>
> Workloads such as these often deal with data sets that are too large to fit
> in a single server's memory, so the data gets sharded - requiring movement via
> a network. Sharded apps also sometimes have to do expensive reshuffling -
> moving data to nodes with available compute resources. Avoiding the sharding
> overheads by accessing such data sets in disaggregated shared memory looks
> promising to make make better use of memory and compute resources, and by
> effectively de-duplicating data sets in memory.
>
> About sharable memory
>
> * Shared memory is pmem-like, in that hosts will connect in order to access
>   pre-existing contents
> * Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
> * CXL 3 provides for optionally-supported hardware-managed cache coherency
> * But "multiple-readers, no writers" use cases don't need hardware support
>   for coherency
> * CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
>   an allocator built in.
> * When sharable capacity is allocated, each host that has access will see a
>   /dev/dax device that can be found by the "tag" of the allocation. The tag is
>   just a uuid.
> * CXL 3.1 also allows the capacity associated with any allocated tag to be
>   provided to each host (or host group) as either writable or read-only.
>
> About famfs
>
> Famfs is an append-only log-structured file system that places many limits
> on what can be done. This allows famfs to tolerate clients with a stale copy
> of metadata. All memory allocation and log maintenance is performed from user
> space, but file extent lists are cached in the kernel for fast fault
> resolution. The current limitations are fairly extreme, but many can be relaxed
> by writing more code, managing Byzantine generals, etc. ;)
>
> A famfs-enabled kernel can be cloned at [3], and the user space repo can be
> cloned at [4]. Even with major functional limitations in its current form
> (e.g. famfs does not currently support deleting files), it is sufficient to
> use in data analytics workloads - in which you 1) create a famfs file system,
> 2) dump data sets into it, 3) run clustered jobs that consume the shared data
> sets, and 4) dismount and deallocate the memory containing the file system.
>
> Famfs Open Issues
>
> * Volatile CXL memory is exposed as character dax devices; the famfs patch
>   set adds the iomap API, which is required for fs-dax but until now missing
>   from character dax.
> * (/dev/pmem devices are block, and support the iomap api for fs-dax file
>   systems)
> * /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
>   devices cannot be converted to pmem mode.
> * /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
>   patch set adds that.
> * VFS layer hooks for a file system on a character device may be needed.
> * Famfs has uncovered some previously latent bugs in the /dev/dax mmap
>   machinery that probably require attention.
> * Famfs currently works with either pmem or devdax devices, but our
>   inclination is to drop pmem support to, reduce the complexity of supporting
>   two different underlying device types - particularly since famfs is not
>   intended for actual pmem.
>
>
> Required :-
> Dan Williams
> Christian Brauner
> Jonathan Cameron
> Dave Hansen
>
> [LSF/MM + BPF ATTEND]
>
> I am the author of the famfs file system. Famfs was first introduced at LPC
> 2023 [2]. I'm also Micron's voting member on the Software and Systems Working
> Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
> specification.
>
>
> References
>
> [1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@groves.net/#t
> [2] https://lpc.events/event/17/contributions/1455/
> [3] https://www.computeexpresslink.org/download-the-specification
> [4] https://github.com/cxl-micron-reskit/famfs-linux
>

Hi John,

Following our correspondence on your patch set [1], I am not sure that the
details of famfs file system itself are an interesting topic for the
LSFMM crowd??
What I would like to do is schedule a session on:
"Famfs: new userspace filesystem driver vs. improving FUSE/DAX"

I am hoping that Miklos and Bernd will be able to participate in this
session remotely.

You see the last time that someone tried to introduce a specialized
faster FUSE replacement [2], the comments from the community were
that FUSE protocol can and should be improved instead of introducing
another "filesystem in userspace" protocol.

Since 2019, FUSE has gained virtiofs/dax support, it recently gained
FUSE passthrough support and Bernd is working on FUSE uring [3].

My hope is that you will be able to list the needed improvements
to /dev/dax iomap and FUSE so that you could use the existing
kernel infrastructure and FUSE libraries to implement famfs.

How does that sound for a discussion?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/
[2] https://lore.kernel.org/linux-fsdevel/8d119597-4543-c6a4-917f-14f4f4a6a855@netapp.com/
[3] https://lore.kernel.org/linux-fsdevel/20230321011047.3425786-1-bschubert@ddn.com/

^ permalink raw reply	[relevance 0%]

* [PATCH v15 01/11] landlock: Add IOCTL access right for character and block devices
  2024-04-19 16:11  2% [PATCH v15 00/11] Landlock: IOCTL support Günther Noack
@ 2024-04-19 16:11  6% ` Günther Noack
  2024-05-08 10:40  0% ` [PATCH v15 00/11] Landlock: IOCTL support Mickaël Salaün
  1 sibling, 0 replies; 200+ results
From: Günther Noack @ 2024-04-19 16:11 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack,
	Christian Brauner

Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
and increments the Landlock ABI version to 5.

This access right applies to device-custom IOCTL commands
when they are invoked on block or character device files.

Like the truncate right, this right is associated with a file
descriptor at the time of open(2), and gets respected even when the
file descriptor is used outside of the thread which it was originally
opened in.

Therefore, a newly enabled Landlock policy does not apply to file
descriptors which are already open.

If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
number of safe IOCTL commands will be permitted on newly opened device
files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
as other IOCTL commands for regular files which are implemented in
fs/ioctl.c.

Noteworthy scenarios which require special attention:

TTY devices are often passed into a process from the parent process,
and so a newly enabled Landlock policy does not retroactively apply to
them automatically.  In the past, TTY devices have often supported
IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
letting callers control the TTY input buffer (and simulate
keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
modern kernels though.

Known limitations:

The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
control over IOCTL commands.

Landlock users may use path-based restrictions in combination with
their knowledge about the file system layout to control what IOCTLs
can be done.

Cc: Paul Moore <paul@paul-moore.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Günther Noack <gnoack@google.com>
---
 include/uapi/linux/landlock.h                |  38 +++-
 security/landlock/fs.c                       | 225 ++++++++++++++++++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   2 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   |   5 +-
 6 files changed, 258 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
index 25c8d7677539..68625e728f43 100644
--- a/include/uapi/linux/landlock.h
+++ b/include/uapi/linux/landlock.h
@@ -128,7 +128,7 @@ struct landlock_net_port_attr {
  * files and directories.  Files or directories opened before the sandboxing
  * are not subject to these restrictions.
  *
- * A file can only receive these access rights:
+ * The following access rights apply only to files:
  *
  * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
  * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
@@ -138,12 +138,13 @@ struct landlock_net_port_attr {
  * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
  * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
  *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
- *   ``O_TRUNC``. Whether an opened file can be truncated with
- *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
- *   same way as read and write permissions are checked during
- *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
- *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
- *   third version of the Landlock ABI.
+ *   ``O_TRUNC``.  This access right is available since the third version of the
+ *   Landlock ABI.
+ *
+ * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
+ * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
+ * read and write permissions are checked during :manpage:`open(2)` using
+ * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
  *
  * A directory can receive access rights related to files or directories.  The
  * following access right is applied to the directory itself, and the
@@ -198,13 +199,33 @@ struct landlock_net_port_attr {
  *   If multiple requirements are not met, the ``EACCES`` error code takes
  *   precedence over ``EXDEV``.
  *
+ * The following access right applies both to files and directories:
+ *
+ * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
+ *   character or block device.
+ *
+ *   This access right applies to all `ioctl(2)` commands implemented by device
+ *   drivers.  However, the following common IOCTL commands continue to be
+ *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:
+ *
+ *   * IOCTL commands targeting file descriptors (``FIOCLEX``, ``FIONCLEX``),
+ *   * IOCTL commands targeting file descriptions (``FIONBIO``, ``FIOASYNC``),
+ *   * IOCTL commands targeting file systems (``FIFREEZE``, ``FITHAW``,
+ *     ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``)
+ *   * Some IOCTL commands which do not make sense when used with devices, but
+ *     whose implementations are safe and return the right error codes
+ *     (``FS_IOC_FIEMAP``, ``FICLONE``, ``FICLONERANGE``, ``FIDEDUPERANGE``)
+ *
+ *   This access right is available since the fifth version of the Landlock
+ *   ABI.
+ *
  * .. warning::
  *
  *   It is currently not possible to restrict some file-related actions
  *   accessible through these syscall families: :manpage:`chdir(2)`,
  *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
  *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
- *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
+ *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
  *   Future Landlock evolutions will enable to restrict them.
  */
 /* clang-format off */
@@ -223,6 +244,7 @@ struct landlock_net_port_attr {
 #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
 #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
 #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
+#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
 /* clang-format on */
 
 /**
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index c15559432d3d..22d8b7c28074 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -5,8 +5,11 @@
  * Copyright © 2016-2020 Mickaël Salaün <mic@digikod.net>
  * Copyright © 2018-2020 ANSSI
  * Copyright © 2021-2022 Microsoft Corporation
+ * Copyright © 2022 Günther Noack <gnoack3000@gmail.com>
+ * Copyright © 2023-2024 Google LLC
  */
 
+#include <asm/ioctls.h>
 #include <kunit/test.h>
 #include <linux/atomic.h>
 #include <linux/bitops.h>
@@ -14,6 +17,7 @@
 #include <linux/compiler_types.h>
 #include <linux/dcache.h>
 #include <linux/err.h>
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -29,6 +33,7 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue.h>
+#include <uapi/linux/fiemap.h>
 #include <uapi/linux/landlock.h>
 
 #include "common.h"
@@ -84,6 +89,160 @@ static const struct landlock_object_underops landlock_fs_underops = {
 	.release = release_inode
 };
 
+/* IOCTL helpers */
+
+/**
+ * is_masked_device_ioctl - Determine whether an IOCTL command is always
+ * permitted with Landlock for device files.  These commands can not be
+ * restricted on device files by enforcing a Landlock policy.
+ *
+ * @cmd: The IOCTL command that is supposed to be run.
+ *
+ * By default, any IOCTL on a device file requires the
+ * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  However, we blanket-permit some
+ * commands, if:
+ *
+ * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
+ *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
+ *
+ * 2. The command is harmless when invoked on devices.
+ *
+ * We also permit commands that do not make sense for devices, but where the
+ * do_vfs_ioctl() implementation returns a more conventional error code.
+ *
+ * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
+ * should be considered for inclusion here.
+ *
+ * Returns: true if the IOCTL @cmd can not be restricted with Landlock for
+ * device files.
+ */
+static __attribute_const__ bool is_masked_device_ioctl(const unsigned int cmd)
+{
+	switch (cmd) {
+	/*
+	 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
+	 * close-on-exec and the file's buffered-IO and async flags.  These
+	 * operations are also available through fcntl(2), and are
+	 * unconditionally permitted in Landlock.
+	 */
+	case FIOCLEX:
+	case FIONCLEX:
+	case FIONBIO:
+	case FIOASYNC:
+	/*
+	 * FIOQSIZE queries the size of a regular file, directory, or link.
+	 *
+	 * We still permit it, because it always returns -ENOTTY for
+	 * other file types.
+	 */
+	case FIOQSIZE:
+	/*
+	 * FIFREEZE and FITHAW freeze and thaw the file system which the
+	 * given file belongs to.  Requires CAP_SYS_ADMIN.
+	 *
+	 * These commands operate on the file system's superblock rather
+	 * than on the file itself.  The same operations can also be
+	 * done through any other file or directory on the same file
+	 * system, so it is safe to permit these.
+	 */
+	case FIFREEZE:
+	case FITHAW:
+	/*
+	 * FS_IOC_FIEMAP queries information about the allocation of
+	 * blocks within a file.
+	 *
+	 * This IOCTL command only makes sense for regular files and is
+	 * not implemented by devices. It is harmless to permit.
+	 */
+	case FS_IOC_FIEMAP:
+	/*
+	 * FIGETBSZ queries the file system's block size for a file or
+	 * directory.
+	 *
+	 * This command operates on the file system's superblock rather
+	 * than on the file itself.  The same operation can also be done
+	 * through any other file or directory on the same file system,
+	 * so it is safe to permit it.
+	 */
+	case FIGETBSZ:
+	/*
+	 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
+	 * their underlying storage ("reflink") between source and
+	 * destination FDs, on file systems which support that.
+	 *
+	 * These IOCTL commands only apply to regular files
+	 * and are harmless to permit for device files.
+	 */
+	case FICLONE:
+	case FICLONERANGE:
+	case FIDEDUPERANGE:
+	/*
+	 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
+	 * the file system superblock, not on the specific file, so
+	 * these operations are available through any other file on the
+	 * same file system as well.
+	 */
+	case FS_IOC_GETFSUUID:
+	case FS_IOC_GETFSSYSFSPATH:
+		return true;
+
+	/*
+	 * FIONREAD, FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
+	 * FS_IOC_FSSETXATTR are forwarded to device implementations.
+	 */
+
+	/*
+	 * file_ioctl() commands (FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64,
+	 * FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE) are
+	 * forwarded to device implementations, so not permitted.
+	 */
+
+	/* Other commands are guarded by the access right. */
+	default:
+		return false;
+	}
+}
+
+/*
+ * is_masked_device_ioctl_compat - same as the helper above, but checking the
+ * "compat" IOCTL commands.
+ *
+ * The IOCTL commands with special handling in compat-mode should behave the
+ * same as their non-compat counterparts.
+ */
+static __attribute_const__ bool
+is_masked_device_ioctl_compat(const unsigned int cmd)
+{
+	switch (cmd) {
+	/* FICLONE is permitted, same as in the non-compat variant. */
+	case FICLONE:
+		return true;
+
+#if defined(CONFIG_X86_64)
+	/*
+	 * FS_IOC_RESVSP_32, FS_IOC_RESVSP64_32, FS_IOC_UNRESVSP_32,
+	 * FS_IOC_UNRESVSP64_32, FS_IOC_ZERO_RANGE_32: not blanket-permitted,
+	 * for consistency with their non-compat variants.
+	 */
+	case FS_IOC_RESVSP_32:
+	case FS_IOC_RESVSP64_32:
+	case FS_IOC_UNRESVSP_32:
+	case FS_IOC_UNRESVSP64_32:
+	case FS_IOC_ZERO_RANGE_32:
+#endif
+
+	/*
+	 * FS_IOC32_GETFLAGS, FS_IOC32_SETFLAGS are forwarded to their device
+	 * implementations.
+	 */
+	case FS_IOC32_GETFLAGS:
+	case FS_IOC32_SETFLAGS:
+		return false;
+	default:
+		return is_masked_device_ioctl(cmd);
+	}
+}
+
 /* Ruleset management */
 
 static struct landlock_object *get_inode_object(struct inode *const inode)
@@ -148,7 +307,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL_DEV)
 /* clang-format on */
 
 /*
@@ -1332,11 +1492,18 @@ static int hook_file_alloc_security(struct file *const file)
 	return 0;
 }
 
+static bool is_device(const struct file *const file)
+{
+	const struct inode *inode = file_inode(file);
+
+	return S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
+}
+
 static int hook_file_open(struct file *const file)
 {
 	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
-	access_mask_t open_access_request, full_access_request, allowed_access;
-	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
+	access_mask_t open_access_request, full_access_request, allowed_access,
+		optional_access;
 	const struct landlock_ruleset *const dom =
 		get_fs_domain(landlock_cred(file->f_cred)->domain);
 
@@ -1354,6 +1521,10 @@ static int hook_file_open(struct file *const file)
 	 * We look up more access than what we immediately need for open(), so
 	 * that we can later authorize operations on opened files.
 	 */
+	optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
+	if (is_device(file))
+		optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
+
 	full_access_request = open_access_request | optional_access;
 
 	if (is_access_to_paths_allowed(
@@ -1410,6 +1581,52 @@ static int hook_file_truncate(struct file *const file)
 	return -EACCES;
 }
 
+static int hook_file_ioctl(struct file *file, unsigned int cmd,
+			   unsigned long arg)
+{
+	access_mask_t allowed_access = landlock_file(file)->allowed_access;
+
+	/*
+	 * It is the access rights at the time of opening the file which
+	 * determine whether IOCTL can be used on the opened file later.
+	 *
+	 * The access right is attached to the opened file in hook_file_open().
+	 */
+	if (allowed_access & LANDLOCK_ACCESS_FS_IOCTL_DEV)
+		return 0;
+
+	if (!is_device(file))
+		return 0;
+
+	if (is_masked_device_ioctl(cmd))
+		return 0;
+
+	return -EACCES;
+}
+
+static int hook_file_ioctl_compat(struct file *file, unsigned int cmd,
+				  unsigned long arg)
+{
+	access_mask_t allowed_access = landlock_file(file)->allowed_access;
+
+	/*
+	 * It is the access rights at the time of opening the file which
+	 * determine whether IOCTL can be used on the opened file later.
+	 *
+	 * The access right is attached to the opened file in hook_file_open().
+	 */
+	if (allowed_access & LANDLOCK_ACCESS_FS_IOCTL_DEV)
+		return 0;
+
+	if (!is_device(file))
+		return 0;
+
+	if (is_masked_device_ioctl_compat(cmd))
+		return 0;
+
+	return -EACCES;
+}
+
 static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
 
@@ -1432,6 +1649,8 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
 	LSM_HOOK_INIT(file_open, hook_file_open),
 	LSM_HOOK_INIT(file_truncate, hook_file_truncate),
+	LSM_HOOK_INIT(file_ioctl, hook_file_ioctl),
+	LSM_HOOK_INIT(file_ioctl_compat, hook_file_ioctl_compat),
 };
 
 __init void landlock_add_fs_hooks(void)
diff --git a/security/landlock/limits.h b/security/landlock/limits.h
index 93c9c6f91556..20fdb5ff3514 100644
--- a/security/landlock/limits.h
+++ b/security/landlock/limits.h
@@ -18,7 +18,7 @@
 #define LANDLOCK_MAX_NUM_LAYERS		16
 #define LANDLOCK_MAX_NUM_RULES		U32_MAX
 
-#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_TRUNCATE
+#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_IOCTL_DEV
 #define LANDLOCK_MASK_ACCESS_FS		((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
 #define LANDLOCK_NUM_ACCESS_FS		__const_hweight64(LANDLOCK_MASK_ACCESS_FS)
 #define LANDLOCK_SHIFT_ACCESS_FS	0
diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
index 6788e73b6681..03b470f5a85a 100644
--- a/security/landlock/syscalls.c
+++ b/security/landlock/syscalls.c
@@ -149,7 +149,7 @@ static const struct file_operations ruleset_fops = {
 	.write = fop_dummy_write,
 };
 
-#define LANDLOCK_ABI_VERSION 4
+#define LANDLOCK_ABI_VERSION 5
 
 /**
  * sys_landlock_create_ruleset - Create a new ruleset
diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
index a6f89aaea77d..3c1e9f35b531 100644
--- a/tools/testing/selftests/landlock/base_test.c
+++ b/tools/testing/selftests/landlock/base_test.c
@@ -75,7 +75,7 @@ TEST(abi_version)
 	const struct landlock_ruleset_attr ruleset_attr = {
 		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
 	};
-	ASSERT_EQ(4, landlock_create_ruleset(NULL, 0,
+	ASSERT_EQ(5, landlock_create_ruleset(NULL, 0,
 					     LANDLOCK_CREATE_RULESET_VERSION));
 
 	ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
index 9a6036fbf289..418ad745a5dd 100644
--- a/tools/testing/selftests/landlock/fs_test.c
+++ b/tools/testing/selftests/landlock/fs_test.c
@@ -529,9 +529,10 @@ TEST_F_FORK(layout1, inval)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL_DEV)
 
-#define ACCESS_LAST LANDLOCK_ACCESS_FS_TRUNCATE
+#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL_DEV
 
 #define ACCESS_ALL ( \
 	ACCESS_FILE | \
-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply related	[relevance 6%]

* [PATCH v15 00/11] Landlock: IOCTL support
@ 2024-04-19 16:11  2% Günther Noack
  2024-04-19 16:11  6% ` [PATCH v15 01/11] landlock: Add IOCTL access right for character and block devices Günther Noack
  2024-05-08 10:40  0% ` [PATCH v15 00/11] Landlock: IOCTL support Mickaël Salaün
  0 siblings, 2 replies; 200+ results
From: Günther Noack @ 2024-04-19 16:11 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack

Hello!

These patches add simple ioctl(2) support to Landlock.

Objective
~~~~~~~~~

Make ioctl(2) requests for device files restrictable with Landlock,
in a way that is useful for real-world applications.

Proposed approach
~~~~~~~~~~~~~~~~~

Introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right, which restricts the
use of ioctl(2) on block and character devices.

We attach the this access right to opened file descriptors, as we
already do for LANDLOCK_ACCESS_FS_TRUNCATE.

If LANDLOCK_ACCESS_FS_IOCTL_DEV is handled (restricted in the
ruleset), the LANDLOCK_ACCESS_FS_IOCTL_DEV right governs the use of
all device-specific IOCTL commands.  We make exceptions for common and
known-harmless IOCTL commands such as FIOCLEX, FIONCLEX, FIONBIO and
FIOASYNC, as well as other IOCTL commands which are implemented in
fs/ioctl.c.  A full list of these IOCTL commands is listed in the
documentation.

I believe that this approach works for the majority of use cases, and
offers a good trade-off between complexity of the Landlock API and
implementation and flexibility when the feature is used.

Current limitations
~~~~~~~~~~~~~~~~~~~

With this patch set, ioctl(2) requests can *not* be filtered based on
file type, device number (dev_t) or on the ioctl(2) request number.

On the initial RFC patch set [1], we have reached consensus to start
with this simpler coarse-grained approach, and build additional IOCTL
restriction capabilities on top in subsequent steps.

[1] https://lore.kernel.org/linux-security-module/d4f1395c-d2d4-1860-3a02-2a0c023dd761@digikod.net/

Notable implications of this approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* A processes' existing open file descriptors stay unaffected
  when a process enables Landlock.

  This means that in common scenarios, where the terminal file
  descriptor is inherited from the parent process, the terminal's
  IOCTLs (ioctl_tty(2)) continue to work.

* ioctl(2) continues to be available for file descriptors for
  non-device files.  Example: Network sockets, memfd_create(2),
  regular files and directories.

Examples
~~~~~~~~

Starting a sandboxed shell from $HOME with samples/landlock/sandboxer:

  LL_FS_RO=/ LL_FS_RW=. ./sandboxer /bin/bash

The LANDLOCK_ACCESS_FS_IOCTL_DEV right is part of the "read-write"
rights here, so we expect that newly opened device files outside of
$HOME don't work with most IOCTL commands.

  * "stty" works: It probes terminal properties

  * "stty </dev/tty" fails: /dev/tty can be reopened, but the IOCTL is
    denied.

  * "eject" fails: ioctls to use CD-ROM drive are denied.

  * "ls /dev" works: It uses ioctl to get the terminal size for
    columnar layout

  * The text editors "vim" and "mg" work.  (GNU Emacs fails because it
    attempts to reopen /dev/tty.)

Unaffected IOCTL commands
~~~~~~~~~~~~~~~~~~~~~~~~~

To decide which IOCTL commands should be blanket-permitted, we went
through the list of IOCTL commands which are handled directly in
fs/ioctl.c and looked at them individually to understand what they are
about.

The following commands are permitted by Landlock unconditionally:

 * FIOCLEX, FIONCLEX - these work on the file descriptor and
   manipulate the close-on-exec flag (also available through
   fcntl(2) with F_SETFD)
 * FIONBIO, FIOASYNC - these work on the struct file and enable
   nonblocking-IO and async flags (also available through
   fcntl(2) with F_SETFL)

The following commands are also unconditionally permitted by Landlock, because
they are really operating on the file system's superblock, rather than on the
file itself (the same funcionality is also available from any other file on the
same file system):

 * FIFREEZE, FITHAW - work on superblock(!) to freeze/thaw the file
   system. Requires CAP_SYS_ADMIN.
 * FIGETBSZ - get file system blocksize
 * FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH - getting file system properties

Notably, the command FIONREAD is *not* blanket-permitted,
because it would be a device-specific implementation.

Detailed reasoning about each IOCTL command from fs/ioctl.c is in
get_required_ioctl_dev_access() in security/landlock/fs.c.


Related Work
~~~~~~~~~~~~

OpenBSD's pledge(2) [2] restricts ioctl(2) independent of the file
descriptor which is used.  The implementers maintain multiple
allow-lists of predefined ioctl(2) operations required for different
application domains such as "audio", "bpf", "tty" and "inet".

OpenBSD does not guarantee backwards compatibility to the same extent
as Linux does, so it's easier for them to update these lists in later
versions.  It might not be a feasible approach for Linux though.

[2] https://man.openbsd.org/OpenBSD-7.4/pledge.2


Implementation Rationale
~~~~~~~~~~~~~~~~~~~~~~~~

A main constraint of this implementation is that the blanket-permitted
IOCTL commands for device files should never dispatch to the
device-specific implementations in f_ops->unlocked_ioctl() and
f_ops->compat_ioctl().

There are many implementations of these f_ops operations and they are
too scattered across the kernel to give strong guarantees about them.
Additionally, some existing implementations do work before even
checking whether they support the cmd number which was passed to them.


In this implementation, we are listing the blanket-permitted IOCTL
commands in the Landlock implementation, mirroring a subset of the
IOCTL commands which are directly implemented in do_vfs_ioctl() in
fs/ioctl.c.  The trade-off is that the Landlock LSM needs to track
future developments in fs/ioctl.c to keep up to date with that, in
particular when new IOCTL commands are introduced there, or when they
are moved there from the f_ops implementations.

We mitigate this risk in this patch set by adding fs/ioctl.c to the
paths that are relevant to Landlock in the MAINTAINERS file.

The trade-off is discussed in more detail in [3].


Previous versions of this patch set have used different implementation
approaches to guarantee the main constraint above, which we have
dismissed due to the following reasons:

* V10: Introduced a new LSM hook file_vfs_ioctl, which gets invoked
  just before the call to f_ops->unlocked_ioctl().

  Not done, because it would have created an avoidable overlap between
  the file_ioctl and file_vfs_ioctl LSM hooks [4].

* V11: Introduced an indirection layer in fs/ioctl.c, so that Landlock
  could figure out the list of IOCTL commands which are handled by
  do_vfs_ioctl().

  Not done due to additional indirection and possible performance
  impact in fs/ioctl.c [5]

* V12: Introduced a special error code to be returned from the
  file_ioctl hook, and matching logic that would disallow the call to
  f_ops->unlocked_ioctl() in case that this error code is returned.

  Not done due because this approach would conflict with Landlock's
  planned audit logging [6] and because LSM hooks with special error
  codes are generally discouraged and have lead to problems in the
  past [7].

Thanks to Arnd Bergmann, Christian Brauner, Kent Overstreet, Mickaël Salaün and
Paul Moore for guiding this implementation on the right track!

[3] https://lore.kernel.org/all/ZgLJG0aN0psur5Z7@google.com/
[4] https://lore.kernel.org/all/CAHC9VhRojXNSU9zi2BrP8z6JmOmT3DAqGNtinvvz=tL1XhVdyg@mail.gmail.com/
[5] https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com
[6] https://lore.kernel.org/all/20240326.ahyaaPa0ohs6@digikod.net
[7] https://lore.kernel.org/all/CAHC9VhQJFWYeheR-EqqdfCq0YpvcQX5Scjfgcz1q+jrWg8YsdA@mail.gmail.com/


Changes
~~~~~~~

V15:
 * Drop the commit about FS_IOC_GETFSUUID / FS_IOC_GETFSSYSFSPATH --
   it is already assumed as a prerequisite now.
 * security/landlock/fs.c:
   * Add copyright notice for my contributions (also for the truncate
     patch set)
 * Tests:
   * In commit "Test IOCTL support":
     * Test with /dev/zero instead of /dev/tty
     * Check only FIONREAD instead of both FIONREAD and TCGETS
     * Remove a now-unused SKIP()
   * In test for Named UNIX Domain Sockets:
     * Do not inline variable assignments in ASSERT() usages
   * In commit "Exhaustive test for the IOCTL allow-list":
     * Make IOCTL results deterministic:
       * Zero the input buffer
       * Close FD 0 for the ioctl() call, to avoid accidentally using it
 * Cosmetic changes and cleanups
   * Remove a leftover mention of "synthetic" access rights
   * Fix docstring format for is_masked_device_ioctl()
   * Newline and comment ordering cleanups as discussed in v14 review

V14:
 * Revise which IOCTLs are permitted.
   It is almost the same as the vfs_masked_device_ioctl() hooks from
   https://lore.kernel.org/all/20240219183539.2926165-1-mic@digikod.net/,
   with the following differences:
   * Added cases for FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH
   * Do not blanket-permit FS_IOC_{GET,SET}{FLAGS,XATTR}.
     They fall back to the device implementation.
 * fs/ioctl:
   * Small prerequisite change so that FS_IOC_GETFSUUID and
     FS_IOC_GETFSSYSFSPATH do not fall back to the device implementation.
   * Slightly rephrase wording in the warning above do_vfs_ioctl().
 * Implement compat handler
 * Improve UAPI header documentation
 * Code structure
   * Change helper function style to return a boolean
   * Reorder structure of the IOCTL hooks (much cleaner now -- thanks for the
     hint, Mickaël!)
   * Extract is_device() helper

V13:
 * Using the existing file_ioctl hook and a hardcoded list of IOCTL commands.
   (See the section on implementation rationale above.)
 * Add support for FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH.
   
V12:
 * Rebased on Arnd's proposal:
   https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com/
   This means that:
   * the IOCTL security hooks can return a special value ENOFILEOPS,
     which is treated specially in fs/ioctl.c to permit the IOCTL,
     but only as long as it does not call f_ops->unlocked_ioctl or
     f_ops->compat_ioctl.
 * The only change compared to V11 is commit 1, as well as a small
   adaptation in the commit 2 (The Landlock implementation needs to
   return the new special value).  The tests and documentation commits
   are exactly the same as before.

V11:
 * Rebased on Mickaël's proposal to refactor fs/ioctl.c:
   https://lore.kernel.org/all/20240315145848.1844554-1-mic@digikod.net/
   This means that:
   * we do not add the file_vfs_ioctl() hook as in V10
   * we add vfs_get_ioctl_handler() instead, so that Landlock
     can query which of the IOCTL commands in handled in do_vfs_ioctl()

   That proposal is used here unmodified (except for minor typos in the commit
   description).
 * Use the hook_ioctl_compat LSM hook as well.

V10:
 * Major change: only restrict IOCTL invocations on device files
   * Rename access right to LANDLOCK_ACCESS_FS_IOCTL_DEV
   * Remove the notion of synthetic access rights and IOCTL right groups
 * Introduce a new LSM hook file_vfs_ioctl, which gets invoked just
   before the call to f_ops->unlocked_ioctl()
 * Documentation
   * Various complications were removed or simplified:
     * Suggestion to mount file systems as nodev is not needed any more,
       as Landlock already lets users distinguish device files.
     * Remarks about fscrypt were removed.  The fscrypt-related IOCTLs only
       applied to regular files and directories, so this patch does not affect
       them any more.
     * Various documentation of the IOCTL grouping approach was removed,
       as it's not needed any more.

V9:
 * in “landlock: Add IOCTL access right”:
   * Change IOCTL group names and grouping as discussed with Mickaël.
     This makes the grouping coarser, and we occasionally rely on the
     underlying implementation to perform the appropriate read/write
     checks.
     * Group IOCTL_RW (one of READ_FILE, WRITE_FILE or READ_DIR):
       FIONREAD, FIOQSIZE, FIGETBSZ
     * Group IOCTL_RWF (one of READ_FILE or WRITE_FILE):
       FS_IOC_FIEMAP, FIBMAP, FIDEDUPERANGE, FICLONE, FICLONERANGE,
       FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
       FS_IOC_ZERO_RANGE
   * Excempt pipe file descriptors from IOCTL restrictions,
     even for named pipes which are opened from the file system.
     This is to be consistent with anonymous pipes created with pipe(2).
     As discussed in https://lore.kernel.org/r/ZP7lxmXklksadvz+@google.com
   * Document rationale for the IOCTL grouping in the code
   * Use __attribute_const__
   * Rename required_ioctl_access() to get_required_ioctl_access()
 * Selftests
   * Simplify IOCTL test fixtures as a result of simpler grouping.
   * Test that IOCTLs are permitted on named pipe FDs.
   * Test that IOCTLs are permitted on named Unix Domain Socket FDs.
   * Work around compilation issue with old GCC / glibc.
     https://sourceware.org/glibc/wiki/Synchronizing_Headers
     Thanks to Huyadi <hu.yadi@h3c.com>, who pointed this out in
     https://lore.kernel.org/all/f25be6663bcc4608adf630509f045a76@h3c.com/
     and Mickaël, who fixed it through #include reordering.
 * Documentation changes
   * Reword "IOCTL commands" section a bit
   * s/permit/allow/
   * s/access right/right/, if preceded by LANDLOCK_ACCESS_FS_*
   * s/IOCTL/FS_IOCTL/ in ASCII table
   * Update IOCTL grouping documentation in header file
 * Removed a few of the earlier commits in this patch set,
   which have already been merged.

V8:
 * Documentation changes
   * userspace-api/landlock.rst:
     * Add an extra paragraph about how the IOCTL right combines
       when used with other access rights.
     * Explain better the circumstances under which passing of
       file descriptors between different Landlock domains can happen
   * limits.h: Add comment to explain public vs internal FS access rights
   * Add a paragraph in the commit to explain better why the IOCTL
     right works as it does

V7:
 * in “landlock: Add IOCTL access right”:
   * Make IOCTL_GROUPS a #define so that static_assert works even on
     old compilers (bug reported by Intel about PowerPC GCC9 config)
   * Adapt indentation of IOCTL_GROUPS definition
   * Add missing dots in kernel-doc comments.
 * in “landlock: Remove remaining "inline" modifiers in .c files”:
   * explain reasoning in commit message

V6:
 * Implementation:
   * Check that only publicly visible access rights can be used when adding a
     rule (rather than the synthetic ones).  Thanks Mickaël for spotting that!
   * Move all functionality related to IOCTL groups and synthetic access rights
     into the same place at the top of fs.c
   * Move kernel doc to the .c file in one instance
   * Smaller code style issues (upcase IOCTL, vardecl at block start)
   * Remove inline modifier from functions in .c files
 * Tests:
   * use SKIP
   * Rename 'fd' to dir_fd and file_fd where appropriate
   * Remove duplicate "ioctl" mentions from test names
   * Rename "permitted" to "allowed", in ioctl and ftruncate tests
   * Do not add rules if access is 0, in test helper

V5:
 * Implementation:
   * move IOCTL group expansion logic into fs.c (implementation suggested by
     mic)
   * rename IOCTL_CMD_G* constants to LANDLOCK_ACCESS_FS_IOCTL_GROUP*
   * fs.c: create ioctl_groups constant
   * add "const" to some variables
 * Formatting and docstring fixes (including wrong kernel-doc format)
 * samples/landlock: fix ABI version and fallback attribute (mic)
 * Documentation
   * move header documentation changes into the implementation commit
   * spell out how FIFREEZE, FITHAW and attribute-manipulation ioctls from
     fs/ioctl.c are handled
   * change ABI 4 to ABI 5 in some missing places

V4:
 * use "synthetic" IOCTL access rights, as previously discussed
 * testing changes
   * use a large fixture-based test, for more exhaustive coverage,
     and replace some of the earlier tests with it
 * rebased on mic-next

V3:
 * always permit the IOCTL commands FIOCLEX, FIONCLEX, FIONBIO, FIOASYNC and
   FIONREAD, independent of LANDLOCK_ACCESS_FS_IOCTL
 * increment ABI version in the same commit where the feature is introduced
 * testing changes
   * use FIOQSIZE instead of TTY IOCTL commands
     (FIOQSIZE works with regular files, directories and memfds)
   * run the memfd test with both Landlock enabled and disabled
   * add a test for the always-permitted IOCTL commands

V2:
 * rebased on mic-next
 * added documentation
 * exercise ioctl(2) in the memfd test
 * test: Use layout0 for the test

---

V1: https://lore.kernel.org/all/20230502171755.9788-1-gnoack3000@gmail.com/
V2: https://lore.kernel.org/all/20230623144329.136541-1-gnoack@google.com/
V3: https://lore.kernel.org/all/20230814172816.3907299-1-gnoack@google.com/
V4: https://lore.kernel.org/all/20231103155717.78042-1-gnoack@google.com/
V5: https://lore.kernel.org/all/20231117154920.1706371-1-gnoack@google.com/
V6: https://lore.kernel.org/all/20231124173026.3257122-1-gnoack@google.com/
V7: https://lore.kernel.org/all/20231201143042.3276833-1-gnoack@google.com/
V8: https://lore.kernel.org/all/20231208155121.1943775-1-gnoack@google.com/
V9: https://lore.kernel.org/all/20240209170612.1638517-1-gnoack@google.com/
V10: https://lore.kernel.org/all/20240309075320.160128-1-gnoack@google.com/
V11: https://lore.kernel.org/all/20240322151002.3653639-1-gnoack@google.com/
V12: https://lore.kernel.org/all/20240325134004.4074874-1-gnoack@google.com/
V13: https://lore.kernel.org/all/20240327131040.158777-1-gnoack@google.com/
V14: https://lore.kernel.org/all/20240405214040.101396-1-gnoack@google.com/

Günther Noack (11):
  landlock: Add IOCTL access right for character and block devices
  selftests/landlock: Test IOCTL support
  selftests/landlock: Test IOCTL with memfds
  selftests/landlock: Test ioctl(2) and ftruncate(2) with open(O_PATH)
  selftests/landlock: Test IOCTLs on named pipes
  selftests/landlock: Check IOCTL restrictions for named UNIX domain
    sockets
  selftests/landlock: Exhaustive test for the IOCTL allow-list
  samples/landlock: Add support for LANDLOCK_ACCESS_FS_IOCTL_DEV
  landlock: Document IOCTL support
  MAINTAINERS: Notify Landlock maintainers about changes to fs/ioctl.c
  fs/ioctl: Add a comment to keep the logic in sync with LSM policies

 Documentation/userspace-api/landlock.rst     |  76 ++-
 MAINTAINERS                                  |   1 +
 fs/ioctl.c                                   |   3 +
 include/uapi/linux/landlock.h                |  38 +-
 samples/landlock/sandboxer.c                 |  13 +-
 security/landlock/fs.c                       | 225 ++++++++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   2 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   | 486 ++++++++++++++++++-
 10 files changed, 805 insertions(+), 43 deletions(-)


base-commit: fe611b72031cc211a96cf0b3b58838953950cb13
-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply	[relevance 2%]

* Re: [PATCH 03/26] netfs: Update i_blocks when write committed to pagecache
  @ 2024-04-16 22:47  5%     ` David Howells
  0 siblings, 0 replies; 200+ results
From: David Howells @ 2024-04-16 22:47 UTC (permalink / raw)
  To: Jeff Layton, Steve French
  Cc: dhowells, Christian Brauner, Gao Xiang, Dominique Martinet,
	Matthew Wilcox, Marc Dionne, Paulo Alcantara, Shyam Prasad N,
	Tom Talpey, Eric Van Hensbergen, Ilya Dryomov, netfs,
	linux-cachefs, linux-afs, linux-cifs, linux-nfs, ceph-devel,
	v9fs, linux-erofs, linux-fsdevel, linux-mm, netdev, linux-kernel,
	Steve French, Shyam Prasad N, Rohith Surabattula

Jeff Layton <jlayton@kernel.org> wrote:

> > Update i_blocks when i_size is updated when we finish making a write to the
> > pagecache to reflect the amount of space we think will be consumed.
> > 
> 
> Umm ok, but why? I get that the i_size and i_blocks would be out of sync
> until we get back new attrs from the server, but is that a problem? I'm
> mainly curious as to what's paying attention to the i_blocks during this
> window.

This is taking over from a cifs patch that does the same thing - but in code
that is removed by my cifs-netfs branch, so I should probably let Steve speak
to that, though I think the problem with cifs is that these fields aren't
properly updated until the closure occurs and the server is consulted.

    commit dbfdff402d89854126658376cbcb08363194d3cd
    Author: Steve French <stfrench@microsoft.com>
    Date:   Thu Feb 22 00:26:52 2024 -0600

    smb3: update allocation size more accurately on write completion

    Changes to allocation size are approximated for extending writes of cached
    files until the server returns the actual value (on SMB3 close or query info
    for example), but it was setting the estimated value for number of blocks
    to larger than the file size even if the file is likely sparse which
    breaks various xfstests (e.g. generic/129, 130, 221, 228).
    
    When i_size and i_blocks are updated in write completion do not increase
    allocation size more than what was written (rounded up to 512 bytes).

David


^ permalink raw reply	[relevance 5%]

* Re: [PATCH v14 02/12] landlock: Add IOCTL access right for character and block devices
  2024-04-05 21:40  6% ` [PATCH v14 02/12] landlock: Add IOCTL access right for character and block devices Günther Noack
@ 2024-04-12 15:16  0%   ` Mickaël Salaün
  0 siblings, 0 replies; 200+ results
From: Mickaël Salaün @ 2024-04-12 15:16 UTC (permalink / raw)
  To: Günther Noack
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Christian Brauner

I like this patch very much! This patch series is in linux-next and I
don't expect it to change much. Just a few comments below and for test
patches.

The only remaining question is: should we allow non-device files to
receive the LANDLOCK_ACCESS_FS_IOCTL_DEV right?

On Fri, Apr 05, 2024 at 09:40:30PM +0000, Günther Noack wrote:
> Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
> and increments the Landlock ABI version to 5.
> 
> This access right applies to device-custom IOCTL commands
> when they are invoked on block or character device files.
> 
> Like the truncate right, this right is associated with a file
> descriptor at the time of open(2), and gets respected even when the
> file descriptor is used outside of the thread which it was originally
> opened in.
> 
> Therefore, a newly enabled Landlock policy does not apply to file
> descriptors which are already open.
> 
> If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
> number of safe IOCTL commands will be permitted on newly opened device
> files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
> as other IOCTL commands for regular files which are implemented in
> fs/ioctl.c.
> 
> Noteworthy scenarios which require special attention:
> 
> TTY devices are often passed into a process from the parent process,
> and so a newly enabled Landlock policy does not retroactively apply to
> them automatically.  In the past, TTY devices have often supported
> IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> letting callers control the TTY input buffer (and simulate
> keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> modern kernels though.
> 
> Known limitations:
> 
> The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
> control over IOCTL commands.
> 
> Landlock users may use path-based restrictions in combination with
> their knowledge about the file system layout to control what IOCTLs
> can be done.
> 
> Cc: Paul Moore <paul@paul-moore.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Günther Noack <gnoack@google.com>
> ---
>  include/uapi/linux/landlock.h                |  38 +++-
>  security/landlock/fs.c                       | 221 ++++++++++++++++++-

You contributed a lot and you may want to add a copyright in this file.

>  security/landlock/limits.h                   |   2 +-
>  security/landlock/syscalls.c                 |   8 +-
>  tools/testing/selftests/landlock/base_test.c |   2 +-
>  tools/testing/selftests/landlock/fs_test.c   |   5 +-
>  6 files changed, 259 insertions(+), 17 deletions(-)
> 
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> index 25c8d7677539..68625e728f43 100644
> --- a/include/uapi/linux/landlock.h
> +++ b/include/uapi/linux/landlock.h
> @@ -128,7 +128,7 @@ struct landlock_net_port_attr {
>   * files and directories.  Files or directories opened before the sandboxing
>   * are not subject to these restrictions.
>   *
> - * A file can only receive these access rights:
> + * The following access rights apply only to files:
>   *
>   * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
>   * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
> @@ -138,12 +138,13 @@ struct landlock_net_port_attr {
>   * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
>   * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
>   *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
> - *   ``O_TRUNC``. Whether an opened file can be truncated with
> - *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
> - *   same way as read and write permissions are checked during
> - *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
> - *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
> - *   third version of the Landlock ABI.
> + *   ``O_TRUNC``.  This access right is available since the third version of the
> + *   Landlock ABI.
> + *
> + * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> + * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> + * read and write permissions are checked during :manpage:`open(2)` using
> + * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
>   *
>   * A directory can receive access rights related to files or directories.  The
>   * following access right is applied to the directory itself, and the
> @@ -198,13 +199,33 @@ struct landlock_net_port_attr {
>   *   If multiple requirements are not met, the ``EACCES`` error code takes
>   *   precedence over ``EXDEV``.
>   *
> + * The following access right applies both to files and directories:
> + *
> + * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
> + *   character or block device.
> + *
> + *   This access right applies to all `ioctl(2)` commands implemented by device
> + *   drivers.  However, the following common IOCTL commands continue to be
> + *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:
> + *
> + *   * IOCTL commands targeting file descriptors (``FIOCLEX``, ``FIONCLEX``),
> + *   * IOCTL commands targeting file descriptions (``FIONBIO``, ``FIOASYNC``),
> + *   * IOCTL commands targeting file systems (``FIFREEZE``, ``FITHAW``,
> + *     ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``)
> + *   * Some IOCTL commands which do not make sense when used with devices, but
> + *     whose implementations are safe and return the right error codes
> + *     (``FS_IOC_FIEMAP``, ``FICLONE``, ``FICLONERANGE``, ``FIDEDUPERANGE``)
> + *
> + *   This access right is available since the fifth version of the Landlock
> + *   ABI.
> + *
>   * .. warning::
>   *
>   *   It is currently not possible to restrict some file-related actions
>   *   accessible through these syscall families: :manpage:`chdir(2)`,
>   *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
>   *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
> - *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
> + *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
>   *   Future Landlock evolutions will enable to restrict them.
>   */
>  /* clang-format off */
> @@ -223,6 +244,7 @@ struct landlock_net_port_attr {
>  #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
>  #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
>  #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
> +#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
>  /* clang-format on */
>  
>  /**
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index c15559432d3d..b0857541d5e0 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -7,6 +7,7 @@
>   * Copyright © 2021-2022 Microsoft Corporation
>   */
>  
> +#include <asm/ioctls.h>
>  #include <kunit/test.h>
>  #include <linux/atomic.h>
>  #include <linux/bitops.h>
> @@ -14,6 +15,7 @@
>  #include <linux/compiler_types.h>
>  #include <linux/dcache.h>
>  #include <linux/err.h>
> +#include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
> @@ -29,6 +31,7 @@
>  #include <linux/types.h>
>  #include <linux/wait_bit.h>
>  #include <linux/workqueue.h>
> +#include <uapi/linux/fiemap.h>
>  #include <uapi/linux/landlock.h>
>  
>  #include "common.h"
> @@ -84,6 +87,158 @@ static const struct landlock_object_underops landlock_fs_underops = {
>  	.release = release_inode
>  };
>  
> +/* IOCTL helpers */
> +
> +/**
> + * is_masked_device_ioctl(): Determine whether an IOCTL command is always
> + * permitted with Landlock for device files.  These commands can not be
> + * restricted on device files by enforcing a Landlock policy.
> + *
> + * @cmd: The IOCTL command that is supposed to be run.
> + *
> + * By default, any IOCTL on a device file requires the
> + * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  However, we blanket-permit some
> + * commands, if:
> + *
> + * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
> + *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
> + *
> + * 2. The command is harmless when invoked on devices.
> + *
> + * We also permit commands that do not make sense for devices, but where the
> + * do_vfs_ioctl() implementation returns a more conventional error code.
> + *
> + * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
> + * should be considered for inclusion here.
> + *
> + * Returns: true if the IOCTL @cmd can not be restricted with Landlock for
> + * device files.
> + */

Great documentation!

> +static __attribute_const__ bool is_masked_device_ioctl(const unsigned int cmd)
> +{
> +	switch (cmd) {
> +	/*
> +	 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> +	 * close-on-exec and the file's buffered-IO and async flags.  These
> +	 * operations are also available through fcntl(2), and are
> +	 * unconditionally permitted in Landlock.
> +	 */
> +	case FIOCLEX:
> +	case FIONCLEX:
> +	case FIONBIO:
> +	case FIOASYNC:
> +	/*
> +	 * FIOQSIZE queries the size of a regular file, directory, or link.
> +	 *
> +	 * We still permit it, because it always returns -ENOTTY for
> +	 * other file types.
> +	 */
> +	case FIOQSIZE:
> +	/*
> +	 * FIFREEZE and FITHAW freeze and thaw the file system which the
> +	 * given file belongs to.  Requires CAP_SYS_ADMIN.
> +	 *
> +	 * These commands operate on the file system's superblock rather
> +	 * than on the file itself.  The same operations can also be
> +	 * done through any other file or directory on the same file
> +	 * system, so it is safe to permit these.
> +	 */
> +	case FIFREEZE:
> +	case FITHAW:
> +	/*
> +	 * FS_IOC_FIEMAP queries information about the allocation of
> +	 * blocks within a file.
> +	 *
> +	 * This IOCTL command only makes sense for regular files and is
> +	 * not implemented by devices. It is harmless to permit.
> +	 */
> +	case FS_IOC_FIEMAP:
> +	/*
> +	 * FIGETBSZ queries the file system's block size for a file or
> +	 * directory.
> +	 *
> +	 * This command operates on the file system's superblock rather
> +	 * than on the file itself.  The same operation can also be done
> +	 * through any other file or directory on the same file system,
> +	 * so it is safe to permit it.
> +	 */
> +	case FIGETBSZ:
> +	/*
> +	 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
> +	 * their underlying storage ("reflink") between source and
> +	 * destination FDs, on file systems which support that.
> +	 *
> +	 * These IOCTL commands only apply to regular files
> +	 * and are harmless to permit for device files.
> +	 */
> +	case FICLONE:
> +	case FICLONERANGE:
> +	case FIDEDUPERANGE:

> +	/*
> +	 * FIONREAD, FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
> +	 * FS_IOC_FSSETXATTR are forwarded to device implementations.
> +	 */

The above comment should be better near the file_ioctl() one.

> +
> +	/*
> +	 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
> +	 * the file system superblock, not on the specific file, so
> +	 * these operations are available through any other file on the
> +	 * same file system as well.
> +	 */
> +	case FS_IOC_GETFSUUID:
> +	case FS_IOC_GETFSSYSFSPATH:
> +		return true;
> +
> +	/*
> +	 * file_ioctl() commands (FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64,
> +	 * FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE) are
> +	 * forwarded to device implementations, so not permitted.
> +	 */
> +
> +	/* Other commands are guarded by the access right. */
> +	default:
> +		return false;
> +	}
> +}
> +
> +/*
> + * is_masked_device_ioctl_compat - same as the helper above, but checking the
> + * "compat" IOCTL commands.
> + *
> + * The IOCTL commands with special handling in compat-mode should behave the
> + * same as their non-compat counterparts.
> + */
> +static __attribute_const__ bool
> +is_masked_device_ioctl_compat(const unsigned int cmd)
> +{
> +	switch (cmd) {
> +	/* FICLONE is permitted, same as in the non-compat variant. */
> +	case FICLONE:
> +		return true;

A new line before and after if/endif would be good.

> +#if defined(CONFIG_X86_64)
> +	/*
> +	 * FS_IOC_RESVSP_32, FS_IOC_RESVSP64_32, FS_IOC_UNRESVSP_32,
> +	 * FS_IOC_UNRESVSP64_32, FS_IOC_ZERO_RANGE_32: not blanket-permitted,
> +	 * for consistency with their non-compat variants.
> +	 */
> +	case FS_IOC_RESVSP_32:
> +	case FS_IOC_RESVSP64_32:
> +	case FS_IOC_UNRESVSP_32:
> +	case FS_IOC_UNRESVSP64_32:
> +	case FS_IOC_ZERO_RANGE_32:
> +#endif
> +	/*
> +	 * FS_IOC32_GETFLAGS, FS_IOC32_SETFLAGS are forwarded to their device
> +	 * implementations.
> +	 */
> +	case FS_IOC32_GETFLAGS:
> +	case FS_IOC32_SETFLAGS:
> +		return false;
> +	default:
> +		return is_masked_device_ioctl(cmd);
> +	}
> +}
> +
>  /* Ruleset management */
>  
>  static struct landlock_object *get_inode_object(struct inode *const inode)

^ permalink raw reply	[relevance 0%]

* RE: [PATCH v6 10/10] nvme: Atomic write support
  @ 2024-04-11 23:32  4%           ` Dan Helmick
  0 siblings, 0 replies; 200+ results
From: Dan Helmick @ 2024-04-11 23:32 UTC (permalink / raw)
  To: Luis Chamberlain, John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-fsdevel, tytso, jbongio, linux-scsi, ojaswin, linux-aio,
	linux-btrfs, io-uring, nilay, ritesh.list, willy, Alan Adamson

On Thu, April 11, 2024 10:23 AM, Luis Chaimberlain wrote:
> On Thu, Apr 11, 2024 at 09:59:57AM +0100, John Garry wrote:
> > On 11/04/2024 01:29, Luis Chamberlain wrote:
> > > On Tue, Mar 26, 2024 at 01:38:13PM +0000, John Garry wrote:
> > > > From: Alan Adamson <alan.adamson@oracle.com>
> > > >
> > > > Add support to set block layer request_queue atomic write limits.
> > > > The limits will be derived from either the namespace or controller
> > > > atomic parameters.
> > > >
> > > > NVMe atomic-related parameters are grouped into "normal" and "power-
> fail"
> > > > (or PF) class of parameter. For atomic write support, only PF
> > > > parameters are of interest. The "normal" parameters are concerned
> > > > with racing reads and writes (which also applies to PF). See NVM
> > > > Command Set Specification Revision 1.0d section 2.1.4 for reference.
> > > >
> > > > Whether to use per namespace or controller atomic parameters is
> > > > decided by NSFEAT bit 1 - see Figure 97: Identify – Identify
> > > > Namespace Data Structure, NVM Command Set.
> > > >
> > > > NVMe namespaces may define an atomic boundary, whereby no atomic
> > > > guarantees are provided for a write which straddles this per-lba
> > > > space boundary. The block layer merging policy is such that no
> > > > merges may occur in which the resultant request would straddle such a
> boundary.
> > > >
> > > > Unlike SCSI, NVMe specifies no granularity or alignment rules,
> > > > apart from atomic boundary rule.
> > >
> > > Larger IU drives a larger alignment *preference*, and it can be
> > > multiples of the LBA format, it's called Namespace Preferred Write
> > > Granularity (NPWG) and the NVMe driver already parses it. So say you
> > > have a 4k LBA format but a 16k NPWG. I suspect this means we'd want
> > > atomics writes to align to 16k but I can let Dan confirm.

Apologies for my delayed reply.  I confirm.  

FYI: I authored the first draft of the OPTPERF section and I tried also at a point to help clarify the atomics section.  After my 1st drafts on these sections, there was a fair amount of translations from English into Standards type language.  So, some drive specifics get removed to enable other medias and so-forth.  

NPWG is a preference.  It does not dictate the atomic behavior though.  So, don't go assuming you can rely on that behavior.  
 

> >
> > If we need to be aligned to NPWG, then the min atomic write unit would
> > also need to be NPWG. Any NPWG relation to atomic writes is not
> > defined in the spec, AFAICS.
> 
> NPWG is just a preference, not a requirement, so it is different than logical
> block size. As far as I can tell we have no block topology information to
> represent it. LBS will help users opt-in to align to the NPWG, and a respective
> NAWUPF will ensure you can also atomically write the respective sector size.
> 
> For atomics, NABSPF is what we want to use.
> 
> The above statement on the commit log just seems a bit misleading then.
> 
> > We simply use the LBA data size as the min atomic unit in this patch.
> 
> I thought NABSPF is used.

Yes, use NABSPF.  But most SSDs don't actually have boundaries.  This is more of a legacy SSD need.  

> 
> > > > Note on NABSPF:
> > > > There seems to be some vagueness in the spec as to whether NABSPF
> > > > applies for NSFEAT bit 1 being unset. Figure 97 does not
> > > > explicitly mention NABSPF and how it is affected by bit 1. However
> > > > Figure 4 does tell to check Figure
> > > > 97 for info about per-namespace parameters, which NABSPF is, so it
> > > > is implied. However currently nvme_update_disk_info() does check
> > > > namespace parameter NABO regardless of this bit.

NABO is a parameter that was carried forward, and it was already in the spec.  I didn't get an ability to impact that one with my changes.  

The story that was relayed to me says this parameter first existed in SATA and SCSI, and NVMe just pulled over an equivalent parameter even though the problem was resolved in the landscape NVMe SSDs ship into.  I was told that NABO was a parameter from Windows 95-ish days.  Something about the BIOS being written in 512B sectors with an ending that didn't align to 4KB.  But all the HDDs were trying to move over to 4KB for efficiencies of their ECCs.  So, there was this NABO parameter added to get the OS portion of the drive to be aligned nicely with the HDD's ECC.  

Anyways: add in the offset as queried from the drive even though it will most likely always be zero.

> > >
> > > Yeah that its quirky.
> > >
> > > Also today we set the physical block size to min(npwg, atomic) and
> > > that means for a today's average 4k IU drive if they get 16k atomic
> > > the physical block size would still be 4k. As the physical block
> > > size in practice can also lift the sector size filesystems used it
> > > would seem odd only a larger npwg could lift it.
> > It seems to me that if you want to provide atomic guarantees for this
> > large "physical block size", then it needs to be based on (N)AWUPF and NPWG.
> 
> For atomicity, I read it as needing to use NABSPF. Aligning to NPWG will just
> help performance.
> 
> The NPWG comes from an internal mapping table constructed and kept on
> DRAM on a drive in units of an IU size [0], and so not aligning to the IU just
> causes having to work with entries in the able rather than just one, and also
> incurs a read-modify-write. Contrary to the logical block size, a write below
> NPWG but respecting the logical block size is allowed, its just not optimal.
> 
> [0]
> https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=eae43295-
> b57ffcf1-eae5b9da-000babda0201-21ccc05e04b9be40&q=1&e=a20d17e2-
> 9e56-47e4-b5e0-
> 05494254a286&u=https*3A*2F*2Fkernelnewbies.org*2FKernelProjects*2Flarg
> e-block-size*23Indirection_Unit_size_increases__;JSUlJSUl!!EwVzqGoTKBqv-
> 0DWAJBm!Wkzd2Bo5MWgYPXDhfJYso2nocO0kAtKHvtYD1NT6p4QIkC846TclDRJ
> pPd683WDGeJSTbRBLKq5Fy-V9mBa8$
> 
>   Luis

^ permalink raw reply	[relevance 4%]

* Re: commit e57bf9cda9cd ("timerfd: convert to ->read_iter()") breaks booting on debian stable (bookworm, 12.5)
  @ 2024-04-11 22:01  1% ` Bert Karwatzki
  0 siblings, 0 replies; 200+ results
From: Bert Karwatzki @ 2024-04-11 22:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, linux-fsdevel

These are the kernel messages from the first failed boot, no error to see here,
but the machine does not boot properly either:

[    0.000000][    T0] Linux version 6.9.0-rc3-next-20240411 (bert@lisa) (gcc
(Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1168 SMP
PREEMPT_DYNAMIC Thu Apr 11 09:28:24 CEST 2024
[    0.000000][    T0] Command line: BOOT_IMAGE=/boot/vmlinuz-6.9.0-rc3-next-
20240411 root=UUID=73e0f015-c115-4eb2-92cb-dbf7da2b6112 ro clocksource=hpet
amdgpu.noretry=0 amdgpu.mcbp=1 quiet
[    0.000000][    T0] KERNEL supported cpus:
[    0.000000][    T0]   AMD AuthenticAMD
[    0.000000][    T0] BIOS-provided physical RAM map:
[    0.000000][    T0] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff]
usable
[    0.000000][    T0] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff]
usable
[    0.000000][    T0] BIOS-e820: [mem 0x0000000009bff000-0x000000000a000fff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x000000000a001000-0x000000000a1fffff]
usable
[    0.000000][    T0] BIOS-e820: [mem 0x000000000a200000-0x000000000a20efff]
ACPI NVS
[    0.000000][    T0] BIOS-e820: [mem 0x000000000a20f000-0x00000000e9e1ffff]
usable
[    0.000000][    T0] BIOS-e820: [mem 0x00000000e9e20000-0x00000000eb33efff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000eb33f000-0x00000000eb39efff]
ACPI data
[    0.000000][    T0] BIOS-e820: [mem 0x00000000eb39f000-0x00000000eb556fff]
ACPI NVS
[    0.000000][    T0] BIOS-e820: [mem 0x00000000eb557000-0x00000000ed17cfff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000ed17d000-0x00000000ed1fefff]
type 20
[    0.000000][    T0] BIOS-e820: [mem 0x00000000ed1ff000-0x00000000edffffff]
usable
[    0.000000][    T0] BIOS-e820: [mem 0x00000000ee000000-0x00000000f7ffffff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fd000000-0x00000000fdffffff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000feb80000-0x00000000fec01fff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fec10000-0x00000000fec10fff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed00fff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed8ffff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fedc4000-0x00000000fedc9fff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fedcc000-0x00000000fedcefff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000fedd5000-0x00000000fedd5fff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff]
reserved
[    0.000000][    T0] BIOS-e820: [mem 0x0000000100000000-0x0000000fee2fffff]
usable
[    0.000000][    T0] BIOS-e820: [mem 0x0000000fee300000-0x000000100fffffff]
reserved
[    0.000000][    T0] NX (Execute Disable) protection: active
[    0.000000][    T0] APIC: Static calls initialized
[    0.000000][    T0] efi: EFI v2.7 by American Megatrends
[    0.000000][    T0] efi: ACPI=0xeb540000 ACPI 2.0=0xeb540014
TPMFinalLog=0xeb50c000 SMBIOS=0xed020000 SMBIOS 3.0=0xed01f000
MEMATTR=0xe6fa0018 ESRT=0xe87cb898
[    0.000000][    T0] efi: Remove mem54: MMIO range=[0xf0000000-0xf7ffffff]
(128MB) from e820 map
[    0.000000][    T0] e820: remove [mem 0xf0000000-0xf7ffffff] reserved
[    0.000000][    T0] efi: Remove mem55: MMIO range=[0xfd000000-0xfdffffff]
(16MB) from e820 map
[    0.000000][    T0] e820: remove [mem 0xfd000000-0xfdffffff] reserved
[    0.000000][    T0] efi: Remove mem56: MMIO range=[0xfeb80000-0xfec01fff]
(0MB) from e820 map
[    0.000000][    T0] e820: remove [mem 0xfeb80000-0xfec01fff] reserved
[    0.000000][    T0] efi: Not removing mem57: MMIO range=[0xfec10000-
0xfec10fff] (4KB) from e820 map
[    0.000000][    T0] efi: Not removing mem58: MMIO range=[0xfed00000-
0xfed00fff] (4KB) from e820 map
[    0.000000][    T0] efi: Not removing mem59: MMIO range=[0xfed40000-
0xfed44fff] (20KB) from e820 map
[    0.000000][    T0] efi: Not removing mem60: MMIO range=[0xfed80000-
0xfed8ffff] (64KB) from e820 map
[    0.000000][    T0] efi: Not removing mem61: MMIO range=[0xfedc4000-
0xfedc9fff] (24KB) from e820 map
[    0.000000][    T0] efi: Not removing mem62: MMIO range=[0xfedcc000-
0xfedcefff] (12KB) from e820 map
[    0.000000][    T0] efi: Not removing mem63: MMIO range=[0xfedd5000-
0xfedd5fff] (4KB) from e820 map
[    0.000000][    T0] efi: Remove mem64: MMIO range=[0xff000000-0xffffffff]
(16MB) from e820 map
[    0.000000][    T0] e820: remove [mem 0xff000000-0xffffffff] reserved
[    0.000000][    T0] SMBIOS 3.3.0 present.
[    0.000000][    T0] DMI: Micro-Star International Co., Ltd. Alpha 15
B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
[    0.000000][    T0] tsc: Fast TSC calibration using PIT
[    0.000000][    T0] tsc: Detected 3194.029 MHz processor
[    0.000127][    T0] e820: update [mem 0x00000000-0x00000fff] usable ==>
reserved
[    0.000128][    T0] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000132][    T0] last_pfn = 0xfee300 max_arch_pfn = 0x400000000
[    0.000136][    T0] MTRR map: 5 entries (3 fixed + 2 variable; max 20), built
from 9 variable MTRRs
[    0.000137][    T0] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC-
WT
[    0.000352][    T0] e820: update [mem 0xf0000000-0xffffffff] usable ==>
reserved
[    0.000356][    T0] last_pfn = 0xee000 max_arch_pfn = 0x400000000
[    0.002717][    T0] esrt: Reserving ESRT space from 0x00000000e87cb898 to
0x00000000e87cb8d0.
[    0.002722][    T0] e820: update [mem 0xe87cb000-0xe87cbfff] usable ==>
reserved
[    0.002730][    T0] Using GB pages for direct mapping
[    0.002855][    T0] Secure boot disabled
[    0.002856][    T0] RAMDISK: [mem 0x329a9000-0x354cbfff]
[    0.002861][    T0] ACPI: Early table checksum verification disabled
[    0.002863][    T0] ACPI: RSDP 0x00000000EB540014 000024 (v02 MSI_NB)
[    0.002865][    T0] ACPI: XSDT 0x00000000EB53F728 000114 (v01 MSI_NB MEGABOOK
01072009 AMI  01000013)
[    0.002869][    T0] ACPI: FACP 0x00000000EB390000 000114 (v06 MSI_NB MEGABOOK
01072009 AMI  00010013)
[    0.002872][    T0] ACPI: DSDT 0x00000000EB383000 00C50C (v02 MSI_NB MEGABOOK
01072009 INTL 20190509)
[    0.002874][    T0] ACPI: FACS 0x00000000EB50A000 000040
[    0.002875][    T0] ACPI: SLIC 0x00000000EB39E000 000176 (v01 MSI_NB MEGABOOK
01072009 AMI  01000013)
[    0.002877][    T0] ACPI: SSDT 0x00000000EB396000 0072B0 (v02 AMD    AmdTable
00000002 MSFT 04000000)
[    0.002878][    T0] ACPI: IVRS 0x00000000EB395000 0001A4 (v02 AMD    AmdTable
00000001 AMD  00000000)
[    0.002879][    T0] ACPI: SSDT 0x00000000EB391000 003A21 (v01 AMD    AMD AOD
00000001 INTL 20190509)
[    0.002880][    T0] ACPI: FIDT 0x00000000EB382000 00009C (v01 MSI_NB MEGABOOK
01072009 AMI  00010013)
[    0.002881][    T0] ACPI: ECDT 0x00000000EB381000 0000C1 (v01 MSI_NB MEGABOOK
01072009 AMI. 00010013)
[    0.002883][    T0] ACPI: MCFG 0x00000000EB380000 00003C (v01 MSI_NB MEGABOOK
01072009 MSFT 00010013)
[    0.002884][    T0] ACPI: HPET 0x00000000EB37F000 000038 (v01 MSI_NB MEGABOOK
01072009 AMI  00000005)
[    0.002885][    T0] ACPI: VFCT 0x00000000EB371000 00D884 (v01 MSI_NB MEGABOOK
00000001 AMD  31504F47)
[    0.002886][    T0] ACPI: BGRT 0x00000000EB370000 000038 (v01 MSI_NB MEGABOOK
01072009 AMI  00010013)
[    0.002887][    T0] ACPI: TPM2 0x00000000EB36F000 00004C (v04 MSI_NB MEGABOOK
00000001 AMI  00000000)
[    0.002889][    T0] ACPI: SSDT 0x00000000EB369000 005354 (v02 AMD    AmdTable
00000001 AMD  00000001)
[    0.002890][    T0] ACPI: CRAT 0x00000000EB368000 000EE8 (v01 AMD    AmdTable
00000001 AMD  00000001)
[    0.002891][    T0] ACPI: CDIT 0x00000000EB367000 000029 (v01 AMD    AmdTable
00000001 AMD  00000001)
[    0.002892][    T0] ACPI: SSDT 0x00000000EB366000 000149 (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002893][    T0] ACPI: SSDT 0x00000000EB364000 00148E (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002894][    T0] ACPI: SSDT 0x00000000EB362000 00153F (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002895][    T0] ACPI: SSDT 0x00000000EB361000 000696 (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002897][    T0] ACPI: SSDT 0x00000000EB35F000 001A56 (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002898][    T0] ACPI: SSDT 0x00000000EB35E000 0005DE (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002899][    T0] ACPI: SSDT 0x00000000EB35A000 0036E9 (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002900][    T0] ACPI: WSMT 0x00000000EB359000 000028 (v01 MSI_NB MEGABOOK
01072009 AMI  00010013)
[    0.002901][    T0] ACPI: APIC 0x00000000EB358000 0000DE (v03 MSI_NB MEGABOOK
01072009 AMI  00010013)
[    0.002902][    T0] ACPI: SSDT 0x00000000EB357000 00008D (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002903][    T0] ACPI: SSDT 0x00000000EB356000 0008A8 (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002904][    T0] ACPI: SSDT 0x00000000EB355000 0001B7 (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002906][    T0] ACPI: SSDT 0x00000000EB354000 0007B1 (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002907][    T0] ACPI: SSDT 0x00000000EB353000 00097D (v01 AMD    AmdTable
00000001 INTL 20190509)
[    0.002908][    T0] ACPI: FPDT 0x00000000EB352000 000044 (v01 MSI_NB A M I
01072009 AMI  01000013)
[    0.002909][    T0] ACPI: Reserving FACP table memory at [mem 0xeb390000-
0xeb390113]
[    0.002910][    T0] ACPI: Reserving DSDT table memory at [mem 0xeb383000-
0xeb38f50b]
[    0.002910][    T0] ACPI: Reserving FACS table memory at [mem 0xeb50a000-
0xeb50a03f]
[    0.002911][    T0] ACPI: Reserving SLIC table memory at [mem 0xeb39e000-
0xeb39e175]
[    0.002911][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb396000-
0xeb39d2af]
[    0.002911][    T0] ACPI: Reserving IVRS table memory at [mem 0xeb395000-
0xeb3951a3]
[    0.002912][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb391000-
0xeb394a20]
[    0.002912][    T0] ACPI: Reserving FIDT table memory at [mem 0xeb382000-
0xeb38209b]
[    0.002912][    T0] ACPI: Reserving ECDT table memory at [mem 0xeb381000-
0xeb3810c0]
[    0.002913][    T0] ACPI: Reserving MCFG table memory at [mem 0xeb380000-
0xeb38003b]
[    0.002913][    T0] ACPI: Reserving HPET table memory at [mem 0xeb37f000-
0xeb37f037]
[    0.002914][    T0] ACPI: Reserving VFCT table memory at [mem 0xeb371000-
0xeb37e883]
[    0.002914][    T0] ACPI: Reserving BGRT table memory at [mem 0xeb370000-
0xeb370037]
[    0.002914][    T0] ACPI: Reserving TPM2 table memory at [mem 0xeb36f000-
0xeb36f04b]
[    0.002915][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb369000-
0xeb36e353]
[    0.002915][    T0] ACPI: Reserving CRAT table memory at [mem 0xeb368000-
0xeb368ee7]
[    0.002915][    T0] ACPI: Reserving CDIT table memory at [mem 0xeb367000-
0xeb367028]
[    0.002916][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb366000-
0xeb366148]
[    0.002916][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb364000-
0xeb36548d]
[    0.002917][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb362000-
0xeb36353e]
[    0.002917][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb361000-
0xeb361695]
[    0.002917][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb35f000-
0xeb360a55]
[    0.002918][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb35e000-
0xeb35e5dd]
[    0.002918][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb35a000-
0xeb35d6e8]
[    0.002918][    T0] ACPI: Reserving WSMT table memory at [mem 0xeb359000-
0xeb359027]
[    0.002919][    T0] ACPI: Reserving APIC table memory at [mem 0xeb358000-
0xeb3580dd]
[    0.002919][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb357000-
0xeb35708c]
[    0.002920][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb356000-
0xeb3568a7]
[    0.002920][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb355000-
0xeb3551b6]
[    0.002920][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb354000-
0xeb3547b0]
[    0.002921][    T0] ACPI: Reserving SSDT table memory at [mem 0xeb353000-
0xeb35397c]
[    0.002921][    T0] ACPI: Reserving FPDT table memory at [mem 0xeb352000-
0xeb352043]
[    0.002957][    T0] Zone ranges:
[    0.002957][    T0]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.002959][    T0]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.002959][    T0]   Normal   [mem 0x0000000100000000-0x0000000fee2fffff]
[    0.002960][    T0]   Device   empty
[    0.002961][    T0] Movable zone start for each node
[    0.002961][    T0] Early memory node ranges
[    0.002962][    T0]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.002962][    T0]   node   0: [mem 0x0000000000100000-0x0000000009bfefff]
[    0.002963][    T0]   node   0: [mem 0x000000000a001000-0x000000000a1fffff]
[    0.002963][    T0]   node   0: [mem 0x000000000a20f000-0x00000000e9e1ffff]
[    0.002964][    T0]   node   0: [mem 0x00000000ed1ff000-0x00000000edffffff]
[    0.002964][    T0]   node   0: [mem 0x0000000100000000-0x0000000fee2fffff]
[    0.002966][    T0] Initmem setup node 0 [mem 0x0000000000001000-
0x0000000fee2fffff]
[    0.002971][    T0] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.002981][    T0] On node 0, zone DMA: 96 pages in unavailable ranges
[    0.003071][    T0] On node 0, zone DMA32: 1026 pages in unavailable ranges
[    0.006707][    T0] On node 0, zone DMA32: 15 pages in unavailable ranges
[    0.006786][    T0] On node 0, zone DMA32: 13279 pages in unavailable ranges
[    0.006969][    T0] On node 0, zone Normal: 8192 pages in unavailable ranges
[    0.007002][    T0] On node 0, zone Normal: 7424 pages in unavailable ranges
[    0.007969][    T0] ACPI: PM-Timer IO Port: 0x808
[    0.007975][    T0] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
[    0.007985][    T0] IOAPIC[0]: apic_id 33, version 33, address 0xfec00000,
GSI 0-23
[    0.007990][    T0] IOAPIC[1]: apic_id 34, version 33, address 0xfec01000,
GSI 24-55
[    0.007992][    T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.007993][    T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low
level)
[    0.007995][    T0] ACPI: Using ACPI (MADT) for SMP configuration information
[    0.007996][    T0] ACPI: HPET id: 0x10228201 base: 0xfed00000
[    0.008002][    T0] e820: update [mem 0xe62ee000-0xe63e1fff] usable ==>
reserved
[    0.008013][    T0] CPU topo: Max. logical packages:   1
[    0.008013][    T0] CPU topo: Max. logical dies:       1
[    0.008013][    T0] CPU topo: Max. dies per package:   1
[    0.008016][    T0] CPU topo: Max. threads per core:   2
[    0.008017][    T0] CPU topo: Num. cores per package:     8
[    0.008017][    T0] CPU topo: Num. threads per package:  16
[    0.008017][    T0] CPU topo: Allowing 16 present CPUs plus 0 hotplug CPUs
[    0.008029][    T0] PM: hibernation: Registered nosave memory: [mem
0x00000000-0x00000fff]
[    0.008030][    T0] PM: hibernation: Registered nosave memory: [mem
0x000a0000-0x000fffff]
[    0.008031][    T0] PM: hibernation: Registered nosave memory: [mem
0x09bff000-0x0a000fff]
[    0.008032][    T0] PM: hibernation: Registered nosave memory: [mem
0x0a200000-0x0a20efff]
[    0.008032][    T0] PM: hibernation: Registered nosave memory: [mem
0xe62ee000-0xe63e1fff]
[    0.008033][    T0] PM: hibernation: Registered nosave memory: [mem
0xe87cb000-0xe87cbfff]
[    0.008034][    T0] PM: hibernation: Registered nosave memory: [mem
0xe9e20000-0xeb33efff]
[    0.008034][    T0] PM: hibernation: Registered nosave memory: [mem
0xeb33f000-0xeb39efff]
[    0.008035][    T0] PM: hibernation: Registered nosave memory: [mem
0xeb39f000-0xeb556fff]
[    0.008035][    T0] PM: hibernation: Registered nosave memory: [mem
0xeb557000-0xed17cfff]
[    0.008035][    T0] PM: hibernation: Registered nosave memory: [mem
0xed17d000-0xed1fefff]
[    0.008036][    T0] PM: hibernation: Registered nosave memory: [mem
0xee000000-0xefffffff]
[    0.008036][    T0] PM: hibernation: Registered nosave memory: [mem
0xf0000000-0xfec0ffff]
[    0.008037][    T0] PM: hibernation: Registered nosave memory: [mem
0xfec10000-0xfec10fff]
[    0.008037][    T0] PM: hibernation: Registered nosave memory: [mem
0xfec11000-0xfecfffff]
[    0.008037][    T0] PM: hibernation: Registered nosave memory: [mem
0xfed00000-0xfed00fff]
[    0.008038][    T0] PM: hibernation: Registered nosave memory: [mem
0xfed01000-0xfed3ffff]
[    0.008038][    T0] PM: hibernation: Registered nosave memory: [mem
0xfed40000-0xfed44fff]
[    0.008038][    T0] PM: hibernation: Registered nosave memory: [mem
0xfed45000-0xfed7ffff]
[    0.008039][    T0] PM: hibernation: Registered nosave memory: [mem
0xfed80000-0xfed8ffff]
[    0.008039][    T0] PM: hibernation: Registered nosave memory: [mem
0xfed90000-0xfedc3fff]
[    0.008039][    T0] PM: hibernation: Registered nosave memory: [mem
0xfedc4000-0xfedc9fff]
[    0.008039][    T0] PM: hibernation: Registered nosave memory: [mem
0xfedca000-0xfedcbfff]
[    0.008040][    T0] PM: hibernation: Registered nosave memory: [mem
0xfedcc000-0xfedcefff]
[    0.008040][    T0] PM: hibernation: Registered nosave memory: [mem
0xfedcf000-0xfedd4fff]
[    0.008040][    T0] PM: hibernation: Registered nosave memory: [mem
0xfedd5000-0xfedd5fff]
[    0.008041][    T0] PM: hibernation: Registered nosave memory: [mem
0xfedd6000-0xffffffff]
[    0.008042][    T0] [mem 0xf0000000-0xfec0ffff] available for PCI devices
[    0.008044][    T0] clocksource: refined-jiffies: mask: 0xffffffff
max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.008049][    T0] setup_percpu: NR_CPUS:16 nr_cpumask_bits:16 nr_cpu_ids:16
nr_node_ids:1
[    0.008390][    T0] percpu: Embedded 45 pages/cpu s143464 r8192 d32664
u262144
[    0.008394][    T0] pcpu-alloc: s143464 r8192 d32664 u262144 alloc=1*2097152
[    0.008395][    T0] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 [0] 08 09 10 11
12 13 14 15
[    0.008407][    T0] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.9.0-rc3-
next-20240411 root=UUID=73e0f015-c115-4eb2-92cb-dbf7da2b6112 ro clocksource=hpet
amdgpu.noretry=0 amdgpu.mcbp=1 quiet
[    0.008442][    T0] Unknown kernel command line parameters
"BOOT_IMAGE=/boot/vmlinuz-6.9.0-rc3-next-20240411", will be passed to user
space.
[    0.008469][    T0] random: crng init done
[    0.012671][    T0] Dentry cache hash table entries: 8388608 (order: 14,
67108864 bytes, linear)
[    0.014717][    T0] Inode-cache hash table entries: 4194304 (order: 13,
33554432 bytes, linear)
[    0.014776][    T0] Built 1 zonelists, mobility grouping on.  Total pages:
16616111
[    0.014778][    T0] mem auto-init: stack:off, heap alloc:off, heap free:off,
mlocked free:off
[    0.014815][    T0] software IO TLB: area num 16.
[    0.030225][    T0] Memory: 3784692K/66464444K available (12288K kernel code,
1066K rwdata, 3580K rodata, 1372K init, 1528K bss, 1350924K reserved, 0K cma-
reserved)
[    0.030332][    T0] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16,
Nodes=1
[    0.030373][    T0] Dynamic Preempt: full
[    0.030419][    T0] rcu: Preemptible hierarchical RCU implementation.
[    0.030420][    T0] 	Trampoline variant of Tasks RCU enabled.
[    0.030421][    T0] 	Tracing variant of Tasks RCU enabled.
[    0.030421][    T0] rcu: RCU calculated value of scheduler-enlistment delay
is 100 jiffies.
[    0.030429][    T0] RCU Tasks: Setting shift to 4 and lim to 1
rcu_task_cb_adjust=1.
[    0.030432][    T0] RCU Tasks Trace: Setting shift to 4 and lim to 1
rcu_task_cb_adjust=1.
[    0.030433][    T0] NR_IRQS: 4352, nr_irqs: 1096, preallocated irqs: 16
[    0.030612][    T0] rcu: srcu_init: Setting srcu_struct sizes based on
contention.
[    0.030671][    T0] Console: colour dummy device 80x25
[    0.030673][    T0] printk: legacy console [tty0] enabled
[    0.030689][    T0] ACPI: Core revision 20230628
[    0.030869][    T0] clocksource: hpet: mask: 0xffffffff max_cycles:
0xffffffff, max_idle_ns: 133484873504 ns
[    0.030886][    T0] APIC: Switch to symmetric I/O mode setup
[    0.031460][    T0] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR0, rdevid:160
[    0.031462][    T0] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR1, rdevid:160
[    0.031463][    T0] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR2, rdevid:160
[    0.031464][    T0] AMD-Vi: ivrs, add hid:AMDI0020, uid:\_SB.FUR3, rdevid:160
[    0.031464][    T0] AMD-Vi: Using global IVHD EFR:0x206d73ef22254ade,
EFR2:0x0
[    0.031716][    T0] APIC: Switched APIC routing to: physical flat
[    0.032304][    T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.036892][    T0] clocksource: tsc-early: mask: 0xffffffffffffffff
max_cycles: 0x2e0a41ca3be, max_idle_ns: 440795356728 ns
[    0.036898][    T0] Calibrating delay loop (skipped), value calculated using
timer frequency.. 6388.05 BogoMIPS (lpj=3194029)
[    0.036916][    T0] x86/cpu: User Mode Instruction Prevention (UMIP)
activated
[    0.036949][    T0] LVT offset 1 assigned for vector 0xf9
[    0.037057][    T0] LVT offset 2 assigned for vector 0xf4
[    0.037081][    T0] Last level iTLB entries: 4KB 512, 2MB 512, 4MB 256
[    0.037082][    T0] Last level dTLB entries: 4KB 2048, 2MB 2048, 4MB 1024,
1GB 0
[    0.037084][    T0] process: using mwait in idle threads
[    0.037087][    T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and
__user pointer sanitization
[    0.037089][    T0] Spectre V2 : Mitigation: Retpolines
[    0.037089][    T0] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling
RSB on context switch
[    0.037090][    T0] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on
VMEXIT
[    0.037090][    T0] Spectre V2 : Enabling Restricted Speculation for firmware
calls
[    0.037092][    T0] Spectre V2 : mitigation: Enabling conditional Indirect
Branch Prediction Barrier
[    0.037092][    T0] Spectre V2 : User space: Mitigation: STIBP always-on
protection
[    0.037094][    T0] Speculative Store Bypass: Mitigation: Speculative Store
Bypass disabled via prctl
[    0.037094][    T0] Speculative Return Stack Overflow: IBPB-extending
microcode not applied!
[    0.037095][    T0] Speculative Return Stack Overflow: WARNING: See
https://kernel.org/doc/html/latest/admin-guide/hw-vuln/srso.html for mitigation
options.
[    0.037096][    T0] Speculative Return Stack Overflow: Vulnerable: Safe RET,
no microcode
[    0.037101][    T0] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating
point registers'
[    0.037102][    T0] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.037102][    T0] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.037103][    T0] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys
User registers'
[    0.037104][    T0] x86/fpu: Supporting XSAVE feature 0x800: 'Control-flow
User registers'
[    0.037105][    T0] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.037106][    T0] x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
[    0.037106][    T0] x86/fpu: xstate_offset[11]:  840, xstate_sizes[11]:   16
[    0.037107][    T0] x86/fpu: Enabled xstate features 0xa07, context size is
856 bytes, using 'compacted' format.
[    0.047777][    T0] Freeing SMP alternatives memory: 32K
[    0.047778][    T0] pid_max: default: 32768 minimum: 301
[    0.049596][    T0] Mount-cache hash table entries: 131072 (order: 8, 1048576
bytes, linear)
[    0.049663][    T0] Mountpoint-cache hash table entries: 131072 (order: 8,
1048576 bytes, linear)
[    0.152196][    T1] smpboot: CPU0: AMD Ryzen 7 5800H with Radeon Graphics
(family: 0x19, model: 0x50, stepping: 0x0)
[    0.152361][    T1] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[    0.152378][    T1] ... version:                0
[    0.152379][    T1] ... bit width:              48
[    0.152380][    T1] ... generic registers:      6
[    0.152380][    T1] ... value mask:             0000ffffffffffff
[    0.152381][    T1] ... max period:             00007fffffffffff
[    0.152382][    T1] ... fixed-purpose events:   0
[    0.152383][    T1] ... event mask:             000000000000003f
[    0.152449][    T1] signal: max sigframe size: 3376
[    0.152471][    T1] rcu: Hierarchical SRCU implementation.
[    0.152472][    T1] rcu: 	Max phase no-delay instances is 400.
[    0.153047][    T9] NMI watchdog: Enabled. Permanently consumes one hw-PMU
counter.
[    0.153169][    T1] smp: Bringing up secondary CPUs ...
[    0.153237][    T1] smpboot: x86: Booting SMP configuration:
[    0.153238][    T1] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14
#1  #3  #5  #7  #9 #11 #13 #15
[    0.163967][    T1] Spectre V2 : Update user space SMT mitigation: STIBP
always-on
[    0.170912][    T1] smp: Brought up 1 node, 16 CPUs
[    0.170912][    T1] smpboot: Total of 16 processors activated (102208.92
BogoMIPS)
[    0.227905][   T96] node 0 deferred pages initialised in 56ms
[    0.228667][    T1] devtmpfs: initialized
[    0.228667][    T1] x86/mm: Memory block size: 128MB
[    0.234179][    T1] ACPI: PM: Registering ACPI NVS region [mem 0x0a200000-
0x0a20efff] (61440 bytes)
[    0.234179][    T1] ACPI: PM: Registering ACPI NVS region [mem 0xeb39f000-
0xeb556fff] (1802240 bytes)
[    0.234179][    T1] clocksource: jiffies: mask: 0xffffffff max_cycles:
0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.234179][    T1] futex hash table entries: 4096 (order: 6, 262144 bytes,
linear)
[    0.234179][    T1] pinctrl core: initialized pinctrl subsystem
[    0.234357][    T1] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.234470][    T1] audit: initializing netlink subsys (disabled)
[    0.234475][  T107] audit: type=2000 audit(1712836198.202:1):
state=initialized audit_enabled=0 res=1
[    0.234475][    T1] thermal_sys: Registered thermal governor 'fair_share'
[    0.234475][    T1] thermal_sys: Registered thermal governor 'bang_bang'
[    0.234475][    T1] thermal_sys: Registered thermal governor 'step_wise'
[    0.234475][    T1] thermal_sys: Registered thermal governor 'user_space'
[    0.234475][    T1] thermal_sys: Registered thermal governor
'power_allocator'
[    0.234475][    T1] cpuidle: using governor ladder
[    0.234475][    T1] cpuidle: using governor teo
[    0.234931][    T1] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    0.234985][    T1] PCI: ECAM [mem 0xf0000000-0xf7ffffff] (base 0xf0000000)
for domain 0000 [bus 00-7f]
[    0.234989][    T1] PCI: not using ECAM ([mem 0xf0000000-0xf7ffffff] not
reserved)
[    0.234991][    T1] PCI: Using configuration type 1 for base access
[    0.234992][    T1] PCI: Using configuration type 1 for extended access
[    0.235054][    T1] kprobes: kprobe jump-optimization is enabled. All kprobes
are optimized if possible.
[    0.235054][    T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0
pages
[    0.235054][    T1] HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB
page
[    0.235054][    T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0
pages
[    0.235054][    T1] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page
[    0.235054][    T1] ACPI: Added _OSI(Module Device)
[    0.235054][    T1] ACPI: Added _OSI(Processor Device)
[    0.235054][    T1] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.235054][    T1] ACPI: Added _OSI(Processor Aggregator Device)
[    0.249579][    T1] ACPI: 16 ACPI AML tables successfully acquired and loaded
[    0.250654][    T1] ACPI: EC: EC started
[    0.250655][    T1] ACPI: EC: interrupt blocked
[    0.251177][    T1] ACPI: EC: EC_CMD/EC_SC=0x66, EC_DATA=0x62
[    0.251179][    T1] ACPI: EC: Boot ECDT EC used to handle transactions
[    0.251983][    T1] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[    0.253356][    T1] ACPI: _OSC evaluation for CPUs failed, trying _PDC
[    0.254197][    T1] ACPI: Interpreter enabled
[    0.254213][    T1] ACPI: PM: (supports S0 S4 S5)
[    0.254214][    T1] ACPI: Using IOAPIC for interrupt routing
[    0.254404][    T1] PCI: ECAM [mem 0xf0000000-0xf7ffffff] (base 0xf0000000)
for domain 0000 [bus 00-7f]
[    0.254434][    T1] PCI: ECAM [mem 0xf0000000-0xf7ffffff] reserved as ACPI
motherboard resource
[    0.254443][    T1] PCI: Using host bridge windows from ACPI; if necessary,
use "pci=nocrs" and report a bug
[    0.254445][    T1] PCI: Using E820 reservations for host bridge windows
[    0.254854][    T1] ACPI: Enabled 1 GPEs in block 00 to 1F
[    0.255505][    T1] ACPI: \_SB_.PCI0.GPP0.M237: New power resource
[    0.255661][    T1] ACPI: \_SB_.PCI0.GPP0.SWUS.M237: New power resource
[    0.255741][    T1] ACPI: \_SB_.PCI0.GPP0.SWUS.SWDS.M237: New power resource
[    0.256753][    T1] ACPI: \_SB_.PCI0.GP17.XHC0.P0U0: New power resource
[    0.256779][    T1] ACPI: \_SB_.PCI0.GP17.XHC0.P3U0: New power resource
[    0.257316][    T1] ACPI: \_SB_.PCI0.GP17.XHC1.P0U1: New power resource
[    0.257342][    T1] ACPI: \_SB_.PCI0.GP17.XHC1.P3U1: New power resource
[    0.259149][    T1] ACPI: \_SB_.PCI0.GPP6.P0NV: New power resource
[    0.259337][    T1] ACPI: \_SB_.PCI0.GPP5.P0NX: New power resource
[    0.265534][    T1] ACPI: \_SB_.PRWB: New power resource
[    0.267301][    T1] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.267306][    T1] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
ClockPM Segments MSI HPX-Type3]
[    0.267378][    T1] acpi PNP0A08:00: _OSC: platform does not support [LTR]
[    0.267506][    T1] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME
AER PCIeCapability]
[    0.267510][    T1] acpi PNP0A08:00: [Firmware Info]: ECAM [mem 0xf0000000-
0xf7ffffff] for domain 0000 [bus 00-7f] only partially covers this bridge
[    0.267937][    T1] PCI host bridge to bus 0000:00
[    0.267938][    T1] pci_bus 0000:00: root bus resource [io  0x0000-0x03af
window]
[    0.267940][    T1] pci_bus 0000:00: root bus resource [io  0x03e0-0x0cf7
window]
[    0.267942][    T1] pci_bus 0000:00: root bus resource [io  0x03b0-0x03df
window]
[    0.267943][    T1] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff
window]
[    0.267945][    T1] pci_bus 0000:00: root bus resource [mem 0x000a0000-
0x000dffff window]
[    0.267946][    T1] pci_bus 0000:00: root bus resource [mem 0xf0000000-
0xfcffffff window]
[    0.267947][    T1] pci_bus 0000:00: root bus resource [mem 0x1010000000-
0xffffffffff window]
[    0.267949][    T1] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.267964][    T1] pci 0000:00:00.0: [1022:1630] type 00 class 0x060000
conventional PCI endpoint
[    0.268062][    T1] pci 0000:00:00.2: [1022:1631] type 00 class 0x080600
conventional PCI endpoint
[    0.268154][    T1] pci 0000:00:01.0: [1022:1632] type 00 class 0x060000
conventional PCI endpoint
[    0.268226][    T1] pci 0000:00:01.1: [1022:1633] type 01 class 0x060400 PCIe
Root Port
[    0.268241][    T1] pci 0000:00:01.1: PCI bridge to [bus 01-03]
[    0.268246][    T1] pci 0000:00:01.1:   bridge window [mem 0xfca00000-
0xfccfffff]
[    0.268252][    T1] pci 0000:00:01.1:   bridge window [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.268292][    T1] pci 0000:00:01.1: PME# supported from D0 D3hot D3cold
[    0.268397][    T1] pci 0000:00:02.0: [1022:1632] type 00 class 0x060000
conventional PCI endpoint
[    0.268469][    T1] pci 0000:00:02.1: [1022:1634] type 01 class 0x060400 PCIe
Root Port
[    0.268484][    T1] pci 0000:00:02.1: PCI bridge to [bus 04]
[    0.268493][    T1] pci 0000:00:02.1:   bridge window [mem 0xfe30300000-
0xfe304fffff 64bit pref]
[    0.268532][    T1] pci 0000:00:02.1: PME# supported from D0 D3hot D3cold
[    0.268613][    T1] pci 0000:00:02.2: [1022:1634] type 01 class 0x060400 PCIe
Root Port
[    0.268628][    T1] pci 0000:00:02.2: PCI bridge to [bus 05]
[    0.268631][    T1] pci 0000:00:02.2:   bridge window [io  0xf000-0xffff]
[    0.268634][    T1] pci 0000:00:02.2:   bridge window [mem 0xfcf00000-
0xfcffffff]
[    0.268644][    T1] pci 0000:00:02.2: enabling Extended Tags
[    0.268678][    T1] pci 0000:00:02.2: PME# supported from D0 D3hot D3cold
[    0.268760][    T1] pci 0000:00:02.3: [1022:1634] type 01 class 0x060400 PCIe
Root Port
[    0.268774][    T1] pci 0000:00:02.3: PCI bridge to [bus 06]
[    0.268778][    T1] pci 0000:00:02.3:   bridge window [mem 0xfce00000-
0xfcefffff]
[    0.268820][    T1] pci 0000:00:02.3: PME# supported from D0 D3hot D3cold
[    0.268902][    T1] pci 0000:00:02.4: [1022:1634] type 01 class 0x060400 PCIe
Root Port
[    0.268917][    T1] pci 0000:00:02.4: PCI bridge to [bus 07]
[    0.268921][    T1] pci 0000:00:02.4:   bridge window [mem 0xfcd00000-
0xfcdfffff]
[    0.268932][    T1] pci 0000:00:02.4: enabling Extended Tags
[    0.268965][    T1] pci 0000:00:02.4: PME# supported from D0 D3hot D3cold
[    0.269056][    T1] pci 0000:00:08.0: [1022:1632] type 00 class 0x060000
conventional PCI endpoint
[    0.269128][    T1] pci 0000:00:08.1: [1022:1635] type 01 class 0x060400 PCIe
Root Port
[    0.269143][    T1] pci 0000:00:08.1: PCI bridge to [bus 08]
[    0.269146][    T1] pci 0000:00:08.1:   bridge window [io  0xe000-0xefff]
[    0.269148][    T1] pci 0000:00:08.1:   bridge window [mem 0xfc500000-
0xfc9fffff]
[    0.269153][    T1] pci 0000:00:08.1:   bridge window [mem 0xfe20000000-
0xfe301fffff 64bit pref]
[    0.269159][    T1] pci 0000:00:08.1: enabling Extended Tags
[    0.269192][    T1] pci 0000:00:08.1: PME# supported from D0 D3hot D3cold
[    0.269335][    T1] pci 0000:00:14.0: [1022:790b] type 00 class 0x0c0500
conventional PCI endpoint
[    0.269455][    T1] pci 0000:00:14.3: [1022:790e] type 00 class 0x060100
conventional PCI endpoint
[    0.269589][    T1] pci 0000:00:18.0: [1022:166a] type 00 class 0x060000
conventional PCI endpoint
[    0.269644][    T1] pci 0000:00:18.1: [1022:166b] type 00 class 0x060000
conventional PCI endpoint
[    0.269695][    T1] pci 0000:00:18.2: [1022:166c] type 00 class 0x060000
conventional PCI endpoint
[    0.269747][    T1] pci 0000:00:18.3: [1022:166d] type 00 class 0x060000
conventional PCI endpoint
[    0.269799][    T1] pci 0000:00:18.4: [1022:166e] type 00 class 0x060000
conventional PCI endpoint
[    0.269851][    T1] pci 0000:00:18.5: [1022:166f] type 00 class 0x060000
conventional PCI endpoint
[    0.269905][    T1] pci 0000:00:18.6: [1022:1670] type 00 class 0x060000
conventional PCI endpoint
[    0.269956][    T1] pci 0000:00:18.7: [1022:1671] type 00 class 0x060000
conventional PCI endpoint
[    0.270071][    T1] pci 0000:01:00.0: [1002:1478] type 01 class 0x060400 PCIe
Switch Upstream Port
[    0.270085][    T1] pci 0000:01:00.0: BAR 0 [mem 0xfcc00000-0xfcc03fff]
[    0.270100][    T1] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[    0.270107][    T1] pci 0000:01:00.0:   bridge window [mem 0xfca00000-
0xfcbfffff]
[    0.270116][    T1] pci 0000:01:00.0:   bridge window [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.270195][    T1] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[    0.270259][    T1] pci 0000:01:00.0: 63.008 Gb/s available PCIe bandwidth,
limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 126.024 Gb/s with
16.0 GT/s PCIe x8 link)
[    0.270364][    T1] pci 0000:00:01.1: PCI bridge to [bus 01-03]
[    0.270426][    T1] pci 0000:02:00.0: [1002:1479] type 01 class 0x060400 PCIe
Switch Downstream Port
[    0.270451][    T1] pci 0000:02:00.0: PCI bridge to [bus 03]
[    0.270458][    T1] pci 0000:02:00.0:   bridge window [mem 0xfca00000-
0xfcbfffff]
[    0.270467][    T1] pci 0000:02:00.0:   bridge window [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.270548][    T1] pci 0000:02:00.0: PME# supported from D0 D3hot D3cold
[    0.270908][    T1] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[    0.270977][    T1] pci 0000:03:00.0: [1002:73ff] type 00 class 0x038000 PCIe
Legacy Endpoint
[    0.270992][    T1] pci 0000:03:00.0: BAR 0 [mem 0xfc00000000-0xfdffffffff
64bit pref]
[    0.271001][    T1] pci 0000:03:00.0: BAR 2 [mem 0xfe00000000-0xfe0fffffff
64bit pref]
[    0.271011][    T1] pci 0000:03:00.0: BAR 5 [mem 0xfca00000-0xfcafffff]
[    0.271016][    T1] pci 0000:03:00.0: ROM [mem 0xfcb00000-0xfcb1ffff pref]
[    0.271103][    T1] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.271172][    T1] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth,
limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048 Gb/s with
16.0 GT/s PCIe x16 link)
[    0.271265][    T1] pci 0000:03:00.1: [1002:ab28] type 00 class 0x040300 PCIe
Legacy Endpoint
[    0.271276][    T1] pci 0000:03:00.1: BAR 0 [mem 0xfcb20000-0xfcb23fff]
[    0.271359][    T1] pci 0000:03:00.1: PME# supported from D1 D2 D3hot D3cold
[    0.271482][    T1] pci 0000:02:00.0: PCI bridge to [bus 03]
[    0.271556][    T1] pci 0000:04:00.0: [14c3:0608] type 00 class 0x028000 PCIe
Endpoint
[    0.271575][    T1] pci 0000:04:00.0: BAR 0 [mem 0xfe30300000-0xfe303fffff
64bit pref]
[    0.271587][    T1] pci 0000:04:00.0: BAR 2 [mem 0xfe30400000-0xfe30403fff
64bit pref]
[    0.271598][    T1] pci 0000:04:00.0: BAR 4 [mem 0xfe30404000-0xfe30404fff
64bit pref]
[    0.271672][    T1] pci 0000:04:00.0: supports D1 D2
[    0.271674][    T1] pci 0000:04:00.0: PME# supported from D0 D1 D2 D3hot
D3cold
[    0.271839][    T1] pci 0000:00:02.1: PCI bridge to [bus 04]
[    0.271906][    T1] pci 0000:05:00.0: [10ec:8168] type 00 class 0x020000 PCIe
Endpoint
[    0.271923][    T1] pci 0000:05:00.0: BAR 0 [io  0xf000-0xf0ff]
[    0.271946][    T1] pci 0000:05:00.0: BAR 2 [mem 0xfcf04000-0xfcf04fff 64bit]
[    0.271960][    T1] pci 0000:05:00.0: BAR 4 [mem 0xfcf00000-0xfcf03fff 64bit]
[    0.272052][    T1] pci 0000:05:00.0: supports D1 D2
[    0.272053][    T1] pci 0000:05:00.0: PME# supported from D0 D1 D2 D3hot
D3cold
[    0.272233][    T1] pci 0000:00:02.2: PCI bridge to [bus 05]
[    0.272627][    T1] pci 0000:06:00.0: [2646:5013] type 00 class 0x010802 PCIe
Endpoint
[    0.272673][    T1] pci 0000:06:00.0: BAR 0 [mem 0xfce00000-0xfce03fff 64bit]
[    0.273226][    T1] pci 0000:06:00.0: 31.504 Gb/s available PCIe bandwidth,
limited by 8.0 GT/s PCIe x4 link at 0000:00:02.3 (capable of 63.012 Gb/s with
16.0 GT/s PCIe x4 link)
[    0.273829][    T1] pci 0000:00:02.3: PCI bridge to [bus 06]
[    0.273911][    T1] pci 0000:07:00.0: [c0a9:2263] type 00 class 0x010802 PCIe
Endpoint
[    0.273931][    T1] pci 0000:07:00.0: BAR 0 [mem 0xfcd00000-0xfcd03fff 64bit]
[    0.274227][    T1] pci 0000:00:02.4: PCI bridge to [bus 07]
[    0.274301][    T1] pci 0000:08:00.0: [1002:1638] type 00 class 0x030000 PCIe
Legacy Endpoint
[    0.274312][    T1] pci 0000:08:00.0: BAR 0 [mem 0xfe20000000-0xfe2fffffff
64bit pref]
[    0.274320][    T1] pci 0000:08:00.0: BAR 2 [mem 0xfe30000000-0xfe301fffff
64bit pref]
[    0.274325][    T1] pci 0000:08:00.0: BAR 4 [io  0xe000-0xe0ff]
[    0.274331][    T1] pci 0000:08:00.0: BAR 5 [mem 0xfc900000-0xfc97ffff]
[    0.274339][    T1] pci 0000:08:00.0: enabling Extended Tags
[    0.274389][    T1] pci 0000:08:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.274423][    T1] pci 0000:08:00.0: 126.016 Gb/s available PCIe bandwidth,
limited by 8.0 GT/s PCIe x16 link at 0000:00:08.1 (capable of 252.048 Gb/s with
16.0 GT/s PCIe x16 link)
[    0.274503][    T1] pci 0000:08:00.1: [1002:1637] type 00 class 0x040300 PCIe
Legacy Endpoint
[    0.274510][    T1] pci 0000:08:00.1: BAR 0 [mem 0xfc9c8000-0xfc9cbfff]
[    0.274531][    T1] pci 0000:08:00.1: enabling Extended Tags
[    0.274561][    T1] pci 0000:08:00.1: PME# supported from D1 D2 D3hot D3cold
[    0.274633][    T1] pci 0000:08:00.2: [1022:15df] type 00 class 0x108000 PCIe
Endpoint
[    0.274906][    T1] pci 0000:08:00.2: BAR 2 [mem 0xfc800000-0xfc8fffff]
[    0.274916][    T1] pci 0000:08:00.2: BAR 5 [mem 0xfc9ce000-0xfc9cffff]
[    0.274923][    T1] pci 0000:08:00.2: enabling Extended Tags
[    0.275022][    T1] pci 0000:08:00.3: [1022:1639] type 00 class 0x0c0330 PCIe
Endpoint
[    0.275033][    T1] pci 0000:08:00.3: BAR 0 [mem 0xfc700000-0xfc7fffff 64bit]
[    0.275056][    T1] pci 0000:08:00.3: enabling Extended Tags
[    0.275088][    T1] pci 0000:08:00.3: PME# supported from D0 D3hot D3cold
[    0.275166][    T1] pci 0000:08:00.4: [1022:1639] type 00 class 0x0c0330 PCIe
Endpoint
[    0.275176][    T1] pci 0000:08:00.4: BAR 0 [mem 0xfc600000-0xfc6fffff 64bit]
[    0.275199][    T1] pci 0000:08:00.4: enabling Extended Tags
[    0.275231][    T1] pci 0000:08:00.4: PME# supported from D0 D3hot D3cold
[    0.275308][    T1] pci 0000:08:00.5: [1022:15e2] type 00 class 0x048000 PCIe
Endpoint
[    0.275316][    T1] pci 0000:08:00.5: BAR 0 [mem 0xfc980000-0xfc9bffff]
[    0.275336][    T1] pci 0000:08:00.5: enabling Extended Tags
[    0.275366][    T1] pci 0000:08:00.5: PME# supported from D0 D3hot D3cold
[    0.275437][    T1] pci 0000:08:00.6: [1022:15e3] type 00 class 0x040300 PCIe
Endpoint
[    0.275444][    T1] pci 0000:08:00.6: BAR 0 [mem 0xfc9c0000-0xfc9c7fff]
[    0.275465][    T1] pci 0000:08:00.6: enabling Extended Tags
[    0.275495][    T1] pci 0000:08:00.6: PME# supported from D0 D3hot D3cold
[    0.275566][    T1] pci 0000:08:00.7: [1022:15e4] type 00 class 0x118000 PCIe
Endpoint
[    0.275579][    T1] pci 0000:08:00.7: BAR 2 [mem 0xfc500000-0xfc5fffff]
[    0.275589][    T1] pci 0000:08:00.7: BAR 5 [mem 0xfc9cc000-0xfc9cdfff]
[    0.275596][    T1] pci 0000:08:00.7: enabling Extended Tags
[    0.275717][    T1] pci 0000:00:08.1: PCI bridge to [bus 08]
[    0.275744][    T1] pci_bus 0000:00: on NUMA node 0
[    0.276538][    T1] ACPI: PCI: Interrupt link LNKA configured for IRQ 0
[    0.276581][    T1] ACPI: PCI: Interrupt link LNKB configured for IRQ 0
[    0.276618][    T1] ACPI: PCI: Interrupt link LNKC configured for IRQ 0
[    0.276663][    T1] ACPI: PCI: Interrupt link LNKD configured for IRQ 0
[    0.276704][    T1] ACPI: PCI: Interrupt link LNKE configured for IRQ 0
[    0.276739][    T1] ACPI: PCI: Interrupt link LNKF configured for IRQ 0
[    0.276773][    T1] ACPI: PCI: Interrupt link LNKG configured for IRQ 0
[    0.276807][    T1] ACPI: PCI: Interrupt link LNKH configured for IRQ 0
[    0.277776][    T1] Low-power S0 idle used by default for system suspend
[    0.279208][    T1] ACPI: EC: interrupt unblocked
[    0.279209][    T1] ACPI: EC: event unblocked
[    0.279211][    T1] ACPI: EC: EC_CMD/EC_SC=0x66, EC_DATA=0x62
[    0.279213][    T1] ACPI: EC: GPE=0x3
[    0.279214][    T1] ACPI: \_SB_.PCI0.SBRG.EC__: Boot ECDT EC initialization
complete
[    0.279216][    T1] ACPI: \_SB_.PCI0.SBRG.EC__: EC: Used to handle
transactions and events
[    0.279252][    T1] iommu: Default domain type: Passthrough
[    0.279252][    T1] EDAC MC: Ver: 3.0.0
[    0.279399][    T1] efivars: Registered efivars operations
[    0.279976][    T1] PCI: Using ACPI for IRQ routing
[    0.284124][    T1] PCI: pci_cache_line_size set to 64 bytes
[    0.284895][    T1] e820: reserve RAM buffer [mem 0x09bff000-0x0bffffff]
[    0.284897][    T1] e820: reserve RAM buffer [mem 0x0a200000-0x0bffffff]
[    0.284898][    T1] e820: reserve RAM buffer [mem 0xe62ee000-0xe7ffffff]
[    0.284899][    T1] e820: reserve RAM buffer [mem 0xe87cb000-0xebffffff]
[    0.284901][    T1] e820: reserve RAM buffer [mem 0xe9e20000-0xebffffff]
[    0.284902][    T1] e820: reserve RAM buffer [mem 0xee000000-0xefffffff]
[    0.284903][    T1] e820: reserve RAM buffer [mem 0xfee300000-0xfefffffff]
[    0.284954][    T1] pci 0000:08:00.0: vgaarb: setting as boot VGA device
[    0.284954][    T1] pci 0000:08:00.0: vgaarb: bridge control possible
[    0.284954][    T1] pci 0000:08:00.0: vgaarb: VGA device added:
decodes=io+mem,owns=none,locks=none
[    0.284954][    T1] vgaarb: loaded
[    0.285346][    T1] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.285357][    T1] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[    0.286938][    T1] clocksource: Switched to clocksource hpet
[    0.287057][    T1] pnp: PnP ACPI init
[    0.287125][    T1] system 00:00: [mem 0xf0000000-0xf7ffffff] has been
reserved
[    0.287578][    T1] system 00:04: [io  0x04d0-0x04d1] has been reserved
[    0.287581][    T1] system 00:04: [io  0x040b] has been reserved
[    0.287582][    T1] system 00:04: [io  0x04d6] has been reserved
[    0.287584][    T1] system 00:04: [io  0x0c00-0x0c01] has been reserved
[    0.287585][    T1] system 00:04: [io  0x0c14] has been reserved
[    0.287587][    T1] system 00:04: [io  0x0c50-0x0c51] has been reserved
[    0.287588][    T1] system 00:04: [io  0x0c52] has been reserved
[    0.287590][    T1] system 00:04: [io  0x0c6c] has been reserved
[    0.287591][    T1] system 00:04: [io  0x0c6f] has been reserved
[    0.287592][    T1] system 00:04: [io  0x0cd8-0x0cdf] has been reserved
[    0.287594][    T1] system 00:04: [io  0x0800-0x089f] has been reserved
[    0.287595][    T1] system 00:04: [io  0x0b00-0x0b0f] has been reserved
[    0.287597][    T1] system 00:04: [io  0x0b20-0x0b3f] has been reserved
[    0.287598][    T1] system 00:04: [io  0x0900-0x090f] has been reserved
[    0.287600][    T1] system 00:04: [io  0x0910-0x091f] has been reserved
[    0.287601][    T1] system 00:04: [mem 0xfec00000-0xfec00fff] could not be
reserved
[    0.287603][    T1] system 00:04: [mem 0xfec01000-0xfec01fff] could not be
reserved
[    0.287605][    T1] system 00:04: [mem 0xfedc0000-0xfedc0fff] has been
reserved
[    0.287607][    T1] system 00:04: [mem 0xfee00000-0xfee00fff] has been
reserved
[    0.287608][    T1] system 00:04: [mem 0xfed80000-0xfed8ffff] could not be
reserved
[    0.287610][    T1] system 00:04: [mem 0xfec10000-0xfec10fff] has been
reserved
[    0.287612][    T1] system 00:04: [mem 0xff000000-0xffffffff] has been
reserved
[    0.288282][    T1] pnp: PnP ACPI: found 5 devices
[    0.294558][    T1] clocksource: acpi_pm: mask: 0xffffff max_cycles:
0xffffff, max_idle_ns: 2085701024 ns
[    0.294640][    T1] NET: Registered PF_INET protocol family
[    0.294822][    T1] IP idents hash table entries: 262144 (order: 9, 2097152
bytes, linear)
[    0.296849][    T1] tcp_listen_portaddr_hash hash table entries: 32768
(order: 7, 524288 bytes, linear)
[    0.296891][    T1] Table-perturb hash table entries: 65536 (order: 6, 262144
bytes, linear)
[    0.296898][    T1] TCP established hash table entries: 524288 (order: 10,
4194304 bytes, linear)
[    0.297220][    T1] TCP bind hash table entries: 65536 (order: 9, 2097152
bytes, linear)
[    0.297370][    T1] TCP: Hash tables configured (established 524288 bind
65536)
[    0.297564][    T1] MPTCP token hash table entries: 65536 (order: 8, 1572864
bytes, linear)
[    0.297613][    T1] UDP hash table entries: 32768 (order: 8, 1048576 bytes,
linear)
[    0.297695][    T1] UDP-Lite hash table entries: 32768 (order: 8, 1048576
bytes, linear)
[    0.297816][    T1] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.297822][    T1] NET: Registered PF_XDP protocol family
[    0.297829][    T1] pci 0000:00:01.1: bridge window [io  0x1000-0x0fff] to
[bus 01-03] add_size 1000
[    0.297837][    T1] pci 0000:00:01.1: bridge window [io  0x1000-0x1fff]:
assigned
[    0.297839][    T1] pci 0000:02:00.0: PCI bridge to [bus 03]
[    0.297848][    T1] pci 0000:02:00.0:   bridge window [mem 0xfca00000-
0xfcbfffff]
[    0.297851][    T1] pci 0000:02:00.0:   bridge window [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.297856][    T1] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[    0.297860][    T1] pci 0000:01:00.0:   bridge window [mem 0xfca00000-
0xfcbfffff]
[    0.297863][    T1] pci 0000:01:00.0:   bridge window [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.297868][    T1] pci 0000:00:01.1: PCI bridge to [bus 01-03]
[    0.297870][    T1] pci 0000:00:01.1:   bridge window [io  0x1000-0x1fff]
[    0.297872][    T1] pci 0000:00:01.1:   bridge window [mem 0xfca00000-
0xfccfffff]
[    0.297875][    T1] pci 0000:00:01.1:   bridge window [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.297879][    T1] pci 0000:00:02.1: PCI bridge to [bus 04]
[    0.297882][    T1] pci 0000:00:02.1:   bridge window [mem 0xfe30300000-
0xfe304fffff 64bit pref]
[    0.297886][    T1] pci 0000:00:02.2: PCI bridge to [bus 05]
[    0.297887][    T1] pci 0000:00:02.2:   bridge window [io  0xf000-0xffff]
[    0.297890][    T1] pci 0000:00:02.2:   bridge window [mem 0xfcf00000-
0xfcffffff]
[    0.297894][    T1] pci 0000:00:02.3: PCI bridge to [bus 06]
[    0.297897][    T1] pci 0000:00:02.3:   bridge window [mem 0xfce00000-
0xfcefffff]
[    0.297901][    T1] pci 0000:00:02.4: PCI bridge to [bus 07]
[    0.297904][    T1] pci 0000:00:02.4:   bridge window [mem 0xfcd00000-
0xfcdfffff]
[    0.297908][    T1] pci 0000:00:08.1: PCI bridge to [bus 08]
[    0.297910][    T1] pci 0000:00:08.1:   bridge window [io  0xe000-0xefff]
[    0.297913][    T1] pci 0000:00:08.1:   bridge window [mem 0xfc500000-
0xfc9fffff]
[    0.297915][    T1] pci 0000:00:08.1:   bridge window [mem 0xfe20000000-
0xfe301fffff 64bit pref]
[    0.297918][    T1] pci_bus 0000:00: resource 4 [io  0x0000-0x03af window]
[    0.297920][    T1] pci_bus 0000:00: resource 5 [io  0x03e0-0x0cf7 window]
[    0.297922][    T1] pci_bus 0000:00: resource 6 [io  0x03b0-0x03df window]
[    0.297923][    T1] pci_bus 0000:00: resource 7 [io  0x0d00-0xffff window]
[    0.297924][    T1] pci_bus 0000:00: resource 8 [mem 0x000a0000-0x000dffff
window]
[    0.297926][    T1] pci_bus 0000:00: resource 9 [mem 0xf0000000-0xfcffffff
window]
[    0.297927][    T1] pci_bus 0000:00: resource 10 [mem 0x1010000000-
0xffffffffff window]
[    0.297929][    T1] pci_bus 0000:01: resource 0 [io  0x1000-0x1fff]
[    0.297930][    T1] pci_bus 0000:01: resource 1 [mem 0xfca00000-0xfccfffff]
[    0.297931][    T1] pci_bus 0000:01: resource 2 [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.297933][    T1] pci_bus 0000:02: resource 1 [mem 0xfca00000-0xfcbfffff]
[    0.297934][    T1] pci_bus 0000:02: resource 2 [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.297935][    T1] pci_bus 0000:03: resource 1 [mem 0xfca00000-0xfcbfffff]
[    0.297937][    T1] pci_bus 0000:03: resource 2 [mem 0xfc00000000-
0xfe0fffffff 64bit pref]
[    0.297938][    T1] pci_bus 0000:04: resource 2 [mem 0xfe30300000-
0xfe304fffff 64bit pref]
[    0.297939][    T1] pci_bus 0000:05: resource 0 [io  0xf000-0xffff]
[    0.297941][    T1] pci_bus 0000:05: resource 1 [mem 0xfcf00000-0xfcffffff]
[    0.297942][    T1] pci_bus 0000:06: resource 1 [mem 0xfce00000-0xfcefffff]
[    0.297944][    T1] pci_bus 0000:07: resource 1 [mem 0xfcd00000-0xfcdfffff]
[    0.297945][    T1] pci_bus 0000:08: resource 0 [io  0xe000-0xefff]
[    0.297946][    T1] pci_bus 0000:08: resource 1 [mem 0xfc500000-0xfc9fffff]
[    0.297947][    T1] pci_bus 0000:08: resource 2 [mem 0xfe20000000-
0xfe301fffff 64bit pref]
[    0.298041][    T1] pci 0000:03:00.1: D0 power state depends on 0000:03:00.0
[    0.298435][    T1] pci 0000:08:00.1: D0 power state depends on 0000:08:00.0
[    0.298441][    T1] pci 0000:08:00.3: extending delay after power-on from
D3hot to 20 msec
[    0.298565][    T1] pci 0000:08:00.4: extending delay after power-on from
D3hot to 20 msec
[    0.298634][    T1] PCI: CLS 64 bytes, default 64
[    0.298642][    T1] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters
supported
[    0.298680][  T102] Trying to unpack rootfs image as initramfs...
[    0.298683][    T1] pci 0000:00:01.0: Adding to iommu group 0
[    0.298697][    T1] pci 0000:00:01.1: Adding to iommu group 1
[    0.298718][    T1] pci 0000:00:02.0: Adding to iommu group 2
[    0.298731][    T1] pci 0000:00:02.1: Adding to iommu group 3
[    0.298743][    T1] pci 0000:00:02.2: Adding to iommu group 4
[    0.298756][    T1] pci 0000:00:02.3: Adding to iommu group 5
[    0.298769][    T1] pci 0000:00:02.4: Adding to iommu group 6
[    0.298788][    T1] pci 0000:00:08.0: Adding to iommu group 7
[    0.298801][    T1] pci 0000:00:08.1: Adding to iommu group 8
[    0.298825][    T1] pci 0000:00:14.0: Adding to iommu group 9
[    0.298837][    T1] pci 0000:00:14.3: Adding to iommu group 9
[    0.298896][    T1] pci 0000:00:18.0: Adding to iommu group 10
[    0.298909][    T1] pci 0000:00:18.1: Adding to iommu group 10
[    0.298922][    T1] pci 0000:00:18.2: Adding to iommu group 10
[    0.298934][    T1] pci 0000:00:18.3: Adding to iommu group 10
[    0.298947][    T1] pci 0000:00:18.4: Adding to iommu group 10
[    0.298959][    T1] pci 0000:00:18.5: Adding to iommu group 10
[    0.298971][    T1] pci 0000:00:18.6: Adding to iommu group 10
[    0.298985][    T1] pci 0000:00:18.7: Adding to iommu group 10
[    0.298998][    T1] pci 0000:01:00.0: Adding to iommu group 11
[    0.299011][    T1] pci 0000:02:00.0: Adding to iommu group 12
[    0.299032][    T1] pci 0000:03:00.0: Adding to iommu group 13
[    0.299048][    T1] pci 0000:03:00.1: Adding to iommu group 14
[    0.299061][    T1] pci 0000:04:00.0: Adding to iommu group 15
[    0.299074][    T1] pci 0000:05:00.0: Adding to iommu group 16
[    0.299087][    T1] pci 0000:06:00.0: Adding to iommu group 17
[    0.299100][    T1] pci 0000:07:00.0: Adding to iommu group 18
[    0.299119][    T1] pci 0000:08:00.0: Adding to iommu group 19
[    0.299132][    T1] pci 0000:08:00.1: Adding to iommu group 20
[    0.299146][    T1] pci 0000:08:00.2: Adding to iommu group 21
[    0.299160][    T1] pci 0000:08:00.3: Adding to iommu group 22
[    0.299173][    T1] pci 0000:08:00.4: Adding to iommu group 23
[    0.299187][    T1] pci 0000:08:00.5: Adding to iommu group 24
[    0.299201][    T1] pci 0000:08:00.6: Adding to iommu group 25
[    0.299214][    T1] pci 0000:08:00.7: Adding to iommu group 26
[    0.299491][    T1] AMD-Vi: Extended features (0x206d73ef22254ade, 0x0): PPR
X2APIC NX GT IA GA PC GA_vAPIC
[    0.299498][    T1] AMD-Vi: Interrupt remapping enabled
[    0.299499][    T1] AMD-Vi: X2APIC enabled
[    0.299673][    T1] AMD-Vi: Virtual APIC enabled
[    0.299686][    T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    0.299687][    T1] software IO TLB: mapped [mem 0x00000000e1e6d000-
0x00000000e5e6d000] (64MB)
[    0.299717][    T1] LVT offset 0 assigned for vector 0x400
[    0.302395][    T1] perf: AMD IBS detected (0x000003ff)
[    0.302540][   T20] amd_uncore: 4 amd_df counters detected
[    0.302545][   T20] amd_uncore: 6 amd_l3 counters detected
[    0.302687][    T1] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4
counters/bank).
[    0.303139][    T1] Initialise system trusted keyrings
[    0.303176][    T1] workingset: timestamp_bits=46 max_order=24 bucket_order=0
[    0.303181][    T1] zbud: loaded
[    0.312181][    T1] Key type asymmetric registered
[    0.312183][    T1] Asymmetric key parser 'x509' registered
[    0.312201][    T1] Block layer SCSI generic (bsg) driver version 0.4 loaded
(major 250)
[    0.312251][    T1] io scheduler bfq registered
[    0.317815][    T1] pcieport 0000:00:01.1: PME: Signaling with IRQ 43
[    0.317836][    T1] pcieport 0000:00:01.1: pciehp: Slot #0 AttnBtn- PwrCtrl-
MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis-
LLActRep+
[    0.317999][    T1] pcieport 0000:00:02.1: PME: Signaling with IRQ 44
[    0.318094][    T1] pcieport 0000:00:02.2: PME: Signaling with IRQ 45
[    0.318190][    T1] pcieport 0000:00:02.3: PME: Signaling with IRQ 46
[    0.318298][    T1] pcieport 0000:00:02.4: PME: Signaling with IRQ 47
[    0.318398][    T1] pcieport 0000:00:08.1: PME: Signaling with IRQ 48
[    0.318799][    T1] ACPI: video: Video Device [VGA] (multi-head: yes  rom: no
post: no)
[    0.319026][    T1] input: Video Bus as
/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:13/LNXVIDEO:00/input/input0
[    0.319189][    T1] Estimated ratio of average max frequency by base
frequency (times 1024): 1226
[    0.319203][    T9] Monitor-Mwait will be used to enter C-1 state
[    0.319209][    T1] ACPI: \_SB_.PLTF.P000: Found 3 idle states
[    0.319322][    T1] ACPI: \_SB_.PLTF.P001: Found 3 idle states
[    0.319420][    T1] ACPI: \_SB_.PLTF.P002: Found 3 idle states
[    0.323268][    T1] ACPI: \_SB_.PLTF.P003: Found 3 idle states
[    0.323416][    T1] ACPI: \_SB_.PLTF.P004: Found 3 idle states
[    0.323584][    T1] ACPI: \_SB_.PLTF.P005: Found 3 idle states
[    0.323745][    T1] ACPI: \_SB_.PLTF.P006: Found 3 idle states
[    0.323862][    T1] ACPI: \_SB_.PLTF.P007: Found 3 idle states
[    0.324013][    T1] ACPI: \_SB_.PLTF.P008: Found 3 idle states
[    0.324171][    T1] ACPI: \_SB_.PLTF.P009: Found 3 idle states
[    0.324322][    T1] ACPI: \_SB_.PLTF.P00A: Found 3 idle states
[    0.324476][    T1] ACPI: \_SB_.PLTF.P00B: Found 3 idle states
[    0.324623][    T1] ACPI: \_SB_.PLTF.P00C: Found 3 idle states
[    0.324758][    T1] ACPI: \_SB_.PLTF.P00D: Found 3 idle states
[    0.324886][    T1] ACPI: \_SB_.PLTF.P00E: Found 3 idle states
[    0.325025][    T1] ACPI: \_SB_.PLTF.P00F: Found 3 idle states
[    0.325894][    T1] thermal LNXTHERM:00: registered as thermal_zone0
[    0.325896][    T1] ACPI: thermal: Thermal Zone [THRM] (41 C)
[    0.326142][    T1] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.339293][    T1] tpm_crb MSFT0101:00: Disabling hwrng
[    0.339759][    T1] ACPI: bus type drm_connector registered
[    0.341800][    T1] i8042: PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M]
at 0x60,0x64 irq 1,12
[    0.344984][    T1] serio: i8042 KBD port at 0x60,0x64 irq 1
[    0.345036][    T1] serio: i8042 AUX port at 0x60,0x64 irq 12
[    0.345372][    T1] mousedev: PS/2 mouse device common for all mice
[    0.345393][    T1] rtc_cmos 00:01: RTC can wake from S4
[    0.345734][    T1] rtc_cmos 00:01: registered as rtc0
[    0.345792][    T1] rtc_cmos 00:01: setting system clock to 2024-04-
11T11:49:58 UTC (1712836198)
[    0.345822][    T1] rtc_cmos 00:01: alarms up to one month, y3k, 114 bytes
nvram
[    0.349641][    T1] efifb: probing for efifb
[    0.349653][    T1] efifb: framebuffer at 0xfe20000000, using 3072k, total
3072k
[    0.349656][    T1] efifb: mode is 1024x768x32, linelength=4096, pages=1
[    0.349658][    T1] efifb: scrolling: redraw
[    0.349660][    T1] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    0.349844][    T1] Console: switching to colour frame buffer device 128x48
[    0.356727][  T133] input: AT Translated Set 2 keyboard as
/devices/platform/i8042/serio0/input/input1
[    0.359933][    T1] fb0: EFI VGA frame buffer device
[    0.360037][    T1] NET: Registered PF_INET6 protocol family
[    0.463985][  T102] Freeing initrd memory: 44172K
[    0.470689][    T1] Segment Routing with IPv6
[    0.470705][    T1] In-situ OAM (IOAM) with IPv6
[    0.470733][    T1] mip6: Mobile IPv6
[    0.470738][    T1] NET: Registered PF_PACKET protocol family
[    0.470785][    T1] mpls_gso: MPLS GSO support
[    0.472876][    T1] microcode: Current revision: 0x0a50000c
[    0.473527][    T1] resctrl: L3 allocation detected
[    0.473529][    T1] resctrl: MB allocation detected
[    0.473530][    T1] resctrl: L3 monitoring detected
[    0.473550][    T1] IPI shorthand broadcast: enabled
[    0.474596][    T1] sched_clock: Marking stable (473003978, 1254441)-
>(491155385, -16896966)
[    0.474930][    T1] Timer migration: 2 hierarchy levels; 8 children per
group; 2 crossnode level
[    0.475105][    T1] registered taskstats version 1
[    0.475172][    T1] Loading compiled-in X.509 certificates
[    0.523119][    T1] Loaded X.509 cert 'Build time autogenerated kernel key:
d585c95a2505853bfd51f8eba85ec95fff6d0af3'
[    0.524364][    T1] Key type .fscrypt registered
[    0.524366][    T1] Key type fscrypt-provisioning registered
[    0.590489][    T1] ACPI BIOS Error (bug): Could not resolve symbol
[\_SB.PCI0.GP17.MP2], AE_NOT_FOUND (20230628/psargs-330)
[    0.590497][   T97] pci_bus 0000:03: Allocating resources
[    0.590783][    T1] ACPI Error: Aborting method \_SB.GPIO._EVT due to
previous error (AE_NOT_FOUND) (20230628/psparse-529)
[    0.591110][    T1] clk: Disabling unused clocks
[    0.591112][    T1] PM: genpd: Disabling unused power domains
[    0.591347][    T1] Freeing unused kernel image (initmem) memory: 1372K
[    0.591354][    T1] Write protecting the kernel read-only data: 16384k
[    0.591573][    T1] Freeing unused kernel image (rodata/data gap) memory:
516K
[    0.632221][    T1] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[    0.632225][    T1] Run /init as init process
[    0.632226][    T1]   with arguments:
[    0.632227][    T1]     /init
[    0.632228][    T1]   with environment:
[    0.632229][    T1]     HOME=/
[    0.632230][    T1]     TERM=linux
[    0.632231][    T1]     BOOT_IMAGE=/boot/vmlinuz-6.9.0-rc3-next-20240411
[    0.733106][  T260] piix4_smbus 0000:00:14.0: SMBus Host Controller at 0xb00,
revision 0
[    0.733112][  T260] piix4_smbus 0000:00:14.0: Using register 0x02 for SMBus
port selection
[    0.733166][  T260] piix4_smbus 0000:00:14.0: Auxiliary SMBus Host Controller
at 0xb20
[    0.733293][  T265] hid: raw HID events driver (C) Jiri Kosina
[    0.734462][  T279] pcie_mp2_amd 0000:08:00.7: enabling device (0000 -> 0002)
[    0.736728][  T250] ACPI: bus type USB registered
[    0.736751][  T250] usbcore: registered new interface driver usbfs
[    0.736758][  T250] usbcore: registered new interface driver hub
[    0.736769][  T250] usbcore: registered new device driver usb
[    0.740873][  T280] r8169 0000:05:00.0 eth0: RTL8168h/8111h,
d8:bb:c1:ab:dd:5e, XID 541, IRQ 54
[    0.740877][  T280] r8169 0000:05:00.0 eth0: jumbo features [frames: 9194
bytes, tx checksumming: ko]
[    0.753906][  T297] hid-generic 0020:1022:0001.0001: hidraw0: SENSOR HUB HID
v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.753980][  T297] hid-generic 0020:1022:0001.0002: hidraw1: SENSOR HUB HID
v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.754052][  T297] hid-generic 0020:1022:0001.0003: hidraw2: SENSOR HUB HID
v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.754123][  T297] hid-generic 0020:1022:0001.0004: hidraw3: SENSOR HUB HID
v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.754182][  T297] hid-generic 0020:1022:0001.0005: hidraw4: SENSOR HUB HID
v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.754249][  T297] hid-generic 0020:1022:0001.0006: hidraw5: SENSOR HUB HID
v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.880457][   T97] input: PNP0C50:0e 06CB:7E7E Mouse as
/devices/platform/AMDI0010:03/i2c-0/i2c-
PNP0C50:0e/0018:06CB:7E7E.0007/input/input4
[    0.880518][   T97] input: PNP0C50:0e 06CB:7E7E Touchpad as
/devices/platform/AMDI0010:03/i2c-0/i2c-
PNP0C50:0e/0018:06CB:7E7E.0007/input/input5
[    0.880598][   T97] hid-generic 0018:06CB:7E7E.0007: input,hidraw6: I2C HID
v1.00 Mouse [PNP0C50:0e 06CB:7E7E] on i2c-PNP0C50:0e
[    0.881436][  T254] r8169 0000:05:00.0 enp5s0: renamed from eth0
[    0.882259][  T295] hid-sensor-hub 0020:1022:0001.0001: hidraw0: SENSOR HUB
HID v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.882381][  T295] hid-sensor-hub 0020:1022:0001.0002: hidraw1: SENSOR HUB
HID v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.882493][  T295] hid-sensor-hub 0020:1022:0001.0003: hidraw2: SENSOR HUB
HID v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.882601][  T295] hid-sensor-hub 0020:1022:0001.0004: hidraw3: SENSOR HUB
HID v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.882709][  T295] hid-sensor-hub 0020:1022:0001.0005: hidraw4: SENSOR HUB
HID v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.882837][  T295] hid-sensor-hub 0020:1022:0001.0006: hidraw5: SENSOR HUB
HID v0.00 Device [hid-amdsfh 1022:0001] on pcie_mp2_amd
[    0.883719][  T250] xhci_hcd 0000:08:00.3: xHCI Host Controller
[    0.883726][  T250] xhci_hcd 0000:08:00.3: new USB bus registered, assigned
bus number 1
[    0.883806][  T250] xhci_hcd 0000:08:00.3: hcc params 0x0268ffe5 hci version
0x110 quirks 0x0000020000000410
[    0.884062][  T250] xhci_hcd 0000:08:00.3: xHCI Host Controller
[    0.884065][  T250] xhci_hcd 0000:08:00.3: new USB bus registered, assigned
bus number 2
[    0.884067][  T250] xhci_hcd 0000:08:00.3: Host supports USB 3.1 Enhanced
SuperSpeed
[    0.884103][  T250] usb usb1: New USB device found, idVendor=1d6b,
idProduct=0002, bcdDevice= 6.09
[    0.884105][  T250] usb usb1: New USB device strings: Mfr=3, Product=2,
SerialNumber=1
[    0.884106][  T250] usb usb1: Product: xHCI Host Controller
[    0.884108][  T250] usb usb1: Manufacturer: Linux 6.9.0-rc3-next-20240411
xhci-hcd
[    0.884109][  T250] usb usb1: SerialNumber: 0000:08:00.3
[    0.884203][   T97] nvme 0000:06:00.0: platform quirk: setting simple suspend
[    0.884206][  T102] nvme 0000:07:00.0: platform quirk: setting simple suspend
[    0.884250][  T250] hub 1-0:1.0: USB hub found
[    0.884265][  T250] hub 1-0:1.0: 4 ports detected
[    0.884272][  T102] nvme nvme1: pci function 0000:07:00.0
[    0.884273][   T97] nvme nvme0: pci function 0000:06:00.0
[    0.888973][  T250] usb usb2: We don't know the algorithms for LPM for this
host, disabling LPM.
[    0.889000][  T250] usb usb2: New USB device found, idVendor=1d6b,
idProduct=0003, bcdDevice= 6.09
[    0.889002][  T250] usb usb2: New USB device strings: Mfr=3, Product=2,
SerialNumber=1
[    0.889004][  T250] usb usb2: Product: xHCI Host Controller
[    0.889005][  T250] usb usb2: Manufacturer: Linux 6.9.0-rc3-next-20240411
xhci-hcd
[    0.889008][  T250] usb usb2: SerialNumber: 0000:08:00.3
[    0.889109][  T250] hub 2-0:1.0: USB hub found
[    0.889119][  T250] hub 2-0:1.0: 2 ports detected
[    0.889499][  T250] xhci_hcd 0000:08:00.4: xHCI Host Controller
[    0.889503][  T250] xhci_hcd 0000:08:00.4: new USB bus registered, assigned
bus number 3
[    0.889583][  T250] xhci_hcd 0000:08:00.4: hcc params 0x0268ffe5 hci version
0x110 quirks 0x0000020000000410
[    0.889805][  T250] xhci_hcd 0000:08:00.4: xHCI Host Controller
[    0.889808][  T250] xhci_hcd 0000:08:00.4: new USB bus registered, assigned
bus number 4
[    0.889810][  T250] xhci_hcd 0000:08:00.4: Host supports USB 3.1 Enhanced
SuperSpeed
[    0.889869][  T250] usb usb3: New USB device found, idVendor=1d6b,
idProduct=0002, bcdDevice= 6.09
[    0.889872][  T250] usb usb3: New USB device strings: Mfr=3, Product=2,
SerialNumber=1
[    0.889874][  T250] usb usb3: Product: xHCI Host Controller
[    0.889875][  T250] usb usb3: Manufacturer: Linux 6.9.0-rc3-next-20240411
xhci-hcd
[    0.889876][  T250] usb usb3: SerialNumber: 0000:08:00.4
[    0.889971][  T250] hub 3-0:1.0: USB hub found
[    0.889982][  T250] hub 3-0:1.0: 4 ports detected
[    0.890601][  T250] usb usb4: We don't know the algorithms for LPM for this
host, disabling LPM.
[    0.890622][  T250] usb usb4: New USB device found, idVendor=1d6b,
idProduct=0003, bcdDevice= 6.09
[    0.890624][  T250] usb usb4: New USB device strings: Mfr=3, Product=2,
SerialNumber=1
[    0.890626][  T250] usb usb4: Product: xHCI Host Controller
[    0.890627][  T250] usb usb4: Manufacturer: Linux 6.9.0-rc3-next-20240411
xhci-hcd
[    0.890629][  T250] usb usb4: SerialNumber: 0000:08:00.4
[    0.890721][  T250] hub 4-0:1.0: USB hub found
[    0.890727][  T250] hub 4-0:1.0: 2 ports detected
[    0.892506][  T102] nvme nvme1: missing or invalid SUBNQN field.
[    0.894439][   T97] nvme nvme0: D3 entry latency set to 10 seconds
[    0.897479][   T97] nvme nvme0: 16/0/0 default/read/poll queues
[    0.899635][   T11]  nvme0n1: p1 p2 p3 p4
[    0.900698][  T102] nvme nvme1: 15/0/0 default/read/poll queues
[    0.909376][   T97]  nvme1n1: p1
[    0.966669][  T249] [drm] amdgpu kernel modesetting enabled.
[    0.966687][  T249] amdgpu: vga_switcheroo: detected switching method
\_SB_.PCI0.GP17.VGA_.ATPX handle
[    0.967008][  T249] amdgpu: ATPX version 1, functions 0x00000001
[    0.967045][  T249] amdgpu: ATPX Hybrid Graphics
[    0.973338][  T249] amdgpu: Virtual CRAT table created for CPU
[    0.973349][  T249] amdgpu: Topology: Add CPU node
[    0.973447][  T249] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[    0.973480][  T249] [drm] initializing kernel modesetting (DIMGREY_CAVEFISH
0x1002:0x73FF 0x1462:0x1313 0xC3).
[    0.973488][  T249] [drm] register mmio base: 0xFCA00000
[    0.973490][  T249] [drm] register mmio size: 1048576
[    0.973552][  T249] [drm] MCBP is enabled
[    0.977600][  T249] [drm] add ip block number 0 <nv_common>
[    0.977602][  T249] [drm] add ip block number 1 <gmc_v10_0>
[    0.977603][  T249] [drm] add ip block number 2 <navi10_ih>
[    0.977604][  T249] [drm] add ip block number 3 <psp>
[    0.977605][  T249] [drm] add ip block number 4 <smu>
[    0.977606][  T249] [drm] add ip block number 5 <dm>
[    0.977607][  T249] [drm] add ip block number 6 <gfx_v10_0>
[    0.977608][  T249] [drm] add ip block number 7 <sdma_v5_2>
[    0.977609][  T249] [drm] add ip block number 8 <vcn_v3_0>
[    0.977609][  T249] [drm] add ip block number 9 <jpeg_v3_0>
[    0.977617][  T249] amdgpu 0000:03:00.0: amdgpu: ACPI VFCT table present but
broken (too short #2),skipping
[    0.985158][  T249] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    0.985160][  T249] amdgpu: ATOM BIOS: SWBRT77181.001
[    0.990982][  T249] [drm] VCN(0) decode is enabled in VM mode
[    0.990984][  T249] [drm] VCN(0) encode is enabled in VM mode
[    0.992536][  T249] [drm] JPEG decode is enabled in VM mode
[    0.992544][  T249] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ)
feature disabled as experimental (default)
[    0.992551][  T249] [drm] GPU posting now...
[    0.992615][  T249] [drm] vm size is 262144 GB, 4 levels, block size is 9-
bit, fragment size is 9-bit
[    0.992621][  T249] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M
0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    0.992623][  T249] amdgpu 0000:03:00.0: amdgpu: GART: 512M
0x0000000000000000 - 0x000000001FFFFFFF
[    0.992631][  T249] [drm] Detected VRAM RAM=8176M, BAR=8192M
[    0.992632][  T249] [drm] RAM width 128bits GDDR6
[    0.992738][  T249] [drm] amdgpu: 8176M of VRAM memory ready
[    0.992739][  T249] [drm] amdgpu: 31853M of GTT memory ready.
[    0.992749][  T249] [drm] GART: num cpu pages 131072, num gpu pages 131072
[    0.992874][  T249] [drm] PCIE GART of 512M enabled (table at
0x00000081FEB00000).
[    1.129262][   T54] usb 1-4: new high-speed USB device number 2 using
xhci_hcd
[    1.130612][   T34] usb 3-2: new low-speed USB device number 2 using xhci_hcd
[    1.266127][   T34] usb 3-2: New USB device found, idVendor=1bcf,
idProduct=08a0, bcdDevice= 1.04
[    1.266129][   T34] usb 3-2: New USB device strings: Mfr=0, Product=0,
SerialNumber=0
[    1.266491][   T54] usb 1-4: New USB device found, idVendor=30c9,
idProduct=0042, bcdDevice= 0.03
[    1.266495][   T54] usb 1-4: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[    1.266497][   T54] usb 1-4: Product: Integrated Camera
[    1.266500][   T54] usb 1-4: Manufacturer: S1F0009330LB620L420004LP
[    1.266501][   T54] usb 1-4: SerialNumber: SunplusIT Inc
[    1.293204][  T276] input: HID 1bcf:08a0 Mouse as
/devices/pci0000:00/0000:00:08.1/0000:08:00.4/usb3/3-2/3-
2:1.0/0003:1BCF:08A0.0008/input/input7
[    1.293284][  T276] input: HID 1bcf:08a0 Keyboard as
/devices/pci0000:00/0000:00:08.1/0000:08:00.4/usb3/3-2/3-
2:1.0/0003:1BCF:08A0.0008/input/input8
[    1.343263][  T129] tsc: Refined TSC clocksource calibration: 3193.999 MHz
[    1.343268][  T129] clocksource: tsc: mask: 0xffffffffffffffff max_cycles:
0x2e0a24cf65f, max_idle_ns: 440795271781 ns
[    1.345366][  T276] input: HID 1bcf:08a0 as
/devices/pci0000:00/0000:00:08.1/0000:08:00.4/usb3/3-2/3-
2:1.0/0003:1BCF:08A0.0008/input/input9
[    1.345453][  T276] hid-generic 0003:1BCF:08A0.0008: input,hiddev0,hidraw7:
USB HID v1.10 Mouse [HID 1bcf:08a0] on usb-0000:08:00.4-2/input0
[    1.345479][  T276] usbcore: registered new interface driver usbhid
[    1.345481][  T276] usbhid: USB HID core driver
[    1.397268][   T34] usb 3-3: new high-speed USB device number 3 using
xhci_hcd
[    1.527260][   T34] usb 3-3: New USB device found, idVendor=0e8d,
idProduct=0608, bcdDevice= 1.00
[    1.527262][   T34] usb 3-3: New USB device strings: Mfr=5, Product=6,
SerialNumber=7
[    1.527264][   T34] usb 3-3: Product: Wireless_Device
[    1.527266][   T34] usb 3-3: Manufacturer: MediaTek Inc.
[    1.527267][   T34] usb 3-3: SerialNumber: 000000000
[    1.650264][   T34] usb 3-4: new full-speed USB device number 4 using
xhci_hcd
[    1.809130][   T34] usb 3-4: New USB device found, idVendor=1462,
idProduct=1563, bcdDevice= 2.00
[    1.809132][   T34] usb 3-4: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[    1.809134][   T34] usb 3-4: Product: MysticLight MS-1563 v0001
[    1.809135][   T34] usb 3-4: Manufacturer: MSI
[    1.809137][   T34] usb 3-4: SerialNumber: 2064386A5430
[    1.843969][   T34] hid-generic 0003:1462:1563.0009: hiddev1,hidraw8: USB HID
v1.11 Device [MSI MysticLight MS-1563 v0001] on usb-0000:08:00.4-4/input0
[    3.279775][  T249] amdgpu 0000:03:00.0: amdgpu: STB initialized to 2048
entries
[    3.279836][  T249] [drm] Loading DMUB firmware via PSP: version=0x0202001E
[    3.280117][  T249] [drm] use_doorbell being set to: [true]
[    3.280128][  T249] [drm] use_doorbell being set to: [true]
[    3.280138][  T249] [drm] Found VCN firmware Version ENC: 1.27 DEC: 2 VEP: 0
Revision: 0
[    3.280144][  T249] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN
firmware
[    3.443691][  T249] amdgpu 0000:03:00.0: amdgpu: reserve 0xa00000 from
0x81fd000000 for PSP TMR
[    3.524926][  T249] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode
is not available
[    3.536157][  T249] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay
ta ucode is not available
[    3.536179][  T249] amdgpu 0000:03:00.0: amdgpu: smu driver if version =
0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version =
0x003b2b00 (59.43.0)
[    3.536184][  T249] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not
matched
[    3.536216][  T249] amdgpu 0000:03:00.0: amdgpu: use vbios provided pptable
[    3.587190][  T249] amdgpu 0000:03:00.0: amdgpu: SMU is initialized
successfully!
[    3.587550][  T249] [drm] Display Core v3.2.279 initialized on DCN 3.0.2
[    3.587552][  T249] [drm] DP-HDMI FRL PCON supported
[    3.588783][  T249] [drm] DMUB hardware initialized: version=0x0202001E
[    3.621482][  T249] [drm] kiq ring mec 2 pipe 1 q 0
[    3.628043][  T249] [drm] VCN decode and encode initialized
successfully(under DPG Mode).
[    3.628407][  T249] [drm] JPEG decode initialized successfully.
[    3.656005][  T249] amdgpu: HMM registered 8176MB device memory
[    3.656750][  T249] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    3.656764][  T249] kfd kfd: amdgpu: Total number of KFD nodes to be created:
1
[    3.656991][  T249] amdgpu: Virtual CRAT table created for GPU
[    3.657146][  T249] amdgpu: Topology: Add dGPU node [0x73ff:0x1002]
[    3.657148][  T249] kfd kfd: amdgpu: added device 1002:73ff
[    3.657170][  T249] amdgpu 0000:03:00.0: amdgpu: SE 2, SH per SE 2, CU per SH
8, active_cu_number 28
[    3.657175][  T249] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv
eng 0 on hub 0
[    3.657177][  T249] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.1.0 uses VM inv
eng 1 on hub 0
[    3.657178][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv
eng 4 on hub 0
[    3.657180][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv
eng 5 on hub 0
[    3.657181][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv
eng 6 on hub 0
[    3.657182][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv
eng 7 on hub 0
[    3.657184][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv
eng 8 on hub 0
[    3.657185][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv
eng 9 on hub 0
[    3.657186][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv
eng 10 on hub 0
[    3.657188][  T249] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv
eng 11 on hub 0
[    3.657189][  T249] amdgpu 0000:03:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv
eng 12 on hub 0
[    3.657190][  T249] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng
13 on hub 0
[    3.657191][  T249] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng
14 on hub 0
[    3.657193][  T249] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv
eng 0 on hub 8
[    3.657194][  T249] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv
eng 1 on hub 8
[    3.657195][  T249] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv
eng 4 on hub 8
[    3.657197][  T249] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv
eng 5 on hub 8
[    3.658212][  T249] amdgpu 0000:03:00.0: amdgpu: Using BOCO for runtime pm
[    3.664390][  T249] [drm] Initialized amdgpu 3.57.0 20150101 for 0000:03:00.0
on minor 0
[    3.667597][  T249] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
[    3.667610][  T249] [drm] DSC precompute is not needed.
[    3.667785][  T249] amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
[    3.667817][  T249] [drm] initializing kernel modesetting (RENOIR
0x1002:0x1638 0x1462:0x1313 0xC5).
[    3.667825][  T249] [drm] register mmio base: 0xFC900000
[    3.667826][  T249] [drm] register mmio size: 524288
[    3.667879][  T249] [drm] MCBP is enabled
[    3.670679][  T249] [drm] add ip block number 0 <soc15_common>
[    3.670681][  T249] [drm] add ip block number 1 <gmc_v9_0>
[    3.670682][  T249] [drm] add ip block number 2 <vega10_ih>
[    3.670683][  T249] [drm] add ip block number 3 <psp>
[    3.670684][  T249] [drm] add ip block number 4 <smu>
[    3.670685][  T249] [drm] add ip block number 5 <dm>
[    3.670686][  T249] [drm] add ip block number 6 <gfx_v9_0>
[    3.670687][  T249] [drm] add ip block number 7 <sdma_v4_0>
[    3.670689][  T249] [drm] add ip block number 8 <vcn_v2_0>
[    3.670690][  T249] [drm] add ip block number 9 <jpeg_v2_0>
[    3.670698][  T249] amdgpu 0000:08:00.0: amdgpu: Fetched VBIOS from VFCT
[    3.670700][  T249] amdgpu: ATOM BIOS: 113-CEZANNE-018
[    3.673008][  T249] [drm] VCN decode is enabled in VM mode
[    3.673010][  T249] [drm] VCN encode is enabled in VM mode
[    3.674035][  T249] [drm] JPEG decode is enabled in VM mode
[    3.674107][  T249] Console: switching to colour dummy device 80x25
[    3.674125][  T249] amdgpu 0000:08:00.0: vgaarb: deactivate vga console
[    3.674128][  T249] amdgpu 0000:08:00.0: amdgpu: Trusted Memory Zone (TMZ)
feature enabled
[    3.674130][  T249] amdgpu 0000:08:00.0: amdgpu: MODE2 reset
[    3.674278][  T249] [drm] vm size is 262144 GB, 4 levels, block size is 9-
bit, fragment size is 9-bit
[    3.674283][  T249] amdgpu 0000:08:00.0: amdgpu: VRAM: 512M
0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
[    3.674285][  T249] amdgpu 0000:08:00.0: amdgpu: GART: 1024M
0x0000000000000000 - 0x000000003FFFFFFF
[    3.674290][  T249] [drm] Detected VRAM RAM=512M, BAR=512M
[    3.674291][  T249] [drm] RAM width 128bits DDR4
[    3.674348][  T249] [drm] amdgpu: 512M of VRAM memory ready
[    3.674349][  T249] [drm] amdgpu: 31853M of GTT memory ready.
[    3.674360][  T249] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    3.674446][  T249] [drm] PCIE GART of 1024M enabled.
[    3.674449][  T249] [drm] PTB located at 0x000000F41FC00000
[    3.674708][  T249] [drm] Loading DMUB firmware via PSP: version=0x01010027
[    3.675066][  T249] [drm] Found VCN firmware Version ENC: 1.20 DEC: 5 VEP: 0
Revision: 3
[    3.675072][  T249] amdgpu 0000:08:00.0: amdgpu: Will use PSP to load VCN
firmware
[    4.388355][  T249] amdgpu 0000:08:00.0: amdgpu: reserve 0x400000 from
0xf41f800000 for PSP TMR
[    4.475365][  T249] amdgpu 0000:08:00.0: amdgpu: RAS: optional ras ta ucode
is not available
[    4.484120][  T249] amdgpu 0000:08:00.0: amdgpu: RAP: optional rap ta ucode
is not available
[    4.484122][  T249] amdgpu 0000:08:00.0: amdgpu: SECUREDISPLAY: securedisplay
ta ucode is not available
[    4.484870][  T249] amdgpu 0000:08:00.0: amdgpu: SMU is initialized
successfully!
[    4.485986][  T249] [drm] Display Core v3.2.279 initialized on DCN 2.1
[    4.485988][  T249] [drm] DP-HDMI FRL PCON supported
[    4.486529][  T249] [drm] DMUB hardware initialized: version=0x01010027
[    4.640450][  T249] [drm] kiq ring mec 2 pipe 1 q 0
[    4.644112][  T249] [drm] VCN decode and encode initialized
successfully(under DPG Mode).
[    4.644131][  T249] [drm] JPEG decode initialized successfully.
[    4.650935][  T249] amdgpu: HMM registered 512MB device memory
[    4.651843][  T249] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    4.651850][  T249] kfd kfd: amdgpu: Total number of KFD nodes to be created:
1
[    4.651960][  T249] amdgpu: Virtual CRAT table created for GPU
[    4.652165][  T249] amdgpu: Topology: Add dGPU node [0x1638:0x1002]
[    4.652167][  T249] kfd kfd: amdgpu: added device 1002:1638
[    4.652175][  T249] amdgpu 0000:08:00.0: amdgpu: SE 1, SH per SE 1, CU per SH
8, active_cu_number 8
[    4.652177][  T249] amdgpu 0000:08:00.0: amdgpu: ring gfx uses VM inv eng 0
on hub 0
[    4.652179][  T249] amdgpu 0000:08:00.0: amdgpu: ring gfx_low uses VM inv eng
1 on hub 0
[    4.652180][  T249] amdgpu 0000:08:00.0: amdgpu: ring gfx_high uses VM inv
eng 4 on hub 0
[    4.652181][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.0.0 uses VM inv
eng 5 on hub 0
[    4.652183][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.0 uses VM inv
eng 6 on hub 0
[    4.652184][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.2.0 uses VM inv
eng 7 on hub 0
[    4.652185][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.3.0 uses VM inv
eng 8 on hub 0
[    4.652186][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.0.1 uses VM inv
eng 9 on hub 0
[    4.652187][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 uses VM inv
eng 10 on hub 0
[    4.652189][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.2.1 uses VM inv
eng 11 on hub 0
[    4.652190][  T249] amdgpu 0000:08:00.0: amdgpu: ring comp_1.3.1 uses VM inv
eng 12 on hub 0
[    4.652191][  T249] amdgpu 0000:08:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv
eng 13 on hub 0
[    4.652192][  T249] amdgpu 0000:08:00.0: amdgpu: ring sdma0 uses VM inv eng 0
on hub 8
[    4.652194][  T249] amdgpu 0000:08:00.0: amdgpu: ring vcn_dec uses VM inv eng
1 on hub 8
[    4.652195][  T249] amdgpu 0000:08:00.0: amdgpu: ring vcn_enc0 uses VM inv
eng 4 on hub 8
[    4.652196][  T249] amdgpu 0000:08:00.0: amdgpu: ring vcn_enc1 uses VM inv
eng 5 on hub 8
[    4.652197][  T249] amdgpu 0000:08:00.0: amdgpu: ring jpeg_dec uses VM inv
eng 6 on hub 8
[    4.653231][  T249] amdgpu 0000:08:00.0: amdgpu: NO pm mode for runtime pm
[    4.653445][  T249] [drm] Initialized amdgpu 3.57.0 20150101 for 0000:08:00.0
on minor 1
[    4.659451][  T249] fbcon: amdgpudrmfb (fb0) is primary device
[    4.672585][  T249] Console: switching to colour frame buffer device 240x67
[    4.681073][  T249] amdgpu 0000:08:00.0: [drm] fb0: amdgpudrmfb frame buffer
device
[    4.716182][  T360] PM: Image not found (code -22)
[    4.755027][  T373] EXT4-fs (nvme0n1p2): mounted filesystem 73e0f015-c115-
4eb2-92cb-dbf7da2b6112 ro with ordered data mode. Quota mode: disabled.
[    4.991053][  T426] RPC: Registered named UNIX socket transport module.
[    4.991057][  T426] RPC: Registered udp transport module.
[    4.991058][  T426] RPC: Registered tcp transport module.
[    4.991059][  T426] RPC: Registered tcp-with-tls transport module.
[    4.991060][  T426] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    4.991233][  T437] pstore: Using crash dump compression: deflate
[    4.991273][  T433] device-mapper: uevent: version 1.0.3
[    4.991355][  T433] device-mapper: ioctl: 4.48.0-ioctl (2023-03-01)
initialised: dm-devel@lists.linux.dev
[    4.992542][  T439] fuse: init (API version 7.40)
[    4.992613][  T437] pstore: Registered efi_pstore as persistent store backend
[    4.993631][  T441] loop: module loaded
[    5.010537][  T448] cfg80211: Loading compiled-in X.509 certificates for
regulatory database
[    5.010675][  T448] Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[    5.010803][  T448] Loaded X.509 cert 'wens:
61c038651aabdcf94bd0ac7ff06c7248db18c600'
[    5.011798][  T133] cfg80211: loaded regulatory.db is malformed or signature
is missing/invalid
[    5.012504][  T452] EXT4-fs (nvme0n1p2): re-mounted 73e0f015-c115-4eb2-92cb-
dbf7da2b6112 r/w. Quota mode: disabled.
[    5.027951][  T448] mt7921e 0000:04:00.0: enabling device (0000 -> 0002)
[    5.034269][  T448] mt7921e 0000:04:00.0: ASIC revision: 79610010
[    5.109736][  T141] mt7921e 0000:04:00.0: HW/SW Version: 0x8a108a10, Build
Time: 20230117170855a
[    5.109736][  T141]
[    5.117064][  T521] input: Lid Switch as
/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:33/PNP0C09:00/PNP0C0D:00/inpu
t/input10
[    5.122183][  T141] mt7921e 0000:04:00.0: WM Firmware Version: ____010000,
Build Time: 20230117170942
[    5.125132][  T524] ACPI: AC: AC Adapter [ADP1] (on-line)
[    5.126686][  T487] ccp 0000:08:00.2: enabling device (0000 -> 0002)
[    5.129580][  T487] ccp 0000:08:00.2: tee enabled
[    5.129641][  T487] ccp 0000:08:00.2: psp enabled
[    5.131172][  T521] ACPI: button: Lid Switch [LID0]
[    5.132862][  T521] input: Power Button as
/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input11
[    5.137041][  T540] RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters,
163840 ms ovfl timer
[    5.137045][  T540] RAPL PMU: hw unit of domain package 2^-16 Joules
[    5.139967][  T538] mc: Linux media interface: v0.10
[    5.157394][  T521] ACPI: button: Power Button [PWRB]
[    5.157470][  T521] input: Sleep Button as
/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0E:00/input/input12
[    5.157539][  T521] ACPI: button: Sleep Button [SLPB]
[    5.165218][  T480] input: PNP0C50:0e 06CB:7E7E Mouse as
/devices/platform/AMDI0010:03/i2c-0/i2c-
PNP0C50:0e/0018:06CB:7E7E.0007/input/input13
[    5.165371][  T480] input: PNP0C50:0e 06CB:7E7E Touchpad as
/devices/platform/AMDI0010:03/i2c-0/i2c-
PNP0C50:0e/0018:06CB:7E7E.0007/input/input14
[    5.165524][  T480] hid-multitouch 0018:06CB:7E7E.0007: input,hidraw6: I2C
HID v1.00 Mouse [PNP0C50:0e 06CB:7E7E] on i2c-PNP0C50:0e
[    5.167160][  T563] Adding 75497468k swap on /dev/nvme0n1p3.  Priority:-2
extents:1 across:75497468k SS
[    5.232350][   T11] ACPI: battery: Slot [BAT1] (battery present)
[    5.233905][  T466] input: MSI WMI hotkeys as /devices/virtual/input/input16
[    5.236367][  T523] MCE: In-kernel MCE decoding enabled.
[    5.238888][  T473] snd_rn_pci_acp3x 0000:08:00.5: enabling device (0000 ->
0002)
[    5.240439][  T538] videodev: Linux video capture interface: v2.00
[    5.250246][  T538] usb 1-4: Found UVC 1.00 device Integrated Camera
(30c9:0042)
[    5.268960][  T538] usbcore: registered new interface driver uvcvideo
[    5.279677][  T526] snd_hda_intel 0000:03:00.1: enabling device (0000 ->
0002)
[    5.279795][  T526] snd_hda_intel 0000:03:00.1: Handle vga_switcheroo audio
client
[    5.279799][  T526] snd_hda_intel 0000:03:00.1: Force to non-snoop mode
[    5.279906][  T526] snd_hda_intel 0000:08:00.1: enabling device (0000 ->
0002)
[    5.279953][  T526] snd_hda_intel 0000:08:00.1: Handle vga_switcheroo audio
client
[    5.280084][  T526] snd_hda_intel 0000:08:00.6: enabling device (0000 ->
0002)
[    5.280899][  T523] AMD Address Translation Library initialized
[    5.282987][  T525] Bluetooth: Core ver 2.22
[    5.283006][  T525] NET: Registered PF_BLUETOOTH protocol family
[    5.283007][  T525] Bluetooth: HCI device and connection manager initialized
[    5.283011][  T525] Bluetooth: HCI socket layer initialized
[    5.283013][  T525] Bluetooth: L2CAP socket layer initialized
[    5.283016][  T525] Bluetooth: SCO socket layer initialized
[    5.284911][  T522] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops
amdgpu_dm_audio_component_bind_ops [amdgpu])
[    5.286343][  T522] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops
amdgpu_dm_audio_component_bind_ops [amdgpu])
[    5.287824][  T133] input: HD-Audio Generic HDMI/DP,pcm=3 as
/devices/pci0000:00/0000:00:08.1/0000:08:00.1/sound/card2/input17
[    5.287889][  T133] input: HD-Audio Generic HDMI/DP,pcm=7 as
/devices/pci0000:00/0000:00:08.1/0000:08:00.1/sound/card2/input18
[    5.287921][  T142] input: HDA ATI HDMI HDMI/DP,pcm=3 as
/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.1/sound/ca
rd1/input21
[    5.287933][  T133] input: HD-Audio Generic HDMI/DP,pcm=8 as
/devices/pci0000:00/0000:00:08.1/0000:08:00.1/sound/card2/input19
[    5.287997][  T142] input: HDA ATI HDMI HDMI/DP,pcm=7 as
/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.1/sound/ca
rd1/input22
[    5.288023][  T133] input: HD-Audio Generic HDMI/DP,pcm=9 as
/devices/pci0000:00/0000:00:08.1/0000:08:00.1/sound/card2/input20
[    5.288452][  T142] input: HDA ATI HDMI HDMI/DP,pcm=8 as
/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.1/sound/ca
rd1/input23
[    5.289540][  T142] input: HDA ATI HDMI HDMI/DP,pcm=9 as
/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.1/sound/ca
rd1/input24
[    5.289846][  T525] usbcore: registered new interface driver btusb
[    5.289942][  T142] input: HDA ATI HDMI HDMI/DP,pcm=10 as
/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.1/sound/ca
rd1/input25
[    5.291103][  T163] bluetooth hci0: Direct firmware load for
mediatek/BT_RAM_CODE_MT7961_1a_2_hdr.bin failed with error -2
[    5.291107][  T163] Bluetooth: hci0: Failed to load firmware file (-2)
[    5.291276][  T163] Bluetooth: hci0: Failed to set up firmware (-2)
[    5.291291][  T163] Bluetooth: hci0: HCI Enhanced Setup Synchronous
Connection command is advertised, but not supported.
[    5.297300][  T538] snd_hda_codec_realtek hdaudioC3D0: autoconfig for ALC233:
line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:speaker
[    5.297305][  T538] snd_hda_codec_realtek hdaudioC3D0:    speaker_outs=0
(0x0/0x0/0x0/0x0/0x0)
[    5.297307][  T538] snd_hda_codec_realtek hdaudioC3D0:    hp_outs=1
(0x21/0x0/0x0/0x0/0x0)
[    5.297309][  T538] snd_hda_codec_realtek hdaudioC3D0:    mono: mono_out=0x0
[    5.297310][  T538] snd_hda_codec_realtek hdaudioC3D0:    inputs:
[    5.297312][  T538] snd_hda_codec_realtek hdaudioC3D0:      Mic=0x19
[    5.340339][  T310] input: HD-Audio Generic Mic as
/devices/pci0000:00/0000:00:08.1/0000:08:00.6/sound/card3/input26
[    5.340395][  T310] input: HD-Audio Generic Headphone as
/devices/pci0000:00/0000:00:08.1/0000:08:00.6/sound/card3/input27
[    6.033937][  T701] EXT4-fs (nvme1n1p1): mounted filesystem 85e13cd1-3c57-
4343-a1f5-6209e530b640 r/w with ordered data mode. Quota mode: disabled.
[    6.035563][  T699] EXT4-fs (nvme0n1p4): mounted filesystem d21e6ad6-bc46-
4b61-bc20-e4d2f4bf719a r/w with ordered data mode. Quota mode: disabled.
[    6.101205][  T829] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[    6.101209][  T829] Bluetooth: BNEP filters: protocol multicast
[    6.101214][  T829] Bluetooth: BNEP socket layer initialized
[    6.188275][  T773] Generic FE-GE Realtek PHY r8169-0-500:00: attached PHY
driver (mii_bus:phy_addr=r8169-0-500:00, irq=MAC)
[    6.363505][  T130] r8169 0000:05:00.0 enp5s0: Link is Down
[    6.736542][  T469] mt7921e 0000:04:00.0 wlp4s0: renamed from wlan0
1130]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or
directory
[    9.577740][  T785] wlp4s0: authenticate with 54:67:51:3d:a2:e0 (local
address=c8:94:02:c1:bd:69)
[    9.693216][  T785] wlp4s0: send auth to 54:67:51:3d:a2:e0 (try 1/3)
[    9.697085][   T11] wlp4s0: authenticated
[    9.698222][   T11] wlp4s0: associate with 54:67:51:3d:a2:e0 (try 1/3)
[    9.725683][   T11] wlp4s0: RX AssocResp from 54:67:51:3d:a2:e0 (capab=0x1411
status=0 aid=2)
[    9.753578][   T11] wlp4s0: associated
[    9.867576][   T11] wlp4s0: Limiting TX power to 20 (20 - 0) dBm as
advertised by 54:67:51:3d:a2:e0
[   55.005089][  T773] wlp4s0: deauthenticating from 54:67:51:3d:a2:e0 by local
choice (Reason: 3=DEAUTH_LEAVING)
[   59.110719][  T785] wlp4s0: authenticate with 54:67:51:3d:a2:d2 (local
address=c8:94:02:c1:bd:69)
[   59.611020][  T785] wlp4s0: send auth to 54:67:51:3d:a2:d2 (try 1/3)
[   59.633421][   T11] wlp4s0: authenticated
[   59.634543][   T11] wlp4s0: associate with 54:67:51:3d:a2:d2 (try 1/3)
[   59.695669][   T11] wlp4s0: RX AssocResp from 54:67:51:3d:a2:d2 (capab=0x511
status=0 aid=1)
[   59.722226][   T11] wlp4s0: associated
[   60.729778][   T11] wlp4s0: deauthenticated from 54:67:51:3d:a2:d2 (Reason:
15=4WAY_HANDSHAKE_TIMEOUT)
[   65.750903][ T1189] NFSD: Unable to initialize client recovery tracking! (-
110)
[   65.750914][ T1189] NFSD: starting 90-second grace period (net f0000000)
[   90.742908][  T335] [drm] PCIE GART of 512M enabled (table at
0x00000081FEB00000).
[   90.742946][  T335] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[   90.919222][  T335] amdgpu 0000:03:00.0: amdgpu: reserve 0xa00000 from
0x81fd000000 for PSP TMR
[   90.999655][  T335] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode
is not available
[   91.011007][  T335] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay
ta ucode is not available
[   91.011011][  T335] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[   91.011016][  T335] amdgpu 0000:03:00.0: amdgpu: smu driver if version =
0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version =
0x003b2b00 (59.43.0)
[   91.011021][  T335] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not
matched
[   91.062712][  T335] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[   91.063966][  T335] [drm] DMUB hardware initialized: version=0x0202001E
[   91.084778][  T335] [drm] kiq ring mec 2 pipe 1 q 0
[   91.089816][  T335] [drm] VCN decode and encode initialized
successfully(under DPG Mode).
[   91.090002][  T335] [drm] JPEG decode initialized successfully.
[   91.090032][  T335] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv
eng 0 on hub 0
[   91.090035][  T335] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.1.0 uses VM inv
eng 1 on hub 0
[   91.090036][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv
eng 4 on hub 0
[   91.090037][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv
eng 5 on hub 0
[   91.090038][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv
eng 6 on hub 0
[   91.090040][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv
eng 7 on hub 0
[   91.090041][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv
eng 8 on hub 0
[   91.090042][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv
eng 9 on hub 0
[   91.090043][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv
eng 10 on hub 0
[   91.090044][  T335] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv
eng 11 on hub 0
[   91.090046][  T335] amdgpu 0000:03:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv
eng 12 on hub 0
[   91.090047][  T335] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng
13 on hub 0
[   91.090048][  T335] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng
14 on hub 0
[   91.090050][  T335] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv
eng 0 on hub 8
[   91.090051][  T335] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv
eng 1 on hub 8
[   91.090052][  T335] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv
eng 4 on hub 8
[   91.090053][  T335] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv
eng 5 on hub 8
[   91.093917][  T335] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
[  205.757123][  T129] [drm] PCIE GART of 512M enabled (table at
0x00000081FEB00000).
[  205.757166][  T129] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[  205.933524][  T129] amdgpu 0000:03:00.0: amdgpu: reserve 0xa00000 from
0x81fd000000 for PSP TMR
[  206.014750][  T129] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode
is not available
[  206.025993][  T129] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay
ta ucode is not available
[  206.025998][  T129] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[  206.026003][  T129] amdgpu 0000:03:00.0: amdgpu: smu driver if version =
0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version =
0x003b2b00 (59.43.0)
[  206.026008][  T129] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not
matched
[  206.076907][  T129] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[  206.078162][  T129] [drm] DMUB hardware initialized: version=0x0202001E
[  206.099136][  T129] [drm] kiq ring mec 2 pipe 1 q 0
[  206.103954][  T129] [drm] VCN decode and encode initialized
successfully(under DPG Mode).
[  206.104112][  T129] [drm] JPEG decode initialized successfully.
[  206.104142][  T129] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv
eng 0 on hub 0
[  206.104144][  T129] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.1.0 uses VM inv
eng 1 on hub 0
[  206.104145][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv
eng 4 on hub 0
[  206.104146][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv
eng 5 on hub 0
[  206.104148][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv
eng 6 on hub 0
[  206.104149][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv
eng 7 on hub 0
[  206.104150][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv
eng 8 on hub 0
[  206.104151][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv
eng 9 on hub 0
[  206.104152][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv
eng 10 on hub 0
[  206.104154][  T129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv
eng 11 on hub 0
[  206.104155][  T129] amdgpu 0000:03:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv
eng 12 on hub 0
[  206.104156][  T129] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng
13 on hub 0
[  206.104158][  T129] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng
14 on hub 0
[  206.104159][  T129] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv
eng 0 on hub 8
[  206.104160][  T129] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv
eng 1 on hub 8
[  206.104161][  T129] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv
eng 4 on hub 8
[  206.104163][  T129] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv
eng 5 on hub 8
[  206.108228][  T129] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes


Bert Karwatzki

^ permalink raw reply	[relevance 1%]

* Re: [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio
  2024-04-10 14:29  2% [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
                   ` (3 preceding siblings ...)
  2024-04-10 14:29  5% ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
@ 2024-04-11  1:12  0% ` Zhang Yi
  2024-04-24  8:12  0% ` Zhang Yi
  5 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-11  1:12 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, willy, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, wangkefeng.wang

On 2024/4/10 22:29, Zhang Yi wrote:
> Hello!
> 
> This is the fourth version of RFC patch series that convert ext4 regular
> file's buffered IO path to iomap and enable large folio. I've rebased it
> on 6.9-rc3, it also **depends on my xfs/iomap fix series** which has
> been reviewed but not merged yet[1]. Compared to the third vesion, this
> iteration fixes an issue discovered in current ext4 code, and contains
> another two main changes, 1) add bigalloc support and 2) simplify the
> updating logic of reserved delalloc data block, both changes could be
> sent out as preliminary patch series, besides these, others are some
> small code cleanups, performance optimize and commit log improvements.
> Please take a look at this series and any comments are welcome.
> 

I've uploaded this series and the dependency xfs/iomap changes to my
github repository, feel free to check it out.

https://github.com/zhangyi089/linux/commits/ext4_buffered_iomap_v4/

Thanks,
Yi.

> This series supports ext4 with the default features and mount
> options(bigalloc is also supported), doesn't support non-extent(ext3),
> inline_data, dax, fs_verity, fs_crypt and data=journal mode, ext4 would
> fall back to buffer_head path automatically if you enabled those
> features or options. Although it has many limitations now, it can satisfy
> the requirements of most common cases and bring a significant performance
> benefit for large IOs.
> 
> The iomap path would be simpler than the buffer_head path to some extent,
> please note that there are 4 major differences:
> 1. Always allocate unwritten extent for new blocks, it means that it's
>    not controlled by dioread_nolock mount option.
> 2. Since 1, there is no risk of exposing stale data during the append
>    write, so we don't need to write back data before metadata, it's time
>    to drop 'data = ordered' mode automatically.
> 3. Since 2, we don't need to reserve journal credits and use reserved
>    handle for the extent status conversion during writeback.
> 4. We could postpone updating the i_disksize to the endio, it could
>    avoid exposing zero data during append write and instantaneous power
>    failure.
> 
> Series details:
> Patch 1-9: this is the part 2 preparation series, it fix a problem
> first, and makes ext4_insert_delayed_block() call path support inserting
> multiple delalloc blocks (also support bigalloc), finally make
> ext4_da_map_blocks() buffer_head unaware, I've send it out separately[2]
> and hope this could be merged first.
> 
> Patch 10-19: this is the part 3 prepartory changes(picked out from my
> metadata reservation series[3], these are not a strong dependency
> patches, but I'd suggested these could be merged before the iomap
> conversion). These patches moves ext4_da_update_reserve_space() to
> ext4_es_insert_extent(), and always set EXT4_GET_BLOCKS_DELALLOC_RESERVE
> when allocating delalloc blocks, no matter it's from delayed allocate or
> non-delayed allocate (fallocate) path, it makes delalloc extents always
> delonly. These can make delalloc reservation simpler and cleaner than
> before.
> 
> Patch 20-34: These patches are the main implements of the buffered IO
> iomap conversion, It first introduce a sequence counter for extent
> status tree, then add a new iomap aops for read, write, mmap, replace
> current buffered_head path. Finally, enable iomap path besides inline
> data, non-extent, dax, fs_verity, fs_crypt, defrag and data=journal
> mode, if user specify "buffered_iomap" mount option, also enable large
> folio. Please look at the following patch for details.
> 
> About Tests:
>  - Pass kvm-xfstests in auto mode, and the keep running stress tests and
>    fault injection tests.
>  - A performance tests below (tested on my version 3 series,
>    theoretically there won't be much difference in this version).
> 
>    Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
>    with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.
> 
>    == buffer read ==
> 
>                   buffer head        iomap + large folio
>    type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>    ----------------------------------------------------
>    hole     4K    565k    2206       811k    3167
>    hole     64K   45.1k   2820       78.1k   4879
>    hole     1M    2744    2744       4890    4891
>    ramdisk  4K    436k    1703       554k    2163
>    ramdisk  64K   29.6k   1848       44.0k   2747
>    ramdisk  1M    1994    1995       2809    2809
>    nvme     4K    306k    1196       324k    1267
>    nvme     64K   19.3k   1208       24.3k   1517
>    nvme     1M    1694    1694       2256    2256
> 
>    == buffer write ==
> 
>                                         buffer head  iomap + large folio
>    type   Overwrite Sync Writeback bs   IOPS   BW    IOPS   BW
>    ------------------------------------------------------------
>    cache    N       N    N         4K   395k   1544  415k   1621
>    cache    N       N    N         64K  30.8k  1928  80.1k  5005
>    cache    N       N    N         1M   1963   1963  5641   5642
>    cache    Y       N    N         4K   423k   1652  443k   1730
>    cache    Y       N    N         64K  33.0k  2063  80.8k  5051
>    cache    Y       N    N         1M   2103   2103  5588   5589
>    ramdisk  N       N    Y         4K   362k   1416  307k   1198
>    ramdisk  N       N    Y         64K  22.4k  1399  64.8k  4050
>    ramdisk  N       N    Y         1M   1670   1670  4559   4560
>    ramdisk  N       Y    N         4K   9830   38.4  13.5k  52.8
>    ramdisk  N       Y    N         64K  5834   365   10.1k  629
>    ramdisk  N       Y    N         1M   1011   1011  2064   2064
>    ramdisk  Y       N    Y         4K   397k   1550  409k   1598
>    ramdisk  Y       N    Y         64K  29.2k  1827  73.6k  4597
>    ramdisk  Y       N    Y         1M   1837   1837  4985   4985
>    ramdisk  Y       Y    N         4K   173k   675   182k   710
>    ramdisk  Y       Y    N         64K  17.7k  1109  33.7k  2105
>    ramdisk  Y       Y    N         1M   1128   1129  1790   1791
>    nvme     N       N    Y         4K   298k   1164  290k   1134
>    nvme     N       N    Y         64K  21.5k  1343  57.4k  3590
>    nvme     N       N    Y         1M   1308   1308  3664   3664
>    nvme     N       Y    N         4K   10.7k  41.8  12.0k  46.9
>    nvme     N       Y    N         64K  5962   373   8598   537
>    nvme     N       Y    N         1M   676    677   1417   1418
>    nvme     Y       N    Y         4K   366k   1430  373k   1456
>    nvme     Y       N    Y         64K  26.7k  1670  56.8k  3547
>    nvme     Y       N    Y         1M   1745   1746  3586   3586
>    nvme     Y       Y    N         4K   59.0k  230   61.2k  239
>    nvme     Y       Y    N         64K  13.0k  813   21.0k  1311
>    nvme     Y       Y    N         1M   683    683   1368   1369
>  
> TODO
>  - Keep on doing stress tests and fixing.
>  - Reserve enough space for delalloc metadata blocks and try to drop
>    ext4_nonda_switch().
>  - First support defrag and then support other more unsupported features
>    and mount options.
> 
> Changes since v3:
>  - Drop the part 1 prepartory patches which have been merged [4].
>  - Drop the two iomap patches since I've submitted separately [1].
>  - Fix an incorrect reserved delalloc blocks count and incorrect extent
>    status cache issue found on current ext4 code.
>  - Pick out part 2 prepartory patch series [2], it make
>    ext4_insert_delayed_block() call path support inserting multiple
>    delalloc blocks (also support bigalloc )and make ext4_da_map_blocks()
>    buffer_head unaware.
>  - Adjust and simplify the reserved delalloc blocks updating logic,
>    preparing for reserving meta data blocks for delalloc.
>  - Drop datasync dirty check in ext4_set_iomap() for buffered
>    read/write, improves the concurrent performance on small I/Os.
>  - Prevent always hold invalid_lock in page_cache_ra_order(), add
>    lockless check.
>  - Disable iomap path by default since it's experimental new, add a
>    mount option "buffered_iomap" to enable it.
>  - Some other minor fixes and change log improvements.
> Changes since v2:
>  - Update patch 1-6 to v3.
>  - iomap_zero and iomap_unshare don't need to update i_size and call
>    iomap_write_failed(), introduce a new helper iomap_write_end_simple()
>    to avoid doing that.
>  - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(),
>    introduce a new helper ext4_iomap_map_one_extent() to allocate
>    delalloc blocks in writeback, which is always under i_data_sem in
>    write mode. This is done to prevent the writing back delalloc
>    extents become stale if it raced by truncate.
>  - Add a lock detection in mapping_clear_large_folios().
> Changes since v1:
>  - Introduce seq count for iomap buffered write and writeback to protect
>    races from extents changes, e.g. truncate, mwrite.
>  - Always allocate unwritten extents for new blocks, drop dioread_lock
>    mode, and make no distinctions between dioread_lock and
>    dioread_nolock.
>  - Don't add ditry data range to jinode, drop data=ordered mode, and
>    make no distinctions between data=ordered and data=writeback mode.
>  - Postpone updating i_disksize to endio.
>  - Allow splitting extents and use reserved space in endio.
>  - Instead of reimplement a new delayed mapping helper
>    ext4_iomap_da_map_blocks() for buffer write, try to reuse
>    ext4_da_map_blocks().
>  - Add support for disabling large folio on active inodes.
>  - Support online defragmentation, make file fall back to buffer_head
>    and disable large folio in ext4_move_extents().
>  - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
>  - Add dirty_len and pos trace info to trace_iomap_writepage_map().
>  - Update patch 1-6 to v2.
> 
> [1] https://lore.kernel.org/linux-xfs/20240320110548.2200662-1-yi.zhang@huaweicloud.com/
> [2] https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/
> [3] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/
> [4] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/
> 
> Thanks,
> Yi.
> 
> ---
> v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
> 
> Zhang Yi (34):
>   ext4: factor out a common helper to query extent map
>   ext4: check the extent status again before inserting delalloc block
>   ext4: trim delalloc extent
>   ext4: drop iblock parameter
>   ext4: make ext4_es_insert_delayed_block() insert multi-blocks
>   ext4: make ext4_da_reserve_space() reserve multi-clusters
>   ext4: factor out check for whether a cluster is allocated
>   ext4: make ext4_insert_delayed_block() insert multi-blocks
>   ext4: make ext4_da_map_blocks() buffer_head unaware
>   ext4: factor out ext4_map_create_blocks() to allocate new blocks
>   ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set
>   ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks
>   ext4: let __revise_pending() return newly inserted pendings
>   ext4: count removed reserved blocks for delalloc only extent entry
>   ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
>   ext4: drop ext4_es_delayed_clu()
>   ext4: use ext4_map_query_blocks() in ext4_map_blocks()
>   ext4: drop ext4_es_is_delonly()
>   ext4: drop all delonly descriptions
>   ext4: use reserved metadata blocks when splitting extent on endio
>   ext4: introduce seq counter for the extent status entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: implement zero_range iomap path
>   ext4: writeback partial blocks before zeroing out range
>   ext4: fall back to buffer_head path for defrag
>   ext4: partial enable iomap for regular file's buffered IO path
>   filemap: support disable large folios on active inode
>   ext4: enable large folio for regular file with iomap buffered IO path
>   ext4: don't mark IOMAP_F_DIRTY for buffer write
>   ext4: add mount option for buffered IO iomap path
> 


^ permalink raw reply	[relevance 0%]

* [RFC PATCH v4 23/34] ext4: implement buffered read iomap path
  2024-04-10 14:29  2% [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
                   ` (2 preceding siblings ...)
  2024-04-10 14:29  5% ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
@ 2024-04-10 14:29  5% ` Zhang Yi
  2024-04-11  1:12  0% ` [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
  2024-04-24  8:12  0% ` Zhang Yi
  5 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-10 14:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

Add ext4_iomap_buffered_io_begin() for the iomap read path, it call
ext4_map_blocks() to query map status and call ext4_set_iomap() to
convert ext4 map to iomap.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4c1fed516d9e..20eb772f4f62 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,14 +3523,46 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
-static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset,
+				loff_t length, unsigned int iomap_flags,
+				struct iomap *iomap, struct iomap *srcmap)
 {
+	int ret;
+	struct ext4_map_blocks map;
+	u8 blkbits = inode->i_blkbits;
+
+	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+		return -EIO;
+	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+		return -EINVAL;
+	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+		return -ERANGE;
+
+	/* Calculate the first and last logical blocks respectively. */
+	map.m_lblk = offset >> blkbits;
+	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret < 0)
+		return ret;
+
+	ext4_set_iomap(inode, iomap, &map, offset, length, iomap_flags);
 	return 0;
 }
 
-static void ext4_iomap_readahead(struct readahead_control *rac)
+const struct iomap_ops ext4_iomap_buffered_read_ops = {
+	.iomap_begin = ext4_iomap_buffered_io_begin,
+};
+
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
 {
+	return iomap_read_folio(folio, &ext4_iomap_buffered_read_ops);
+}
 
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+	iomap_readahead(rac, &ext4_iomap_buffered_read_ops);
 }
 
 static int ext4_iomap_writepages(struct address_space *mapping,
-- 
2.39.2


^ permalink raw reply related	[relevance 5%]

* [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks()
  2024-04-10 14:29  2% [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
  2024-04-10 14:29 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
  2024-04-10 14:29  4% ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
@ 2024-04-10 14:29  5% ` Zhang Yi
  2024-04-10 14:29  5% ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-10 14:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

The blocks map querying logic in ext4_map_blocks() are the same with
ext4_map_query_blocks(), so use it directly.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 752fc0555dc0..64bdfa9e06b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -658,27 +658,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 * file system block.
 	 */
 	down_read(&EXT4_I(inode)->i_data_sem);
-	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
-		retval = ext4_ext_map_blocks(handle, inode, map, 0);
-	} else {
-		retval = ext4_ind_map_blocks(handle, inode, map, 0);
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-	}
+	retval = ext4_map_query_blocks(handle, inode, map);
 	up_read((&EXT4_I(inode)->i_data_sem));
 
 found:
-- 
2.39.2


^ permalink raw reply related	[relevance 5%]

* [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
  2024-04-10 14:29  2% [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
  2024-04-10 14:29 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
@ 2024-04-10 14:29  4% ` Zhang Yi
  2024-04-10 14:29  5% ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-10 14:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

Now we update data reserved space for delalloc after allocating new
blocks in ext4_{ind|ext}_map_blocks(). If bigalloc feature is enabled,
we also need to query the extents_status tree and calculate the exact
reserved clusters. This is complicated and it appears
ext4_es_insert_extent() is a better place to do this job, it could make
things simple because __es_remove_extent() could count delalloc blocks
and __revise_pending() and return newly added pending count.

One special case needs to concern is the quota claiming, when bigalloc
is enabled, if the delayed cluster allocation has been raced by another
no-delayed allocation which doesn't overlap the delayed blocks (from
fallocate, filemap, DIO...) , we cannot claim quota as usual because the
racer have already done it, so we also need to check the counted
reserved blocks.

  |               one cluster               |
  -------------------------------------------
  |                            | delayed es |
  -------------------------------------------
  ^           ^
  | fallocate | <- don't claim quota

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c        | 37 -------------------------------------
 fs/ext4/extents_status.c | 22 +++++++++++++++++++++-
 fs/ext4/indirect.c       |  7 -------
 3 files changed, 21 insertions(+), 45 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e57054bdc5fd..8bc8a519f745 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4355,43 +4355,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		goto out;
 	}
 
-	/*
-	 * Reduce the reserved cluster count to reflect successful deferred
-	 * allocation of delayed allocated clusters or direct allocation of
-	 * clusters discovered to be delayed allocated.  Once allocated, a
-	 * cluster is not included in the reserved count.
-	 */
-	if (test_opt(inode->i_sb, DELALLOC) && allocated_clusters) {
-		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
-			/*
-			 * When allocating delayed allocated clusters, simply
-			 * reduce the reserved cluster count and claim quota
-			 */
-			ext4_da_update_reserve_space(inode, allocated_clusters,
-							1);
-		} else {
-			ext4_lblk_t lblk, len;
-			unsigned int n;
-
-			/*
-			 * When allocating non-delayed allocated clusters
-			 * (from fallocate, filemap, DIO, or clusters
-			 * allocated when delalloc has been disabled by
-			 * ext4_nonda_switch), reduce the reserved cluster
-			 * count by the number of allocated clusters that
-			 * have previously been delayed allocated.  Quota
-			 * has been claimed by ext4_mb_new_blocks() above,
-			 * so release the quota reservations made for any
-			 * previously delayed allocated clusters.
-			 */
-			lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk);
-			len = allocated_clusters << sbi->s_cluster_bits;
-			n = ext4_es_delayed_clu(inode, lblk, len);
-			if (n > 0)
-				ext4_da_update_reserve_space(inode, (int) n, 0);
-		}
-	}
-
 	/*
 	 * Cache the extent and update transaction to commit on fdatasync only
 	 * when it is _not_ an unwritten extent.
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 38ec2cc5ae3b..75227f151b8f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -856,6 +856,8 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err1 = 0, err2 = 0, err3 = 0;
+	struct rsvd_info rinfo;
+	int resv_used, pending = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct extent_status *es1 = NULL;
 	struct extent_status *es2 = NULL;
@@ -894,7 +896,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
+	err1 = __es_remove_extent(inode, lblk, end, &rinfo, es1);
 	if (err1 != 0)
 		goto error;
 	/* Free preallocated extent if it didn't get used. */
@@ -924,9 +926,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			__free_pending(pr);
 			pr = NULL;
 		}
+		pending = err3;
 	}
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
+	/*
+	 * Reduce the reserved cluster count to reflect successful deferred
+	 * allocation of delayed allocated clusters or direct allocation of
+	 * clusters discovered to be delayed allocated.  Once allocated, a
+	 * cluster is not included in the reserved count.
+	 *
+	 * When bigalloc is enabled, allocating non-delayed allocated blocks
+	 * which belong to delayed allocated clusters (from fallocate, filemap,
+	 * DIO, or clusters allocated when delalloc has been disabled by
+	 * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(),
+	 * so release the quota reservations made for any previously delayed
+	 * allocated clusters.
+	 */
+	resv_used = rinfo.delonly_cluster + pending;
+	if (resv_used)
+		ext4_da_update_reserve_space(inode, resv_used,
+					     rinfo.delonly_block);
 	if (err1 || err2 || err3 < 0)
 		goto retry;
 
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index d8ca7f64f952..7404f0935c90 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -652,13 +652,6 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 	count = ar.len;
 
-	/*
-	 * Update reserved blocks/metadata blocks after successful block
-	 * allocation which had been deferred till now.
-	 */
-	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
-		ext4_da_update_reserve_space(inode, count, 1);
-
 got_it:
 	map->m_flags |= EXT4_MAP_MAPPED;
 	map->m_pblk = le32_to_cpu(chain[depth-1].key);
-- 
2.39.2


^ permalink raw reply related	[relevance 4%]

* [PATCH v4 01/34] ext4: factor out a common helper to query extent map
  2024-04-10 14:29  2% [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
@ 2024-04-10 14:29 11% ` Zhang Yi
  2024-04-26 11:55  7%   ` Ritesh Harjani
  2024-04-10 14:29  4% ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-04-10 14:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

Factor out a new common helper ext4_map_query_blocks() from the
ext4_da_map_blocks(), it query and return the extent map status on the
inode's extent path, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 537803250ca9..6a41172c06e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
+				 struct ext4_map_blocks *map)
+{
+	unsigned int status;
+	int retval;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		retval = ext4_ext_map_blocks(handle, inode, map, 0);
+	else
+		retval = ext4_ind_map_blocks(handle, inode, map, 0);
+
+	if (retval <= 0)
+		return retval;
+
+	if (unlikely(retval != map->m_len)) {
+		ext4_warning(inode->i_sb,
+			     "ES len assertion failed for inode "
+			     "%lu: retval %d != map->m_len %d",
+			     inode->i_ino, retval, map->m_len);
+		WARN_ON(1);
+	}
+
+	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+			      map->m_pblk, status);
+	return retval;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_has_inline_data(inode))
 		retval = 0;
-	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
-	if (retval < 0) {
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
+		retval = ext4_map_query_blocks(NULL, inode, map);
 	up_read(&EXT4_I(inode)->i_data_sem);
+	if (retval)
+		return retval;
 
 add_delayed:
 	down_write(&EXT4_I(inode)->i_data_sem);
-- 
2.39.2


^ permalink raw reply related	[relevance 11%]

* [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio
@ 2024-04-10 14:29  2% Zhang Yi
  2024-04-10 14:29 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
                   ` (5 more replies)
  0 siblings, 6 replies; 200+ results
From: Zhang Yi @ 2024-04-10 14:29 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, david, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

Hello!

This is the fourth version of RFC patch series that convert ext4 regular
file's buffered IO path to iomap and enable large folio. I've rebased it
on 6.9-rc3, it also **depends on my xfs/iomap fix series** which has
been reviewed but not merged yet[1]. Compared to the third vesion, this
iteration fixes an issue discovered in current ext4 code, and contains
another two main changes, 1) add bigalloc support and 2) simplify the
updating logic of reserved delalloc data block, both changes could be
sent out as preliminary patch series, besides these, others are some
small code cleanups, performance optimize and commit log improvements.
Please take a look at this series and any comments are welcome.

This series supports ext4 with the default features and mount
options(bigalloc is also supported), doesn't support non-extent(ext3),
inline_data, dax, fs_verity, fs_crypt and data=journal mode, ext4 would
fall back to buffer_head path automatically if you enabled those
features or options. Although it has many limitations now, it can satisfy
the requirements of most common cases and bring a significant performance
benefit for large IOs.

The iomap path would be simpler than the buffer_head path to some extent,
please note that there are 4 major differences:
1. Always allocate unwritten extent for new blocks, it means that it's
   not controlled by dioread_nolock mount option.
2. Since 1, there is no risk of exposing stale data during the append
   write, so we don't need to write back data before metadata, it's time
   to drop 'data = ordered' mode automatically.
3. Since 2, we don't need to reserve journal credits and use reserved
   handle for the extent status conversion during writeback.
4. We could postpone updating the i_disksize to the endio, it could
   avoid exposing zero data during append write and instantaneous power
   failure.

Series details:
Patch 1-9: this is the part 2 preparation series, it fix a problem
first, and makes ext4_insert_delayed_block() call path support inserting
multiple delalloc blocks (also support bigalloc), finally make
ext4_da_map_blocks() buffer_head unaware, I've send it out separately[2]
and hope this could be merged first.

Patch 10-19: this is the part 3 prepartory changes(picked out from my
metadata reservation series[3], these are not a strong dependency
patches, but I'd suggested these could be merged before the iomap
conversion). These patches moves ext4_da_update_reserve_space() to
ext4_es_insert_extent(), and always set EXT4_GET_BLOCKS_DELALLOC_RESERVE
when allocating delalloc blocks, no matter it's from delayed allocate or
non-delayed allocate (fallocate) path, it makes delalloc extents always
delonly. These can make delalloc reservation simpler and cleaner than
before.

Patch 20-34: These patches are the main implements of the buffered IO
iomap conversion, It first introduce a sequence counter for extent
status tree, then add a new iomap aops for read, write, mmap, replace
current buffered_head path. Finally, enable iomap path besides inline
data, non-extent, dax, fs_verity, fs_crypt, defrag and data=journal
mode, if user specify "buffered_iomap" mount option, also enable large
folio. Please look at the following patch for details.

About Tests:
 - Pass kvm-xfstests in auto mode, and the keep running stress tests and
   fault injection tests.
 - A performance tests below (tested on my version 3 series,
   theoretically there won't be much difference in this version).

   Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
   with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.

   == buffer read ==

                  buffer head        iomap + large folio
   type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
   ----------------------------------------------------
   hole     4K    565k    2206       811k    3167
   hole     64K   45.1k   2820       78.1k   4879
   hole     1M    2744    2744       4890    4891
   ramdisk  4K    436k    1703       554k    2163
   ramdisk  64K   29.6k   1848       44.0k   2747
   ramdisk  1M    1994    1995       2809    2809
   nvme     4K    306k    1196       324k    1267
   nvme     64K   19.3k   1208       24.3k   1517
   nvme     1M    1694    1694       2256    2256

   == buffer write ==

                                        buffer head  iomap + large folio
   type   Overwrite Sync Writeback bs   IOPS   BW    IOPS   BW
   ------------------------------------------------------------
   cache    N       N    N         4K   395k   1544  415k   1621
   cache    N       N    N         64K  30.8k  1928  80.1k  5005
   cache    N       N    N         1M   1963   1963  5641   5642
   cache    Y       N    N         4K   423k   1652  443k   1730
   cache    Y       N    N         64K  33.0k  2063  80.8k  5051
   cache    Y       N    N         1M   2103   2103  5588   5589
   ramdisk  N       N    Y         4K   362k   1416  307k   1198
   ramdisk  N       N    Y         64K  22.4k  1399  64.8k  4050
   ramdisk  N       N    Y         1M   1670   1670  4559   4560
   ramdisk  N       Y    N         4K   9830   38.4  13.5k  52.8
   ramdisk  N       Y    N         64K  5834   365   10.1k  629
   ramdisk  N       Y    N         1M   1011   1011  2064   2064
   ramdisk  Y       N    Y         4K   397k   1550  409k   1598
   ramdisk  Y       N    Y         64K  29.2k  1827  73.6k  4597
   ramdisk  Y       N    Y         1M   1837   1837  4985   4985
   ramdisk  Y       Y    N         4K   173k   675   182k   710
   ramdisk  Y       Y    N         64K  17.7k  1109  33.7k  2105
   ramdisk  Y       Y    N         1M   1128   1129  1790   1791
   nvme     N       N    Y         4K   298k   1164  290k   1134
   nvme     N       N    Y         64K  21.5k  1343  57.4k  3590
   nvme     N       N    Y         1M   1308   1308  3664   3664
   nvme     N       Y    N         4K   10.7k  41.8  12.0k  46.9
   nvme     N       Y    N         64K  5962   373   8598   537
   nvme     N       Y    N         1M   676    677   1417   1418
   nvme     Y       N    Y         4K   366k   1430  373k   1456
   nvme     Y       N    Y         64K  26.7k  1670  56.8k  3547
   nvme     Y       N    Y         1M   1745   1746  3586   3586
   nvme     Y       Y    N         4K   59.0k  230   61.2k  239
   nvme     Y       Y    N         64K  13.0k  813   21.0k  1311
   nvme     Y       Y    N         1M   683    683   1368   1369
 
TODO
 - Keep on doing stress tests and fixing.
 - Reserve enough space for delalloc metadata blocks and try to drop
   ext4_nonda_switch().
 - First support defrag and then support other more unsupported features
   and mount options.

Changes since v3:
 - Drop the part 1 prepartory patches which have been merged [4].
 - Drop the two iomap patches since I've submitted separately [1].
 - Fix an incorrect reserved delalloc blocks count and incorrect extent
   status cache issue found on current ext4 code.
 - Pick out part 2 prepartory patch series [2], it make
   ext4_insert_delayed_block() call path support inserting multiple
   delalloc blocks (also support bigalloc )and make ext4_da_map_blocks()
   buffer_head unaware.
 - Adjust and simplify the reserved delalloc blocks updating logic,
   preparing for reserving meta data blocks for delalloc.
 - Drop datasync dirty check in ext4_set_iomap() for buffered
   read/write, improves the concurrent performance on small I/Os.
 - Prevent always hold invalid_lock in page_cache_ra_order(), add
   lockless check.
 - Disable iomap path by default since it's experimental new, add a
   mount option "buffered_iomap" to enable it.
 - Some other minor fixes and change log improvements.
Changes since v2:
 - Update patch 1-6 to v3.
 - iomap_zero and iomap_unshare don't need to update i_size and call
   iomap_write_failed(), introduce a new helper iomap_write_end_simple()
   to avoid doing that.
 - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(),
   introduce a new helper ext4_iomap_map_one_extent() to allocate
   delalloc blocks in writeback, which is always under i_data_sem in
   write mode. This is done to prevent the writing back delalloc
   extents become stale if it raced by truncate.
 - Add a lock detection in mapping_clear_large_folios().
Changes since v1:
 - Introduce seq count for iomap buffered write and writeback to protect
   races from extents changes, e.g. truncate, mwrite.
 - Always allocate unwritten extents for new blocks, drop dioread_lock
   mode, and make no distinctions between dioread_lock and
   dioread_nolock.
 - Don't add ditry data range to jinode, drop data=ordered mode, and
   make no distinctions between data=ordered and data=writeback mode.
 - Postpone updating i_disksize to endio.
 - Allow splitting extents and use reserved space in endio.
 - Instead of reimplement a new delayed mapping helper
   ext4_iomap_da_map_blocks() for buffer write, try to reuse
   ext4_da_map_blocks().
 - Add support for disabling large folio on active inodes.
 - Support online defragmentation, make file fall back to buffer_head
   and disable large folio in ext4_move_extents().
 - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
 - Add dirty_len and pos trace info to trace_iomap_writepage_map().
 - Update patch 1-6 to v2.

[1] https://lore.kernel.org/linux-xfs/20240320110548.2200662-1-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/
[3] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/
[4] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/

Thanks,
Yi.

---
v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/

Zhang Yi (34):
  ext4: factor out a common helper to query extent map
  ext4: check the extent status again before inserting delalloc block
  ext4: trim delalloc extent
  ext4: drop iblock parameter
  ext4: make ext4_es_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_reserve_space() reserve multi-clusters
  ext4: factor out check for whether a cluster is allocated
  ext4: make ext4_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_map_blocks() buffer_head unaware
  ext4: factor out ext4_map_create_blocks() to allocate new blocks
  ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set
  ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks
  ext4: let __revise_pending() return newly inserted pendings
  ext4: count removed reserved blocks for delalloc only extent entry
  ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
  ext4: drop ext4_es_delayed_clu()
  ext4: use ext4_map_query_blocks() in ext4_map_blocks()
  ext4: drop ext4_es_is_delonly()
  ext4: drop all delonly descriptions
  ext4: use reserved metadata blocks when splitting extent on endio
  ext4: introduce seq counter for the extent status entry
  ext4: add a new iomap aops for regular file's buffered IO path
  ext4: implement buffered read iomap path
  ext4: implement buffered write iomap path
  ext4: implement writeback iomap path
  ext4: implement mmap iomap path
  ext4: implement zero_range iomap path
  ext4: writeback partial blocks before zeroing out range
  ext4: fall back to buffer_head path for defrag
  ext4: partial enable iomap for regular file's buffered IO path
  filemap: support disable large folios on active inode
  ext4: enable large folio for regular file with iomap buffered IO path
  ext4: don't mark IOMAP_F_DIRTY for buffer write
  ext4: add mount option for buffered IO iomap path

-- 
2.39.2


^ permalink raw reply	[relevance 2%]

* [RFC PATCH v4 23/34] ext4: implement buffered read iomap path
  2024-04-10 13:27  2% [RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
                   ` (2 preceding siblings ...)
  2024-04-10 13:28  5% ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
@ 2024-04-10 13:28  5% ` Zhang Yi
  3 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-10 13:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

Add ext4_iomap_buffered_io_begin() for the iomap read path, it call
ext4_map_blocks() to query map status and call ext4_set_iomap() to
convert ext4 map to iomap.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4c1fed516d9e..20eb772f4f62 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,14 +3523,46 @@ const struct iomap_ops ext4_iomap_report_ops = {
 	.iomap_begin = ext4_iomap_begin_report,
 };
 
-static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset,
+				loff_t length, unsigned int iomap_flags,
+				struct iomap *iomap, struct iomap *srcmap)
 {
+	int ret;
+	struct ext4_map_blocks map;
+	u8 blkbits = inode->i_blkbits;
+
+	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+		return -EIO;
+	if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+		return -EINVAL;
+	if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+		return -ERANGE;
+
+	/* Calculate the first and last logical blocks respectively. */
+	map.m_lblk = offset >> blkbits;
+	map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+			  EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret < 0)
+		return ret;
+
+	ext4_set_iomap(inode, iomap, &map, offset, length, iomap_flags);
 	return 0;
 }
 
-static void ext4_iomap_readahead(struct readahead_control *rac)
+const struct iomap_ops ext4_iomap_buffered_read_ops = {
+	.iomap_begin = ext4_iomap_buffered_io_begin,
+};
+
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
 {
+	return iomap_read_folio(folio, &ext4_iomap_buffered_read_ops);
+}
 
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+	iomap_readahead(rac, &ext4_iomap_buffered_read_ops);
 }
 
 static int ext4_iomap_writepages(struct address_space *mapping,
-- 
2.39.2


^ permalink raw reply related	[relevance 5%]

* [PATCH v4 01/34] ext4: factor out a common helper to query extent map
  2024-04-10 13:27  2% [RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
@ 2024-04-10 13:27 11% ` Zhang Yi
  2024-04-10 13:27  4% ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-10 13:27 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

Factor out a new common helper ext4_map_query_blocks() from the
ext4_da_map_blocks(), it query and return the extent map status on the
inode's extent path, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 537803250ca9..6a41172c06e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
+				 struct ext4_map_blocks *map)
+{
+	unsigned int status;
+	int retval;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		retval = ext4_ext_map_blocks(handle, inode, map, 0);
+	else
+		retval = ext4_ind_map_blocks(handle, inode, map, 0);
+
+	if (retval <= 0)
+		return retval;
+
+	if (unlikely(retval != map->m_len)) {
+		ext4_warning(inode->i_sb,
+			     "ES len assertion failed for inode "
+			     "%lu: retval %d != map->m_len %d",
+			     inode->i_ino, retval, map->m_len);
+		WARN_ON(1);
+	}
+
+	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+			      map->m_pblk, status);
+	return retval;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_has_inline_data(inode))
 		retval = 0;
-	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
-	if (retval < 0) {
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
+		retval = ext4_map_query_blocks(NULL, inode, map);
 	up_read(&EXT4_I(inode)->i_data_sem);
+	if (retval)
+		return retval;
 
 add_delayed:
 	down_write(&EXT4_I(inode)->i_data_sem);
-- 
2.39.2


^ permalink raw reply related	[relevance 11%]

* [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
  2024-04-10 13:27  2% [RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
  2024-04-10 13:27 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
@ 2024-04-10 13:27  4% ` Zhang Yi
  2024-04-10 13:28  5% ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
  2024-04-10 13:28  5% ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
  3 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-10 13:27 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

Now we update data reserved space for delalloc after allocating new
blocks in ext4_{ind|ext}_map_blocks(). If bigalloc feature is enabled,
we also need to query the extents_status tree and calculate the exact
reserved clusters. This is complicated and it appears
ext4_es_insert_extent() is a better place to do this job, it could make
things simple because __es_remove_extent() could count delalloc blocks
and __revise_pending() and return newly added pending count.

One special case needs to concern is the quota claiming, when bigalloc
is enabled, if the delayed cluster allocation has been raced by another
no-delayed allocation which doesn't overlap the delayed blocks (from
fallocate, filemap, DIO...) , we cannot claim quota as usual because the
racer have already done it, so we also need to check the counted
reserved blocks.

  |               one cluster               |
  -------------------------------------------
  |                            | delayed es |
  -------------------------------------------
  ^           ^
  | fallocate | <- don't claim quota

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c        | 37 -------------------------------------
 fs/ext4/extents_status.c | 22 +++++++++++++++++++++-
 fs/ext4/indirect.c       |  7 -------
 3 files changed, 21 insertions(+), 45 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e57054bdc5fd..8bc8a519f745 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4355,43 +4355,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		goto out;
 	}
 
-	/*
-	 * Reduce the reserved cluster count to reflect successful deferred
-	 * allocation of delayed allocated clusters or direct allocation of
-	 * clusters discovered to be delayed allocated.  Once allocated, a
-	 * cluster is not included in the reserved count.
-	 */
-	if (test_opt(inode->i_sb, DELALLOC) && allocated_clusters) {
-		if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
-			/*
-			 * When allocating delayed allocated clusters, simply
-			 * reduce the reserved cluster count and claim quota
-			 */
-			ext4_da_update_reserve_space(inode, allocated_clusters,
-							1);
-		} else {
-			ext4_lblk_t lblk, len;
-			unsigned int n;
-
-			/*
-			 * When allocating non-delayed allocated clusters
-			 * (from fallocate, filemap, DIO, or clusters
-			 * allocated when delalloc has been disabled by
-			 * ext4_nonda_switch), reduce the reserved cluster
-			 * count by the number of allocated clusters that
-			 * have previously been delayed allocated.  Quota
-			 * has been claimed by ext4_mb_new_blocks() above,
-			 * so release the quota reservations made for any
-			 * previously delayed allocated clusters.
-			 */
-			lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk);
-			len = allocated_clusters << sbi->s_cluster_bits;
-			n = ext4_es_delayed_clu(inode, lblk, len);
-			if (n > 0)
-				ext4_da_update_reserve_space(inode, (int) n, 0);
-		}
-	}
-
 	/*
 	 * Cache the extent and update transaction to commit on fdatasync only
 	 * when it is _not_ an unwritten extent.
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 38ec2cc5ae3b..75227f151b8f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -856,6 +856,8 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err1 = 0, err2 = 0, err3 = 0;
+	struct rsvd_info rinfo;
+	int resv_used, pending = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct extent_status *es1 = NULL;
 	struct extent_status *es2 = NULL;
@@ -894,7 +896,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 		pr = __alloc_pending(true);
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
+	err1 = __es_remove_extent(inode, lblk, end, &rinfo, es1);
 	if (err1 != 0)
 		goto error;
 	/* Free preallocated extent if it didn't get used. */
@@ -924,9 +926,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			__free_pending(pr);
 			pr = NULL;
 		}
+		pending = err3;
 	}
 error:
 	write_unlock(&EXT4_I(inode)->i_es_lock);
+	/*
+	 * Reduce the reserved cluster count to reflect successful deferred
+	 * allocation of delayed allocated clusters or direct allocation of
+	 * clusters discovered to be delayed allocated.  Once allocated, a
+	 * cluster is not included in the reserved count.
+	 *
+	 * When bigalloc is enabled, allocating non-delayed allocated blocks
+	 * which belong to delayed allocated clusters (from fallocate, filemap,
+	 * DIO, or clusters allocated when delalloc has been disabled by
+	 * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(),
+	 * so release the quota reservations made for any previously delayed
+	 * allocated clusters.
+	 */
+	resv_used = rinfo.delonly_cluster + pending;
+	if (resv_used)
+		ext4_da_update_reserve_space(inode, resv_used,
+					     rinfo.delonly_block);
 	if (err1 || err2 || err3 < 0)
 		goto retry;
 
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index d8ca7f64f952..7404f0935c90 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -652,13 +652,6 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 	count = ar.len;
 
-	/*
-	 * Update reserved blocks/metadata blocks after successful block
-	 * allocation which had been deferred till now.
-	 */
-	if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
-		ext4_da_update_reserve_space(inode, count, 1);
-
 got_it:
 	map->m_flags |= EXT4_MAP_MAPPED;
 	map->m_pblk = le32_to_cpu(chain[depth-1].key);
-- 
2.39.2


^ permalink raw reply related	[relevance 4%]

* [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks()
  2024-04-10 13:27  2% [RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
  2024-04-10 13:27 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
  2024-04-10 13:27  4% ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
@ 2024-04-10 13:28  5% ` Zhang Yi
  2024-04-10 13:28  5% ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
  3 siblings, 0 replies; 200+ results
From: Zhang Yi @ 2024-04-10 13:28 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

From: Zhang Yi <yi.zhang@huawei.com>

The blocks map querying logic in ext4_map_blocks() are the same with
ext4_map_query_blocks(), so use it directly.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 752fc0555dc0..64bdfa9e06b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -658,27 +658,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	 * file system block.
 	 */
 	down_read(&EXT4_I(inode)->i_data_sem);
-	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
-		retval = ext4_ext_map_blocks(handle, inode, map, 0);
-	} else {
-		retval = ext4_ind_map_blocks(handle, inode, map, 0);
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-	}
+	retval = ext4_map_query_blocks(handle, inode, map);
 	up_read((&EXT4_I(inode)->i_data_sem));
 
 found:
-- 
2.39.2


^ permalink raw reply related	[relevance 5%]

* [RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio
@ 2024-04-10 13:27  2% Zhang Yi
  2024-04-10 13:27 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
                   ` (3 more replies)
  0 siblings, 4 replies; 200+ results
From: Zhang Yi @ 2024-04-10 13:27 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, linux-mm, linux-kernel, tytso, adilger.kernel,
	jack, ritesh.list, hch, djwong, willy, zokeefe, yi.zhang,
	yi.zhang, chengzhihao1, yukuai3, wangkefeng.wang

Hello!

This is the fourth version of RFC patch series that convert ext4 regular
file's buffered IO path to iomap and enable large folio. I've rebased it
on 6.9-rc3, it also **depends on my xfs/iomap fix series** which has
been reviewed but not merged yet[1]. Compared to the third vesion, this
iteration fixes an issue discovered in current ext4 code, and contains
another two main changes, 1) add bigalloc support and 2) simplify the
updating logic of reserved delalloc data block, both changes could be
sent out as preliminary patch series, besides these, others are some
small code cleanups, performance optimize and commit log improvements.
Please take a look at this series and any comments are welcome.

This series supports ext4 with the default features and mount
options(bigalloc is also supported), doesn't support non-extent(ext3),
inline_data, dax, fs_verity, fs_crypt and data=journal mode, ext4 would
fall back to buffer_head path automatically if you enabled those
features or options. Although it has many limitations now, it can satisfy
the requirements of most common cases and bring a significant performance
benefit for large IOs.

The iomap path would be simpler than the buffer_head path to some extent,
please note that there are 4 major differences:
1. Always allocate unwritten extent for new blocks, it means that it's
   not controlled by dioread_nolock mount option.
2. Since 1, there is no risk of exposing stale data during the append
   write, so we don't need to write back data before metadata, it's time
   to drop 'data = ordered' mode automatically.
3. Since 2, we don't need to reserve journal credits and use reserved
   handle for the extent status conversion during writeback.
4. We could postpone updating the i_disksize to the endio, it could
   avoid exposing zero data during append write and instantaneous power
   failure.

Series details:
Patch 1-9: this is the part 2 preparation series, it fix a problem
first, and makes ext4_insert_delayed_block() call path support inserting
multiple delalloc blocks (also support bigalloc), finally make
ext4_da_map_blocks() buffer_head unaware, I've send it out separately[2]
and hope this could be merged first.

Patch 10-19: this is the part 3 prepartory changes(picked out from my
metadata reservation series[3], these are not a strong dependency
patches, but I'd suggested these could be merged before the iomap
conversion). These patches moves ext4_da_update_reserve_space() to
ext4_es_insert_extent(), and always set EXT4_GET_BLOCKS_DELALLOC_RESERVE
when allocating delalloc blocks, no matter it's from delayed allocate or
non-delayed allocate (fallocate) path, it makes delalloc extents always
delonly. These can make delalloc reservation simpler and cleaner than
before.

Patch 20-34: These patches are the main implements of the buffered IO
iomap conversion, It first introduce a sequence counter for extent
status tree, then add a new iomap aops for read, write, mmap, replace
current buffered_head path. Finally, enable iomap path besides inline
data, non-extent, dax, fs_verity, fs_crypt, defrag and data=journal
mode, if user specify "buffered_iomap" mount option, also enable large
folio. Please look at the following patch for details.

About Tests:
 - Pass kvm-xfstests in auto mode, and the keep running stress tests and
   fault injection tests.
 - A performance tests below (tested on my version 3 series,
   theoretically there won't be much difference in this version).

   Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
   with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.

   == buffer read ==

                  buffer head        iomap + large folio
   type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
   ----------------------------------------------------
   hole     4K    565k    2206       811k    3167
   hole     64K   45.1k   2820       78.1k   4879
   hole     1M    2744    2744       4890    4891
   ramdisk  4K    436k    1703       554k    2163
   ramdisk  64K   29.6k   1848       44.0k   2747
   ramdisk  1M    1994    1995       2809    2809
   nvme     4K    306k    1196       324k    1267
   nvme     64K   19.3k   1208       24.3k   1517
   nvme     1M    1694    1694       2256    2256

   == buffer write ==

                                        buffer head  iomap + large folio
   type   Overwrite Sync Writeback bs   IOPS   BW    IOPS   BW
   ------------------------------------------------------------
   cache    N       N    N         4K   395k   1544  415k   1621
   cache    N       N    N         64K  30.8k  1928  80.1k  5005
   cache    N       N    N         1M   1963   1963  5641   5642
   cache    Y       N    N         4K   423k   1652  443k   1730
   cache    Y       N    N         64K  33.0k  2063  80.8k  5051
   cache    Y       N    N         1M   2103   2103  5588   5589
   ramdisk  N       N    Y         4K   362k   1416  307k   1198
   ramdisk  N       N    Y         64K  22.4k  1399  64.8k  4050
   ramdisk  N       N    Y         1M   1670   1670  4559   4560
   ramdisk  N       Y    N         4K   9830   38.4  13.5k  52.8
   ramdisk  N       Y    N         64K  5834   365   10.1k  629
   ramdisk  N       Y    N         1M   1011   1011  2064   2064
   ramdisk  Y       N    Y         4K   397k   1550  409k   1598
   ramdisk  Y       N    Y         64K  29.2k  1827  73.6k  4597
   ramdisk  Y       N    Y         1M   1837   1837  4985   4985
   ramdisk  Y       Y    N         4K   173k   675   182k   710
   ramdisk  Y       Y    N         64K  17.7k  1109  33.7k  2105
   ramdisk  Y       Y    N         1M   1128   1129  1790   1791
   nvme     N       N    Y         4K   298k   1164  290k   1134
   nvme     N       N    Y         64K  21.5k  1343  57.4k  3590
   nvme     N       N    Y         1M   1308   1308  3664   3664
   nvme     N       Y    N         4K   10.7k  41.8  12.0k  46.9
   nvme     N       Y    N         64K  5962   373   8598   537
   nvme     N       Y    N         1M   676    677   1417   1418
   nvme     Y       N    Y         4K   366k   1430  373k   1456
   nvme     Y       N    Y         64K  26.7k  1670  56.8k  3547
   nvme     Y       N    Y         1M   1745   1746  3586   3586
   nvme     Y       Y    N         4K   59.0k  230   61.2k  239
   nvme     Y       Y    N         64K  13.0k  813   21.0k  1311
   nvme     Y       Y    N         1M   683    683   1368   1369
 
TODO
 - Keep on doing stress tests and fixing.
 - Reserve enough space for delalloc metadata blocks and try to drop
   ext4_nonda_switch().
 - First support defrag and then support other more unsupported features
   and mount options.

Changes since v3:
 - Drop the part 1 prepartory patches which have been merged [4].
 - Drop the two iomap patches since I've submitted separately [1].
 - Fix an incorrect reserved delalloc blocks count and incorrect extent
   status cache issue found on current ext4 code.
 - Pick out part 2 prepartory patch series [2], it make
   ext4_insert_delayed_block() call path support inserting multiple
   delalloc blocks (also support bigalloc )and make ext4_da_map_blocks()
   buffer_head unaware.
 - Adjust and simplify the reserved delalloc blocks updating logic,
   preparing for reserving meta data blocks for delalloc.
 - Drop datasync dirty check in ext4_set_iomap() for buffered
   read/write, improves the concurrent performance on small I/Os.
 - Prevent always hold invalid_lock in page_cache_ra_order(), add
   lockless check.
 - Disable iomap path by default since it's experimental new, add a
   mount option "buffered_iomap" to enable it.
 - Some other minor fixes and change log improvements.
Changes since v2:
 - Update patch 1-6 to v3.
 - iomap_zero and iomap_unshare don't need to update i_size and call
   iomap_write_failed(), introduce a new helper iomap_write_end_simple()
   to avoid doing that.
 - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(),
   introduce a new helper ext4_iomap_map_one_extent() to allocate
   delalloc blocks in writeback, which is always under i_data_sem in
   write mode. This is done to prevent the writing back delalloc
   extents become stale if it raced by truncate.
 - Add a lock detection in mapping_clear_large_folios().
Changes since v1:
 - Introduce seq count for iomap buffered write and writeback to protect
   races from extents changes, e.g. truncate, mwrite.
 - Always allocate unwritten extents for new blocks, drop dioread_lock
   mode, and make no distinctions between dioread_lock and
   dioread_nolock.
 - Don't add ditry data range to jinode, drop data=ordered mode, and
   make no distinctions between data=ordered and data=writeback mode.
 - Postpone updating i_disksize to endio.
 - Allow splitting extents and use reserved space in endio.
 - Instead of reimplement a new delayed mapping helper
   ext4_iomap_da_map_blocks() for buffer write, try to reuse
   ext4_da_map_blocks().
 - Add support for disabling large folio on active inodes.
 - Support online defragmentation, make file fall back to buffer_head
   and disable large folio in ext4_move_extents().
 - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
 - Add dirty_len and pos trace info to trace_iomap_writepage_map().
 - Update patch 1-6 to v2.

[1] https://lore.kernel.org/linux-xfs/20240320110548.2200662-1-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-ext4/20240410034203.2188357-1-yi.zhang@huaweicloud.com/
[3] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/
[4] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/

Thanks,
Yi.

---
v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/

Zhang Yi (34):
  ext4: factor out a common helper to query extent map
  ext4: check the extent status again before inserting delalloc block
  ext4: trim delalloc extent
  ext4: drop iblock parameter
  ext4: make ext4_es_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_reserve_space() reserve multi-clusters
  ext4: factor out check for whether a cluster is allocated
  ext4: make ext4_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_map_blocks() buffer_head unaware
  ext4: factor out ext4_map_create_blocks() to allocate new blocks
  ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set
  ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks
  ext4: let __revise_pending() return newly inserted pendings
  ext4: count removed reserved blocks for delalloc only extent entry
  ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
  ext4: drop ext4_es_delayed_clu()
  ext4: use ext4_map_query_blocks() in ext4_map_blocks()
  ext4: drop ext4_es_is_delonly()
  ext4: drop all delonly descriptions
  ext4: use reserved metadata blocks when splitting extent on endio
  ext4: introduce seq counter for the extent status entry
  ext4: add a new iomap aops for regular file's buffered IO path
  ext4: implement buffered read iomap path
  ext4: implement buffered write iomap path
  ext4: implement writeback iomap path
  ext4: implement mmap iomap path
  ext4: implement zero_range iomap path
  ext4: writeback partial blocks before zeroing out range
  ext4: fall back to buffer_head path for defrag
  ext4: partial enable iomap for regular file's buffered IO path
  filemap: support disable large folios on active inode
  ext4: enable large folio for regular file with iomap buffered IO path
  ext4: don't mark IOMAP_F_DIRTY for buffer write
  ext4: add mount option for buffered IO iomap path

-- 
2.39.2


^ permalink raw reply	[relevance 2%]

* [PATCH v2 1/9] ext4: factor out a common helper to query extent map
  2024-04-10  3:41  5% [PATCH v2 0/9] ext4: support adding multi-delalloc blocks Zhang Yi
@ 2024-04-10  3:41 11% ` Zhang Yi
  2024-04-24 20:05  7%   ` Jan Kara
  0 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-04-10  3:41 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, tytso, adilger.kernel, jack, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Factor out a new common helper ext4_map_query_blocks() from the
ext4_da_map_blocks(), it query and return the extent map status on the
inode's extent path, no logic changes.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 57 +++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 537803250ca9..6a41172c06e1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,35 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+static int ext4_map_query_blocks(handle_t *handle, struct inode *inode,
+				 struct ext4_map_blocks *map)
+{
+	unsigned int status;
+	int retval;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		retval = ext4_ext_map_blocks(handle, inode, map, 0);
+	else
+		retval = ext4_ind_map_blocks(handle, inode, map, 0);
+
+	if (retval <= 0)
+		return retval;
+
+	if (unlikely(retval != map->m_len)) {
+		ext4_warning(inode->i_sb,
+			     "ES len assertion failed for inode "
+			     "%lu: retval %d != map->m_len %d",
+			     inode->i_ino, retval, map->m_len);
+		WARN_ON(1);
+	}
+
+	status = map->m_flags & EXT4_MAP_UNWRITTEN ?
+			EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+	ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
+			      map->m_pblk, status);
+	return retval;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -1744,33 +1773,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
 	down_read(&EXT4_I(inode)->i_data_sem);
 	if (ext4_has_inline_data(inode))
 		retval = 0;
-	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
-		retval = ext4_ext_map_blocks(NULL, inode, map, 0);
 	else
-		retval = ext4_ind_map_blocks(NULL, inode, map, 0);
-	if (retval < 0) {
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
-	if (retval > 0) {
-		unsigned int status;
-
-		if (unlikely(retval != map->m_len)) {
-			ext4_warning(inode->i_sb,
-				     "ES len assertion failed for inode "
-				     "%lu: retval %d != map->m_len %d",
-				     inode->i_ino, retval, map->m_len);
-			WARN_ON(1);
-		}
-
-		status = map->m_flags & EXT4_MAP_UNWRITTEN ?
-				EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
-		ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
-				      map->m_pblk, status);
-		up_read(&EXT4_I(inode)->i_data_sem);
-		return retval;
-	}
+		retval = ext4_map_query_blocks(NULL, inode, map);
 	up_read(&EXT4_I(inode)->i_data_sem);
+	if (retval)
+		return retval;
 
 add_delayed:
 	down_write(&EXT4_I(inode)->i_data_sem);
-- 
2.39.2


^ permalink raw reply related	[relevance 11%]

* [PATCH v2 0/9] ext4: support adding multi-delalloc blocks
@ 2024-04-10  3:41  5% Zhang Yi
  2024-04-10  3:41 11% ` [PATCH v2 1/9] ext4: factor out a common helper to query extent map Zhang Yi
  0 siblings, 1 reply; 200+ results
From: Zhang Yi @ 2024-04-10  3:41 UTC (permalink / raw)
  To: linux-ext4
  Cc: linux-fsdevel, tytso, adilger.kernel, jack, yi.zhang, yi.zhang,
	chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Hello!

This patch series is the part 2 prepartory changes of the buffered IO
iomap conversion, I picked them out from my buffered IO iomap conversion
RFC series v3[1], add a fix for an issue found in current ext4 code,and
also add bigalloc feature support. Please look the following patches for
details.

The first 2 patches fix an incorrect delalloc reserved blocks count
issue, the second 6 patches make ext4_insert_delayed_block() call path
support inserting multi-delalloc blocks once a time, and the last patch
makes ext4_da_map_blocks() buffer_head unaware, prepared for iomap.

This patch set has been passed 'kvm-xfstests -g auto' tests, I hope it
could be reviewed and merged first.

[1] https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/

Thanks,
Yi.

Zhang Yi (9):
  ext4: factor out a common helper to query extent map
  ext4: check the extent status again before inserting delalloc block
  ext4: trim delalloc extent
  ext4: drop iblock parameter
  ext4: make ext4_es_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_reserve_space() reserve multi-clusters
  ext4: factor out check for whether a cluster is allocated
  ext4: make ext4_insert_delayed_block() insert multi-blocks
  ext4: make ext4_da_map_blocks() buffer_head unaware

 fs/ext4/extents_status.c    |  63 ++++++----
 fs/ext4/extents_status.h    |   5 +-
 fs/ext4/inode.c             | 240 +++++++++++++++++++++++-------------
 include/trace/events/ext4.h |  26 ++--
 4 files changed, 213 insertions(+), 121 deletions(-)

-- 
2.39.2


^ permalink raw reply	[relevance 5%]

* [PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()
  @ 2024-04-09 19:22  5% ` David Hildenbrand
  0 siblings, 0 replies; 200+ results
From: David Hildenbrand @ 2024-04-09 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, cgroups, linux-sh, linux-trace-kernel,
	linux-fsdevel, David Hildenbrand, Andrew Morton,
	Matthew Wilcox (Oracle),
	Peter Xu, Ryan Roberts, Yin Fengwei, Yang Shi, Zi Yan,
	Jonathan Corbet, Hugh Dickins, Yoshinori Sato, Rich Felker,
	John Paul Adrian Glaubitz, Chris Zankel, Max Filippov,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Richard Chang

We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

For tracing purposes, we use page_mapcount() in
__alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds
strange: total_migrated and total_reclaimed would count each page only
once, not multiple times.

But then, isolate_migratepages_range() adds each folio only once to the
list. So for large folios, we would query the mapcount of the
first page of the folio, which doesn't make too much sense for large
folios.

Let's simply use folio_mapped() * folio_nr_pages(), which makes more
sense as nr_migratepages is also incremented by the number of pages in
the folio in case of successful migration.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 393366d4a704..40fc0f60e021 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6389,8 +6389,12 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		if (trace_mm_alloc_contig_migrate_range_info_enabled()) {
 			total_reclaimed += nr_reclaimed;
-			list_for_each_entry(page, &cc->migratepages, lru)
-				total_mapped += page_mapcount(page);
+			list_for_each_entry(page, &cc->migratepages, lru) {
+				struct folio *folio = page_folio(page);
+
+				total_mapped += folio_mapped(folio) *
+						folio_nr_pages(folio);
+			}
 		}
 
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
-- 
2.44.0


^ permalink raw reply related	[relevance 5%]

* [PATCH 6.6 188/252] cifs: Fix caching to try to do open O_WRONLY as rdwr on server
       [not found]     <20240408125306.643546457@linuxfoundation.org>
@ 2024-04-08 12:58  4% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2024-04-08 12:58 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, David Howells, Steve French,
	Shyam Prasad N, Rohith Surabattula, Jeff Layton, linux-cifs,
	netfs, linux-fsdevel, Steve French, Sasha Levin

6.6-stable review patch.  If anyone has any objections, please let me know.

------------------

From: David Howells <dhowells@redhat.com>

[ Upstream commit e9e62243a3e2322cf639f653a0b0a88a76446ce7 ]

When we're engaged in local caching of a cifs filesystem, we cannot perform
caching of a partially written cache granule unless we can read the rest of
the granule.  This can result in unexpected access errors being reported to
the user.

Fix this by the following: if a file is opened O_WRONLY locally, but the
mount was given the "-o fsc" flag, try first opening the remote file with
GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping the
GENERIC_READ and doing the open again.  If that last succeeds, invalidate
the cache for that file as for O_DIRECT.

Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/smb/client/dir.c     | 15 +++++++++++++
 fs/smb/client/file.c    | 48 ++++++++++++++++++++++++++++++++---------
 fs/smb/client/fscache.h |  6 ++++++
 3 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index 580a27a3a7e62..855468a32904e 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -189,6 +189,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	int disposition;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	*oplock = 0;
 	if (tcon->ses->server->oplocks)
@@ -200,6 +201,10 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		return PTR_ERR(full_path);
 	}
 
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (oflags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	if (tcon->unix_ext && cap_unix(tcon->ses) && !tcon->broken_posix_open &&
 	    (CIFS_UNIX_POSIX_PATH_OPS_CAP &
@@ -276,6 +281,8 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		desired_access |= GENERIC_READ; /* is this too little? */
 	if (OPEN_FMODE(oflags) & FMODE_WRITE)
 		desired_access |= GENERIC_WRITE;
+	if (rdwr_for_fscache == 1)
+		desired_access |= GENERIC_READ;
 
 	disposition = FILE_OVERWRITE_IF;
 	if ((oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))
@@ -304,6 +311,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	if (!tcon->unix_ext && (mode & S_IWUGO) == 0)
 		create_options |= CREATE_OPTION_READONLY;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -317,8 +325,15 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	rc = server->ops->open(xid, &oparms, oplock, buf);
 	if (rc) {
 		cifs_dbg(FYI, "cifs_create returned 0x%x\n", rc);
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access &= ~GENERIC_READ;
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		goto out;
 	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	/*
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index c711d5eb2987e..606972a95465b 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -206,12 +206,12 @@ cifs_mark_open_files_invalid(struct cifs_tcon *tcon)
 	 */
 }
 
-static inline int cifs_convert_flags(unsigned int flags)
+static inline int cifs_convert_flags(unsigned int flags, int rdwr_for_fscache)
 {
 	if ((flags & O_ACCMODE) == O_RDONLY)
 		return GENERIC_READ;
 	else if ((flags & O_ACCMODE) == O_WRONLY)
-		return GENERIC_WRITE;
+		return rdwr_for_fscache == 1 ? (GENERIC_READ | GENERIC_WRITE) : GENERIC_WRITE;
 	else if ((flags & O_ACCMODE) == O_RDWR) {
 		/* GENERIC_ALL is too much permission to request
 		   can cause unnecessary access denied on create */
@@ -348,11 +348,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	int create_options = CREATE_NOT_DIR;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	if (!server->ops->open)
 		return -ENOSYS;
 
-	desired_access = cifs_convert_flags(f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(f_flags, rdwr_for_fscache);
 
 /*********************************************************************
  *  open flag mapping table:
@@ -389,6 +394,7 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	if (f_flags & O_DIRECT)
 		create_options |= CREATE_NO_BUFFER;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -400,8 +406,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	};
 
 	rc = server->ops->open(xid, &oparms, oplock, buf);
-	if (rc)
+	if (rc) {
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access = cifs_convert_flags(f_flags, 0);
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		return rc;
+	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 	/* TODO: Add support for calling posix query info but with passing in fid */
 	if (tcon->unix_ext)
@@ -834,11 +848,11 @@ int cifs_open(struct inode *inode, struct file *file)
 use_cache:
 	fscache_use_cookie(cifs_inode_cookie(file_inode(file)),
 			   file->f_mode & FMODE_WRITE);
-	if (file->f_flags & O_DIRECT &&
-	    (!((file->f_flags & O_ACCMODE) != O_RDONLY) ||
-	     file->f_flags & O_APPEND))
-		cifs_invalidate_cache(file_inode(file),
-				      FSCACHE_INVAL_DIO_WRITE);
+	if (!(file->f_flags & O_DIRECT))
+		goto out;
+	if ((file->f_flags & (O_ACCMODE | O_APPEND)) == O_RDONLY)
+		goto out;
+	cifs_invalidate_cache(file_inode(file), FSCACHE_INVAL_DIO_WRITE);
 
 out:
 	free_dentry_path(page);
@@ -903,6 +917,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	int disposition = FILE_OPEN;
 	int create_options = CREATE_NOT_DIR;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	xid = get_xid();
 	mutex_lock(&cfile->fh_mutex);
@@ -966,7 +981,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	}
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
 
-	desired_access = cifs_convert_flags(cfile->f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (cfile->f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(cfile->f_flags, rdwr_for_fscache);
 
 	/* O_SYNC also has bit for O_DSYNC so following check picks up either */
 	if (cfile->f_flags & O_SYNC)
@@ -978,6 +997,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	if (server->ops->get_lease_key)
 		server->ops->get_lease_key(inode, &cfile->fid);
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -1003,6 +1023,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		/* indicate that we need to relock the file */
 		oparms.reconnect = true;
 	}
+	if (rc == -EACCES && rdwr_for_fscache == 1) {
+		desired_access = cifs_convert_flags(cfile->f_flags, 0);
+		rdwr_for_fscache = 2;
+		goto retry_open;
+	}
 
 	if (rc) {
 		mutex_unlock(&cfile->fh_mutex);
@@ -1011,6 +1036,9 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		goto reopen_error_exit;
 	}
 
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 reopen_success:
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
diff --git a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h
index a3d73720914f8..1f2ea9f5cc9a8 100644
--- a/fs/smb/client/fscache.h
+++ b/fs/smb/client/fscache.h
@@ -109,6 +109,11 @@ static inline void cifs_readahead_to_fscache(struct inode *inode,
 		__cifs_readahead_to_fscache(inode, pos, len);
 }
 
+static inline bool cifs_fscache_enabled(struct inode *inode)
+{
+	return fscache_cookie_enabled(cifs_inode_cookie(inode));
+}
+
 #else /* CONFIG_CIFS_FSCACHE */
 static inline
 void cifs_fscache_fill_coherency(struct inode *inode,
@@ -124,6 +129,7 @@ static inline void cifs_fscache_release_inode_cookie(struct inode *inode) {}
 static inline void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update) {}
 static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { return NULL; }
 static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) {}
+static inline bool cifs_fscache_enabled(struct inode *inode) { return false; }
 
 static inline int cifs_fscache_query_occupancy(struct inode *inode,
 					       pgoff_t first, unsigned int nr_pages,
-- 
2.43.0




^ permalink raw reply related	[relevance 4%]

* [PATCH 6.8 179/273] cifs: Fix caching to try to do open O_WRONLY as rdwr on server
       [not found]     <20240408125309.280181634@linuxfoundation.org>
@ 2024-04-08 12:57  4% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2024-04-08 12:57 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, David Howells, Steve French,
	Shyam Prasad N, Rohith Surabattula, Jeff Layton, linux-cifs,
	netfs, linux-fsdevel, Steve French, Sasha Levin

6.8-stable review patch.  If anyone has any objections, please let me know.

------------------

From: David Howells <dhowells@redhat.com>

[ Upstream commit e9e62243a3e2322cf639f653a0b0a88a76446ce7 ]

When we're engaged in local caching of a cifs filesystem, we cannot perform
caching of a partially written cache granule unless we can read the rest of
the granule.  This can result in unexpected access errors being reported to
the user.

Fix this by the following: if a file is opened O_WRONLY locally, but the
mount was given the "-o fsc" flag, try first opening the remote file with
GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping the
GENERIC_READ and doing the open again.  If that last succeeds, invalidate
the cache for that file as for O_DIRECT.

Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/smb/client/dir.c     | 15 +++++++++++++
 fs/smb/client/file.c    | 48 ++++++++++++++++++++++++++++++++---------
 fs/smb/client/fscache.h |  6 ++++++
 3 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index 89333d9bce36e..37897b919dd5a 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -189,6 +189,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	int disposition;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	*oplock = 0;
 	if (tcon->ses->server->oplocks)
@@ -200,6 +201,10 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		return PTR_ERR(full_path);
 	}
 
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (oflags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	if (tcon->unix_ext && cap_unix(tcon->ses) && !tcon->broken_posix_open &&
 	    (CIFS_UNIX_POSIX_PATH_OPS_CAP &
@@ -276,6 +281,8 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		desired_access |= GENERIC_READ; /* is this too little? */
 	if (OPEN_FMODE(oflags) & FMODE_WRITE)
 		desired_access |= GENERIC_WRITE;
+	if (rdwr_for_fscache == 1)
+		desired_access |= GENERIC_READ;
 
 	disposition = FILE_OVERWRITE_IF;
 	if ((oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))
@@ -304,6 +311,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	if (!tcon->unix_ext && (mode & S_IWUGO) == 0)
 		create_options |= CREATE_OPTION_READONLY;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -317,8 +325,15 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	rc = server->ops->open(xid, &oparms, oplock, buf);
 	if (rc) {
 		cifs_dbg(FYI, "cifs_create returned 0x%x\n", rc);
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access &= ~GENERIC_READ;
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		goto out;
 	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	/*
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 9d42a39009076..9ec835320f8a7 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -206,12 +206,12 @@ cifs_mark_open_files_invalid(struct cifs_tcon *tcon)
 	 */
 }
 
-static inline int cifs_convert_flags(unsigned int flags)
+static inline int cifs_convert_flags(unsigned int flags, int rdwr_for_fscache)
 {
 	if ((flags & O_ACCMODE) == O_RDONLY)
 		return GENERIC_READ;
 	else if ((flags & O_ACCMODE) == O_WRONLY)
-		return GENERIC_WRITE;
+		return rdwr_for_fscache == 1 ? (GENERIC_READ | GENERIC_WRITE) : GENERIC_WRITE;
 	else if ((flags & O_ACCMODE) == O_RDWR) {
 		/* GENERIC_ALL is too much permission to request
 		   can cause unnecessary access denied on create */
@@ -348,11 +348,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	int create_options = CREATE_NOT_DIR;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	if (!server->ops->open)
 		return -ENOSYS;
 
-	desired_access = cifs_convert_flags(f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(f_flags, rdwr_for_fscache);
 
 /*********************************************************************
  *  open flag mapping table:
@@ -389,6 +394,7 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	if (f_flags & O_DIRECT)
 		create_options |= CREATE_NO_BUFFER;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -400,8 +406,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	};
 
 	rc = server->ops->open(xid, &oparms, oplock, buf);
-	if (rc)
+	if (rc) {
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access = cifs_convert_flags(f_flags, 0);
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		return rc;
+	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 	/* TODO: Add support for calling posix query info but with passing in fid */
 	if (tcon->unix_ext)
@@ -834,11 +848,11 @@ int cifs_open(struct inode *inode, struct file *file)
 use_cache:
 	fscache_use_cookie(cifs_inode_cookie(file_inode(file)),
 			   file->f_mode & FMODE_WRITE);
-	if (file->f_flags & O_DIRECT &&
-	    (!((file->f_flags & O_ACCMODE) != O_RDONLY) ||
-	     file->f_flags & O_APPEND))
-		cifs_invalidate_cache(file_inode(file),
-				      FSCACHE_INVAL_DIO_WRITE);
+	if (!(file->f_flags & O_DIRECT))
+		goto out;
+	if ((file->f_flags & (O_ACCMODE | O_APPEND)) == O_RDONLY)
+		goto out;
+	cifs_invalidate_cache(file_inode(file), FSCACHE_INVAL_DIO_WRITE);
 
 out:
 	free_dentry_path(page);
@@ -903,6 +917,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	int disposition = FILE_OPEN;
 	int create_options = CREATE_NOT_DIR;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	xid = get_xid();
 	mutex_lock(&cfile->fh_mutex);
@@ -966,7 +981,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	}
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
 
-	desired_access = cifs_convert_flags(cfile->f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (cfile->f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(cfile->f_flags, rdwr_for_fscache);
 
 	/* O_SYNC also has bit for O_DSYNC so following check picks up either */
 	if (cfile->f_flags & O_SYNC)
@@ -978,6 +997,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	if (server->ops->get_lease_key)
 		server->ops->get_lease_key(inode, &cfile->fid);
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -1003,6 +1023,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		/* indicate that we need to relock the file */
 		oparms.reconnect = true;
 	}
+	if (rc == -EACCES && rdwr_for_fscache == 1) {
+		desired_access = cifs_convert_flags(cfile->f_flags, 0);
+		rdwr_for_fscache = 2;
+		goto retry_open;
+	}
 
 	if (rc) {
 		mutex_unlock(&cfile->fh_mutex);
@@ -1011,6 +1036,9 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		goto reopen_error_exit;
 	}
 
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 reopen_success:
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
diff --git a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h
index a3d73720914f8..1f2ea9f5cc9a8 100644
--- a/fs/smb/client/fscache.h
+++ b/fs/smb/client/fscache.h
@@ -109,6 +109,11 @@ static inline void cifs_readahead_to_fscache(struct inode *inode,
 		__cifs_readahead_to_fscache(inode, pos, len);
 }
 
+static inline bool cifs_fscache_enabled(struct inode *inode)
+{
+	return fscache_cookie_enabled(cifs_inode_cookie(inode));
+}
+
 #else /* CONFIG_CIFS_FSCACHE */
 static inline
 void cifs_fscache_fill_coherency(struct inode *inode,
@@ -124,6 +129,7 @@ static inline void cifs_fscache_release_inode_cookie(struct inode *inode) {}
 static inline void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update) {}
 static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { return NULL; }
 static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) {}
+static inline bool cifs_fscache_enabled(struct inode *inode) { return false; }
 
 static inline int cifs_fscache_query_occupancy(struct inode *inode,
 					       pgoff_t first, unsigned int nr_pages,
-- 
2.43.0




^ permalink raw reply related	[relevance 4%]

* [PATCH 6.1 100/138] cifs: Fix caching to try to do open O_WRONLY as rdwr on server
       [not found]     <20240408125256.218368873@linuxfoundation.org>
@ 2024-04-08 12:58  4% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2024-04-08 12:58 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, David Howells, Steve French,
	Shyam Prasad N, Rohith Surabattula, Jeff Layton, linux-cifs,
	netfs, linux-fsdevel, Steve French, Sasha Levin

6.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: David Howells <dhowells@redhat.com>

[ Upstream commit e9e62243a3e2322cf639f653a0b0a88a76446ce7 ]

When we're engaged in local caching of a cifs filesystem, we cannot perform
caching of a partially written cache granule unless we can read the rest of
the granule.  This can result in unexpected access errors being reported to
the user.

Fix this by the following: if a file is opened O_WRONLY locally, but the
mount was given the "-o fsc" flag, try first opening the remote file with
GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping the
GENERIC_READ and doing the open again.  If that last succeeds, invalidate
the cache for that file as for O_DIRECT.

Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/smb/client/dir.c     | 15 +++++++++++++
 fs/smb/client/file.c    | 48 ++++++++++++++++++++++++++++++++---------
 fs/smb/client/fscache.h |  6 ++++++
 3 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index e382b794acbed..863c7bc3db86f 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -180,6 +180,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	int disposition;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	*oplock = 0;
 	if (tcon->ses->server->oplocks)
@@ -191,6 +192,10 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		return PTR_ERR(full_path);
 	}
 
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (oflags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	if (tcon->unix_ext && cap_unix(tcon->ses) && !tcon->broken_posix_open &&
 	    (CIFS_UNIX_POSIX_PATH_OPS_CAP &
@@ -267,6 +272,8 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		desired_access |= GENERIC_READ; /* is this too little? */
 	if (OPEN_FMODE(oflags) & FMODE_WRITE)
 		desired_access |= GENERIC_WRITE;
+	if (rdwr_for_fscache == 1)
+		desired_access |= GENERIC_READ;
 
 	disposition = FILE_OVERWRITE_IF;
 	if ((oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))
@@ -295,6 +302,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	if (!tcon->unix_ext && (mode & S_IWUGO) == 0)
 		create_options |= CREATE_OPTION_READONLY;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -308,8 +316,15 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	rc = server->ops->open(xid, &oparms, oplock, buf);
 	if (rc) {
 		cifs_dbg(FYI, "cifs_create returned 0x%x\n", rc);
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access &= ~GENERIC_READ;
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		goto out;
 	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	/*
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 0f3405e0f2e48..c240cea7ca349 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -77,12 +77,12 @@ cifs_mark_open_files_invalid(struct cifs_tcon *tcon)
 	 */
 }
 
-static inline int cifs_convert_flags(unsigned int flags)
+static inline int cifs_convert_flags(unsigned int flags, int rdwr_for_fscache)
 {
 	if ((flags & O_ACCMODE) == O_RDONLY)
 		return GENERIC_READ;
 	else if ((flags & O_ACCMODE) == O_WRONLY)
-		return GENERIC_WRITE;
+		return rdwr_for_fscache == 1 ? (GENERIC_READ | GENERIC_WRITE) : GENERIC_WRITE;
 	else if ((flags & O_ACCMODE) == O_RDWR) {
 		/* GENERIC_ALL is too much permission to request
 		   can cause unnecessary access denied on create */
@@ -219,11 +219,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	int create_options = CREATE_NOT_DIR;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	if (!server->ops->open)
 		return -ENOSYS;
 
-	desired_access = cifs_convert_flags(f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(f_flags, rdwr_for_fscache);
 
 /*********************************************************************
  *  open flag mapping table:
@@ -260,6 +265,7 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	if (f_flags & O_DIRECT)
 		create_options |= CREATE_NO_BUFFER;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -271,8 +277,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	};
 
 	rc = server->ops->open(xid, &oparms, oplock, buf);
-	if (rc)
+	if (rc) {
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access = cifs_convert_flags(f_flags, 0);
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		return rc;
+	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 	/* TODO: Add support for calling posix query info but with passing in fid */
 	if (tcon->unix_ext)
@@ -705,11 +719,11 @@ int cifs_open(struct inode *inode, struct file *file)
 use_cache:
 	fscache_use_cookie(cifs_inode_cookie(file_inode(file)),
 			   file->f_mode & FMODE_WRITE);
-	if (file->f_flags & O_DIRECT &&
-	    (!((file->f_flags & O_ACCMODE) != O_RDONLY) ||
-	     file->f_flags & O_APPEND))
-		cifs_invalidate_cache(file_inode(file),
-				      FSCACHE_INVAL_DIO_WRITE);
+	if (!(file->f_flags & O_DIRECT))
+		goto out;
+	if ((file->f_flags & (O_ACCMODE | O_APPEND)) == O_RDONLY)
+		goto out;
+	cifs_invalidate_cache(file_inode(file), FSCACHE_INVAL_DIO_WRITE);
 
 out:
 	free_dentry_path(page);
@@ -774,6 +788,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	int disposition = FILE_OPEN;
 	int create_options = CREATE_NOT_DIR;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	xid = get_xid();
 	mutex_lock(&cfile->fh_mutex);
@@ -837,7 +852,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	}
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
 
-	desired_access = cifs_convert_flags(cfile->f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (cfile->f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(cfile->f_flags, rdwr_for_fscache);
 
 	/* O_SYNC also has bit for O_DSYNC so following check picks up either */
 	if (cfile->f_flags & O_SYNC)
@@ -849,6 +868,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	if (server->ops->get_lease_key)
 		server->ops->get_lease_key(inode, &cfile->fid);
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -874,6 +894,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		/* indicate that we need to relock the file */
 		oparms.reconnect = true;
 	}
+	if (rc == -EACCES && rdwr_for_fscache == 1) {
+		desired_access = cifs_convert_flags(cfile->f_flags, 0);
+		rdwr_for_fscache = 2;
+		goto retry_open;
+	}
 
 	if (rc) {
 		mutex_unlock(&cfile->fh_mutex);
@@ -882,6 +907,9 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		goto reopen_error_exit;
 	}
 
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 reopen_success:
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
diff --git a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h
index 67b601041f0a3..c691b98b442a6 100644
--- a/fs/smb/client/fscache.h
+++ b/fs/smb/client/fscache.h
@@ -108,6 +108,11 @@ static inline void cifs_readpage_to_fscache(struct inode *inode,
 		__cifs_readpage_to_fscache(inode, page);
 }
 
+static inline bool cifs_fscache_enabled(struct inode *inode)
+{
+	return fscache_cookie_enabled(cifs_inode_cookie(inode));
+}
+
 #else /* CONFIG_CIFS_FSCACHE */
 static inline
 void cifs_fscache_fill_coherency(struct inode *inode,
@@ -123,6 +128,7 @@ static inline void cifs_fscache_release_inode_cookie(struct inode *inode) {}
 static inline void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update) {}
 static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { return NULL; }
 static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) {}
+static inline bool cifs_fscache_enabled(struct inode *inode) { return false; }
 
 static inline int cifs_fscache_query_occupancy(struct inode *inode,
 					       pgoff_t first, unsigned int nr_pages,
-- 
2.43.0




^ permalink raw reply related	[relevance 4%]

* [PATCH vfs.all 25/26] buffer: add helpers to get and set bdev
  @ 2024-04-06  9:09  3% ` Yu Kuai
  0 siblings, 0 replies; 200+ results
From: Yu Kuai @ 2024-04-06  9:09 UTC (permalink / raw)
  To: jack, hch, brauner, viro, axboe
  Cc: linux-fsdevel, linux-block, yi.zhang, yangerkun, yukuai3

From: Yu Kuai <yukuai3@huawei.com>

So that we have unified APIs, there are no functional changes and
prepare to convert buffer_head to use bdev_file.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/fops.c                  |  2 +-
 drivers/md/md-bitmap.c        |  2 +-
 fs/affs/file.c                |  2 +-
 fs/buffer.c                   | 10 +++++-----
 fs/direct-io.c                |  4 ++--
 fs/ext2/xattr.c               |  2 +-
 fs/ext4/mmp.c                 |  2 +-
 fs/ext4/page-io.c             |  5 ++---
 fs/ext4/xattr.c               |  2 +-
 fs/gfs2/aops.c                |  2 +-
 fs/gfs2/meta_io.c             |  2 +-
 fs/jbd2/commit.c              |  2 +-
 fs/jbd2/journal.c             |  2 +-
 fs/jbd2/transaction.c         |  8 ++++----
 fs/mpage.c                    | 10 +++++-----
 fs/nilfs2/btnode.c            |  4 ++--
 fs/nilfs2/gcinode.c           |  2 +-
 fs/nilfs2/mdt.c               |  2 +-
 fs/nilfs2/page.c              |  4 ++--
 fs/ntfs3/inode.c              |  2 +-
 fs/reiserfs/fix_node.c        |  2 +-
 fs/reiserfs/journal.c         |  2 +-
 fs/reiserfs/prints.c          |  4 ++--
 fs/reiserfs/stree.c           |  2 +-
 fs/reiserfs/tail_conversion.c |  2 +-
 include/linux/buffer_head.h   | 20 +++++++++++++++++++-
 include/trace/events/block.h  |  2 +-
 27 files changed, 61 insertions(+), 44 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 7d177be788cd..edae216e31dd 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -407,7 +407,7 @@ static const struct iomap_ops blkdev_iomap_ops = {
 static int blkdev_get_block(struct inode *inode, sector_t iblock,
 		struct buffer_head *bh, int create)
 {
-	bh->b_bdev = I_BDEV(inode);
+	bh_set_bdev_file(bh, inode->i_private);
 	bh->b_blocknr = iblock;
 	set_buffer_mapped(bh);
 	return 0;
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 059afc24c08b..fd6c95e0c625 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -381,7 +381,7 @@ static int read_file_page(struct file *file, unsigned long index,
 			}
 
 			bh->b_blocknr = block;
-			bh->b_bdev = inode->i_sb->s_bdev;
+			bh_set_bdev_file(bh, inode->i_sb->s_bdev_file);
 			if (count < blocksize)
 				count = 0;
 			else
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 04c018e19602..f15b24202aab 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -365,7 +365,7 @@ affs_get_block(struct inode *inode, sector_t block, struct buffer_head *bh_resul
 err_alloc:
 	brelse(ext_bh);
 	clear_buffer_mapped(bh_result);
-	bh_result->b_bdev = NULL;
+	bh_set_bdev_file(bh_result, NULL);
 	// unlock cache
 	affs_unlock_ext(inode);
 	return -ENOSPC;
diff --git a/fs/buffer.c b/fs/buffer.c
index 7900720fc54b..e4d74eb63265 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -129,7 +129,7 @@ static void buffer_io_error(struct buffer_head *bh, char *msg)
 	if (!test_bit(BH_Quiet, &bh->b_state))
 		printk_ratelimited(KERN_ERR
 			"Buffer I/O error on dev %pg, logical block %llu%s\n",
-			bh->b_bdev, (unsigned long long)bh->b_blocknr, msg);
+			bh_bdev(bh), (unsigned long long)bh->b_blocknr, msg);
 }
 
 /*
@@ -1367,7 +1367,7 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
 
-		if (bh && bh->b_blocknr == block && bh->b_bdev == bdev &&
+		if (bh && bh->b_blocknr == block && bh_bdev(bh) == bdev &&
 		    bh->b_size == size) {
 			if (i) {
 				while (i) {
@@ -1564,7 +1564,7 @@ static void discard_buffer(struct buffer_head * bh)
 
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
-	bh->b_bdev = NULL;
+	bh_set_bdev_file(bh, NULL);
 	b_state = READ_ONCE(bh->b_state);
 	do {
 	} while (!try_cmpxchg(&bh->b_state, &b_state,
@@ -2005,7 +2005,7 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
 {
 	loff_t offset = (loff_t)block << inode->i_blkbits;
 
-	bh->b_bdev = iomap_bdev(iomap);
+	bh_set_bdev_file(bh, iomap->bdev_file);
 
 	/*
 	 * Block points to offset in file we need to map, iomap contains
@@ -2781,7 +2781,7 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
 	if (buffer_prio(bh))
 		opf |= REQ_PRIO;
 
-	bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
+	bio = bio_alloc(bh_bdev(bh), 1, opf, GFP_NOIO);
 
 	fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 62c97ff9e852..49475f530e0f 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -673,7 +673,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
 	sector = start_sector << (sdio->blkbits - 9);
 	nr_pages = bio_max_segs(sdio->pages_in_io);
 	BUG_ON(nr_pages <= 0);
-	dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
+	dio_bio_alloc(dio, sdio, bh_bdev(map_bh), sector, nr_pages);
 	sdio->boundary = 0;
 out:
 	return ret;
@@ -948,7 +948,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 					map_bh->b_blocknr << sdio->blkfactor;
 				if (buffer_new(map_bh)) {
 					clean_bdev_aliases(
-						map_bh->b_bdev,
+						bh_bdev(map_bh),
 						map_bh->b_blocknr,
 						map_bh->b_size >> i_blkbits);
 				}
diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index c885dcc3bd0d..42e595e87a74 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -80,7 +80,7 @@
 	} while (0)
 # define ea_bdebug(bh, f...) do { \
 		printk(KERN_DEBUG "block %pg:%lu: ", \
-			bh->b_bdev, (unsigned long) bh->b_blocknr); \
+			bh_bdev(bh), (unsigned long) bh->b_blocknr); \
 		printk(f); \
 		printk("\n"); \
 	} while (0)
diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index bd946d0c71b7..5641bd34d021 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -384,7 +384,7 @@ int ext4_multi_mount_protect(struct super_block *sb,
 
 	BUILD_BUG_ON(sizeof(mmp->mmp_bdevname) < BDEVNAME_SIZE);
 	snprintf(mmp->mmp_bdevname, sizeof(mmp->mmp_bdevname),
-		 "%pg", bh->b_bdev);
+		 "%pg", bh_bdev(bh));
 
 	/*
 	 * Start a kernel thread to update the MMP block periodically.
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 312bc6813357..1b02b6a28eca 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -93,8 +93,7 @@ struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end)
 static void buffer_io_error(struct buffer_head *bh)
 {
 	printk_ratelimited(KERN_ERR "Buffer I/O error on device %pg, logical block %llu\n",
-		       bh->b_bdev,
-			(unsigned long long)bh->b_blocknr);
+			   bh_bdev(bh), (unsigned long long)bh->b_blocknr);
 }
 
 static void ext4_finish_bio(struct bio *bio)
@@ -397,7 +396,7 @@ static void io_submit_init_bio(struct ext4_io_submit *io,
 	 * bio_alloc will _always_ be able to allocate a bio if
 	 * __GFP_DIRECT_RECLAIM is set, see comments for bio_alloc_bioset().
 	 */
-	bio = bio_alloc(bh->b_bdev, BIO_MAX_VECS, REQ_OP_WRITE, GFP_NOIO);
+	bio = bio_alloc(bh_bdev(bh), BIO_MAX_VECS, REQ_OP_WRITE, GFP_NOIO);
 	fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_end_io = ext4_end_bio;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index b67a176bfcf9..005af215e24a 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -68,7 +68,7 @@
 	       inode->i_sb->s_id, inode->i_ino, ##__VA_ARGS__)
 # define ea_bdebug(bh, fmt, ...)					\
 	printk(KERN_DEBUG "block %pg:%lu: " fmt "\n",			\
-	       bh->b_bdev, (unsigned long)bh->b_blocknr, ##__VA_ARGS__)
+	       bh_bdev(bh), (unsigned long)bh->b_blocknr, ##__VA_ARGS__)
 #else
 # define ea_idebug(inode, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 974aca9c8ea8..24b6cf9021ca 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -622,7 +622,7 @@ static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh)
 			spin_unlock(&sdp->sd_ail_lock);
 		}
 	}
-	bh->b_bdev = NULL;
+	bh_set_bdev_file(bh, NULL);
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
 	clear_buffer_new(bh);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index f814054c8cd0..2052d3fc2c24 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -218,7 +218,7 @@ static void gfs2_submit_bhs(blk_opf_t opf, struct buffer_head *bhs[], int num)
 		struct buffer_head *bh = *bhs;
 		struct bio *bio;
 
-		bio = bio_alloc(bh->b_bdev, num, opf, GFP_NOIO);
+		bio = bio_alloc(bh_bdev(bh), num, opf, GFP_NOIO);
 		bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 		while (num > 0) {
 			bh = *bhs;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 5e122586e06e..413f32b2f308 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -1014,7 +1014,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				clear_buffer_mapped(bh);
 				clear_buffer_new(bh);
 				clear_buffer_req(bh);
-				bh->b_bdev = NULL;
+				bh_set_bdev_file(bh, NULL);
 			}
 		}
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index abd42a6ccd0e..c1ce32d99267 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -434,7 +434,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 
 	folio_set_bh(new_bh, new_folio, new_offset);
 	new_bh->b_size = bh_in->b_size;
-	new_bh->b_bdev = journal->j_dev;
+	bh_set_bdev_file(new_bh, journal->j_dev_file);
 	new_bh->b_blocknr = blocknr;
 	new_bh->b_private = bh_in;
 	set_buffer_mapped(new_bh);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index cb0b8d6fc0c6..04021f54ca97 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -929,7 +929,7 @@ static void warn_dirty_buffer(struct buffer_head *bh)
 	       "JBD2: Spotted dirty metadata buffer (dev = %pg, blocknr = %llu). "
 	       "There's a risk of filesystem corruption in case of system "
 	       "crash.\n",
-	       bh->b_bdev, (unsigned long long)bh->b_blocknr);
+	       bh_bdev(bh), (unsigned long long)bh->b_blocknr);
 }
 
 /* Call t_frozen trigger and copy buffer data into jh->b_frozen_data. */
@@ -990,7 +990,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 	/* If it takes too long to lock the buffer, trace it */
 	time_lock = jbd2_time_diff(start_lock, jiffies);
 	if (time_lock > HZ/10)
-		trace_jbd2_lock_buffer_stall(bh->b_bdev->bd_dev,
+		trace_jbd2_lock_buffer_stall(bh_bdev(bh)->bd_dev,
 			jiffies_to_msecs(time_lock));
 
 	/* We now hold the buffer lock so it is safe to query the buffer
@@ -2374,7 +2374,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 			write_unlock(&journal->j_state_lock);
 			jbd2_journal_put_journal_head(jh);
 			/* Already zapped buffer? Nothing to do... */
-			if (!bh->b_bdev)
+			if (!bh_bdev(bh))
 				return 0;
 			return -EBUSY;
 		}
@@ -2428,7 +2428,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 	clear_buffer_new(bh);
 	clear_buffer_delay(bh);
 	clear_buffer_unwritten(bh);
-	bh->b_bdev = NULL;
+	bh_set_bdev_file(bh, NULL);
 	return may_free;
 }
 
diff --git a/fs/mpage.c b/fs/mpage.c
index fa8b99a199fa..40594afa63cb 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -126,7 +126,7 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
 	do {
 		if (block == page_block) {
 			page_bh->b_state = bh->b_state;
-			page_bh->b_bdev = bh->b_bdev;
+			bh_copy_bdev_file(page_bh, bh);
 			page_bh->b_blocknr = bh->b_blocknr;
 			break;
 		}
@@ -216,7 +216,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			page_block++;
 			block_in_file++;
 		}
-		bdev = map_bh->b_bdev;
+		bdev = bh_bdev(map_bh);
 	}
 
 	/*
@@ -272,7 +272,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			page_block++;
 			block_in_file++;
 		}
-		bdev = map_bh->b_bdev;
+		bdev = bh_bdev(map_bh);
 	}
 
 	if (first_hole != blocks_per_page) {
@@ -515,7 +515,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 				boundary_block = bh->b_blocknr;
 				boundary_bdev = bh->b_bdev;
 			}
-			bdev = bh->b_bdev;
+			bdev = bh_bdev(bh);
 		} while ((bh = bh->b_this_page) != head);
 
 		if (first_unmapped)
@@ -565,7 +565,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 		}
 		page_block++;
 		boundary = buffer_boundary(&map_bh);
-		bdev = map_bh.b_bdev;
+		bdev = bh_bdev(&map_bh);
 		if (block_in_file == last_block)
 			break;
 		block_in_file++;
diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index 0131d83b912d..3f81d00fc031 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -59,7 +59,7 @@ nilfs_btnode_create_block(struct address_space *btnc, __u64 blocknr)
 		BUG();
 	}
 	memset(bh->b_data, 0, i_blocksize(inode));
-	bh->b_bdev = inode->i_sb->s_bdev;
+	bh_set_bdev_file(bh, inode->i_sb->s_bdev_file);
 	bh->b_blocknr = blocknr;
 	set_buffer_mapped(bh);
 	set_buffer_uptodate(bh);
@@ -118,7 +118,7 @@ int nilfs_btnode_submit_block(struct address_space *btnc, __u64 blocknr,
 		goto found;
 	}
 	set_buffer_mapped(bh);
-	bh->b_bdev = inode->i_sb->s_bdev;
+	bh_set_bdev_file(bh, inode->i_sb->s_bdev_file);
 	bh->b_blocknr = pblocknr; /* set block address for read */
 	bh->b_end_io = end_buffer_read_sync;
 	get_bh(bh);
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index bf9a11d58817..83d2b5e034ad 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -84,7 +84,7 @@ int nilfs_gccache_submit_read_data(struct inode *inode, sector_t blkoff,
 	}
 
 	if (!buffer_mapped(bh)) {
-		bh->b_bdev = inode->i_sb->s_bdev;
+		bh_set_bdev_file(bh, inode->i_sb->s_bdev_file);
 		set_buffer_mapped(bh);
 	}
 	bh->b_blocknr = pbn;
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 4f792a0ad0f0..10f33017a1c9 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -89,7 +89,7 @@ static int nilfs_mdt_create_block(struct inode *inode, unsigned long block,
 	if (buffer_uptodate(bh))
 		goto failed_bh;
 
-	bh->b_bdev = sb->s_bdev;
+	bh_set_bdev_file(bh, sb->s_bdev_file);
 	err = nilfs_mdt_insert_new_block(inode, block, bh, init_block);
 	if (likely(!err)) {
 		get_bh(bh);
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 14e470fb8870..b6cc95dd13c0 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -111,7 +111,7 @@ void nilfs_copy_buffer(struct buffer_head *dbh, struct buffer_head *sbh)
 
 	dbh->b_state = sbh->b_state & NILFS_BUFFER_INHERENT_BITS;
 	dbh->b_blocknr = sbh->b_blocknr;
-	dbh->b_bdev = sbh->b_bdev;
+	bh_copy_bdev_file(dbh, sbh);
 
 	bh = dbh;
 	bits = sbh->b_state & (BIT(BH_Uptodate) | BIT(BH_Mapped));
@@ -216,7 +216,7 @@ static void nilfs_copy_folio(struct folio *dst, struct folio *src,
 		lock_buffer(dbh);
 		dbh->b_state = sbh->b_state & mask;
 		dbh->b_blocknr = sbh->b_blocknr;
-		dbh->b_bdev = sbh->b_bdev;
+		bh_copy_bdev_file(dbh, sbh);
 		sbh = sbh->b_this_page;
 		dbh = dbh->b_this_page;
 	} while (dbh != dbufs);
diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index 3c4c878f6d77..c795fd2000ee 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -609,7 +609,7 @@ static noinline int ntfs_get_block_vbo(struct inode *inode, u64 vbo,
 	lbo = ((u64)lcn << cluster_bits) + off;
 
 	set_buffer_mapped(bh);
-	bh->b_bdev = sb->s_bdev;
+	bh_set_bdev_file(bh, sb->s_bdev_file);
 	bh->b_blocknr = lbo >> sb->s_blocksize_bits;
 
 	valid = ni->i_valid;
diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c
index 6c13a8d9a73c..2b288b1539d9 100644
--- a/fs/reiserfs/fix_node.c
+++ b/fs/reiserfs/fix_node.c
@@ -2332,7 +2332,7 @@ static void tb_buffer_sanity_check(struct super_block *sb,
 				       "in tree %s[%d] (%b)",
 				       descr, level, bh);
 
-		if (bh->b_bdev != sb->s_bdev)
+		if (bh_bdev(bh) != sb->s_bdev)
 			reiserfs_panic(sb, "jmacd-4", "buffer has wrong "
 				       "device %s[%d] (%b)",
 				       descr, level, bh);
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index e539ccd39e1e..724113cb79d3 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -618,7 +618,7 @@ static void reiserfs_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 	if (buffer_journaled(bh)) {
 		reiserfs_warning(NULL, "clm-2084",
 				 "pinned buffer %lu:%pg sent to disk",
-				 bh->b_blocknr, bh->b_bdev);
+				 bh->b_blocknr, bh_bdev(bh));
 	}
 	if (uptodate)
 		set_buffer_uptodate(bh);
diff --git a/fs/reiserfs/prints.c b/fs/reiserfs/prints.c
index 84a194b77f19..249a458b6e28 100644
--- a/fs/reiserfs/prints.c
+++ b/fs/reiserfs/prints.c
@@ -156,7 +156,7 @@ static int scnprintf_buffer_head(char *buf, size_t size, struct buffer_head *bh)
 {
 	return scnprintf(buf, size,
 			 "dev %pg, size %zd, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)",
-			 bh->b_bdev, bh->b_size,
+			 bh_bdev(bh), bh->b_size,
 			 (unsigned long long)bh->b_blocknr,
 			 atomic_read(&(bh->b_count)),
 			 bh->b_state, bh->b_page,
@@ -561,7 +561,7 @@ static int print_super_block(struct buffer_head *bh)
 		return 1;
 	}
 
-	printk("%pg\'s super block is in block %llu\n", bh->b_bdev,
+	printk("%pg\'s super block is in block %llu\n", bh_bdev(bh),
 	       (unsigned long long)bh->b_blocknr);
 	printk("Reiserfs version %s\n", version);
 	printk("Block count %u\n", sb_block_count(rs));
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 5faf702f8d15..23998f071d9c 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -331,7 +331,7 @@ static inline int key_in_buffer(
 	       || chk_path->path_length > MAX_HEIGHT,
 	       "PAP-5050: pointer to the key(%p) is NULL or invalid path length(%d)",
 	       key, chk_path->path_length);
-	RFALSE(!PATH_PLAST_BUFFER(chk_path)->b_bdev,
+	RFALSE(!bh_bdev(PATH_PLAST_BUFFER(chk_path)),
 	       "PAP-5060: device must not be NODEV");
 
 	if (comp_keys(get_lkey(chk_path, sb), key) == 1)
diff --git a/fs/reiserfs/tail_conversion.c b/fs/reiserfs/tail_conversion.c
index 2cec61af2a9e..300e6737a0db 100644
--- a/fs/reiserfs/tail_conversion.c
+++ b/fs/reiserfs/tail_conversion.c
@@ -187,7 +187,7 @@ void reiserfs_unmap_buffer(struct buffer_head *bh)
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
 	clear_buffer_new(bh);
-	bh->b_bdev = NULL;
+	bh_set_bdev_file(bh, NULL);
 	unlock_buffer(bh);
 }
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d78454a4dd1f..4c6f0d0332c8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -10,6 +10,7 @@
 
 #include <linux/types.h>
 #include <linux/blk_types.h>
+#include <linux/blkdev.h>
 #include <linux/fs.h>
 #include <linux/linkage.h>
 #include <linux/pagemap.h>
@@ -136,6 +137,23 @@ BUFFER_FNS(Meta, meta)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
 
+static __always_inline void bh_set_bdev_file(struct buffer_head *bh,
+					     struct file *bdev_file)
+{
+	bh->b_bdev = bdev_file ? file_bdev(bdev_file) : NULL;
+}
+
+static __always_inline void bh_copy_bdev_file(struct buffer_head *dbh,
+					      struct buffer_head *sbh)
+{
+	dbh->b_bdev = sbh->b_bdev;
+}
+
+static __always_inline struct block_device *bh_bdev(struct buffer_head *bh)
+{
+	return bh->b_bdev;
+}
+
 static __always_inline void set_buffer_uptodate(struct buffer_head *bh)
 {
 	/*
@@ -377,7 +395,7 @@ static inline void
 map_bh(struct buffer_head *bh, struct super_block *sb, sector_t block)
 {
 	set_buffer_mapped(bh);
-	bh->b_bdev = sb->s_bdev;
+	bh_set_bdev_file(bh, sb->s_bdev_file);
 	bh->b_blocknr = block;
 	bh->b_size = sb->s_blocksize;
 }
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 0e128ad51460..95d3ed978864 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -26,7 +26,7 @@ DECLARE_EVENT_CLASS(block_buffer,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= bh->b_bdev->bd_dev;
+		__entry->dev		= bh_bdev(bh)->bd_dev;
 		__entry->sector		= bh->b_blocknr;
 		__entry->size		= bh->b_size;
 	),
-- 
2.39.2


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH v2] xfs: allow cross-linking special files without project quota
  2024-03-15  2:48  5% ` Darrick J. Wong
  2024-03-15  9:35  0%   ` Andrey Albershteyn
@ 2024-04-05 22:22  0%   ` Andrey Albershteyn
  1 sibling, 0 replies; 200+ results
From: Andrey Albershteyn @ 2024-04-05 22:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, linux-xfs, chandan.babu

On 2024-03-14 19:48:26, Darrick J. Wong wrote:
> On Thu, Mar 14, 2024 at 06:07:02PM +0100, Andrey Albershteyn wrote:
> > There's an issue that if special files is created before quota
> > project is enabled, then it's not possible to link this file. This
> > works fine for normal files. This happens because xfs_quota skips
> > special files (no ioctls to set necessary flags). The check for
> > having the same project ID for source and destination then fails as
> > source file doesn't have any ID.
> > 
> > mkfs.xfs -f /dev/sda
> > mount -o prjquota /dev/sda /mnt/test
> > 
> > mkdir /mnt/test/foo
> > mkfifo /mnt/test/foo/fifo1
> > 
> > xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> > > Setting up project 9 (path /mnt/test/foo)...
> > > xfs_quota: skipping special file /mnt/test/foo/fifo1
> > > Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).
> > 
> > ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> > > ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link
> 
> Aha.  So hardlinking special files within a directory subtree that all
> have the same nonzero project quota ID fails if that special file
> happened to have been created before the subtree was assigned that pqid.
> And there's nothing we can do about that, because there's no way to call
> XFS_IOC_SETFSXATTR on a special file because opening those gets you a
> different inode from the special block/fifo/chardev filesystem...
> 
> > mkfifo /mnt/test/foo/fifo2
> > ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link
> > 
> > Fix this by allowing linking of special files to the project quota
> > if special files doesn't have any ID set (ID = 0).
> 
> ...and that's the workaround for this situation.  The project quota
> accounting here will be weird because there will be (more) files in a
> directory subtree than is reported by xfs_quota, but the subtree was
> already messed up in that manner.
> 
> Question: Should we have a XFS_IOC_SETFSXATTRAT where we can pass in
> relative directory paths and actually query/update special files?

After some more thinking/looking into the code this is probably the
only way to make it work the same for special files. Also, I've
noticed that this workaround can be applied to xfs_rename then.

So, I will start with implementing XFS_IOC_SETFSXATTRAT

-- 
- Andrey

> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
> > ---
> >  fs/xfs/xfs_inode.c | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 1fd94958aa97..b7be19be0132 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1240,8 +1240,19 @@ xfs_link(
> >  	 */
> >  	if (unlikely((tdp->i_diflags & XFS_DIFLAG_PROJINHERIT) &&
> >  		     tdp->i_projid != sip->i_projid)) {
> > -		error = -EXDEV;
> > -		goto error_return;
> > +		/*
> > +		 * Project quota setup skips special files which can
> > +		 * leave inodes in a PROJINHERIT directory without a
> > +		 * project ID set. We need to allow links to be made
> > +		 * to these "project-less" inodes because userspace
> > +		 * expects them to succeed after project ID setup,
> > +		 * but everything else should be rejected.
> > +		 */
> > +		if (!special_file(VFS_I(sip)->i_mode) ||
> > +		    sip->i_projid != 0) {
> > +			error = -EXDEV;
> > +			goto error_return;
> > +		}
> >  	}
> >  
> >  	if (!resblks) {
> > -- 
> > 2.42.0
> > 
> > 
> 


^ permalink raw reply	[relevance 0%]

* [PATCH v14 02/12] landlock: Add IOCTL access right for character and block devices
  2024-04-05 21:40  2% [PATCH v14 00/12] Landlock: IOCTL support Günther Noack
@ 2024-04-05 21:40  6% ` Günther Noack
  2024-04-12 15:16  0%   ` Mickaël Salaün
  0 siblings, 1 reply; 200+ results
From: Günther Noack @ 2024-04-05 21:40 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack,
	Christian Brauner

Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
and increments the Landlock ABI version to 5.

This access right applies to device-custom IOCTL commands
when they are invoked on block or character device files.

Like the truncate right, this right is associated with a file
descriptor at the time of open(2), and gets respected even when the
file descriptor is used outside of the thread which it was originally
opened in.

Therefore, a newly enabled Landlock policy does not apply to file
descriptors which are already open.

If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
number of safe IOCTL commands will be permitted on newly opened device
files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
as other IOCTL commands for regular files which are implemented in
fs/ioctl.c.

Noteworthy scenarios which require special attention:

TTY devices are often passed into a process from the parent process,
and so a newly enabled Landlock policy does not retroactively apply to
them automatically.  In the past, TTY devices have often supported
IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
letting callers control the TTY input buffer (and simulate
keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
modern kernels though.

Known limitations:

The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
control over IOCTL commands.

Landlock users may use path-based restrictions in combination with
their knowledge about the file system layout to control what IOCTLs
can be done.

Cc: Paul Moore <paul@paul-moore.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Günther Noack <gnoack@google.com>
---
 include/uapi/linux/landlock.h                |  38 +++-
 security/landlock/fs.c                       | 221 ++++++++++++++++++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   8 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   |   5 +-
 6 files changed, 259 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
index 25c8d7677539..68625e728f43 100644
--- a/include/uapi/linux/landlock.h
+++ b/include/uapi/linux/landlock.h
@@ -128,7 +128,7 @@ struct landlock_net_port_attr {
  * files and directories.  Files or directories opened before the sandboxing
  * are not subject to these restrictions.
  *
- * A file can only receive these access rights:
+ * The following access rights apply only to files:
  *
  * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
  * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
@@ -138,12 +138,13 @@ struct landlock_net_port_attr {
  * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
  * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
  *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
- *   ``O_TRUNC``. Whether an opened file can be truncated with
- *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
- *   same way as read and write permissions are checked during
- *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
- *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
- *   third version of the Landlock ABI.
+ *   ``O_TRUNC``.  This access right is available since the third version of the
+ *   Landlock ABI.
+ *
+ * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
+ * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
+ * read and write permissions are checked during :manpage:`open(2)` using
+ * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
  *
  * A directory can receive access rights related to files or directories.  The
  * following access right is applied to the directory itself, and the
@@ -198,13 +199,33 @@ struct landlock_net_port_attr {
  *   If multiple requirements are not met, the ``EACCES`` error code takes
  *   precedence over ``EXDEV``.
  *
+ * The following access right applies both to files and directories:
+ *
+ * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
+ *   character or block device.
+ *
+ *   This access right applies to all `ioctl(2)` commands implemented by device
+ *   drivers.  However, the following common IOCTL commands continue to be
+ *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:
+ *
+ *   * IOCTL commands targeting file descriptors (``FIOCLEX``, ``FIONCLEX``),
+ *   * IOCTL commands targeting file descriptions (``FIONBIO``, ``FIOASYNC``),
+ *   * IOCTL commands targeting file systems (``FIFREEZE``, ``FITHAW``,
+ *     ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``)
+ *   * Some IOCTL commands which do not make sense when used with devices, but
+ *     whose implementations are safe and return the right error codes
+ *     (``FS_IOC_FIEMAP``, ``FICLONE``, ``FICLONERANGE``, ``FIDEDUPERANGE``)
+ *
+ *   This access right is available since the fifth version of the Landlock
+ *   ABI.
+ *
  * .. warning::
  *
  *   It is currently not possible to restrict some file-related actions
  *   accessible through these syscall families: :manpage:`chdir(2)`,
  *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
  *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
- *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
+ *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
  *   Future Landlock evolutions will enable to restrict them.
  */
 /* clang-format off */
@@ -223,6 +244,7 @@ struct landlock_net_port_attr {
 #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
 #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
 #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
+#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
 /* clang-format on */
 
 /**
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index c15559432d3d..b0857541d5e0 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -7,6 +7,7 @@
  * Copyright © 2021-2022 Microsoft Corporation
  */
 
+#include <asm/ioctls.h>
 #include <kunit/test.h>
 #include <linux/atomic.h>
 #include <linux/bitops.h>
@@ -14,6 +15,7 @@
 #include <linux/compiler_types.h>
 #include <linux/dcache.h>
 #include <linux/err.h>
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -29,6 +31,7 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue.h>
+#include <uapi/linux/fiemap.h>
 #include <uapi/linux/landlock.h>
 
 #include "common.h"
@@ -84,6 +87,158 @@ static const struct landlock_object_underops landlock_fs_underops = {
 	.release = release_inode
 };
 
+/* IOCTL helpers */
+
+/**
+ * is_masked_device_ioctl(): Determine whether an IOCTL command is always
+ * permitted with Landlock for device files.  These commands can not be
+ * restricted on device files by enforcing a Landlock policy.
+ *
+ * @cmd: The IOCTL command that is supposed to be run.
+ *
+ * By default, any IOCTL on a device file requires the
+ * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  However, we blanket-permit some
+ * commands, if:
+ *
+ * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
+ *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
+ *
+ * 2. The command is harmless when invoked on devices.
+ *
+ * We also permit commands that do not make sense for devices, but where the
+ * do_vfs_ioctl() implementation returns a more conventional error code.
+ *
+ * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
+ * should be considered for inclusion here.
+ *
+ * Returns: true if the IOCTL @cmd can not be restricted with Landlock for
+ * device files.
+ */
+static __attribute_const__ bool is_masked_device_ioctl(const unsigned int cmd)
+{
+	switch (cmd) {
+	/*
+	 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
+	 * close-on-exec and the file's buffered-IO and async flags.  These
+	 * operations are also available through fcntl(2), and are
+	 * unconditionally permitted in Landlock.
+	 */
+	case FIOCLEX:
+	case FIONCLEX:
+	case FIONBIO:
+	case FIOASYNC:
+	/*
+	 * FIOQSIZE queries the size of a regular file, directory, or link.
+	 *
+	 * We still permit it, because it always returns -ENOTTY for
+	 * other file types.
+	 */
+	case FIOQSIZE:
+	/*
+	 * FIFREEZE and FITHAW freeze and thaw the file system which the
+	 * given file belongs to.  Requires CAP_SYS_ADMIN.
+	 *
+	 * These commands operate on the file system's superblock rather
+	 * than on the file itself.  The same operations can also be
+	 * done through any other file or directory on the same file
+	 * system, so it is safe to permit these.
+	 */
+	case FIFREEZE:
+	case FITHAW:
+	/*
+	 * FS_IOC_FIEMAP queries information about the allocation of
+	 * blocks within a file.
+	 *
+	 * This IOCTL command only makes sense for regular files and is
+	 * not implemented by devices. It is harmless to permit.
+	 */
+	case FS_IOC_FIEMAP:
+	/*
+	 * FIGETBSZ queries the file system's block size for a file or
+	 * directory.
+	 *
+	 * This command operates on the file system's superblock rather
+	 * than on the file itself.  The same operation can also be done
+	 * through any other file or directory on the same file system,
+	 * so it is safe to permit it.
+	 */
+	case FIGETBSZ:
+	/*
+	 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
+	 * their underlying storage ("reflink") between source and
+	 * destination FDs, on file systems which support that.
+	 *
+	 * These IOCTL commands only apply to regular files
+	 * and are harmless to permit for device files.
+	 */
+	case FICLONE:
+	case FICLONERANGE:
+	case FIDEDUPERANGE:
+	/*
+	 * FIONREAD, FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
+	 * FS_IOC_FSSETXATTR are forwarded to device implementations.
+	 */
+
+	/*
+	 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
+	 * the file system superblock, not on the specific file, so
+	 * these operations are available through any other file on the
+	 * same file system as well.
+	 */
+	case FS_IOC_GETFSUUID:
+	case FS_IOC_GETFSSYSFSPATH:
+		return true;
+
+	/*
+	 * file_ioctl() commands (FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64,
+	 * FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE) are
+	 * forwarded to device implementations, so not permitted.
+	 */
+
+	/* Other commands are guarded by the access right. */
+	default:
+		return false;
+	}
+}
+
+/*
+ * is_masked_device_ioctl_compat - same as the helper above, but checking the
+ * "compat" IOCTL commands.
+ *
+ * The IOCTL commands with special handling in compat-mode should behave the
+ * same as their non-compat counterparts.
+ */
+static __attribute_const__ bool
+is_masked_device_ioctl_compat(const unsigned int cmd)
+{
+	switch (cmd) {
+	/* FICLONE is permitted, same as in the non-compat variant. */
+	case FICLONE:
+		return true;
+#if defined(CONFIG_X86_64)
+	/*
+	 * FS_IOC_RESVSP_32, FS_IOC_RESVSP64_32, FS_IOC_UNRESVSP_32,
+	 * FS_IOC_UNRESVSP64_32, FS_IOC_ZERO_RANGE_32: not blanket-permitted,
+	 * for consistency with their non-compat variants.
+	 */
+	case FS_IOC_RESVSP_32:
+	case FS_IOC_RESVSP64_32:
+	case FS_IOC_UNRESVSP_32:
+	case FS_IOC_UNRESVSP64_32:
+	case FS_IOC_ZERO_RANGE_32:
+#endif
+	/*
+	 * FS_IOC32_GETFLAGS, FS_IOC32_SETFLAGS are forwarded to their device
+	 * implementations.
+	 */
+	case FS_IOC32_GETFLAGS:
+	case FS_IOC32_SETFLAGS:
+		return false;
+	default:
+		return is_masked_device_ioctl(cmd);
+	}
+}
+
 /* Ruleset management */
 
 static struct landlock_object *get_inode_object(struct inode *const inode)
@@ -148,7 +303,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL_DEV)
 /* clang-format on */
 
 /*
@@ -1332,11 +1488,18 @@ static int hook_file_alloc_security(struct file *const file)
 	return 0;
 }
 
+static bool is_device(const struct file *const file)
+{
+	const struct inode *inode = file_inode(file);
+
+	return S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
+}
+
 static int hook_file_open(struct file *const file)
 {
 	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
-	access_mask_t open_access_request, full_access_request, allowed_access;
-	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
+	access_mask_t open_access_request, full_access_request, allowed_access,
+		optional_access;
 	const struct landlock_ruleset *const dom =
 		get_fs_domain(landlock_cred(file->f_cred)->domain);
 
@@ -1354,6 +1517,10 @@ static int hook_file_open(struct file *const file)
 	 * We look up more access than what we immediately need for open(), so
 	 * that we can later authorize operations on opened files.
 	 */
+	optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
+	if (is_device(file))
+		optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
+
 	full_access_request = open_access_request | optional_access;
 
 	if (is_access_to_paths_allowed(
@@ -1410,6 +1577,52 @@ static int hook_file_truncate(struct file *const file)
 	return -EACCES;
 }
 
+static int hook_file_ioctl(struct file *file, unsigned int cmd,
+			   unsigned long arg)
+{
+	access_mask_t allowed_access = landlock_file(file)->allowed_access;
+
+	/*
+	 * It is the access rights at the time of opening the file which
+	 * determine whether IOCTL can be used on the opened file later.
+	 *
+	 * The access right is attached to the opened file in hook_file_open().
+	 */
+	if (allowed_access & LANDLOCK_ACCESS_FS_IOCTL_DEV)
+		return 0;
+
+	if (!is_device(file))
+		return 0;
+
+	if (is_masked_device_ioctl(cmd))
+		return 0;
+
+	return -EACCES;
+}
+
+static int hook_file_ioctl_compat(struct file *file, unsigned int cmd,
+				  unsigned long arg)
+{
+	access_mask_t allowed_access = landlock_file(file)->allowed_access;
+
+	/*
+	 * It is the access rights at the time of opening the file which
+	 * determine whether IOCTL can be used on the opened file later.
+	 *
+	 * The access right is attached to the opened file in hook_file_open().
+	 */
+	if (allowed_access & LANDLOCK_ACCESS_FS_IOCTL_DEV)
+		return 0;
+
+	if (!is_device(file))
+		return 0;
+
+	if (is_masked_device_ioctl_compat(cmd))
+		return 0;
+
+	return -EACCES;
+}
+
 static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
 
@@ -1432,6 +1645,8 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
 	LSM_HOOK_INIT(file_open, hook_file_open),
 	LSM_HOOK_INIT(file_truncate, hook_file_truncate),
+	LSM_HOOK_INIT(file_ioctl, hook_file_ioctl),
+	LSM_HOOK_INIT(file_ioctl_compat, hook_file_ioctl_compat),
 };
 
 __init void landlock_add_fs_hooks(void)
diff --git a/security/landlock/limits.h b/security/landlock/limits.h
index 93c9c6f91556..20fdb5ff3514 100644
--- a/security/landlock/limits.h
+++ b/security/landlock/limits.h
@@ -18,7 +18,7 @@
 #define LANDLOCK_MAX_NUM_LAYERS		16
 #define LANDLOCK_MAX_NUM_RULES		U32_MAX
 
-#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_TRUNCATE
+#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_IOCTL_DEV
 #define LANDLOCK_MASK_ACCESS_FS		((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
 #define LANDLOCK_NUM_ACCESS_FS		__const_hweight64(LANDLOCK_MASK_ACCESS_FS)
 #define LANDLOCK_SHIFT_ACCESS_FS	0
diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
index 6788e73b6681..9ae3dfa47443 100644
--- a/security/landlock/syscalls.c
+++ b/security/landlock/syscalls.c
@@ -149,7 +149,7 @@ static const struct file_operations ruleset_fops = {
 	.write = fop_dummy_write,
 };
 
-#define LANDLOCK_ABI_VERSION 4
+#define LANDLOCK_ABI_VERSION 5
 
 /**
  * sys_landlock_create_ruleset - Create a new ruleset
@@ -321,7 +321,11 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset,
 	if (!path_beneath_attr.allowed_access)
 		return -ENOMSG;
 
-	/* Checks that allowed_access matches the @ruleset constraints. */
+	/*
+	 * Checks that allowed_access matches the @ruleset constraints and only
+	 * consists of publicly visible access rights (as opposed to synthetic
+	 * ones).
+	 */
 	mask = landlock_get_raw_fs_access_mask(ruleset, 0);
 	if ((path_beneath_attr.allowed_access | mask) != mask)
 		return -EINVAL;
diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
index a6f89aaea77d..3c1e9f35b531 100644
--- a/tools/testing/selftests/landlock/base_test.c
+++ b/tools/testing/selftests/landlock/base_test.c
@@ -75,7 +75,7 @@ TEST(abi_version)
 	const struct landlock_ruleset_attr ruleset_attr = {
 		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
 	};
-	ASSERT_EQ(4, landlock_create_ruleset(NULL, 0,
+	ASSERT_EQ(5, landlock_create_ruleset(NULL, 0,
 					     LANDLOCK_CREATE_RULESET_VERSION));
 
 	ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
index 9a6036fbf289..418ad745a5dd 100644
--- a/tools/testing/selftests/landlock/fs_test.c
+++ b/tools/testing/selftests/landlock/fs_test.c
@@ -529,9 +529,10 @@ TEST_F_FORK(layout1, inval)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL_DEV)
 
-#define ACCESS_LAST LANDLOCK_ACCESS_FS_TRUNCATE
+#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL_DEV
 
 #define ACCESS_ALL ( \
 	ACCESS_FILE | \
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply related	[relevance 6%]

* [PATCH v14 00/12] Landlock: IOCTL support
@ 2024-04-05 21:40  2% Günther Noack
  2024-04-05 21:40  6% ` [PATCH v14 02/12] landlock: Add IOCTL access right for character and block devices Günther Noack
  0 siblings, 1 reply; 200+ results
From: Günther Noack @ 2024-04-05 21:40 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack

Hello!

These patches add simple ioctl(2) support to Landlock.

Objective
~~~~~~~~~

Make ioctl(2) requests for device files restrictable with Landlock,
in a way that is useful for real-world applications.

Proposed approach
~~~~~~~~~~~~~~~~~

Introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right, which restricts the
use of ioctl(2) on block and character devices.

We attach the this access right to opened file descriptors, as we
already do for LANDLOCK_ACCESS_FS_TRUNCATE.

If LANDLOCK_ACCESS_FS_IOCTL_DEV is handled (restricted in the
ruleset), the LANDLOCK_ACCESS_FS_IOCTL_DEV right governs the use of
all device-specific IOCTL commands.  We make exceptions for common and
known-harmless IOCTL commands such as FIOCLEX, FIONCLEX, FIONBIO and
FIOASYNC, as well as other IOCTL commands which are implemented in
fs/ioctl.c.  A full list of these IOCTL commands is listed in the
documentation.

I believe that this approach works for the majority of use cases, and
offers a good trade-off between complexity of the Landlock API and
implementation and flexibility when the feature is used.

Current limitations
~~~~~~~~~~~~~~~~~~~

With this patch set, ioctl(2) requests can *not* be filtered based on
file type, device number (dev_t) or on the ioctl(2) request number.

On the initial RFC patch set [1], we have reached consensus to start
with this simpler coarse-grained approach, and build additional IOCTL
restriction capabilities on top in subsequent steps.

[1] https://lore.kernel.org/linux-security-module/d4f1395c-d2d4-1860-3a02-2a0c023dd761@digikod.net/

Notable implications of this approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* A processes' existing open file descriptors stay unaffected
  when a process enables Landlock.

  This means that in common scenarios, where the terminal file
  descriptor is inherited from the parent process, the terminal's
  IOCTLs (ioctl_tty(2)) continue to work.

* ioctl(2) continues to be available for file descriptors for
  non-device files.  Example: Network sockets, memfd_create(2),
  regular files and directories.

Examples
~~~~~~~~

Starting a sandboxed shell from $HOME with samples/landlock/sandboxer:

  LL_FS_RO=/ LL_FS_RW=. ./sandboxer /bin/bash

The LANDLOCK_ACCESS_FS_IOCTL_DEV right is part of the "read-write"
rights here, so we expect that newly opened device files outside of
$HOME don't work with most IOCTL commands.

  * "stty" works: It probes terminal properties

  * "stty </dev/tty" fails: /dev/tty can be reopened, but the IOCTL is
    denied.

  * "eject" fails: ioctls to use CD-ROM drive are denied.

  * "ls /dev" works: It uses ioctl to get the terminal size for
    columnar layout

  * The text editors "vim" and "mg" work.  (GNU Emacs fails because it
    attempts to reopen /dev/tty.)

Unaffected IOCTL commands
~~~~~~~~~~~~~~~~~~~~~~~~~

To decide which IOCTL commands should be blanket-permitted, we went
through the list of IOCTL commands which are handled directly in
fs/ioctl.c and looked at them individually to understand what they are
about.

The following commands are permitted by Landlock unconditionally:

 * FIOCLEX, FIONCLEX - these work on the file descriptor and
   manipulate the close-on-exec flag (also available through
   fcntl(2) with F_SETFD)
 * FIONBIO, FIOASYNC - these work on the struct file and enable
   nonblocking-IO and async flags (also available through
   fcntl(2) with F_SETFL)

The following commands are also unconditionally permitted by Landlock, because
they are really operating on the file system's superblock, rather than on the
file itself (the same funcionality is also available from any other file on the
same file system):

 * FIFREEZE, FITHAW - work on superblock(!) to freeze/thaw the file
   system. Requires CAP_SYS_ADMIN.
 * FIGETBSZ - get file system blocksize
 * FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH - getting file system properties

Notably, the command FIONREAD is *not* blanket-permitted,
because it would be a device-specific implementation.

Detailed reasoning about each IOCTL command from fs/ioctl.c is in
get_required_ioctl_dev_access() in security/landlock/fs.c.


Related Work
~~~~~~~~~~~~

OpenBSD's pledge(2) [2] restricts ioctl(2) independent of the file
descriptor which is used.  The implementers maintain multiple
allow-lists of predefined ioctl(2) operations required for different
application domains such as "audio", "bpf", "tty" and "inet".

OpenBSD does not guarantee backwards compatibility to the same extent
as Linux does, so it's easier for them to update these lists in later
versions.  It might not be a feasible approach for Linux though.

[2] https://man.openbsd.org/OpenBSD-7.4/pledge.2


Implementation Rationale
~~~~~~~~~~~~~~~~~~~~~~~~

A main constraint of this implementation is that the blanket-permitted
IOCTL commands for device files should never dispatch to the
device-specific implementations in f_ops->unlocked_ioctl() and
f_ops->compat_ioctl().

There are many implementations of these f_ops operations and they are
too scattered across the kernel to give strong guarantees about them.
Additionally, some existing implementations do work before even
checking whether they support the cmd number which was passed to them.


In this implementation, we are listing the blanket-permitted IOCTL
commands in the Landlock implementation, mirroring a subset of the
IOCTL commands which are directly implemented in do_vfs_ioctl() in
fs/ioctl.c.  The trade-off is that the Landlock LSM needs to track
future developments in fs/ioctl.c to keep up to date with that, in
particular when new IOCTL commands are introduced there, or when they
are moved there from the f_ops implementations.

We mitigate this risk in this patch set by adding fs/ioctl.c to the
paths that are relevant to Landlock in the MAINTAINERS file.

The trade-off is discussed in more detail in [3].


Previous versions of this patch set have used different implementation
approaches to guarantee the main constraint above, which we have
dismissed due to the following reasons:

* V10: Introduced a new LSM hook file_vfs_ioctl, which gets invoked
  just before the call to f_ops->unlocked_ioctl().

  Not done, because it would have created an avoidable overlap between
  the file_ioctl and file_vfs_ioctl LSM hooks [4].

* V11: Introduced an indirection layer in fs/ioctl.c, so that Landlock
  could figure out the list of IOCTL commands which are handled by
  do_vfs_ioctl().

  Not done due to additional indirection and possible performance
  impact in fs/ioctl.c [5]

* V12: Introduced a special error code to be returned from the
  file_ioctl hook, and matching logic that would disallow the call to
  f_ops->unlocked_ioctl() in case that this error code is returned.

  Not done due because this approach would conflict with Landlock's
  planned audit logging [6] and because LSM hooks with special error
  codes are generally discouraged and have lead to problems in the
  past [7].

Thanks to Arnd Bergmann, Christian Brauner, Kent Overstreet, Mickaël Salaün and
Paul Moore for guiding this implementation on the right track!

[3] https://lore.kernel.org/all/ZgLJG0aN0psur5Z7@google.com/
[4] https://lore.kernel.org/all/CAHC9VhRojXNSU9zi2BrP8z6JmOmT3DAqGNtinvvz=tL1XhVdyg@mail.gmail.com/
[5] https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com
[6] https://lore.kernel.org/all/20240326.ahyaaPa0ohs6@digikod.net
[7] https://lore.kernel.org/all/CAHC9VhQJFWYeheR-EqqdfCq0YpvcQX5Scjfgcz1q+jrWg8YsdA@mail.gmail.com/


Changes
~~~~~~~

V14:
 * Revise which IOCTLs are permitted.
   It is almost the same as the vfs_masked_device_ioctl() hooks from
   https://lore.kernel.org/all/20240219183539.2926165-1-mic@digikod.net/,
   with the following differences:
   * Added cases for FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH
   * Do not blanket-permit FS_IOC_{GET,SET}{FLAGS,XATTR}.
     They fall back to the device implementation.
 * fs/ioctl:
   * Small prerequisite change so that FS_IOC_GETFSUUID and
     FS_IOC_GETFSSYSFSPATH do not fall back to the device implementation.
   * Slightly rephrase wording in the warning above do_vfs_ioctl().
 * Implement compat handler
 * Improve UAPI header documentation
 * Code structure
   * Change helper function style to return a boolean
   * Reorder structure of the IOCTL hooks (much cleaner now -- thanks for the
     hint, Mickaël!)
   * Extract is_device() helper

V13:
 * Using the existing file_ioctl hook and a hardcoded list of IOCTL commands.
   (See the section on implementation rationale above.)
 * Add support for FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH.
   
V12:
 * Rebased on Arnd's proposal:
   https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com/
   This means that:
   * the IOCTL security hooks can return a special value ENOFILEOPS,
     which is treated specially in fs/ioctl.c to permit the IOCTL,
     but only as long as it does not call f_ops->unlocked_ioctl or
     f_ops->compat_ioctl.
 * The only change compared to V11 is commit 1, as well as a small
   adaptation in the commit 2 (The Landlock implementation needs to
   return the new special value).  The tests and documentation commits
   are exactly the same as before.

V11:
 * Rebased on Mickaël's proposal to refactor fs/ioctl.c:
   https://lore.kernel.org/all/20240315145848.1844554-1-mic@digikod.net/
   This means that:
   * we do not add the file_vfs_ioctl() hook as in V10
   * we add vfs_get_ioctl_handler() instead, so that Landlock
     can query which of the IOCTL commands in handled in do_vfs_ioctl()

   That proposal is used here unmodified (except for minor typos in the commit
   description).
 * Use the hook_ioctl_compat LSM hook as well.

V10:
 * Major change: only restrict IOCTL invocations on device files
   * Rename access right to LANDLOCK_ACCESS_FS_IOCTL_DEV
   * Remove the notion of synthetic access rights and IOCTL right groups
 * Introduce a new LSM hook file_vfs_ioctl, which gets invoked just
   before the call to f_ops->unlocked_ioctl()
 * Documentation
   * Various complications were removed or simplified:
     * Suggestion to mount file systems as nodev is not needed any more,
       as Landlock already lets users distinguish device files.
     * Remarks about fscrypt were removed.  The fscrypt-related IOCTLs only
       applied to regular files and directories, so this patch does not affect
       them any more.
     * Various documentation of the IOCTL grouping approach was removed,
       as it's not needed any more.

V9:
 * in “landlock: Add IOCTL access right”:
   * Change IOCTL group names and grouping as discussed with Mickaël.
     This makes the grouping coarser, and we occasionally rely on the
     underlying implementation to perform the appropriate read/write
     checks.
     * Group IOCTL_RW (one of READ_FILE, WRITE_FILE or READ_DIR):
       FIONREAD, FIOQSIZE, FIGETBSZ
     * Group IOCTL_RWF (one of READ_FILE or WRITE_FILE):
       FS_IOC_FIEMAP, FIBMAP, FIDEDUPERANGE, FICLONE, FICLONERANGE,
       FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
       FS_IOC_ZERO_RANGE
   * Excempt pipe file descriptors from IOCTL restrictions,
     even for named pipes which are opened from the file system.
     This is to be consistent with anonymous pipes created with pipe(2).
     As discussed in https://lore.kernel.org/r/ZP7lxmXklksadvz+@google.com
   * Document rationale for the IOCTL grouping in the code
   * Use __attribute_const__
   * Rename required_ioctl_access() to get_required_ioctl_access()
 * Selftests
   * Simplify IOCTL test fixtures as a result of simpler grouping.
   * Test that IOCTLs are permitted on named pipe FDs.
   * Test that IOCTLs are permitted on named Unix Domain Socket FDs.
   * Work around compilation issue with old GCC / glibc.
     https://sourceware.org/glibc/wiki/Synchronizing_Headers
     Thanks to Huyadi <hu.yadi@h3c.com>, who pointed this out in
     https://lore.kernel.org/all/f25be6663bcc4608adf630509f045a76@h3c.com/
     and Mickaël, who fixed it through #include reordering.
 * Documentation changes
   * Reword "IOCTL commands" section a bit
   * s/permit/allow/
   * s/access right/right/, if preceded by LANDLOCK_ACCESS_FS_*
   * s/IOCTL/FS_IOCTL/ in ASCII table
   * Update IOCTL grouping documentation in header file
 * Removed a few of the earlier commits in this patch set,
   which have already been merged.

V8:
 * Documentation changes
   * userspace-api/landlock.rst:
     * Add an extra paragraph about how the IOCTL right combines
       when used with other access rights.
     * Explain better the circumstances under which passing of
       file descriptors between different Landlock domains can happen
   * limits.h: Add comment to explain public vs internal FS access rights
   * Add a paragraph in the commit to explain better why the IOCTL
     right works as it does

V7:
 * in “landlock: Add IOCTL access right”:
   * Make IOCTL_GROUPS a #define so that static_assert works even on
     old compilers (bug reported by Intel about PowerPC GCC9 config)
   * Adapt indentation of IOCTL_GROUPS definition
   * Add missing dots in kernel-doc comments.
 * in “landlock: Remove remaining "inline" modifiers in .c files”:
   * explain reasoning in commit message

V6:
 * Implementation:
   * Check that only publicly visible access rights can be used when adding a
     rule (rather than the synthetic ones).  Thanks Mickaël for spotting that!
   * Move all functionality related to IOCTL groups and synthetic access rights
     into the same place at the top of fs.c
   * Move kernel doc to the .c file in one instance
   * Smaller code style issues (upcase IOCTL, vardecl at block start)
   * Remove inline modifier from functions in .c files
 * Tests:
   * use SKIP
   * Rename 'fd' to dir_fd and file_fd where appropriate
   * Remove duplicate "ioctl" mentions from test names
   * Rename "permitted" to "allowed", in ioctl and ftruncate tests
   * Do not add rules if access is 0, in test helper

V5:
 * Implementation:
   * move IOCTL group expansion logic into fs.c (implementation suggested by
     mic)
   * rename IOCTL_CMD_G* constants to LANDLOCK_ACCESS_FS_IOCTL_GROUP*
   * fs.c: create ioctl_groups constant
   * add "const" to some variables
 * Formatting and docstring fixes (including wrong kernel-doc format)
 * samples/landlock: fix ABI version and fallback attribute (mic)
 * Documentation
   * move header documentation changes into the implementation commit
   * spell out how FIFREEZE, FITHAW and attribute-manipulation ioctls from
     fs/ioctl.c are handled
   * change ABI 4 to ABI 5 in some missing places

V4:
 * use "synthetic" IOCTL access rights, as previously discussed
 * testing changes
   * use a large fixture-based test, for more exhaustive coverage,
     and replace some of the earlier tests with it
 * rebased on mic-next

V3:
 * always permit the IOCTL commands FIOCLEX, FIONCLEX, FIONBIO, FIOASYNC and
   FIONREAD, independent of LANDLOCK_ACCESS_FS_IOCTL
 * increment ABI version in the same commit where the feature is introduced
 * testing changes
   * use FIOQSIZE instead of TTY IOCTL commands
     (FIOQSIZE works with regular files, directories and memfds)
   * run the memfd test with both Landlock enabled and disabled
   * add a test for the always-permitted IOCTL commands

V2:
 * rebased on mic-next
 * added documentation
 * exercise ioctl(2) in the memfd test
 * test: Use layout0 for the test

---

V1: https://lore.kernel.org/all/20230502171755.9788-1-gnoack3000@gmail.com/
V2: https://lore.kernel.org/all/20230623144329.136541-1-gnoack@google.com/
V3: https://lore.kernel.org/all/20230814172816.3907299-1-gnoack@google.com/
V4: https://lore.kernel.org/all/20231103155717.78042-1-gnoack@google.com/
V5: https://lore.kernel.org/all/20231117154920.1706371-1-gnoack@google.com/
V6: https://lore.kernel.org/all/20231124173026.3257122-1-gnoack@google.com/
V7: https://lore.kernel.org/all/20231201143042.3276833-1-gnoack@google.com/
V8: https://lore.kernel.org/all/20231208155121.1943775-1-gnoack@google.com/
V9: https://lore.kernel.org/all/20240209170612.1638517-1-gnoack@google.com/
V10: https://lore.kernel.org/all/20240309075320.160128-1-gnoack@google.com/
V11: https://lore.kernel.org/all/20240322151002.3653639-1-gnoack@google.com/
V12: https://lore.kernel.org/all/20240325134004.4074874-1-gnoack@google.com/
V13: https://lore.kernel.org/all/20240327131040.158777-1-gnoack@google.com/

Günther Noack (12):
  fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH
    fail
  landlock: Add IOCTL access right for character and block devices
  selftests/landlock: Test IOCTL support
  selftests/landlock: Test IOCTL with memfds
  selftests/landlock: Test ioctl(2) and ftruncate(2) with open(O_PATH)
  selftests/landlock: Test IOCTLs on named pipes
  selftests/landlock: Check IOCTL restrictions for named UNIX domain
    sockets
  selftests/landlock: Exhaustive test for the IOCTL allow-list
  samples/landlock: Add support for LANDLOCK_ACCESS_FS_IOCTL_DEV
  landlock: Document IOCTL support
  MAINTAINERS: Notify Landlock maintainers about changes to fs/ioctl.c
  fs/ioctl: Add a comment to keep the logic in sync with LSM policies

 Documentation/userspace-api/landlock.rst     |  76 ++-
 MAINTAINERS                                  |   1 +
 fs/ioctl.c                                   |   7 +-
 include/uapi/linux/landlock.h                |  38 +-
 samples/landlock/sandboxer.c                 |  13 +-
 security/landlock/fs.c                       | 221 ++++++++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   8 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   | 491 ++++++++++++++++++-
 10 files changed, 813 insertions(+), 46 deletions(-)


base-commit: e9df9344b6f3e5e1c745a71f125ff4b5c6ddc96b
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply	[relevance 2%]

* Re: [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices
  2024-04-05 16:17  0%         ` Günther Noack
@ 2024-04-05 18:01  0%           ` Mickaël Salaün
  0 siblings, 0 replies; 200+ results
From: Mickaël Salaün @ 2024-04-05 18:01 UTC (permalink / raw)
  To: Günther Noack
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Christian Brauner

On Fri, Apr 05, 2024 at 06:17:17PM +0200, Günther Noack wrote:
> On Wed, Apr 03, 2024 at 01:15:45PM +0200, Mickaël Salaün wrote:
> > On Tue, Apr 02, 2024 at 08:28:49PM +0200, Günther Noack wrote:
> > > On Wed, Mar 27, 2024 at 05:57:31PM +0100, Mickaël Salaün wrote:
> > > > On Wed, Mar 27, 2024 at 01:10:31PM +0000, Günther Noack wrote:
> > > > > +	case FIOQSIZE:
> > > > > +		/*
> > > > > +		 * FIOQSIZE queries the size of a regular file or directory.
> > > > > +		 *
> > > > > +		 * This IOCTL command only applies to regular files and
> > > > > +		 * directories.
> > > > > +		 */
> > > > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > > > 
> > > > This should always be allowed because do_vfs_ioctl() never returns
> > > > -ENOIOCTLCMD for this command.  That's why I wrote
> > > > vfs_masked_device_ioctl() this way [1].  I think it would be easier to
> > > > read and maintain this code with a is_masked_device_ioctl() logic.  Listing
> > > > commands that are not masked makes it difficult to review because
> > > > allowed and denied return codes are interleaved.
> > > 
> > > Oh, I misunderstood you on [2], I think -- I was under the impression that you
> > > wanted to keep the switch case in the same order (and with the same entries?) as
> > > the original in do_vfs_ioctl.  So you'd prefer to only list the always-allowed
> > > IOCTL commands here, as you have done in vfs_masked_device_ioctl() [3]?
> > > 
> > > [2] https://lore.kernel.org/all/20240326.ooCheem1biV2@digikod.net/
> > > [3] https://lore.kernel.org/all/20240219183539.2926165-1-mic@digikod.net/
> > 
> > That was indeed unclear.  About IOCTL commands, the same order ease
> > reviewing and maintenance but we don't need to list all commands,
> > which will limit updates of this list.  However, for the current
> > unused/unmasked one, we can still add them very briefly in comments as I
> > did with FIONREAD and file_ioctl()'s ones in vfs_masked_device_ioctl().
> > Only listing the "masked" ones (for device case) shorten the list, and
> > having a list with the same semantic ("mask device-specific IOCTLs")
> > ease review and maintenance as well.
> > 
> > > 
> > > Can you please clarify how you make up your mind about what should be permitted
> > > and what should not?  I have trouble understanding the rationale for the changes
> > > that you asked for below, apart from the points that they are harmless and that
> > > the return codes should be consistent.
> > 
> > The rationale is the same: all IOCTL commands that are not
> > passed/specific to character or block devices (i.e. IOCTLs defined in
> > fs/ioctl.c) are allowed.  vfs_masked_device_ioctl() returns true if the
> > IOCTL command is not passed to the related device driver but handled by
> > fs/ioctl.c instead (i.e. handled by the VFS layer).
> 
> Thanks for clarifying -- this makes more sense now.  I traced the cases with
> -ENOIOCTLCMD through the code more thoroughly and it is more aligned now with
> what you implemented before.  The places where I ended up implementing it
> differently to your vfs_masked_device_ioctl() patch are:
> 
>  * Do not blanket-permit FS_IOC_{GET,SET}{FLAGS,XATTR}.
>    They fall back to the device implementation.
> 
>  * FS_IOC_GETUUID and FS_IOC_GETFSSYSFSPATH are now handled.
>    These return -ENOIOCTLCMD from do_vfs_ioctl(), so they do fall back to the
>    handlers in struct file_operations, so we can not permit these either.

Good catch!

> 
> These seem like pretty clear cases to me.
> 
> 
> > > The criteria that I have used in this patch set are that (a) it is implemented
> > > in do_vfs_ioctl() rather than further below, and (b) it makes sense to use that
> > > command on a device file.  (If we permit FIOQSIZE, FS_IOC_FIEMAP and others
> > > here, we will get slightly more correct error codes in these cases, but the
> > > IOCTLs will still not work, because they are not useful and not implemented for
> > > devices. -- On the other hand, we are also increasing the exposed code surface a
> > > bit.  For example, FS_IOC_FIEMAP is calling into inode->i_op->fiemap().  That is
> > > probably harmless for device files, but requires us to reason at a deeper level
> > > to convince ourselves of that.)
> > 
> > FIOQSIZE is fully handled by do_vfs_ioctl(), and FS_IOC_FIEMAP is
> > implemented as the inode level, so it should not be passed at the struct
> > file/device level unless ENOIOCTLCMD is returned (but it should not,
> > right?).  Because it depends on the inode implementation, it looks like
> > this IOCTL may work (in theory) on character or block devices too.  If
> > this is correct, we should not deny it because the semantic of
> > LANDLOCK_ACCESS_FS_IOCTL_DEV is to control IOCTLs passed to device
> > drivers.  Furthermore, as you pointed out, error codes would be
> > unaltered.
> > 
> > It would be good to test (as you suggested IIRC) the masked commands on
> > a simple device (e.g. /dev/null) to check that it returns ENOTTY,
> > EOPNOTSUPP, or EACCES according to our expectations.
> 
> Sounds good, I'll add a test.
> 
> 
> > I agree that this would increase a bit the exposed code surface but I'm
> > pretty sure that if a sandboxed process is allowed to access a device
> > file, it is also allowed to access directory or other file types as well
> > and then would still be able to reach the FS_IOC_FIEMAP implementation.
> 
> I assume you mean FIGETBSZ?  The FS_IOC_FIEMAP IOCTL is the one that returns
> file extent maps, so that user space can reason about whether a file is stored
> in a consecutive way on disk.

I meant FS_IOC_FIEMAP for regular files.

> 
> 
> > I'd like to avoid exceptions as in the current implementation of
> > get_required_ioctl_dev_access() with a switch/case either returning 0 or
> > LANDLOCK_ACCESS_FS_IOCTL_DEV (excluding the default case of course).  An
> > alternative approach would be to group IOCTL command cases according to
> > their returned value, but I find it a bit more complex for no meaningful
> > gain.  What do you think?
> 
> I don't have strong opinions about it, as long as we don't accidentally mess up
> the fallbacks if this changes.
> 
> 
> > > In your implementation at [3], you were permitting FICLONE* and FIDEDUPERANGE,
> > > but not FS_IOC_ZERO_RANGE, which is like fallocate().  How are these cases
> > > different to each other?  Is that on purpose?
> > 
> > FICLONE* and FIDEDUPERANGE match device files and the
> > vfs_clone_file_range()/generic_file_rw_checks() check returns EINVAL for
> > device files.  So there is no need to add exceptions for these commands.
> > 
> > FS_IOC_ZERO_RANGE is only implemented for regular files (see
> > file_ioctl() call), so it is passed to device files.
> 
> Makes sense :)
> 
> 
> —Günther
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices
  2024-04-03 11:15  0%       ` Mickaël Salaün
@ 2024-04-05 16:17  0%         ` Günther Noack
  2024-04-05 18:01  0%           ` Mickaël Salaün
  0 siblings, 1 reply; 200+ results
From: Günther Noack @ 2024-04-05 16:17 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Christian Brauner

On Wed, Apr 03, 2024 at 01:15:45PM +0200, Mickaël Salaün wrote:
> On Tue, Apr 02, 2024 at 08:28:49PM +0200, Günther Noack wrote:
> > On Wed, Mar 27, 2024 at 05:57:31PM +0100, Mickaël Salaün wrote:
> > > On Wed, Mar 27, 2024 at 01:10:31PM +0000, Günther Noack wrote:
> > > > +	case FIOQSIZE:
> > > > +		/*
> > > > +		 * FIOQSIZE queries the size of a regular file or directory.
> > > > +		 *
> > > > +		 * This IOCTL command only applies to regular files and
> > > > +		 * directories.
> > > > +		 */
> > > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > > 
> > > This should always be allowed because do_vfs_ioctl() never returns
> > > -ENOIOCTLCMD for this command.  That's why I wrote
> > > vfs_masked_device_ioctl() this way [1].  I think it would be easier to
> > > read and maintain this code with a is_masked_device_ioctl() logic.  Listing
> > > commands that are not masked makes it difficult to review because
> > > allowed and denied return codes are interleaved.
> > 
> > Oh, I misunderstood you on [2], I think -- I was under the impression that you
> > wanted to keep the switch case in the same order (and with the same entries?) as
> > the original in do_vfs_ioctl.  So you'd prefer to only list the always-allowed
> > IOCTL commands here, as you have done in vfs_masked_device_ioctl() [3]?
> > 
> > [2] https://lore.kernel.org/all/20240326.ooCheem1biV2@digikod.net/
> > [3] https://lore.kernel.org/all/20240219183539.2926165-1-mic@digikod.net/
> 
> That was indeed unclear.  About IOCTL commands, the same order ease
> reviewing and maintenance but we don't need to list all commands,
> which will limit updates of this list.  However, for the current
> unused/unmasked one, we can still add them very briefly in comments as I
> did with FIONREAD and file_ioctl()'s ones in vfs_masked_device_ioctl().
> Only listing the "masked" ones (for device case) shorten the list, and
> having a list with the same semantic ("mask device-specific IOCTLs")
> ease review and maintenance as well.
> 
> > 
> > Can you please clarify how you make up your mind about what should be permitted
> > and what should not?  I have trouble understanding the rationale for the changes
> > that you asked for below, apart from the points that they are harmless and that
> > the return codes should be consistent.
> 
> The rationale is the same: all IOCTL commands that are not
> passed/specific to character or block devices (i.e. IOCTLs defined in
> fs/ioctl.c) are allowed.  vfs_masked_device_ioctl() returns true if the
> IOCTL command is not passed to the related device driver but handled by
> fs/ioctl.c instead (i.e. handled by the VFS layer).

Thanks for clarifying -- this makes more sense now.  I traced the cases with
-ENOIOCTLCMD through the code more thoroughly and it is more aligned now with
what you implemented before.  The places where I ended up implementing it
differently to your vfs_masked_device_ioctl() patch are:

 * Do not blanket-permit FS_IOC_{GET,SET}{FLAGS,XATTR}.
   They fall back to the device implementation.

 * FS_IOC_GETUUID and FS_IOC_GETFSSYSFSPATH are now handled.
   These return -ENOIOCTLCMD from do_vfs_ioctl(), so they do fall back to the
   handlers in struct file_operations, so we can not permit these either.

These seem like pretty clear cases to me.


> > The criteria that I have used in this patch set are that (a) it is implemented
> > in do_vfs_ioctl() rather than further below, and (b) it makes sense to use that
> > command on a device file.  (If we permit FIOQSIZE, FS_IOC_FIEMAP and others
> > here, we will get slightly more correct error codes in these cases, but the
> > IOCTLs will still not work, because they are not useful and not implemented for
> > devices. -- On the other hand, we are also increasing the exposed code surface a
> > bit.  For example, FS_IOC_FIEMAP is calling into inode->i_op->fiemap().  That is
> > probably harmless for device files, but requires us to reason at a deeper level
> > to convince ourselves of that.)
> 
> FIOQSIZE is fully handled by do_vfs_ioctl(), and FS_IOC_FIEMAP is
> implemented as the inode level, so it should not be passed at the struct
> file/device level unless ENOIOCTLCMD is returned (but it should not,
> right?).  Because it depends on the inode implementation, it looks like
> this IOCTL may work (in theory) on character or block devices too.  If
> this is correct, we should not deny it because the semantic of
> LANDLOCK_ACCESS_FS_IOCTL_DEV is to control IOCTLs passed to device
> drivers.  Furthermore, as you pointed out, error codes would be
> unaltered.
> 
> It would be good to test (as you suggested IIRC) the masked commands on
> a simple device (e.g. /dev/null) to check that it returns ENOTTY,
> EOPNOTSUPP, or EACCES according to our expectations.

Sounds good, I'll add a test.


> I agree that this would increase a bit the exposed code surface but I'm
> pretty sure that if a sandboxed process is allowed to access a device
> file, it is also allowed to access directory or other file types as well
> and then would still be able to reach the FS_IOC_FIEMAP implementation.

I assume you mean FIGETBSZ?  The FS_IOC_FIEMAP IOCTL is the one that returns
file extent maps, so that user space can reason about whether a file is stored
in a consecutive way on disk.


> I'd like to avoid exceptions as in the current implementation of
> get_required_ioctl_dev_access() with a switch/case either returning 0 or
> LANDLOCK_ACCESS_FS_IOCTL_DEV (excluding the default case of course).  An
> alternative approach would be to group IOCTL command cases according to
> their returned value, but I find it a bit more complex for no meaningful
> gain.  What do you think?

I don't have strong opinions about it, as long as we don't accidentally mess up
the fallbacks if this changes.


> > In your implementation at [3], you were permitting FICLONE* and FIDEDUPERANGE,
> > but not FS_IOC_ZERO_RANGE, which is like fallocate().  How are these cases
> > different to each other?  Is that on purpose?
> 
> FICLONE* and FIDEDUPERANGE match device files and the
> vfs_clone_file_range()/generic_file_rw_checks() check returns EINVAL for
> device files.  So there is no need to add exceptions for these commands.
> 
> FS_IOC_ZERO_RANGE is only implemented for regular files (see
> file_ioctl() call), so it is passed to device files.

Makes sense :)


—Günther

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v15 9/9] fuse: auto-invalidate inode attributes in passthrough mode
  @ 2024-04-04 14:07  5%         ` Sweet Tea Dorminy
  0 siblings, 0 replies; 200+ results
From: Sweet Tea Dorminy @ 2024-04-04 14:07 UTC (permalink / raw)
  To: Amir Goldstein, Bernd Schubert; +Cc: Miklos Szeredi, linux-fsdevel


> Sweet Tea,
> 
> Can you please explain the workload where you find that this patch is needed?

I was researching before sending out my own version of attr passthrough 
- it seemed like a step in the direction, but then the code in-tree 
wasn't the same.

> Is your workload using mmap writes? requires a long attribute cache timeout?
> Does your workload involve mixing passthrough IO and direct/cached IO
> on the same inode at different times or by different open fd's?
> 
> I would like to know, so I can tell you if getattr() passthrough design is
> going to help your use case.
> 
> For example, my current getattr() passthrough design (in my head)
> will not allow opening the inode in cached IO mode from lookup time
> until evict/forget, unlike the current read/write passthrough, which is
> from first open to last close.

I think the things I'd been working on is very similar.

Two possible HSM variants, both focused on doing passthrough IO with 
minimal involvement from the fuse server in at least some cases.

One would be using passthrough for temporary ingestion of some memory 
state for a workload, user writes files and the FUSE server can choose 
to passthrough them to local storage temporarily or to send them to 
remote storage -- as ingestion requires pausing the workload and is 
therefore very expensive, I'd like to pass through attr updates to the 
backing file so that there are minimal roundtrips to the fuse server 
during write. Later the HSM would move the files to remote storage, or 
delete them.

One would be using passthrough for binaries -- providing specific sets 
of mostly binaries with some tracking on open/close, so the HSM can 
delete unused sets. Again the goal is to avoid metadata query roundtrips 
to userspace for speed; we don't expect a file open in passthrough mode 
to be opened again for FUSE-server-mediated IO until the passthrough 
version is closed.

Thanks!

Sweet Tea

^ permalink raw reply	[relevance 5%]

* Re: [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices
  2024-04-02 18:28  0%     ` Günther Noack
@ 2024-04-03 11:15  0%       ` Mickaël Salaün
  2024-04-05 16:17  0%         ` Günther Noack
  0 siblings, 1 reply; 200+ results
From: Mickaël Salaün @ 2024-04-03 11:15 UTC (permalink / raw)
  To: Günther Noack
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Christian Brauner

On Tue, Apr 02, 2024 at 08:28:49PM +0200, Günther Noack wrote:
> Hello!
> 
> Thanks for the review!
> 
> On Wed, Mar 27, 2024 at 05:57:31PM +0100, Mickaël Salaün wrote:
> > On Wed, Mar 27, 2024 at 01:10:31PM +0000, Günther Noack wrote:
> > > Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
> > > and increments the Landlock ABI version to 5.
> > > 
> > > This access right applies to device-custom IOCTL commands
> > > when they are invoked on block or character device files.
> > > 
> > > Like the truncate right, this right is associated with a file
> > > descriptor at the time of open(2), and gets respected even when the
> > > file descriptor is used outside of the thread which it was originally
> > > opened in.
> > > 
> > > Therefore, a newly enabled Landlock policy does not apply to file
> > > descriptors which are already open.
> > > 
> > > If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
> > > number of safe IOCTL commands will be permitted on newly opened device
> > > files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
> > > as other IOCTL commands for regular files which are implemented in
> > > fs/ioctl.c.
> > > 
> > > Noteworthy scenarios which require special attention:
> > > 
> > > TTY devices are often passed into a process from the parent process,
> > > and so a newly enabled Landlock policy does not retroactively apply to
> > > them automatically.  In the past, TTY devices have often supported
> > > IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> > > letting callers control the TTY input buffer (and simulate
> > > keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> > > modern kernels though.
> > > 
> > > Known limitations:
> > > 
> > > The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
> > > control over IOCTL commands.
> > > 
> > > Landlock users may use path-based restrictions in combination with
> > > their knowledge about the file system layout to control what IOCTLs
> > > can be done.
> > > 
> > > Cc: Paul Moore <paul@paul-moore.com>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Arnd Bergmann <arnd@arndb.de>
> > > Signed-off-by: Günther Noack <gnoack@google.com>
> > > ---
> > >  include/uapi/linux/landlock.h                |  33 +++-
> > >  security/landlock/fs.c                       | 183 ++++++++++++++++++-
> > >  security/landlock/limits.h                   |   2 +-
> > >  security/landlock/syscalls.c                 |   8 +-
> > >  tools/testing/selftests/landlock/base_test.c |   2 +-
> > >  tools/testing/selftests/landlock/fs_test.c   |   5 +-
> > >  6 files changed, 216 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> > > index 25c8d7677539..5d90e9799eb5 100644
> > > --- a/include/uapi/linux/landlock.h
> > > +++ b/include/uapi/linux/landlock.h
> > > @@ -128,7 +128,7 @@ struct landlock_net_port_attr {
> > >   * files and directories.  Files or directories opened before the sandboxing
> > >   * are not subject to these restrictions.
> > >   *
> > > - * A file can only receive these access rights:
> > > + * The following access rights apply only to files:
> > >   *
> > >   * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
> > >   * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
> > > @@ -138,12 +138,13 @@ struct landlock_net_port_attr {
> > >   * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
> > >   * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
> > >   *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
> > > - *   ``O_TRUNC``. Whether an opened file can be truncated with
> > > - *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
> > > - *   same way as read and write permissions are checked during
> > > - *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
> > > - *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
> > > - *   third version of the Landlock ABI.
> > > + *   ``O_TRUNC``.  This access right is available since the third version of the
> > > + *   Landlock ABI.
> > > + *
> > > + * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> > > + * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> > > + * read and write permissions are checked during :manpage:`open(2)` using
> > > + * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
> > >   *
> > >   * A directory can receive access rights related to files or directories.  The
> > >   * following access right is applied to the directory itself, and the
> > > @@ -198,13 +199,28 @@ struct landlock_net_port_attr {
> > >   *   If multiple requirements are not met, the ``EACCES`` error code takes
> > >   *   precedence over ``EXDEV``.
> > >   *
> > > + * The following access right applies both to files and directories:
> > > + *
> > > + * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
> > > + *   character or block device.
> > > + *
> > > + *   This access right applies to all `ioctl(2)` commands implemented by device
> > 
> > :manpage:`ioctl(2)`
> > 
> > > + *   drivers.  However, the following common IOCTL commands continue to be
> > > + *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:
> > 
> > This is good but explaining the rationale could help, something like
> > this (taking care of not packing lines listing commands to ease review
> > when a new command will be added):
> > 
> > IOCTL commands targetting file descriptors (``FIOCLEX``, ``FIONCLEX``),
> > file descriptions (``FIONBIO``, ``FIOASYNC``),
> > file systems (``FIOQSIZE``, ``FS_IOC_FIEMAP``, ``FICLONE``,
> > ``FICLONERAN``, ``FIDEDUPERANGE``, ``FS_IOC_GETFLAGS``,
> > ``FS_IOC_SETFLAGS``, ``FS_IOC_FSGETXATTR``, ``FS_IOC_FSSETXATTR``),
> > or superblocks (``FIFREEZE``, ``FITHAW``, ``FIGETBSZ``,
> > ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``)
> > are never denied.  However, such IOCTL commands still require an opened
> > file and may not be available on any file type.  Read or write
> > permission may be checked by the underlying implementation, as well as
> > capabilities.
> 
> OK, I'll add some more explanation in the next version.
> 
> 
> > > + *   ``FIOCLEX``, ``FIONCLEX``, ``FIONBIO``, ``FIOASYNC``, ``FIFREEZE``,
> > > + *   ``FITHAW``, ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``
> > > + *
> > > + *   This access right is available since the fifth version of the Landlock
> > > + *   ABI.
> > > + *
> > >   * .. warning::
> > >   *
> > >   *   It is currently not possible to restrict some file-related actions
> > >   *   accessible through these syscall families: :manpage:`chdir(2)`,
> > >   *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
> > >   *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
> > > - *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
> > > + *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
> > >   *   Future Landlock evolutions will enable to restrict them.
> > >   */
> > >  /* clang-format off */
> > > @@ -223,6 +239,7 @@ struct landlock_net_port_attr {
> > >  #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
> > >  #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
> > >  #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
> > > +#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
> > >  /* clang-format on */
> > >  
> > >  /**
> > > diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> > > index c15559432d3d..2ef6c57fa20b 100644
> > > --- a/security/landlock/fs.c
> > > +++ b/security/landlock/fs.c
> > > @@ -7,6 +7,7 @@
> > >   * Copyright © 2021-2022 Microsoft Corporation
> > >   */
> > >  
> > > +#include <asm/ioctls.h>
> > >  #include <kunit/test.h>
> > >  #include <linux/atomic.h>
> > >  #include <linux/bitops.h>
> > > @@ -14,6 +15,7 @@
> > >  #include <linux/compiler_types.h>
> > >  #include <linux/dcache.h>
> > >  #include <linux/err.h>
> > > +#include <linux/falloc.h>
> > >  #include <linux/fs.h>
> > >  #include <linux/init.h>
> > >  #include <linux/kernel.h>
> > > @@ -29,6 +31,7 @@
> > >  #include <linux/types.h>
> > >  #include <linux/wait_bit.h>
> > >  #include <linux/workqueue.h>
> > > +#include <uapi/linux/fiemap.h>
> > >  #include <uapi/linux/landlock.h>
> > >  
> > >  #include "common.h"
> > > @@ -84,6 +87,141 @@ static const struct landlock_object_underops landlock_fs_underops = {
> > >  	.release = release_inode
> > >  };
> > >  
> > > +/* IOCTL helpers */
> > > +
> > > +/**
> > > + * get_required_ioctl_dev_access(): Determine required access rights for IOCTLs
> > > + * on device files.
> > > + *
> > > + * @cmd: The IOCTL command that is supposed to be run.
> > > + *
> > > + * By default, any IOCTL on a device file requires the
> > > + * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  We make exceptions for commands, if:
> > > + *
> > > + * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
> > > + *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
> > > + *
> > > + * 2. The command can be reasonably used on a device file at all.
> > > + *
> > > + * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
> > > + * should be considered for inclusion here.
> > > + *
> > > + * Returns: The access rights that must be granted on an opened file in order to
> > > + * use the given @cmd.
> > > + */
> > > +static __attribute_const__ access_mask_t
> > > +get_required_ioctl_dev_access(const unsigned int cmd)
> > > +{
> > > +	switch (cmd) {
> > > +	case FIOCLEX:
> > > +	case FIONCLEX:
> > > +	case FIONBIO:
> > > +	case FIOASYNC:
> > > +		/*
> > > +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> > > +		 * close-on-exec and the file's buffered-IO and async flags.
> > > +		 * These operations are also available through fcntl(2), and are
> > > +		 * unconditionally permitted in Landlock.
> > > +		 */
> > > +		return 0;
> > > +	case FIOQSIZE:
> > > +		/*
> > > +		 * FIOQSIZE queries the size of a regular file or directory.
> > > +		 *
> > > +		 * This IOCTL command only applies to regular files and
> > > +		 * directories.
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > 
> > This should always be allowed because do_vfs_ioctl() never returns
> > -ENOIOCTLCMD for this command.  That's why I wrote
> > vfs_masked_device_ioctl() this way [1].  I think it would be easier to
> > read and maintain this code with a is_masked_device_ioctl() logic.  Listing
> > commands that are not masked makes it difficult to review because
> > allowed and denied return codes are interleaved.
> 
> Oh, I misunderstood you on [2], I think -- I was under the impression that you
> wanted to keep the switch case in the same order (and with the same entries?) as
> the original in do_vfs_ioctl.  So you'd prefer to only list the always-allowed
> IOCTL commands here, as you have done in vfs_masked_device_ioctl() [3]?
> 
> [2] https://lore.kernel.org/all/20240326.ooCheem1biV2@digikod.net/
> [3] https://lore.kernel.org/all/20240219183539.2926165-1-mic@digikod.net/

That was indeed unclear.  About IOCTL commands, the same order ease
reviewing and maintenance but we don't need to list all commands,
which will limit updates of this list.  However, for the current
unused/unmasked one, we can still add them very briefly in comments as I
did with FIONREAD and file_ioctl()'s ones in vfs_masked_device_ioctl().
Only listing the "masked" ones (for device case) shorten the list, and
having a list with the same semantic ("mask device-specific IOCTLs")
ease review and maintenance as well.

> 
> Can you please clarify how you make up your mind about what should be permitted
> and what should not?  I have trouble understanding the rationale for the changes
> that you asked for below, apart from the points that they are harmless and that
> the return codes should be consistent.

The rationale is the same: all IOCTL commands that are not
passed/specific to character or block devices (i.e. IOCTLs defined in
fs/ioctl.c) are allowed.  vfs_masked_device_ioctl() returns true if the
IOCTL command is not passed to the related device driver but handled by
fs/ioctl.c instead (i.e. handled by the VFS layer).

> 
> The criteria that I have used in this patch set are that (a) it is implemented
> in do_vfs_ioctl() rather than further below, and (b) it makes sense to use that
> command on a device file.  (If we permit FIOQSIZE, FS_IOC_FIEMAP and others
> here, we will get slightly more correct error codes in these cases, but the
> IOCTLs will still not work, because they are not useful and not implemented for
> devices. -- On the other hand, we are also increasing the exposed code surface a
> bit.  For example, FS_IOC_FIEMAP is calling into inode->i_op->fiemap().  That is
> probably harmless for device files, but requires us to reason at a deeper level
> to convince ourselves of that.)

FIOQSIZE is fully handled by do_vfs_ioctl(), and FS_IOC_FIEMAP is
implemented as the inode level, so it should not be passed at the struct
file/device level unless ENOIOCTLCMD is returned (but it should not,
right?).  Because it depends on the inode implementation, it looks like
this IOCTL may work (in theory) on character or block devices too.  If
this is correct, we should not deny it because the semantic of
LANDLOCK_ACCESS_FS_IOCTL_DEV is to control IOCTLs passed to device
drivers.  Furthermore, as you pointed out, error codes would be
unaltered.

It would be good to test (as you suggested IIRC) the masked commands on
a simple device (e.g. /dev/null) to check that it returns ENOTTY,
EOPNOTSUPP, or EACCES according to our expectations.

I agree that this would increase a bit the exposed code surface but I'm
pretty sure that if a sandboxed process is allowed to access a device
file, it is also allowed to access directory or other file types as well
and then would still be able to reach the FS_IOC_FIEMAP implementation.

I'd like to avoid exceptions as in the current implementation of
get_required_ioctl_dev_access() with a switch/case either returning 0 or
LANDLOCK_ACCESS_FS_IOCTL_DEV (excluding the default case of course).  An
alternative approach would be to group IOCTL command cases according to
their returned value, but I find it a bit more complex for no meaningful
gain.  What do you think?

> 
> In your implementation at [3], you were permitting FICLONE* and FIDEDUPERANGE,
> but not FS_IOC_ZERO_RANGE, which is like fallocate().  How are these cases
> different to each other?  Is that on purpose?

FICLONE* and FIDEDUPERANGE match device files and the
vfs_clone_file_range()/generic_file_rw_checks() check returns EINVAL for
device files.  So there is no need to add exceptions for these commands.

FS_IOC_ZERO_RANGE is only implemented for regular files (see
file_ioctl() call), so it is passed to device files.

> 
> 
> > [1] https://lore.kernel.org/r/20240219183539.2926165-1-mic@digikod.net
> > 
> > Your IOCTL command explanation comments are nice and they should be kept
> > in is_masked_device_ioctl() (if they mask device IOCTL commands).
> 
> OK
> 
> > 
> > > +	case FIFREEZE:
> > > +	case FITHAW:
> > > +		/*
> > > +		 * FIFREEZE and FITHAW freeze and thaw the file system which the
> > > +		 * given file belongs to.  Requires CAP_SYS_ADMIN.
> > > +		 *
> > > +		 * These commands operate on the file system's superblock rather
> > > +		 * than on the file itself.  The same operations can also be
> > > +		 * done through any other file or directory on the same file
> > > +		 * system, so it is safe to permit these.
> > > +		 */
> > > +		return 0;
> > > +	case FS_IOC_FIEMAP:
> > > +		/*
> > > +		 * FS_IOC_FIEMAP queries information about the allocation of
> > > +		 * blocks within a file.
> > > +		 *
> > > +		 * This IOCTL command only applies to regular files.
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > 
> > Same here.
> > 
> > > +	case FIGETBSZ:
> > > +		/*
> > > +		 * FIGETBSZ queries the file system's block size for a file or
> > > +		 * directory.
> > > +		 *
> > > +		 * This command operates on the file system's superblock rather
> > > +		 * than on the file itself.  The same operation can also be done
> > > +		 * through any other file or directory on the same file system,
> > > +		 * so it is safe to permit it.
> > > +		 */
> > > +		return 0;
> > > +	case FICLONE:
> > > +	case FICLONERANGE:
> > > +	case FIDEDUPERANGE:
> > > +		/*
> > > +		 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
> > > +		 * their underlying storage ("reflink") between source and
> > > +		 * destination FDs, on file systems which support that.
> > > +		 *
> > > +		 * These IOCTL commands only apply to regular files.
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > 
> > ditto
> > 
> > > +	case FIONREAD:
> > > +		/*
> > > +		 * FIONREAD returns the number of bytes available for reading.
> > > +		 *
> > > +		 * We require LANDLOCK_ACCESS_FS_IOCTL_DEV for FIONREAD, because
> > > +		 * devices implement it in f_ops->unlocked_ioctl().  The
> > > +		 * implementations of this operation have varying quality and
> > > +		 * complexity, so it is hard to reason about what they do.
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > > +	case FS_IOC_GETFLAGS:
> > > +	case FS_IOC_SETFLAGS:
> > > +	case FS_IOC_FSGETXATTR:
> > > +	case FS_IOC_FSSETXATTR:
> > > +		/*
> > > +		 * FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
> > > +		 * FS_IOC_FSSETXATTR do not apply for devices.
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > > +	case FS_IOC_GETFSUUID:
> > > +	case FS_IOC_GETFSSYSFSPATH:
> > > +		/*
> > > +		 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
> > > +		 * the file system superblock, not on the specific file, so
> > > +		 * these operations are available through any other file on the
> > > +		 * same file system as well.
> > > +		 */
> > > +		return 0;
> > > +	case FIBMAP:
> > > +	case FS_IOC_RESVSP:
> > > +	case FS_IOC_RESVSP64:
> > > +	case FS_IOC_UNRESVSP:
> > > +	case FS_IOC_UNRESVSP64:
> > > +	case FS_IOC_ZERO_RANGE:
> > > +		/*
> > > +		 * FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP,
> > > +		 * FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE only apply to regular
> > > +		 * files (as implemented in file_ioctl()).
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > > +	default:
> > > +		/*
> > > +		 * Other commands are guarded by the catch-all access right.
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > > +	}
> > > +}
> > > +
> > >  /* Ruleset management */
> > >  
> > >  static struct landlock_object *get_inode_object(struct inode *const inode)
> > > @@ -148,7 +286,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> > >  	LANDLOCK_ACCESS_FS_EXECUTE | \
> > >  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
> > >  	LANDLOCK_ACCESS_FS_READ_FILE | \
> > > -	LANDLOCK_ACCESS_FS_TRUNCATE)
> > > +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> > > +	LANDLOCK_ACCESS_FS_IOCTL_DEV)
> > >  /* clang-format on */
> > >  
> > >  /*
> > > @@ -1335,8 +1474,10 @@ static int hook_file_alloc_security(struct file *const file)
> > >  static int hook_file_open(struct file *const file)
> > >  {
> > >  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
> > > -	access_mask_t open_access_request, full_access_request, allowed_access;
> > > -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > > +	access_mask_t open_access_request, full_access_request, allowed_access,
> > > +		optional_access;
> > > +	const struct inode *inode = file_inode(file);
> > > +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> > >  	const struct landlock_ruleset *const dom =
> > >  		get_fs_domain(landlock_cred(file->f_cred)->domain);
> > >  
> > > @@ -1354,6 +1495,10 @@ static int hook_file_open(struct file *const file)
> > >  	 * We look up more access than what we immediately need for open(), so
> > >  	 * that we can later authorize operations on opened files.
> > >  	 */
> > > +	optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > > +	if (is_device)
> > > +		optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > > +
> > >  	full_access_request = open_access_request | optional_access;
> > >  
> > >  	if (is_access_to_paths_allowed(
> > > @@ -1410,6 +1555,36 @@ static int hook_file_truncate(struct file *const file)
> > >  	return -EACCES;
> > >  }
> > >  
> > > +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> > > +			   unsigned long arg)
> > > +{
> > > +	const struct inode *inode = file_inode(file);
> > > +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> > > +	access_mask_t required_access, allowed_access;
> > 
> > As explained in [2], I'd like not-sandboxed tasks to not have visible
> > performance impact because of Landlock:
> > 
> >   We should first check landlock_file(file)->allowed_access as in
> >   hook_file_truncate() to return as soon as possible for non-sandboxed
> >   tasks.  Any other computation should be done after that (e.g. with an
> >   is_device() helper).
> > 
> > [2] https://lore.kernel.org/r/20240311.If7ieshaegu2@digikod.net
> > 
> > This is_device(file) helper should also replace other is_device variables.
> 
> Done.
> 
> FWIW, I have doubts that it makes a performance difference - the is_device()
> check is almost for free as well.  But we can pull the same check earlier for
> consistency with the truncate hook, if it helps people to understand that their
> own program performance should be unaffected.

Agree

> 
> > 
> > 
> > > +
> > > +	if (!is_device)
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * It is the access rights at the time of opening the file which
> > > +	 * determine whether IOCTL can be used on the opened file later.
> > > +	 *
> > > +	 * The access right is attached to the opened file in hook_file_open().
> > > +	 */
> > > +	required_access = get_required_ioctl_dev_access(cmd);
> > > +	allowed_access = landlock_file(file)->allowed_access;
> > > +	if ((allowed_access & required_access) == required_access)
> > > +		return 0;
> > > +
> > > +	return -EACCES;
> > > +}
> > > +
> > > +static int hook_file_ioctl_compat(struct file *file, unsigned int cmd,
> > > +				  unsigned long arg)
> > > +{
> > > +	return hook_file_ioctl(file, cmd, arg);
> > 
> > The compat-specific IOCTL commands are missing (e.g. FS_IOC_RESVSP_32).
> > Relying on is_masked_device_ioctl() should make this call OK though.
> 
> OK, I'll try to replicate the logic from your vfs_masked_device_ioctl() approach
> then?

Yes please, unless you catch an issue with this approach.

> 
> —Günther
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-04-03  0:10  0%       ` Colin Walters
  2024-04-03  1:39  0%         ` Darrick J. Wong
@ 2024-04-03  8:35  0%         ` Alexander Larsson
  1 sibling, 0 replies; 200+ results
From: Alexander Larsson @ 2024-04-03  8:35 UTC (permalink / raw)
  To: Colin Walters, Darrick J. Wong
  Cc: Eric Biggers, Andrey Albershteyn, xfs, linux-fsdevel, fsverity

On Tue, 2024-04-02 at 20:10 -0400, Colin Walters wrote:
> [cc alexl@, retained quotes for context]
> 
> On Tue, Apr 2, 2024, at 6:52 PM, Darrick J. Wong wrote:
> > On Tue, Apr 02, 2024 at 04:00:06PM -0400, Colin Walters wrote:
> > > 
> > > 
> > > On Fri, Mar 29, 2024, at 8:43 PM, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > There are more things that one can do with an open file
> > > > descriptor on
> > > > XFS -- query extended attributes, scan for metadata damage,
> > > > repair
> > > > metadata, etc.  None of this is possible if the fsverity
> > > > metadata are
> > > > damaged, because that prevents the file from being opened.
> > > > 
> > > > Ignore a selective set of error codes that we know
> > > > fsverity_file_open to
> > > > return if the verity descriptor is nonsense.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/iomap/buffered-io.c |    8 ++++++++
> > > >  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
> > > >  2 files changed, 26 insertions(+), 1 deletion(-)
> > > > 
> > > > 
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 9f9d929dfeebc..e68a15b72dbdd 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const
> > > > struct 
> > > > iomap_iter *iter,
> > > >  	size_t poff, plen;
> > > >  	sector_t sector;
> > > > 
> > > > +	/*
> > > > +	 * If this verity file hasn't been activated, fail
> > > > read attempts.  This
> > > > +	 * can happen if the calling filesystem allows files
> > > > to be opened even
> > > > +	 * with damaged verity metadata.
> > > > +	 */
> > > > +	if (IS_VERITY(iter->inode) && !fsverity_active(iter-
> > > > >inode))
> > > > +		return -EIO;
> > > > +
> > > >  	if (iomap->type == IOMAP_INLINE)
> > > >  		return iomap_read_inline_data(iter, folio);
> > > > 
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index c0b3e8146b753..36034eaefbf55 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -1431,8 +1431,25 @@ xfs_file_open(
> > > >  			FMODE_DIO_PARALLEL_WRITE |
> > > > FMODE_CAN_ODIRECT;
> > > > 
> > > >  	error = fsverity_file_open(inode, file);
> > > > -	if (error)
> > > > +	switch (error) {
> > > > +	case -EFBIG:
> > > > +	case -EINVAL:
> > > > +	case -EMSGSIZE:
> > > > +	case -EFSCORRUPTED:
> > > > +		/*
> > > > +		 * Be selective about which fsverity errors we
> > > > propagate to
> > > > +		 * userspace; we still want to be able to open
> > > > this file even
> > > > +		 * if reads don't work.  Someone might want to
> > > > perform an
> > > > +		 * online repair.
> > > > +		 */
> > > > +		if (has_capability_noaudit(current,
> > > > CAP_SYS_ADMIN))
> > > > +			break;
> > > 
> > > As I understand it, fsverity (and dm-verity) are desirable in
> > > high-safety and integrity requirement cases where the goal is for
> > > the
> > > system to "fail closed" if errors in general are detected;
> > > anything
> > > that would have the system be in an ill-defined state.
> > 
> > Is "open() fails if verity metadata are trashed" a hard
> > requirement?
> 
> I can't say authoritatively, but I do want to ensure we've dug into
> the semantics here, and I agree with Eric that it would make the most
> sense to have this be consistent across filesystems.

In terms of userspace I think this semantic change is fine. Even if the
metadata is broken we will still not see any non-validated data. It's
as if we didn't try to use the broken fsverity metadata until it needed
to be used. I agree with others though that having the same behavior
across all filesystems would make sense. Also, it might be useful
information that the filesystem has an error, so maybe we should log
the swallowed errors.

For kernel use, in overlayfs when using verity_mode=require, we do use
open() (in ovl_validate_verity) to trigger the initialization of
fsverity_info . However I took a look at this code, and it seems to
properly handle (i.e. fail) the case where IS_VERITY(inode) is true but
there is no fsverity_info after open.

Similarly, IMA (in ima_get_verity_digest) relies on the digest loaded
from the header. But it also seems to handle this case correctly.

> > Reads will still fail due to (iomap) readahead returning EIO for a
> > file
> > that is IS_VERITY() && !fsverity_active().  This is (afaict) the
> > state
> > you end up with when the fsverity open fails.  ext4/f2fs don't do
> > that,
> > but they also don't have online fsck so once a file's dead it's
> > dead.
> 
> OK, right.  Allowing an open() but having read() fail seems like it
> doesn't weaken things too much in reality.  I think what makes me
> uncomfortable is the error-swallowing; but yes, in theory we should
> get the same or similar error on a subsequent read().

If anything the explicit error list seems a bit fragile to me. What if
the underlying fs reported some new error when reading the metadata,
should we then suddenly fail here when we didn't before? 

> 
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a lonely alcoholic firefighter looking for a cure to the poison 
coursing through his veins. She's a tortured insomniac Hell's Angel on 
the trail of a serial killer. They fight crime! 


^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-04-03  0:10  0%       ` Colin Walters
@ 2024-04-03  1:39  0%         ` Darrick J. Wong
  2024-04-03  8:35  0%         ` Alexander Larsson
  1 sibling, 0 replies; 200+ results
From: Darrick J. Wong @ 2024-04-03  1:39 UTC (permalink / raw)
  To: Colin Walters
  Cc: Eric Biggers, Andrey Albershteyn, xfs, linux-fsdevel, fsverity,
	Alexander Larsson

On Tue, Apr 02, 2024 at 08:10:15PM -0400, Colin Walters wrote:
> [cc alexl@, retained quotes for context]
> 
> On Tue, Apr 2, 2024, at 6:52 PM, Darrick J. Wong wrote:
> > On Tue, Apr 02, 2024 at 04:00:06PM -0400, Colin Walters wrote:
> >> 
> >> 
> >> On Fri, Mar 29, 2024, at 8:43 PM, Darrick J. Wong wrote:
> >> > From: Darrick J. Wong <djwong@kernel.org>
> >> >
> >> > There are more things that one can do with an open file descriptor on
> >> > XFS -- query extended attributes, scan for metadata damage, repair
> >> > metadata, etc.  None of this is possible if the fsverity metadata are
> >> > damaged, because that prevents the file from being opened.
> >> >
> >> > Ignore a selective set of error codes that we know fsverity_file_open to
> >> > return if the verity descriptor is nonsense.
> >> >
> >> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> >> > ---
> >> >  fs/iomap/buffered-io.c |    8 ++++++++
> >> >  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
> >> >  2 files changed, 26 insertions(+), 1 deletion(-)
> >> >
> >> >
> >> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> > index 9f9d929dfeebc..e68a15b72dbdd 100644
> >> > --- a/fs/iomap/buffered-io.c
> >> > +++ b/fs/iomap/buffered-io.c
> >> > @@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const struct 
> >> > iomap_iter *iter,
> >> >  	size_t poff, plen;
> >> >  	sector_t sector;
> >> > 
> >> > +	/*
> >> > +	 * If this verity file hasn't been activated, fail read attempts.  This
> >> > +	 * can happen if the calling filesystem allows files to be opened even
> >> > +	 * with damaged verity metadata.
> >> > +	 */
> >> > +	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
> >> > +		return -EIO;
> >> > +
> >> >  	if (iomap->type == IOMAP_INLINE)
> >> >  		return iomap_read_inline_data(iter, folio);
> >> > 
> >> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> >> > index c0b3e8146b753..36034eaefbf55 100644
> >> > --- a/fs/xfs/xfs_file.c
> >> > +++ b/fs/xfs/xfs_file.c
> >> > @@ -1431,8 +1431,25 @@ xfs_file_open(
> >> >  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> >> > 
> >> >  	error = fsverity_file_open(inode, file);
> >> > -	if (error)
> >> > +	switch (error) {
> >> > +	case -EFBIG:
> >> > +	case -EINVAL:
> >> > +	case -EMSGSIZE:
> >> > +	case -EFSCORRUPTED:
> >> > +		/*
> >> > +		 * Be selective about which fsverity errors we propagate to
> >> > +		 * userspace; we still want to be able to open this file even
> >> > +		 * if reads don't work.  Someone might want to perform an
> >> > +		 * online repair.
> >> > +		 */
> >> > +		if (has_capability_noaudit(current, CAP_SYS_ADMIN))
> >> > +			break;
> >> 
> >> As I understand it, fsverity (and dm-verity) are desirable in
> >> high-safety and integrity requirement cases where the goal is for the
> >> system to "fail closed" if errors in general are detected; anything
> >> that would have the system be in an ill-defined state.
> >
> > Is "open() fails if verity metadata are trashed" a hard requirement?
> 
> I can't say authoritatively, but I do want to ensure we've dug into
> the semantics here, and I agree with Eric that it would make the most
> sense to have this be consistent across filesystems.
> 
> > Reads will still fail due to (iomap) readahead returning EIO for a file
> > that is IS_VERITY() && !fsverity_active().  This is (afaict) the state
> > you end up with when the fsverity open fails.  ext4/f2fs don't do that,
> > but they also don't have online fsck so once a file's dead it's dead.
> 
> OK, right.  Allowing an open() but having read() fail seems like it
> doesn't weaken things too much in reality.  I think what makes me
> uncomfortable is the error-swallowing; but yes, in theory we should
> get the same or similar error on a subsequent read().

<nod> I /could/ write up some tests to make sure that happens.

> > <shrug> I don't know if regular (i.e. non-verity) xattrs are one of the
> > things that get frozen by verity?  Storing fsverity metadata in private
> > namespace xattrs is unique to xfs.
> 
> No, verity only covers file contents, no other metadata.  This is one
> of the rationales for composefs (e.g. ensuring things like the suid
> bit, security.selinux xattr etc. are covered as well as in general
> complete filesystem trees).
> 
> >> I hesitate to say it but maybe there should be some ioctl for online
> >> repair use cases only, or perhaps a new O_NOVERITY special flag to
> >> openat2()?
> >
> > "openat2 but without meddling from the VFS"?  Tempting... ;)
> 
> Or really any lower level even filesystem-specific API for the online
> fsck case.  Adding a blanket new special case for all CAP_SYS_ADMIN
> processes covers a lot of things that don't need that.

I suppose there could be an O_NOVALIDATION to turn off data checksum
validation on btrfs/bcachefs too.  But then you'd want to careful
controls on who gets to use it.  Maybe not liblzma_la-crc64-fast.o.

--D

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-04-02 23:45  0%       ` Eric Biggers
@ 2024-04-03  1:34  0%         ` Darrick J. Wong
  0 siblings, 0 replies; 200+ results
From: Darrick J. Wong @ 2024-04-03  1:34 UTC (permalink / raw)
  To: Eric Biggers; +Cc: Colin Walters, aalbersh, xfs, linux-fsdevel, fsverity

On Tue, Apr 02, 2024 at 04:45:58PM -0700, Eric Biggers wrote:
> On Tue, Apr 02, 2024 at 03:52:16PM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 02, 2024 at 04:00:06PM -0400, Colin Walters wrote:
> > > 
> > > 
> > > On Fri, Mar 29, 2024, at 8:43 PM, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > There are more things that one can do with an open file descriptor on
> > > > XFS -- query extended attributes, scan for metadata damage, repair
> > > > metadata, etc.  None of this is possible if the fsverity metadata are
> > > > damaged, because that prevents the file from being opened.
> > > >
> > > > Ignore a selective set of error codes that we know fsverity_file_open to
> > > > return if the verity descriptor is nonsense.
> > > >
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/iomap/buffered-io.c |    8 ++++++++
> > > >  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
> > > >  2 files changed, 26 insertions(+), 1 deletion(-)
> > > >
> > > >
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 9f9d929dfeebc..e68a15b72dbdd 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const struct 
> > > > iomap_iter *iter,
> > > >  	size_t poff, plen;
> > > >  	sector_t sector;
> > > > 
> > > > +	/*
> > > > +	 * If this verity file hasn't been activated, fail read attempts.  This
> > > > +	 * can happen if the calling filesystem allows files to be opened even
> > > > +	 * with damaged verity metadata.
> > > > +	 */
> > > > +	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
> > > > +		return -EIO;
> > > > +
> > > >  	if (iomap->type == IOMAP_INLINE)
> > > >  		return iomap_read_inline_data(iter, folio);
> > > > 
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index c0b3e8146b753..36034eaefbf55 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -1431,8 +1431,25 @@ xfs_file_open(
> > > >  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> > > > 
> > > >  	error = fsverity_file_open(inode, file);
> > > > -	if (error)
> > > > +	switch (error) {
> > > > +	case -EFBIG:
> > > > +	case -EINVAL:
> > > > +	case -EMSGSIZE:
> > > > +	case -EFSCORRUPTED:
> > > > +		/*
> > > > +		 * Be selective about which fsverity errors we propagate to
> > > > +		 * userspace; we still want to be able to open this file even
> > > > +		 * if reads don't work.  Someone might want to perform an
> > > > +		 * online repair.
> > > > +		 */
> > > > +		if (has_capability_noaudit(current, CAP_SYS_ADMIN))
> > > > +			break;
> > > 
> > > As I understand it, fsverity (and dm-verity) are desirable in
> > > high-safety and integrity requirement cases where the goal is for the
> > > system to "fail closed" if errors in general are detected; anything
> > > that would have the system be in an ill-defined state.
> > 
> > Is "open() fails if verity metadata are trashed" a hard requirement?
> > 
> > Reads will still fail due to (iomap) readahead returning EIO for a file
> > that is IS_VERITY() && !fsverity_active().  This is (afaict) the state
> > you end up with when the fsverity open fails.  ext4/f2fs don't do that,
> > but they also don't have online fsck so once a file's dead it's dead.
> > 
> 
> We really should have the same behavior on all filesystems, and that behavior
> should be documented in Documentation/filesystems/fsverity.rst.  I guess you
> want this for XFS_IOC_SCRUB_METADATA?

Yes.  xfs_scrub tries to open every regular file that it can, but if the
fsverity metadata is too badly damaged then the open() returns EMSGSIZE
or EINVAL or something.  The EMSGSIZE is particularly nasty since it's
not listed in the openat() manpage as a possible error code, which
surprised me.

>                                        That takes in an inode number directly,
> in xfs_scrub_metadata::sm_ino; does it even need to be executed on the same file
> it's checking?

<nod> The metadata repairs themselves can use scrub-by-handle mode, so
it's not *so* hard to handle it gracefully.

>                 Anyway, allowing the open means that the case of IS_VERITY() &&
> !fsverity_active() needs to be handled later in any case when I/O may be done to
> the file.  We need to be super careful to ensure that all cases are handled.

I /think/ most everything else is gated on IS_VERITY, right?

> Even just considering this patchset and XFS only, it looks like you got it wrong
> in xfs_file_read_iter().  You're allowing direct I/O to files that have
> IS_VERITY() && !fsverity_active().

Ahaha, yeah, that needs to be changed to:

	else if ((iocb->ki_flags & IOCB_DIRECT) && !IS_VERITY(inode))
		ret = xfs_file_dio_read(iocb, to);

Good catch.

> This change also invalidates the documentation for fsverity_active() which is:
> 
> /**
>  * fsverity_active() - do reads from the inode need to go through fs-verity?
>  * @inode: inode to check
>  *
>  * This checks whether ->i_verity_info has been set.
>  *
>  * Filesystems call this from ->readahead() to check whether the pages need to
>  * be verified or not.  Don't use IS_VERITY() for this purpose; it's subject to
>  * a race condition where the file is being read concurrently with
>  * FS_IOC_ENABLE_VERITY completing.  (S_VERITY is set before ->i_verity_info.)
>  *
>  * Return: true if reads need to go through fs-verity, otherwise false
>  */
> 
> I think that if you'd like to move forward with this, it would take a patchset
> that brings the behavior to all filesystems and considers all callers of
> fsverity_active().

<nod> If you think it's a reasonable thing to allow, then I'll of course
apply it to btr/ext4/f2fs.

> Another consideration will be whether the fsverity builtin signature not
> matching the file, not being trusted, or being malformed counts as "the fsverity
> metadata being damaged".

<shrug> Can you easily check that in the open routine?  I figured that
signature validation problems would manifest as read errors.

--D

> - Eric
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-04-02 22:52  0%     ` Darrick J. Wong
  2024-04-02 23:45  0%       ` Eric Biggers
@ 2024-04-03  0:10  0%       ` Colin Walters
  2024-04-03  1:39  0%         ` Darrick J. Wong
  2024-04-03  8:35  0%         ` Alexander Larsson
  1 sibling, 2 replies; 200+ results
From: Colin Walters @ 2024-04-03  0:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Eric Biggers, Andrey Albershteyn, xfs, linux-fsdevel, fsverity,
	Alexander Larsson

[cc alexl@, retained quotes for context]

On Tue, Apr 2, 2024, at 6:52 PM, Darrick J. Wong wrote:
> On Tue, Apr 02, 2024 at 04:00:06PM -0400, Colin Walters wrote:
>> 
>> 
>> On Fri, Mar 29, 2024, at 8:43 PM, Darrick J. Wong wrote:
>> > From: Darrick J. Wong <djwong@kernel.org>
>> >
>> > There are more things that one can do with an open file descriptor on
>> > XFS -- query extended attributes, scan for metadata damage, repair
>> > metadata, etc.  None of this is possible if the fsverity metadata are
>> > damaged, because that prevents the file from being opened.
>> >
>> > Ignore a selective set of error codes that we know fsverity_file_open to
>> > return if the verity descriptor is nonsense.
>> >
>> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>> > ---
>> >  fs/iomap/buffered-io.c |    8 ++++++++
>> >  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
>> >  2 files changed, 26 insertions(+), 1 deletion(-)
>> >
>> >
>> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> > index 9f9d929dfeebc..e68a15b72dbdd 100644
>> > --- a/fs/iomap/buffered-io.c
>> > +++ b/fs/iomap/buffered-io.c
>> > @@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const struct 
>> > iomap_iter *iter,
>> >  	size_t poff, plen;
>> >  	sector_t sector;
>> > 
>> > +	/*
>> > +	 * If this verity file hasn't been activated, fail read attempts.  This
>> > +	 * can happen if the calling filesystem allows files to be opened even
>> > +	 * with damaged verity metadata.
>> > +	 */
>> > +	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
>> > +		return -EIO;
>> > +
>> >  	if (iomap->type == IOMAP_INLINE)
>> >  		return iomap_read_inline_data(iter, folio);
>> > 
>> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> > index c0b3e8146b753..36034eaefbf55 100644
>> > --- a/fs/xfs/xfs_file.c
>> > +++ b/fs/xfs/xfs_file.c
>> > @@ -1431,8 +1431,25 @@ xfs_file_open(
>> >  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
>> > 
>> >  	error = fsverity_file_open(inode, file);
>> > -	if (error)
>> > +	switch (error) {
>> > +	case -EFBIG:
>> > +	case -EINVAL:
>> > +	case -EMSGSIZE:
>> > +	case -EFSCORRUPTED:
>> > +		/*
>> > +		 * Be selective about which fsverity errors we propagate to
>> > +		 * userspace; we still want to be able to open this file even
>> > +		 * if reads don't work.  Someone might want to perform an
>> > +		 * online repair.
>> > +		 */
>> > +		if (has_capability_noaudit(current, CAP_SYS_ADMIN))
>> > +			break;
>> 
>> As I understand it, fsverity (and dm-verity) are desirable in
>> high-safety and integrity requirement cases where the goal is for the
>> system to "fail closed" if errors in general are detected; anything
>> that would have the system be in an ill-defined state.
>
> Is "open() fails if verity metadata are trashed" a hard requirement?

I can't say authoritatively, but I do want to ensure we've dug into the semantics here, and I agree with Eric that it would make the most sense to have this be consistent across filesystems.

> Reads will still fail due to (iomap) readahead returning EIO for a file
> that is IS_VERITY() && !fsverity_active().  This is (afaict) the state
> you end up with when the fsverity open fails.  ext4/f2fs don't do that,
> but they also don't have online fsck so once a file's dead it's dead.

OK, right.  Allowing an open() but having read() fail seems like it doesn't weaken things too much in reality.  I think what makes me uncomfortable is the error-swallowing; but yes, in theory we should get the same or similar error on a subsequent read().

> <shrug> I don't know if regular (i.e. non-verity) xattrs are one of the
> things that get frozen by verity?  Storing fsverity metadata in private
> namespace xattrs is unique to xfs.

No, verity only covers file contents, no other metadata.  This is one of the rationales for composefs (e.g. ensuring things like the suid bit, security.selinux xattr etc. are covered as well as in general complete filesystem trees).

>> I hesitate to say it but maybe there should be some ioctl for online
>> repair use cases only, or perhaps a new O_NOVERITY special flag to
>> openat2()?
>
> "openat2 but without meddling from the VFS"?  Tempting... ;)

Or really any lower level even filesystem-specific API for the online fsck case.  
Adding a blanket new special case for all CAP_SYS_ADMIN processes covers a lot of things that don't need that.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-04-02 22:52  0%     ` Darrick J. Wong
@ 2024-04-02 23:45  0%       ` Eric Biggers
  2024-04-03  1:34  0%         ` Darrick J. Wong
  2024-04-03  0:10  0%       ` Colin Walters
  1 sibling, 1 reply; 200+ results
From: Eric Biggers @ 2024-04-02 23:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Colin Walters, aalbersh, xfs, linux-fsdevel, fsverity

On Tue, Apr 02, 2024 at 03:52:16PM -0700, Darrick J. Wong wrote:
> On Tue, Apr 02, 2024 at 04:00:06PM -0400, Colin Walters wrote:
> > 
> > 
> > On Fri, Mar 29, 2024, at 8:43 PM, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > There are more things that one can do with an open file descriptor on
> > > XFS -- query extended attributes, scan for metadata damage, repair
> > > metadata, etc.  None of this is possible if the fsverity metadata are
> > > damaged, because that prevents the file from being opened.
> > >
> > > Ignore a selective set of error codes that we know fsverity_file_open to
> > > return if the verity descriptor is nonsense.
> > >
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/iomap/buffered-io.c |    8 ++++++++
> > >  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
> > >  2 files changed, 26 insertions(+), 1 deletion(-)
> > >
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 9f9d929dfeebc..e68a15b72dbdd 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const struct 
> > > iomap_iter *iter,
> > >  	size_t poff, plen;
> > >  	sector_t sector;
> > > 
> > > +	/*
> > > +	 * If this verity file hasn't been activated, fail read attempts.  This
> > > +	 * can happen if the calling filesystem allows files to be opened even
> > > +	 * with damaged verity metadata.
> > > +	 */
> > > +	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
> > > +		return -EIO;
> > > +
> > >  	if (iomap->type == IOMAP_INLINE)
> > >  		return iomap_read_inline_data(iter, folio);
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index c0b3e8146b753..36034eaefbf55 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1431,8 +1431,25 @@ xfs_file_open(
> > >  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> > > 
> > >  	error = fsverity_file_open(inode, file);
> > > -	if (error)
> > > +	switch (error) {
> > > +	case -EFBIG:
> > > +	case -EINVAL:
> > > +	case -EMSGSIZE:
> > > +	case -EFSCORRUPTED:
> > > +		/*
> > > +		 * Be selective about which fsverity errors we propagate to
> > > +		 * userspace; we still want to be able to open this file even
> > > +		 * if reads don't work.  Someone might want to perform an
> > > +		 * online repair.
> > > +		 */
> > > +		if (has_capability_noaudit(current, CAP_SYS_ADMIN))
> > > +			break;
> > 
> > As I understand it, fsverity (and dm-verity) are desirable in
> > high-safety and integrity requirement cases where the goal is for the
> > system to "fail closed" if errors in general are detected; anything
> > that would have the system be in an ill-defined state.
> 
> Is "open() fails if verity metadata are trashed" a hard requirement?
> 
> Reads will still fail due to (iomap) readahead returning EIO for a file
> that is IS_VERITY() && !fsverity_active().  This is (afaict) the state
> you end up with when the fsverity open fails.  ext4/f2fs don't do that,
> but they also don't have online fsck so once a file's dead it's dead.
> 

We really should have the same behavior on all filesystems, and that behavior
should be documented in Documentation/filesystems/fsverity.rst.  I guess you
want this for XFS_IOC_SCRUB_METADATA?  That takes in an inode number directly,
in xfs_scrub_metadata::sm_ino; does it even need to be executed on the same file
it's checking?  Anyway, allowing the open means that the case of IS_VERITY() &&
!fsverity_active() needs to be handled later in any case when I/O may be done to
the file.  We need to be super careful to ensure that all cases are handled.

Even just considering this patchset and XFS only, it looks like you got it wrong
in xfs_file_read_iter().  You're allowing direct I/O to files that have
IS_VERITY() && !fsverity_active().

This change also invalidates the documentation for fsverity_active() which is:

/**
 * fsverity_active() - do reads from the inode need to go through fs-verity?
 * @inode: inode to check
 *
 * This checks whether ->i_verity_info has been set.
 *
 * Filesystems call this from ->readahead() to check whether the pages need to
 * be verified or not.  Don't use IS_VERITY() for this purpose; it's subject to
 * a race condition where the file is being read concurrently with
 * FS_IOC_ENABLE_VERITY completing.  (S_VERITY is set before ->i_verity_info.)
 *
 * Return: true if reads need to go through fs-verity, otherwise false
 */

I think that if you'd like to move forward with this, it would take a patchset
that brings the behavior to all filesystems and considers all callers of
fsverity_active().

Another consideration will be whether the fsverity builtin signature not
matching the file, not being trusted, or being malformed counts as "the fsverity
metadata being damaged".

- Eric

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-04-02 20:00  7%   ` Colin Walters
@ 2024-04-02 22:52  0%     ` Darrick J. Wong
  2024-04-02 23:45  0%       ` Eric Biggers
  2024-04-03  0:10  0%       ` Colin Walters
  0 siblings, 2 replies; 200+ results
From: Darrick J. Wong @ 2024-04-02 22:52 UTC (permalink / raw)
  To: Colin Walters; +Cc: Eric Biggers, aalbersh, xfs, linux-fsdevel, fsverity

On Tue, Apr 02, 2024 at 04:00:06PM -0400, Colin Walters wrote:
> 
> 
> On Fri, Mar 29, 2024, at 8:43 PM, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > There are more things that one can do with an open file descriptor on
> > XFS -- query extended attributes, scan for metadata damage, repair
> > metadata, etc.  None of this is possible if the fsverity metadata are
> > damaged, because that prevents the file from being opened.
> >
> > Ignore a selective set of error codes that we know fsverity_file_open to
> > return if the verity descriptor is nonsense.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/iomap/buffered-io.c |    8 ++++++++
> >  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
> >  2 files changed, 26 insertions(+), 1 deletion(-)
> >
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 9f9d929dfeebc..e68a15b72dbdd 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const struct 
> > iomap_iter *iter,
> >  	size_t poff, plen;
> >  	sector_t sector;
> > 
> > +	/*
> > +	 * If this verity file hasn't been activated, fail read attempts.  This
> > +	 * can happen if the calling filesystem allows files to be opened even
> > +	 * with damaged verity metadata.
> > +	 */
> > +	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
> > +		return -EIO;
> > +
> >  	if (iomap->type == IOMAP_INLINE)
> >  		return iomap_read_inline_data(iter, folio);
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index c0b3e8146b753..36034eaefbf55 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1431,8 +1431,25 @@ xfs_file_open(
> >  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> > 
> >  	error = fsverity_file_open(inode, file);
> > -	if (error)
> > +	switch (error) {
> > +	case -EFBIG:
> > +	case -EINVAL:
> > +	case -EMSGSIZE:
> > +	case -EFSCORRUPTED:
> > +		/*
> > +		 * Be selective about which fsverity errors we propagate to
> > +		 * userspace; we still want to be able to open this file even
> > +		 * if reads don't work.  Someone might want to perform an
> > +		 * online repair.
> > +		 */
> > +		if (has_capability_noaudit(current, CAP_SYS_ADMIN))
> > +			break;
> 
> As I understand it, fsverity (and dm-verity) are desirable in
> high-safety and integrity requirement cases where the goal is for the
> system to "fail closed" if errors in general are detected; anything
> that would have the system be in an ill-defined state.

Is "open() fails if verity metadata are trashed" a hard requirement?

Reads will still fail due to (iomap) readahead returning EIO for a file
that is IS_VERITY() && !fsverity_active().  This is (afaict) the state
you end up with when the fsverity open fails.  ext4/f2fs don't do that,
but they also don't have online fsck so once a file's dead it's dead.

> A lot of ambient processes are going to have CAP_SYS_ADMIN and this
> will just swallow these errors for those (will things the EFSCORRUPTED
> path at least have been logged by a lower level function?)...whereas
> this is only needed just for a very few tools.
> 
> At least for composefs the quoted cases of "query extended attributes,
> scan for metadata damage, repair metadata" are all things that
> canonically live in the composefs metadata (EROFS) blob, so in theory
> there's a lot less of a need to query/inspect it for those use cases.
> (Maybe for composefs we should force canonicalize all the underlying
> files to have mode 0400 and no xattrs or something and add that to its
> repair).

<shrug> I don't know if regular (i.e. non-verity) xattrs are one of the
things that get frozen by verity?  Storing fsverity metadata in private
namespace xattrs is unique to xfs.

> I hesitate to say it but maybe there should be some ioctl for online
> repair use cases only, or perhaps a new O_NOVERITY special flag to
> openat2()?

"openat2 but without meddling from the VFS"?  Tempting... ;)

--D

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-03-30  0:43  5% ` [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged Darrick J. Wong
  2024-04-02 18:04  0%   ` Andrey Albershteyn
@ 2024-04-02 20:00  7%   ` Colin Walters
  2024-04-02 22:52  0%     ` Darrick J. Wong
  1 sibling, 1 reply; 200+ results
From: Colin Walters @ 2024-04-02 20:00 UTC (permalink / raw)
  To: Darrick J. Wong, Eric Biggers, aalbersh; +Cc: xfs, linux-fsdevel, fsverity



On Fri, Mar 29, 2024, at 8:43 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> There are more things that one can do with an open file descriptor on
> XFS -- query extended attributes, scan for metadata damage, repair
> metadata, etc.  None of this is possible if the fsverity metadata are
> damaged, because that prevents the file from being opened.
>
> Ignore a selective set of error codes that we know fsverity_file_open to
> return if the verity descriptor is nonsense.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/iomap/buffered-io.c |    8 ++++++++
>  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
>  2 files changed, 26 insertions(+), 1 deletion(-)
>
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9f9d929dfeebc..e68a15b72dbdd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const struct 
> iomap_iter *iter,
>  	size_t poff, plen;
>  	sector_t sector;
> 
> +	/*
> +	 * If this verity file hasn't been activated, fail read attempts.  This
> +	 * can happen if the calling filesystem allows files to be opened even
> +	 * with damaged verity metadata.
> +	 */
> +	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
> +		return -EIO;
> +
>  	if (iomap->type == IOMAP_INLINE)
>  		return iomap_read_inline_data(iter, folio);
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c0b3e8146b753..36034eaefbf55 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1431,8 +1431,25 @@ xfs_file_open(
>  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> 
>  	error = fsverity_file_open(inode, file);
> -	if (error)
> +	switch (error) {
> +	case -EFBIG:
> +	case -EINVAL:
> +	case -EMSGSIZE:
> +	case -EFSCORRUPTED:
> +		/*
> +		 * Be selective about which fsverity errors we propagate to
> +		 * userspace; we still want to be able to open this file even
> +		 * if reads don't work.  Someone might want to perform an
> +		 * online repair.
> +		 */
> +		if (has_capability_noaudit(current, CAP_SYS_ADMIN))
> +			break;

As I understand it, fsverity (and dm-verity) are desirable in high-safety and integrity requirement cases where the goal is for the system to "fail closed" if errors in general are detected; anything that would have the system be in an ill-defined state.

A lot of ambient processes are going to have CAP_SYS_ADMIN and this will just swallow these errors for those (will things the EFSCORRUPTED path at least have been logged by a lower level function?)...whereas this is only needed just for a very few tools.

At least for composefs the quoted cases of "query extended attributes, scan for metadata damage, repair metadata" are all things that canonically live in the composefs metadata (EROFS) blob, so in theory there's a lot less of a need to query/inspect it for those use cases.  (Maybe for composefs we should force canonicalize all the underlying files to have mode 0400 and no xattrs or something and add that to its repair).

I hesitate to say it but maybe there should be some ioctl for online repair use cases only, or perhaps a new O_NOVERITY special flag to openat2()?




^ permalink raw reply	[relevance 7%]

* Re: [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices
  2024-03-27 16:57  0%   ` Mickaël Salaün
  2024-03-28 12:01  0%     ` Mickaël Salaün
@ 2024-04-02 18:28  0%     ` Günther Noack
  2024-04-03 11:15  0%       ` Mickaël Salaün
  1 sibling, 1 reply; 200+ results
From: Günther Noack @ 2024-04-02 18:28 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Christian Brauner

Hello!

Thanks for the review!

On Wed, Mar 27, 2024 at 05:57:31PM +0100, Mickaël Salaün wrote:
> On Wed, Mar 27, 2024 at 01:10:31PM +0000, Günther Noack wrote:
> > Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
> > and increments the Landlock ABI version to 5.
> > 
> > This access right applies to device-custom IOCTL commands
> > when they are invoked on block or character device files.
> > 
> > Like the truncate right, this right is associated with a file
> > descriptor at the time of open(2), and gets respected even when the
> > file descriptor is used outside of the thread which it was originally
> > opened in.
> > 
> > Therefore, a newly enabled Landlock policy does not apply to file
> > descriptors which are already open.
> > 
> > If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
> > number of safe IOCTL commands will be permitted on newly opened device
> > files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
> > as other IOCTL commands for regular files which are implemented in
> > fs/ioctl.c.
> > 
> > Noteworthy scenarios which require special attention:
> > 
> > TTY devices are often passed into a process from the parent process,
> > and so a newly enabled Landlock policy does not retroactively apply to
> > them automatically.  In the past, TTY devices have often supported
> > IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> > letting callers control the TTY input buffer (and simulate
> > keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> > modern kernels though.
> > 
> > Known limitations:
> > 
> > The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
> > control over IOCTL commands.
> > 
> > Landlock users may use path-based restrictions in combination with
> > their knowledge about the file system layout to control what IOCTLs
> > can be done.
> > 
> > Cc: Paul Moore <paul@paul-moore.com>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Signed-off-by: Günther Noack <gnoack@google.com>
> > ---
> >  include/uapi/linux/landlock.h                |  33 +++-
> >  security/landlock/fs.c                       | 183 ++++++++++++++++++-
> >  security/landlock/limits.h                   |   2 +-
> >  security/landlock/syscalls.c                 |   8 +-
> >  tools/testing/selftests/landlock/base_test.c |   2 +-
> >  tools/testing/selftests/landlock/fs_test.c   |   5 +-
> >  6 files changed, 216 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> > index 25c8d7677539..5d90e9799eb5 100644
> > --- a/include/uapi/linux/landlock.h
> > +++ b/include/uapi/linux/landlock.h
> > @@ -128,7 +128,7 @@ struct landlock_net_port_attr {
> >   * files and directories.  Files or directories opened before the sandboxing
> >   * are not subject to these restrictions.
> >   *
> > - * A file can only receive these access rights:
> > + * The following access rights apply only to files:
> >   *
> >   * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
> >   * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
> > @@ -138,12 +138,13 @@ struct landlock_net_port_attr {
> >   * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
> >   * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
> >   *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
> > - *   ``O_TRUNC``. Whether an opened file can be truncated with
> > - *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
> > - *   same way as read and write permissions are checked during
> > - *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
> > - *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
> > - *   third version of the Landlock ABI.
> > + *   ``O_TRUNC``.  This access right is available since the third version of the
> > + *   Landlock ABI.
> > + *
> > + * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> > + * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> > + * read and write permissions are checked during :manpage:`open(2)` using
> > + * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
> >   *
> >   * A directory can receive access rights related to files or directories.  The
> >   * following access right is applied to the directory itself, and the
> > @@ -198,13 +199,28 @@ struct landlock_net_port_attr {
> >   *   If multiple requirements are not met, the ``EACCES`` error code takes
> >   *   precedence over ``EXDEV``.
> >   *
> > + * The following access right applies both to files and directories:
> > + *
> > + * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
> > + *   character or block device.
> > + *
> > + *   This access right applies to all `ioctl(2)` commands implemented by device
> 
> :manpage:`ioctl(2)`
> 
> > + *   drivers.  However, the following common IOCTL commands continue to be
> > + *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:
> 
> This is good but explaining the rationale could help, something like
> this (taking care of not packing lines listing commands to ease review
> when a new command will be added):
> 
> IOCTL commands targetting file descriptors (``FIOCLEX``, ``FIONCLEX``),
> file descriptions (``FIONBIO``, ``FIOASYNC``),
> file systems (``FIOQSIZE``, ``FS_IOC_FIEMAP``, ``FICLONE``,
> ``FICLONERAN``, ``FIDEDUPERANGE``, ``FS_IOC_GETFLAGS``,
> ``FS_IOC_SETFLAGS``, ``FS_IOC_FSGETXATTR``, ``FS_IOC_FSSETXATTR``),
> or superblocks (``FIFREEZE``, ``FITHAW``, ``FIGETBSZ``,
> ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``)
> are never denied.  However, such IOCTL commands still require an opened
> file and may not be available on any file type.  Read or write
> permission may be checked by the underlying implementation, as well as
> capabilities.

OK, I'll add some more explanation in the next version.


> > + *   ``FIOCLEX``, ``FIONCLEX``, ``FIONBIO``, ``FIOASYNC``, ``FIFREEZE``,
> > + *   ``FITHAW``, ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``
> > + *
> > + *   This access right is available since the fifth version of the Landlock
> > + *   ABI.
> > + *
> >   * .. warning::
> >   *
> >   *   It is currently not possible to restrict some file-related actions
> >   *   accessible through these syscall families: :manpage:`chdir(2)`,
> >   *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
> >   *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
> > - *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
> > + *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
> >   *   Future Landlock evolutions will enable to restrict them.
> >   */
> >  /* clang-format off */
> > @@ -223,6 +239,7 @@ struct landlock_net_port_attr {
> >  #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
> >  #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
> >  #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
> > +#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
> >  /* clang-format on */
> >  
> >  /**
> > diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> > index c15559432d3d..2ef6c57fa20b 100644
> > --- a/security/landlock/fs.c
> > +++ b/security/landlock/fs.c
> > @@ -7,6 +7,7 @@
> >   * Copyright © 2021-2022 Microsoft Corporation
> >   */
> >  
> > +#include <asm/ioctls.h>
> >  #include <kunit/test.h>
> >  #include <linux/atomic.h>
> >  #include <linux/bitops.h>
> > @@ -14,6 +15,7 @@
> >  #include <linux/compiler_types.h>
> >  #include <linux/dcache.h>
> >  #include <linux/err.h>
> > +#include <linux/falloc.h>
> >  #include <linux/fs.h>
> >  #include <linux/init.h>
> >  #include <linux/kernel.h>
> > @@ -29,6 +31,7 @@
> >  #include <linux/types.h>
> >  #include <linux/wait_bit.h>
> >  #include <linux/workqueue.h>
> > +#include <uapi/linux/fiemap.h>
> >  #include <uapi/linux/landlock.h>
> >  
> >  #include "common.h"
> > @@ -84,6 +87,141 @@ static const struct landlock_object_underops landlock_fs_underops = {
> >  	.release = release_inode
> >  };
> >  
> > +/* IOCTL helpers */
> > +
> > +/**
> > + * get_required_ioctl_dev_access(): Determine required access rights for IOCTLs
> > + * on device files.
> > + *
> > + * @cmd: The IOCTL command that is supposed to be run.
> > + *
> > + * By default, any IOCTL on a device file requires the
> > + * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  We make exceptions for commands, if:
> > + *
> > + * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
> > + *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
> > + *
> > + * 2. The command can be reasonably used on a device file at all.
> > + *
> > + * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
> > + * should be considered for inclusion here.
> > + *
> > + * Returns: The access rights that must be granted on an opened file in order to
> > + * use the given @cmd.
> > + */
> > +static __attribute_const__ access_mask_t
> > +get_required_ioctl_dev_access(const unsigned int cmd)
> > +{
> > +	switch (cmd) {
> > +	case FIOCLEX:
> > +	case FIONCLEX:
> > +	case FIONBIO:
> > +	case FIOASYNC:
> > +		/*
> > +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> > +		 * close-on-exec and the file's buffered-IO and async flags.
> > +		 * These operations are also available through fcntl(2), and are
> > +		 * unconditionally permitted in Landlock.
> > +		 */
> > +		return 0;
> > +	case FIOQSIZE:
> > +		/*
> > +		 * FIOQSIZE queries the size of a regular file or directory.
> > +		 *
> > +		 * This IOCTL command only applies to regular files and
> > +		 * directories.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> This should always be allowed because do_vfs_ioctl() never returns
> -ENOIOCTLCMD for this command.  That's why I wrote
> vfs_masked_device_ioctl() this way [1].  I think it would be easier to
> read and maintain this code with a is_masked_device_ioctl() logic.  Listing
> commands that are not masked makes it difficult to review because
> allowed and denied return codes are interleaved.

Oh, I misunderstood you on [2], I think -- I was under the impression that you
wanted to keep the switch case in the same order (and with the same entries?) as
the original in do_vfs_ioctl.  So you'd prefer to only list the always-allowed
IOCTL commands here, as you have done in vfs_masked_device_ioctl() [3]?

[2] https://lore.kernel.org/all/20240326.ooCheem1biV2@digikod.net/
[3] https://lore.kernel.org/all/20240219183539.2926165-1-mic@digikod.net/


Can you please clarify how you make up your mind about what should be permitted
and what should not?  I have trouble understanding the rationale for the changes
that you asked for below, apart from the points that they are harmless and that
the return codes should be consistent.

The criteria that I have used in this patch set are that (a) it is implemented
in do_vfs_ioctl() rather than further below, and (b) it makes sense to use that
command on a device file.  (If we permit FIOQSIZE, FS_IOC_FIEMAP and others
here, we will get slightly more correct error codes in these cases, but the
IOCTLs will still not work, because they are not useful and not implemented for
devices. -- On the other hand, we are also increasing the exposed code surface a
bit.  For example, FS_IOC_FIEMAP is calling into inode->i_op->fiemap().  That is
probably harmless for device files, but requires us to reason at a deeper level
to convince ourselves of that.)

In your implementation at [3], you were permitting FICLONE* and FIDEDUPERANGE,
but not FS_IOC_ZERO_RANGE, which is like fallocate().  How are these cases
different to each other?  Is that on purpose?


> [1] https://lore.kernel.org/r/20240219183539.2926165-1-mic@digikod.net
> 
> Your IOCTL command explanation comments are nice and they should be kept
> in is_masked_device_ioctl() (if they mask device IOCTL commands).

OK

> 
> > +	case FIFREEZE:
> > +	case FITHAW:
> > +		/*
> > +		 * FIFREEZE and FITHAW freeze and thaw the file system which the
> > +		 * given file belongs to.  Requires CAP_SYS_ADMIN.
> > +		 *
> > +		 * These commands operate on the file system's superblock rather
> > +		 * than on the file itself.  The same operations can also be
> > +		 * done through any other file or directory on the same file
> > +		 * system, so it is safe to permit these.
> > +		 */
> > +		return 0;
> > +	case FS_IOC_FIEMAP:
> > +		/*
> > +		 * FS_IOC_FIEMAP queries information about the allocation of
> > +		 * blocks within a file.
> > +		 *
> > +		 * This IOCTL command only applies to regular files.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> Same here.
> 
> > +	case FIGETBSZ:
> > +		/*
> > +		 * FIGETBSZ queries the file system's block size for a file or
> > +		 * directory.
> > +		 *
> > +		 * This command operates on the file system's superblock rather
> > +		 * than on the file itself.  The same operation can also be done
> > +		 * through any other file or directory on the same file system,
> > +		 * so it is safe to permit it.
> > +		 */
> > +		return 0;
> > +	case FICLONE:
> > +	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> > +		/*
> > +		 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
> > +		 * their underlying storage ("reflink") between source and
> > +		 * destination FDs, on file systems which support that.
> > +		 *
> > +		 * These IOCTL commands only apply to regular files.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> ditto
> 
> > +	case FIONREAD:
> > +		/*
> > +		 * FIONREAD returns the number of bytes available for reading.
> > +		 *
> > +		 * We require LANDLOCK_ACCESS_FS_IOCTL_DEV for FIONREAD, because
> > +		 * devices implement it in f_ops->unlocked_ioctl().  The
> > +		 * implementations of this operation have varying quality and
> > +		 * complexity, so it is hard to reason about what they do.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	case FS_IOC_GETFLAGS:
> > +	case FS_IOC_SETFLAGS:
> > +	case FS_IOC_FSGETXATTR:
> > +	case FS_IOC_FSSETXATTR:
> > +		/*
> > +		 * FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
> > +		 * FS_IOC_FSSETXATTR do not apply for devices.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	case FS_IOC_GETFSUUID:
> > +	case FS_IOC_GETFSSYSFSPATH:
> > +		/*
> > +		 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
> > +		 * the file system superblock, not on the specific file, so
> > +		 * these operations are available through any other file on the
> > +		 * same file system as well.
> > +		 */
> > +		return 0;
> > +	case FIBMAP:
> > +	case FS_IOC_RESVSP:
> > +	case FS_IOC_RESVSP64:
> > +	case FS_IOC_UNRESVSP:
> > +	case FS_IOC_UNRESVSP64:
> > +	case FS_IOC_ZERO_RANGE:
> > +		/*
> > +		 * FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP,
> > +		 * FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE only apply to regular
> > +		 * files (as implemented in file_ioctl()).
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	default:
> > +		/*
> > +		 * Other commands are guarded by the catch-all access right.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	}
> > +}
> > +
> >  /* Ruleset management */
> >  
> >  static struct landlock_object *get_inode_object(struct inode *const inode)
> > @@ -148,7 +286,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> >  	LANDLOCK_ACCESS_FS_EXECUTE | \
> >  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
> >  	LANDLOCK_ACCESS_FS_READ_FILE | \
> > -	LANDLOCK_ACCESS_FS_TRUNCATE)
> > +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> > +	LANDLOCK_ACCESS_FS_IOCTL_DEV)
> >  /* clang-format on */
> >  
> >  /*
> > @@ -1335,8 +1474,10 @@ static int hook_file_alloc_security(struct file *const file)
> >  static int hook_file_open(struct file *const file)
> >  {
> >  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
> > -	access_mask_t open_access_request, full_access_request, allowed_access;
> > -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > +	access_mask_t open_access_request, full_access_request, allowed_access,
> > +		optional_access;
> > +	const struct inode *inode = file_inode(file);
> > +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> >  	const struct landlock_ruleset *const dom =
> >  		get_fs_domain(landlock_cred(file->f_cred)->domain);
> >  
> > @@ -1354,6 +1495,10 @@ static int hook_file_open(struct file *const file)
> >  	 * We look up more access than what we immediately need for open(), so
> >  	 * that we can later authorize operations on opened files.
> >  	 */
> > +	optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > +	if (is_device)
> > +		optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +
> >  	full_access_request = open_access_request | optional_access;
> >  
> >  	if (is_access_to_paths_allowed(
> > @@ -1410,6 +1555,36 @@ static int hook_file_truncate(struct file *const file)
> >  	return -EACCES;
> >  }
> >  
> > +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> > +			   unsigned long arg)
> > +{
> > +	const struct inode *inode = file_inode(file);
> > +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> > +	access_mask_t required_access, allowed_access;
> 
> As explained in [2], I'd like not-sandboxed tasks to not have visible
> performance impact because of Landlock:
> 
>   We should first check landlock_file(file)->allowed_access as in
>   hook_file_truncate() to return as soon as possible for non-sandboxed
>   tasks.  Any other computation should be done after that (e.g. with an
>   is_device() helper).
> 
> [2] https://lore.kernel.org/r/20240311.If7ieshaegu2@digikod.net
> 
> This is_device(file) helper should also replace other is_device variables.

Done.

FWIW, I have doubts that it makes a performance difference - the is_device()
check is almost for free as well.  But we can pull the same check earlier for
consistency with the truncate hook, if it helps people to understand that their
own program performance should be unaffected.

> 
> 
> > +
> > +	if (!is_device)
> > +		return 0;
> > +
> > +	/*
> > +	 * It is the access rights at the time of opening the file which
> > +	 * determine whether IOCTL can be used on the opened file later.
> > +	 *
> > +	 * The access right is attached to the opened file in hook_file_open().
> > +	 */
> > +	required_access = get_required_ioctl_dev_access(cmd);
> > +	allowed_access = landlock_file(file)->allowed_access;
> > +	if ((allowed_access & required_access) == required_access)
> > +		return 0;
> > +
> > +	return -EACCES;
> > +}
> > +
> > +static int hook_file_ioctl_compat(struct file *file, unsigned int cmd,
> > +				  unsigned long arg)
> > +{
> > +	return hook_file_ioctl(file, cmd, arg);
> 
> The compat-specific IOCTL commands are missing (e.g. FS_IOC_RESVSP_32).
> Relying on is_masked_device_ioctl() should make this call OK though.

OK, I'll try to replicate the logic from your vfs_masked_device_ioctl() approach
then?

—Günther

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  2024-03-30  0:43  5% ` [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged Darrick J. Wong
@ 2024-04-02 18:04  0%   ` Andrey Albershteyn
  2024-04-02 20:00  7%   ` Colin Walters
  1 sibling, 0 replies; 200+ results
From: Andrey Albershteyn @ 2024-04-02 18:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: ebiggers, linux-xfs, linux-fsdevel, fsverity

On 2024-03-29 17:43:22, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> There are more things that one can do with an open file descriptor on
> XFS -- query extended attributes, scan for metadata damage, repair
> metadata, etc.  None of this is possible if the fsverity metadata are
> damaged, because that prevents the file from being opened.
> 
> Ignore a selective set of error codes that we know fsverity_file_open to
> return if the verity descriptor is nonsense.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/iomap/buffered-io.c |    8 ++++++++
>  fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
>  2 files changed, 26 insertions(+), 1 deletion(-)
> 
> 

Looks good to me:
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>

-- 
- Andrey


^ permalink raw reply	[relevance 0%]

* [PATCH] cifs: Fix caching to try to do open O_WRONLY as rdwr on server
@ 2024-04-02  9:11  4% David Howells
  0 siblings, 0 replies; 200+ results
From: David Howells @ 2024-04-02  9:11 UTC (permalink / raw)
  To: Steve French
  Cc: dhowells, Shyam Prasad N, Rohith Surabattula, Jeff Layton,
	Naveen Mamindlapalli, linux-cifs, netfs, linux-fsdevel,
	linux-kernel

When we're engaged in local caching of a cifs filesystem, we cannot perform
caching of a partially written cache granule unless we can read the rest of
the granule.  This can result in unexpected access errors being reported to
the user.

Fix this by the following: if a file is opened O_WRONLY locally, but the
mount was given the "-o fsc" flag, try first opening the remote file with
GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping the
GENERIC_READ and doing the open again.  If that last succeeds, invalidate
the cache for that file as for O_DIRECT.

Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/smb/client/dir.c     |   15 +++++++++++++++
 fs/smb/client/file.c    |   48 ++++++++++++++++++++++++++++++++++++++----------
 fs/smb/client/fscache.h |    6 ++++++
 3 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index d11dc3aa458b..864b194dbaa0 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -189,6 +189,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	int disposition;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	*oplock = 0;
 	if (tcon->ses->server->oplocks)
@@ -200,6 +201,10 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		return PTR_ERR(full_path);
 	}
 
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (oflags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	if (tcon->unix_ext && cap_unix(tcon->ses) && !tcon->broken_posix_open &&
 	    (CIFS_UNIX_POSIX_PATH_OPS_CAP &
@@ -276,6 +281,8 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		desired_access |= GENERIC_READ; /* is this too little? */
 	if (OPEN_FMODE(oflags) & FMODE_WRITE)
 		desired_access |= GENERIC_WRITE;
+	if (rdwr_for_fscache == 1)
+		desired_access |= GENERIC_READ;
 
 	disposition = FILE_OVERWRITE_IF;
 	if ((oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))
@@ -304,6 +311,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	if (!tcon->unix_ext && (mode & S_IWUGO) == 0)
 		create_options |= CREATE_OPTION_READONLY;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -317,8 +325,15 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	rc = server->ops->open(xid, &oparms, oplock, buf);
 	if (rc) {
 		cifs_dbg(FYI, "cifs_create returned 0x%x\n", rc);
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access &= ~GENERIC_READ;
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		goto out;
 	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	/*
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 59da572d3384..1541a4f6045d 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -206,12 +206,12 @@ cifs_mark_open_files_invalid(struct cifs_tcon *tcon)
 	 */
 }
 
-static inline int cifs_convert_flags(unsigned int flags)
+static inline int cifs_convert_flags(unsigned int flags, int rdwr_for_fscache)
 {
 	if ((flags & O_ACCMODE) == O_RDONLY)
 		return GENERIC_READ;
 	else if ((flags & O_ACCMODE) == O_WRONLY)
-		return GENERIC_WRITE;
+		return rdwr_for_fscache == 1 ? (GENERIC_READ | GENERIC_WRITE) : GENERIC_WRITE;
 	else if ((flags & O_ACCMODE) == O_RDWR) {
 		/* GENERIC_ALL is too much permission to request
 		   can cause unnecessary access denied on create */
@@ -348,11 +348,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	int create_options = CREATE_NOT_DIR;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	if (!server->ops->open)
 		return -ENOSYS;
 
-	desired_access = cifs_convert_flags(f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(f_flags, rdwr_for_fscache);
 
 /*********************************************************************
  *  open flag mapping table:
@@ -389,6 +394,7 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	if (f_flags & O_DIRECT)
 		create_options |= CREATE_NO_BUFFER;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -400,8 +406,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	};
 
 	rc = server->ops->open(xid, &oparms, oplock, buf);
-	if (rc)
+	if (rc) {
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access = cifs_convert_flags(f_flags, 0);
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		return rc;
+	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 	/* TODO: Add support for calling posix query info but with passing in fid */
 	if (tcon->unix_ext)
@@ -834,11 +848,11 @@ int cifs_open(struct inode *inode, struct file *file)
 use_cache:
 	fscache_use_cookie(cifs_inode_cookie(file_inode(file)),
 			   file->f_mode & FMODE_WRITE);
-	if (file->f_flags & O_DIRECT &&
-	    (!((file->f_flags & O_ACCMODE) != O_RDONLY) ||
-	     file->f_flags & O_APPEND))
-		cifs_invalidate_cache(file_inode(file),
-				      FSCACHE_INVAL_DIO_WRITE);
+	if (!(file->f_flags & O_DIRECT))
+		goto out;
+	if ((file->f_flags & (O_ACCMODE | O_APPEND)) == O_RDONLY)
+		goto out;
+	cifs_invalidate_cache(file_inode(file), FSCACHE_INVAL_DIO_WRITE);
 
 out:
 	free_dentry_path(page);
@@ -903,6 +917,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	int disposition = FILE_OPEN;
 	int create_options = CREATE_NOT_DIR;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	xid = get_xid();
 	mutex_lock(&cfile->fh_mutex);
@@ -966,7 +981,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	}
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
 
-	desired_access = cifs_convert_flags(cfile->f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (cfile->f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(cfile->f_flags, rdwr_for_fscache);
 
 	/* O_SYNC also has bit for O_DSYNC so following check picks up either */
 	if (cfile->f_flags & O_SYNC)
@@ -978,6 +997,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	if (server->ops->get_lease_key)
 		server->ops->get_lease_key(inode, &cfile->fid);
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -1003,6 +1023,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		/* indicate that we need to relock the file */
 		oparms.reconnect = true;
 	}
+	if (rc == -EACCES && rdwr_for_fscache == 1) {
+		desired_access = cifs_convert_flags(cfile->f_flags, 0);
+		rdwr_for_fscache = 2;
+		goto retry_open;
+	}
 
 	if (rc) {
 		mutex_unlock(&cfile->fh_mutex);
@@ -1011,6 +1036,9 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		goto reopen_error_exit;
 	}
 
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 reopen_success:
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
diff --git a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h
index a3d73720914f..1f2ea9f5cc9a 100644
--- a/fs/smb/client/fscache.h
+++ b/fs/smb/client/fscache.h
@@ -109,6 +109,11 @@ static inline void cifs_readahead_to_fscache(struct inode *inode,
 		__cifs_readahead_to_fscache(inode, pos, len);
 }
 
+static inline bool cifs_fscache_enabled(struct inode *inode)
+{
+	return fscache_cookie_enabled(cifs_inode_cookie(inode));
+}
+
 #else /* CONFIG_CIFS_FSCACHE */
 static inline
 void cifs_fscache_fill_coherency(struct inode *inode,
@@ -124,6 +129,7 @@ static inline void cifs_fscache_release_inode_cookie(struct inode *inode) {}
 static inline void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update) {}
 static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { return NULL; }
 static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) {}
+static inline bool cifs_fscache_enabled(struct inode *inode) { return false; }
 
 static inline int cifs_fscache_query_occupancy(struct inode *inode,
 					       pgoff_t first, unsigned int nr_pages,


^ permalink raw reply related	[relevance 4%]

* [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged
  @ 2024-03-30  0:43  5% ` Darrick J. Wong
  2024-04-02 18:04  0%   ` Andrey Albershteyn
  2024-04-02 20:00  7%   ` Colin Walters
  0 siblings, 2 replies; 200+ results
From: Darrick J. Wong @ 2024-03-30  0:43 UTC (permalink / raw)
  To: djwong, ebiggers, aalbersh; +Cc: linux-xfs, linux-fsdevel, fsverity

From: Darrick J. Wong <djwong@kernel.org>

There are more things that one can do with an open file descriptor on
XFS -- query extended attributes, scan for metadata damage, repair
metadata, etc.  None of this is possible if the fsverity metadata are
damaged, because that prevents the file from being opened.

Ignore a selective set of error codes that we know fsverity_file_open to
return if the verity descriptor is nonsense.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c |    8 ++++++++
 fs/xfs/xfs_file.c      |   19 ++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)


diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9f9d929dfeebc..e68a15b72dbdd 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -487,6 +487,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	size_t poff, plen;
 	sector_t sector;
 
+	/*
+	 * If this verity file hasn't been activated, fail read attempts.  This
+	 * can happen if the calling filesystem allows files to be opened even
+	 * with damaged verity metadata.
+	 */
+	if (IS_VERITY(iter->inode) && !fsverity_active(iter->inode))
+		return -EIO;
+
 	if (iomap->type == IOMAP_INLINE)
 		return iomap_read_inline_data(iter, folio);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c0b3e8146b753..36034eaefbf55 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1431,8 +1431,25 @@ xfs_file_open(
 			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
 
 	error = fsverity_file_open(inode, file);
-	if (error)
+	switch (error) {
+	case -EFBIG:
+	case -EINVAL:
+	case -EMSGSIZE:
+	case -EFSCORRUPTED:
+		/*
+		 * Be selective about which fsverity errors we propagate to
+		 * userspace; we still want to be able to open this file even
+		 * if reads don't work.  Someone might want to perform an
+		 * online repair.
+		 */
+		if (has_capability_noaudit(current, CAP_SYS_ADMIN))
+			break;
 		return error;
+	case 0:
+		break;
+	default:
+		return error;
+	}
 
 	return generic_file_open(inode, file);
 }


^ permalink raw reply related	[relevance 5%]

* RE: [PATCH v6 11/15] cifs: When caching, try to open O_WRONLY file rdwr on server
  2024-03-28 16:58  4% ` [PATCH v6 11/15] cifs: When caching, try to open O_WRONLY file rdwr on server David Howells
@ 2024-03-29  9:58  0%   ` Naveen Mamindlapalli
  0 siblings, 0 replies; 200+ results
From: Naveen Mamindlapalli @ 2024-03-29  9:58 UTC (permalink / raw)
  To: David Howells, Steve French
  Cc: Jeff Layton, Matthew Wilcox, Paulo Alcantara, Shyam Prasad N,
	Tom Talpey, Christian Brauner, netfs, linux-cifs, linux-fsdevel,
	linux-mm, netdev, linux-kernel, Steve French, Shyam Prasad N,
	Rohith Surabattula


> -----Original Message-----
> From: David Howells <dhowells@redhat.com>
> Sent: Thursday, March 28, 2024 10:28 PM
> To: Steve French <smfrench@gmail.com>
> Cc: David Howells <dhowells@redhat.com>; Jeff Layton <jlayton@kernel.org>;
> Matthew Wilcox <willy@infradead.org>; Paulo Alcantara <pc@manguebit.com>;
> Shyam Prasad N <sprasad@microsoft.com>; Tom Talpey <tom@talpey.com>;
> Christian Brauner <christian@brauner.io>; netfs@lists.linux.dev; linux-
> cifs@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Steve French
> <sfrench@samba.org>; Shyam Prasad N <nspmangalore@gmail.com>; Rohith
> Surabattula <rohiths.msft@gmail.com>
> Subject: [PATCH v6 11/15] cifs: When caching, try to open
> O_WRONLY file rdwr on server
> 
> When we're engaged in local caching of a cifs filesystem, we cannot perform
> caching of a partially written cache granule unless we can read the rest of the
> granule.  To deal with this, if a file is opened O_WRONLY locally, but the mount
> was given the "-o fsc" flag, try first opening the remote file with
> GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping
> the GENERIC_READ and doing the open again.  If that last succeeds, invalidate
> the cache for that file as for O_DIRECT.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Steve French <sfrench@samba.org>
> cc: Shyam Prasad N <nspmangalore@gmail.com>
> cc: Rohith Surabattula <rohiths.msft@gmail.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: linux-cifs@vger.kernel.org
> cc: netfs@lists.linux.dev
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>  fs/smb/client/dir.c     | 15 ++++++++++++
>  fs/smb/client/file.c    | 51 +++++++++++++++++++++++++++++++++--------
>  fs/smb/client/fscache.h |  6 +++++
>  3 files changed, 62 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c index
> 89333d9bce36..37897b919dd5 100644
> --- a/fs/smb/client/dir.c
> +++ b/fs/smb/client/dir.c
> @@ -189,6 +189,7 @@ static int cifs_do_create(struct inode *inode, struct dentry
> *direntry, unsigned
>  	int disposition;
>  	struct TCP_Server_Info *server = tcon->ses->server;
>  	struct cifs_open_parms oparms;
> +	int rdwr_for_fscache = 0;
> 
>  	*oplock = 0;
>  	if (tcon->ses->server->oplocks)
> @@ -200,6 +201,10 @@ static int cifs_do_create(struct inode *inode, struct
> dentry *direntry, unsigned
>  		return PTR_ERR(full_path);
>  	}
> 
> +	/* If we're caching, we need to be able to fill in around partial writes. */
> +	if (cifs_fscache_enabled(inode) && (oflags & O_ACCMODE) ==
> O_WRONLY)
> +		rdwr_for_fscache = 1;
> +
>  #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
>  	if (tcon->unix_ext && cap_unix(tcon->ses) && !tcon->broken_posix_open
> &&
>  	    (CIFS_UNIX_POSIX_PATH_OPS_CAP &
> @@ -276,6 +281,8 @@ static int cifs_do_create(struct inode *inode, struct dentry
> *direntry, unsigned
>  		desired_access |= GENERIC_READ; /* is this too little? */
>  	if (OPEN_FMODE(oflags) & FMODE_WRITE)
>  		desired_access |= GENERIC_WRITE;
> +	if (rdwr_for_fscache == 1)
> +		desired_access |= GENERIC_READ;
> 
>  	disposition = FILE_OVERWRITE_IF;
>  	if ((oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL)) @@ -
> 304,6 +311,7 @@ static int cifs_do_create(struct inode *inode, struct dentry
> *direntry, unsigned
>  	if (!tcon->unix_ext && (mode & S_IWUGO) == 0)
>  		create_options |= CREATE_OPTION_READONLY;
> 
> +retry_open:
>  	oparms = (struct cifs_open_parms) {
>  		.tcon = tcon,
>  		.cifs_sb = cifs_sb,
> @@ -317,8 +325,15 @@ static int cifs_do_create(struct inode *inode, struct
> dentry *direntry, unsigned
>  	rc = server->ops->open(xid, &oparms, oplock, buf);
>  	if (rc) {
>  		cifs_dbg(FYI, "cifs_create returned 0x%x\n", rc);
> +		if (rc == -EACCES && rdwr_for_fscache == 1) {
> +			desired_access &= ~GENERIC_READ;
> +			rdwr_for_fscache = 2;
> +			goto retry_open;
> +		}
>  		goto out;
>  	}
> +	if (rdwr_for_fscache == 2)
> +		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
> 
>  #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
>  	/*
> diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index
> 73573dadf90e..761a80963f76 100644
> --- a/fs/smb/client/file.c
> +++ b/fs/smb/client/file.c
> @@ -521,12 +521,12 @@ cifs_mark_open_files_invalid(struct cifs_tcon *tcon)
>  	 */
>  }
> 
> -static inline int cifs_convert_flags(unsigned int flags)
> +static inline int cifs_convert_flags(unsigned int flags, int
> +rdwr_for_fscache)
>  {
>  	if ((flags & O_ACCMODE) == O_RDONLY)
>  		return GENERIC_READ;
>  	else if ((flags & O_ACCMODE) == O_WRONLY)
> -		return GENERIC_WRITE;
> +		return rdwr_for_fscache == 1 ? (GENERIC_READ |
> GENERIC_WRITE) :
> +GENERIC_WRITE;
>  	else if ((flags & O_ACCMODE) == O_RDWR) {
>  		/* GENERIC_ALL is too much permission to request
>  		   can cause unnecessary access denied on create */ @@ -
> 663,11 +663,16 @@ static int cifs_nt_open(const char *full_path, struct inode
> *inode, struct cifs_
>  	int create_options = CREATE_NOT_DIR;
>  	struct TCP_Server_Info *server = tcon->ses->server;
>  	struct cifs_open_parms oparms;
> +	int rdwr_for_fscache = 0;
> 
>  	if (!server->ops->open)
>  		return -ENOSYS;
> 
> -	desired_access = cifs_convert_flags(f_flags);
> +	/* If we're caching, we need to be able to fill in around partial writes. */
> +	if (cifs_fscache_enabled(inode) && (f_flags & O_ACCMODE) ==
> O_WRONLY)
> +		rdwr_for_fscache = 1;
> +
> +	desired_access = cifs_convert_flags(f_flags, rdwr_for_fscache);
> 
>  /*********************************************************************
>   *  open flag mapping table:
> @@ -704,6 +709,7 @@ static int cifs_nt_open(const char *full_path, struct inode
> *inode, struct cifs_
>  	if (f_flags & O_DIRECT)
>  		create_options |= CREATE_NO_BUFFER;
> 
> +retry_open:
>  	oparms = (struct cifs_open_parms) {
>  		.tcon = tcon,
>  		.cifs_sb = cifs_sb,
> @@ -715,8 +721,16 @@ static int cifs_nt_open(const char *full_path, struct inode
> *inode, struct cifs_
>  	};
> 
>  	rc = server->ops->open(xid, &oparms, oplock, buf);
> -	if (rc)
> +	if (rc) {
> +		if (rc == -EACCES && rdwr_for_fscache == 1) {
> +			desired_access = cifs_convert_flags(f_flags, 0);
> +			rdwr_for_fscache = 2;
> +			goto retry_open;
> +		}
>  		return rc;
> +	}
> +	if (rdwr_for_fscache == 2)
> +		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
> 
>  	/* TODO: Add support for calling posix query info but with passing in fid */
>  	if (tcon->unix_ext)
> @@ -1149,11 +1163,14 @@ int cifs_open(struct inode *inode, struct file *file)
>  use_cache:
>  	fscache_use_cookie(cifs_inode_cookie(file_inode(file)),
>  			   file->f_mode & FMODE_WRITE);
> -	if (file->f_flags & O_DIRECT &&
> -	    (!((file->f_flags & O_ACCMODE) != O_RDONLY) ||
> -	     file->f_flags & O_APPEND))
> -		cifs_invalidate_cache(file_inode(file),
> -				      FSCACHE_INVAL_DIO_WRITE);
> +	//if ((file->f_flags & O_ACCMODE) == O_WRONLY)
> +	//	goto inval;

Why to keep unused code?

Thanks,
Naveen

> +	if (!(file->f_flags & O_DIRECT))
> +		goto out;
> +	if ((file->f_flags & (O_ACCMODE | O_APPEND)) == O_RDONLY)
> +		goto out;
> +//inval:
> +	cifs_invalidate_cache(file_inode(file), FSCACHE_INVAL_DIO_WRITE);
> 
>  out:
>  	free_dentry_path(page);
> @@ -1218,6 +1235,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool
> can_flush)
>  	int disposition = FILE_OPEN;
>  	int create_options = CREATE_NOT_DIR;
>  	struct cifs_open_parms oparms;
> +	int rdwr_for_fscache = 0;
> 
>  	xid = get_xid();
>  	mutex_lock(&cfile->fh_mutex);
> @@ -1281,7 +1299,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool
> can_flush)
>  	}
>  #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
> 
> -	desired_access = cifs_convert_flags(cfile->f_flags);
> +	/* If we're caching, we need to be able to fill in around partial writes. */
> +	if (cifs_fscache_enabled(inode) && (cfile->f_flags & O_ACCMODE) ==
> O_WRONLY)
> +		rdwr_for_fscache = 1;
> +
> +	desired_access = cifs_convert_flags(cfile->f_flags, rdwr_for_fscache);
> 
>  	/* O_SYNC also has bit for O_DSYNC so following check picks up either
> */
>  	if (cfile->f_flags & O_SYNC)
> @@ -1293,6 +1315,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool
> can_flush)
>  	if (server->ops->get_lease_key)
>  		server->ops->get_lease_key(inode, &cfile->fid);
> 
> +retry_open:
>  	oparms = (struct cifs_open_parms) {
>  		.tcon = tcon,
>  		.cifs_sb = cifs_sb,
> @@ -1318,6 +1341,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool
> can_flush)
>  		/* indicate that we need to relock the file */
>  		oparms.reconnect = true;
>  	}
> +	if (rc == -EACCES && rdwr_for_fscache == 1) {
> +		desired_access = cifs_convert_flags(cfile->f_flags, 0);
> +		rdwr_for_fscache = 2;
> +		goto retry_open;
> +	}
> 
>  	if (rc) {
>  		mutex_unlock(&cfile->fh_mutex);
> @@ -1326,6 +1354,9 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool
> can_flush)
>  		goto reopen_error_exit;
>  	}
> 
> +	if (rdwr_for_fscache == 2)
> +		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
> +
>  #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
>  reopen_success:
>  #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */ diff --git
> a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h index
> a3d73720914f..1f2ea9f5cc9a 100644
> --- a/fs/smb/client/fscache.h
> +++ b/fs/smb/client/fscache.h
> @@ -109,6 +109,11 @@ static inline void cifs_readahead_to_fscache(struct
> inode *inode,
>  		__cifs_readahead_to_fscache(inode, pos, len);  }
> 
> +static inline bool cifs_fscache_enabled(struct inode *inode) {
> +	return fscache_cookie_enabled(cifs_inode_cookie(inode));
> +}
> +
>  #else /* CONFIG_CIFS_FSCACHE */
>  static inline
>  void cifs_fscache_fill_coherency(struct inode *inode, @@ -124,6 +129,7 @@
> static inline void cifs_fscache_release_inode_cookie(struct inode *inode) {}  static
> inline void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update) {}
> static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { return
> NULL; }  static inline void cifs_invalidate_cache(struct inode *inode, unsigned int
> flags) {}
> +static inline bool cifs_fscache_enabled(struct inode *inode) { return
> +false; }
> 
>  static inline int cifs_fscache_query_occupancy(struct inode *inode,
>  					       pgoff_t first, unsigned int nr_pages,
> 


^ permalink raw reply	[relevance 0%]

* [PATCH v6 13/15] cifs: Remove some code that's no longer used, part 1
    2024-03-28 16:58  4% ` [PATCH v6 11/15] cifs: When caching, try to open O_WRONLY file rdwr on server David Howells
@ 2024-03-28 16:58  2% ` David Howells
  1 sibling, 0 replies; 200+ results
From: David Howells @ 2024-03-28 16:58 UTC (permalink / raw)
  To: Steve French
  Cc: David Howells, Jeff Layton, Matthew Wilcox, Paulo Alcantara,
	Shyam Prasad N, Tom Talpey, Christian Brauner, netfs, linux-cifs,
	linux-fsdevel, linux-mm, netdev, linux-kernel, Steve French,
	Shyam Prasad N, Rohith Surabattula

Remove some code that was #if'd out with the netfslib conversion.  This is
split into parts for file.c as the diff generator otherwise produces a hard
to read diff for part of it where a big chunk is cut out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/smb/client/cifsglob.h  |  12 -
 fs/smb/client/cifsproto.h |  25 --
 fs/smb/client/file.c      | 619 --------------------------------------
 fs/smb/client/fscache.c   | 111 -------
 fs/smb/client/fscache.h   |  58 ----
 5 files changed, 825 deletions(-)

diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
index 639cdeb3f77e..94885bf86ff2 100644
--- a/fs/smb/client/cifsglob.h
+++ b/fs/smb/client/cifsglob.h
@@ -1510,18 +1510,6 @@ struct cifs_io_subrequest {
 	struct smbd_mr			*mr;
 #endif
 	struct cifs_credits		credits;
-
-#if 0 // TODO: Remove following elements
-	struct list_head		list;
-	struct completion		done;
-	struct work_struct		work;
-	struct cifsFileInfo		*cfile;
-	struct address_space		*mapping;
-	struct cifs_aio_ctx		*ctx;
-	enum writeback_sync_modes	sync_mode;
-	bool				uncached;
-	struct bio_vec			*bv;
-#endif
 };
 
 /*
diff --git a/fs/smb/client/cifsproto.h b/fs/smb/client/cifsproto.h
index e0ccf32d7ecd..57ec67cdc31e 100644
--- a/fs/smb/client/cifsproto.h
+++ b/fs/smb/client/cifsproto.h
@@ -600,36 +600,11 @@ void __cifs_put_smb_ses(struct cifs_ses *ses);
 extern struct cifs_ses *
 cifs_get_smb_ses(struct TCP_Server_Info *server, struct smb3_fs_context *ctx);
 
-#if 0 // TODO Remove
-void cifs_readdata_release(struct cifs_io_subrequest *rdata);
-static inline void cifs_get_readdata(struct cifs_io_subrequest *rdata)
-{
-	refcount_inc(&rdata->subreq.ref);
-}
-static inline void cifs_put_readdata(struct cifs_io_subrequest *rdata)
-{
-	if (refcount_dec_and_test(&rdata->subreq.ref))
-		cifs_readdata_release(rdata);
-}
-#endif
 int cifs_async_readv(struct cifs_io_subrequest *rdata);
 int cifs_readv_receive(struct TCP_Server_Info *server, struct mid_q_entry *mid);
 
 void cifs_async_writev(struct cifs_io_subrequest *wdata);
 void cifs_writev_complete(struct work_struct *work);
-#if 0 // TODO Remove
-struct cifs_io_subrequest *cifs_writedata_alloc(work_func_t complete);
-void cifs_writedata_release(struct cifs_io_subrequest *rdata);
-static inline void cifs_get_writedata(struct cifs_io_subrequest *wdata)
-{
-	refcount_inc(&wdata->subreq.ref);
-}
-static inline void cifs_put_writedata(struct cifs_io_subrequest *wdata)
-{
-	if (refcount_dec_and_test(&wdata->subreq.ref))
-		cifs_writedata_release(wdata);
-}
-#endif
 int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
 			  struct cifs_sb_info *cifs_sb,
 			  const unsigned char *path, char *pbuf,
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index c57a3638c51a..265d96f663d7 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -352,133 +352,6 @@ const struct netfs_request_ops cifs_req_ops = {
 	.issue_write		= cifs_issue_write,
 };
 
-#if 0 // TODO remove 397
-/*
- * Remove the dirty flags from a span of pages.
- */
-static void cifs_undirty_folios(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each_marked(&xas, folio, end, PAGECACHE_TAG_DIRTY) {
-		if (xas_retry(&xas, folio))
-			continue;
-		xas_pause(&xas);
-		rcu_read_unlock();
-		folio_lock(folio);
-		folio_clear_dirty_for_io(folio);
-		folio_unlock(folio);
-		rcu_read_lock();
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * Completion of write to server.
- */
-void cifs_pages_written_back(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	if (!len)
-		return;
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each(&xas, folio, end) {
-		if (xas_retry(&xas, folio))
-			continue;
-		if (!folio_test_writeback(folio)) {
-			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
-				  len, start, folio->index, end);
-			continue;
-		}
-
-		folio_detach_private(folio);
-		folio_end_writeback(folio);
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * Failure of write to server.
- */
-void cifs_pages_write_failed(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	if (!len)
-		return;
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each(&xas, folio, end) {
-		if (xas_retry(&xas, folio))
-			continue;
-		if (!folio_test_writeback(folio)) {
-			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
-				  len, start, folio->index, end);
-			continue;
-		}
-
-		folio_set_error(folio);
-		folio_end_writeback(folio);
-	}
-
-	rcu_read_unlock();
-}
-
-/*
- * Redirty pages after a temporary failure.
- */
-void cifs_pages_write_redirty(struct inode *inode, loff_t start, unsigned int len)
-{
-	struct address_space *mapping = inode->i_mapping;
-	struct folio *folio;
-	pgoff_t end;
-
-	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
-
-	if (!len)
-		return;
-
-	rcu_read_lock();
-
-	end = (start + len - 1) / PAGE_SIZE;
-	xas_for_each(&xas, folio, end) {
-		if (!folio_test_writeback(folio)) {
-			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
-				  len, start, folio->index, end);
-			continue;
-		}
-
-		filemap_dirty_folio(folio->mapping, folio);
-		folio_end_writeback(folio);
-	}
-
-	rcu_read_unlock();
-}
-#endif // end netfslib remove 397
-
 /*
  * Mark as invalid, all open files on tree connections since they
  * were closed when session to server was lost.
@@ -2492,92 +2365,6 @@ void cifs_write_subrequest_terminated(struct cifs_io_subrequest *wdata, ssize_t
 	netfs_write_subrequest_terminated(&wdata->subreq, result, was_async);
 }
 
-#if 0 // TODO remove 2483
-static ssize_t
-cifs_write(struct cifsFileInfo *open_file, __u32 pid, const char *write_data,
-	   size_t write_size, loff_t *offset)
-{
-	int rc = 0;
-	unsigned int bytes_written = 0;
-	unsigned int total_written;
-	struct cifs_tcon *tcon;
-	struct TCP_Server_Info *server;
-	unsigned int xid;
-	struct dentry *dentry = open_file->dentry;
-	struct cifsInodeInfo *cifsi = CIFS_I(d_inode(dentry));
-	struct cifs_io_parms io_parms = {0};
-
-	cifs_dbg(FYI, "write %zd bytes to offset %lld of %pd\n",
-		 write_size, *offset, dentry);
-
-	tcon = tlink_tcon(open_file->tlink);
-	server = tcon->ses->server;
-
-	if (!server->ops->sync_write)
-		return -ENOSYS;
-
-	xid = get_xid();
-
-	for (total_written = 0; write_size > total_written;
-	     total_written += bytes_written) {
-		rc = -EAGAIN;
-		while (rc == -EAGAIN) {
-			struct kvec iov[2];
-			unsigned int len;
-
-			if (open_file->invalidHandle) {
-				/* we could deadlock if we called
-				   filemap_fdatawait from here so tell
-				   reopen_file not to flush data to
-				   server now */
-				rc = cifs_reopen_file(open_file, false);
-				if (rc != 0)
-					break;
-			}
-
-			len = min(server->ops->wp_retry_size(d_inode(dentry)),
-				  (unsigned int)write_size - total_written);
-			/* iov[0] is reserved for smb header */
-			iov[1].iov_base = (char *)write_data + total_written;
-			iov[1].iov_len = len;
-			io_parms.pid = pid;
-			io_parms.tcon = tcon;
-			io_parms.offset = *offset;
-			io_parms.length = len;
-			rc = server->ops->sync_write(xid, &open_file->fid,
-					&io_parms, &bytes_written, iov, 1);
-		}
-		if (rc || (bytes_written == 0)) {
-			if (total_written)
-				break;
-			else {
-				free_xid(xid);
-				return rc;
-			}
-		} else {
-			spin_lock(&d_inode(dentry)->i_lock);
-			cifs_update_eof(cifsi, *offset, bytes_written);
-			spin_unlock(&d_inode(dentry)->i_lock);
-			*offset += bytes_written;
-		}
-	}
-
-	cifs_stats_bytes_written(tcon, total_written);
-
-	if (total_written > 0) {
-		spin_lock(&d_inode(dentry)->i_lock);
-		if (*offset > d_inode(dentry)->i_size) {
-			i_size_write(d_inode(dentry), *offset);
-			d_inode(dentry)->i_blocks = (512 - 1 + *offset) >> 9;
-		}
-		spin_unlock(&d_inode(dentry)->i_lock);
-	}
-	mark_inode_dirty_sync(d_inode(dentry));
-	free_xid(xid);
-	return total_written;
-}
-#endif // end netfslib remove 2483
-
 struct cifsFileInfo *find_readable_file(struct cifsInodeInfo *cifs_inode,
 					bool fsuid_only)
 {
@@ -4778,293 +4565,6 @@ int cifs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return rc;
 }
 
-#if 0 // TODO remove 4794
-/*
- * Unlock a bunch of folios in the pagecache.
- */
-static void cifs_unlock_folios(struct address_space *mapping, pgoff_t first, pgoff_t last)
-{
-	struct folio *folio;
-	XA_STATE(xas, &mapping->i_pages, first);
-
-	rcu_read_lock();
-	xas_for_each(&xas, folio, last) {
-		folio_unlock(folio);
-	}
-	rcu_read_unlock();
-}
-
-static void cifs_readahead_complete(struct work_struct *work)
-{
-	struct cifs_io_subrequest *rdata = container_of(work,
-							struct cifs_io_subrequest, work);
-	struct folio *folio;
-	pgoff_t last;
-	bool good = rdata->result == 0 || (rdata->result == -EAGAIN && rdata->got_bytes);
-
-	XA_STATE(xas, &rdata->mapping->i_pages, rdata->subreq.start / PAGE_SIZE);
-
-	if (good)
-		cifs_readahead_to_fscache(rdata->mapping->host,
-					  rdata->subreq.start, rdata->subreq.len);
-
-	if (iov_iter_count(&rdata->subreq.io_iter) > 0)
-		iov_iter_zero(iov_iter_count(&rdata->subreq.io_iter), &rdata->subreq.io_iter);
-
-	last = (rdata->subreq.start + rdata->subreq.len - 1) / PAGE_SIZE;
-
-	rcu_read_lock();
-	xas_for_each(&xas, folio, last) {
-		if (good) {
-			flush_dcache_folio(folio);
-			folio_mark_uptodate(folio);
-		}
-		folio_unlock(folio);
-	}
-	rcu_read_unlock();
-
-	cifs_put_readdata(rdata);
-}
-
-static void cifs_readahead(struct readahead_control *ractl)
-{
-	struct cifsFileInfo *open_file = ractl->file->private_data;
-	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(ractl->file);
-	struct TCP_Server_Info *server;
-	unsigned int xid, nr_pages, cache_nr_pages = 0;
-	unsigned int ra_pages;
-	pgoff_t next_cached = ULONG_MAX, ra_index;
-	bool caching = fscache_cookie_enabled(cifs_inode_cookie(ractl->mapping->host)) &&
-		cifs_inode_cookie(ractl->mapping->host)->cache_priv;
-	bool check_cache = caching;
-	pid_t pid;
-	int rc = 0;
-
-	/* Note that readahead_count() lags behind our dequeuing of pages from
-	 * the ractl, wo we have to keep track for ourselves.
-	 */
-	ra_pages = readahead_count(ractl);
-	ra_index = readahead_index(ractl);
-
-	xid = get_xid();
-
-	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
-		pid = open_file->pid;
-	else
-		pid = current->tgid;
-
-	server = cifs_pick_channel(tlink_tcon(open_file->tlink)->ses);
-
-	cifs_dbg(FYI, "%s: file=%p mapping=%p num_pages=%u\n",
-		 __func__, ractl->file, ractl->mapping, ra_pages);
-
-	/*
-	 * Chop the readahead request up into rsize-sized read requests.
-	 */
-	while ((nr_pages = ra_pages)) {
-		unsigned int i;
-		struct cifs_io_subrequest *rdata;
-		struct cifs_credits credits_on_stack;
-		struct cifs_credits *credits = &credits_on_stack;
-		struct folio *folio;
-		pgoff_t fsize;
-		size_t rsize;
-
-		/*
-		 * Find out if we have anything cached in the range of
-		 * interest, and if so, where the next chunk of cached data is.
-		 */
-		if (caching) {
-			if (check_cache) {
-				rc = cifs_fscache_query_occupancy(
-					ractl->mapping->host, ra_index, nr_pages,
-					&next_cached, &cache_nr_pages);
-				if (rc < 0)
-					caching = false;
-				check_cache = false;
-			}
-
-			if (ra_index == next_cached) {
-				/*
-				 * TODO: Send a whole batch of pages to be read
-				 * by the cache.
-				 */
-				folio = readahead_folio(ractl);
-				fsize = folio_nr_pages(folio);
-				ra_pages -= fsize;
-				ra_index += fsize;
-				if (cifs_readpage_from_fscache(ractl->mapping->host,
-							       &folio->page) < 0) {
-					/*
-					 * TODO: Deal with cache read failure
-					 * here, but for the moment, delegate
-					 * that to readpage.
-					 */
-					caching = false;
-				}
-				folio_unlock(folio);
-				next_cached += fsize;
-				cache_nr_pages -= fsize;
-				if (cache_nr_pages == 0)
-					check_cache = true;
-				continue;
-			}
-		}
-
-		if (open_file->invalidHandle) {
-			rc = cifs_reopen_file(open_file, true);
-			if (rc) {
-				if (rc == -EAGAIN)
-					continue;
-				break;
-			}
-		}
-
-		if (cifs_sb->ctx->rsize == 0)
-			cifs_sb->ctx->rsize =
-				server->ops->negotiate_rsize(tlink_tcon(open_file->tlink),
-							     cifs_sb->ctx);
-
-		rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize,
-						   &rsize, credits);
-		if (rc)
-			break;
-		nr_pages = min_t(size_t, rsize / PAGE_SIZE, ra_pages);
-		if (next_cached != ULONG_MAX)
-			nr_pages = min_t(size_t, nr_pages, next_cached - ra_index);
-
-		/*
-		 * Give up immediately if rsize is too small to read an entire
-		 * page. The VFS will fall back to readpage. We should never
-		 * reach this point however since we set ra_pages to 0 when the
-		 * rsize is smaller than a cache page.
-		 */
-		if (unlikely(!nr_pages)) {
-			add_credits_and_wake_if(server, credits, 0);
-			break;
-		}
-
-		rdata = cifs_readdata_alloc(cifs_readahead_complete);
-		if (!rdata) {
-			/* best to give up if we're out of mem */
-			add_credits_and_wake_if(server, credits, 0);
-			break;
-		}
-
-		rdata->subreq.start	= ra_index * PAGE_SIZE;
-		rdata->subreq.len	= nr_pages * PAGE_SIZE;
-		rdata->cfile	= cifsFileInfo_get(open_file);
-		rdata->server	= server;
-		rdata->mapping	= ractl->mapping;
-		rdata->pid	= pid;
-		rdata->credits	= credits_on_stack;
-
-		for (i = 0; i < nr_pages; i++) {
-			if (!readahead_folio(ractl))
-				WARN_ON(1);
-		}
-		ra_pages -= nr_pages;
-		ra_index += nr_pages;
-
-		iov_iter_xarray(&rdata->subreq.io_iter, ITER_DEST, &rdata->mapping->i_pages,
-				rdata->subreq.start, rdata->subreq.len);
-
-		rc = adjust_credits(server, &rdata->credits, rdata->subreq.len);
-		if (!rc) {
-			if (rdata->cfile->invalidHandle)
-				rc = -EAGAIN;
-			else
-				rc = server->ops->async_readv(rdata);
-		}
-
-		if (rc) {
-			add_credits_and_wake_if(server, &rdata->credits, 0);
-			cifs_unlock_folios(rdata->mapping,
-					   rdata->subreq.start / PAGE_SIZE,
-					   (rdata->subreq.start + rdata->subreq.len - 1) / PAGE_SIZE);
-			/* Fallback to the readpage in error/reconnect cases */
-			cifs_put_readdata(rdata);
-			break;
-		}
-
-		cifs_put_readdata(rdata);
-	}
-
-	free_xid(xid);
-}
-
-/*
- * cifs_readpage_worker must be called with the page pinned
- */
-static int cifs_readpage_worker(struct file *file, struct page *page,
-	loff_t *poffset)
-{
-	struct inode *inode = file_inode(file);
-	struct timespec64 atime, mtime;
-	char *read_data;
-	int rc;
-
-	/* Is the page cached? */
-	rc = cifs_readpage_from_fscache(inode, page);
-	if (rc == 0)
-		goto read_complete;
-
-	read_data = kmap(page);
-	/* for reads over a certain size could initiate async read ahead */
-
-	rc = cifs_read(file, read_data, PAGE_SIZE, poffset);
-
-	if (rc < 0)
-		goto io_error;
-	else
-		cifs_dbg(FYI, "Bytes read %d\n", rc);
-
-	/* we do not want atime to be less than mtime, it broke some apps */
-	atime = inode_set_atime_to_ts(inode, current_time(inode));
-	mtime = inode_get_mtime(inode);
-	if (timespec64_compare(&atime, &mtime) < 0)
-		inode_set_atime_to_ts(inode, inode_get_mtime(inode));
-
-	if (PAGE_SIZE > rc)
-		memset(read_data + rc, 0, PAGE_SIZE - rc);
-
-	flush_dcache_page(page);
-	SetPageUptodate(page);
-	rc = 0;
-
-io_error:
-	kunmap(page);
-
-read_complete:
-	unlock_page(page);
-	return rc;
-}
-
-static int cifs_read_folio(struct file *file, struct folio *folio)
-{
-	struct page *page = &folio->page;
-	loff_t offset = page_file_offset(page);
-	int rc = -EACCES;
-	unsigned int xid;
-
-	xid = get_xid();
-
-	if (file->private_data == NULL) {
-		rc = -EBADF;
-		free_xid(xid);
-		return rc;
-	}
-
-	cifs_dbg(FYI, "read_folio %p at offset %d 0x%x\n",
-		 page, (int)offset, (int)offset);
-
-	rc = cifs_readpage_worker(file, page, &offset);
-
-	free_xid(xid);
-	return rc;
-}
-#endif // end netfslib remove 4794
-
 static int is_inode_writable(struct cifsInodeInfo *cifs_inode)
 {
 	struct cifsFileInfo *open_file;
@@ -5112,104 +4612,6 @@ bool is_size_safe_to_change(struct cifsInodeInfo *cifsInode, __u64 end_of_file,
 		return true;
 }
 
-#if 0 // TODO remove 5152
-static int cifs_write_begin(struct file *file, struct address_space *mapping,
-			loff_t pos, unsigned len,
-			struct page **pagep, void **fsdata)
-{
-	int oncethru = 0;
-	pgoff_t index = pos >> PAGE_SHIFT;
-	loff_t offset = pos & (PAGE_SIZE - 1);
-	loff_t page_start = pos & PAGE_MASK;
-	loff_t i_size;
-	struct page *page;
-	int rc = 0;
-
-	cifs_dbg(FYI, "write_begin from %lld len %d\n", (long long)pos, len);
-
-start:
-	page = grab_cache_page_write_begin(mapping, index);
-	if (!page) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
-	if (PageUptodate(page))
-		goto out;
-
-	/*
-	 * If we write a full page it will be up to date, no need to read from
-	 * the server. If the write is short, we'll end up doing a sync write
-	 * instead.
-	 */
-	if (len == PAGE_SIZE)
-		goto out;
-
-	/*
-	 * optimize away the read when we have an oplock, and we're not
-	 * expecting to use any of the data we'd be reading in. That
-	 * is, when the page lies beyond the EOF, or straddles the EOF
-	 * and the write will cover all of the existing data.
-	 */
-	if (CIFS_CACHE_READ(CIFS_I(mapping->host))) {
-		i_size = i_size_read(mapping->host);
-		if (page_start >= i_size ||
-		    (offset == 0 && (pos + len) >= i_size)) {
-			zero_user_segments(page, 0, offset,
-					   offset + len,
-					   PAGE_SIZE);
-			/*
-			 * PageChecked means that the parts of the page
-			 * to which we're not writing are considered up
-			 * to date. Once the data is copied to the
-			 * page, it can be set uptodate.
-			 */
-			SetPageChecked(page);
-			goto out;
-		}
-	}
-
-	if ((file->f_flags & O_ACCMODE) != O_WRONLY && !oncethru) {
-		/*
-		 * might as well read a page, it is fast enough. If we get
-		 * an error, we don't need to return it. cifs_write_end will
-		 * do a sync write instead since PG_uptodate isn't set.
-		 */
-		cifs_readpage_worker(file, page, &page_start);
-		put_page(page);
-		oncethru = 1;
-		goto start;
-	} else {
-		/* we could try using another file handle if there is one -
-		   but how would we lock it to prevent close of that handle
-		   racing with this read? In any case
-		   this will be written out by write_end so is fine */
-	}
-out:
-	*pagep = page;
-	return rc;
-}
-
-static bool cifs_release_folio(struct folio *folio, gfp_t gfp)
-{
-	if (folio_test_private(folio))
-		return 0;
-	if (folio_test_private_2(folio)) { /* [DEPRECATED] */
-		if (current_is_kswapd() || !(gfp & __GFP_FS))
-			return false;
-		folio_wait_private_2(folio);
-	}
-	fscache_note_page_release(cifs_inode_cookie(folio->mapping->host));
-	return true;
-}
-
-static void cifs_invalidate_folio(struct folio *folio, size_t offset,
-				 size_t length)
-{
-	folio_wait_private_2(folio); /* [DEPRECATED] */
-}
-#endif // end netfslib remove 5152
-
 void cifs_oplock_break(struct work_struct *work)
 {
 	struct cifsFileInfo *cfile = container_of(work, struct cifsFileInfo,
@@ -5299,27 +4701,6 @@ void cifs_oplock_break(struct work_struct *work)
 	cifs_done_oplock_break(cinode);
 }
 
-#if 0 // TODO remove 5333
-/*
- * The presence of cifs_direct_io() in the address space ops vector
- * allowes open() O_DIRECT flags which would have failed otherwise.
- *
- * In the non-cached mode (mount with cache=none), we shunt off direct read and write requests
- * so this method should never be called.
- *
- * Direct IO is not yet supported in the cached mode.
- */
-static ssize_t
-cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter)
-{
-        /*
-         * FIXME
-         * Eventually need to support direct IO for non forcedirectio mounts
-         */
-        return -EINVAL;
-}
-#endif // netfs end remove 5333
-
 static int cifs_swap_activate(struct swap_info_struct *sis,
 			      struct file *swap_file, sector_t *span)
 {
diff --git a/fs/smb/client/fscache.c b/fs/smb/client/fscache.c
index 7aa1d633c027..147e8cd38fe1 100644
--- a/fs/smb/client/fscache.c
+++ b/fs/smb/client/fscache.c
@@ -150,114 +150,3 @@ void cifs_fscache_release_inode_cookie(struct inode *inode)
 		cifsi->netfs.cache = NULL;
 	}
 }
-
-#if 0 // TODO remove
-/*
- * Fallback page reading interface.
- */
-static int fscache_fallback_read_page(struct inode *inode, struct page *page)
-{
-	struct netfs_cache_resources cres;
-	struct fscache_cookie *cookie = cifs_inode_cookie(inode);
-	struct iov_iter iter;
-	struct bio_vec bvec;
-	int ret;
-
-	memset(&cres, 0, sizeof(cres));
-	bvec_set_page(&bvec, page, PAGE_SIZE, 0);
-	iov_iter_bvec(&iter, ITER_DEST, &bvec, 1, PAGE_SIZE);
-
-	ret = fscache_begin_read_operation(&cres, cookie);
-	if (ret < 0)
-		return ret;
-
-	ret = fscache_read(&cres, page_offset(page), &iter, NETFS_READ_HOLE_FAIL,
-			   NULL, NULL);
-	fscache_end_operation(&cres);
-	return ret;
-}
-
-/*
- * Fallback page writing interface.
- */
-static int fscache_fallback_write_pages(struct inode *inode, loff_t start, size_t len,
-					bool no_space_allocated_yet)
-{
-	struct netfs_cache_resources cres;
-	struct fscache_cookie *cookie = cifs_inode_cookie(inode);
-	struct iov_iter iter;
-	int ret;
-
-	memset(&cres, 0, sizeof(cres));
-	iov_iter_xarray(&iter, ITER_SOURCE, &inode->i_mapping->i_pages, start, len);
-
-	ret = fscache_begin_write_operation(&cres, cookie);
-	if (ret < 0)
-		return ret;
-
-	ret = cres.ops->prepare_write(&cres, &start, &len, len, i_size_read(inode),
-				      no_space_allocated_yet);
-	if (ret == 0)
-		ret = fscache_write(&cres, start, &iter, NULL, NULL);
-	fscache_end_operation(&cres);
-	return ret;
-}
-
-/*
- * Retrieve a page from FS-Cache
- */
-int __cifs_readpage_from_fscache(struct inode *inode, struct page *page)
-{
-	int ret;
-
-	cifs_dbg(FYI, "%s: (fsc:%p, p:%p, i:0x%p\n",
-		 __func__, cifs_inode_cookie(inode), page, inode);
-
-	ret = fscache_fallback_read_page(inode, page);
-	if (ret < 0)
-		return ret;
-
-	/* Read completed synchronously */
-	SetPageUptodate(page);
-	return 0;
-}
-
-void __cifs_readahead_to_fscache(struct inode *inode, loff_t pos, size_t len)
-{
-	cifs_dbg(FYI, "%s: (fsc: %p, p: %llx, l: %zx, i: %p)\n",
-		 __func__, cifs_inode_cookie(inode), pos, len, inode);
-
-	fscache_fallback_write_pages(inode, pos, len, true);
-}
-
-/*
- * Query the cache occupancy.
- */
-int __cifs_fscache_query_occupancy(struct inode *inode,
-				   pgoff_t first, unsigned int nr_pages,
-				   pgoff_t *_data_first,
-				   unsigned int *_data_nr_pages)
-{
-	struct netfs_cache_resources cres;
-	struct fscache_cookie *cookie = cifs_inode_cookie(inode);
-	loff_t start, data_start;
-	size_t len, data_len;
-	int ret;
-
-	ret = fscache_begin_read_operation(&cres, cookie);
-	if (ret < 0)
-		return ret;
-
-	start = first * PAGE_SIZE;
-	len = nr_pages * PAGE_SIZE;
-	ret = cres.ops->query_occupancy(&cres, start, len, PAGE_SIZE,
-					&data_start, &data_len);
-	if (ret == 0) {
-		*_data_first = data_start / PAGE_SIZE;
-		*_data_nr_pages = len / PAGE_SIZE;
-	}
-
-	fscache_end_operation(&cres);
-	return ret;
-}
-#endif
diff --git a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h
index 08b30f79d4cd..f06cb24f5f3c 100644
--- a/fs/smb/client/fscache.h
+++ b/fs/smb/client/fscache.h
@@ -74,43 +74,6 @@ static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags
 			   i_size_read(inode), flags);
 }
 
-#if 0 // TODO remove
-extern int __cifs_fscache_query_occupancy(struct inode *inode,
-					  pgoff_t first, unsigned int nr_pages,
-					  pgoff_t *_data_first,
-					  unsigned int *_data_nr_pages);
-
-static inline int cifs_fscache_query_occupancy(struct inode *inode,
-					       pgoff_t first, unsigned int nr_pages,
-					       pgoff_t *_data_first,
-					       unsigned int *_data_nr_pages)
-{
-	if (!cifs_inode_cookie(inode))
-		return -ENOBUFS;
-	return __cifs_fscache_query_occupancy(inode, first, nr_pages,
-					      _data_first, _data_nr_pages);
-}
-
-extern int __cifs_readpage_from_fscache(struct inode *pinode, struct page *ppage);
-extern void __cifs_readahead_to_fscache(struct inode *pinode, loff_t pos, size_t len);
-
-
-static inline int cifs_readpage_from_fscache(struct inode *inode,
-					     struct page *page)
-{
-	if (cifs_inode_cookie(inode))
-		return __cifs_readpage_from_fscache(inode, page);
-	return -ENOBUFS;
-}
-
-static inline void cifs_readahead_to_fscache(struct inode *inode,
-					     loff_t pos, size_t len)
-{
-	if (cifs_inode_cookie(inode))
-		__cifs_readahead_to_fscache(inode, pos, len);
-}
-#endif
-
 static inline bool cifs_fscache_enabled(struct inode *inode)
 {
 	return fscache_cookie_enabled(cifs_inode_cookie(inode));
@@ -133,27 +96,6 @@ static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { re
 static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) {}
 static inline bool cifs_fscache_enabled(struct inode *inode) { return false; }
 
-#if 0 // TODO remove
-static inline int cifs_fscache_query_occupancy(struct inode *inode,
-					       pgoff_t first, unsigned int nr_pages,
-					       pgoff_t *_data_first,
-					       unsigned int *_data_nr_pages)
-{
-	*_data_first = ULONG_MAX;
-	*_data_nr_pages = 0;
-	return -ENOBUFS;
-}
-
-static inline int
-cifs_readpage_from_fscache(struct inode *inode, struct page *page)
-{
-	return -ENOBUFS;
-}
-
-static inline
-void cifs_readahead_to_fscache(struct inode *inode, loff_t pos, size_t len) {}
-#endif
-
 #endif /* CONFIG_CIFS_FSCACHE */
 
 #endif /* _CIFS_FSCACHE_H */


^ permalink raw reply related	[relevance 2%]

* [PATCH v6 11/15] cifs: When caching, try to open O_WRONLY file rdwr on server
  @ 2024-03-28 16:58  4% ` David Howells
  2024-03-29  9:58  0%   ` Naveen Mamindlapalli
  2024-03-28 16:58  2% ` [PATCH v6 13/15] cifs: Remove some code that's no longer used, part 1 David Howells
  1 sibling, 1 reply; 200+ results
From: David Howells @ 2024-03-28 16:58 UTC (permalink / raw)
  To: Steve French
  Cc: David Howells, Jeff Layton, Matthew Wilcox, Paulo Alcantara,
	Shyam Prasad N, Tom Talpey, Christian Brauner, netfs, linux-cifs,
	linux-fsdevel, linux-mm, netdev, linux-kernel, Steve French,
	Shyam Prasad N, Rohith Surabattula

When we're engaged in local caching of a cifs filesystem, we cannot perform
caching of a partially written cache granule unless we can read the rest of
the granule.  To deal with this, if a file is opened O_WRONLY locally, but
the mount was given the "-o fsc" flag, try first opening the remote file
with GENERIC_READ|GENERIC_WRITE and if that returns -EACCES, try dropping
the GENERIC_READ and doing the open again.  If that last succeeds,
invalidate the cache for that file as for O_DIRECT.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/smb/client/dir.c     | 15 ++++++++++++
 fs/smb/client/file.c    | 51 +++++++++++++++++++++++++++++++++--------
 fs/smb/client/fscache.h |  6 +++++
 3 files changed, 62 insertions(+), 10 deletions(-)

diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index 89333d9bce36..37897b919dd5 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -189,6 +189,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	int disposition;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	*oplock = 0;
 	if (tcon->ses->server->oplocks)
@@ -200,6 +201,10 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		return PTR_ERR(full_path);
 	}
 
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (oflags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	if (tcon->unix_ext && cap_unix(tcon->ses) && !tcon->broken_posix_open &&
 	    (CIFS_UNIX_POSIX_PATH_OPS_CAP &
@@ -276,6 +281,8 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 		desired_access |= GENERIC_READ; /* is this too little? */
 	if (OPEN_FMODE(oflags) & FMODE_WRITE)
 		desired_access |= GENERIC_WRITE;
+	if (rdwr_for_fscache == 1)
+		desired_access |= GENERIC_READ;
 
 	disposition = FILE_OVERWRITE_IF;
 	if ((oflags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))
@@ -304,6 +311,7 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	if (!tcon->unix_ext && (mode & S_IWUGO) == 0)
 		create_options |= CREATE_OPTION_READONLY;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -317,8 +325,15 @@ static int cifs_do_create(struct inode *inode, struct dentry *direntry, unsigned
 	rc = server->ops->open(xid, &oparms, oplock, buf);
 	if (rc) {
 		cifs_dbg(FYI, "cifs_create returned 0x%x\n", rc);
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access &= ~GENERIC_READ;
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		goto out;
 	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 	/*
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index 73573dadf90e..761a80963f76 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -521,12 +521,12 @@ cifs_mark_open_files_invalid(struct cifs_tcon *tcon)
 	 */
 }
 
-static inline int cifs_convert_flags(unsigned int flags)
+static inline int cifs_convert_flags(unsigned int flags, int rdwr_for_fscache)
 {
 	if ((flags & O_ACCMODE) == O_RDONLY)
 		return GENERIC_READ;
 	else if ((flags & O_ACCMODE) == O_WRONLY)
-		return GENERIC_WRITE;
+		return rdwr_for_fscache == 1 ? (GENERIC_READ | GENERIC_WRITE) : GENERIC_WRITE;
 	else if ((flags & O_ACCMODE) == O_RDWR) {
 		/* GENERIC_ALL is too much permission to request
 		   can cause unnecessary access denied on create */
@@ -663,11 +663,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	int create_options = CREATE_NOT_DIR;
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	if (!server->ops->open)
 		return -ENOSYS;
 
-	desired_access = cifs_convert_flags(f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(f_flags, rdwr_for_fscache);
 
 /*********************************************************************
  *  open flag mapping table:
@@ -704,6 +709,7 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	if (f_flags & O_DIRECT)
 		create_options |= CREATE_NO_BUFFER;
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -715,8 +721,16 @@ static int cifs_nt_open(const char *full_path, struct inode *inode, struct cifs_
 	};
 
 	rc = server->ops->open(xid, &oparms, oplock, buf);
-	if (rc)
+	if (rc) {
+		if (rc == -EACCES && rdwr_for_fscache == 1) {
+			desired_access = cifs_convert_flags(f_flags, 0);
+			rdwr_for_fscache = 2;
+			goto retry_open;
+		}
 		return rc;
+	}
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
 
 	/* TODO: Add support for calling posix query info but with passing in fid */
 	if (tcon->unix_ext)
@@ -1149,11 +1163,14 @@ int cifs_open(struct inode *inode, struct file *file)
 use_cache:
 	fscache_use_cookie(cifs_inode_cookie(file_inode(file)),
 			   file->f_mode & FMODE_WRITE);
-	if (file->f_flags & O_DIRECT &&
-	    (!((file->f_flags & O_ACCMODE) != O_RDONLY) ||
-	     file->f_flags & O_APPEND))
-		cifs_invalidate_cache(file_inode(file),
-				      FSCACHE_INVAL_DIO_WRITE);
+	//if ((file->f_flags & O_ACCMODE) == O_WRONLY)
+	//	goto inval;
+	if (!(file->f_flags & O_DIRECT))
+		goto out;
+	if ((file->f_flags & (O_ACCMODE | O_APPEND)) == O_RDONLY)
+		goto out;
+//inval:
+	cifs_invalidate_cache(file_inode(file), FSCACHE_INVAL_DIO_WRITE);
 
 out:
 	free_dentry_path(page);
@@ -1218,6 +1235,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	int disposition = FILE_OPEN;
 	int create_options = CREATE_NOT_DIR;
 	struct cifs_open_parms oparms;
+	int rdwr_for_fscache = 0;
 
 	xid = get_xid();
 	mutex_lock(&cfile->fh_mutex);
@@ -1281,7 +1299,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	}
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
 
-	desired_access = cifs_convert_flags(cfile->f_flags);
+	/* If we're caching, we need to be able to fill in around partial writes. */
+	if (cifs_fscache_enabled(inode) && (cfile->f_flags & O_ACCMODE) == O_WRONLY)
+		rdwr_for_fscache = 1;
+
+	desired_access = cifs_convert_flags(cfile->f_flags, rdwr_for_fscache);
 
 	/* O_SYNC also has bit for O_DSYNC so following check picks up either */
 	if (cfile->f_flags & O_SYNC)
@@ -1293,6 +1315,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 	if (server->ops->get_lease_key)
 		server->ops->get_lease_key(inode, &cfile->fid);
 
+retry_open:
 	oparms = (struct cifs_open_parms) {
 		.tcon = tcon,
 		.cifs_sb = cifs_sb,
@@ -1318,6 +1341,11 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		/* indicate that we need to relock the file */
 		oparms.reconnect = true;
 	}
+	if (rc == -EACCES && rdwr_for_fscache == 1) {
+		desired_access = cifs_convert_flags(cfile->f_flags, 0);
+		rdwr_for_fscache = 2;
+		goto retry_open;
+	}
 
 	if (rc) {
 		mutex_unlock(&cfile->fh_mutex);
@@ -1326,6 +1354,9 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
 		goto reopen_error_exit;
 	}
 
+	if (rdwr_for_fscache == 2)
+		cifs_invalidate_cache(inode, FSCACHE_INVAL_DIO_WRITE);
+
 #ifdef CONFIG_CIFS_ALLOW_INSECURE_LEGACY
 reopen_success:
 #endif /* CONFIG_CIFS_ALLOW_INSECURE_LEGACY */
diff --git a/fs/smb/client/fscache.h b/fs/smb/client/fscache.h
index a3d73720914f..1f2ea9f5cc9a 100644
--- a/fs/smb/client/fscache.h
+++ b/fs/smb/client/fscache.h
@@ -109,6 +109,11 @@ static inline void cifs_readahead_to_fscache(struct inode *inode,
 		__cifs_readahead_to_fscache(inode, pos, len);
 }
 
+static inline bool cifs_fscache_enabled(struct inode *inode)
+{
+	return fscache_cookie_enabled(cifs_inode_cookie(inode));
+}
+
 #else /* CONFIG_CIFS_FSCACHE */
 static inline
 void cifs_fscache_fill_coherency(struct inode *inode,
@@ -124,6 +129,7 @@ static inline void cifs_fscache_release_inode_cookie(struct inode *inode) {}
 static inline void cifs_fscache_unuse_inode_cookie(struct inode *inode, bool update) {}
 static inline struct fscache_cookie *cifs_inode_cookie(struct inode *inode) { return NULL; }
 static inline void cifs_invalidate_cache(struct inode *inode, unsigned int flags) {}
+static inline bool cifs_fscache_enabled(struct inode *inode) { return false; }
 
 static inline int cifs_fscache_query_occupancy(struct inode *inode,
 					       pgoff_t first, unsigned int nr_pages,


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices
  2024-03-27 16:57  0%   ` Mickaël Salaün
@ 2024-03-28 12:01  0%     ` Mickaël Salaün
  2024-04-02 18:28  0%     ` Günther Noack
  1 sibling, 0 replies; 200+ results
From: Mickaël Salaün @ 2024-03-28 12:01 UTC (permalink / raw)
  To: Günther Noack
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Christian Brauner

On Wed, Mar 27, 2024 at 05:57:35PM +0100, Mickaël Salaün wrote:
> On Wed, Mar 27, 2024 at 01:10:31PM +0000, Günther Noack wrote:
> > Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
> > and increments the Landlock ABI version to 5.
> > 
> > This access right applies to device-custom IOCTL commands
> > when they are invoked on block or character device files.
> > 
> > Like the truncate right, this right is associated with a file
> > descriptor at the time of open(2), and gets respected even when the
> > file descriptor is used outside of the thread which it was originally
> > opened in.
> > 
> > Therefore, a newly enabled Landlock policy does not apply to file
> > descriptors which are already open.
> > 
> > If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
> > number of safe IOCTL commands will be permitted on newly opened device
> > files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
> > as other IOCTL commands for regular files which are implemented in
> > fs/ioctl.c.
> > 
> > Noteworthy scenarios which require special attention:
> > 
> > TTY devices are often passed into a process from the parent process,
> > and so a newly enabled Landlock policy does not retroactively apply to
> > them automatically.  In the past, TTY devices have often supported
> > IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> > letting callers control the TTY input buffer (and simulate
> > keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> > modern kernels though.
> > 
> > Known limitations:
> > 
> > The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
> > control over IOCTL commands.
> > 
> > Landlock users may use path-based restrictions in combination with
> > their knowledge about the file system layout to control what IOCTLs
> > can be done.
> > 
> > Cc: Paul Moore <paul@paul-moore.com>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Signed-off-by: Günther Noack <gnoack@google.com>
> > ---
> >  include/uapi/linux/landlock.h                |  33 +++-
> >  security/landlock/fs.c                       | 183 ++++++++++++++++++-
> >  security/landlock/limits.h                   |   2 +-
> >  security/landlock/syscalls.c                 |   8 +-
> >  tools/testing/selftests/landlock/base_test.c |   2 +-
> >  tools/testing/selftests/landlock/fs_test.c   |   5 +-
> >  6 files changed, 216 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> > index 25c8d7677539..5d90e9799eb5 100644
> > --- a/include/uapi/linux/landlock.h
> > +++ b/include/uapi/linux/landlock.h
> > @@ -128,7 +128,7 @@ struct landlock_net_port_attr {
> >   * files and directories.  Files or directories opened before the sandboxing
> >   * are not subject to these restrictions.
> >   *
> > - * A file can only receive these access rights:
> > + * The following access rights apply only to files:
> >   *
> >   * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
> >   * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
> > @@ -138,12 +138,13 @@ struct landlock_net_port_attr {
> >   * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
> >   * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
> >   *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
> > - *   ``O_TRUNC``. Whether an opened file can be truncated with
> > - *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
> > - *   same way as read and write permissions are checked during
> > - *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
> > - *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
> > - *   third version of the Landlock ABI.
> > + *   ``O_TRUNC``.  This access right is available since the third version of the
> > + *   Landlock ABI.
> > + *
> > + * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> > + * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> > + * read and write permissions are checked during :manpage:`open(2)` using
> > + * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
> >   *
> >   * A directory can receive access rights related to files or directories.  The
> >   * following access right is applied to the directory itself, and the
> > @@ -198,13 +199,28 @@ struct landlock_net_port_attr {
> >   *   If multiple requirements are not met, the ``EACCES`` error code takes
> >   *   precedence over ``EXDEV``.
> >   *
> > + * The following access right applies both to files and directories:
> > + *
> > + * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
> > + *   character or block device.
> > + *
> > + *   This access right applies to all `ioctl(2)` commands implemented by device
> 
> :manpage:`ioctl(2)`
> 
> > + *   drivers.  However, the following common IOCTL commands continue to be
> > + *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:
> 
> This is good but explaining the rationale could help, something like
> this (taking care of not packing lines listing commands to ease review
> when a new command will be added):
> 
> IOCTL commands targetting file descriptors (``FIOCLEX``, ``FIONCLEX``),
> file descriptions (``FIONBIO``, ``FIOASYNC``),
> file systems (``FIOQSIZE``, ``FS_IOC_FIEMAP``, ``FICLONE``,
> ``FICLONERAN``, ``FIDEDUPERANGE``, ``FS_IOC_GETFLAGS``,
> ``FS_IOC_SETFLAGS``, ``FS_IOC_FSGETXATTR``, ``FS_IOC_FSSETXATTR``),
> or superblocks (``FIFREEZE``, ``FITHAW``, ``FIGETBSZ``,
> ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``)
> are never denied.  However, such IOCTL commands still require an opened
> file and may not be available on any file type.  Read or write
> permission may be checked by the underlying implementation, as well as
> capabilities.
> 
> > + *
> > + *   ``FIOCLEX``, ``FIONCLEX``, ``FIONBIO``, ``FIOASYNC``, ``FIFREEZE``,
> > + *   ``FITHAW``, ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``
> > + *
> > + *   This access right is available since the fifth version of the Landlock
> > + *   ABI.
> > + *
> >   * .. warning::
> >   *
> >   *   It is currently not possible to restrict some file-related actions
> >   *   accessible through these syscall families: :manpage:`chdir(2)`,
> >   *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
> >   *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
> > - *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
> > + *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
> >   *   Future Landlock evolutions will enable to restrict them.
> >   */
> >  /* clang-format off */
> > @@ -223,6 +239,7 @@ struct landlock_net_port_attr {
> >  #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
> >  #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
> >  #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
> > +#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
> >  /* clang-format on */
> >  
> >  /**
> > diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> > index c15559432d3d..2ef6c57fa20b 100644
> > --- a/security/landlock/fs.c
> > +++ b/security/landlock/fs.c
> > @@ -7,6 +7,7 @@
> >   * Copyright © 2021-2022 Microsoft Corporation
> >   */
> >  
> > +#include <asm/ioctls.h>
> >  #include <kunit/test.h>
> >  #include <linux/atomic.h>
> >  #include <linux/bitops.h>
> > @@ -14,6 +15,7 @@
> >  #include <linux/compiler_types.h>
> >  #include <linux/dcache.h>
> >  #include <linux/err.h>
> > +#include <linux/falloc.h>
> >  #include <linux/fs.h>
> >  #include <linux/init.h>
> >  #include <linux/kernel.h>
> > @@ -29,6 +31,7 @@
> >  #include <linux/types.h>
> >  #include <linux/wait_bit.h>
> >  #include <linux/workqueue.h>
> > +#include <uapi/linux/fiemap.h>
> >  #include <uapi/linux/landlock.h>
> >  
> >  #include "common.h"
> > @@ -84,6 +87,141 @@ static const struct landlock_object_underops landlock_fs_underops = {
> >  	.release = release_inode
> >  };
> >  
> > +/* IOCTL helpers */
> > +
> > +/**
> > + * get_required_ioctl_dev_access(): Determine required access rights for IOCTLs
> > + * on device files.
> > + *
> > + * @cmd: The IOCTL command that is supposed to be run.
> > + *
> > + * By default, any IOCTL on a device file requires the
> > + * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  We make exceptions for commands, if:
> > + *
> > + * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
> > + *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
> > + *
> > + * 2. The command can be reasonably used on a device file at all.
> > + *
> > + * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
> > + * should be considered for inclusion here.
> > + *
> > + * Returns: The access rights that must be granted on an opened file in order to
> > + * use the given @cmd.
> > + */
> > +static __attribute_const__ access_mask_t
> > +get_required_ioctl_dev_access(const unsigned int cmd)
> > +{
> > +	switch (cmd) {
> > +	case FIOCLEX:
> > +	case FIONCLEX:
> > +	case FIONBIO:
> > +	case FIOASYNC:
> > +		/*
> > +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> > +		 * close-on-exec and the file's buffered-IO and async flags.
> > +		 * These operations are also available through fcntl(2), and are
> > +		 * unconditionally permitted in Landlock.
> > +		 */
> > +		return 0;
> > +	case FIOQSIZE:
> > +		/*
> > +		 * FIOQSIZE queries the size of a regular file or directory.
> > +		 *
> > +		 * This IOCTL command only applies to regular files and
> > +		 * directories.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> This should always be allowed because do_vfs_ioctl() never returns
> -ENOIOCTLCMD for this command.  That's why I wrote
> vfs_masked_device_ioctl() this way [1].  I think it would be easier to
> read and maintain this code with a is_masked_device_ioctl() logic.  Listing
> commands that are not masked makes it difficult to review because
> allowed and denied return codes are interleaved.
> 
> [1] https://lore.kernel.org/r/20240219183539.2926165-1-mic@digikod.net
> 
> Your IOCTL command explanation comments are nice and they should be kept
> in is_masked_device_ioctl() (if they mask device IOCTL commands).
> 
> > +	case FIFREEZE:
> > +	case FITHAW:
> > +		/*
> > +		 * FIFREEZE and FITHAW freeze and thaw the file system which the
> > +		 * given file belongs to.  Requires CAP_SYS_ADMIN.
> > +		 *
> > +		 * These commands operate on the file system's superblock rather
> > +		 * than on the file itself.  The same operations can also be
> > +		 * done through any other file or directory on the same file
> > +		 * system, so it is safe to permit these.
> > +		 */
> > +		return 0;
> > +	case FS_IOC_FIEMAP:
> > +		/*
> > +		 * FS_IOC_FIEMAP queries information about the allocation of
> > +		 * blocks within a file.
> > +		 *
> > +		 * This IOCTL command only applies to regular files.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> Same here.
> 
> > +	case FIGETBSZ:
> > +		/*
> > +		 * FIGETBSZ queries the file system's block size for a file or
> > +		 * directory.
> > +		 *
> > +		 * This command operates on the file system's superblock rather
> > +		 * than on the file itself.  The same operation can also be done
> > +		 * through any other file or directory on the same file system,
> > +		 * so it is safe to permit it.
> > +		 */
> > +		return 0;
> > +	case FICLONE:
> > +	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> > +		/*
> > +		 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
> > +		 * their underlying storage ("reflink") between source and
> > +		 * destination FDs, on file systems which support that.
> > +		 *
> > +		 * These IOCTL commands only apply to regular files.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> ditto
> 
> > +	case FIONREAD:
> > +		/*
> > +		 * FIONREAD returns the number of bytes available for reading.
> > +		 *
> > +		 * We require LANDLOCK_ACCESS_FS_IOCTL_DEV for FIONREAD, because
> > +		 * devices implement it in f_ops->unlocked_ioctl().  The
> > +		 * implementations of this operation have varying quality and
> > +		 * complexity, so it is hard to reason about what they do.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	case FS_IOC_GETFLAGS:
> > +	case FS_IOC_SETFLAGS:
> > +	case FS_IOC_FSGETXATTR:
> > +	case FS_IOC_FSSETXATTR:
> > +		/*
> > +		 * FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
> > +		 * FS_IOC_FSSETXATTR do not apply for devices.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	case FS_IOC_GETFSUUID:
> > +	case FS_IOC_GETFSSYSFSPATH:
> > +		/*
> > +		 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
> > +		 * the file system superblock, not on the specific file, so
> > +		 * these operations are available through any other file on the
> > +		 * same file system as well.
> > +		 */
> > +		return 0;
> > +	case FIBMAP:
> > +	case FS_IOC_RESVSP:
> > +	case FS_IOC_RESVSP64:
> > +	case FS_IOC_UNRESVSP:
> > +	case FS_IOC_UNRESVSP64:
> > +	case FS_IOC_ZERO_RANGE:
> > +		/*
> > +		 * FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP,
> > +		 * FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE only apply to regular
> > +		 * files (as implemented in file_ioctl()).
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	default:
> > +		/*
> > +		 * Other commands are guarded by the catch-all access right.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +	}
> > +}
> > +
> >  /* Ruleset management */
> >  
> >  static struct landlock_object *get_inode_object(struct inode *const inode)
> > @@ -148,7 +286,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> >  	LANDLOCK_ACCESS_FS_EXECUTE | \
> >  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
> >  	LANDLOCK_ACCESS_FS_READ_FILE | \
> > -	LANDLOCK_ACCESS_FS_TRUNCATE)
> > +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> > +	LANDLOCK_ACCESS_FS_IOCTL_DEV)
> >  /* clang-format on */
> >  
> >  /*
> > @@ -1335,8 +1474,10 @@ static int hook_file_alloc_security(struct file *const file)
> >  static int hook_file_open(struct file *const file)
> >  {
> >  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
> > -	access_mask_t open_access_request, full_access_request, allowed_access;
> > -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > +	access_mask_t open_access_request, full_access_request, allowed_access,
> > +		optional_access;
> > +	const struct inode *inode = file_inode(file);
> > +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> >  	const struct landlock_ruleset *const dom =
> >  		get_fs_domain(landlock_cred(file->f_cred)->domain);
> >  
> > @@ -1354,6 +1495,10 @@ static int hook_file_open(struct file *const file)
> >  	 * We look up more access than what we immediately need for open(), so
> >  	 * that we can later authorize operations on opened files.
> >  	 */
> > +	optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > +	if (is_device)
> > +		optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > +
> >  	full_access_request = open_access_request | optional_access;
> >  
> >  	if (is_access_to_paths_allowed(
> > @@ -1410,6 +1555,36 @@ static int hook_file_truncate(struct file *const file)
> >  	return -EACCES;
> >  }
> >  
> > +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> > +			   unsigned long arg)
> > +{
> > +	const struct inode *inode = file_inode(file);
> > +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> > +	access_mask_t required_access, allowed_access;
> 
> As explained in [2], I'd like not-sandboxed tasks to not have visible
> performance impact because of Landlock:
> 
>   We should first check landlock_file(file)->allowed_access as in
>   hook_file_truncate() to return as soon as possible for non-sandboxed
>   tasks.  Any other computation should be done after that (e.g. with an
>   is_device() helper).
> 
> [2] https://lore.kernel.org/r/20240311.If7ieshaegu2@digikod.net
> 
> This is_device(file) helper should also replace other is_device variables.
> 
> 
> > +
> > +	if (!is_device)
> > +		return 0;
> > +
> > +	/*
> > +	 * It is the access rights at the time of opening the file which
> > +	 * determine whether IOCTL can be used on the opened file later.
> > +	 *
> > +	 * The access right is attached to the opened file in hook_file_open().
> > +	 */
> > +	required_access = get_required_ioctl_dev_access(cmd);
> > +	allowed_access = landlock_file(file)->allowed_access;
> > +	if ((allowed_access & required_access) == required_access)
> > +		return 0;
> > +
> > +	return -EACCES;
> > +}
> > +
> > +static int hook_file_ioctl_compat(struct file *file, unsigned int cmd,
> > +				  unsigned long arg)
> > +{
> > +	return hook_file_ioctl(file, cmd, arg);
> 
> The compat-specific IOCTL commands are missing (e.g. FS_IOC_RESVSP_32).
> Relying on is_masked_device_ioctl() should make this call OK though.

Well no, see vfs_masked_device_ioctl_compat().

> 
> > +}
> > +
> >  static struct security_hook_list landlock_hooks[] __ro_after_init = {
> >  	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
> >  

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices
  2024-03-27 13:10  6% ` [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices Günther Noack
@ 2024-03-27 16:57  0%   ` Mickaël Salaün
  2024-03-28 12:01  0%     ` Mickaël Salaün
  2024-04-02 18:28  0%     ` Günther Noack
  0 siblings, 2 replies; 200+ results
From: Mickaël Salaün @ 2024-03-27 16:57 UTC (permalink / raw)
  To: Günther Noack
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Christian Brauner

On Wed, Mar 27, 2024 at 01:10:31PM +0000, Günther Noack wrote:
> Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
> and increments the Landlock ABI version to 5.
> 
> This access right applies to device-custom IOCTL commands
> when they are invoked on block or character device files.
> 
> Like the truncate right, this right is associated with a file
> descriptor at the time of open(2), and gets respected even when the
> file descriptor is used outside of the thread which it was originally
> opened in.
> 
> Therefore, a newly enabled Landlock policy does not apply to file
> descriptors which are already open.
> 
> If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
> number of safe IOCTL commands will be permitted on newly opened device
> files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
> as other IOCTL commands for regular files which are implemented in
> fs/ioctl.c.
> 
> Noteworthy scenarios which require special attention:
> 
> TTY devices are often passed into a process from the parent process,
> and so a newly enabled Landlock policy does not retroactively apply to
> them automatically.  In the past, TTY devices have often supported
> IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> letting callers control the TTY input buffer (and simulate
> keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> modern kernels though.
> 
> Known limitations:
> 
> The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
> control over IOCTL commands.
> 
> Landlock users may use path-based restrictions in combination with
> their knowledge about the file system layout to control what IOCTLs
> can be done.
> 
> Cc: Paul Moore <paul@paul-moore.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Günther Noack <gnoack@google.com>
> ---
>  include/uapi/linux/landlock.h                |  33 +++-
>  security/landlock/fs.c                       | 183 ++++++++++++++++++-
>  security/landlock/limits.h                   |   2 +-
>  security/landlock/syscalls.c                 |   8 +-
>  tools/testing/selftests/landlock/base_test.c |   2 +-
>  tools/testing/selftests/landlock/fs_test.c   |   5 +-
>  6 files changed, 216 insertions(+), 17 deletions(-)
> 
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> index 25c8d7677539..5d90e9799eb5 100644
> --- a/include/uapi/linux/landlock.h
> +++ b/include/uapi/linux/landlock.h
> @@ -128,7 +128,7 @@ struct landlock_net_port_attr {
>   * files and directories.  Files or directories opened before the sandboxing
>   * are not subject to these restrictions.
>   *
> - * A file can only receive these access rights:
> + * The following access rights apply only to files:
>   *
>   * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
>   * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
> @@ -138,12 +138,13 @@ struct landlock_net_port_attr {
>   * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
>   * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
>   *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
> - *   ``O_TRUNC``. Whether an opened file can be truncated with
> - *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
> - *   same way as read and write permissions are checked during
> - *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
> - *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
> - *   third version of the Landlock ABI.
> + *   ``O_TRUNC``.  This access right is available since the third version of the
> + *   Landlock ABI.
> + *
> + * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> + * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> + * read and write permissions are checked during :manpage:`open(2)` using
> + * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
>   *
>   * A directory can receive access rights related to files or directories.  The
>   * following access right is applied to the directory itself, and the
> @@ -198,13 +199,28 @@ struct landlock_net_port_attr {
>   *   If multiple requirements are not met, the ``EACCES`` error code takes
>   *   precedence over ``EXDEV``.
>   *
> + * The following access right applies both to files and directories:
> + *
> + * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
> + *   character or block device.
> + *
> + *   This access right applies to all `ioctl(2)` commands implemented by device

:manpage:`ioctl(2)`

> + *   drivers.  However, the following common IOCTL commands continue to be
> + *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:

This is good but explaining the rationale could help, something like
this (taking care of not packing lines listing commands to ease review
when a new command will be added):

IOCTL commands targetting file descriptors (``FIOCLEX``, ``FIONCLEX``),
file descriptions (``FIONBIO``, ``FIOASYNC``),
file systems (``FIOQSIZE``, ``FS_IOC_FIEMAP``, ``FICLONE``,
``FICLONERAN``, ``FIDEDUPERANGE``, ``FS_IOC_GETFLAGS``,
``FS_IOC_SETFLAGS``, ``FS_IOC_FSGETXATTR``, ``FS_IOC_FSSETXATTR``),
or superblocks (``FIFREEZE``, ``FITHAW``, ``FIGETBSZ``,
``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``)
are never denied.  However, such IOCTL commands still require an opened
file and may not be available on any file type.  Read or write
permission may be checked by the underlying implementation, as well as
capabilities.

> + *
> + *   ``FIOCLEX``, ``FIONCLEX``, ``FIONBIO``, ``FIOASYNC``, ``FIFREEZE``,
> + *   ``FITHAW``, ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``
> + *
> + *   This access right is available since the fifth version of the Landlock
> + *   ABI.
> + *
>   * .. warning::
>   *
>   *   It is currently not possible to restrict some file-related actions
>   *   accessible through these syscall families: :manpage:`chdir(2)`,
>   *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
>   *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
> - *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
> + *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
>   *   Future Landlock evolutions will enable to restrict them.
>   */
>  /* clang-format off */
> @@ -223,6 +239,7 @@ struct landlock_net_port_attr {
>  #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
>  #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
>  #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
> +#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
>  /* clang-format on */
>  
>  /**
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index c15559432d3d..2ef6c57fa20b 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -7,6 +7,7 @@
>   * Copyright © 2021-2022 Microsoft Corporation
>   */
>  
> +#include <asm/ioctls.h>
>  #include <kunit/test.h>
>  #include <linux/atomic.h>
>  #include <linux/bitops.h>
> @@ -14,6 +15,7 @@
>  #include <linux/compiler_types.h>
>  #include <linux/dcache.h>
>  #include <linux/err.h>
> +#include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
> @@ -29,6 +31,7 @@
>  #include <linux/types.h>
>  #include <linux/wait_bit.h>
>  #include <linux/workqueue.h>
> +#include <uapi/linux/fiemap.h>
>  #include <uapi/linux/landlock.h>
>  
>  #include "common.h"
> @@ -84,6 +87,141 @@ static const struct landlock_object_underops landlock_fs_underops = {
>  	.release = release_inode
>  };
>  
> +/* IOCTL helpers */
> +
> +/**
> + * get_required_ioctl_dev_access(): Determine required access rights for IOCTLs
> + * on device files.
> + *
> + * @cmd: The IOCTL command that is supposed to be run.
> + *
> + * By default, any IOCTL on a device file requires the
> + * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  We make exceptions for commands, if:
> + *
> + * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
> + *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
> + *
> + * 2. The command can be reasonably used on a device file at all.
> + *
> + * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
> + * should be considered for inclusion here.
> + *
> + * Returns: The access rights that must be granted on an opened file in order to
> + * use the given @cmd.
> + */
> +static __attribute_const__ access_mask_t
> +get_required_ioctl_dev_access(const unsigned int cmd)
> +{
> +	switch (cmd) {
> +	case FIOCLEX:
> +	case FIONCLEX:
> +	case FIONBIO:
> +	case FIOASYNC:
> +		/*
> +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> +		 * close-on-exec and the file's buffered-IO and async flags.
> +		 * These operations are also available through fcntl(2), and are
> +		 * unconditionally permitted in Landlock.
> +		 */
> +		return 0;
> +	case FIOQSIZE:
> +		/*
> +		 * FIOQSIZE queries the size of a regular file or directory.
> +		 *
> +		 * This IOCTL command only applies to regular files and
> +		 * directories.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;

This should always be allowed because do_vfs_ioctl() never returns
-ENOIOCTLCMD for this command.  That's why I wrote
vfs_masked_device_ioctl() this way [1].  I think it would be easier to
read and maintain this code with a is_masked_device_ioctl() logic.  Listing
commands that are not masked makes it difficult to review because
allowed and denied return codes are interleaved.

[1] https://lore.kernel.org/r/20240219183539.2926165-1-mic@digikod.net

Your IOCTL command explanation comments are nice and they should be kept
in is_masked_device_ioctl() (if they mask device IOCTL commands).

> +	case FIFREEZE:
> +	case FITHAW:
> +		/*
> +		 * FIFREEZE and FITHAW freeze and thaw the file system which the
> +		 * given file belongs to.  Requires CAP_SYS_ADMIN.
> +		 *
> +		 * These commands operate on the file system's superblock rather
> +		 * than on the file itself.  The same operations can also be
> +		 * done through any other file or directory on the same file
> +		 * system, so it is safe to permit these.
> +		 */
> +		return 0;
> +	case FS_IOC_FIEMAP:
> +		/*
> +		 * FS_IOC_FIEMAP queries information about the allocation of
> +		 * blocks within a file.
> +		 *
> +		 * This IOCTL command only applies to regular files.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;

Same here.

> +	case FIGETBSZ:
> +		/*
> +		 * FIGETBSZ queries the file system's block size for a file or
> +		 * directory.
> +		 *
> +		 * This command operates on the file system's superblock rather
> +		 * than on the file itself.  The same operation can also be done
> +		 * through any other file or directory on the same file system,
> +		 * so it is safe to permit it.
> +		 */
> +		return 0;
> +	case FICLONE:
> +	case FICLONERANGE:
> +	case FIDEDUPERANGE:
> +		/*
> +		 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
> +		 * their underlying storage ("reflink") between source and
> +		 * destination FDs, on file systems which support that.
> +		 *
> +		 * These IOCTL commands only apply to regular files.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;

ditto

> +	case FIONREAD:
> +		/*
> +		 * FIONREAD returns the number of bytes available for reading.
> +		 *
> +		 * We require LANDLOCK_ACCESS_FS_IOCTL_DEV for FIONREAD, because
> +		 * devices implement it in f_ops->unlocked_ioctl().  The
> +		 * implementations of this operation have varying quality and
> +		 * complexity, so it is hard to reason about what they do.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> +	case FS_IOC_GETFLAGS:
> +	case FS_IOC_SETFLAGS:
> +	case FS_IOC_FSGETXATTR:
> +	case FS_IOC_FSSETXATTR:
> +		/*
> +		 * FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
> +		 * FS_IOC_FSSETXATTR do not apply for devices.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> +	case FS_IOC_GETFSUUID:
> +	case FS_IOC_GETFSSYSFSPATH:
> +		/*
> +		 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
> +		 * the file system superblock, not on the specific file, so
> +		 * these operations are available through any other file on the
> +		 * same file system as well.
> +		 */
> +		return 0;
> +	case FIBMAP:
> +	case FS_IOC_RESVSP:
> +	case FS_IOC_RESVSP64:
> +	case FS_IOC_UNRESVSP:
> +	case FS_IOC_UNRESVSP64:
> +	case FS_IOC_ZERO_RANGE:
> +		/*
> +		 * FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP,
> +		 * FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE only apply to regular
> +		 * files (as implemented in file_ioctl()).
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> +	default:
> +		/*
> +		 * Other commands are guarded by the catch-all access right.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
> +	}
> +}
> +
>  /* Ruleset management */
>  
>  static struct landlock_object *get_inode_object(struct inode *const inode)
> @@ -148,7 +286,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
>  	LANDLOCK_ACCESS_FS_EXECUTE | \
>  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
>  	LANDLOCK_ACCESS_FS_READ_FILE | \
> -	LANDLOCK_ACCESS_FS_TRUNCATE)
> +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> +	LANDLOCK_ACCESS_FS_IOCTL_DEV)
>  /* clang-format on */
>  
>  /*
> @@ -1335,8 +1474,10 @@ static int hook_file_alloc_security(struct file *const file)
>  static int hook_file_open(struct file *const file)
>  {
>  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
> -	access_mask_t open_access_request, full_access_request, allowed_access;
> -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> +	access_mask_t open_access_request, full_access_request, allowed_access,
> +		optional_access;
> +	const struct inode *inode = file_inode(file);
> +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
>  	const struct landlock_ruleset *const dom =
>  		get_fs_domain(landlock_cred(file->f_cred)->domain);
>  
> @@ -1354,6 +1495,10 @@ static int hook_file_open(struct file *const file)
>  	 * We look up more access than what we immediately need for open(), so
>  	 * that we can later authorize operations on opened files.
>  	 */
> +	optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> +	if (is_device)
> +		optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> +
>  	full_access_request = open_access_request | optional_access;
>  
>  	if (is_access_to_paths_allowed(
> @@ -1410,6 +1555,36 @@ static int hook_file_truncate(struct file *const file)
>  	return -EACCES;
>  }
>  
> +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> +			   unsigned long arg)
> +{
> +	const struct inode *inode = file_inode(file);
> +	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> +	access_mask_t required_access, allowed_access;

As explained in [2], I'd like not-sandboxed tasks to not have visible
performance impact because of Landlock:

  We should first check landlock_file(file)->allowed_access as in
  hook_file_truncate() to return as soon as possible for non-sandboxed
  tasks.  Any other computation should be done after that (e.g. with an
  is_device() helper).

[2] https://lore.kernel.org/r/20240311.If7ieshaegu2@digikod.net

This is_device(file) helper should also replace other is_device variables.


> +
> +	if (!is_device)
> +		return 0;
> +
> +	/*
> +	 * It is the access rights at the time of opening the file which
> +	 * determine whether IOCTL can be used on the opened file later.
> +	 *
> +	 * The access right is attached to the opened file in hook_file_open().
> +	 */
> +	required_access = get_required_ioctl_dev_access(cmd);
> +	allowed_access = landlock_file(file)->allowed_access;
> +	if ((allowed_access & required_access) == required_access)
> +		return 0;
> +
> +	return -EACCES;
> +}
> +
> +static int hook_file_ioctl_compat(struct file *file, unsigned int cmd,
> +				  unsigned long arg)
> +{
> +	return hook_file_ioctl(file, cmd, arg);

The compat-specific IOCTL commands are missing (e.g. FS_IOC_RESVSP_32).
Relying on is_masked_device_ioctl() should make this call OK though.

> +}
> +
>  static struct security_hook_list landlock_hooks[] __ro_after_init = {
>  	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
>  

^ permalink raw reply	[relevance 0%]

* [ANNOUNCE] util-linux v2.40
@ 2024-03-27 15:10  1% Karel Zak
  0 siblings, 0 replies; 200+ results
From: Karel Zak @ 2024-03-27 15:10 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, util-linux

The util-linux release v2.40 is available at

  http://www.kernel.org/pub/linux/utils/util-linux/v2.40/
 
Feedback and bug reports, as always, are welcomed.
 
  Karel


util-linux 2.40 Release Notes
=============================

Release highlights
------------------

libmount:

    The libmount monitor has been enhanced to be more user-friendly for
    userspace-specific mount options (e.g., GNOME gVFS).

    The new mount kernel API can be enabled/disabled using the environment
    variable LIBMOUNT_FORCE_MOUNT2={always, never, auto}.

libsmartcols:

    The library now supports filtering expressions (refer to the scols-filter(5)
    man page). Applications can utilize the filter before gathering all data
    for output, reducing resource usage and improving performance. Currently,
    this feature is employed in lsfd and lsblk. Example:

	lsblk --filter 'NAME !~ "sd[ab]"'

    The library now supports counters (based on filters). For instance:

        lsfd --summary=only \
                   -C 'netlink sockets':'(NAME =~ "NETLINK:.*")' \
                   -C 'unix sockets':'(NAME =~ "UNIX:.*")'
     VALUE COUNTER
        57 netlink sockets
      1552 unix sockets

* liblastlog2 and pam_lastlog2:

    Introducing a new library, liblastlog2, to implement lastlog replacement
    using a SQLite3 database in /var/lib/lastlog/lastlog2.db. This implementation
    is Y2038 safe, and the database size is independent of UIDs.

    A new command, lastlog2, is now available.

    lastlog2 is utilized in pam_lastlog2.


* libuuid iImproved support for 64-bit time.

* A new command, lsclocks, is introduced to display system clocks (realtime,
  monotonic, boottime, etc.).

* A new small command, enosys, is introduced to make syscalls fail with ENOSYS.
  The command is based on prctl( SECCOMP_MODE_FILTER ) syscall.

* A new command, exch, is introduced to atomically exchanges paths between
  two files. The command is based on renameat2( RENAME_EXCHANGE ) syscall.

* A new command, setpgid, is introduced to run a program in a new process group.

* login(1) now supports systemd service credentials. 'login.noauth' is now
  supported.


Security issues
---------------

This release fixes CVE-2024-28085. The wall command does not filter escape
sequences from command line arguments. The vulnerable code was introduced in
commit cdd3cc7fa4 (2013). Every version since has been vulnerable.

This allows unprivileged users to put arbitrary text on other users terminals,
if mesg is set to y and *wall is setgid*. Not all distros are affected (e.g.
CentOS, RHEL, Fedora are not; Ubuntu and Debian wall is both setgid and mesg is
set to y by default).


Changes between v2.39 and v2.40
-------------------------------

AUTHORS:
   - add tools contributed by myself  [Thomas Weißschuh]
Add Phytium FTC862 cpu model. fix:
   - #2486  [unknown]
Documentation:
   - add basic smoketest for boilerplate.c  [Thomas Weißschuh]
Fix typo:
   - octen -> octet  [zeyun chen]
agetty:
   - Load autologin user from agetty.autologin credential  [Daan De Meyer]
   - include fileutils.h  [Thomas Weißschuh]
   - remove duplicate include  [Karel Zak]
   - use get_terminal_default_type()  [Karel Zak]
   - use sd_get_sessions() for number of users (#2088)  [Thorsten Kukuk]
audit-arch:
   - add support for alpha  [Thomas Weißschuh]
autotools:
   - add dependence on libsmartcols for lsclocks  [Karel Zak]
   - add missing dist_noinst_DATA  [Karel Zak]
   - check for flex in autogen.sh  [Karel Zak]
   - cleanup lastlog2 stuff  [Karel Zak]
   - fix AC_DEFINE_UNQUOTED() use  [Karel Zak]
   - fix AC_PROG_LEX use  [Karel Zak]
   - fix librtas check  [Karel Zak]
   - fix non-Linux build  [Karel Zak, Samuel Thibault]
   - fix typos  [Karel Zak]
   - use stamp file to build filter parser, improve portability  [Karel Zak]
bash-completion:
   - (fadvise)  fix a typo  [Masatake YAMATO]
   - (lslocks)  add --bytes option to the rules  [Masatake YAMATO]
   - add -T to last  [Karel Zak]
   - make sure that "lastb" actually completes  [Eli Schwartz]
   - update for mkswap  [Karel Zak]
blkdev.h:
   - avoid some unused argument warnings  [Thomas Weißschuh]
   - update location of SCSI device types  [Christoph Anton Mitterer]
blkid:
   - fix call to err_exclusive_options  [Thomas Weißschuh]
blkpr:
   - store return value of getopt_long in int  [Thomas Weißschuh]
blkzone:
   - don't take address of struct blk_zone  [Thomas Weißschuh]
blockdev:
   - add missing verbose output for --getsz  [Christoph Anton Mitterer]
   - add support for BLKGETZONESZ  [Thomas Weißschuh]
   - properly check for BLKGETZONESZ ioctl  [Thomas Weißschuh]
build:
   - use -std=c99 and -std=c++11 by default  [Thomas Weißschuh]
build-sys:
   - (tests) validate that time_t is 64bit  [Thomas Weißschuh]
   - add --disable-exch  [Karel Zak]
   - add --disable-waitpid  [Frantisek Sumsal]
   - add AX_COMPARE_VERSION  [Thomas Weißschuh]
   - add enosys and syscalls.h to gitignore  [Enze Li]
   - backport autoconf year2038 macros  [Thomas Weißschuh]
   - don't call pkg-config --static if unnecessary  [Karel Zak]
   - fail build for untracked files  [Thomas Weißschuh]
   - fix libmount/src/hooks.c use  [Karel Zak]
   - fix po-man clean  [Karel Zak]
   - fix typo in waitpid check  [Thomas Weißschuh]
   - improve checkadoc  [Karel Zak]
   - introduce localstatedir  [Karel Zak]
   - make sure everywhere is localstatedir  [Karel Zak]
   - only build col on glibc  [Thomas Weißschuh]
   - only pass --failure-level if supported  [Thomas Weißschuh]
   - rearrange gitignore in alphabetical order  [Enze Li]
   - release++ (v2.40-rc1)  [Karel Zak]
   - release++ (v2.40-rc2)  [Karel Zak]
   - try to always use 64bit time support on glibc  [Thomas Weißschuh]
buildsys:
   - warn on usage of VLAs  [Thomas Weißschuh]
   - warn on usage of alloca()  [Thomas Weißschuh]
c.h:
   - make err_nonsys available  [Thomas Weißschuh]
cal:
   - avoid out of bound write  [Thomas Weißschuh]
   - fix error message for bad -c argument  [Jakub Wilk]
   - fix long option name for -c  [Jakub Wilk]
cfdisk:
   - add hint about labels for bootable flag  [Karel Zak]
   - ask y/n before wipe  [Karel Zak]
   - fix menu behavior after writing changes  [Karel Zak]
   - properly handle out-of-order partitions during resize  [Thomas Weißschuh]
chcpu(8):
   - document limitations of -g  [Stanislav Brabec]
chrt:
   - (man) add note about --sched-period lower limit  [Karel Zak]
   - (tests) don't mark tests as known failed  [Thomas Weißschuh]
   - (tests) increase deadline test parameters  [Thomas Weißschuh]
   - allow option separator  [Thomas Weißschuh]
chsh:
   - use libeconf to read /etc/shells  [Thorsten Kukuk]
ci:
   - (codeql) ignore cpp/uncontrolled-process-operation  [Thomas Weißschuh]
   - add OpenWrt SDK based CI jobs  [Thomas Weißschuh]
   - also use GCC 13 for sanitizer builds  [Thomas Weißschuh]
   - build on old distro  [Thomas Weißschuh]
   - build with GCC 13/11  [Thomas Weißschuh]
   - build with clang 17  [Thomas Weißschuh]
   - cache openwrt sdk  [Thomas Weißschuh]
   - cancel running jobs on push  [Frantisek Sumsal]
   - collect coverage on _exit() as well  [Frantisek Sumsal]
   - consistently use gcc 13 during CI  [Thomas Weißschuh]
   - disable cpp/path-injection rule  [Thomas Weißschuh]
   - don't combine -Werror and -fsanitize  [Thomas Weißschuh]
   - enable -Werror for meson  [Thomas Weißschuh]
   - fix indentation  [Frantisek Sumsal]
   - hide coverage-related stuff begind --enable-coverage  [Frantisek Sumsal]
   - mark source directory as safe  [Thomas Weißschuh]
   - packit  add flex  [Karel Zak]
   - prevent prompts during installation  [Thomas Weißschuh]
   - reduce aslr level to avoid issues with ASAN  [Thomas Weißschuh]
   - run full testsuite under musl libc  [Thomas Weißschuh]
   - tweak build dir's ACL when collecting coverage  [Frantisek Sumsal]
   - use clang 16  [Thomas Weißschuh]
column:
   - fix -l  [Karel Zak]
   - fix memory leak  [Thomas Weißschuh]
coverage.h:
   - mark _exit as noreturn  [Thomas Weißschuh]
ctrlaltdel:
   - remove unnecessary uid check  [JJ-Meng]
disk-utils:
   - add SPDX and Copyright notices  [Karel Zak]
dmesg:
   - (tests) validate json output  [Thomas Weißschuh]
   - -r LOG_MAKEPRI needs fac << 3  [Edward Chron]
   - Delete redundant pager setup  [Karel Zak]
   - add caller_id support  [Edward Chron]
   - add support for reserved and local facilities  [Thomas Weißschuh]
   - cleanup function names  [Karel Zak]
   - correctly print all supported facility names  [Thomas Weißschuh]
   - don't affect delta by --since  [Karel Zak]
   - error out instead of silently ignoring force_prefix  [Thomas Weißschuh]
   - fix FD leak  [Karel Zak]
   - fix delta calculation  [Karel Zak]
   - fix wrong size calculation  [Karel Zak]
   - make kmsg read() buffer big enough for kernel  [anteater]
   - man and coding style changes  [Karel Zak]
   - only write one message to json  [Thomas Weißschuh]
   - open-code LOG_MAKEPRI  [Thomas Weißschuh]
   - support for additional human readable timestamp  [Rishabh Thukral]
   - support reading kmsg format from file  [Thomas Weißschuh]
   - use symbolic defines for second conversions  [Thomas Weißschuh]
docs:
   - add BSD-2-Clause  [Karel Zak]
   - add SPDX to boilerplate.c  [Karel Zak]
   - add enosys to ReleaseNotes  [Karel Zak]
   - add exch to ReleaseNotes  [Karel Zak]
   - add hints about systemd  [Karel Zak]
   - add note about stable branches  [Karel Zak]
   - add setpgid do ReleaseNotes  [Karel Zak]
   - cleanup public domain license texts  [Karel Zak]
   - fix a typo  [Masatake YAMATO]
   - fix typos  [Jakub Wilk]
   - improve howto-pull-request  [Karel Zak]
   - move Copyright in boilerplate.c  [Karel Zak]
   - move GPL-2.0 license text to Docimentation directory  [Karel Zak]
   - remove duplicated author name in namei.1.adoc  [Emanuele Torre]
   - update AUTHORS file  [Karel Zak]
   - update v2.40-ReleaseNotes  [Karel Zak]
   - use HTTPS for GitHub clone URLs  [Jakub Wilk]
   - use proper XSPD identifier for GPL-2.0  [Karel Zak]
eject:
   - (tests) don't write mount hint to terminal  [Karel Zak]
enosys:
   - add --list  [Thomas Weißschuh]
   - add bash completion  [Thomas Weißschuh]
   - add common arguments  [Thomas Weißschuh]
   - add helpers for 64 bit integer loading  [Thomas Weißschuh]
   - add manpage  [Thomas Weißschuh]
   - add support for MIPS, PowerPC and ARC  [Thomas Weißschuh]
   - add support for ioctl blocking  [Thomas Weißschuh]
   - add support for loongarch  [Thomas Weißschuh]
   - add support for sparc  [Thomas Weißschuh]
   - add test  [Thomas Weißschuh]
   - allow CPU speculation  [Thomas Weißschuh]
   - avoid warnings when no aliases are found  [Thomas Weißschuh]
   - build BPF dynamically  [Thomas Weißschuh]
   - don't require end-of-options marker  [Thomas Weißschuh]
   - don't validate that numbers are found from headers  [Thomas Weißschuh]
   - drop unnessecary load of ioctl number  [Thomas Weißschuh]
   - enable locale handling  [Thomas Weißschuh]
   - find syscalls at build time  [Thomas Weißschuh]
   - fix build on hppa  [John David Anglin]
   - fix native arch for s390x  [Thomas Weißschuh]
   - improve checks for EXIT_NOTSUPP  [Thomas Weißschuh]
   - include sys/syscall.h  [Thomas Weißschuh]
   - list syscall numbers  [Thomas Weißschuh]
   - make messages useful for users  [Thomas Weißschuh]
   - mark variable static  [Thomas Weißschuh]
   - move from tests/helpers/test_enosys.c  [Thomas Weißschuh]
   - only build if AUDIT_ARCH_NATIVE is defined  [Thomas Weißschuh]
   - optimize bytecode when execve is not blocked  [Thomas Weißschuh]
   - optimize bytecode when no ioctls are blocked  [Thomas Weißschuh]
   - properly block execve syscall  [Thomas Weißschuh]
   - provide a nicer build message for syscalls.h generation  [Thomas Weißschuh]
   - remove long jumps from BPF  [Thomas Weißschuh]
   - remove unneeded inline variable declaration  [Thomas Weißschuh]
   - split audit arch detection into dedicated header  [Thomas Weißschuh]
   - store blocked syscalls in list instead of array  [Thomas Weißschuh]
   - syscall numbers are "long"  [Thomas Weißschuh]
   - translate messages  [Thomas Weißschuh]
   - validate syscall architecture  [Thomas Weißschuh]
exch:
   - Add man page to po4a.cfg to make it translatable  [Mario Blättermann]
   - cosmetic changes  [Karel Zak]
   - fix typo  [Karel Zak]
   - new command exchaging two files atomically  [Masatake YAMATO]
   - properly terminate options array  [Thomas Weißschuh]
   - use NULL rather than zero  [Karel Zak]
exec_shell:
   - use xasprintf  [Thomas Weißschuh]
fadvise:
   - (test) don't compare fincore page counts  [Thomas Weißschuh]
   - (test) dynamically calculate expected test values  [Thomas Weißschuh]
   - (test) test with 64k blocks  [Thomas Weißschuh]
   - (tests) factor out calls to "fincore"  [Thomas Weißschuh]
   - Fix markup in man page  [Mario Blättermann]
fallocate:
   - fix the way to evaluate values returned from posix_fallocate  [Masatake YAMATO]
fdisk:
   - (man) fix typo, improve readability  [Karel Zak]
   - add support for partition resizing  [Thomas Weißschuh]
   - guard posix variable  [Thomas Weißschuh]
   - remove usage of VLA  [Thomas Weißschuh]
fileeq:
   - optimize size of ul_fileeq_method  [Thomas Weißschuh]
fincore:
   - (tests) adapt alternative testcases to new header format  [Thomas Weißschuh]
   - (tests) also use nosize error file  [Thomas Weißschuh]
   - (tests) fix double log output  [Chris Hofstaedtler]
   - add --output-all  [Thomas Weißschuh]
   - fix alignment of column listing in --help  [Thomas Weißschuh]
   - refactor output formatting  [Thomas Weißschuh]
   - report data from cachestat()  [Thomas Weißschuh]
findmnt:
   - add --list-columns  [Karel Zak]
   - add -I, --dfi options for imitating the output of df -i  [Masatake YAMATO]
   - add inode-related columns for implementing "df -i" like output  [Masatake YAMATO]
   - remove deleted option from manual  [Chris Hofstaedtler]
   - use zero to separate lines in multi-line cells  [Karel Zak]
flock:
   - initialize timevals [-Werror=maybe-uninitialized]  [Karel Zak]
fsck:
   - initialize timevals [-Werror=maybe-uninitialized]  [Karel Zak]
fstab:
   - Fix markup in man page  [Mario Blättermann]
   - add hint about systemd reload  [Karel Zak]
github:
   - add labeler  [Karel Zak]
   - check apt-cache in more robust way  [Karel Zak]
   - check apt-cache in more robust way (v2)  [Masatake YAMATO]
   - fix build with clang and in ubuntu build-root  [Karel Zak]
gitignore:
   - ignore exch  [Thomas Weißschuh]
   - ignore setpgid binary  [Christian Göttsche]
hardlink:
   - (man) add missing comma  [Jakub Wilk]
   - Fix markup in man page  [Mario Blättermann]
   - fix fiemap use  [Karel Zak]
hexdump:
   - Add missing section header in man page  [Mario Blättermann]
   - add '--one-byte-hex' format option  [Tomasz Wojdat]
   - add new format-strings test case  [Tomasz Wojdat]
   - check blocksize when display data  [Karel Zak]
   - use xasprintf to build string  [Thomas Weißschuh]
hwclock:
   - Improve set error in the face of jitter  [Eric Badger]
   - add --vl-read, --vl-clear documentation and bash-completion  [Rasmus Villemoes]
   - add support for RTC_VL_READ/RTC_VL_CLR ioctls  [Rasmus Villemoes]
   - handle failure of audit_log_user_message  [Thomas Weißschuh]
   - reuse error message  [Karel Zak]
include:
   - add DragonFlyBSD GPT partition types  [Thomas Weißschuh]
   - add U-Boot environment partition type  [Thomas Weißschuh]
   - add some more ChromeOS partition types  [Thomas Weißschuh]
   - define pidfd syscalls if needed  [Markus Mayer]
include/audit-arch:
   - add missing SPDX  [Karel Zak]
include/bitops.h:
   - Remove bswap* compatibility hack for FreeBSD  [Daniel Engberg]
include/c.h:
   - add helpers for unaligned structure access  [Thomas Weißschuh]
   - handle members of const struct  [Thomas Weißschuh]
   - implement reallocarray  [Thomas Weißschuh]
include/crc64:
   - add missing license header  [Karel Zak]
include/strutils:
   - add ul_strtold()  [Karel Zak]
irqtop:
   - fix numeric sorting  [Valery Ushakov]
jsonwrt:
   - add ul_jsonwrt_value_s_sized  [Thomas Weißschuh]
last:
   - Add -T option for tab-separated output  [Trag Date]
   - avoid out of bounds array access  [biubiuzy]
last(1):
   - Document -T option for tab-separated output  [Trag Date]
lastlog:
   - cleanup function definitions  [Karel Zak]
   - improve errors printing  [Karel Zak]
lastlog2:
   - Don't print space if Service column is not printed  [Miika Alikirri]
   - Fix various issues with meson  [Fabian Vogt]
   - convert check_user() to boolean-like macro  [Karel Zak]
   - improve coding style  [Karel Zak]
   - make longopts[] static-const  [Karel Zak]
   - rename tmpfiles  [Christian Hesse]
ldattach:
   - don't call exit() from signal handler  [Thomas Weißschuh]
ldfd:
   - delete unnecessary ';'  [Masatake YAMATO]
lib:
   - remove pager.c from libcommon  [Karel Zak]
lib/ include/:
   - cleanup licence headers  [Karel Zak]
lib/buffer:
   - make buffer usable for non-string data  [Karel Zak]
lib/c_strtod:
   - fix uselocale() fallback if strtod_l() is not available  [Alan Coopersmith]
lib/caputils:
   - fix integer handling issues [coverity scan]  [Karel Zak]
lib/color-names:
   - fix licence header  [Karel Zak]
lib/colors:
   - correct documentation of colors_add_scheme()  [Thomas Weißschuh]
lib/cpuset:
   - exit early from cpulist_parse  [Thomas Weißschuh]
   - make max variable const  [Thomas Weißschuh]
lib/env:
   - avoid underflow of read_all_alloc() return value  [Thomas Weißschuh]
   - fix function name remote_entry -> remove_entry  [Thomas Weißschuh]
lib/idcache:
   - always gracefully handle null cache  [Thomas Weißschuh]
lib/jsonwrt:
   - add support for float numbers  [Karel Zak]
lib/loopdev:
   - consistently return error values from loopcxt_find_unused()  [Thomas Weißschuh]
   - document function return values  [Thomas Weißschuh]
lib/mbsalign:
   - calculate size of decoded string  [Karel Zak]
lib/mbsedit:
   - remove usage of VLA  [Thomas Weißschuh]
lib/pager:
   - Allow PAGER commands with options  [Dragan Simic]
   - Apply pager-specific fixes only when needed  [Dragan Simic]
lib/path:
   - Set errno in case of fgets failure  [Tobias Stoeckmann]
   - fix possible out of boundary access  [Tobias Stoeckmann]
   - fix typos  [Tobias Stoeckmann]
   - remove ul_prefix_fopen  [Tobias Stoeckmann]
   - remove usage of VLA  [Thomas Weißschuh]
   - set errno in case of error  [Tobias Stoeckmann]
lib/pty-session:
   - Don't ignore SIGHUP.  [Kuniyuki Iwashima]
   - initialize timevals [-Werror=maybe-uninitialized]  [Karel Zak]
lib/sha1:
   - fix for old glibc  [Karel Zak]
lib/shells:
   - Plug econf memory leak  [Tobias Stoeckmann]
   - initialize free-able variables  [Karel Zak]
   - remove space after function name  [Karel Zak]
lib/strutils:
   - add strfappend and strvfappend  [Masatake YAMATO]
   - add ul_next_string()  [Karel Zak]
   - fix typo  [Jakub Wilk]
lib/timeutils:
   - (parse_timestamp_reference) report errors on overflow  [Thomas Weißschuh]
   - (tests) add test for formatting  [Thomas Weißschuh]
   - (tests) move to struct timespec  [Thomas Weißschuh]
   - constify some arguments  [Thomas Weißschuh]
   - don't use glibc strptime extension  [Thomas Weißschuh]
   - implement nanosecond formatting  [Thomas Weißschuh]
   - implement timespec formatting  [Thomas Weißschuh]
   - print error if timestamp can't be parsed  [Thomas Weißschuh]
   - test epoch timestamp  [Thomas Weißschuh]
lib/ttyutils:
   - add get_terminal_default_type()  [Karel Zak]
libblkid:
   - (adapted_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (bcache) also calculate checksum over journal buckets  [Thomas Weißschuh]
   - (bcache) extend superblock definition  [Thomas Weißschuh]
   - (bcache) report block size  [Thomas Weißschuh]
   - (bcache) report label  [Thomas Weißschuh]
   - (bcache) report version  [Thomas Weißschuh]
   - (bcachefs) adapt to major.minor version  [Thomas Weißschuh]
   - (bcachefs) add support for 2nd superblock at 2MiB  [Thomas Weißschuh]
   - (bcachefs) add support for sub-device labels  [Thomas Weißschuh]
   - (bcachefs) add support for superblock at end of disk  [Thomas Weißschuh]
   - (bcachefs) compare against offset from idmag  [Thomas Weißschuh]
   - (bcachefs) fix compiler warning [-Werror=sign-compare]  [Karel Zak]
   - (bcachefs) fix not detecting large superblocks  [Colin Gillespie]
   - (bcachefs) fix size validation  [Thomas Weißschuh]
   - (cramfs) use magic hint  [Thomas Weißschuh]
   - (ddf_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (dev) use strdup to duplicate string  [Thomas Weißschuh]
   - (drbd) avoid unaligned accesses  [Thomas Weißschuh]
   - (drbd) reduce false-positive  [biubiuzy]
   - (drbd) use magics  [Thomas Weißschuh]
   - (drbd) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (drbd) validate zero padding  [Thomas Weißschuh]
   - (hfsplus) reduce false positive  [Karel Zak]
   - (highpoint_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (isw_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (jmicron_raid) avoid modifying shared buffer  [Thomas Weißschuh]
   - (jmicron_raid) use checksum APIs  [Thomas Weißschuh]
   - (jmicron_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (lsi_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (lvm2) read complete superblock  [Thomas Weißschuh]
   - (ntfs) validate that sector_size is a power of two  [Thomas Weißschuh]
   - (nvidia_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (probe) add magic hint  [Thomas Weißschuh]
   - (probe) allow superblock offset from end of device  [Thomas Weißschuh]
   - (probe) handle probe without chain gracefully  [Thomas Weißschuh]
   - (probe) read data in chunks  [Thomas Weißschuh]
   - (probe) remove chunking from blkid_probe_get_idmag()  [Thomas Weißschuh]
   - (probe) remove duplicate log  [Thomas Weißschuh]
   - (promise_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (silicon_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (stratis) remove usage of VLA  [Thomas Weißschuh]
   - (superblocks) add helper blkid32_to_cpu()  [Thomas Weißschuh]
   - (vfat) avoid modifying shared buffer  [Thomas Weißschuh]
   - (via_raid) validate size in standard minsz predicate  [Thomas Weißschuh]
   - (vxfs) add test files  [Thomas Weißschuh]
   - (vxfs) report endianness  [Thomas Weißschuh]
   - (vxfs) simplify prober  [Thomas Weißschuh]
   - (vxfs) use hex escape for magic  [Thomas Weißschuh]
   - (zonefs) avoid modifying shared buffer  [Thomas Weißschuh]
   - Check offset in LUKS2 header  [Milan Broz]
   - add remove_buffer helper  [Thomas Weißschuh]
   - avoid aligning out of probing area  [Thomas Weißschuh]
   - avoid memory leak of cachefile path  [Thomas Weißschuh]
   - avoid use of non-standard typeof()  [Thomas Weißschuh]
   - constify cached disk data  [Thomas Weißschuh]
   - constify return values of blkid_probe_get_sb  [Thomas Weißschuh]
   - exfat  fix fail to find volume label  [Yuezhang Mo]
   - fix topology chain types mismatch  [Karel Zak]
   - improve portability  [Samuel Thibault]
   - introduce blkid_wipe_all  [Thomas Weißschuh]
   - introduce helper to get offset for idmag  [Thomas Weißschuh]
   - iso9660  Define all fields in iso_volume_descriptor according to ECMA-119 4th edition spec  [Pali Rohár]
   - iso9660  Implement full High Sierra CDROM format support  [Pali Rohár]
   - jfs - avoid undefined shift  [Milan Broz]
   - limit read buffer size  [Thomas Weißschuh]
   - make enum libblkid_endianness lowercase  [Thomas Weißschuh]
   - protect shared buffers against modifications  [Thomas Weißschuh]
   - prune unneeded buffers  [Thomas Weißschuh]
   - reset errno before calling probefuncs  [Thomas Weißschuh]
libfdisk:
   - (dos) remove usage of VLA  [Thomas Weißschuh]
   - (sgi)  use strncpy over strcpy  [Thomas Weißschuh]
   - (sun) properly initialize partition data  [Thomas Weißschuh]
   - (tests) fix tests for removal of non-blockdev sync()  [Thomas Weißschuh]
   - add fdisk_partition_get_max_size  [Thomas Weißschuh]
   - add shortcut for Linux extended boot  [Thomas Weißschuh]
   - constify builtin fdisk_parttype  [Thomas Weißschuh]
   - fdisk_deassign_device  only sync(2) blockdevs  [наб]
   - fix typo in debug message  [Thomas Weißschuh]
   - handle allocation failure in fdisk_new_partition  [Thomas Weißschuh]
   - reset errno before calling read()  [Thomas Weißschuh]
   - use new blkid_wipe_all helper  [Thomas Weißschuh]
liblastlog2:
   - fix leaks  [Karel Zak]
   - fix pkg-config inclidedir  [Karel Zak]
libmount:
   - (context) avoid dead store  [Thomas Weißschuh]
   - (optlist) correctly detect ro status  [Thomas Weißschuh]
   - (python)  work around python 3.12 bug  [Thomas Weißschuh]
   - (tests) add helper for option list splitting  [Thomas Weißschuh]
   - (tests) don't require root for update tests  [Thomas Weißschuh]
   - (tests) fix --filesystems crash on invalid argument  [Thomas Weißschuh]
   - (tests) fix --filesystems test argument parsing  [Thomas Weißschuh]
   - (tests) split helper tests  [Thomas Weißschuh]
   - (utils) avoid dead store  [Thomas Weißschuh]
   - (utils) fix statx fallback  [Thomas Weißschuh]
   - (veritydev) use asprintf to build string  [Thomas Weißschuh]
   - Fix export of mnt_context_is_lazy and mnt_context_is_onlyonce  [Matt Turner]
   - Fix regression when mounting with atime  [Filipe Manana]
   - accept '\' as escape for options separator  [Karel Zak]
   - add helper to log mount messages as emitted by kernel  [Thomas Weißschuh]
   - add missing utab options after helper call  [Karel Zak]
   - add mnt_context_within_helper() wrapper  [Karel Zak]
   - add private mnt_optstr_get_missing()  [Karel Zak]
   - add sample to test fs and context relation  [Karel Zak]
   - add utab.act file  [Karel Zak]
   - always ignore user=<name>  [Karel Zak]
   - change syscall status macros to be functions  [Thomas Weißschuh]
   - check for availability of mount_setattr  [Thomas Weißschuh]
   - check for linux/mount.h  [Markus Mayer]
   - check for struct statx  [Markus Mayer]
   - cleanup --fake mode  [Karel Zak]
   - cleanup enosys returns from mount hoop  [Karel Zak]
   - cleanup locking in table update code  [Karel Zak]
   - don't assume errno after failed asprintf()  [Karel Zak]
   - don't call hooks after mount.<type> helper  [Karel Zak]
   - don't call mount.<type> helper with usernames  [Karel Zak]
   - don't hold write fd to mounted device  [Jan Kara]
   - don't initialize variable twice (#2714)  [Thorsten Kukuk]
   - don't pass option "defaults" to helper  [Thomas Weißschuh]
   - fix copy & past bug in lock initialization  [Karel Zak]
   - fix fsconfig value unescaping  [Karel Zak]
   - fix options prepend/insert and merging  [Karel Zak]
   - fix possible NULL dereference [coverity scan]  [Karel Zak]
   - fix statx() includes  [Karel Zak]
   - fix sync options between context and fs structs  [Karel Zak]
   - fix typo  [Debarshi Ray]
   - gracefully handle NULL path in mnt_resolve_target()  [Thomas Weißschuh]
   - guard against sysapi == NULL  [Thomas Weißschuh]
   - handle failure to apply flags as part of a mount operation  [Debarshi Ray]
   - ifdef statx() call  [Karel Zak]
   - ignore unwanted kernel events in monitor  [Karel Zak]
   - improve EPERM interpretation  [Karel Zak]
   - improve act file close  [Karel Zak]
   - improve mnt_table_next_child_fs()  [Karel Zak]
   - introduce /run/mount/utab.event  [Karel Zak]
   - introduce LIBMOUNT_FORCE_MOUNT2={always,never,auto}  [Karel Zak]
   - introduce reference counting for libmnt_lock  [Karel Zak]
   - make sure "option=" is used as string  [Karel Zak]
   - make.stx_mnt_id use more robust  [Karel Zak]
   - reduce utab.lock permissions  [Karel Zak]
   - report all kernel messages for fd-based mount API  [Thomas Weißschuh]
   - report failed syscall name  [Karel Zak]
   - report kernel message from new API  [Karel Zak]
   - report statx in features list  [Karel Zak]
   - test utab options after helper call  [Thomas Weißschuh]
   - update documentation for MNT_ERR_APPLYFLAGS  [Debarshi Ray]
   - use mount(2) for remount on Linux < 5.14  [Karel Zak]
   - use some MS_* flags as superblock flags  [Karel Zak]
libmount (python):
   - simplify struct initialization  [Thomas Weißschuh]
libsmartcols:
   - (cell) consistently handle NULL argument  [Thomas Weißschuh]
   - (filter) Add on-demand data filler  [Karel Zak]
   - (filter) add ability to cast data  [Karel Zak]
   - (filter) add regular expression operators  [Karel Zak]
   - (filter) add upper case EQ,NE,LE,LT,GT and GE operators  [Karel Zak]
   - (filter) cleanup __filter_new_node()  [Karel Zak]
   - (filter) cleanup data types  [Karel Zak]
   - (filter) cleanup function arguments  [Karel Zak]
   - (filter) evaluate params  [Karel Zak]
   - (filter) fix dereferences and leaks [coverity scann]  [Karel Zak]
   - (filter) fix regex deallocation  [Karel Zak]
   - (filter) implement data basic operators  [Karel Zak]
   - (filter) implement logical operators  [Karel Zak]
   - (filter) improve holder status  [Karel Zak]
   - (filter) improve holder use  [Karel Zak]
   - (filter) improve scols_filter_assign_column()  [Karel Zak]
   - (filter) make holders API more generic  [Karel Zak]
   - (filter) move struct filter_expr  [Karel Zak]
   - (filter) move struct filter_param  [Karel Zak]
   - (filter) normalize param strings  [Karel Zak]
   - (filter) param data refactoring  [Karel Zak]
   - (filter) split code  [Karel Zak]
   - (filter) support empty values  [Karel Zak]
   - (filter) support period in identifier  [Karel Zak]
   - (filter) use also rpmatch() for boolean  [Karel Zak]
   - (man) fix typos  [Masatake YAMATO]
   - (sample) fix error message  [Karel Zak]
   - (samples)  fix format truncation warning  [Thomas Weißschuh]
   - (samples) remove filter.c  [Karel Zak]
   - (samples/fromfile) properly handle return value from getline()  [Thomas Weißschuh]
   - (tests) add test for continuous json output  [Thomas Weißschuh]
   - Add --highlight option to filter sample  [Karel Zak]
   - Export internally used types to API  [Karel Zak]
   - accept '% -' in column name for filters  [Karel Zak]
   - accept also '/' in column name for filters  [Karel Zak]
   - accept apostrophe as quote for strings in filter  [Karel Zak]
   - accept no data for custom wrapping cells  [Karel Zak]
   - add --{export,raw,json} to wrap sample  [Karel Zak]
   - add API to join filter and columns  [Karel Zak]
   - add filter API docs  [Karel Zak]
   - add filter sample  [Karel Zak]
   - add filter support to 'fromfile' sample  [Karel Zak]
   - add new functions to API docs  [Karel Zak]
   - add parser header files  [Karel Zak]
   - add scols-filter.5 man page  [Karel Zak]
   - add scols_cell_refer_memory()  [Karel Zak]
   - add support for zero separated wrap data  [Karel Zak]
   - add table cursor  [Karel Zak]
   - add wrap-zero test  [Karel Zak]
   - always print vertical symbol  [Karel Zak]
   - build filter scanner and parser header files too  [Karel Zak]
   - cleanup datafunc() API  [Karel Zak]
   - don't directly access struct members  [Karel Zak]
   - don't include hidden headers in column width calculation  [Thomas Weißschuh]
   - drop spourious newline in between streamed JSON objects  [Thomas Weißschuh]
   - fix columns reduction  [Karel Zak]
   - fix filter param copying  [Karel Zak]
   - fix filter parser initialization  [Karel Zak]
   - fix memory leak on filter parser error  [Karel Zak]
   - fix typo in comment  [Karel Zak]
   - fix typo in parser tokens  [Karel Zak]
   - fix uninitialized local variable in sample  [Karel Zak]
   - flush correct stream  [Thomas Weißschuh]
   - free after error in filter sample  [Karel Zak]
   - handle nameless tables in export format  [Thomas Weißschuh]
   - implement filter based counters  [Karel Zak]
   - improve and fix scols_column_set_properties()  [Karel Zak]
   - improve cell data preparation for non-wrapping cases  [Karel Zak]
   - improve filter integration, use JSON to dump  [Karel Zak]
   - improve parser error messages  [Karel Zak]
   - introduce basic files for filter implementation  [Karel Zak]
   - introduce column type  [Karel Zak]
   - make calculation more robust  [Karel Zak]
   - make cell data printing more robust  [Karel Zak]
   - make sure counter is initialized  [Karel Zak]
   - multi-line cells refactoring  [Karel Zak]
   - only recognize closed object as final element  [Thomas Weißschuh]
   - prefer float in filter expression  [Karel Zak]
   - reset cell wrapping if all done  [Karel Zak]
   - search also by normalized column names (aka 'shellvar' name)  [Karel Zak]
   - support SCOLS_JSON_FLOAT in print API  [Karel Zak]
   - support \x?? for data by samples/fromfile.c  [Karel Zak]
   - update gitignore  [Karel Zak]
libuuid:
   - (test_uuid) make reading UUIDs from file more robust  [Thomas Weißschuh]
   - Add uuid_time64 for 64bit time_t on 32bit  [Thorsten Kukuk]
   - avoid truncate clocks.txt to improve performance  [Goldwyn Rodrigues]
   - fix uint64_t printf and scanf format  [Karel Zak]
libuuid/src/gen_uuid.c:
   - fix cs_min declaration  [Fabrice Fontaine]
logger:
   - initialize socket credentials contol union  [Karel Zak]
   - make sure path is terminated [coverity scan]  [Karel Zak]
   - use strncpy instead of strcpy  [Thomas Weißschuh]
login:
   - Initialize noauth from login.noauth credential  [Daan De Meyer]
   - Use pid_t for child_pid  [Tobias Stoeckmann]
   - access login.noauth file directly  [Tobias Stoeckmann]
   - document blank treatment in shell field  [Tobias Stoeckmann]
   - fix memory leak [coverity scan]  [Karel Zak]
   - ignore return of audit_log_acct_message  [Thomas Weißschuh]
   - move comment  [Tobias Stoeckmann]
   - prevent undefined ioctl and tcsetattr calls  [Tobias Stoeckmann]
   - simplify name creation  [Tobias Stoeckmann]
   - unify pw_shell script test  [Tobias Stoeckmann]
   - use correct terminal fd during setup  [Tobias Stoeckmann]
   - use xasprintf  [Tobias Stoeckmann]
login-utils:
   - Report crashes on reboot lines insted of overlapping uptimes  [Troy Rollo]
   - include libgen.h for basename API  [Khem Raj]
loopdev:
   - report lost loop devices  [Junxiao Bi, Karel Zak]
losetup:
   - add --loop-ref and REF column  [Karel Zak]
   - add MAJ a MIN for device and backing-file  [Karel Zak]
   - cleanup device node modes  [Karel Zak]
   - deduplicate find_unused() logic  [Thomas Weißschuh]
   - fix JSON MAJ MIN  [Karel Zak]
   - improve "sector boundary" warning  [Karel Zak]
   - make --output-all more usable  [Karel Zak]
   - report lost loop devices for finding free loop  [Junxiao Bi]
lsblk:
   - add --filter  [Karel Zak]
   - add --highlight  [Karel Zak]
   - add --list-columns  [Karel Zak]
   - add docs for filters and counters  [Karel Zak]
   - add hint that partition start is in sectors  [Karel Zak]
   - add scols counters support  [Karel Zak]
   - add separate MAJ and MIN columns  [Karel Zak]
   - always set column type  [Karel Zak]
   - define cell data-types, use raw data for SIZEs  [Karel Zak]
   - explain FSAVAIL in better way  [Karel Zak]
   - fix in-tree filtering  [Karel Zak]
   - ignore duplicate lines for counters  [Karel Zak]
   - improve --tree description  [Karel Zak]
   - make sure all line data are deallocated  [Karel Zak]
   - rename sortdata to rawdata  [Karel Zak]
   - report all unknown columns in filter  [Karel Zak]
   - split filter allocation and initialization  [Karel Zak]
   - support normalized column names on command line  [Karel Zak]
   - update after rebase  [Karel Zak]
   - use zero to separate lines in multi-line cells  [Karel Zak]
lsclocks:
   - Fix markup and typos in man page  [Mario Blättermann]
   - Fix markup in man page  [Mario Blättermann]
   - add --output-all  [Thomas Weißschuh]
   - add COL_TYPE  [Thomas Weißschuh]
   - add NS_OFFSET column  [Thomas Weißschuh]
   - add column RESOL for clock resolution  [Thomas Weißschuh]
   - add relative time  [Thomas Weißschuh]
   - add support for RTC  [Thomas Weißschuh]
   - add support for cpu clocks  [Thomas Weißschuh]
   - add support for dynamic clocks  [Thomas Weißschuh]
   - automatically discover dynamic clocks  [Thomas Weißschuh]
   - don't fail without dynamic clocks  [Thomas Weißschuh]
   - factor out path based clocks  [Thomas Weißschuh]
   - improve dynamic clocks docs and completion  [Thomas Weißschuh]
   - new util to interact with system clocks  [Thomas Weißschuh]
   - refer to correct lsclocks(1) manpage  [Thomas Weißschuh]
   - remove unused code  [Karel Zak]
   - rename column RESOLUTION to RESOL_RAW  [Thomas Weißschuh]
   - split out data function  [Thomas Weißschuh]
   - trim default columns  [Thomas Weißschuh]
   - use clock id from clock_getcpuclockid in add_cpu_clock  [Alan Coopersmith]
lscpu:
   - Even more Arm part numbers (early 2023)  [Jeremy Linton]
   - Use 4K buffer size instead of BUFSIZ  [Khem Raj]
   - add procfs-sysfs dump from VisionFive 2  [Jan Engelhardt]
   - cure empty output of lscpu -b/-p  [Jan Engelhardt]
   - don't use NULL sharedmap  [Karel Zak]
   - fix caches separator for --parse=<list>  [Karel Zak]
   - initialize all variables (#2714)  [Thorsten Kukuk]
   - remove redundant include  [Karel Zak]
   - remove usage of VLA  [Thomas Weißschuh]
   - restructure op-mode printing  [Thomas Weißschuh]
lscpu-cputype.c:
   - assign value to multiple variables (ar->bit32 and ar->bit64) clang with -Wcomma will emit an warning of "misuse of comma operator". Since the value that will be assigned, is the same for both (bit32 and bit64), just assigning directly to both variables seems reasonable.  [rilysh]
lsdf:
   - (man page) revise text decoration  [Masatake YAMATO]
   - make the code for filling SOURCE, PARTITION, and MAJMIN reusable  [Masatake YAMATO]
lsfd:
   - (comment) fix a typo  [Masatake YAMATO]
   - (cosmetic) normalize whitespaces  [Masatake YAMATO]
   - (filter) accept floating point numbers in expressions  [Masatake YAMATO]
   - (filter) improve error message  [Masatake YAMATO]
   - (filter) reduce duplicated code in macro definitions  [Masatake YAMATO]
   - (filter) support floating point number used in columns  [Masatake YAMATO]
   - (filter) weakly support ARRAY_STRING and ARRAY_NUMBER json types  [Masatake YAMATO]
   - (man) add bps(8) and ss(8) to the "SEE ALSO" section  [Masatake YAMATO]
   - (man) document --list-columns as the way to list columns  [Masatake YAMATO]
   - (man) document the ENDPOINT column for UNIX socket  [Masatake YAMATO]
   - (man) fix the broken page output for the description of NAME column  [Masatake YAMATO]
   - (man) fix the form for the optional argument of --inet option  [Masatake YAMATO]
   - (man) refer to scols-filter(5)  [Masatake YAMATO]
   - (man) update the description of ENDPOINTS column of UNIX-Stream sockets  [Masatake YAMATO]
   - (man) write about SOCK.SHUTDOWN column  [Masatake YAMATO]
   - (man) write about XMODE.m and classical system calls for multiplexing  [Masatake YAMATO]
   - (refactor) introduce a content data type for char devices  [Masatake YAMATO]
   - (refactor) make the code comparing struct lock objects reusable  [Masatake YAMATO]
   - (refactor) make the code for traversing threads reusable  [Masatake YAMATO]
   - (refactor) make the way to handle character devices extensible  [Masatake YAMATO]
   - (refactor) move miscdev specific code to cdev_misc_ops  [Masatake YAMATO]
   - (refactor) rename a function, s/new_process/new_proc/g  [Masatake YAMATO]
   - (refactor) rename add_nodevs to read_mountinfo  [Masatake YAMATO]
   - (refactor) unify the invocations of  sysfs_get_byteorder()  [Masatake YAMATO]
   - (test) add a case for testing a unix socket including newline characters in its path name  [Masatake YAMATO]
   - (tests) don't run mqueue test on byteorder mismatch  [Thomas Weißschuh]
   - (tests) fix process leak  [Masatake YAMATO]
   - (tests) fix typo  [Thomas Weißschuh]
   - Fix typos in man page  [Mario Blättermann]
   - add "nsfs" to the nodev_table to fill SOURCE column for nsfs files  [Masatake YAMATO]
   - add 'D' flag for representing deleted files to XMODE column  [Masatake YAMATO]
   - add 'm' flag representing "multiplexed by epoll_wait(2)" to XMODE column  [Masatake YAMATO]
   - add BPF-MAP.TYPE, BPF-MAP.TYPE.RAW, and BPF-MAP.ID columns  [Masatake YAMATO]
   - add BPF-PROG.TYPE, BPF-PROG.TYPE.RAW, and BPF-PROG.ID columns  [Masatake YAMATO]
   - add BPF.NAME column  [Masatake YAMATO]
   - add EVENTFD.ID column  [Masatake YAMATO]
   - add PTMX.TTY-INDEX column  [Masatake YAMATO]
   - add SOCK.SHUTDOWN column  [Masatake YAMATO]
   - add TUN.IFFACE, a column for interfaces behind tun devices  [Masatake YAMATO]
   - add a back pointer as a member of anon_eventfd_data  [Masatake YAMATO]
   - add a helper function for adding a nodev to the nodev_table  [Masatake YAMATO]
   - add a helper function, add_endpoint  [Masatake YAMATO]
   - add a helper function, init_endpoint  [Masatake YAMATO]
   - add a helper function, new_ipc  [Masatake YAMATO]
   - add a helper macro, foreach_endpoint  [Masatake YAMATO]
   - add a new type "mqueue", a type for POSIX Mqueue  [Masatake YAMATO]
   - add a whitespace  [Masatake YAMATO]
   - add attach_xinfo and get_ipc_class methods to cdev_ops  [Masatake YAMATO]
   - add comment listing functions names importing via #include  [Masatake YAMATO]
   - add const modifier  [Thomas Weißschuh]
   - add flags, [-lL], representing file lock/lease states to XMODE column  [Masatake YAMATO]
   - add tmpfs as source of sysvipc to the the nodev_table  [Masatake YAMATO]
   - add xstrfappend and xstrvfappend  [Masatake YAMATO]
   - adjust coding style  [Masatake YAMATO]
   - append SOCK.SHUTDOWN value to ENDPOINTS column of UNIX-STREAM sockets  [Masatake YAMATO]
   - assign a class to the file in new_file()  [Masatake YAMATO]
   - avoid passing NULL to qsort()  [Thomas Weißschuh]
   - avoid undefined behavior  [Thomas Weißschuh]
   - build lsfd even if kcmp.h is not available  [Masatake YAMATO]
   - cache the result of checking whether "XMODE" column is enabled or not  [Masatake YAMATO]
   - call xinfo backend method before calling socket generic method when filling columns  [Masatake YAMATO]
   - choose anon_ops declarative way  [Masatake YAMATO]
   - cleanup --list-columns  [Karel Zak]
   - collect the device number for mqueue fs in the initialization stage  [Masatake YAMATO]
   - delete redundant parentheses surrounding return value  [Masatake YAMATO]
   - don't capitalize the help strings for the columns  [Masatake YAMATO]
   - don't check the value returned from new_file()  [Masatake YAMATO]
   - don't list kernel threads unless --threads is given  [Masatake YAMATO]
   - fill ENDPOINTS column for eventfd  [Masatake YAMATO]
   - fill ENDPOINTS column for pty devices  [Masatake YAMATO]
   - fill ENDPOINTS column of POSIX Mqueue  [Masatake YAMATO]
   - fill ENDPOINTS column of unix socket using UNIX_DIAG_PEER information  [Masatake YAMATO]
   - fill NAME column of inotify files with the information about their monitoring targets  [Masatake YAMATO]
   - fix a misleading parameter name  [Masatake YAMATO]
   - fix a sentence in comment  [Masatake YAMATO]
   - fix memory leak in append_filter_expr()  [Karel Zak]
   - fix specifying wrong JSON typs when building the help message  [Masatake YAMATO]
   - fix wrong inconsistency in extracting cwd and root associations  [Masatake YAMATO]
   - include common headers in lsfd.h  [Masatake YAMATO]
   - include system header files first  [Masatake YAMATO]
   - initialize pagesize in an earlier stage  [Masatake YAMATO]
   - initialize the ipc table before loading lists of unix socket peers via netlink diag  [Masatake YAMATO]
   - introduce -H, --list-columns option for making help messages short  [Masatake YAMATO]
   - introduce XMODE column, extensible variant of MODE  [Masatake YAMATO]
   - keep filter-only columns hidden  [Karel Zak]
   - make the order of calling finalize_* and initialize_* consistent  [Masatake YAMATO]
   - make the sock_xinfo layer be able to prepare an ipc_class for a given socket  [Masatake YAMATO]
   - make the way to read /proc/$pid/mountinfo robust  [Masatake YAMATO]
   - mark XMODE.m on fds monitored by poll(2) and ppoll(2)  [Masatake YAMATO]
   - mark XMODE.m on fds monitored by select(2) and pselect6(2)  [Masatake YAMATO]
   - move a local variable to a narrower scope  [Masatake YAMATO]
   - print file descriptors targeted by eventpoll files  [Masatake YAMATO]
   - print the detail of the timer associated with a timerfd  [Masatake YAMATO]
   - print the masks specified in signalfds  [Masatake YAMATO]
   - re-fill unix socket paths with sockdiag netlink interface  [Masatake YAMATO]
   - read /proc/$pid/ns/mnt earlier  [Masatake YAMATO]
   - rearrange the aligment of the help messages  [Masatake YAMATO]
   - show default columns in the help message  [Masatake YAMATO]
   - switch to c99-conformant alignment specification  [Thomas Weißschuh]
   - update the help message for XMODE column  [Masatake YAMATO]
   - use ARRAY_STRING and ARRAY_NUMBER json types in some columns  [Masatake YAMATO]
   - use SCOLS_JSON_FLOAT  [Karel Zak]
   - use \n as the separator in EVENTPOLL.TFDS column  [Masatake YAMATO]
   - use \n as the separator in INOTIFY.INODES and INOTIFY.INODES.RAW columns  [Masatake YAMATO]
   - use filter and counters from libsmartcols  [Karel Zak]
   - use helper functions in column-list-table.h  [Masatake YAMATO]
   - use scols_table_get_column_by_name  [Masatake YAMATO]
   - use the specified output stream for printing help messages  [Masatake YAMATO]
   - use xstrdup instead of xasprintf(...\"%s\"  [Masatake YAMATO]
   - utilize /proc/tty/drivers for filling SOURCE column of tty devices  [Masatake YAMATO]
   - write more about nsfs in comment  [Masatake YAMATO]
lsfd,test_mkfds:
   - (cosmetic) remove whitespaces between functions and their arglists  [Masatake YAMATO]
lsfd-filter:
   - constify filter logic  [Thomas Weißschuh]
lsfd.1.adoc:
   - document BPF.NAME column  [Masatake YAMATO]
   - fix a typo  [Masatake YAMATO]
   - fix typos  [Masatake YAMATO]
   - revise type names for columns  [Masatake YAMATO]
   - update for signalfds  [Masatake YAMATO]
   - write about timerfd  [Masatake YAMATO]
lsipc:
   - fix semaphore USED counter  [Karel Zak]
lslocks:
   - (fix) set JSON type for COL_SIZE even when --bytes is specified  [Masatake YAMATO]
   - (man) add missing fields  [Masatake YAMATO]
   - (man) document LEASE type  [Masatake YAMATO]
   - (man) update the note about OFDLCK  [Masatake YAMATO]
   - (preparation) add a fd number to the lock struct when loading lock info from /proc/$pid/fdinfo/$fd  [Masatake YAMATO]
   - (refactor) add a helper function returning JSON type for a given column  [Masatake YAMATO]
   - (refactor) lift up the code destroying the lock list for future extension  [Masatake YAMATO]
   - (refactor) make the data structure for storing lock information replacable  [Masatake YAMATO]
   - (refactor) remove 'pid' global variable  [Masatake YAMATO]
   - (refactor) use a tree for storing lock information extracted from /proc/$pid/fdinfo/$fd  [Masatake YAMATO]
   - (test) add a case  [Masatake YAMATO]
   - (test) add a case for OFDLCK type locks  [Masatake YAMATO]
   - add -H option printing avaiable columns  [Masatake YAMATO]
   - add HOLDERS column  [Masatake YAMATO]
   - add a missing "break;" in a switch/case statement  [Masatake YAMATO]
   - cleanup --list-columns  [Karel Zak]
   - don't attempt to open /proc/-1/fd/  [Jakub Wilk]
   - fix buffer overflow  [Karel Zak]
   - improve --list-columns  [Karel Zak]
   - refactor the code reading /proc/locks  [Masatake YAMATO]
   - rename functions for future extension  [Masatake YAMATO]
   - store list_add_tail when storing information extracted from /proc/$pid/fdinfo/$fd  [Masatake YAMATO]
   - use information extracted from "locks  " column of /proc/$pid/fdinfo/*  [Masatake YAMATO]
lslogins:
   - (man) fix -y option formatting  [Thomas Weißschuh]
   - fix realloc() loop allocation size  [Thomas Weißschuh]
m4:
   - update pkg.m4  [Thomas Weißschuh]
man:
   - Add enosys and lsclocks to po4a.cfg  [Mario Blättermann]
meson:
   - Only build blkzone and blkpr if the required linux header exists  [Jordan Williams]
   - Only build libmount python module if python was found  [Fabian Vogt]
   - add check for linux/mount.h  [Thomas Weißschuh]
   - add check for struct statx  [Thomas Weißschuh]
   - add conditionalization for test progs  [Zbigniew Jędrzejewski-Szmek]
   - add forgotten files to lists  [Zbigniew Jędrzejewski-Szmek]
   - add missing scols sample  [Karel Zak]
   - avoid future-deprecated feature  [Karel Zak]
   - avoid int operation with non-int  [Thomas Weißschuh]
   - build test_mount_optlist  [Thomas Weißschuh]
   - bump required version to 0.60.0  [Thomas Weißschuh]
   - check for HAVE_STRUCT_STATX_STX_MNT_ID  [Karel Zak]
   - check for _NL_TIME_WEEK_1STDAY in langinfo.h  [Christian Hesse]
   - conditionalize waitpid  [Zbigniew Jędrzejewski-Szmek]
   - create dedicated config for pam_lastlog2  [Thomas Weißschuh]
   - define _GNU_SOURCE when checking for SO_PASSCRED  [Thomas Weißschuh]
   - don't try to build test_ca without libcap-ng  [Thomas Weißschuh]
   - fix LIBBLKID_VERSION definition  [Karel Zak]
   - fix copy & past error  [Karel Zak]
   - fix disablement check  [Zbigniew Jędrzejewski-Szmek]
   - fix mismatch with handling of lib_dl dependency  [Zbigniew Jędrzejewski-Szmek]
   - implement HAVE_PTY  [Zbigniew Jędrzejewski-Szmek]
   - include bash-completion for newgrp  [Christian Hesse]
   - include bash-completion for write  [Christian Hesse]
   - install chfn setuid  [Christian Hesse]
   - install chsh setuid  [Christian Hesse]
   - install lastlog2.h library header file  [Karel Zak]
   - install mount setuid  [Christian Hesse]
   - install newgrp setuid  [Christian Hesse]
   - install su setuid  [Christian Hesse]
   - install symlink for vigr man page  [Christian Hesse]
   - install umount setuid  [Christian Hesse]
   - install wall executable with group 'tty'  [Christian Hesse]
   - install wall setgid  [Christian Hesse]
   - install write executable with group 'tty'  [Christian Hesse]
   - install write setgid  [Christian Hesse]
   - only build liblastlog when enabled  [Thomas Weißschuh]
   - properly handle gettext non-existence  [Thomas Weißschuh]
   - remove scols filter sample  [Karel Zak]
   - require 0.57  [Thomas Weißschuh]
   - run compiler checks with -D_GNU_SOURCE when necessary  [Thomas Weißschuh]
   - run tests if with option program-tests  [sewn]
   - try to always use 64bit time support on glibc  [Thomas Weißschuh]
   - update  for libsmartcols filter  [Karel Zak]
   - use a dependency object for liblastlog2  [Thomas Weißschuh]
   - use a dependency object for sqlite3  [Thomas Weißschuh]
   - use bison --defines=HEADER  [Karel Zak]
   - use meson features instead of bash  [sewn]
misc:
   - constify some fields  [Thomas Weißschuh]
mkfs.minix:
   - handle 64bit time on 32bit system  [Thomas Weißschuh]
mkswap:
   - (tests) don't overwrite logfiles  [Thomas Weißschuh]
   - (tests) validate existence of truncate command  [Thomas Weißschuh]
   - implement --file  [Vicki Pfau]
   - implement --offset  [Thomas Weißschuh]
more:
   - avoid out-of-bound access  [Thomas Weißschuh]
   - exit if POLLERR and POLLHUP on stdin is received  [Goldwyn Rodrigues]
   - exit if POLLHUP or POLLERR on stdin is received  [Goldwyn Rodrigues]
   - fix poll() use  [Karel Zak]
   - remove second check for EOF (#2714)  [Thorsten Kukuk]
   - remove usage of alloca()  [Thomas Weißschuh]
mount:
   - (tests) don't create /dev/nul  [Thomas Weißschuh]
   - (tests) explicitly use test fstab location  [Thomas Weißschuh]
   - (tests) reuse well-known per-test fstab location  [Thomas Weißschuh]
   - (tests) test mount helper with multiple filesystems  [Thomas Weißschuh]
   - Fix markup and typos in man page  [Mario Blättermann]
   - add --map-users and --map-groups convenience options  [Chris Webb]
   - improve code readability  [Karel Zak]
nsenter:
   - (man) add --keep-caps  [Karel Zak]
   - add missing free()  [Karel Zak]
   - add option `-c` to join the cgroup of target process  [u2386]
   - avoid NULL pointer dereference [coverity scan]  [Karel Zak]
   - fix possible NULL dereferece [coverity scan]  [Karel Zak]
pg:
   - use snprintf to build string  [Thomas Weißschuh]
pipesz:
   - avoid dead store  [Thomas Weißschuh]
po:
   - add ro.po (from translationproject.org)  [Remus-Gabriel Chelu]
   - merge changes  [Karel Zak]
   - update cs.po (from translationproject.org)  [Petr Písař]
   - update de.po (from translationproject.org)  [Hermann Beckers]
   - update de.po (from translationproject.org)  [Mario Blättermann]
   - update es.po (from translationproject.org)  [Antonio Ceballos Roa]
   - update fr.po (from translationproject.org)  [Frédéric Marchal]
   - update hr.po (from translationproject.org)  [Božidar Putanec]
   - update ja.po (from translationproject.org)  [Takeshi Hamasaki]
   - update ko.po (from translationproject.org)  [Seong-ho Cho]
   - update pl.po (from translationproject.org)  [Jakub Bogusz]
   - update ro.po (from translationproject.org)  [Remus-Gabriel Chelu]
   - update sr.po (from translationproject.org)  [Мирослав Николић]
   - update tr.po (from translationproject.org)  [Emir SARI]
   - update uk.po (from translationproject.org)  [Yuri Chornoivan]
po-man:
   - add ko.po (from translationproject.org)  [Seong-ho Cho]
   - add ro.po (from translationproject.org)  [Remus-Gabriel Chelu]
   - merge changes  [Karel Zak]
   - update de.po (from translationproject.org)  [Mario Blättermann]
   - update fr.po (from translationproject.org)  [Frédéric Marchal]
   - update ro.po (from translationproject.org)  [Remus-Gabriel Chelu]
   - update sr.po (from translationproject.org)  [Мирослав Николић]
   - update uk.po (from translationproject.org)  [Yuri Chornoivan]
prlimit:
   - (man) fix formatting  [Jakub Wilk]
   - reject trailing junk in limits without " "  [Jakub Wilk]
procfs:
   - add a helper function to access /proc/$pid/syscall  [Masatake YAMATO]
readprofile:
   - use xasprintf to build string  [Thomas Weißschuh]
rename:
   - properly handle directories with trailing slash  [Thomas Weißschuh]
rev:
   - Check for wchar conversion errors  [Tim Hallmann]
runuser.1.adoc:
   - Move -m|-p|--preserve-environment in order  [Sebastian Pipping]
runuser|su:
   - Start supporting option -T|--no-pty  [Sebastian Pipping]
script-playutils:
   - close filestream in case ignore_line() fails  [Thomas Weißschuh]
scriptreplay:
   - support ctrl+s and ctrl+g  [Karel Zak]
setarch:
   - add PER_LINUX_FDPIC fallback  [Karel Zak]
   - add riscv64/riscv32 support  [Michal Biesek]
setpriv:
   - add landlock support  [Thomas Weißschuh]
   - apply landlock without configuration  [Thomas Weißschuh]
   - fix group argument completion  [Thomas Weißschuh]
setterm:
   - Document behavior of redirection  [Stanislav Brabec]
   - avoid restoring flags from uninitialized memory  [Chris Hofstaedtler]
sfdisk:
   - Fix markup in man page  [Mario Blättermann]
   - add hint about duplicate UUIDs when use dump  [Karel Zak]
sha1:
   - properly wipe variables  [Thomas Weißschuh]
strv:
   - make strv_new_api static  [Thomas Weißschuh]
su:
   - (man) add hint about sessions  [Karel Zak]
   - (man) improve formatting  [Karel Zak]
   - fix use after free in run_shell  [Tanish Yadav]
su, agetty:
   - don't use program_invocation_short_name for openlog()  [Karel Zak]
sulogin:
   - relabel terminal according to SELinux policy  [Christian Göttsche]
   - use get_terminal_default_type()  [Karel Zak]
swapon:
   - (man) fix --priority description  [Karel Zak]
   - (tests) abort test on failing commands  [Thomas Weißschuh]
sys-utils:
   - cleanup license lines, add SPDX  [Karel Zak]
   - fix SELinux context example in mount.8  [Todd Zullinger]
   - hwclock-rtc  fix pointer usage  [Karthikeyan Krishnasamy]
sys-utils/lscpu:
   - Unblock SIGSEGV before vmware_bdoor  [WanBingjiang]
   - Use ul_path_scanf where possible  [Tobias Stoeckmann]
term-utils:
   - fix indentation  [Karel Zak]
test:
   - (lsfd  column-xmode)  add mising "wait" invocation  [Masatake YAMATO]
   - (lsfd)  add a case for l and L flags in XMODE column  [Masatake YAMATO]
   - (lsfd) add a case for testing BPF-MAP.TYPE and BPF-MAP.TYPE.RAW columns  [Masatake YAMATO]
   - (lsfd) add a case for testing BPF-PROG.TYPE and BPF-PROG.TYPE.RAW columns  [Masatake YAMATO]
   - (lsfd) add a case for testing DELETED column  [Masatake YAMATO]
   - (lsfd) add a subcase for testing NAME column for a deleted file  [Masatake YAMATO]
   - (mkfds  bpf-map) new factory  [Masatake YAMATO]
   - (mkfds  bpf-prog) new factory  [Masatake YAMATO]
   - (mkfds  make-regular-file) add a parameter for file locking  [Masatake YAMATO]
   - (mkfds  make-regular-file) add a parameter for making the new file readable  [Masatake YAMATO]
   - (mkfds  make-regular-file) add a parameter for writing some bytes  [Masatake YAMATO]
   - (mkfds  make-regular-file) delete the created file when an error occurs  [Masatake YAMATO]
   - (mkfds  make-regular-file) make 'fd' local variable reusable  [Masatake YAMATO]
   - (mkfds  ro-regular-file) add a parameter for a read lease  [Masatake YAMATO]
   - (mkfds) add "make-regular-file" factory  [Masatake YAMATO]
test_enosys:
   - fix build on old kernels  [Thomas Weißschuh]
test_mkfds:
   - avoid "ignoring return value of ‘write’ declared with attribute ‘warn_unused_result’"  [Masatake YAMATO]
test_uuidd:
   - make pthread_t formatting more robust  [Thomas Weißschuh]
tests:
   - (cosmetic,lslocks) trim whitespaces at the end of line  [Masatake YAMATO]
   - (functions.sh) create variable for test fstab location  [Thomas Weißschuh]
   - (functions.sh) use per-test fstab file  [Thomas Weißschuh]
   - (lsfd  column-xmode) do rm -f the file for testing before making it  [Masatake YAMATO]
   - (lsfd  column-xmode) ignore "rwx" mappings  [Masatake YAMATO]
   - (lsfd  column-xmode) skip some subtests if OFD locks are not available  [Masatake YAMATO]
   - (lsfd  filter-floating-point-nums) use --raw output to make the case more robust  [Masatake YAMATO]
   - (lsfd  mkfds-*) alter the L4 ports for avoiding the conflict with option-inet test case  [Masatake YAMATO]
   - (lsfd  mkfds-bpf-map) chmod a+x  [Masatake YAMATO]
   - (lsfd  mkfds-inotify) consider environments not having / as a mount point  [Masatake YAMATO]
   - (lsfd  mkfds-inotify) use findmnt(1) instead of stat(1) to get bdev numbers  [Masatake YAMATO]
   - (lsfd  mkfds-socketpair) make a case for testing DGRAM a subtest and add a subtest for STREAM  [Masatake YAMATO]
   - (lsfd  mkfds-unix-dgram) don't depend on the number of whitespaces in the output  [Masatake YAMATO]
   - (lsfd  option-inet) get child-processes' pids via fifo  [Masatake YAMATO]
   - (lsfd) add a case for testing ENDPOINTS column of UNIX-STREAM sockets  [Masatake YAMATO]
   - (lsfd) add a case for testing EVENTPOLL.TFDS column  [Masatake YAMATO]
   - (lsfd) add a case for testing INOTIFY.INODES.RAW column  [Masatake YAMATO]
   - (lsfd) add a case for testing SOCK.SHUTDOWN column  [Masatake YAMATO]
   - (lsfd) add a case for testing SOURCE column for SysV shmem mappings  [Masatake YAMATO]
   - (lsfd) add a case for testing signalfd related columns  [Masatake YAMATO]
   - (lsfd) add a case for testing timerfd related columns  [Masatake YAMATO]
   - (lsfd) add a case for verifying ENDPOINTS column output in JSON mode  [Masatake YAMATO]
   - (lsfd) add a case testing 'm' flag in XMODE column  [Masatake YAMATO]
   - (lsfd) add a case testing INOTIFY.INODES.RAW column on btrfs  [Masatake YAMATO]
   - (lsfd) add a case testing NAME, SOURCE, ENDPOINTS, and PTMX.TTY-INDEX columns of pts fds  [Masatake YAMATO]
   - (lsfd) add a case testing TUN.IFACE column  [Masatake YAMATO]
   - (lsfd) add a case testing XMODE.m for classical syscalls for multiplexing  [Masatake YAMATO]
   - (lsfd) add cases for POSIX Mqueue  [Masatake YAMATO]
   - (lsfd) add cases for eventfd  [Masatake YAMATO]
   - (lsfd) add lsfd_check_mkfds_factory as a help function  [Masatake YAMATO]
   - (lsfd) avoid race conditions (part 1)  [Masatake YAMATO]
   - (lsfd) don't run the unix-stream testcase including newlines in the path on qemu-user  [Masatake YAMATO]
   - (lsfd) extend the cases for testing BPF.NAME column  [Masatake YAMATO]
   - (lsfd) extend the mkfds-socketpair case to test ENDPOINTS with SOCK.SHUTDOWN info  [Masatake YAMATO]
   - (lsfd) fix typoes in an error name  [Masatake YAMATO]
   - (lsfd) show the entry for mqueue in /proc/self/mountinfo  [Masatake YAMATO]
   - (lsfd) skip mkfds-netns if SIOCGSKNS is not defined  [Masatake YAMATO]
   - (lsfd) skip some cases if NETLINK_SOCK_DIAG for AF_UNIX is not available  [Masatake YAMATO]
   - (lsfd-functions.bash,cosmetic) unify the style to define functions  [Masatake YAMATO]
   - (lsfd/filter) add a case for comparing floating point numbers  [Masatake YAMATO]
   - (lslcoks) insert a sleep between taking a lock and running lslocks  [Masatake YAMATO]
   - (lslocks) add cases testing HOLDERS column  [Masatake YAMATO]
   - (lslocks) add missing ts_finalize call  [Masatake YAMATO]
   - (mkfds) add / and /etc/fstab as the monitoring targets to inotify  [Masatake YAMATO]
   - (mkfds) add a factor for opening tun device  [Masatake YAMATO]
   - (mkfds) add a factory to make SysV shmem  [Masatake YAMATO]
   - (mkfds) add a factory to make a signalfd  [Masatake YAMATO]
   - (mkfds) add a factory to make a timerfd  [Masatake YAMATO]
   - (mkfds) add a factory to make an eventpoll fd  [Masatake YAMATO]
   - (mkfds) add eventfd factory  [Masatake YAMATO]
   - (mkfds) add mqueue factory  [Masatake YAMATO]
   - (mkfds) print a whitespace only when the running factory has "report" method  [Masatake YAMATO]
   - (mkfds) provide the way to declare the number of extra printing values  [Masatake YAMATO]
   - (refactor (test_mkfds, lsfd)) use TS_EXIT_NOTSUPP instead of EXIT_ENOSYS  [Masatake YAMATO]
   - (run.sh) detect builddir from working directory  [Thomas Weißschuh]
   - (test_mkfds  inotify) add "dir" and "file" parameters  [Masatake YAMATO]
   - (test_mkfds  make-regular-file) add a new parameter, "dupfd"  [Masatake YAMATO]
   - (test_mkfds  mkfds-multiplexing) dump /proc/$pid/syscall for debugging  [Masatake YAMATO]
   - (test_mkfds  mkfds-multiplexing) make the output of ts_skip_subtest visible  [Masatake YAMATO]
   - (test_mkfds  pty) add a new factory  [Masatake YAMATO]
   - (test_mkfds  sockdiag) new factory  [Masatake YAMATO]
   - (test_mkfds  socketpair) add "halfclose" parameter  [Masatake YAMATO]
   - (test_mkfds  {bpf-prog,bpf-map}) fix memory leaks  [Masatake YAMATO]
   - (test_mkfds) add --is-available option  [Masatake YAMATO]
   - (test_mkfds) add a new factory "multiplexing"  [Masatake YAMATO]
   - (test_mkfds) add missing PARAM_END marker  [Masatake YAMATO]
   - (test_mkfds) add poll multiplexer  [Masatake YAMATO]
   - (test_mkfds) add ppoll multiplexer  [Masatake YAMATO]
   - (test_mkfds) add pselect6 and select multiplexers  [Masatake YAMATO]
   - (test_mkfds) allow to add factory-made fds to the multiplexer as event source  [Masatake YAMATO]
   - (test_mkfds) include locale headers first to define _GNU_SOURCE  [Masatake YAMATO]
   - (test_mkfds) initialize a proper union member  [Masatake YAMATO]
   - (test_mkfds) monitor stdin by default  [Masatake YAMATO]
   - (test_mkfds) revise the usage of " __attribute__((__unused__))"  [Masatake YAMATO]
   - (test_mkfds) use SYS_bpf instead of __NR_bpf  [Masatake YAMATO]
   - (test_mkfds) use err() when a system call fails  [Masatake YAMATO]
   - (test_mkfds, refactor) make the function for waiting events plugable  [Masatake YAMATO]
   - add libsmartcols filter tests  [Karel Zak]
   - add missing file and improve options-missing test  [Karel Zak]
   - add omitted files  [Karel Zak]
   - add optlist tests  [Karel Zak]
   - add sysinfo to show sizeof(time_t)  [Thomas Weißschuh]
   - add ts_skip_capability  [Masatake YAMATO]
   - add ts_skip_docker  [Thomas Weißschuh]
   - add user and user=name mount test  [Karel Zak]
   - constify a sysinfo helpers struct  [Thomas Weißschuh]
   - don't keep bison messages in tests  [Karel Zak]
   - fix capability testing  [Thomas Weißschuh]
   - fix memory leak in scols fromfile  [Karel Zak]
   - fix subtests containing spaces in their name  [Thomas Weißschuh]
   - increase delay for waitpid test  [Goldwyn Rodrigues]
   - make mount/special more robust  [Karel Zak]
   - make ts_skip_capability accepts the output of older version of getpcaps  [Masatake YAMATO]
   - skip broken tests on docker  [Thomas Weißschuh]
   - update build tests  [Karel Zak]
   - update dmesg deltas  [Karel Zak]
   - update lsfd broken filter test  [Karel Zak]
   - use array keys in more robust way  [Karel Zak]
   - use scols_column_set_properties() in 'fromfile' sample  [Karel Zak]
tests,autotools:
   - add TESTS_COMPONENTS macro for specfying test components from make cmdline  [Masatake YAMATO]
timeutils:
   - add an inline funciton, is_timespecset()  [Masatake YAMATO]
   - add strtimespec_relative  [Thomas Weißschuh]
tmpfiles:
   - add and install for uuidd, generate /run/uuidd & /var/lib/libuuid  [Christian Hesse]
   - depend on systemd...  [Christian Hesse]
tools:
   - (asciidoctor) explicitly require extensions module  [Thomas Weißschuh]
tools/all_syscalls:
   - use pipefail  [sewn]
   - use sh and replace awk with grep & sed  [sewn]
treewide:
   - explicitly mark unused arguments  [Thomas Weißschuh]
   - fix newlines when using fputs  [Thomas Weißschuh]
   - use (x)reallocarray() when applicable  [Thomas Weißschuh]
   - use reallocarray to allocated memory that will be reallocated  [Thomas Weißschuh]
ttyutils:
   - improve get_terminal_default_type() code  [Karel Zak]
uclampset:
   - Remove validation logic  [Qais Yousef]
   - doc  Add a reference to latest kernel documentation  [Qais Yousef]
umount:
   - handle bindmounts during --recursive  [Thomas Weißschuh]
unshare:
   - Add --map-users=all and --map-groups=all  [Chris Webb]
   - Move implementation of --keep-caps option to library function  [David Gibson]
   - Set uid and gid maps directly when run as root  [Chris Webb]
   - Support multiple ID ranges for user and group maps  [Chris Webb]
   - allow negative time offsets  [Thomas Weißschuh]
   - don't try to reset the disposition of SIGKILL  [Chris Webb]
   - fix error message for unexpected time offsets  [Thomas Weißschuh]
   - make sure map_range.next is initialized [coverity scan]  [Karel Zak]
utmpdump:
   - validate subsecond granularity  [Thomas Weißschuh]
uuidd:
   - Fix markup in man page (uuidd.8)  [Mario Blättermann]
   - add cont_clock persistence  [Michael Trapp]
   - enable cont-clock in service file  [Karel Zak]
   - improve man page for -cont-clock  [Karel Zak]
uuidd.rc:
   - create localstatedir in init script  [Christian Hesse]
uuidgen:
   - add option --count  [Karel Zak]
   - mark some options mutually exclusive  [Karel Zak]
   - use xmalloc instead of malloc (#2714)  [Thorsten Kukuk]
verity:
   - modernize example in manpage  [Luca Boccassi]
   - use <roothash>-verity as the device mapper name instead of libmnt_<image>  [Luca Boccassi]
waitpid:
   - only build when pidfd_open is available  [Thomas Weißschuh]
   - warn of "exited" only when --verbose is given  [Masatake YAMATO]
wall:
   - do not error for ttys that do not exist  [Mike Gilbert]
   - fix calloc cal [-Werror=calloc-transposed-args]  [Karel Zak]
   - fix escape sequence Injection [CVE-2024-28085]  [Karel Zak]
   - query logind for list of users with tty (#2088)  [Thorsten Kukuk]
wdctl:
   - properyl test timeout conditions  [Thomas Weißschuh]
   - use only sysfs if sufficient  [Thomas Weißschuh]
wipefs:
   - (man) fix typos  [codefiles]
   - (tests) add test for all detected signatures  [Thomas Weißschuh]
   - (tests) remove necessity of root permissions  [Thomas Weißschuh]
   - allow storage of backups in specific location  [Thomas Weißschuh]
write:
   - Add missing section header in man page  [Mario Blättermann]
   - query logind for list of users with tty (#2088)  [Thorsten Kukuk]
xalloc.h:
   - add new functions  xstrappend, xstrputc, xstrvfappend, and xstrfappend  [Masatake YAMATO]
zramctl:
   - add hint about supported algorithms  [Karel Zak]

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[relevance 1%]

* [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices
  2024-03-27 13:10  2% [PATCH v13 00/10] Landlock: IOCTL support Günther Noack
@ 2024-03-27 13:10  6% ` Günther Noack
  2024-03-27 16:57  0%   ` Mickaël Salaün
  0 siblings, 1 reply; 200+ results
From: Günther Noack @ 2024-03-27 13:10 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack,
	Christian Brauner

Introduces the LANDLOCK_ACCESS_FS_IOCTL_DEV right
and increments the Landlock ABI version to 5.

This access right applies to device-custom IOCTL commands
when they are invoked on block or character device files.

Like the truncate right, this right is associated with a file
descriptor at the time of open(2), and gets respected even when the
file descriptor is used outside of the thread which it was originally
opened in.

Therefore, a newly enabled Landlock policy does not apply to file
descriptors which are already open.

If the LANDLOCK_ACCESS_FS_IOCTL_DEV right is handled, only a small
number of safe IOCTL commands will be permitted on newly opened device
files.  These include FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC, as well
as other IOCTL commands for regular files which are implemented in
fs/ioctl.c.

Noteworthy scenarios which require special attention:

TTY devices are often passed into a process from the parent process,
and so a newly enabled Landlock policy does not retroactively apply to
them automatically.  In the past, TTY devices have often supported
IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
letting callers control the TTY input buffer (and simulate
keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
modern kernels though.

Known limitations:

The LANDLOCK_ACCESS_FS_IOCTL_DEV access right is a coarse-grained
control over IOCTL commands.

Landlock users may use path-based restrictions in combination with
their knowledge about the file system layout to control what IOCTLs
can be done.

Cc: Paul Moore <paul@paul-moore.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Günther Noack <gnoack@google.com>
---
 include/uapi/linux/landlock.h                |  33 +++-
 security/landlock/fs.c                       | 183 ++++++++++++++++++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   8 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   |   5 +-
 6 files changed, 216 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
index 25c8d7677539..5d90e9799eb5 100644
--- a/include/uapi/linux/landlock.h
+++ b/include/uapi/linux/landlock.h
@@ -128,7 +128,7 @@ struct landlock_net_port_attr {
  * files and directories.  Files or directories opened before the sandboxing
  * are not subject to these restrictions.
  *
- * A file can only receive these access rights:
+ * The following access rights apply only to files:
  *
  * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
  * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
@@ -138,12 +138,13 @@ struct landlock_net_port_attr {
  * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
  * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
  *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
- *   ``O_TRUNC``. Whether an opened file can be truncated with
- *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
- *   same way as read and write permissions are checked during
- *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
- *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
- *   third version of the Landlock ABI.
+ *   ``O_TRUNC``.  This access right is available since the third version of the
+ *   Landlock ABI.
+ *
+ * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
+ * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
+ * read and write permissions are checked during :manpage:`open(2)` using
+ * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
  *
  * A directory can receive access rights related to files or directories.  The
  * following access right is applied to the directory itself, and the
@@ -198,13 +199,28 @@ struct landlock_net_port_attr {
  *   If multiple requirements are not met, the ``EACCES`` error code takes
  *   precedence over ``EXDEV``.
  *
+ * The following access right applies both to files and directories:
+ *
+ * - %LANDLOCK_ACCESS_FS_IOCTL_DEV: Invoke :manpage:`ioctl(2)` commands on an opened
+ *   character or block device.
+ *
+ *   This access right applies to all `ioctl(2)` commands implemented by device
+ *   drivers.  However, the following common IOCTL commands continue to be
+ *   invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL_DEV right:
+ *
+ *   ``FIOCLEX``, ``FIONCLEX``, ``FIONBIO``, ``FIOASYNC``, ``FIFREEZE``,
+ *   ``FITHAW``, ``FIGETBSZ``, ``FS_IOC_GETFSUUID``, ``FS_IOC_GETFSSYSFSPATH``
+ *
+ *   This access right is available since the fifth version of the Landlock
+ *   ABI.
+ *
  * .. warning::
  *
  *   It is currently not possible to restrict some file-related actions
  *   accessible through these syscall families: :manpage:`chdir(2)`,
  *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
  *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
- *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
+ *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
  *   Future Landlock evolutions will enable to restrict them.
  */
 /* clang-format off */
@@ -223,6 +239,7 @@ struct landlock_net_port_attr {
 #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
 #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
 #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
+#define LANDLOCK_ACCESS_FS_IOCTL_DEV			(1ULL << 15)
 /* clang-format on */
 
 /**
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index c15559432d3d..2ef6c57fa20b 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -7,6 +7,7 @@
  * Copyright © 2021-2022 Microsoft Corporation
  */
 
+#include <asm/ioctls.h>
 #include <kunit/test.h>
 #include <linux/atomic.h>
 #include <linux/bitops.h>
@@ -14,6 +15,7 @@
 #include <linux/compiler_types.h>
 #include <linux/dcache.h>
 #include <linux/err.h>
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -29,6 +31,7 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue.h>
+#include <uapi/linux/fiemap.h>
 #include <uapi/linux/landlock.h>
 
 #include "common.h"
@@ -84,6 +87,141 @@ static const struct landlock_object_underops landlock_fs_underops = {
 	.release = release_inode
 };
 
+/* IOCTL helpers */
+
+/**
+ * get_required_ioctl_dev_access(): Determine required access rights for IOCTLs
+ * on device files.
+ *
+ * @cmd: The IOCTL command that is supposed to be run.
+ *
+ * By default, any IOCTL on a device file requires the
+ * LANDLOCK_ACCESS_FS_IOCTL_DEV right.  We make exceptions for commands, if:
+ *
+ * 1. The command is implemented in fs/ioctl.c's do_vfs_ioctl(),
+ *    not in f_ops->unlocked_ioctl() or f_ops->compat_ioctl().
+ *
+ * 2. The command can be reasonably used on a device file at all.
+ *
+ * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
+ * should be considered for inclusion here.
+ *
+ * Returns: The access rights that must be granted on an opened file in order to
+ * use the given @cmd.
+ */
+static __attribute_const__ access_mask_t
+get_required_ioctl_dev_access(const unsigned int cmd)
+{
+	switch (cmd) {
+	case FIOCLEX:
+	case FIONCLEX:
+	case FIONBIO:
+	case FIOASYNC:
+		/*
+		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
+		 * close-on-exec and the file's buffered-IO and async flags.
+		 * These operations are also available through fcntl(2), and are
+		 * unconditionally permitted in Landlock.
+		 */
+		return 0;
+	case FIOQSIZE:
+		/*
+		 * FIOQSIZE queries the size of a regular file or directory.
+		 *
+		 * This IOCTL command only applies to regular files and
+		 * directories.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
+	case FIFREEZE:
+	case FITHAW:
+		/*
+		 * FIFREEZE and FITHAW freeze and thaw the file system which the
+		 * given file belongs to.  Requires CAP_SYS_ADMIN.
+		 *
+		 * These commands operate on the file system's superblock rather
+		 * than on the file itself.  The same operations can also be
+		 * done through any other file or directory on the same file
+		 * system, so it is safe to permit these.
+		 */
+		return 0;
+	case FS_IOC_FIEMAP:
+		/*
+		 * FS_IOC_FIEMAP queries information about the allocation of
+		 * blocks within a file.
+		 *
+		 * This IOCTL command only applies to regular files.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
+	case FIGETBSZ:
+		/*
+		 * FIGETBSZ queries the file system's block size for a file or
+		 * directory.
+		 *
+		 * This command operates on the file system's superblock rather
+		 * than on the file itself.  The same operation can also be done
+		 * through any other file or directory on the same file system,
+		 * so it is safe to permit it.
+		 */
+		return 0;
+	case FICLONE:
+	case FICLONERANGE:
+	case FIDEDUPERANGE:
+		/*
+		 * FICLONE, FICLONERANGE and FIDEDUPERANGE make files share
+		 * their underlying storage ("reflink") between source and
+		 * destination FDs, on file systems which support that.
+		 *
+		 * These IOCTL commands only apply to regular files.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
+	case FIONREAD:
+		/*
+		 * FIONREAD returns the number of bytes available for reading.
+		 *
+		 * We require LANDLOCK_ACCESS_FS_IOCTL_DEV for FIONREAD, because
+		 * devices implement it in f_ops->unlocked_ioctl().  The
+		 * implementations of this operation have varying quality and
+		 * complexity, so it is hard to reason about what they do.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
+	case FS_IOC_GETFLAGS:
+	case FS_IOC_SETFLAGS:
+	case FS_IOC_FSGETXATTR:
+	case FS_IOC_FSSETXATTR:
+		/*
+		 * FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_FSGETXATTR and
+		 * FS_IOC_FSSETXATTR do not apply for devices.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
+	case FS_IOC_GETFSUUID:
+	case FS_IOC_GETFSSYSFSPATH:
+		/*
+		 * FS_IOC_GETFSUUID and FS_IOC_GETFSSYSFSPATH both operate on
+		 * the file system superblock, not on the specific file, so
+		 * these operations are available through any other file on the
+		 * same file system as well.
+		 */
+		return 0;
+	case FIBMAP:
+	case FS_IOC_RESVSP:
+	case FS_IOC_RESVSP64:
+	case FS_IOC_UNRESVSP:
+	case FS_IOC_UNRESVSP64:
+	case FS_IOC_ZERO_RANGE:
+		/*
+		 * FIBMAP, FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP,
+		 * FS_IOC_UNRESVSP64 and FS_IOC_ZERO_RANGE only apply to regular
+		 * files (as implemented in file_ioctl()).
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
+	default:
+		/*
+		 * Other commands are guarded by the catch-all access right.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_DEV;
+	}
+}
+
 /* Ruleset management */
 
 static struct landlock_object *get_inode_object(struct inode *const inode)
@@ -148,7 +286,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL_DEV)
 /* clang-format on */
 
 /*
@@ -1335,8 +1474,10 @@ static int hook_file_alloc_security(struct file *const file)
 static int hook_file_open(struct file *const file)
 {
 	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
-	access_mask_t open_access_request, full_access_request, allowed_access;
-	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
+	access_mask_t open_access_request, full_access_request, allowed_access,
+		optional_access;
+	const struct inode *inode = file_inode(file);
+	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
 	const struct landlock_ruleset *const dom =
 		get_fs_domain(landlock_cred(file->f_cred)->domain);
 
@@ -1354,6 +1495,10 @@ static int hook_file_open(struct file *const file)
 	 * We look up more access than what we immediately need for open(), so
 	 * that we can later authorize operations on opened files.
 	 */
+	optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
+	if (is_device)
+		optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
+
 	full_access_request = open_access_request | optional_access;
 
 	if (is_access_to_paths_allowed(
@@ -1410,6 +1555,36 @@ static int hook_file_truncate(struct file *const file)
 	return -EACCES;
 }
 
+static int hook_file_ioctl(struct file *file, unsigned int cmd,
+			   unsigned long arg)
+{
+	const struct inode *inode = file_inode(file);
+	const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
+	access_mask_t required_access, allowed_access;
+
+	if (!is_device)
+		return 0;
+
+	/*
+	 * It is the access rights at the time of opening the file which
+	 * determine whether IOCTL can be used on the opened file later.
+	 *
+	 * The access right is attached to the opened file in hook_file_open().
+	 */
+	required_access = get_required_ioctl_dev_access(cmd);
+	allowed_access = landlock_file(file)->allowed_access;
+	if ((allowed_access & required_access) == required_access)
+		return 0;
+
+	return -EACCES;
+}
+
+static int hook_file_ioctl_compat(struct file *file, unsigned int cmd,
+				  unsigned long arg)
+{
+	return hook_file_ioctl(file, cmd, arg);
+}
+
 static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
 
@@ -1432,6 +1607,8 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
 	LSM_HOOK_INIT(file_open, hook_file_open),
 	LSM_HOOK_INIT(file_truncate, hook_file_truncate),
+	LSM_HOOK_INIT(file_ioctl, hook_file_ioctl),
+	LSM_HOOK_INIT(file_ioctl_compat, hook_file_ioctl_compat),
 };
 
 __init void landlock_add_fs_hooks(void)
diff --git a/security/landlock/limits.h b/security/landlock/limits.h
index 93c9c6f91556..20fdb5ff3514 100644
--- a/security/landlock/limits.h
+++ b/security/landlock/limits.h
@@ -18,7 +18,7 @@
 #define LANDLOCK_MAX_NUM_LAYERS		16
 #define LANDLOCK_MAX_NUM_RULES		U32_MAX
 
-#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_TRUNCATE
+#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_IOCTL_DEV
 #define LANDLOCK_MASK_ACCESS_FS		((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
 #define LANDLOCK_NUM_ACCESS_FS		__const_hweight64(LANDLOCK_MASK_ACCESS_FS)
 #define LANDLOCK_SHIFT_ACCESS_FS	0
diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
index 6788e73b6681..9ae3dfa47443 100644
--- a/security/landlock/syscalls.c
+++ b/security/landlock/syscalls.c
@@ -149,7 +149,7 @@ static const struct file_operations ruleset_fops = {
 	.write = fop_dummy_write,
 };
 
-#define LANDLOCK_ABI_VERSION 4
+#define LANDLOCK_ABI_VERSION 5
 
 /**
  * sys_landlock_create_ruleset - Create a new ruleset
@@ -321,7 +321,11 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset,
 	if (!path_beneath_attr.allowed_access)
 		return -ENOMSG;
 
-	/* Checks that allowed_access matches the @ruleset constraints. */
+	/*
+	 * Checks that allowed_access matches the @ruleset constraints and only
+	 * consists of publicly visible access rights (as opposed to synthetic
+	 * ones).
+	 */
 	mask = landlock_get_raw_fs_access_mask(ruleset, 0);
 	if ((path_beneath_attr.allowed_access | mask) != mask)
 		return -EINVAL;
diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
index a6f89aaea77d..3c1e9f35b531 100644
--- a/tools/testing/selftests/landlock/base_test.c
+++ b/tools/testing/selftests/landlock/base_test.c
@@ -75,7 +75,7 @@ TEST(abi_version)
 	const struct landlock_ruleset_attr ruleset_attr = {
 		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
 	};
-	ASSERT_EQ(4, landlock_create_ruleset(NULL, 0,
+	ASSERT_EQ(5, landlock_create_ruleset(NULL, 0,
 					     LANDLOCK_CREATE_RULESET_VERSION));
 
 	ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
index 9a6036fbf289..418ad745a5dd 100644
--- a/tools/testing/selftests/landlock/fs_test.c
+++ b/tools/testing/selftests/landlock/fs_test.c
@@ -529,9 +529,10 @@ TEST_F_FORK(layout1, inval)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL_DEV)
 
-#define ACCESS_LAST LANDLOCK_ACCESS_FS_TRUNCATE
+#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL_DEV
 
 #define ACCESS_ALL ( \
 	ACCESS_FILE | \
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[relevance 6%]

* [PATCH v13 00/10] Landlock: IOCTL support
@ 2024-03-27 13:10  2% Günther Noack
  2024-03-27 13:10  6% ` [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices Günther Noack
  0 siblings, 1 reply; 200+ results
From: Günther Noack @ 2024-03-27 13:10 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack

Hello!

These patches add simple ioctl(2) support to Landlock.

Objective
~~~~~~~~~

Make ioctl(2) requests for device files restrictable with Landlock,
in a way that is useful for real-world applications.

Proposed approach
~~~~~~~~~~~~~~~~~

Introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right, which restricts the
use of ioctl(2) on block and character devices.

We attach the this access right to opened file descriptors, as we
already do for LANDLOCK_ACCESS_FS_TRUNCATE.

If LANDLOCK_ACCESS_FS_IOCTL_DEV is handled (restricted in the
ruleset), the LANDLOCK_ACCESS_FS_IOCTL_DEV right governs the use of
all device-specific IOCTL commands.  We make exceptions for common and
known-harmless IOCTL commands such as FIOCLEX, FIONCLEX, FIONBIO and
FIOASYNC, as well as other IOCTL commands which are implemented in
fs/ioctl.c.  A full list of these IOCTL commands is listed in the
documentation.

I believe that this approach works for the majority of use cases, and
offers a good trade-off between complexity of the Landlock API and
implementation and flexibility when the feature is used.

Current limitations
~~~~~~~~~~~~~~~~~~~

With this patch set, ioctl(2) requests can *not* be filtered based on
file type, device number (dev_t) or on the ioctl(2) request number.

On the initial RFC patch set [1], we have reached consensus to start
with this simpler coarse-grained approach, and build additional IOCTL
restriction capabilities on top in subsequent steps.

[1] https://lore.kernel.org/linux-security-module/d4f1395c-d2d4-1860-3a02-2a0c023dd761@digikod.net/

Notable implications of this approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* A processes' existing open file descriptors stay unaffected
  when a process enables Landlock.

  This means that in common scenarios, where the terminal file
  descriptor is inherited from the parent process, the terminal's
  IOCTLs (ioctl_tty(2)) continue to work.

* ioctl(2) continues to be available for file descriptors for
  non-device files.  Example: Network sockets, memfd_create(2),
  regular files and directories.

Examples
~~~~~~~~

Starting a sandboxed shell from $HOME with samples/landlock/sandboxer:

  LL_FS_RO=/ LL_FS_RW=. ./sandboxer /bin/bash

The LANDLOCK_ACCESS_FS_IOCTL_DEV right is part of the "read-write"
rights here, so we expect that newly opened device files outside of
$HOME don't work with most IOCTL commands.

  * "stty" works: It probes terminal properties

  * "stty </dev/tty" fails: /dev/tty can be reopened, but the IOCTL is
    denied.

  * "eject" fails: ioctls to use CD-ROM drive are denied.

  * "ls /dev" works: It uses ioctl to get the terminal size for
    columnar layout

  * The text editors "vim" and "mg" work.  (GNU Emacs fails because it
    attempts to reopen /dev/tty.)

Unaffected IOCTL commands
~~~~~~~~~~~~~~~~~~~~~~~~~

To decide which IOCTL commands should be blanket-permitted, we went
through the list of IOCTL commands which are handled directly in
fs/ioctl.c and looked at them individually to understand what they are
about.

The following commands are permitted by Landlock unconditionally:

 * FIOCLEX, FIONCLEX - these work on the file descriptor and
   manipulate the close-on-exec flag (also available through
   fcntl(2) with F_SETFD)
 * FIONBIO, FIOASYNC - these work on the struct file and enable
   nonblocking-IO and async flags (also available through
   fcntl(2) with F_SETFL)

The following commands are also unconditionally permitted by Landlock, because
they are really operating on the file system's superblock, rather than on the
file itself (the same funcionality is also available from any other file on the
same file system):

 * FIFREEZE, FITHAW - work on superblock(!) to freeze/thaw the file
   system. Requires CAP_SYS_ADMIN.
 * FIGETBSZ - get file system blocksize
 * FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH - getting file system properties

Notably, the command FIONREAD is *not* blanket-permitted,
because it would be a device-specific implementation.

Detailed reasoning about each IOCTL command from fs/ioctl.c is in
get_required_ioctl_dev_access() in security/landlock/fs.c.


Related Work
~~~~~~~~~~~~

OpenBSD's pledge(2) [2] restricts ioctl(2) independent of the file
descriptor which is used.  The implementers maintain multiple
allow-lists of predefined ioctl(2) operations required for different
application domains such as "audio", "bpf", "tty" and "inet".

OpenBSD does not guarantee backwards compatibility to the same extent
as Linux does, so it's easier for them to update these lists in later
versions.  It might not be a feasible approach for Linux though.

[2] https://man.openbsd.org/OpenBSD-7.4/pledge.2


Implementation Rationale
~~~~~~~~~~~~~~~~~~~~~~~~

A main constraint of this implementation is that the blanket-permitted
IOCTL commands for device files should never dispatch to the
device-specific implementations in f_ops->unlocked_ioctl() and
f_ops->compat_ioctl().

There are many implementations of these f_ops operations and they are
too scattered across the kernel to give strong guarantees about them.
Additionally, some existing implementations do work before even
checking whether they support the cmd number which was passed to them.


In this implementation, we are listing the blanket-permitted IOCTL
commands in the Landlock implementation, mirroring a subset of the
IOCTL commands which are directly implemented in do_vfs_ioctl() in
fs/ioctl.c.  The trade-off is that the Landlock LSM needs to track
future developments in fs/ioctl.c to keep up to date with that, in
particular when new IOCTL commands are introduced there, or when they
are moved there from the f_ops implementations.

We mitigate this risk in this patch set by adding fs/ioctl.c to the
paths that are relevant to Landlock in the MAINTAINERS file.

The trade-off is discussed in more detail in [3].


Previous versions of this patch set have used different implementation
approaches to guarantee the main constraint above, which we have
dismissed due to the following reasons:

* V10: Introduced a new LSM hook file_vfs_ioctl, which gets invoked
  just before the call to f_ops->unlocked_ioctl().

  Not done, because it would have created an avoidable overlap between
  the file_ioctl and file_vfs_ioctl LSM hooks [4].

* V11: Introduced an indirection layer in fs/ioctl.c, so that Landlock
  could figure out the list of IOCTL commands which are handled by
  do_vfs_ioctl().

  Not done due to additional indirection and possible performance
  impact in fs/ioctl.c [5]

* V12: Introduced a special error code to be returned from the
  file_ioctl hook, and matching logic that would disallow the call to
  f_ops->unlocked_ioctl() in case that this error code is returned.

  Not done due because this approach would conflict with Landlock's
  planned audit logging [6] and because LSM hooks with special error
  codes are generally discouraged and have lead to problems in the
  past [7].

Thanks to Arnd Bergmann, Christian Brauner, Mickaël Salaün and Paul
Moore for guiding this implementation on the right track!

[3] https://lore.kernel.org/all/ZgLJG0aN0psur5Z7@google.com/
[4] https://lore.kernel.org/all/CAHC9VhRojXNSU9zi2BrP8z6JmOmT3DAqGNtinvvz=tL1XhVdyg@mail.gmail.com/
[5] https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com
[6] https://lore.kernel.org/all/20240326.ahyaaPa0ohs6@digikod.net
[7] https://lore.kernel.org/all/CAHC9VhQJFWYeheR-EqqdfCq0YpvcQX5Scjfgcz1q+jrWg8YsdA@mail.gmail.com/


Changes
~~~~~~~

V13:
 * Using the existing file_ioctl hook and a hardcoded list of IOCTL commands.
   (See the section on implementation rationale above.)
 * Add support for FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH.
   
V12:
 * Rebased on Arnd's proposal:
   https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com/
   This means that:
   * the IOCTL security hooks can return a special value ENOFILEOPS,
     which is treated specially in fs/ioctl.c to permit the IOCTL,
     but only as long as it does not call f_ops->unlocked_ioctl or
     f_ops->compat_ioctl.
 * The only change compared to V11 is commit 1, as well as a small
   adaptation in the commit 2 (The Landlock implementation needs to
   return the new special value).  The tests and documentation commits
   are exactly the same as before.

V11:
 * Rebased on Mickaël's proposal to refactor fs/ioctl.c:
   https://lore.kernel.org/all/20240315145848.1844554-1-mic@digikod.net/
   This means that:
   * we do not add the file_vfs_ioctl() hook as in V10
   * we add vfs_get_ioctl_handler() instead, so that Landlock
     can query which of the IOCTL commands in handled in do_vfs_ioctl()

   That proposal is used here unmodified (except for minor typos in the commit
   description).
 * Use the hook_ioctl_compat LSM hook as well.

V10:
 * Major change: only restrict IOCTL invocations on device files
   * Rename access right to LANDLOCK_ACCESS_FS_IOCTL_DEV
   * Remove the notion of synthetic access rights and IOCTL right groups
 * Introduce a new LSM hook file_vfs_ioctl, which gets invoked just
   before the call to f_ops->unlocked_ioctl()
 * Documentation
   * Various complications were removed or simplified:
     * Suggestion to mount file systems as nodev is not needed any more,
       as Landlock already lets users distinguish device files.
     * Remarks about fscrypt were removed.  The fscrypt-related IOCTLs only
       applied to regular files and directories, so this patch does not affect
       them any more.
     * Various documentation of the IOCTL grouping approach was removed,
       as it's not needed any more.

V9:
 * in “landlock: Add IOCTL access right”:
   * Change IOCTL group names and grouping as discussed with Mickaël.
     This makes the grouping coarser, and we occasionally rely on the
     underlying implementation to perform the appropriate read/write
     checks.
     * Group IOCTL_RW (one of READ_FILE, WRITE_FILE or READ_DIR):
       FIONREAD, FIOQSIZE, FIGETBSZ
     * Group IOCTL_RWF (one of READ_FILE or WRITE_FILE):
       FS_IOC_FIEMAP, FIBMAP, FIDEDUPERANGE, FICLONE, FICLONERANGE,
       FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
       FS_IOC_ZERO_RANGE
   * Excempt pipe file descriptors from IOCTL restrictions,
     even for named pipes which are opened from the file system.
     This is to be consistent with anonymous pipes created with pipe(2).
     As discussed in https://lore.kernel.org/r/ZP7lxmXklksadvz+@google.com
   * Document rationale for the IOCTL grouping in the code
   * Use __attribute_const__
   * Rename required_ioctl_access() to get_required_ioctl_access()
 * Selftests
   * Simplify IOCTL test fixtures as a result of simpler grouping.
   * Test that IOCTLs are permitted on named pipe FDs.
   * Test that IOCTLs are permitted on named Unix Domain Socket FDs.
   * Work around compilation issue with old GCC / glibc.
     https://sourceware.org/glibc/wiki/Synchronizing_Headers
     Thanks to Huyadi <hu.yadi@h3c.com>, who pointed this out in
     https://lore.kernel.org/all/f25be6663bcc4608adf630509f045a76@h3c.com/
     and Mickaël, who fixed it through #include reordering.
 * Documentation changes
   * Reword "IOCTL commands" section a bit
   * s/permit/allow/
   * s/access right/right/, if preceded by LANDLOCK_ACCESS_FS_*
   * s/IOCTL/FS_IOCTL/ in ASCII table
   * Update IOCTL grouping documentation in header file
 * Removed a few of the earlier commits in this patch set,
   which have already been merged.

V8:
 * Documentation changes
   * userspace-api/landlock.rst:
     * Add an extra paragraph about how the IOCTL right combines
       when used with other access rights.
     * Explain better the circumstances under which passing of
       file descriptors between different Landlock domains can happen
   * limits.h: Add comment to explain public vs internal FS access rights
   * Add a paragraph in the commit to explain better why the IOCTL
     right works as it does

V7:
 * in “landlock: Add IOCTL access right”:
   * Make IOCTL_GROUPS a #define so that static_assert works even on
     old compilers (bug reported by Intel about PowerPC GCC9 config)
   * Adapt indentation of IOCTL_GROUPS definition
   * Add missing dots in kernel-doc comments.
 * in “landlock: Remove remaining "inline" modifiers in .c files”:
   * explain reasoning in commit message

V6:
 * Implementation:
   * Check that only publicly visible access rights can be used when adding a
     rule (rather than the synthetic ones).  Thanks Mickaël for spotting that!
   * Move all functionality related to IOCTL groups and synthetic access rights
     into the same place at the top of fs.c
   * Move kernel doc to the .c file in one instance
   * Smaller code style issues (upcase IOCTL, vardecl at block start)
   * Remove inline modifier from functions in .c files
 * Tests:
   * use SKIP
   * Rename 'fd' to dir_fd and file_fd where appropriate
   * Remove duplicate "ioctl" mentions from test names
   * Rename "permitted" to "allowed", in ioctl and ftruncate tests
   * Do not add rules if access is 0, in test helper

V5:
 * Implementation:
   * move IOCTL group expansion logic into fs.c (implementation suggested by
     mic)
   * rename IOCTL_CMD_G* constants to LANDLOCK_ACCESS_FS_IOCTL_GROUP*
   * fs.c: create ioctl_groups constant
   * add "const" to some variables
 * Formatting and docstring fixes (including wrong kernel-doc format)
 * samples/landlock: fix ABI version and fallback attribute (mic)
 * Documentation
   * move header documentation changes into the implementation commit
   * spell out how FIFREEZE, FITHAW and attribute-manipulation ioctls from
     fs/ioctl.c are handled
   * change ABI 4 to ABI 5 in some missing places

V4:
 * use "synthetic" IOCTL access rights, as previously discussed
 * testing changes
   * use a large fixture-based test, for more exhaustive coverage,
     and replace some of the earlier tests with it
 * rebased on mic-next

V3:
 * always permit the IOCTL commands FIOCLEX, FIONCLEX, FIONBIO, FIOASYNC and
   FIONREAD, independent of LANDLOCK_ACCESS_FS_IOCTL
 * increment ABI version in the same commit where the feature is introduced
 * testing changes
   * use FIOQSIZE instead of TTY IOCTL commands
     (FIOQSIZE works with regular files, directories and memfds)
   * run the memfd test with both Landlock enabled and disabled
   * add a test for the always-permitted IOCTL commands

V2:
 * rebased on mic-next
 * added documentation
 * exercise ioctl(2) in the memfd test
 * test: Use layout0 for the test

---

V1: https://lore.kernel.org/linux-security-module/20230502171755.9788-1-gnoack3000@gmail.com/
V2: https://lore.kernel.org/linux-security-module/20230623144329.136541-1-gnoack@google.com/
V3: https://lore.kernel.org/linux-security-module/20230814172816.3907299-1-gnoack@google.com/
V4: https://lore.kernel.org/linux-security-module/20231103155717.78042-1-gnoack@google.com/
V5: https://lore.kernel.org/linux-security-module/20231117154920.1706371-1-gnoack@google.com/
V6: https://lore.kernel.org/linux-security-module/20231124173026.3257122-1-gnoack@google.com/
V7: https://lore.kernel.org/linux-security-module/20231201143042.3276833-1-gnoack@google.com/
V8: https://lore.kernel.org/linux-security-module/20231208155121.1943775-1-gnoack@google.com/
V9: https://lore.kernel.org/linux-security-module/20240209170612.1638517-1-gnoack@google.com/
V10: https://lore.kernel.org/linux-security-module/20240309075320.160128-1-gnoack@google.com/
V11: https://lore.kernel.org/linux-security-module/20240322151002.3653639-1-gnoack@google.com/
V12: https://lore.kernel.org/linux-security-module/20240325134004.4074874-1-gnoack@google.com/

Günther Noack (10):
  landlock: Add IOCTL access right for character and block devices
  selftests/landlock: Test IOCTL support
  selftests/landlock: Test IOCTL with memfds
  selftests/landlock: Test ioctl(2) and ftruncate(2) with open(O_PATH)
  selftests/landlock: Test IOCTLs on named pipes
  selftests/landlock: Check IOCTL restrictions for named UNIX domain
    sockets
  samples/landlock: Add support for LANDLOCK_ACCESS_FS_IOCTL_DEV
  landlock: Document IOCTL support
  MAINTAINERS: Notify Landlock maintainers about changes to fs/ioctl.c
  fs/ioctl: Add a comment to keep the logic in sync with the Landlock
    LSM

 Documentation/userspace-api/landlock.rst     |  76 +++-
 MAINTAINERS                                  |   1 +
 fs/ioctl.c                                   |   3 +
 include/uapi/linux/landlock.h                |  33 +-
 samples/landlock/sandboxer.c                 |  13 +-
 security/landlock/fs.c                       | 183 ++++++++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   8 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   | 396 ++++++++++++++++++-
 10 files changed, 673 insertions(+), 44 deletions(-)


base-commit: e9df9344b6f3e5e1c745a71f125ff4b5c6ddc96b
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply	[relevance 2%]

* Re: kernel crash in mknod
  2024-03-25 21:13  0%             ` Al Viro
@ 2024-03-25 21:31  0%               ` Paulo Alcantara
  0 siblings, 0 replies; 200+ results
From: Paulo Alcantara @ 2024-03-25 21:31 UTC (permalink / raw)
  To: Al Viro, Steve French
  Cc: Christian Brauner, Roberto Sassu, LKML, linux-fsdevel, CIFS,
	Christian Brauner, Mimi Zohar, Paul Moore, linux-integrity,
	linux-security-module

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Mar 25, 2024 at 05:47:16PM -0300, Paulo Alcantara wrote:
>> Al Viro <viro@zeniv.linux.org.uk> writes:
>> 
>> > On Mon, Mar 25, 2024 at 11:26:59AM -0500, Steve French wrote:
>> >
>> >> A loosely related question.  Do I need to change cifs.ko to return the
>> >> pointer to inode on mknod now?  dentry->inode is NULL in the case of mknod
>> >> from cifs.ko (and presumably some other fs as Al noted), unlike mkdir and
>> >> create where it is filled in.   Is there a perf advantage in filling in the
>> >> dentry->inode in the mknod path in the fs or better to leave it as is?  Is
>> >> there a good example to borrow from on this?
>> >
>> > AFAICS, that case in in CIFS is the only instance of ->mknod() that does this
>> > "skip lookups, just unhash and return 0" at the moment.
>> >
>> > What's more, it really had been broken all along for one important case -
>> > AF_UNIX bind(2) with address (== socket pathname) being on the filesystem
>> > in question.
>> 
>> Yes, except that we currently return -EPERM for such cases.  I don't
>> even know if this SFU thing supports sockets.
>
> 	Sure, but we really want the rules to be reasonably simple and
> "you may leave dentry unhashed negative and return 0, provided that you
> hadn't been asked to create a socket" is anything but ;-)

Agreed :-)

>> > Note that cifs_sfu_make_node() is the only case in CIFS where that happens -
>> > other codepaths (both in cifs_make_node() and in smb2_make_node()) will
>> > instantiate.  How painful would it be for cifs_sfu_make_node()?
>> > AFAICS, you do open/sync_write/close there; would it be hard to do
>> > an eqiuvalent of fstat and set the inode up?
>> 
>> This should be pretty straightforward as it would only require an extra
>> query info call and then {smb311_posix,cifs}_get_inode_info() ->
>> d_instantiate().  We could even make it a single compound request of
>> open/write/getinfo/close for SMB2+ case.
>
> 	If that's the case, I believe that we should simply declare that
> ->mknod() must instantiate on success and have vfs_mknod() check and
> warn if it hadn't.

LGTM.

Steve, any objections?

^ permalink raw reply	[relevance 0%]

* Re: kernel crash in mknod
  2024-03-25 20:47  5%           ` Paulo Alcantara
@ 2024-03-25 21:13  0%             ` Al Viro
  2024-03-25 21:31  0%               ` Paulo Alcantara
  0 siblings, 1 reply; 200+ results
From: Al Viro @ 2024-03-25 21:13 UTC (permalink / raw)
  To: Paulo Alcantara
  Cc: Steve French, Christian Brauner, Roberto Sassu, LKML,
	linux-fsdevel, CIFS, Christian Brauner, Mimi Zohar, Paul Moore,
	linux-integrity, linux-security-module

On Mon, Mar 25, 2024 at 05:47:16PM -0300, Paulo Alcantara wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > On Mon, Mar 25, 2024 at 11:26:59AM -0500, Steve French wrote:
> >
> >> A loosely related question.  Do I need to change cifs.ko to return the
> >> pointer to inode on mknod now?  dentry->inode is NULL in the case of mknod
> >> from cifs.ko (and presumably some other fs as Al noted), unlike mkdir and
> >> create where it is filled in.   Is there a perf advantage in filling in the
> >> dentry->inode in the mknod path in the fs or better to leave it as is?  Is
> >> there a good example to borrow from on this?
> >
> > AFAICS, that case in in CIFS is the only instance of ->mknod() that does this
> > "skip lookups, just unhash and return 0" at the moment.
> >
> > What's more, it really had been broken all along for one important case -
> > AF_UNIX bind(2) with address (== socket pathname) being on the filesystem
> > in question.
> 
> Yes, except that we currently return -EPERM for such cases.  I don't
> even know if this SFU thing supports sockets.

	Sure, but we really want the rules to be reasonably simple and
"you may leave dentry unhashed negative and return 0, provided that you
hadn't been asked to create a socket" is anything but ;-)

> > Note that cifs_sfu_make_node() is the only case in CIFS where that happens -
> > other codepaths (both in cifs_make_node() and in smb2_make_node()) will
> > instantiate.  How painful would it be for cifs_sfu_make_node()?
> > AFAICS, you do open/sync_write/close there; would it be hard to do
> > an eqiuvalent of fstat and set the inode up?
> 
> This should be pretty straightforward as it would only require an extra
> query info call and then {smb311_posix,cifs}_get_inode_info() ->
> d_instantiate().  We could even make it a single compound request of
> open/write/getinfo/close for SMB2+ case.

	If that's the case, I believe that we should simply declare that
->mknod() must instantiate on success and have vfs_mknod() check and
warn if it hadn't.

	Rationale:

1) mknod(2) is usually followed by at least some access to created object.
Not setting the inode up won't save much anyway.
2) if some instance of ->mknod() skips setting the inode on success (i.e.
unhashes the still-negative dentry and returns 0), it can easily be
converted.  The minimal conversion would be along the lines of turning
	d_drop(dentry);
	return 0;
into
	d_drop(dentry);
	d = foofs_lookup(dir, dentry, 0);
	if (unlikely(d)) {
		if (!IS_ERR(d)) {
			dput(d);
			return -EINVAL;	// weird shit - directory got created somehow
		}
		return PTR_ERR(d);
	}
	return 0;
but there almost certainly are cheaper ways to get the inode metadata,
set the inode up and instantiate the dentry.
3) currently only on in-kernel instance is that way.
4) it makes life simpler for the users of vfs_mknod().

	Objections, anyone?

^ permalink raw reply	[relevance 0%]

* Re: kernel crash in mknod
  @ 2024-03-25 20:47  5%           ` Paulo Alcantara
  2024-03-25 21:13  0%             ` Al Viro
  0 siblings, 1 reply; 200+ results
From: Paulo Alcantara @ 2024-03-25 20:47 UTC (permalink / raw)
  To: Al Viro, Steve French
  Cc: Christian Brauner, Roberto Sassu, LKML, linux-fsdevel, CIFS,
	Christian Brauner, Mimi Zohar, Paul Moore, linux-integrity,
	linux-security-module

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Mon, Mar 25, 2024 at 11:26:59AM -0500, Steve French wrote:
>
>> A loosely related question.  Do I need to change cifs.ko to return the
>> pointer to inode on mknod now?  dentry->inode is NULL in the case of mknod
>> from cifs.ko (and presumably some other fs as Al noted), unlike mkdir and
>> create where it is filled in.   Is there a perf advantage in filling in the
>> dentry->inode in the mknod path in the fs or better to leave it as is?  Is
>> there a good example to borrow from on this?
>
> AFAICS, that case in in CIFS is the only instance of ->mknod() that does this
> "skip lookups, just unhash and return 0" at the moment.
>
> What's more, it really had been broken all along for one important case -
> AF_UNIX bind(2) with address (== socket pathname) being on the filesystem
> in question.

Yes, except that we currently return -EPERM for such cases.  I don't
even know if this SFU thing supports sockets.

> Note that cifs_sfu_make_node() is the only case in CIFS where that happens -
> other codepaths (both in cifs_make_node() and in smb2_make_node()) will
> instantiate.  How painful would it be for cifs_sfu_make_node()?
> AFAICS, you do open/sync_write/close there; would it be hard to do
> an eqiuvalent of fstat and set the inode up?

This should be pretty straightforward as it would only require an extra
query info call and then {smb311_posix,cifs}_get_inode_info() ->
d_instantiate().  We could even make it a single compound request of
open/write/getinfo/close for SMB2+ case.

^ permalink raw reply	[relevance 5%]

* [PATCH v12 0/9] Landlock: IOCTL support
@ 2024-03-25 13:39  2% Günther Noack
  0 siblings, 0 replies; 200+ results
From: Günther Noack @ 2024-03-25 13:39 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack

Hello!

These patches add simple ioctl(2) support to Landlock.

Objective
~~~~~~~~~

Make ioctl(2) requests restrictable with Landlock,
in a way that is useful for real-world applications.

Proposed approach
~~~~~~~~~~~~~~~~~

Introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right, which restricts the
use of ioctl(2) on block and character devices.

We attach the this access right to opened file descriptors, as we
already do for LANDLOCK_ACCESS_FS_TRUNCATE.

If LANDLOCK_ACCESS_FS_IOCTL_DEV is handled (restricted in the
ruleset), the LANDLOCK_ACCESS_FS_IOCTL_DEV right governs the use of
all device-specific IOCTL commands.  We make exceptions for common and
known-harmless IOCTL commands such as FIOCLEX, FIONCLEX, FIONBIO and
FIOASYNC, as well as other IOCTL commands for regular files, which are
implemented in fs/ioctl.c.  A full list of these IOCTL commands is
listed in the documentation.

I believe that this approach works for the majority of use cases, and
offers a good trade-off between complexity of the Landlock API and
implementation and flexibility when the feature is used.

Current limitations
~~~~~~~~~~~~~~~~~~~

With this patch set, ioctl(2) requests can *not* be filtered based on
file type, device number (dev_t) or on the ioctl(2) request number.

On the initial RFC patch set [1], we have reached consensus to start
with this simpler coarse-grained approach, and build additional IOCTL
restriction capabilities on top in subsequent steps.

[1] https://lore.kernel.org/linux-security-module/d4f1395c-d2d4-1860-3a02-2a0c023dd761@digikod.net/

Notable implications of this approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* A processes' existing open file descriptors stay unaffected
  when a process enables Landlock.

  This means in particular that in common scenarios,
  the terminal's IOCTLs (ioctl_tty(2)) continue to work.

* ioctl(2) continues to be available for file descriptors for
  non-device files.  Example: Network sockets, memfd_create(2).

Examples
~~~~~~~~

Starting a sandboxed shell from $HOME with samples/landlock/sandboxer:

  LL_FS_RO=/ LL_FS_RW=. ./sandboxer /bin/bash

The LANDLOCK_ACCESS_FS_IOCTL_DEV right is part of the "read-write"
rights here, so we expect that newly opened files outside of $HOME
don't work with most IOCTL commands.

  * "stty" works: It probes terminal properties

  * "stty </dev/tty" fails: /dev/tty can be reopened, but the IOCTL is
    denied.

  * "eject" fails: ioctls to use CD-ROM drive are denied.

  * "ls /dev" works: It uses ioctl to get the terminal size for
    columnar layout

  * The text editors "vim" and "mg" work.  (GNU Emacs fails because it
    attempts to reopen /dev/tty.)

Unaffected IOCTL commands
~~~~~~~~~~~~~~~~~~~~~~~~~

To decide which IOCTL commands should be blanket-permitted, we went
through the list of IOCTL commands which are handled directly in
fs/ioctl.c and looked at them individually to understand what they are
about.

The following commands are permitted by Landlock unconditionally:

 * FIOCLEX, FIONCLEX - these work on the file descriptor and
   manipulate the close-on-exec flag (also available through
   fcntl(2) with F_SETFD)
 * FIONBIO, FIOASYNC - these work on the struct file and enable
   nonblocking-IO and async flags (also available through
   fcntl(2) with F_SETFL)

The following commands are also technically permitted by Landlock
unconditionally, but are not supported by device files.  By permitting
them in Landlock on device files, we naturally return the normal error
code.

 * FIOQSIZE - get the size of the opened file or directory
 * FIBMAP - get the file system block numbers underlying a file
 * FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
   FS_IOC_ZERO_RANGE: Backwards compatibility with legacy XFS
   preallocation syscalls which predate fallocate(2).

The following commands are also technically permitted by Landlock, but
they are really operating on the file system's superblock, rather than
on the file itself:

 * FIFREEZE, FITHAW - work on superblock(!) to freeze/thaw the file
   system. Requires CAP_SYS_ADMIN.
 * FIGETBSZ - get file system blocksize

The following commands are technically permitted by Landlock:

 * FS_IOC_FIEMAP - get information about file extent mapping
   (c.f. https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt)
 * FIDEDUPERANGE, FICLONE, FICLONERANGE - manipulating shared physical storage
   between multiple files.  These only work on some COW file systems, by design.
 * Accessing file attributes:
   * FS_IOC_GETFLAGS, FS_IOC_SETFLAGS - manipulate inode flags (ioctl_iflags(2))
   * FS_IOC_FSGETXATTR, FS_IOC_FSSETXATTR - more attributes

Notably, the command FIONREAD is *not* blanket-permitted,
because it would be a device-specific implementation.


Related Work
~~~~~~~~~~~~

OpenBSD's pledge(2) [2] restricts ioctl(2) independent of the file
descriptor which is used.  The implementers maintain multiple
allow-lists of predefined ioctl(2) operations required for different
application domains such as "audio", "bpf", "tty" and "inet".

OpenBSD does not guarantee backwards compatibility to the same extent
as Linux does, so it's easier for them to update these lists in later
versions.  It might not be a feasible approach for Linux though.

[2] https://man.openbsd.org/OpenBSD-7.4/pledge.2


Open Questions
~~~~~~~~~~~~~~

 * Is this approach OK as a mechanism for identifying the IOCTL
   commands which are handled by do_vfs_ioctl()?


Changes
~~~~~~~

V12:
 * Rebased on Arnd's proposal:
   https://lore.kernel.org/all/32b1164e-9d5f-40c0-9a4e-001b2c9b822f@app.fastmail.com/
   This means that:
   * the IOCTL security hooks can return a special value ENOFILEOPS,
     which is treated specially in fs/ioctl.c to permit the IOCTL,
     but only as long as it does not call f_ops->unlocked_ioctl or
     f_ops->compat_ioctl.
 * The only change compared to V11 is commit 1, as well as a small
   adaptation in the commit 2 (The Landlock implementation needs to
   return the new special value).  The tests and documentation commits
   are exactly the same as before.

V11:
 * Rebased on Mickaël's proposal to refactor fs/ioctl.c:
   https://lore.kernel.org/all/20240315145848.1844554-1-mic@digikod.net/
   This means that:
   * we do not add the file_vfs_ioctl() hook as in V10
   * we add vfs_get_ioctl_handler() instead, so that Landlock
     can query which of the IOCTL commands in handled in do_vfs_ioctl()

   That proposal is used here unmodified (except for minor typos in the commit
   description).
 * Use the hook_ioctl_compat LSM hook as well.

V10:
 * Major change: only restrict IOCTL invocations on device files
   * Rename access right to LANDLOCK_ACCESS_FS_IOCTL_DEV
   * Remove the notion of synthetic access rights and IOCTL right groups
 * Introduce a new LSM hook file_vfs_ioctl, which gets invoked just
   before the call to f_ops->unlocked_ioctl()
 * Documentation
   * Various complications were removed or simplified:
     * Suggestion to mount file systems as nodev is not needed any more,
       as Landlock already lets users distinguish device files.
     * Remarks about fscrypt were removed.  The fscrypt-related IOCTLs only
       applied to regular files and directories, so this patch does not affect
       them any more.
     * Various documentation of the IOCTL grouping approach was removed,
       as it's not needed any more.

V9:
 * in “landlock: Add IOCTL access right”:
   * Change IOCTL group names and grouping as discussed with Mickaël.
     This makes the grouping coarser, and we occasionally rely on the
     underlying implementation to perform the appropriate read/write
     checks.
     * Group IOCTL_RW (one of READ_FILE, WRITE_FILE or READ_DIR):
       FIONREAD, FIOQSIZE, FIGETBSZ
     * Group IOCTL_RWF (one of READ_FILE or WRITE_FILE):
       FS_IOC_FIEMAP, FIBMAP, FIDEDUPERANGE, FICLONE, FICLONERANGE,
       FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
       FS_IOC_ZERO_RANGE
   * Excempt pipe file descriptors from IOCTL restrictions,
     even for named pipes which are opened from the file system.
     This is to be consistent with anonymous pipes created with pipe(2).
     As discussed in https://lore.kernel.org/r/ZP7lxmXklksadvz+@google.com
   * Document rationale for the IOCTL grouping in the code
   * Use __attribute_const__
   * Rename required_ioctl_access() to get_required_ioctl_access()
 * Selftests
   * Simplify IOCTL test fixtures as a result of simpler grouping.
   * Test that IOCTLs are permitted on named pipe FDs.
   * Test that IOCTLs are permitted on named Unix Domain Socket FDs.
   * Work around compilation issue with old GCC / glibc.
     https://sourceware.org/glibc/wiki/Synchronizing_Headers
     Thanks to Huyadi <hu.yadi@h3c.com>, who pointed this out in
     https://lore.kernel.org/all/f25be6663bcc4608adf630509f045a76@h3c.com/
     and Mickaël, who fixed it through #include reordering.
 * Documentation changes
   * Reword "IOCTL commands" section a bit
   * s/permit/allow/
   * s/access right/right/, if preceded by LANDLOCK_ACCESS_FS_*
   * s/IOCTL/FS_IOCTL/ in ASCII table
   * Update IOCTL grouping documentation in header file
 * Removed a few of the earlier commits in this patch set,
   which have already been merged.

V8:
 * Documentation changes
   * userspace-api/landlock.rst:
     * Add an extra paragraph about how the IOCTL right combines
       when used with other access rights.
     * Explain better the circumstances under which passing of
       file descriptors between different Landlock domains can happen
   * limits.h: Add comment to explain public vs internal FS access rights
   * Add a paragraph in the commit to explain better why the IOCTL
     right works as it does

V7:
 * in “landlock: Add IOCTL access right”:
   * Make IOCTL_GROUPS a #define so that static_assert works even on
     old compilers (bug reported by Intel about PowerPC GCC9 config)
   * Adapt indentation of IOCTL_GROUPS definition
   * Add missing dots in kernel-doc comments.
 * in “landlock: Remove remaining "inline" modifiers in .c files”:
   * explain reasoning in commit message

V6:
 * Implementation:
   * Check that only publicly visible access rights can be used when adding a
     rule (rather than the synthetic ones).  Thanks Mickaël for spotting that!
   * Move all functionality related to IOCTL groups and synthetic access rights
     into the same place at the top of fs.c
   * Move kernel doc to the .c file in one instance
   * Smaller code style issues (upcase IOCTL, vardecl at block start)
   * Remove inline modifier from functions in .c files
 * Tests:
   * use SKIP
   * Rename 'fd' to dir_fd and file_fd where appropriate
   * Remove duplicate "ioctl" mentions from test names
   * Rename "permitted" to "allowed", in ioctl and ftruncate tests
   * Do not add rules if access is 0, in test helper

V5:
 * Implementation:
   * move IOCTL group expansion logic into fs.c (implementation suggested by
     mic)
   * rename IOCTL_CMD_G* constants to LANDLOCK_ACCESS_FS_IOCTL_GROUP*
   * fs.c: create ioctl_groups constant
   * add "const" to some variables
 * Formatting and docstring fixes (including wrong kernel-doc format)
 * samples/landlock: fix ABI version and fallback attribute (mic)
 * Documentation
   * move header documentation changes into the implementation commit
   * spell out how FIFREEZE, FITHAW and attribute-manipulation ioctls from
     fs/ioctl.c are handled
   * change ABI 4 to ABI 5 in some missing places

V4:
 * use "synthetic" IOCTL access rights, as previously discussed
 * testing changes
   * use a large fixture-based test, for more exhaustive coverage,
     and replace some of the earlier tests with it
 * rebased on mic-next

V3:
 * always permit the IOCTL commands FIOCLEX, FIONCLEX, FIONBIO, FIOASYNC and
   FIONREAD, independent of LANDLOCK_ACCESS_FS_IOCTL
 * increment ABI version in the same commit where the feature is introduced
 * testing changes
   * use FIOQSIZE instead of TTY IOCTL commands
     (FIOQSIZE works with regular files, directories and memfds)
   * run the memfd test with both Landlock enabled and disabled
   * add a test for the always-permitted IOCTL commands

V2:
 * rebased on mic-next
 * added documentation
 * exercise ioctl(2) in the memfd test
 * test: Use layout0 for the test

---

V1: https://lore.kernel.org/linux-security-module/20230502171755.9788-1-gnoack3000@gmail.com/
V2: https://lore.kernel.org/linux-security-module/20230623144329.136541-1-gnoack@google.com/
V3: https://lore.kernel.org/linux-security-module/20230814172816.3907299-1-gnoack@google.com/
V4: https://lore.kernel.org/linux-security-module/20231103155717.78042-1-gnoack@google.com/
V5: https://lore.kernel.org/linux-security-module/20231117154920.1706371-1-gnoack@google.com/
V6: https://lore.kernel.org/linux-security-module/20231124173026.3257122-1-gnoack@google.com/
V7: https://lore.kernel.org/linux-security-module/20231201143042.3276833-1-gnoack@google.com/
V8: https://lore.kernel.org/linux-security-module/20231208155121.1943775-1-gnoack@google.com/
V9: https://lore.kernel.org/linux-security-module/20240209170612.1638517-1-gnoack@google.com/
V10: https://lore.kernel.org/linux-security-module/20240309075320.160128-1-gnoack@google.com/
V11: https://lore.kernel.org/linux-security-module/20240322151002.3653639-1-gnoack@google.com/

Günther Noack (9):
  security: Introduce ENOFILEOPS return value for IOCTL hooks
  landlock: Add IOCTL access right for character and block devices
  selftests/landlock: Test IOCTL support
  selftests/landlock: Test IOCTL with memfds
  selftests/landlock: Test ioctl(2) and ftruncate(2) with open(O_PATH)
  selftests/landlock: Test IOCTLs on named pipes
  selftests/landlock: Check IOCTL restrictions for named UNIX domain
    sockets
  samples/landlock: Add support for LANDLOCK_ACCESS_FS_IOCTL_DEV
  landlock: Document IOCTL support

 Documentation/userspace-api/landlock.rst     |  76 +++-
 fs/ioctl.c                                   |  25 +-
 include/linux/security.h                     |   6 +
 include/uapi/linux/landlock.h                |  35 +-
 samples/landlock/sandboxer.c                 |  13 +-
 security/landlock/fs.c                       |  45 ++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   8 +-
 security/security.c                          |  10 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   | 405 ++++++++++++++++++-
 11 files changed, 576 insertions(+), 51 deletions(-)


base-commit: a17c60e533f5cd832e77e0d194e2e0bb663371b6
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply	[relevance 2%]

* [PATCH v11 0/9] Landlock: IOCTL support
@ 2024-03-22 15:09  2% Günther Noack
  0 siblings, 0 replies; 200+ results
From: Günther Noack @ 2024-03-22 15:09 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov,
	Paul Moore, Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel,
	Günther Noack

Hello!

These patches add simple ioctl(2) support to Landlock.

Objective
~~~~~~~~~

Make ioctl(2) requests restrictable with Landlock,
in a way that is useful for real-world applications.

Proposed approach
~~~~~~~~~~~~~~~~~

Introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right, which restricts the
use of ioctl(2) on block and character devices.

We attach the this access right to opened file descriptors, as we
already do for LANDLOCK_ACCESS_FS_TRUNCATE.

If LANDLOCK_ACCESS_FS_IOCTL_DEV is handled (restricted in the
ruleset), the LANDLOCK_ACCESS_FS_IOCTL_DEV right governs the use of
all device-specific IOCTL commands.  We make exceptions for common and
known-harmless IOCTL commands such as FIOCLEX, FIONCLEX, FIONBIO and
FIOASYNC, as well as other IOCTL commands for regular files, which are
implemented in fs/ioctl.c.  A full list of these IOCTL commands is
listed in the documentation.

I believe that this approach works for the majority of use cases, and
offers a good trade-off between complexity of the Landlock API and
implementation and flexibility when the feature is used.

Current limitations
~~~~~~~~~~~~~~~~~~~

With this patch set, ioctl(2) requests can *not* be filtered based on
file type, device number (dev_t) or on the ioctl(2) request number.

On the initial RFC patch set [1], we have reached consensus to start
with this simpler coarse-grained approach, and build additional IOCTL
restriction capabilities on top in subsequent steps.

[1] https://lore.kernel.org/linux-security-module/d4f1395c-d2d4-1860-3a02-2a0c023dd761@digikod.net/

Notable implications of this approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* A processes' existing open file descriptors stay unaffected
  when a process enables Landlock.

  This means in particular that in common scenarios,
  the terminal's IOCTLs (ioctl_tty(2)) continue to work.

* ioctl(2) continues to be available for file descriptors for
  non-device files.  Example: Network sockets, memfd_create(2).

Examples
~~~~~~~~

Starting a sandboxed shell from $HOME with samples/landlock/sandboxer:

  LL_FS_RO=/ LL_FS_RW=. ./sandboxer /bin/bash

The LANDLOCK_ACCESS_FS_IOCTL_DEV right is part of the "read-write"
rights here, so we expect that newly opened files outside of $HOME
don't work with most IOCTL commands.

  * "stty" works: It probes terminal properties

  * "stty </dev/tty" fails: /dev/tty can be reopened, but the IOCTL is
    denied.

  * "eject" fails: ioctls to use CD-ROM drive are denied.

  * "ls /dev" works: It uses ioctl to get the terminal size for
    columnar layout

  * The text editors "vim" and "mg" work.  (GNU Emacs fails because it
    attempts to reopen /dev/tty.)

Unaffected IOCTL commands
~~~~~~~~~~~~~~~~~~~~~~~~~

To decide which IOCTL commands should be blanket-permitted, we went
through the list of IOCTL commands which are handled directly in
fs/ioctl.c and looked at them individually to understand what they are
about.

The following commands are permitted by Landlock unconditionally:

 * FIOCLEX, FIONCLEX - these work on the file descriptor and
   manipulate the close-on-exec flag (also available through
   fcntl(2) with F_SETFD)
 * FIONBIO, FIOASYNC - these work on the struct file and enable
   nonblocking-IO and async flags (also available through
   fcntl(2) with F_SETFL)

The following commands are also technically permitted by Landlock
unconditionally, but are not supported by device files.  By permitting
them in Landlock on device files, we naturally return the normal error
code.

 * FIOQSIZE - get the size of the opened file or directory
 * FIBMAP - get the file system block numbers underlying a file
 * FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
   FS_IOC_ZERO_RANGE: Backwards compatibility with legacy XFS
   preallocation syscalls which predate fallocate(2).

The following commands are also technically permitted by Landlock, but
they are really operating on the file system's superblock, rather than
on the file itself:

 * FIFREEZE, FITHAW - work on superblock(!) to freeze/thaw the file
   system. Requires CAP_SYS_ADMIN.
 * FIGETBSZ - get file system blocksize

The following commands are technically permitted by Landlock:

 * FS_IOC_FIEMAP - get information about file extent mapping
   (c.f. https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt)
 * FIDEDUPERANGE, FICLONE, FICLONERANGE - manipulating shared physical storage
   between multiple files.  These only work on some COW file systems, by design.
 * Accessing file attributes:
   * FS_IOC_GETFLAGS, FS_IOC_SETFLAGS - manipulate inode flags (ioctl_iflags(2))
   * FS_IOC_FSGETXATTR, FS_IOC_FSSETXATTR - more attributes

Notably, the command FIONREAD is *not* blanket-permitted,
because it would be a device-specific implementation.


Related Work
~~~~~~~~~~~~

OpenBSD's pledge(2) [2] restricts ioctl(2) independent of the file
descriptor which is used.  The implementers maintain multiple
allow-lists of predefined ioctl(2) operations required for different
application domains such as "audio", "bpf", "tty" and "inet".

OpenBSD does not guarantee backwards compatibility to the same extent
as Linux does, so it's easier for them to update these lists in later
versions.  It might not be a feasible approach for Linux though.

[2] https://man.openbsd.org/OpenBSD-7.4/pledge.2


Open Questions
~~~~~~~~~~~~~~

 * Is this approach OK as a mechanism for identifying the IOCTL
   commands which are handled by do_vfs_ioctl()?


Changes
~~~~~~~

V11:
 * Rebased on Mickaël's proposal to refactor fs/ioctl.c:
   https://lore.kernel.org/all/20240315145848.1844554-1-mic@digikod.net/
   This means that:
   * we do not add the file_vfs_ioctl() hook as in V10
   * we add vfs_get_ioctl_handler() instead, so that Landlock
     can query which of the IOCTL commands in handled in do_vfs_ioctl()

   That proposal is used here unmodified (except for minor typos in the commit
   description).
 * Use the hook_ioctl_compat LSM hook as well.

V10:
 * Major change: only restrict IOCTL invocations on device files
   * Rename access right to LANDLOCK_ACCESS_FS_IOCTL_DEV
   * Remove the notion of synthetic access rights and IOCTL right groups
 * Introduce a new LSM hook file_vfs_ioctl, which gets invoked just
   before the call to f_ops->unlocked_ioctl()
 * Documentation
   * Various complications were removed or simplified:
     * Suggestion to mount file systems as nodev is not needed any more,
       as Landlock already lets users distinguish device files.
     * Remarks about fscrypt were removed.  The fscrypt-related IOCTLs only
       applied to regular files and directories, so this patch does not affect
       them any more.
     * Various documentation of the IOCTL grouping approach was removed,
       as it's not needed any more.

V9:
 * in “landlock: Add IOCTL access right”:
   * Change IOCTL group names and grouping as discussed with Mickaël.
     This makes the grouping coarser, and we occasionally rely on the
     underlying implementation to perform the appropriate read/write
     checks.
     * Group IOCTL_RW (one of READ_FILE, WRITE_FILE or READ_DIR):
       FIONREAD, FIOQSIZE, FIGETBSZ
     * Group IOCTL_RWF (one of READ_FILE or WRITE_FILE):
       FS_IOC_FIEMAP, FIBMAP, FIDEDUPERANGE, FICLONE, FICLONERANGE,
       FS_IOC_RESVSP, FS_IOC_RESVSP64, FS_IOC_UNRESVSP, FS_IOC_UNRESVSP64,
       FS_IOC_ZERO_RANGE
   * Excempt pipe file descriptors from IOCTL restrictions,
     even for named pipes which are opened from the file system.
     This is to be consistent with anonymous pipes created with pipe(2).
     As discussed in https://lore.kernel.org/r/ZP7lxmXklksadvz+@google.com
   * Document rationale for the IOCTL grouping in the code
   * Use __attribute_const__
   * Rename required_ioctl_access() to get_required_ioctl_access()
 * Selftests
   * Simplify IOCTL test fixtures as a result of simpler grouping.
   * Test that IOCTLs are permitted on named pipe FDs.
   * Test that IOCTLs are permitted on named Unix Domain Socket FDs.
   * Work around compilation issue with old GCC / glibc.
     https://sourceware.org/glibc/wiki/Synchronizing_Headers
     Thanks to Huyadi <hu.yadi@h3c.com>, who pointed this out in
     https://lore.kernel.org/all/f25be6663bcc4608adf630509f045a76@h3c.com/
     and Mickaël, who fixed it through #include reordering.
 * Documentation changes
   * Reword "IOCTL commands" section a bit
   * s/permit/allow/
   * s/access right/right/, if preceded by LANDLOCK_ACCESS_FS_*
   * s/IOCTL/FS_IOCTL/ in ASCII table
   * Update IOCTL grouping documentation in header file
 * Removed a few of the earlier commits in this patch set,
   which have already been merged.

V8:
 * Documentation changes
   * userspace-api/landlock.rst:
     * Add an extra paragraph about how the IOCTL right combines
       when used with other access rights.
     * Explain better the circumstances under which passing of
       file descriptors between different Landlock domains can happen
   * limits.h: Add comment to explain public vs internal FS access rights
   * Add a paragraph in the commit to explain better why the IOCTL
     right works as it does

V7:
 * in “landlock: Add IOCTL access right”:
   * Make IOCTL_GROUPS a #define so that static_assert works even on
     old compilers (bug reported by Intel about PowerPC GCC9 config)
   * Adapt indentation of IOCTL_GROUPS definition
   * Add missing dots in kernel-doc comments.
 * in “landlock: Remove remaining "inline" modifiers in .c files”:
   * explain reasoning in commit message

V6:
 * Implementation:
   * Check that only publicly visible access rights can be used when adding a
     rule (rather than the synthetic ones).  Thanks Mickaël for spotting that!
   * Move all functionality related to IOCTL groups and synthetic access rights
     into the same place at the top of fs.c
   * Move kernel doc to the .c file in one instance
   * Smaller code style issues (upcase IOCTL, vardecl at block start)
   * Remove inline modifier from functions in .c files
 * Tests:
   * use SKIP
   * Rename 'fd' to dir_fd and file_fd where appropriate
   * Remove duplicate "ioctl" mentions from test names
   * Rename "permitted" to "allowed", in ioctl and ftruncate tests
   * Do not add rules if access is 0, in test helper

V5:
 * Implementation:
   * move IOCTL group expansion logic into fs.c (implementation suggested by
     mic)
   * rename IOCTL_CMD_G* constants to LANDLOCK_ACCESS_FS_IOCTL_GROUP*
   * fs.c: create ioctl_groups constant
   * add "const" to some variables
 * Formatting and docstring fixes (including wrong kernel-doc format)
 * samples/landlock: fix ABI version and fallback attribute (mic)
 * Documentation
   * move header documentation changes into the implementation commit
   * spell out how FIFREEZE, FITHAW and attribute-manipulation ioctls from
     fs/ioctl.c are handled
   * change ABI 4 to ABI 5 in some missing places

V4:
 * use "synthetic" IOCTL access rights, as previously discussed
 * testing changes
   * use a large fixture-based test, for more exhaustive coverage,
     and replace some of the earlier tests with it
 * rebased on mic-next

V3:
 * always permit the IOCTL commands FIOCLEX, FIONCLEX, FIONBIO, FIOASYNC and
   FIONREAD, independent of LANDLOCK_ACCESS_FS_IOCTL
 * increment ABI version in the same commit where the feature is introduced
 * testing changes
   * use FIOQSIZE instead of TTY IOCTL commands
     (FIOQSIZE works with regular files, directories and memfds)
   * run the memfd test with both Landlock enabled and disabled
   * add a test for the always-permitted IOCTL commands

V2:
 * rebased on mic-next
 * added documentation
 * exercise ioctl(2) in the memfd test
 * test: Use layout0 for the test

---

V1: https://lore.kernel.org/linux-security-module/20230502171755.9788-1-gnoack3000@gmail.com/
V2: https://lore.kernel.org/linux-security-module/20230623144329.136541-1-gnoack@google.com/
V3: https://lore.kernel.org/linux-security-module/20230814172816.3907299-1-gnoack@google.com/
V4: https://lore.kernel.org/linux-security-module/20231103155717.78042-1-gnoack@google.com/
V5: https://lore.kernel.org/linux-security-module/20231117154920.1706371-1-gnoack@google.com/
V6: https://lore.kernel.org/linux-security-module/20231124173026.3257122-1-gnoack@google.com/
V7: https://lore.kernel.org/linux-security-module/20231201143042.3276833-1-gnoack@google.com/
V8: https://lore.kernel.org/linux-security-module/20231208155121.1943775-1-gnoack@google.com/
V9: https://lore.kernel.org/linux-security-module/20240209170612.1638517-1-gnoack@google.com/
V10: https://lore.kernel.org/linux-security-module/20240309075320.160128-1-gnoack@google.com/

Günther Noack (8):
  landlock: Add IOCTL access right for character and block devices
  selftests/landlock: Test IOCTL support
  selftests/landlock: Test IOCTL with memfds
  selftests/landlock: Test ioctl(2) and ftruncate(2) with open(O_PATH)
  selftests/landlock: Test IOCTLs on named pipes
  selftests/landlock: Check IOCTL restrictions for named UNIX domain
    sockets
  samples/landlock: Add support for LANDLOCK_ACCESS_FS_IOCTL_DEV
  landlock: Document IOCTL support

Mickaël Salaün (1):
  fs: Add and use vfs_get_ioctl_handler()

 Documentation/userspace-api/landlock.rst     |  76 +++-
 fs/ioctl.c                                   | 213 +++++++---
 include/linux/fs.h                           |   6 +
 include/uapi/linux/landlock.h                |  35 +-
 samples/landlock/sandboxer.c                 |  13 +-
 security/landlock/fs.c                       |  52 ++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   8 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   | 405 ++++++++++++++++++-
 10 files changed, 704 insertions(+), 108 deletions(-)


base-commit: a17c60e533f5cd832e77e0d194e2e0bb663371b6
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply	[relevance 2%]

* Re: [PATCH v2] xfs: allow cross-linking special files without project quota
  2024-03-15  2:48  5% ` Darrick J. Wong
@ 2024-03-15  9:35  0%   ` Andrey Albershteyn
  2024-04-05 22:22  0%   ` Andrey Albershteyn
  1 sibling, 0 replies; 200+ results
From: Andrey Albershteyn @ 2024-03-15  9:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-fsdevel, linux-xfs, chandan.babu

On 2024-03-14 19:48:26, Darrick J. Wong wrote:
> On Thu, Mar 14, 2024 at 06:07:02PM +0100, Andrey Albershteyn wrote:
> > There's an issue that if special files is created before quota
> > project is enabled, then it's not possible to link this file. This
> > works fine for normal files. This happens because xfs_quota skips
> > special files (no ioctls to set necessary flags). The check for
> > having the same project ID for source and destination then fails as
> > source file doesn't have any ID.
> > 
> > mkfs.xfs -f /dev/sda
> > mount -o prjquota /dev/sda /mnt/test
> > 
> > mkdir /mnt/test/foo
> > mkfifo /mnt/test/foo/fifo1
> > 
> > xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> > > Setting up project 9 (path /mnt/test/foo)...
> > > xfs_quota: skipping special file /mnt/test/foo/fifo1
> > > Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).
> > 
> > ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> > > ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link
> 
> Aha.  So hardlinking special files within a directory subtree that all
> have the same nonzero project quota ID fails if that special file
> happened to have been created before the subtree was assigned that pqid.
> And there's nothing we can do about that, because there's no way to call
> XFS_IOC_SETFSXATTR on a special file because opening those gets you a
> different inode from the special block/fifo/chardev filesystem...
> 
> > mkfifo /mnt/test/foo/fifo2
> > ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link
> > 
> > Fix this by allowing linking of special files to the project quota
> > if special files doesn't have any ID set (ID = 0).
> 
> ...and that's the workaround for this situation.  The project quota
> accounting here will be weird because there will be (more) files in a
> directory subtree than is reported by xfs_quota, but the subtree was
> already messed up in that manner.

Yeah, there's already that prj ID = 0 file, so nothing changes
regarding accounting.

> Question: Should we have a XFS_IOC_SETFSXATTRAT where we can pass in
> relative directory paths and actually query/update special files?

Added to xfs_quota to not skip them? It would probably solve the
issue, but for existing filesystems with projects this will require
to go through all of special files

> 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
> > ---
> >  fs/xfs/xfs_inode.c | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 1fd94958aa97..b7be19be0132 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1240,8 +1240,19 @@ xfs_link(
> >  	 */
> >  	if (unlikely((tdp->i_diflags & XFS_DIFLAG_PROJINHERIT) &&
> >  		     tdp->i_projid != sip->i_projid)) {
> > -		error = -EXDEV;
> > -		goto error_return;
> > +		/*
> > +		 * Project quota setup skips special files which can
> > +		 * leave inodes in a PROJINHERIT directory without a
> > +		 * project ID set. We need to allow links to be made
> > +		 * to these "project-less" inodes because userspace
> > +		 * expects them to succeed after project ID setup,
> > +		 * but everything else should be rejected.
> > +		 */
> > +		if (!special_file(VFS_I(sip)->i_mode) ||
> > +		    sip->i_projid != 0) {
> > +			error = -EXDEV;
> > +			goto error_return;
> > +		}
> >  	}
> >  
> >  	if (!resblks) {
> > -- 
> > 2.42.0
> > 
> > 
> 

-- 
- Andrey


^ permalink raw reply	[relevance 0%]

* Re: Fwd: [GIT PULL] vfs uuid
  @ 2024-03-15  3:06  5%   ` Kent Overstreet
  0 siblings, 0 replies; 200+ results
From: Kent Overstreet @ 2024-03-15  3:06 UTC (permalink / raw)
  To: Steve French; +Cc: linux-fsdevel, Christian Brauner, CIFS, Kent Overstreet

On Thu, Mar 14, 2024 at 02:55:50PM -0500, Steve French wrote:
> Do you have sample programs for these programs (or even better
> mini-xfstest programs) that we can use to make sure this e.g. works
> for cifs.ko (which has similar concept to FS UUID for most remote
> filesystems etc.)?

https://evilpiepirate.org/git/query-uuid.git/

^ permalink raw reply	[relevance 5%]

* Re: [PATCH v2] xfs: allow cross-linking special files without project quota
  @ 2024-03-15  2:48  5% ` Darrick J. Wong
  2024-03-15  9:35  0%   ` Andrey Albershteyn
  2024-04-05 22:22  0%   ` Andrey Albershteyn
  0 siblings, 2 replies; 200+ results
From: Darrick J. Wong @ 2024-03-15  2:48 UTC (permalink / raw)
  To: Andrey Albershteyn; +Cc: david, linux-fsdevel, linux-xfs, chandan.babu

On Thu, Mar 14, 2024 at 06:07:02PM +0100, Andrey Albershteyn wrote:
> There's an issue that if special files is created before quota
> project is enabled, then it's not possible to link this file. This
> works fine for normal files. This happens because xfs_quota skips
> special files (no ioctls to set necessary flags). The check for
> having the same project ID for source and destination then fails as
> source file doesn't have any ID.
> 
> mkfs.xfs -f /dev/sda
> mount -o prjquota /dev/sda /mnt/test
> 
> mkdir /mnt/test/foo
> mkfifo /mnt/test/foo/fifo1
> 
> xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test
> > Setting up project 9 (path /mnt/test/foo)...
> > xfs_quota: skipping special file /mnt/test/foo/fifo1
> > Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1).
> 
> ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link
> > ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link

Aha.  So hardlinking special files within a directory subtree that all
have the same nonzero project quota ID fails if that special file
happened to have been created before the subtree was assigned that pqid.
And there's nothing we can do about that, because there's no way to call
XFS_IOC_SETFSXATTR on a special file because opening those gets you a
different inode from the special block/fifo/chardev filesystem...

> mkfifo /mnt/test/foo/fifo2
> ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link
> 
> Fix this by allowing linking of special files to the project quota
> if special files doesn't have any ID set (ID = 0).

...and that's the workaround for this situation.  The project quota
accounting here will be weird because there will be (more) files in a
directory subtree than is reported by xfs_quota, but the subtree was
already messed up in that manner.

Question: Should we have a XFS_IOC_SETFSXATTRAT where we can pass in
relative directory paths and actually query/update special files?

> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_inode.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 1fd94958aa97..b7be19be0132 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1240,8 +1240,19 @@ xfs_link(
>  	 */
>  	if (unlikely((tdp->i_diflags & XFS_DIFLAG_PROJINHERIT) &&
>  		     tdp->i_projid != sip->i_projid)) {
> -		error = -EXDEV;
> -		goto error_return;
> +		/*
> +		 * Project quota setup skips special files which can
> +		 * leave inodes in a PROJINHERIT directory without a
> +		 * project ID set. We need to allow links to be made
> +		 * to these "project-less" inodes because userspace
> +		 * expects them to succeed after project ID setup,
> +		 * but everything else should be rejected.
> +		 */
> +		if (!special_file(VFS_I(sip)->i_mode) ||
> +		    sip->i_projid != 0) {
> +			error = -EXDEV;
> +			goto error_return;
> +		}
>  	}
>  
>  	if (!resblks) {
> -- 
> 2.42.0
> 
> 

^ permalink raw reply	[relevance 5%]

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-03-12 20:26  0%             ` Sean Christopherson
@ 2024-03-13 17:11  0%               ` Steven Price
  0 siblings, 0 replies; 200+ results
From: Steven Price @ 2024-03-13 17:11 UTC (permalink / raw)
  To: Sean Christopherson, Michael Roth
  Cc: kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm, linux-crypto,
	x86, linux-kernel, linux-fsdevel, pbonzini, isaku.yamahata,
	ackerleytng, vbabka, ashish.kalra, nikunj.dadhania, jroedel,
	pankaj.gupta

On 12/03/2024 20:26, Sean Christopherson wrote:
> On Mon, Mar 11, 2024, Michael Roth wrote:
>> On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
>>> On Fri, Feb 09, 2024, Steven Price wrote:
>>>>>> One option that I've considered is to implement a seperate CCA ioctl to
>>>>>> notify KVM whether the memory should be mapped protected.
>>>>>
>>>>> That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
>>>>
>>>> Sorry, I really didn't explain that well. Yes effectively this is the
>>>> attribute flag, but there's corner cases for destruction of the VM. My
>>>> thought was that if the VMM wanted to tear down part of the protected
>>>> range (without making it shared) then a separate ioctl would be needed
>>>> to notify KVM of the unmap.
>>>
>>> No new uAPI should be needed, because the only scenario time a benign VMM should
>>> do this is if the guest also knows the memory is being removed, in which case
>>> PUNCH_HOLE will suffice.
>>>
>>>>>> This 'solves' the problem nicely except for the case where the VMM
>>>>>> deliberately punches holes in memory which the guest is using.
>>>>>
>>>>> I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
>>>>> so don't do that.
>>>>
>>>> A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
>>>> my concern here is a VMM which is trying to break the host. In this case
>>>> either the PUNCH_HOLE needs to fail, or we actually need to recover the
>>>> memory from the guest (effectively killing the guest in the process).
>>>
>>> The latter.  IIRC, we talked about this exact case somewhere in the hour-long
>>> rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
>>> this multiple times on-list, though I don't know that there is a single thread
>>> that captures the entire plan.
>>>
>>> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
>>> instance that's attached to a given guest_memfd inode when a page is being fully
>>> removed, i.e. when a page is being freed back to the normal memory pool.  Something
>>> like this proposed SNP patch[2].
>>>
>>> Mike, do have WIP patches you can share?
>>
>> Sorry, I missed this query earlier. I'm a bit confused though, I thought
>> the kvm_arch_gmem_invalidate() hook provided in this patch was what we
>> ended up agreeing on during the PUCK call in question.
> 
> Heh, I trust your memory of things far more than I trust mine.  I'm just proving
> Cunningham's Law.  :-)
> 
>> There was an open question about what to do if a use-case came along
>> where we needed to pass additional parameters to
>> kvm_arch_gmem_invalidate() other than just the start/end PFN range for
>> the pages being freed, but we'd determined that SNP and TDX did not
>> currently need this, so I didn't have any changes planned in this
>> regard.
>>
>> If we now have such a need, what we had proposed was to modify
>> __filemap_remove_folio()/page_cache_delete() to defer setting
>> folio->mapping to NULL so that we could still access it in
>> kvm_gmem_free_folio() so that we can still access mapping->i_private_list
>> to get the list of gmem/KVM instances and pass them on via
>> kvm_arch_gmem_invalidate().
> 
> Yeah, this is what I was remembering.  I obviously forgot that we didn't have a
> need to iterate over all bindings at this time.
> 
>> So that's doable, but it's not clear from this discussion that that's
>> needed.
> 
> Same here.  And even if it is needed, it's not your problem to solve.  The above
> blurb about needing to preserve folio->mapping being free_folio() is sufficient
> to get the ARM code moving in the right direction.
> 
> Thanks!
> 
>> If the idea to block/kill the guest if VMM tries to hole-punch,
>> and ARM CCA already has plans to wire up the shared/private flags in
>> kvm_unmap_gfn_range(), wouldn't that have all the information needed to
>> kill that guest? At that point, kvm_gmem_free_folio() can handle
>> additional per-page cleanup (with additional gmem/KVM info plumbed in
>> if necessary).

Yes, the missing piece of the puzzle was provided by "KVM: Prepare for
handling only shared mappings in mmu_notifier events"[1] - namely the
"only_shared" flag. We don't need to actually block/kill the guest until
it attempts access to the memory which has been removed from the guest -
at that point the guest cannot continue because the security properties
have been violated (the protected memory contents have been lost) so
attempts to continue the guest will fail.

You can ignore most of my other ramblings - as long as everyone is happy
with that flag then Arm CCA should be fine. I was just looking at other
options.

Thanks,

Steve

[1]
https://lore.kernel.org/lkml/20231027182217.3615211-13-seanjc@google.com/

^ permalink raw reply	[relevance 0%]

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  2024-03-11 17:24  5%           ` Michael Roth
@ 2024-03-12 20:26  0%             ` Sean Christopherson
  2024-03-13 17:11  0%               ` Steven Price
  0 siblings, 1 reply; 200+ results
From: Sean Christopherson @ 2024-03-12 20:26 UTC (permalink / raw)
  To: Michael Roth
  Cc: Steven Price, kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm,
	linux-crypto, x86, linux-kernel, linux-fsdevel, pbonzini,
	isaku.yamahata, ackerleytng, vbabka, ashish.kalra,
	nikunj.dadhania, jroedel, pankaj.gupta

On Mon, Mar 11, 2024, Michael Roth wrote:
> On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
> > On Fri, Feb 09, 2024, Steven Price wrote:
> > > >> One option that I've considered is to implement a seperate CCA ioctl to
> > > >> notify KVM whether the memory should be mapped protected.
> > > > 
> > > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> > > 
> > > Sorry, I really didn't explain that well. Yes effectively this is the
> > > attribute flag, but there's corner cases for destruction of the VM. My
> > > thought was that if the VMM wanted to tear down part of the protected
> > > range (without making it shared) then a separate ioctl would be needed
> > > to notify KVM of the unmap.
> > 
> > No new uAPI should be needed, because the only scenario time a benign VMM should
> > do this is if the guest also knows the memory is being removed, in which case
> > PUNCH_HOLE will suffice.
> > 
> > > >> This 'solves' the problem nicely except for the case where the VMM
> > > >> deliberately punches holes in memory which the guest is using.
> > > > 
> > > > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > > > so don't do that.
> > > 
> > > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> > > my concern here is a VMM which is trying to break the host. In this case
> > > either the PUNCH_HOLE needs to fail, or we actually need to recover the
> > > memory from the guest (effectively killing the guest in the process).
> > 
> > The latter.  IIRC, we talked about this exact case somewhere in the hour-long
> > rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
> > this multiple times on-list, though I don't know that there is a single thread
> > that captures the entire plan.
> > 
> > The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
> > instance that's attached to a given guest_memfd inode when a page is being fully
> > removed, i.e. when a page is being freed back to the normal memory pool.  Something
> > like this proposed SNP patch[2].
> > 
> > Mike, do have WIP patches you can share?
> 
> Sorry, I missed this query earlier. I'm a bit confused though, I thought
> the kvm_arch_gmem_invalidate() hook provided in this patch was what we
> ended up agreeing on during the PUCK call in question.

Heh, I trust your memory of things far more than I trust mine.  I'm just proving
Cunningham's Law.  :-)

> There was an open question about what to do if a use-case came along
> where we needed to pass additional parameters to
> kvm_arch_gmem_invalidate() other than just the start/end PFN range for
> the pages being freed, but we'd determined that SNP and TDX did not
> currently need this, so I didn't have any changes planned in this
> regard.
> 
> If we now have such a need, what we had proposed was to modify
> __filemap_remove_folio()/page_cache_delete() to defer setting
> folio->mapping to NULL so that we could still access it in
> kvm_gmem_free_folio() so that we can still access mapping->i_private_list
> to get the list of gmem/KVM instances and pass them on via
> kvm_arch_gmem_invalidate().

Yeah, this is what I was remembering.  I obviously forgot that we didn't have a
need to iterate over all bindings at this time.

> So that's doable, but it's not clear from this discussion that that's
> needed.

Same here.  And even if it is needed, it's not your problem to solve.  The above
blurb about needing to preserve folio->mapping being free_folio() is sufficient
to get the ARM code moving in the right direction.

Thanks!

> If the idea to block/kill the guest if VMM tries to hole-punch,
> and ARM CCA already has plans to wire up the shared/private flags in
> kvm_unmap_gfn_range(), wouldn't that have all the information needed to
> kill that guest? At that point, kvm_gmem_free_folio() can handle
> additional per-page cleanup (with additional gmem/KVM info plumbed in
> if necessary).

^ permalink raw reply	[relevance 0%]

* Re: [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory
  @ 2024-03-11 17:24  5%           ` Michael Roth
  2024-03-12 20:26  0%             ` Sean Christopherson
  0 siblings, 1 reply; 200+ results
From: Michael Roth @ 2024-03-11 17:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Steven Price, kvm, Suzuki K Poulose, tabba, linux-coco, linux-mm,
	linux-crypto, x86, linux-kernel, linux-fsdevel, pbonzini,
	isaku.yamahata, ackerleytng, vbabka, ashish.kalra,
	nikunj.dadhania, jroedel, pankaj.gupta

On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
> On Fri, Feb 09, 2024, Steven Price wrote:
> > >> One option that I've considered is to implement a seperate CCA ioctl to
> > >> notify KVM whether the memory should be mapped protected.
> > > 
> > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> > 
> > Sorry, I really didn't explain that well. Yes effectively this is the
> > attribute flag, but there's corner cases for destruction of the VM. My
> > thought was that if the VMM wanted to tear down part of the protected
> > range (without making it shared) then a separate ioctl would be needed
> > to notify KVM of the unmap.
> 
> No new uAPI should be needed, because the only scenario time a benign VMM should
> do this is if the guest also knows the memory is being removed, in which case
> PUNCH_HOLE will suffice.
> 
> > >> This 'solves' the problem nicely except for the case where the VMM
> > >> deliberately punches holes in memory which the guest is using.
> > > 
> > > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > > so don't do that.
> > 
> > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> > my concern here is a VMM which is trying to break the host. In this case
> > either the PUNCH_HOLE needs to fail, or we actually need to recover the
> > memory from the guest (effectively killing the guest in the process).
> 
> The latter.  IIRC, we talked about this exact case somewhere in the hour-long
> rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
> this multiple times on-list, though I don't know that there is a single thread
> that captures the entire plan.
> 
> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
> instance that's attached to a given guest_memfd inode when a page is being fully
> removed, i.e. when a page is being freed back to the normal memory pool.  Something
> like this proposed SNP patch[2].
> 
> Mike, do have WIP patches you can share?

Sorry, I missed this query earlier. I'm a bit confused though, I thought
the kvm_arch_gmem_invalidate() hook provided in this patch was what we
ended up agreeing on during the PUCK call in question.

There was an open question about what to do if a use-case came along
where we needed to pass additional parameters to
kvm_arch_gmem_invalidate() other than just the start/end PFN range for
the pages being freed, but we'd determined that SNP and TDX did not
currently need this, so I didn't have any changes planned in this
regard.

If we now have such a need, what we had proposed was to modify
__filemap_remove_folio()/page_cache_delete() to defer setting
folio->mapping to NULL so that we could still access it in
kvm_gmem_free_folio() so that we can still access mapping->i_private_list
to get the list of gmem/KVM instances and pass them on via
kvm_arch_gmem_invalidate().

So that's doable, but it's not clear from this discussion that that's
needed. If the idea to block/kill the guest if VMM tries to hole-punch,
and ARM CCA already has plans to wire up the shared/private flags in
kvm_unmap_gfn_range(), wouldn't that have all the information needed to
kill that guest? At that point, kvm_gmem_free_folio() can handle
additional per-page cleanup (with additional gmem/KVM info plumbed in
if necessary).

-Mike


[1] https://lore.kernel.org/kvm/20240202230611.351544-1-seanjc@google.com/T/


> 
> [1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ
> [2] https://lore.kernel.org/all/20231230172351.574091-30-michael.roth@amd.com

^ permalink raw reply	[relevance 5%]

* Re: [PATCH v2] statx: stx_subvol
  2024-03-11  5:30  5%             ` Miklos Szeredi
@ 2024-03-11  5:49  0%               ` Kent Overstreet
  0 siblings, 0 replies; 200+ results
From: Kent Overstreet @ 2024-03-11  5:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dave Chinner, Darrick J. Wong, Neal Gompa, linux-fsdevel,
	linux-bcachefs, linux-btrfs, linux-kernel, Josef Bacik,
	Miklos Szeredi, Christian Brauner, David Howells

On Mon, Mar 11, 2024 at 06:30:21AM +0100, Miklos Szeredi wrote:
> On Mon, 11 Mar 2024 at 03:17, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Mar 08, 2024 at 08:56:33AM -0800, Darrick J. Wong wrote:
> > > Should the XFS data and rt volumes be reported with different stx_vol
> > > values?
> >
> > No, because all the inodes are on the data volume and the same inode
> > can have data on the data volume or the rt volume. i.e. "data on rt,
> > truncate, clear rt, copy data back into data dev".  It's still the
> > same inode, and may have exactly the same data, so why should change
> > stx_vol and make it appear to userspace as being a different inode?
> 
> Because stx_vol must not be used by userspace to distinguish between
> unique inodes.  To determine if two inodes are distinct within a
> filesystem (which may have many volumes) it should query the file
> handle and compare that.
> 
> If we'll have a filesystem that has a different stx_vol but the same
> fh, all the better.

I agree that stx_vol should not be used for uniqueness testing, but
that's a non sequitar here; Dave's talking about the fact that volume
isn't a constatn for a given inode on XFS. And that's a good point;
volumes on XFS don't map to the filesystem path heirarchy in a nice
clean way like on btrfs and bcachefs (and presumably ZFS).

Subvolumes on btrfs and bcachefs form a tree, and that's something we
should document about stx_subvol - recursively enumerable things are
quite nice to work with.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] statx: stx_subvol
  @ 2024-03-11  5:30  5%             ` Miklos Szeredi
  2024-03-11  5:49  0%               ` Kent Overstreet
  0 siblings, 1 reply; 200+ results
From: Miklos Szeredi @ 2024-03-11  5:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Kent Overstreet, Neal Gompa, linux-fsdevel,
	linux-bcachefs, linux-btrfs, linux-kernel, Josef Bacik,
	Miklos Szeredi, Christian Brauner, David Howells

On Mon, 11 Mar 2024 at 03:17, Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Mar 08, 2024 at 08:56:33AM -0800, Darrick J. Wong wrote:
> > Should the XFS data and rt volumes be reported with different stx_vol
> > values?
>
> No, because all the inodes are on the data volume and the same inode
> can have data on the data volume or the rt volume. i.e. "data on rt,
> truncate, clear rt, copy data back into data dev".  It's still the
> same inode, and may have exactly the same data, so why should change
> stx_vol and make it appear to userspace as being a different inode?

Because stx_vol must not be used by userspace to distinguish between
unique inodes.  To determine if two inodes are distinct within a
filesystem (which may have many volumes) it should query the file
handle and compare that.

If we'll have a filesystem that has a different stx_vol but the same
fh, all the better.

Thanks,
Miklos

^ permalink raw reply	[relevance 5%]

* Re: [LSF/MM/BPF TOPIC] statx attributes
  2024-03-07 20:03  0%       ` Steve French
@ 2024-03-07 20:22  0%         ` Andrew Walker
  0 siblings, 0 replies; 200+ results
From: Andrew Walker @ 2024-03-07 20:22 UTC (permalink / raw)
  To: Steve French
  Cc: Kent Overstreet, Amir Goldstein, lsf-pc, CIFS, samba-technical,
	linux-fsdevel, Jan Kara, Christian Brauner, David Howells

On Thu, Mar 7, 2024 at 2:04 PM Steve French <smfrench@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 11:45 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, Mar 07, 2024 at 10:37:13AM -0600, Steve French wrote:
> > > > Which API is used in other OS to query the offline bit?
> > > > Do they use SMB specific API, as Windows does?
> > >
> > > No it is not smb specific - a local fs can also report this.  It is
> > > included in the attribute bits for files and directories, it also
> > > includes a few additional bits that are used by HSM software on local
> > > drives (e.g. FILE_ATTRIBUTE_PINNED when the file may not be taken
> > > offline by HSM software)
> > > ATTRIBUTE_HIDDEN
> > > ATTRIBUTE_SYSTEM
> > > ATTRIBUTE_DIRECTORY
> > > ATTRIGBUTE_ARCHIVE
> > > ATTRIBUTE_TEMPORARY
> > > ATTRIBUTE_SPARSE_FILE
> > > ATTRIBUTE_REPARE_POINT
> > > ATTRIBUTE_COMPRESSED
> > > ATTRIBUTE_NOT_CONTENT_INDEXED
> > > ATTRIBUTE_ENCRYPTED
> > > ATTRIBUTE_OFFLINE
> >
> > we've already got some of these as inode flags available with the
> > getflags ioctl (compressed, also perhaps encrypted?) - but statx does
> > seem a better place for them.
> >
> > statx can also report when they're supported, which does make sense for
> > these.
> >
> > ATTRIBUTE_DIRECTORY, though?
> >
> > we also need to try to define the semantics for these and not just dump
> > them in as just a bunch of identifiers if we want them to be used by
> > other things - and we do.
>
> They are all pretty clearly defined, but many are already in Linux,
> and a few are not relevant (e.g. ATTRIBUTE_DIRECTORY is handled in
> mode bits).  I suspect that Macs have equivalents of most of these
> too.

MacOS and FreeBSD return many of these in stat(2) output via st_flags.
Current set of supported flags are documented in chflags(2) manpage on both
platforms.

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] statx attributes
  2024-03-07 17:45  0%     ` Kent Overstreet
@ 2024-03-07 20:03  0%       ` Steve French
  2024-03-07 20:22  0%         ` Andrew Walker
  0 siblings, 1 reply; 200+ results
From: Steve French @ 2024-03-07 20:03 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Amir Goldstein, lsf-pc, CIFS, samba-technical, linux-fsdevel,
	Jan Kara, Christian Brauner, David Howells

On Thu, Mar 7, 2024 at 11:45 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, Mar 07, 2024 at 10:37:13AM -0600, Steve French wrote:
> > > Which API is used in other OS to query the offline bit?
> > > Do they use SMB specific API, as Windows does?
> >
> > No it is not smb specific - a local fs can also report this.  It is
> > included in the attribute bits for files and directories, it also
> > includes a few additional bits that are used by HSM software on local
> > drives (e.g. FILE_ATTRIBUTE_PINNED when the file may not be taken
> > offline by HSM software)
> > ATTRIBUTE_HIDDEN
> > ATTRIBUTE_SYSTEM
> > ATTRIBUTE_DIRECTORY
> > ATTRIGBUTE_ARCHIVE
> > ATTRIBUTE_TEMPORARY
> > ATTRIBUTE_SPARSE_FILE
> > ATTRIBUTE_REPARE_POINT
> > ATTRIBUTE_COMPRESSED
> > ATTRIBUTE_NOT_CONTENT_INDEXED
> > ATTRIBUTE_ENCRYPTED
> > ATTRIBUTE_OFFLINE
>
> we've already got some of these as inode flags available with the
> getflags ioctl (compressed, also perhaps encrypted?) - but statx does
> seem a better place for them.
>
> statx can also report when they're supported, which does make sense for
> these.
>
> ATTRIBUTE_DIRECTORY, though?
>
> we also need to try to define the semantics for these and not just dump
> them in as just a bunch of identifiers if we want them to be used by
> other things - and we do.

They are all pretty clearly defined, but many are already in Linux,
and a few are not relevant (e.g. ATTRIBUTE_DIRECTORY is handled in
mode bits).  I suspect that Macs have equivalents of most of these
too.


-- 
Thanks,

Steve

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size
  2024-03-07 20:01  0% ` Jarkko Sakkinen
@ 2024-03-07 20:03  0%   ` Jarkko Sakkinen
  0 siblings, 0 replies; 200+ results
From: Jarkko Sakkinen @ 2024-03-07 20:03 UTC (permalink / raw)
  To: Jarkko Sakkinen, Christian Brauner, linux-fsdevel
  Cc: Seth Forshee, linux-integrity, linux-security-module

On Thu Mar 7, 2024 at 10:01 PM EET, Jarkko Sakkinen wrote:
> On Tue Mar 5, 2024 at 2:27 PM EET, Christian Brauner wrote:
> > The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
> > that does a racy query-size+allocate-buffer+retrieve-data. It is used by
> > EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
> > where 9p returned values that amount to allocating about 8000GB worth of
> > memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
> > no reason to allow getting xattr values that are larger than
> > XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
> > values and nothing currently goes beyond that limit afaict. Let it check
> > for that and reject requests that are larger than that.
> >
> > Link: https://lore.kernel.org/r/ZeXcQmHWcYvfCR93@do-x1extreme [1]
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  fs/xattr.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 09d927603433..a53c930e3018 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -395,6 +395,9 @@ vfs_getxattr_alloc(struct mnt_idmap *idmap, struct dentry *dentry,
> >  	if (error < 0)
> >  		return error;
> >  
> > +	if (error > XATTR_SIZE_MAX)
> > +		return -E2BIG;
> > +
> >  	if (!value || (error > xattr_size)) {
> >  		value = krealloc(*xattr_value, error + 1, flags);
> >  		if (!value)
>
> I wonder if this should even categorized as a bug fix and get
> backported. Good catch!

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

BR, Jarkko

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size
  2024-03-05 12:27  5% [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size Christian Brauner
                   ` (2 preceding siblings ...)
  2024-03-05 16:21  0% ` Christian Brauner
@ 2024-03-07 20:01  0% ` Jarkko Sakkinen
  2024-03-07 20:03  0%   ` Jarkko Sakkinen
  3 siblings, 1 reply; 200+ results
From: Jarkko Sakkinen @ 2024-03-07 20:01 UTC (permalink / raw)
  To: Christian Brauner, linux-fsdevel
  Cc: Seth Forshee, linux-integrity, linux-security-module

On Tue Mar 5, 2024 at 2:27 PM EET, Christian Brauner wrote:
> The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
> that does a racy query-size+allocate-buffer+retrieve-data. It is used by
> EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
> where 9p returned values that amount to allocating about 8000GB worth of
> memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
> no reason to allow getting xattr values that are larger than
> XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
> values and nothing currently goes beyond that limit afaict. Let it check
> for that and reject requests that are larger than that.
>
> Link: https://lore.kernel.org/r/ZeXcQmHWcYvfCR93@do-x1extreme [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/xattr.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 09d927603433..a53c930e3018 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -395,6 +395,9 @@ vfs_getxattr_alloc(struct mnt_idmap *idmap, struct dentry *dentry,
>  	if (error < 0)
>  		return error;
>  
> +	if (error > XATTR_SIZE_MAX)
> +		return -E2BIG;
> +
>  	if (!value || (error > xattr_size)) {
>  		value = krealloc(*xattr_value, error + 1, flags);
>  		if (!value)

I wonder if this should even categorized as a bug fix and get
backported. Good catch!

BR, Jarkko

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] statx attributes
  2024-03-07 16:37  0%   ` Steve French
@ 2024-03-07 17:45  0%     ` Kent Overstreet
  2024-03-07 20:03  0%       ` Steve French
  0 siblings, 1 reply; 200+ results
From: Kent Overstreet @ 2024-03-07 17:45 UTC (permalink / raw)
  To: Steve French
  Cc: Amir Goldstein, lsf-pc, CIFS, samba-technical, linux-fsdevel,
	Jan Kara, Christian Brauner, David Howells

On Thu, Mar 07, 2024 at 10:37:13AM -0600, Steve French wrote:
> > Which API is used in other OS to query the offline bit?
> > Do they use SMB specific API, as Windows does?
> 
> No it is not smb specific - a local fs can also report this.  It is
> included in the attribute bits for files and directories, it also
> includes a few additional bits that are used by HSM software on local
> drives (e.g. FILE_ATTRIBUTE_PINNED when the file may not be taken
> offline by HSM software)
> ATTRIBUTE_HIDDEN
> ATTRIBUTE_SYSTEM
> ATTRIBUTE_DIRECTORY
> ATTRIGBUTE_ARCHIVE
> ATTRIBUTE_TEMPORARY
> ATTRIBUTE_SPARSE_FILE
> ATTRIBUTE_REPARE_POINT
> ATTRIBUTE_COMPRESSED
> ATTRIBUTE_NOT_CONTENT_INDEXED
> ATTRIBUTE_ENCRYPTED
> ATTRIBUTE_OFFLINE

we've already got some of these as inode flags available with the
getflags ioctl (compressed, also perhaps encrypted?) - but statx does
seem a better place for them.

statx can also report when they're supported, which does make sense for
these.

ATTRIBUTE_DIRECTORY, though?

we also need to try to define the semantics for these and not just dump
them in as just a bunch of identifiers if we want them to be used by
other things - and we do.

ATTRIBUTE_TEMPORARY is the one I'm eyeing; I've been planning tmpfile
support in bcachefs, it'll turn fsyncs into noops and also ensure files
are deleted on unmount/remount.

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] statx attributes
  2024-03-07  8:54  5% ` Amir Goldstein
@ 2024-03-07 16:37  0%   ` Steve French
  2024-03-07 17:45  0%     ` Kent Overstreet
  0 siblings, 1 reply; 200+ results
From: Steve French @ 2024-03-07 16:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, CIFS, samba-technical, linux-fsdevel, Jan Kara,
	Christian Brauner, Kent Overstreet, David Howells

> Which API is used in other OS to query the offline bit?
> Do they use SMB specific API, as Windows does?

No it is not smb specific - a local fs can also report this.  It is
included in the attribute bits for files and directories, it also
includes a few additional bits that are used by HSM software on local
drives (e.g. FILE_ATTRIBUTE_PINNED when the file may not be taken
offline by HSM software)
ATTRIBUTE_HIDDEN
ATTRIBUTE_SYSTEM
ATTRIBUTE_DIRECTORY
ATTRIGBUTE_ARCHIVE
ATTRIBUTE_TEMPORARY
ATTRIBUTE_SPARSE_FILE
ATTRIBUTE_REPARE_POINT
ATTRIBUTE_COMPRESSED
ATTRIBUTE_NOT_CONTENT_INDEXED
ATTRIBUTE_ENCRYPTED
ATTRIBUTE_OFFLINE

On Thu, Mar 7, 2024 at 2:54 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Mar 7, 2024 at 7:36 AM Steve French <smfrench@gmail.com> wrote:
> >
> > Following up on a discussion a few years ago about missing STATX
> > attributes, I noticed a case recently where some tools on other OS
> > have an option to skip offline files (e.g. the Windows equivalent of
> > grep, "findstr", and some Mac tools also seem to do this).
> >
>
> Which API is used in other OS to query the offline bit?
> Do they use SMB specific API, as Windows does?
>
> > This reminded me that there are a few additional STATX attribute flags
> > that could be helpful beyond the 8 that are currently defined (e.g.
> > STATX_ATTR_COMPRESSED, STATX_ATTR_ENCRYPTED, STATX_ATTR_NO_DUMP,
> > STATX_ATTR_VERITY) and that it be worthwhile revisiting which
> > additional STATX attribute flags would be most useful.
>
> I agree that it would be interesting to talk about new STATX_ attributes,
> but it should already be covered by this talk:
> https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
>
> We have a recent example of what I see as a good process of
> introducing new STATX_ attributes:
> https://lore.kernel.org/linux-fsdevel/20240302220203.623614-1-kent.overstreet@linux.dev/
> 1. Kent needed stx_subvol_id for bcachefs, so he proposed a patch
> 2. The minimum required bikeshedding on the name ;)
> 3. Buy in by at least one other filesystem (btrfs)
>
> w.r.t attributes that only serve one filesystem, certainly a requirement from
> general purpose userspace tools will go a long way to help when introducing
> new attributes such as STATX_ATTR_OFFLINE, so if you get userspace
> projects to request this functionality I think you should be good to go.
>
> >
> > "offline" could be helpful for fuse and cifs.ko and probably multiple
> > fs to be able to report,
>
> I am not sure why you think that "offline" will be useful to fuse?
> Is there any other network fs that already has the concept of "offline"
> attribute?
>
> > but there are likely other examples that could help various filesystems.
>
> Maybe interesting for network fs that are integrated with fscache/netfs?
> It may be useful for netfs to be able to raise the STATX_ATTR_OFFLINE
> attribute for a certain cached file in some scenarios?
>
> As a developer of HSM API [1], where files on any fs could have an
> "offline" status,
> STATX_ATTR_OFFLINE is interesting to me, but only if local disk fs
> will map it to
> persistent inode flags.
>
> When I get to it, I may pick a victim local fs and write a patch for it.
>
> Thanks,
> Amir.
>
> [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API



-- 
Thanks,

Steve

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] statx attributes
  @ 2024-03-07  8:54  5% ` Amir Goldstein
  2024-03-07 16:37  0%   ` Steve French
  0 siblings, 1 reply; 200+ results
From: Amir Goldstein @ 2024-03-07  8:54 UTC (permalink / raw)
  To: Steve French
  Cc: lsf-pc, CIFS, samba-technical, linux-fsdevel, Jan Kara,
	Christian Brauner, Kent Overstreet, David Howells

On Thu, Mar 7, 2024 at 7:36 AM Steve French <smfrench@gmail.com> wrote:
>
> Following up on a discussion a few years ago about missing STATX
> attributes, I noticed a case recently where some tools on other OS
> have an option to skip offline files (e.g. the Windows equivalent of
> grep, "findstr", and some Mac tools also seem to do this).
>

Which API is used in other OS to query the offline bit?
Do they use SMB specific API, as Windows does?

> This reminded me that there are a few additional STATX attribute flags
> that could be helpful beyond the 8 that are currently defined (e.g.
> STATX_ATTR_COMPRESSED, STATX_ATTR_ENCRYPTED, STATX_ATTR_NO_DUMP,
> STATX_ATTR_VERITY) and that it be worthwhile revisiting which
> additional STATX attribute flags would be most useful.

I agree that it would be interesting to talk about new STATX_ attributes,
but it should already be covered by this talk:
https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/

We have a recent example of what I see as a good process of
introducing new STATX_ attributes:
https://lore.kernel.org/linux-fsdevel/20240302220203.623614-1-kent.overstreet@linux.dev/
1. Kent needed stx_subvol_id for bcachefs, so he proposed a patch
2. The minimum required bikeshedding on the name ;)
3. Buy in by at least one other filesystem (btrfs)

w.r.t attributes that only serve one filesystem, certainly a requirement from
general purpose userspace tools will go a long way to help when introducing
new attributes such as STATX_ATTR_OFFLINE, so if you get userspace
projects to request this functionality I think you should be good to go.

>
> "offline" could be helpful for fuse and cifs.ko and probably multiple
> fs to be able to report,

I am not sure why you think that "offline" will be useful to fuse?
Is there any other network fs that already has the concept of "offline"
attribute?

> but there are likely other examples that could help various filesystems.

Maybe interesting for network fs that are integrated with fscache/netfs?
It may be useful for netfs to be able to raise the STATX_ATTR_OFFLINE
attribute for a certain cached file in some scenarios?

As a developer of HSM API [1], where files on any fs could have an
"offline" status,
STATX_ATTR_OFFLINE is interesting to me, but only if local disk fs
will map it to
persistent inode flags.

When I get to it, I may pick a victim local fs and write a patch for it.

Thanks,
Amir.

[1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API

^ permalink raw reply	[relevance 5%]

* Re: [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size
  2024-03-05 12:27  5% [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size Christian Brauner
  2024-03-05 14:33  0% ` Seth Forshee
  2024-03-05 15:17  0% ` Serge E. Hallyn
@ 2024-03-05 16:21  0% ` Christian Brauner
  2024-03-07 20:01  0% ` Jarkko Sakkinen
  3 siblings, 0 replies; 200+ results
From: Christian Brauner @ 2024-03-05 16:21 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner
  Cc: Seth Forshee, linux-integrity, linux-security-module

On Tue, 05 Mar 2024 13:27:06 +0100, Christian Brauner wrote:
> The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
> that does a racy query-size+allocate-buffer+retrieve-data. It is used by
> EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
> where 9p returned values that amount to allocating about 8000GB worth of
> memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
> no reason to allow getting xattr values that are larger than
> XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
> values and nothing currently goes beyond that limit afaict. Let it check
> for that and reject requests that are larger than that.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] xattr: restrict vfs_getxattr_alloc() allocation size
      https://git.kernel.org/vfs/vfs/c/82a4c8736d72

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size
  2024-03-05 12:27  5% [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size Christian Brauner
  2024-03-05 14:33  0% ` Seth Forshee
@ 2024-03-05 15:17  0% ` Serge E. Hallyn
  2024-03-05 16:21  0% ` Christian Brauner
  2024-03-07 20:01  0% ` Jarkko Sakkinen
  3 siblings, 0 replies; 200+ results
From: Serge E. Hallyn @ 2024-03-05 15:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Seth Forshee, linux-integrity, linux-security-module

On Tue, Mar 05, 2024 at 01:27:06PM +0100, Christian Brauner wrote:
> The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
> that does a racy query-size+allocate-buffer+retrieve-data. It is used by
> EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
> where 9p returned values that amount to allocating about 8000GB worth of
> memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
> no reason to allow getting xattr values that are larger than
> XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
> values and nothing currently goes beyond that limit afaict. Let it check
> for that and reject requests that are larger than that.
> 
> Link: https://lore.kernel.org/r/ZeXcQmHWcYvfCR93@do-x1extreme [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/xattr.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 09d927603433..a53c930e3018 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -395,6 +395,9 @@ vfs_getxattr_alloc(struct mnt_idmap *idmap, struct dentry *dentry,
>  	if (error < 0)
>  		return error;
>  
> +	if (error > XATTR_SIZE_MAX)
> +		return -E2BIG;
> +
>  	if (!value || (error > xattr_size)) {
>  		value = krealloc(*xattr_value, error + 1, flags);
>  		if (!value)
> -- 
> 2.43.0
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size
  2024-03-05 12:27  5% [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size Christian Brauner
@ 2024-03-05 14:33  0% ` Seth Forshee
  2024-03-05 15:17  0% ` Serge E. Hallyn
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ results
From: Seth Forshee @ 2024-03-05 14:33 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, linux-integrity, linux-security-module

On Tue, Mar 05, 2024 at 01:27:06PM +0100, Christian Brauner wrote:
> The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
> that does a racy query-size+allocate-buffer+retrieve-data. It is used by
> EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
> where 9p returned values that amount to allocating about 8000GB worth of
> memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
> no reason to allow getting xattr values that are larger than
> XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
> values and nothing currently goes beyond that limit afaict. Let it check
> for that and reject requests that are larger than that.
> 
> Link: https://lore.kernel.org/r/ZeXcQmHWcYvfCR93@do-x1extreme [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Makes sense.

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>

^ permalink raw reply	[relevance 0%]

* [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size
@ 2024-03-05 12:27  5% Christian Brauner
  2024-03-05 14:33  0% ` Seth Forshee
                   ` (3 more replies)
  0 siblings, 4 replies; 200+ results
From: Christian Brauner @ 2024-03-05 12:27 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christian Brauner, Seth Forshee, linux-integrity, linux-security-module

The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
that does a racy query-size+allocate-buffer+retrieve-data. It is used by
EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
where 9p returned values that amount to allocating about 8000GB worth of
memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
no reason to allow getting xattr values that are larger than
XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
values and nothing currently goes beyond that limit afaict. Let it check
for that and reject requests that are larger than that.

Link: https://lore.kernel.org/r/ZeXcQmHWcYvfCR93@do-x1extreme [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/xattr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 09d927603433..a53c930e3018 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -395,6 +395,9 @@ vfs_getxattr_alloc(struct mnt_idmap *idmap, struct dentry *dentry,
 	if (error < 0)
 		return error;
 
+	if (error > XATTR_SIZE_MAX)
+		return -E2BIG;
+
 	if (!value || (error > xattr_size)) {
 		value = krealloc(*xattr_value, error + 1, flags);
 		if (!value)
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH v15 3/9] fuse: implement ioctls to manage backing files
  @ 2024-03-05 10:57  5%                   ` Miklos Szeredi
  0 siblings, 0 replies; 200+ results
From: Miklos Szeredi @ 2024-03-05 10:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Amir Goldstein, Jingbo Xu, Bernd Schubert,
	linux-fsdevel, Alessio Balsini

On Thu, 29 Feb 2024 at 11:17, Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Feb 29, 2024 at 11:15:35AM +0100, Christian Brauner wrote:
> > On Wed, Feb 28, 2024 at 04:01:17PM +0100, Miklos Szeredi wrote:
> > > On Wed, 28 Feb 2024 at 15:32, Jens Axboe <axboe@kernel.dk> wrote:
> > > >
> > > > On 2/28/24 4:28 AM, Amir Goldstein wrote:
> > >
> > > > > Are fixed files visible to lsof?
> > > >
> > > > lsof won't show them, but you can read the fdinfo of the io_uring fd to
> > > > see them. Would probably be possible to make lsof find and show them
> > > > too, but haven't looked into that.
> >
> > I actually wrote about this before when I suggested IORING_OP_FIXED_FD_INSTALL:
> > https://patchwork.kernel.org/project/io-uring/patch/df0e24ff-f3a0-4818-8282-2a4e03b7b5a6@kernel.dk/#25629935
>
> I think that it shouldn't be a problem as long as userspace has some way
> of figuring this out. So extending lsof might just be enough for this.

Problem is fdinfo on io_uring fd just contains the last component names.

Do we want full "magic symlink" semantics for these?  I'm not sure.
But just the last component does seem too little.

I've advocated using xattr for querying virtual attributes like these.
So I'll advocate again.   Does anyone see a problem with adding

getxattr("/proc/$PID/fdinfo/$IO_URING_FD",
"io_uring:fixed_files:$SLOT:path", buf, buflen);

?

Thanks,
Miklos

^ permalink raw reply	[relevance 5%]

* [PATCH v2 09/14] fs: Add FS_XFLAG_ATOMICWRITES flag
  @ 2024-03-04 13:04  5% ` John Garry
  0 siblings, 0 replies; 200+ results
From: John Garry @ 2024-03-04 13:04 UTC (permalink / raw)
  To: djwong, hch, viro, brauner, jack, chandan.babu, david, axboe
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, ritesh.list, linux-block, John Garry

Add a flag indicating that a regular file is enabled for atomic writes.

This is a file attribute that mirrors an ondisk inode flag.  Actual support
for untorn file writes (for now) depends on both the iflag and the
underlying storage devices, which we can only really check at statx and
pwritev2() time.  This is the same story as FS_XFLAG_DAX, which signals to
the fs that we should try to enable the fsdax IO path on the file (instead
of the regular page cache), but applications have to query STAT_ATTR_DAX
to find out if they really got that IO path.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/uapi/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 8828822331bf..aacf54381718 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -142,6 +142,7 @@ struct fsxattr {
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
 /* data extent mappings for regular files must be aligned to extent size hint */
 #define FS_XFLAG_FORCEALIGN	0x00020000
+#define FS_XFLAG_ATOMICWRITES	0x00040000	/* atomic writes enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[relevance 5%]

* [RFC 4/8] ext4: Add statx and other atomic write helper routines
  @ 2024-03-02  7:42  6% ` Ritesh Harjani (IBM)
  0 siblings, 0 replies; 200+ results
From: Ritesh Harjani (IBM) @ 2024-03-02  7:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4
  Cc: Ojaswin Mujoo, Jan Kara, Theodore Ts'o, Matthew Wilcox,
	Darrick J . Wong, Luis Chamberlain, John Garry, linux-kernel,
	Ritesh Harjani (IBM)

This patch adds the statx (STATX_WRITE_ATOMIC) support in ext4_getattr()
to query for atomic_write_unit_min(awu_min), awu_max and other
attributes for atomic writes.
This adds a new runtime mount flag (EXT4_MF_ATOMIC_WRITE_FSAWU),
for querying whether ext4 supports atomic write using fsawu
(filesystem atomic write unit).

Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/ext4.h  | 53 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/inode.c | 16 +++++++++++++++
 2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 023571f8dd1b..1d2bce26e616 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1817,7 +1817,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
  */
 enum {
 	EXT4_MF_MNTDIR_SAMPLED,
-	EXT4_MF_FC_INELIGIBLE	/* Fast commit ineligible */
+	EXT4_MF_FC_INELIGIBLE,		/* Fast commit ineligible */
+	EXT4_MF_ATOMIC_WRITE_FSAWU	/* Atomic write via FSAWU */
 };
 
 static inline void ext4_set_mount_flag(struct super_block *sb, int bit)
@@ -3839,6 +3840,56 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh)
 	return buffer_uptodate(bh);
 }
 
+#define ext4_can_atomic_write_fsawu(sb)				\
+	ext4_test_mount_flag(sb, EXT4_MF_ATOMIC_WRITE_FSAWU)
+
+/**
+ * ext4_atomic_write_fsawu	Returns EXT4 filesystem atomic write unit.
+ *  @sb				super_block
+ *  This returns the filesystem min|max atomic write units.
+ *  For !bigalloc it is filesystem blocksize (fsawu_min)
+ *  For bigalloc it should be either blocksize or multiple of blocksize
+ *  (fsawu_min)
+ */
+static inline void ext4_atomic_write_fsawu(struct super_block *sb,
+					   unsigned int *fsawu_min,
+					   unsigned int *fsawu_max)
+{
+	u8 blkbits = sb->s_blocksize_bits;
+	unsigned int blocksize = 1U << blkbits;
+	unsigned int clustersize = blocksize;
+	struct block_device *bdev = sb->s_bdev;
+	unsigned int awu_min =
+			queue_atomic_write_unit_min_bytes(bdev->bd_queue);
+	unsigned int awu_max =
+			queue_atomic_write_unit_max_bytes(bdev->bd_queue);
+
+	if (ext4_has_feature_bigalloc(sb))
+		clustersize = 1U << (EXT4_SB(sb)->s_cluster_bits + blkbits);
+
+	/* fs min|max should respect awu_[min|max] units */
+	if (unlikely(awu_min > clustersize || awu_max < blocksize))
+		goto not_supported;
+
+	/* in case of !bigalloc fsawu_[min|max] should be same as blocksize */
+	if (!ext4_has_feature_bigalloc(sb)) {
+		*fsawu_min = blocksize;
+		*fsawu_max = blocksize;
+		return;
+	}
+
+	/* bigalloc can support write in blocksize units. So advertize it */
+	*fsawu_min = max(blocksize, awu_min);
+	*fsawu_max = min(clustersize, awu_max);
+
+	/* This should never happen, but let's keep a WARN_ON_ONCE */
+	WARN_ON_ONCE(!IS_ALIGNED(clustersize, *fsawu_min));
+	return;
+not_supported:
+	*fsawu_min = 0;
+	*fsawu_max = 0;
+}
+
 #endif	/* __KERNEL__ */
 
 #define EFSBADCRC	EBADMSG		/* Bad CRC detected */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ccf3b5e3a7c..ea009ca9085d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5536,6 +5536,22 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
 		}
 	}
 
+	if (request_mask & STATX_WRITE_ATOMIC) {
+		unsigned int fsawu_min = 0, fsawu_max = 0;
+
+		/*
+		 * Get fsawu_[min|max] value which we can advertise to userspace
+		 * in statx call, if we support atomic writes using
+		 * EXT4_MF_ATOMIC_WRITE_FSAWU.
+		 */
+		if (ext4_can_atomic_write_fsawu(inode->i_sb)) {
+			ext4_atomic_write_fsawu(inode->i_sb, &fsawu_min,
+						&fsawu_max);
+		}
+
+		generic_fill_statx_atomic_writes(stat, fsawu_min, fsawu_max);
+	}
+
 	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
 	if (flags & EXT4_APPEND_FL)
 		stat->attributes |= STATX_ATTR_APPEND;
-- 
2.43.0


^ permalink raw reply related	[relevance 6%]

* Re: [PATCH] xfs: stop advertising SB_I_VERSION
  2024-02-28  4:28  6% [PATCH] xfs: stop advertising SB_I_VERSION Dave Chinner
  2024-02-28 16:08  0% ` Darrick J. Wong
@ 2024-03-01 13:42  0% ` Jeff Layton
  1 sibling, 0 replies; 200+ results
From: Jeff Layton @ 2024-03-01 13:42 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs; +Cc: djwong, linux-fsdevel

On Wed, 2024-02-28 at 15:28 +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The redefinition of how NFS wants inode->i_version to be updated is
> incomaptible with the XFS i_version mechanism. The VFS now wants
> inode->i_version to only change when ctime changes (i.e. it has
> become a ctime change counter, not an inode change counter). XFS has
> fine grained timestamps, so it can just use ctime for the NFS change
> cookie like it still does for V4 XFS filesystems.
> 

Are you saying that XFS has timestamp granularity finer than
current_time() reports? I thought XFS used the same clocksource as
everyone else.

At LPC, you mentioned you had some patches in progress to use the unused
bits in the tv_nsec field as a change counter to track changes that
occurred within the same timer tick.

Did that not pan out for some reason? I'd like to understand why if so.
It sounded like a reasonable solution to the problem.

> 
> We still want XFS to update the inode change counter as it currently
> does, so convert all the code that checks SB_I_VERSION to check for
> v5 format support. Then we can remove the SB_I_VERSION flag from the
> VFS superblock to indicate that inode->i_version is not a valid
> change counter and should not be used as such.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 15 +++++----------
>  fs/xfs/xfs_iops.c               | 16 +++-------------
>  fs/xfs/xfs_super.c              |  8 --------
>  3 files changed, 8 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 70e97ea6eee7..8071aefad728 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -97,17 +97,12 @@ xfs_trans_log_inode(
>  
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
> -	 * counter if it is configured for this to occur. While we have the
> -	 * inode locked exclusively for metadata modification, we can usually
> -	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
> -	 * the last time it was incremented. If we have XFS_ILOG_CORE already
> -	 * set however, then go ahead and bump the i_version counter
> -	 * unconditionally.
> +	 * counter if it is configured for this to occur.
>  	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> -		if (IS_I_VERSION(inode) &&
> -		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_IVERSION;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
> +	    xfs_has_crc(ip->i_mount)) {
> +		inode->i_version++;
> +		flags |= XFS_ILOG_IVERSION;
>  	}
>  
>  	iip->ili_dirty_flags |= flags;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index be102fd49560..97e792d9d79a 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -584,11 +584,6 @@ xfs_vn_getattr(
>  		}
>  	}
>  
> -	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> -		stat->change_cookie = inode_query_iversion(inode);
> -		stat->result_mask |= STATX_CHANGE_COOKIE;
> -	}
> -
>  	/*
>  	 * Note: If you add another clause to set an attribute flag, please
>  	 * update attributes_mask below.
> @@ -1044,16 +1039,11 @@ xfs_vn_update_time(
>  	struct timespec64	now;
>  
>  	trace_xfs_update_time(ip);
> +	ASSERT(!(flags & S_VERSION));
>  
>  	if (inode->i_sb->s_flags & SB_LAZYTIME) {
> -		if (!((flags & S_VERSION) &&
> -		      inode_maybe_inc_iversion(inode, false))) {
> -			generic_update_time(inode, flags);
> -			return 0;
> -		}
> -
> -		/* Capture the iversion update that just occurred */
> -		log_flags |= XFS_ILOG_CORE;
> +		generic_update_time(inode, flags);
> +		return 0;
>  	}
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 6ce1e6deb7ec..657ce0423f1d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1692,10 +1692,6 @@ xfs_fs_fill_super(
>  
>  	set_posix_acl_flag(sb);
>  
> -	/* version 5 superblocks support inode version counters. */
> -	if (xfs_has_crc(mp))
> -		sb->s_flags |= SB_I_VERSION;
> -
>  	if (xfs_has_dax_always(mp)) {
>  		error = xfs_setup_dax_always(mp);
>  		if (error)
> @@ -1917,10 +1913,6 @@ xfs_fs_reconfigure(
>  	int			flags = fc->sb_flags;
>  	int			error;
>  
> -	/* version 5 superblocks always support version counters. */
> -	if (xfs_has_crc(mp))
> -		fc->sb_flags |= SB_I_VERSION;
> -
>  	error = xfs_fs_validate_params(new_mp);
>  	if (error)
>  		return error;

Acked-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v9 1/8] landlock: Add IOCTL access right
  2024-02-28 12:57  0%     ` Günther Noack
@ 2024-03-01 12:59  0%       ` Mickaël Salaün
  0 siblings, 0 replies; 200+ results
From: Mickaël Salaün @ 2024-03-01 12:59 UTC (permalink / raw)
  To: Günther Noack
  Cc: Arnd Bergmann, Christian Brauner, linux-security-module, Jeff Xu,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel

On Wed, Feb 28, 2024 at 01:57:42PM +0100, Günther Noack wrote:
> Hello Mickaël!
> 
> On Mon, Feb 19, 2024 at 07:34:42PM +0100, Mickaël Salaün wrote:
> > Arn, Christian, please take a look at the following RFC patch and the
> > rationale explained here.
> > 
> > On Fri, Feb 09, 2024 at 06:06:05PM +0100, Günther Noack wrote:
> > > Introduces the LANDLOCK_ACCESS_FS_IOCTL access right
> > > and increments the Landlock ABI version to 5.
> > > 
> > > Like the truncate right, these rights are associated with a file
> > > descriptor at the time of open(2), and get respected even when the
> > > file descriptor is used outside of the thread which it was originally
> > > opened in.
> > > 
> > > A newly enabled Landlock policy therefore does not apply to file
> > > descriptors which are already open.
> > > 
> > > If the LANDLOCK_ACCESS_FS_IOCTL right is handled, only a small number
> > > of safe IOCTL commands will be permitted on newly opened files.  The
> > > permitted IOCTLs can be configured through the ruleset in limited ways
> > > now.  (See documentation for details.)
> > > 
> > > Specifically, when LANDLOCK_ACCESS_FS_IOCTL is handled, granting this
> > > right on a file or directory will *not* permit to do all IOCTL
> > > commands, but only influence the IOCTL commands which are not already
> > > handled through other access rights.  The intent is to keep the groups
> > > of IOCTL commands more fine-grained.
> > > 
> > > Noteworthy scenarios which require special attention:
> > > 
> > > TTY devices are often passed into a process from the parent process,
> > > and so a newly enabled Landlock policy does not retroactively apply to
> > > them automatically.  In the past, TTY devices have often supported
> > > IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> > > letting callers control the TTY input buffer (and simulate
> > > keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> > > modern kernels though.
> > > 
> > > Some legitimate file system features, like setting up fscrypt, are
> > > exposed as IOCTL commands on regular files and directories -- users of
> > > Landlock are advised to double check that the sandboxed process does
> > > not need to invoke these IOCTLs.
> > 
> > I think we really need to allow fscrypt and fs-verity IOCTLs.
> > 
> > > 
> > > Known limitations:
> > > 
> > > The LANDLOCK_ACCESS_FS_IOCTL access right is a coarse-grained control
> > > over IOCTL commands.  Future work will enable a more fine-grained
> > > access control for IOCTLs.
> > > 
> > > In the meantime, Landlock users may use path-based restrictions in
> > > combination with their knowledge about the file system layout to
> > > control what IOCTLs can be done.  Mounting file systems with the nodev
> > > option can help to distinguish regular files and devices, and give
> > > guarantees about the affected files, which Landlock alone can not give
> > > yet.
> > 
> > I had a second though about our current approach, and it looks like we
> > can do simpler, more generic, and with less IOCTL commands specific
> > handling.
> > 
> > What we didn't take into account is that an IOCTL needs an opened file,
> > which means that the caller must already have been allowed to open this
> > file in read or write mode.
> > 
> > I think most FS-specific IOCTL commands check access rights (i.e. access
> > mode or required capability), other than implicit ones (at least read or
> > write), when appropriate.  We don't get such guarantee with device
> > drivers.
> > 
> > The main threat is IOCTLs on character or block devices because their
> > impact may be unknown (if we only look at the IOCTL command, not the
> > backing file), but we should allow IOCTLs on filesystems (e.g. fscrypt,
> > fs-verity, clone extents).  I think we should only implement a
> > LANDLOCK_ACCESS_FS_IOCTL_DEV right, which would be more explicit.  This
> > change would impact the IOCTLs grouping (not required anymore), but
> > we'll still need the list of VFS IOCTLs.
> 
> 
> I am fine with dropping the IOCTL grouping and going for this simpler approach.
> 
> This must have been a misunderstanding - I thought you wanted to align the
> access checks in Landlock with the ones done by the kernel already, so that we
> can reason about it more locally.  But I'm fine with doing it just for device
> files as well, if that is what it takes.  It's definitely simpler.

I still think we should align existing Landlock access rights with the VFS IOCTL
semantic (i.e. mostly defined in do_vfs_ioctl(), but also in the compat
ioctl syscall).  However, according to our investigations and
discussions, it looks like the groups we defined should already be
enforced by the VFS code, which means we should not need such groups
after all.  My last proposal is to still delegate access for VFS IOCTLs
to the current Landlock access rights, but it doesn't seem required to
add specific access check if we are able to identify these VFS IOCTLs.

> 
> Before I jump into the implementation, let me paraphrase your proposal to make
> sure I understood it correctly:
> 
>  * We *only* introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right.

Yes

> 
>  * This access right governs the use of nontrivial IOCTL commands on
>    character and block device files.
> 
>    * On open()ed files which are not character or block devices,
>      all IOCTL commands keep working.

Yes

> 
>      This includes pipes and sockets, but also a variety of "anonymous" file
>      types which are possibly openable through /proc/self/*/fd/*?

Indeed, and we should document that. It should be noted that these
"anonymous" file types only comes from dedicated syscalls (which are not
currently controlled by Landlock) or from this synthetic proc interface.
One thing to keep in mind is that /proc/*/fd/* can only be opened on
tasks under the same sandbox (or a child one), so we should consider
that they are explicitly allowed by the policy the same way
pre-sandboxed inherited file descriptors are.

It might be interesting to list a few of such anonymous file types.  Are
there any that can act on global resources (like block/char devices
can)?

I also think that most anonymous file types should check for FD's read
and write mode when it makes sense (which is not the case for most
block/char IOCTLs), but I might be wrong.

I think this LANDLOCK_ACCESS_FS_IOCTL_DEV design would be good for now,
and probably enough for most use cases.  This would fill a major gap in
an easy-to-understand-and-document way.

> 
>  * The trivial IOCTL commands are identified using the proposed function
>    vfs_masked_device_ioctl().
> 
>    * For these commands, the implementations are in fs/ioctl.c, except for
>      FIONREAD, in some cases.  We trust these implementations to check the
>      file's type (dir/regular) and access rights (r/w) correctly.

FIONREAD is explicitly not part of vfs_masked_device_ioctl() because it
is only defined for regular files (and forwarded to the underlying
implementation otherwise), hence the "masked_device" name. If the
underlying filesystem handles this IOCTL command for directory that's
fine, and we don't need explicit exception.

> 
> 
> Open questions I have:
> 
> * What about files which are neither devices nor regular files or directories?
> 
>   The obvious ones which can be open()ed are pipes, where only FIONREAND and two
>   harmless-looking watch queue IOCTLs are implemented.
> 
>   But then I think that /proc/*/fd/* is a way through which other non-device
>   files can become accessible?  What do we do for these?  (I am getting EACCES
>   when trying to open some anon_inodes that way... is this something we can
>   count on?)

As explained above, /proc/*/fd/* is already restricted per sandbox
scopes, which seem enough.

> 
> * How did you come up with the list in vfs_masked_device_ioctl()?  I notice that
>   some of these are from the switch() statement we had before, but not all of
>   them are included.
> 
>   I can kind of see that for the fallocate()-like ones and for FIBMAP, because
>   these **only** make sense for regular files, and IOCTLs on regular files are
>   permitted anyway.

I took inspiration from get_required_ioctl_access(), and built this list
looking at which of the VFS IOCTLs go through the VFS implementation
(mostly do_vfs_ioctl() but also the compat syscall) for IOCTL requests
on *block and character devices*.

The initial assumption is that file systems cannot implement block nor
character device IOCTLs, which is why this approach seems safe and
consistent.

> 
> * What do we do for FIONREAD?  Your patch says that it should be forwarded to
>   device implementations.  But technically, devices can implement all kinds of
>   surprising behaviour for that.

FIONREAD should always be allowed for non-device files (which means on
allowed-to-be-opened and non-device files), and controlled with
LANDLOCK_ACCESS_FS_IOCTL_DEV for character and block devices.

> 
>   If you look at the ioctl implementations of different drivers, you can very
>   quickly find a surprising amount of things that happen completely independent
>   of the IOCTL command.  (Some implementations are acquiring locks and other
>   resources before they even check what the cmd value is. - and we would be
>   exposing that if we let devices handle FIONREAD).

Correct, which is why FIONREAD on devices should be controlled by
LANDLOCK_ACCESS_FS_IOCTL_DEV.  See my previous email (below) with the
"is_device" checks.

> 
> 
> Please let me know whether I understood you correctly there.

I think so, but I guess you missed the "is_device" part.

> 
> Regarding the implementation notes you left below, I think they mostly derive
> from the *_IOCTL_DEV approach in a direct way.

Yes

> 
> 
> > > +static __attribute_const__ access_mask_t
> > > +get_required_ioctl_access(const unsigned int cmd)
> > > +{
> > > +	switch (cmd) {
> > > +	case FIOCLEX:
> > > +	case FIONCLEX:
> > > +	case FIONBIO:
> > > +	case FIOASYNC:
> > > +		/*
> > > +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> > > +		 * close-on-exec and the file's buffered-IO and async flags.
> > > +		 * These operations are also available through fcntl(2), and are
> > > +		 * unconditionally permitted in Landlock.
> > > +		 */
> > > +		return 0;
> > > +	case FIONREAD:
> > > +	case FIOQSIZE:
> > > +	case FIGETBSZ:
> > > +		/*
> > > +		 * FIONREAD returns the number of bytes available for reading.
> > > +		 * FIONREAD returns the number of immediately readable bytes for
> > > +		 * a file.
> > > +		 *
> > > +		 * FIOQSIZE queries the size of a file or directory.
> > > +		 *
> > > +		 * FIGETBSZ queries the file system's block size for a file or
> > > +		 * directory.
> > > +		 *
> > > +		 * These IOCTL commands are permitted for files which are opened
> > > +		 * with LANDLOCK_ACCESS_FS_READ_DIR,
> > > +		 * LANDLOCK_ACCESS_FS_READ_FILE, or
> > > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > > +		 */
> > 
> > Because files or directories can only be opened with
> > LANDLOCK_ACCESS_FS_{READ,WRITE}_{FILE,DIR}, and because IOCTLs can only
> > be sent on a file descriptor, this means that we can always allow these
> > 3 commands (for opened files).
> > 
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_RW;
> > > +	case FS_IOC_FIEMAP:
> > > +	case FIBMAP:
> > > +		/*
> > > +		 * FS_IOC_FIEMAP and FIBMAP query information about the
> > > +		 * allocation of blocks within a file.  They are permitted for
> > > +		 * files which are opened with LANDLOCK_ACCESS_FS_READ_FILE or
> > > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > > +		 */
> > > +		fallthrough;
> > > +	case FIDEDUPERANGE:
> > > +	case FICLONE:
> > > +	case FICLONERANGE:
> > > +		/*
> > > +		 * FIDEDUPERANGE, FICLONE and FICLONERANGE make files share
> > > +		 * their underlying storage ("reflink") between source and
> > > +		 * destination FDs, on file systems which support that.
> > > +		 *
> > > +		 * The underlying implementations are already checking whether
> > > +		 * the involved files are opened with the appropriate read/write
> > > +		 * modes.  We rely on this being implemented correctly.
> > > +		 *
> > > +		 * These IOCTLs are permitted for files which are opened with
> > > +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> > > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > > +		 */
> > > +		fallthrough;
> > > +	case FS_IOC_RESVSP:
> > > +	case FS_IOC_RESVSP64:
> > > +	case FS_IOC_UNRESVSP:
> > > +	case FS_IOC_UNRESVSP64:
> > > +	case FS_IOC_ZERO_RANGE:
> > > +		/*
> > > +		 * These IOCTLs reserve space, or create holes like
> > > +		 * fallocate(2).  We rely on the implementations checking the
> > > +		 * files' read/write modes.
> > > +		 *
> > > +		 * These IOCTLs are permitted for files which are opened with
> > > +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> > > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > > +		 */
> > 
> > These 10 commands only make sense on directories, so we could also
> > always allow them on file descriptors.
> 
> I imagine that's a typo?  The commands above do make sense on regular files.

Yes, I meant they "only make sense on regular files".

> 
> 
> > > +		return LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> > > +	default:
> > > +		/*
> > > +		 * Other commands are guarded by the catch-all access right.
> > > +		 */
> > > +		return LANDLOCK_ACCESS_FS_IOCTL;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * expand_ioctl() - Return the dst flags from either the src flag or the
> > > + * %LANDLOCK_ACCESS_FS_IOCTL flag, depending on whether the
> > > + * %LANDLOCK_ACCESS_FS_IOCTL and src access rights are handled or not.
> > > + *
> > > + * @handled: Handled access rights.
> > > + * @access: The access mask to copy values from.
> > > + * @src: A single access right to copy from in @access.
> > > + * @dst: One or more access rights to copy to.
> > > + *
> > > + * Returns: @dst, or 0.
> > > + */
> > > +static __attribute_const__ access_mask_t
> > > +expand_ioctl(const access_mask_t handled, const access_mask_t access,
> > > +	     const access_mask_t src, const access_mask_t dst)
> > > +{
> > > +	access_mask_t copy_from;
> > > +
> > > +	if (!(handled & LANDLOCK_ACCESS_FS_IOCTL))
> > > +		return 0;
> > > +
> > > +	copy_from = (handled & src) ? src : LANDLOCK_ACCESS_FS_IOCTL;
> > > +	if (access & copy_from)
> > > +		return dst;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * landlock_expand_access_fs() - Returns @access with the synthetic IOCTL group
> > > + * flags enabled if necessary.
> > > + *
> > > + * @handled: Handled FS access rights.
> > > + * @access: FS access rights to expand.
> > > + *
> > > + * Returns: @access expanded by the necessary flags for the synthetic IOCTL
> > > + * access rights.
> > > + */
> > > +static __attribute_const__ access_mask_t landlock_expand_access_fs(
> > > +	const access_mask_t handled, const access_mask_t access)
> > > +{
> > > +	return access |
> > > +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_WRITE_FILE,
> > > +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> > > +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> > > +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_FILE,
> > > +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> > > +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> > > +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_DIR,
> > > +			    LANDLOCK_ACCESS_FS_IOCTL_RW);
> > > +}
> > > +
> > > +/**
> > > + * landlock_expand_handled_access_fs() - add synthetic IOCTL access rights to an
> > > + * access mask of handled accesses.
> > > + *
> > > + * @handled: The handled accesses of a ruleset that is being created.
> > > + *
> > > + * Returns: @handled, with the bits for the synthetic IOCTL access rights set,
> > > + * if %LANDLOCK_ACCESS_FS_IOCTL is handled.
> > > + */
> > > +__attribute_const__ access_mask_t
> > > +landlock_expand_handled_access_fs(const access_mask_t handled)
> > > +{
> > > +	return landlock_expand_access_fs(handled, handled);
> > > +}
> > > +
> > >  /* Ruleset management */
> > >  
> > >  static struct landlock_object *get_inode_object(struct inode *const inode)
> > > @@ -148,7 +331,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> > >  	LANDLOCK_ACCESS_FS_EXECUTE | \
> > >  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
> > >  	LANDLOCK_ACCESS_FS_READ_FILE | \
> > > -	LANDLOCK_ACCESS_FS_TRUNCATE)
> > > +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> > > +	LANDLOCK_ACCESS_FS_IOCTL)
> > >  /* clang-format on */
> > >  
> > >  /*
> > > @@ -158,6 +342,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
> > >  			    const struct path *const path,
> > >  			    access_mask_t access_rights)
> > >  {
> > > +	access_mask_t handled;
> > >  	int err;
> > >  	struct landlock_id id = {
> > >  		.type = LANDLOCK_KEY_INODE,
> > > @@ -170,9 +355,11 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
> > >  	if (WARN_ON_ONCE(ruleset->num_layers != 1))
> > >  		return -EINVAL;
> > >  
> > > +	handled = landlock_get_fs_access_mask(ruleset, 0);
> > > +	/* Expands the synthetic IOCTL groups. */
> > > +	access_rights |= landlock_expand_access_fs(handled, access_rights);
> > >  	/* Transforms relative access rights to absolute ones. */
> > > -	access_rights |= LANDLOCK_MASK_ACCESS_FS &
> > > -			 ~landlock_get_fs_access_mask(ruleset, 0);
> > > +	access_rights |= LANDLOCK_MASK_ACCESS_FS & ~handled;
> > >  	id.key.object = get_inode_object(d_backing_inode(path->dentry));
> > >  	if (IS_ERR(id.key.object))
> > >  		return PTR_ERR(id.key.object);
> > > @@ -1333,7 +1520,9 @@ static int hook_file_open(struct file *const file)
> > >  {
> > >  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
> > >  	access_mask_t open_access_request, full_access_request, allowed_access;
> > > -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > > +	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE |
> > > +					      LANDLOCK_ACCESS_FS_IOCTL |
> > > +					      IOCTL_GROUPS;
> > >  	const struct landlock_ruleset *const dom = get_current_fs_domain();
> > >  
> > >  	if (!dom)
> > 
> > We should set optional_access according to the file type before
> > `full_access_request = open_access_request | optional_access;`
> > 
> > const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> > 
> > optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > if (is_device)
> >     optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > 
> > 
> > Because LANDLOCK_ACCESS_FS_IOCTL_DEV is dedicated to character or block
> > devices, we may want landlock_add_rule() to only allow this access right
> > to be tied to directories, or character devices, or block devices.  Even
> > if it would be more consistent with constraints on directory-only access
> > rights, I'm not sure about that.
> > 
> > 
> > > @@ -1375,6 +1564,16 @@ static int hook_file_open(struct file *const file)
> > >  		}
> > >  	}
> > >  
> > > +	/*
> > > +	 * Named pipes should be treated just like anonymous pipes.
> > > +	 * Therefore, we permit all IOCTLs on them.
> > > +	 */
> > > +	if (S_ISFIFO(file_inode(file)->i_mode)) {
> > > +		allowed_access |= LANDLOCK_ACCESS_FS_IOCTL |
> > > +				  LANDLOCK_ACCESS_FS_IOCTL_RW |
> > > +				  LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> > > +	}
> > 
> > Instead of this S_ISFIFO check:
> > 
> > if (!is_device)
> >     allowed_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > 
> > > +
> > >  	/*
> > >  	 * For operations on already opened files (i.e. ftruncate()), it is the
> > >  	 * access rights at the time of open() which decide whether the
> > > @@ -1406,6 +1605,25 @@ static int hook_file_truncate(struct file *const file)
> > >  	return -EACCES;
> > >  }
> > >  
> > > +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> > > +			   unsigned long arg)
> > > +{
> > > +	const access_mask_t required_access = get_required_ioctl_access(cmd);
> > 
> > const access_mask_t required_access = LANDLOCK_ACCESS_FS_IOCTL_DEV;
> > 
> > 
> > > +	const access_mask_t allowed_access =
> > > +		landlock_file(file)->allowed_access;
> > > +
> > > +	/*
> > > +	 * It is the access rights at the time of opening the file which
> > > +	 * determine whether IOCTL can be used on the opened file later.
> > > +	 *
> > > +	 * The access right is attached to the opened file in hook_file_open().
> > > +	 */
> > > +	if ((allowed_access & required_access) == required_access)
> > > +		return 0;
> > 
> > We could then check against the do_vfs_ioctl()'s commands, excluding
> > FIONREAD and file_ioctl()'s commands, to always allow VFS-related
> > commands:
> > 
> > if (vfs_masked_device_ioctl(cmd))
> >     return 0;
> > 
> > As a safeguard, we could define vfs_masked_device_ioctl(cmd) in
> > fs/ioctl.c and make it called by do_vfs_ioctl() as a safeguard to make
> > sure we keep an accurate list of VFS IOCTL commands (see next RFC patch).
> 
> 
> > The compat IOCTL hook must also be implemented.
> 
> Thanks!  I can't believe I missed that one.
> 
> 
> > What do you think? Any better idea?
> 
> It seems like a reasonable approach.  I'd like to double check with you that we
> are on the same page about it before doing the next implementation step.  (These
> iterations seems cheaper when we do them in English than when we do them in C.)

We only reached this design because of the previous iterations, reviews
and discussions.  Implementations details matter in this case and it's
good to take time to convince ourselves of the best approach (and to
understand how underlying implementations work).  Finding a "simple"
interface that makes sense to control IOCTLs in an efficient way wasn't
obvious but I'm convinced we got it now.

Thanks for your perseverance!

> 
> Thanks for the review!
> —Günther
> 

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] untorn buffered writes
  @ 2024-02-29  1:07  5%     ` Dave Chinner
  0 siblings, 0 replies; 200+ results
From: Dave Chinner @ 2024-02-29  1:07 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm

On Wed, Feb 28, 2024 at 05:33:54PM -0600, Theodore Ts'o wrote:
> On Wed, Feb 28, 2024 at 02:11:06PM +0000, Matthew Wilcox wrote:
> > I'm not entirely sure that it does become a mess.  If our implementation
> > of this ensures that each write ends up in a single folio (even if the
> > entire folio is larger than the write), then we will have satisfied the
> > semantics of the flag.
> 
> What if we do a 32k write which spans two folios?  And what
> if the physical pages for those 32k in the buffer cache are not
> contiguous?  Are you going to have to join the two 16k folios
> together, or maybe two 8k folios and an 16k folio, and relocate pages
> to make a contiguous 32k folio when we do a buffered RWF_ATOMIC write
> of size 32k?

RWF_ATOMIC defines contraints that a 32kB write must be 32kB
aligned. So the only way a 32kB write would span two folios is if
a 16kB write had already been done in this space.

WE are already dealing with this problem for bs > ps with the min
order mapping constraint. We can deal with this easily by ensuring
that when we set the inode as supporting atomic writes. This already
ensures physical extent allocation alignment, we can also set the
mapping folio order at this time to ensure that we only allocate
RWF_ATOMIC compatible aligned/sized folios....

> > I think we'd be better off treating RWF_ATOMIC like it's a bs>PS device.

Which is why Willy says this...

> > That takes two somewhat special cases and makes them use the same code
> > paths, which probably means fewer bugs as both camps will be testing
> > the same code.
> 
> But for a bs > PS device, where the logical block size is greater than
> the page size, you don't need the RWF_ATOMIC flag at all.

Yes we do - hardware already supports REQ_ATOMIC sizes larger than
64kB filesystem blocks. i.e. RWF_ATOMIC is not restricted to 64kB
or any specific filesystem block size, and can always be larger than
the filesystem block size.

> All direct
> I/O writes *must* be a multiple of the logical sector size, and
> buffered writes, if they are smaller than the block size, *must* be
> handled as a read-modify-write, since you can't send writes to the
> device smaller than the logical sector size.

The filesystem will likely need to constrain minimum RWF_ATOMIC
sizes to a single filesystem block. That's the whole point of having
the statx interface - the application is going to have to query what
the min/max atomic write sizes supported are and adjust to those.
Applications will not be able to use 2kB RWF_ATOMIC writes on a 4kB
block size filesystem, and it's no different with larger filesystem
block sizes.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[relevance 5%]

* [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND]
@ 2024-02-29  0:20  3% John Groves
  2024-04-23 13:30  0% ` [Lsf-pc] " Amir Goldstein
  0 siblings, 1 reply; 200+ results
From: John Groves @ 2024-02-29  0:20 UTC (permalink / raw)
  To: lsf-pc, Jonathan Corbet, Dan Williams, Vishal Verma, Dave Jiang,
	Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	linux-cxl, linux-fsdevel, nvdimm
  Cc: John Groves, John Groves, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Randy Dunlap, Jerome Glisse,
	David Rientjes, Johannes Weiner, John Hubbard, Zi Yan,
	Bharata B Rao, Aneesh Kumar K . V, Alistair Popple,
	Christoph Lameter, Andrew Morton, Jon Grimm, Brian Morris,
	Wei Xu, Theodore Ts'o, mykolal, Aravind Ramesh, Ajay Joshi,
	Eishan Mirakhur, Ravi Shankar, Srinivasulu Thanneeru

John Groves, Micron

Micron recently released the first RFC for famfs [1]. Although famfs is not
CXL-specific in any way, it aims to enable hosts to share data sets in shared
memory (such as CXL) by providing a memory-mappable fs-dax file system
interface to the memory.

Sharable disaggregated memory already exists in the lab, and will be possible
in the wild soon. Famfs aims to do the following:

* Provide an access method that provides isolation between files, and does not
  tempt developers to mmap all the memory writable on every host.
* Provide an an access method that can be used by unmodified apps.

Without something like famfs, enabling the use of sharable memory will involve
the temptation to do things that may destabilize systems, such as
mapping large shared, writable global memory ranges and hooking allocators to
use it (potentially sacrificing isolation), and forcing the same virtual
address ranges in every host/process (compromising security).

The most obvious candidate app categories are data analytics and data lakes.
Both make heavy use of "zero-copy" data frames - column oriented data that
is laid out for efficient use via (MAP_SHARED) mmap. Moreover, these use case
categories are generally driven by python code that wrangles data into
appropriate data frames - making it straightforward to put the data frames
into famfs. Furthermore, these use cases usually involve the shared data being
read-only during computation or query jobs - meaning they are often free of
cache coherency concerns.

Workloads such as these often deal with data sets that are too large to fit
in a single server's memory, so the data gets sharded - requiring movement via
a network. Sharded apps also sometimes have to do expensive reshuffling -
moving data to nodes with available compute resources. Avoiding the sharding
overheads by accessing such data sets in disaggregated shared memory looks
promising to make make better use of memory and compute resources, and by
effectively de-duplicating data sets in memory.

About sharable memory

* Shared memory is pmem-like, in that hosts will connect in order to access
  pre-existing contents
* Onlining sharable memory as system-ram is nonsense; system-ram gets zeroed...
* CXL 3 provides for optionally-supported hardware-managed cache coherency
* But "multiple-readers, no writers" use cases don't need hardware support
  for coherency
* CXL 3.1 dynamic capacity devices (DCDs) should be thought of as devices with
  an allocator built in.
* When sharable capacity is allocated, each host that has access will see a
  /dev/dax device that can be found by the "tag" of the allocation. The tag is
  just a uuid.
* CXL 3.1 also allows the capacity associated with any allocated tag to be
  provided to each host (or host group) as either writable or read-only.

About famfs

Famfs is an append-only log-structured file system that places many limits
on what can be done. This allows famfs to tolerate clients with a stale copy
of metadata. All memory allocation and log maintenance is performed from user
space, but file extent lists are cached in the kernel for fast fault
resolution. The current limitations are fairly extreme, but many can be relaxed
by writing more code, managing Byzantine generals, etc. ;)

A famfs-enabled kernel can be cloned at [3], and the user space repo can be
cloned at [4]. Even with major functional limitations in its current form
(e.g. famfs does not currently support deleting files), it is sufficient to
use in data analytics workloads - in which you 1) create a famfs file system,
2) dump data sets into it, 3) run clustered jobs that consume the shared data
sets, and 4) dismount and deallocate the memory containing the file system.

Famfs Open Issues

* Volatile CXL memory is exposed as character dax devices; the famfs patch
  set adds the iomap API, which is required for fs-dax but until now missing
  from character dax.
* (/dev/pmem devices are block, and support the iomap api for fs-dax file
  systems)
* /dev/pmem devices can be converted to /dev/dax mode, but native /dev/dax
  devices cannot be converted to pmem mode.
* /dev/dax devices lack the iomap api that fs-dax uses with pmem, so the famfs
  patch set adds that.
* VFS layer hooks for a file system on a character device may be needed.
* Famfs has uncovered some previously latent bugs in the /dev/dax mmap
  machinery that probably require attention.
* Famfs currently works with either pmem or devdax devices, but our
  inclination is to drop pmem support to, reduce the complexity of supporting
  two different underlying device types - particularly since famfs is not
  intended for actual pmem.


Required :-
Dan Williams
Christian Brauner
Jonathan Cameron
Dave Hansen

[LSF/MM + BPF ATTEND]

I am the author of the famfs file system. Famfs was first introduced at LPC
2023 [2]. I'm also Micron's voting member on the Software and Systems Working
Group (SSWG) of the CXL Consortium, and a co-author of the CXL 3.1
specification.


References

[1] https://lore.kernel.org/linux-fsdevel/cover.1708709155.git.john@groves.net/#t
[2] https://lpc.events/event/17/contributions/1455/
[3] https://www.computeexpresslink.org/download-the-specification
[4] https://github.com/cxl-micron-reskit/famfs-linux

Best regards,
John Groves
Micron

^ permalink raw reply	[relevance 3%]

* Re: [PATCH] xfs: stop advertising SB_I_VERSION
  2024-02-28  4:28  6% [PATCH] xfs: stop advertising SB_I_VERSION Dave Chinner
@ 2024-02-28 16:08  0% ` Darrick J. Wong
  2024-03-01 13:42  0% ` Jeff Layton
  1 sibling, 0 replies; 200+ results
From: Darrick J. Wong @ 2024-02-28 16:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, jlayton, linux-fsdevel

On Wed, Feb 28, 2024 at 03:28:59PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The redefinition of how NFS wants inode->i_version to be updated is
> incomaptible with the XFS i_version mechanism. The VFS now wants
> inode->i_version to only change when ctime changes (i.e. it has
> become a ctime change counter, not an inode change counter). XFS has
> fine grained timestamps, so it can just use ctime for the NFS change
> cookie like it still does for V4 XFS filesystems.
> 
> We still want XFS to update the inode change counter as it currently
> does, so convert all the code that checks SB_I_VERSION to check for
> v5 format support. Then we can remove the SB_I_VERSION flag from the
> VFS superblock to indicate that inode->i_version is not a valid
> change counter and should not be used as such.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Seeing as NFS and XFS' definition of i_version have diverged, I suppose
divorce is the only option.  But please, let's get rid of all the
*iversion() calls in the codebase.

With my paranoia hat on: let's add an i_changecounter to xfs_inode and
completely stop using the inode.i_version to prevent the vfs from
messing with us.

At some point we can rev the ondisk format to add a new field so that
"di_version" can be whatever u64 cookie the vfs passes us through
i_version.

--D

> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 15 +++++----------
>  fs/xfs/xfs_iops.c               | 16 +++-------------
>  fs/xfs/xfs_super.c              |  8 --------
>  3 files changed, 8 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 70e97ea6eee7..8071aefad728 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -97,17 +97,12 @@ xfs_trans_log_inode(
>  
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
> -	 * counter if it is configured for this to occur. While we have the
> -	 * inode locked exclusively for metadata modification, we can usually
> -	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
> -	 * the last time it was incremented. If we have XFS_ILOG_CORE already
> -	 * set however, then go ahead and bump the i_version counter
> -	 * unconditionally.
> +	 * counter if it is configured for this to occur.
>  	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> -		if (IS_I_VERSION(inode) &&
> -		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_IVERSION;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
> +	    xfs_has_crc(ip->i_mount)) {
> +		inode->i_version++;
> +		flags |= XFS_ILOG_IVERSION;
>  	}
>  
>  	iip->ili_dirty_flags |= flags;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index be102fd49560..97e792d9d79a 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -584,11 +584,6 @@ xfs_vn_getattr(
>  		}
>  	}
>  
> -	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> -		stat->change_cookie = inode_query_iversion(inode);
> -		stat->result_mask |= STATX_CHANGE_COOKIE;
> -	}
> -
>  	/*
>  	 * Note: If you add another clause to set an attribute flag, please
>  	 * update attributes_mask below.
> @@ -1044,16 +1039,11 @@ xfs_vn_update_time(
>  	struct timespec64	now;
>  
>  	trace_xfs_update_time(ip);
> +	ASSERT(!(flags & S_VERSION));
>  
>  	if (inode->i_sb->s_flags & SB_LAZYTIME) {
> -		if (!((flags & S_VERSION) &&
> -		      inode_maybe_inc_iversion(inode, false))) {
> -			generic_update_time(inode, flags);
> -			return 0;
> -		}
> -
> -		/* Capture the iversion update that just occurred */
> -		log_flags |= XFS_ILOG_CORE;
> +		generic_update_time(inode, flags);
> +		return 0;
>  	}
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 6ce1e6deb7ec..657ce0423f1d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1692,10 +1692,6 @@ xfs_fs_fill_super(
>  
>  	set_posix_acl_flag(sb);
>  
> -	/* version 5 superblocks support inode version counters. */
> -	if (xfs_has_crc(mp))
> -		sb->s_flags |= SB_I_VERSION;
> -
>  	if (xfs_has_dax_always(mp)) {
>  		error = xfs_setup_dax_always(mp);
>  		if (error)
> @@ -1917,10 +1913,6 @@ xfs_fs_reconfigure(
>  	int			flags = fc->sb_flags;
>  	int			error;
>  
> -	/* version 5 superblocks always support version counters. */
> -	if (xfs_has_crc(mp))
> -		fc->sb_flags |= SB_I_VERSION;
> -
>  	error = xfs_fs_validate_params(new_mp);
>  	if (error)
>  		return error;
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v9 1/8] landlock: Add IOCTL access right
  2024-02-19 18:34  0%   ` Mickaël Salaün
@ 2024-02-28 12:57  0%     ` Günther Noack
  2024-03-01 12:59  0%       ` Mickaël Salaün
  0 siblings, 1 reply; 200+ results
From: Günther Noack @ 2024-02-28 12:57 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Arnd Bergmann, Christian Brauner, linux-security-module, Jeff Xu,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel

Hello Mickaël!

On Mon, Feb 19, 2024 at 07:34:42PM +0100, Mickaël Salaün wrote:
> Arn, Christian, please take a look at the following RFC patch and the
> rationale explained here.
> 
> On Fri, Feb 09, 2024 at 06:06:05PM +0100, Günther Noack wrote:
> > Introduces the LANDLOCK_ACCESS_FS_IOCTL access right
> > and increments the Landlock ABI version to 5.
> > 
> > Like the truncate right, these rights are associated with a file
> > descriptor at the time of open(2), and get respected even when the
> > file descriptor is used outside of the thread which it was originally
> > opened in.
> > 
> > A newly enabled Landlock policy therefore does not apply to file
> > descriptors which are already open.
> > 
> > If the LANDLOCK_ACCESS_FS_IOCTL right is handled, only a small number
> > of safe IOCTL commands will be permitted on newly opened files.  The
> > permitted IOCTLs can be configured through the ruleset in limited ways
> > now.  (See documentation for details.)
> > 
> > Specifically, when LANDLOCK_ACCESS_FS_IOCTL is handled, granting this
> > right on a file or directory will *not* permit to do all IOCTL
> > commands, but only influence the IOCTL commands which are not already
> > handled through other access rights.  The intent is to keep the groups
> > of IOCTL commands more fine-grained.
> > 
> > Noteworthy scenarios which require special attention:
> > 
> > TTY devices are often passed into a process from the parent process,
> > and so a newly enabled Landlock policy does not retroactively apply to
> > them automatically.  In the past, TTY devices have often supported
> > IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> > letting callers control the TTY input buffer (and simulate
> > keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> > modern kernels though.
> > 
> > Some legitimate file system features, like setting up fscrypt, are
> > exposed as IOCTL commands on regular files and directories -- users of
> > Landlock are advised to double check that the sandboxed process does
> > not need to invoke these IOCTLs.
> 
> I think we really need to allow fscrypt and fs-verity IOCTLs.
> 
> > 
> > Known limitations:
> > 
> > The LANDLOCK_ACCESS_FS_IOCTL access right is a coarse-grained control
> > over IOCTL commands.  Future work will enable a more fine-grained
> > access control for IOCTLs.
> > 
> > In the meantime, Landlock users may use path-based restrictions in
> > combination with their knowledge about the file system layout to
> > control what IOCTLs can be done.  Mounting file systems with the nodev
> > option can help to distinguish regular files and devices, and give
> > guarantees about the affected files, which Landlock alone can not give
> > yet.
> 
> I had a second though about our current approach, and it looks like we
> can do simpler, more generic, and with less IOCTL commands specific
> handling.
> 
> What we didn't take into account is that an IOCTL needs an opened file,
> which means that the caller must already have been allowed to open this
> file in read or write mode.
> 
> I think most FS-specific IOCTL commands check access rights (i.e. access
> mode or required capability), other than implicit ones (at least read or
> write), when appropriate.  We don't get such guarantee with device
> drivers.
> 
> The main threat is IOCTLs on character or block devices because their
> impact may be unknown (if we only look at the IOCTL command, not the
> backing file), but we should allow IOCTLs on filesystems (e.g. fscrypt,
> fs-verity, clone extents).  I think we should only implement a
> LANDLOCK_ACCESS_FS_IOCTL_DEV right, which would be more explicit.  This
> change would impact the IOCTLs grouping (not required anymore), but
> we'll still need the list of VFS IOCTLs.


I am fine with dropping the IOCTL grouping and going for this simpler approach.

This must have been a misunderstanding - I thought you wanted to align the
access checks in Landlock with the ones done by the kernel already, so that we
can reason about it more locally.  But I'm fine with doing it just for device
files as well, if that is what it takes.  It's definitely simpler.

Before I jump into the implementation, let me paraphrase your proposal to make
sure I understood it correctly:

 * We *only* introduce the LANDLOCK_ACCESS_FS_IOCTL_DEV right.

 * This access right governs the use of nontrivial IOCTL commands on
   character and block device files.

   * On open()ed files which are not character or block devices,
     all IOCTL commands keep working.

     This includes pipes and sockets, but also a variety of "anonymous" file
     types which are possibly openable through /proc/self/*/fd/*?

 * The trivial IOCTL commands are identified using the proposed function
   vfs_masked_device_ioctl().

   * For these commands, the implementations are in fs/ioctl.c, except for
     FIONREAD, in some cases.  We trust these implementations to check the
     file's type (dir/regular) and access rights (r/w) correctly.


Open questions I have:

* What about files which are neither devices nor regular files or directories?

  The obvious ones which can be open()ed are pipes, where only FIONREAND and two
  harmless-looking watch queue IOCTLs are implemented.

  But then I think that /proc/*/fd/* is a way through which other non-device
  files can become accessible?  What do we do for these?  (I am getting EACCES
  when trying to open some anon_inodes that way... is this something we can
  count on?)

* How did you come up with the list in vfs_masked_device_ioctl()?  I notice that
  some of these are from the switch() statement we had before, but not all of
  them are included.

  I can kind of see that for the fallocate()-like ones and for FIBMAP, because
  these **only** make sense for regular files, and IOCTLs on regular files are
  permitted anyway.

* What do we do for FIONREAD?  Your patch says that it should be forwarded to
  device implementations.  But technically, devices can implement all kinds of
  surprising behaviour for that.

  If you look at the ioctl implementations of different drivers, you can very
  quickly find a surprising amount of things that happen completely independent
  of the IOCTL command.  (Some implementations are acquiring locks and other
  resources before they even check what the cmd value is. - and we would be
  exposing that if we let devices handle FIONREAD).


Please let me know whether I understood you correctly there.

Regarding the implementation notes you left below, I think they mostly derive
from the *_IOCTL_DEV approach in a direct way.


> > +static __attribute_const__ access_mask_t
> > +get_required_ioctl_access(const unsigned int cmd)
> > +{
> > +	switch (cmd) {
> > +	case FIOCLEX:
> > +	case FIONCLEX:
> > +	case FIONBIO:
> > +	case FIOASYNC:
> > +		/*
> > +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> > +		 * close-on-exec and the file's buffered-IO and async flags.
> > +		 * These operations are also available through fcntl(2), and are
> > +		 * unconditionally permitted in Landlock.
> > +		 */
> > +		return 0;
> > +	case FIONREAD:
> > +	case FIOQSIZE:
> > +	case FIGETBSZ:
> > +		/*
> > +		 * FIONREAD returns the number of bytes available for reading.
> > +		 * FIONREAD returns the number of immediately readable bytes for
> > +		 * a file.
> > +		 *
> > +		 * FIOQSIZE queries the size of a file or directory.
> > +		 *
> > +		 * FIGETBSZ queries the file system's block size for a file or
> > +		 * directory.
> > +		 *
> > +		 * These IOCTL commands are permitted for files which are opened
> > +		 * with LANDLOCK_ACCESS_FS_READ_DIR,
> > +		 * LANDLOCK_ACCESS_FS_READ_FILE, or
> > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > +		 */
> 
> Because files or directories can only be opened with
> LANDLOCK_ACCESS_FS_{READ,WRITE}_{FILE,DIR}, and because IOCTLs can only
> be sent on a file descriptor, this means that we can always allow these
> 3 commands (for opened files).
> 
> > +		return LANDLOCK_ACCESS_FS_IOCTL_RW;
> > +	case FS_IOC_FIEMAP:
> > +	case FIBMAP:
> > +		/*
> > +		 * FS_IOC_FIEMAP and FIBMAP query information about the
> > +		 * allocation of blocks within a file.  They are permitted for
> > +		 * files which are opened with LANDLOCK_ACCESS_FS_READ_FILE or
> > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > +		 */
> > +		fallthrough;
> > +	case FIDEDUPERANGE:
> > +	case FICLONE:
> > +	case FICLONERANGE:
> > +		/*
> > +		 * FIDEDUPERANGE, FICLONE and FICLONERANGE make files share
> > +		 * their underlying storage ("reflink") between source and
> > +		 * destination FDs, on file systems which support that.
> > +		 *
> > +		 * The underlying implementations are already checking whether
> > +		 * the involved files are opened with the appropriate read/write
> > +		 * modes.  We rely on this being implemented correctly.
> > +		 *
> > +		 * These IOCTLs are permitted for files which are opened with
> > +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > +		 */
> > +		fallthrough;
> > +	case FS_IOC_RESVSP:
> > +	case FS_IOC_RESVSP64:
> > +	case FS_IOC_UNRESVSP:
> > +	case FS_IOC_UNRESVSP64:
> > +	case FS_IOC_ZERO_RANGE:
> > +		/*
> > +		 * These IOCTLs reserve space, or create holes like
> > +		 * fallocate(2).  We rely on the implementations checking the
> > +		 * files' read/write modes.
> > +		 *
> > +		 * These IOCTLs are permitted for files which are opened with
> > +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> > +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> > +		 */
> 
> These 10 commands only make sense on directories, so we could also
> always allow them on file descriptors.

I imagine that's a typo?  The commands above do make sense on regular files.


> > +		return LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> > +	default:
> > +		/*
> > +		 * Other commands are guarded by the catch-all access right.
> > +		 */
> > +		return LANDLOCK_ACCESS_FS_IOCTL;
> > +	}
> > +}
> > +
> > +/**
> > + * expand_ioctl() - Return the dst flags from either the src flag or the
> > + * %LANDLOCK_ACCESS_FS_IOCTL flag, depending on whether the
> > + * %LANDLOCK_ACCESS_FS_IOCTL and src access rights are handled or not.
> > + *
> > + * @handled: Handled access rights.
> > + * @access: The access mask to copy values from.
> > + * @src: A single access right to copy from in @access.
> > + * @dst: One or more access rights to copy to.
> > + *
> > + * Returns: @dst, or 0.
> > + */
> > +static __attribute_const__ access_mask_t
> > +expand_ioctl(const access_mask_t handled, const access_mask_t access,
> > +	     const access_mask_t src, const access_mask_t dst)
> > +{
> > +	access_mask_t copy_from;
> > +
> > +	if (!(handled & LANDLOCK_ACCESS_FS_IOCTL))
> > +		return 0;
> > +
> > +	copy_from = (handled & src) ? src : LANDLOCK_ACCESS_FS_IOCTL;
> > +	if (access & copy_from)
> > +		return dst;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * landlock_expand_access_fs() - Returns @access with the synthetic IOCTL group
> > + * flags enabled if necessary.
> > + *
> > + * @handled: Handled FS access rights.
> > + * @access: FS access rights to expand.
> > + *
> > + * Returns: @access expanded by the necessary flags for the synthetic IOCTL
> > + * access rights.
> > + */
> > +static __attribute_const__ access_mask_t landlock_expand_access_fs(
> > +	const access_mask_t handled, const access_mask_t access)
> > +{
> > +	return access |
> > +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_WRITE_FILE,
> > +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> > +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> > +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_FILE,
> > +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> > +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> > +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_DIR,
> > +			    LANDLOCK_ACCESS_FS_IOCTL_RW);
> > +}
> > +
> > +/**
> > + * landlock_expand_handled_access_fs() - add synthetic IOCTL access rights to an
> > + * access mask of handled accesses.
> > + *
> > + * @handled: The handled accesses of a ruleset that is being created.
> > + *
> > + * Returns: @handled, with the bits for the synthetic IOCTL access rights set,
> > + * if %LANDLOCK_ACCESS_FS_IOCTL is handled.
> > + */
> > +__attribute_const__ access_mask_t
> > +landlock_expand_handled_access_fs(const access_mask_t handled)
> > +{
> > +	return landlock_expand_access_fs(handled, handled);
> > +}
> > +
> >  /* Ruleset management */
> >  
> >  static struct landlock_object *get_inode_object(struct inode *const inode)
> > @@ -148,7 +331,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> >  	LANDLOCK_ACCESS_FS_EXECUTE | \
> >  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
> >  	LANDLOCK_ACCESS_FS_READ_FILE | \
> > -	LANDLOCK_ACCESS_FS_TRUNCATE)
> > +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> > +	LANDLOCK_ACCESS_FS_IOCTL)
> >  /* clang-format on */
> >  
> >  /*
> > @@ -158,6 +342,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
> >  			    const struct path *const path,
> >  			    access_mask_t access_rights)
> >  {
> > +	access_mask_t handled;
> >  	int err;
> >  	struct landlock_id id = {
> >  		.type = LANDLOCK_KEY_INODE,
> > @@ -170,9 +355,11 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
> >  	if (WARN_ON_ONCE(ruleset->num_layers != 1))
> >  		return -EINVAL;
> >  
> > +	handled = landlock_get_fs_access_mask(ruleset, 0);
> > +	/* Expands the synthetic IOCTL groups. */
> > +	access_rights |= landlock_expand_access_fs(handled, access_rights);
> >  	/* Transforms relative access rights to absolute ones. */
> > -	access_rights |= LANDLOCK_MASK_ACCESS_FS &
> > -			 ~landlock_get_fs_access_mask(ruleset, 0);
> > +	access_rights |= LANDLOCK_MASK_ACCESS_FS & ~handled;
> >  	id.key.object = get_inode_object(d_backing_inode(path->dentry));
> >  	if (IS_ERR(id.key.object))
> >  		return PTR_ERR(id.key.object);
> > @@ -1333,7 +1520,9 @@ static int hook_file_open(struct file *const file)
> >  {
> >  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
> >  	access_mask_t open_access_request, full_access_request, allowed_access;
> > -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> > +	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE |
> > +					      LANDLOCK_ACCESS_FS_IOCTL |
> > +					      IOCTL_GROUPS;
> >  	const struct landlock_ruleset *const dom = get_current_fs_domain();
> >  
> >  	if (!dom)
> 
> We should set optional_access according to the file type before
> `full_access_request = open_access_request | optional_access;`
> 
> const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);
> 
> optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> if (is_device)
>     optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> 
> Because LANDLOCK_ACCESS_FS_IOCTL_DEV is dedicated to character or block
> devices, we may want landlock_add_rule() to only allow this access right
> to be tied to directories, or character devices, or block devices.  Even
> if it would be more consistent with constraints on directory-only access
> rights, I'm not sure about that.
> 
> 
> > @@ -1375,6 +1564,16 @@ static int hook_file_open(struct file *const file)
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Named pipes should be treated just like anonymous pipes.
> > +	 * Therefore, we permit all IOCTLs on them.
> > +	 */
> > +	if (S_ISFIFO(file_inode(file)->i_mode)) {
> > +		allowed_access |= LANDLOCK_ACCESS_FS_IOCTL |
> > +				  LANDLOCK_ACCESS_FS_IOCTL_RW |
> > +				  LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> > +	}
> 
> Instead of this S_ISFIFO check:
> 
> if (!is_device)
>     allowed_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> > +
> >  	/*
> >  	 * For operations on already opened files (i.e. ftruncate()), it is the
> >  	 * access rights at the time of open() which decide whether the
> > @@ -1406,6 +1605,25 @@ static int hook_file_truncate(struct file *const file)
> >  	return -EACCES;
> >  }
> >  
> > +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> > +			   unsigned long arg)
> > +{
> > +	const access_mask_t required_access = get_required_ioctl_access(cmd);
> 
> const access_mask_t required_access = LANDLOCK_ACCESS_FS_IOCTL_DEV;
> 
> 
> > +	const access_mask_t allowed_access =
> > +		landlock_file(file)->allowed_access;
> > +
> > +	/*
> > +	 * It is the access rights at the time of opening the file which
> > +	 * determine whether IOCTL can be used on the opened file later.
> > +	 *
> > +	 * The access right is attached to the opened file in hook_file_open().
> > +	 */
> > +	if ((allowed_access & required_access) == required_access)
> > +		return 0;
> 
> We could then check against the do_vfs_ioctl()'s commands, excluding
> FIONREAD and file_ioctl()'s commands, to always allow VFS-related
> commands:
> 
> if (vfs_masked_device_ioctl(cmd))
>     return 0;
> 
> As a safeguard, we could define vfs_masked_device_ioctl(cmd) in
> fs/ioctl.c and make it called by do_vfs_ioctl() as a safeguard to make
> sure we keep an accurate list of VFS IOCTL commands (see next RFC patch).


> The compat IOCTL hook must also be implemented.

Thanks!  I can't believe I missed that one.


> What do you think? Any better idea?

It seems like a reasonable approach.  I'd like to double check with you that we
are on the same page about it before doing the next implementation step.  (These
iterations seems cheaper when we do them in English than when we do them in C.)

Thanks for the review!
—Günther

^ permalink raw reply	[relevance 0%]

* [PATCH] xfs: stop advertising SB_I_VERSION
@ 2024-02-28  4:28  6% Dave Chinner
  2024-02-28 16:08  0% ` Darrick J. Wong
  2024-03-01 13:42  0% ` Jeff Layton
  0 siblings, 2 replies; 200+ results
From: Dave Chinner @ 2024-02-28  4:28 UTC (permalink / raw)
  To: linux-xfs; +Cc: djwong, jlayton, linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

The redefinition of how NFS wants inode->i_version to be updated is
incomaptible with the XFS i_version mechanism. The VFS now wants
inode->i_version to only change when ctime changes (i.e. it has
become a ctime change counter, not an inode change counter). XFS has
fine grained timestamps, so it can just use ctime for the NFS change
cookie like it still does for V4 XFS filesystems.

We still want XFS to update the inode change counter as it currently
does, so convert all the code that checks SB_I_VERSION to check for
v5 format support. Then we can remove the SB_I_VERSION flag from the
VFS superblock to indicate that inode->i_version is not a valid
change counter and should not be used as such.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_inode.c | 15 +++++----------
 fs/xfs/xfs_iops.c               | 16 +++-------------
 fs/xfs/xfs_super.c              |  8 --------
 3 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 70e97ea6eee7..8071aefad728 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -97,17 +97,12 @@ xfs_trans_log_inode(
 
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
-	 * counter if it is configured for this to occur. While we have the
-	 * inode locked exclusively for metadata modification, we can usually
-	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
-	 * the last time it was incremented. If we have XFS_ILOG_CORE already
-	 * set however, then go ahead and bump the i_version counter
-	 * unconditionally.
+	 * counter if it is configured for this to occur.
 	 */
-	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
-		if (IS_I_VERSION(inode) &&
-		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
-			flags |= XFS_ILOG_IVERSION;
+	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
+	    xfs_has_crc(ip->i_mount)) {
+		inode->i_version++;
+		flags |= XFS_ILOG_IVERSION;
 	}
 
 	iip->ili_dirty_flags |= flags;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index be102fd49560..97e792d9d79a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -584,11 +584,6 @@ xfs_vn_getattr(
 		}
 	}
 
-	if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
-		stat->change_cookie = inode_query_iversion(inode);
-		stat->result_mask |= STATX_CHANGE_COOKIE;
-	}
-
 	/*
 	 * Note: If you add another clause to set an attribute flag, please
 	 * update attributes_mask below.
@@ -1044,16 +1039,11 @@ xfs_vn_update_time(
 	struct timespec64	now;
 
 	trace_xfs_update_time(ip);
+	ASSERT(!(flags & S_VERSION));
 
 	if (inode->i_sb->s_flags & SB_LAZYTIME) {
-		if (!((flags & S_VERSION) &&
-		      inode_maybe_inc_iversion(inode, false))) {
-			generic_update_time(inode, flags);
-			return 0;
-		}
-
-		/* Capture the iversion update that just occurred */
-		log_flags |= XFS_ILOG_CORE;
+		generic_update_time(inode, flags);
+		return 0;
 	}
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 6ce1e6deb7ec..657ce0423f1d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1692,10 +1692,6 @@ xfs_fs_fill_super(
 
 	set_posix_acl_flag(sb);
 
-	/* version 5 superblocks support inode version counters. */
-	if (xfs_has_crc(mp))
-		sb->s_flags |= SB_I_VERSION;
-
 	if (xfs_has_dax_always(mp)) {
 		error = xfs_setup_dax_always(mp);
 		if (error)
@@ -1917,10 +1913,6 @@ xfs_fs_reconfigure(
 	int			flags = fc->sb_flags;
 	int			error;
 
-	/* version 5 superblocks always support version counters. */
-	if (xfs_has_crc(mp))
-		fc->sb_flags |= SB_I_VERSION;
-
 	error = xfs_fs_validate_params(new_mp);
 	if (error)
 		return error;
-- 
2.43.0


^ permalink raw reply related	[relevance 6%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  6:21  0%                               ` Kent Overstreet
@ 2024-02-27 15:32  0%                                 ` Paul E. McKenney
  0 siblings, 0 replies; 200+ results
From: Paul E. McKenney @ 2024-02-27 15:32 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Tue, Feb 27, 2024 at 01:21:05AM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 09:17:41PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote:
> > > On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > > > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > > > > Well, we won't want it getting hammered on continuously - we should be
> > > > > > > able to tune reclaim so that doesn't happen.
> > > > > > > 
> > > > > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > > > > for that; there's APIs they can query to maintain that counter.
> > > > > > 
> > > > > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > > > > But RCU does not know their size.  Yes, we could ferret this on each
> > > > > > call to kmem_free_rcu(), but that might not be great for performance.
> > > > > > We could traverse the lists at runtime, but such traversal must be done
> > > > > > with interrupts disabled, which is also not great.
> > > > > > 
> > > > > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > > > > 
> > > > > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > > > > 	kick_rcu()
> > > > > > 
> > > > > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > > > > the number of objects rather than their aggregate size.
> > > > > 
> > > > > I don't think that'll really work given that object size can very from <
> > > > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > > > > way and I positively hate it; it's really helpful for introspection and
> > > > > debugability later to give good human understandable units to this
> > > > > stuff.
> > > > 
> > > > You might well be right, but let's please try it before adding overhead to
> > > > kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> > > > 
> > > > > And __ksize() is pretty cheap, and I think there might be room in struct
> > > > > slab to stick the object size there instead of getting it from the slab
> > > > > cache - and folio_size() is cheaper still.
> > > > 
> > > > On __ksize():
> > > > 
> > > >  * This should only be used internally to query the true size of allocations.
> > > >  * It is not meant to be a way to discover the usable size of an allocation
> > > >  * after the fact. Instead, use kmalloc_size_roundup().
> > > > 
> > > > Except that kmalloc_size_roundup() doesn't look like it is meant for
> > > > this use case.  On __ksize() being used only internally, I would not be
> > > > at all averse to kfree_rcu() and friends moving to mm.
> > > 
> > > __ksize() is the right helper to use for this; ksize() is "how much
> > > usable memory", __ksize() is "how much does this occupy".
> > > 
> > > > The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> > > > and folio_size() when given vmalloc() memory?
> > > 
> > > __ksize() for slab memory, but folio_size() would be for page
> > > allocations - actually, I think compound_order() is more appropriate
> > > here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
> > > don't have yet but it looks like we're going to need.
> > > 
> > > I'm scanning through vmalloc.c and I don't think we have a helper yet to
> > > query the allocation size - I can write one tomorrow, giving my brain a
> > > rest today :)
> > 
> > Again, let's give the straight count of blocks a try first.  I do see
> > that you feel that the added overhead is negligible, but zero added
> > overhead is even better.
> 
> How are you going to write a heuristic that works correctly both when
> the system is cycling through nothing but 2M hugepages, and nothing but
> 128 byte whatevers?

I could simply use the same general approach that I use within RCU
itself, which currently has absolutely no idea how much memory (if any)
that each callback will free.  Especially given that some callbacks
free groups of memory blocks, while other free nothing.  ;-)

Alternatively, we could gather statistics on the amount of memory freed
by each callback and use that as an estimate.

But we should instead step back and ask exactly what we are trying to
accomplish here, which just might be what Dave Chinner was getting at.

At a ridiculously high level, reclaim is looking for memory to free.
Some read-only memory can often be dropped immediately on the grounds
that its data can be read back in if needed.  Other memory can only be
dropped after being written out, which involves a delay.  There are of
course many other complications, but this will do for a start.

So, where does RCU fit in?

RCU fits in between the two.  With memory awaiting RCU, there is no need
to write anything out, but there is a delay.  As such, memory waiting
for an RCU grace period is similar to memory that is to be reclaimed
after its I/O completes.

One complication, and a complication that we are considering exploiting,
is that, unlike reclaimable memory waiting for I/O, we could often
(but not always) have some control over how quickly RCU's grace periods
complete.  And we already do this programmatically by using the choice
between sychronize_rcu() and synchronize_rcu_expedited().  The question
is whether we should expedite normal RCU grace periods during reclaim,
and if so, under what conditions.

You identified one potential condition, namely the amount of memory
waiting to be reclaimed.  One complication with this approach is that RCU
has no idea how much memory each callback represents, and for call_rcu(),
there is no way for it to find out.  For kfree_rcu(), there are ways,
but as you know, I am questioning whether those ways are reasonable from
a performance perspective.  But even if they are, we would be accepting
more error from the memory waiting via call_rcu() than we would be
accepting if we just counted blocks instead of bytes for kfree_rcu().

Let me reiterate that:  The estimation error that you are objecting to
for kfree_rcu() is completely and utterly unavoidable for call_rcu().
RCU callback functions do whatever their authors want, and we won't be
analyzing their code to estimate bytes freed without some serious advances
in code-analysis technology.  Hence my perhaps otherwise inexplicable
insistence on starting with block counts rather than byte counts.

Another complication surrounding estimating memory to be freed is that
this memory can be in any of the following states:

1.	Not yet associated with a specific grace period.  Its CPU (or
	its rcuog kthread, as the case may be) must interact with the
	RCU core and assign a grace period.  There are a few costly
	RCU_STRICT_GRACE_PERIOD tricks that can help here, usually
	involving IPIing a bunch of CPUs or awakening a bunch of rcuog
	kthreads.

2.	Associated with a grace period that has not yet started.
	This grace period must of course be started, which usually
	cannot happen until the previous grace period completes.
	Which leads us to...

3.	Associated with the current grace period.  This is where
	the rcutorture forcing of quiescent states comes in.

4.	Waiting to be invoked.	This happens from either RCU_SOFTIRQ
	context or from rcuoc kthread context, and is of course impeded
	by any of the aforementioned measures to speed things up.
	Perhaps we could crank up the priority of the relevant ksoftirq
	or rcuog/rcuoc kthreads, though this likely has some serious
	side effects.  Besides, as far as I know, we don't mess with
	other process priorities for the benefit of reclaim, so why
	would we start with RCU?

Of these, #3 and #4 are normally the slowest, with #3 often being the
slowest under light call_rcu()/kfree_rcu() load (or in the presence of
slow RCU readers) and #4 often being the slowest under callback-flooding
conditions.  Now reclaim might cause callback flooding, but I don't
recall seeing this.  At least not to the extent as userspace-induced
callback flooding, for example, during "rm -rf" of a big filesystem tree
with lots of small files on a fast device.  So we likely focus on #3.

So we can speed RCU up, and we could use some statistics on quantity
of whatever waiting for RCU.  But is that the right approach?

Keep in mind that reclaim via writeback involves I/O delays.  Now these
delays are often much shorter than the used to be, courtesy of SSDs.
But queueing is still a thing, as are limitations on write bandwidth,
so those delays are still consequential.  My experience is that reclaim
spans seconds rather than milliseconds, let alone microseconds, and RCU
grace periods are in the tens to hundreds of milliseconds.  Or am I yet
again showing my age?

So perhaps we should instead make RCU take "ongoing reclaim" into account
when prodding reluctant grace periods.  For example, RCU normally scans
for idle CPUs every 3 milliseconds.  Maybe it should speed that up to
once per millisecond when reclaim is ongoing.  And likelwise for RCU's
other grace-period-prodding heuristics.

In addition, RCU uses the number of callbacks queued on a particular
CPU to instigate grace-period prodding.  Maybe these heuristics should
also depend on whether or not a reclaim is in progress.

Is there an easy, fast, and reliable way for RCU to determine whether or
not a reclaim is ongoing?  It might be both easier and more effective
for RCU to simply unconditionally react to the existence of a relaim
than for the reclaim process to try to figure out when to prod RCU.

							Thanx, Paul

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  5:17  0%                             ` Paul E. McKenney
@ 2024-02-27  6:21  0%                               ` Kent Overstreet
  2024-02-27 15:32  0%                                 ` Paul E. McKenney
  0 siblings, 1 reply; 200+ results
From: Kent Overstreet @ 2024-02-27  6:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 09:17:41PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote:
> > On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > > > Well, we won't want it getting hammered on continuously - we should be
> > > > > > able to tune reclaim so that doesn't happen.
> > > > > > 
> > > > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > > > for that; there's APIs they can query to maintain that counter.
> > > > > 
> > > > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > > > But RCU does not know their size.  Yes, we could ferret this on each
> > > > > call to kmem_free_rcu(), but that might not be great for performance.
> > > > > We could traverse the lists at runtime, but such traversal must be done
> > > > > with interrupts disabled, which is also not great.
> > > > > 
> > > > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > > > 
> > > > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > > > 	kick_rcu()
> > > > > 
> > > > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > > > the number of objects rather than their aggregate size.
> > > > 
> > > > I don't think that'll really work given that object size can very from <
> > > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > > > way and I positively hate it; it's really helpful for introspection and
> > > > debugability later to give good human understandable units to this
> > > > stuff.
> > > 
> > > You might well be right, but let's please try it before adding overhead to
> > > kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> > > 
> > > > And __ksize() is pretty cheap, and I think there might be room in struct
> > > > slab to stick the object size there instead of getting it from the slab
> > > > cache - and folio_size() is cheaper still.
> > > 
> > > On __ksize():
> > > 
> > >  * This should only be used internally to query the true size of allocations.
> > >  * It is not meant to be a way to discover the usable size of an allocation
> > >  * after the fact. Instead, use kmalloc_size_roundup().
> > > 
> > > Except that kmalloc_size_roundup() doesn't look like it is meant for
> > > this use case.  On __ksize() being used only internally, I would not be
> > > at all averse to kfree_rcu() and friends moving to mm.
> > 
> > __ksize() is the right helper to use for this; ksize() is "how much
> > usable memory", __ksize() is "how much does this occupy".
> > 
> > > The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> > > and folio_size() when given vmalloc() memory?
> > 
> > __ksize() for slab memory, but folio_size() would be for page
> > allocations - actually, I think compound_order() is more appropriate
> > here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
> > don't have yet but it looks like we're going to need.
> > 
> > I'm scanning through vmalloc.c and I don't think we have a helper yet to
> > query the allocation size - I can write one tomorrow, giving my brain a
> > rest today :)
> 
> Again, let's give the straight count of blocks a try first.  I do see
> that you feel that the added overhead is negligible, but zero added
> overhead is even better.

How are you going to write a heuristic that works correctly both when
the system is cycling through nothing but 2M hugepages, and nothing but
128 byte whatevers?

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  1:08  5%                           ` Kent Overstreet
@ 2024-02-27  5:17  0%                             ` Paul E. McKenney
  2024-02-27  6:21  0%                               ` Kent Overstreet
  0 siblings, 1 reply; 200+ results
From: Paul E. McKenney @ 2024-02-27  5:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > > Well, we won't want it getting hammered on continuously - we should be
> > > > > able to tune reclaim so that doesn't happen.
> > > > > 
> > > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > > for that; there's APIs they can query to maintain that counter.
> > > > 
> > > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > > But RCU does not know their size.  Yes, we could ferret this on each
> > > > call to kmem_free_rcu(), but that might not be great for performance.
> > > > We could traverse the lists at runtime, but such traversal must be done
> > > > with interrupts disabled, which is also not great.
> > > > 
> > > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > > 
> > > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > > 	kick_rcu()
> > > > 
> > > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > > the number of objects rather than their aggregate size.
> > > 
> > > I don't think that'll really work given that object size can very from <
> > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > > way and I positively hate it; it's really helpful for introspection and
> > > debugability later to give good human understandable units to this
> > > stuff.
> > 
> > You might well be right, but let's please try it before adding overhead to
> > kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> > 
> > > And __ksize() is pretty cheap, and I think there might be room in struct
> > > slab to stick the object size there instead of getting it from the slab
> > > cache - and folio_size() is cheaper still.
> > 
> > On __ksize():
> > 
> >  * This should only be used internally to query the true size of allocations.
> >  * It is not meant to be a way to discover the usable size of an allocation
> >  * after the fact. Instead, use kmalloc_size_roundup().
> > 
> > Except that kmalloc_size_roundup() doesn't look like it is meant for
> > this use case.  On __ksize() being used only internally, I would not be
> > at all averse to kfree_rcu() and friends moving to mm.
> 
> __ksize() is the right helper to use for this; ksize() is "how much
> usable memory", __ksize() is "how much does this occupy".
> 
> > The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> > and folio_size() when given vmalloc() memory?
> 
> __ksize() for slab memory, but folio_size() would be for page
> allocations - actually, I think compound_order() is more appropriate
> here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
> don't have yet but it looks like we're going to need.
> 
> I'm scanning through vmalloc.c and I don't think we have a helper yet to
> query the allocation size - I can write one tomorrow, giving my brain a
> rest today :)

Again, let's give the straight count of blocks a try first.  I do see
that you feel that the added overhead is negligible, but zero added
overhead is even better.

							Thanx, Paul

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  0:55  5%                         ` Paul E. McKenney
@ 2024-02-27  1:08  5%                           ` Kent Overstreet
  2024-02-27  5:17  0%                             ` Paul E. McKenney
  0 siblings, 1 reply; 200+ results
From: Kent Overstreet @ 2024-02-27  1:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > > Well, we won't want it getting hammered on continuously - we should be
> > > > able to tune reclaim so that doesn't happen.
> > > > 
> > > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > > for that; there's APIs they can query to maintain that counter.
> > > 
> > > We can easily tell you the number of blocks of memory waiting to be freed.
> > > But RCU does not know their size.  Yes, we could ferret this on each
> > > call to kmem_free_rcu(), but that might not be great for performance.
> > > We could traverse the lists at runtime, but such traversal must be done
> > > with interrupts disabled, which is also not great.
> > > 
> > > > then, we can add a heuristic threshhold somewhere, something like 
> > > > 
> > > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > > 	kick_rcu()
> > > 
> > > If it is a heuristic anyway, it sounds best to base the heuristic on
> > > the number of objects rather than their aggregate size.
> > 
> > I don't think that'll really work given that object size can very from <
> > 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> > way and I positively hate it; it's really helpful for introspection and
> > debugability later to give good human understandable units to this
> > stuff.
> 
> You might well be right, but let's please try it before adding overhead to
> kfree_rcu() and friends.  I bet it will prove to be good and sufficient.
> 
> > And __ksize() is pretty cheap, and I think there might be room in struct
> > slab to stick the object size there instead of getting it from the slab
> > cache - and folio_size() is cheaper still.
> 
> On __ksize():
> 
>  * This should only be used internally to query the true size of allocations.
>  * It is not meant to be a way to discover the usable size of an allocation
>  * after the fact. Instead, use kmalloc_size_roundup().
> 
> Except that kmalloc_size_roundup() doesn't look like it is meant for
> this use case.  On __ksize() being used only internally, I would not be
> at all averse to kfree_rcu() and friends moving to mm.

__ksize() is the right helper to use for this; ksize() is "how much
usable memory", __ksize() is "how much does this occupy".

> The idea is for kfree_rcu() to invoke __ksize() when given slab memory
> and folio_size() when given vmalloc() memory?

__ksize() for slab memory, but folio_size() would be for page
allocations - actually, I think compound_order() is more appropriate
here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we
don't have yet but it looks like we're going to need.

I'm scanning through vmalloc.c and I don't think we have a helper yet to
query the allocation size - I can write one tomorrow, giving my brain a
rest today :)

^ permalink raw reply	[relevance 5%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  0:29  0%                       ` Kent Overstreet
@ 2024-02-27  0:55  5%                         ` Paul E. McKenney
  2024-02-27  1:08  5%                           ` Kent Overstreet
  0 siblings, 1 reply; 200+ results
From: Paul E. McKenney @ 2024-02-27  0:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > > Well, we won't want it getting hammered on continuously - we should be
> > > able to tune reclaim so that doesn't happen.
> > > 
> > > I think getting numbers on the amount of memory stranded waiting for RCU
> > > is probably first order of business - minor tweak to kfree_rcu() et all
> > > for that; there's APIs they can query to maintain that counter.
> > 
> > We can easily tell you the number of blocks of memory waiting to be freed.
> > But RCU does not know their size.  Yes, we could ferret this on each
> > call to kmem_free_rcu(), but that might not be great for performance.
> > We could traverse the lists at runtime, but such traversal must be done
> > with interrupts disabled, which is also not great.
> > 
> > > then, we can add a heuristic threshhold somewhere, something like 
> > > 
> > > if (rcu_stranded * multiplier > reclaimable_memory)
> > > 	kick_rcu()
> > 
> > If it is a heuristic anyway, it sounds best to base the heuristic on
> > the number of objects rather than their aggregate size.
> 
> I don't think that'll really work given that object size can very from <
> 100 bytes all the way up to 2MB hugepages. The shrinker API works that
> way and I positively hate it; it's really helpful for introspection and
> debugability later to give good human understandable units to this
> stuff.

You might well be right, but let's please try it before adding overhead to
kfree_rcu() and friends.  I bet it will prove to be good and sufficient.

> And __ksize() is pretty cheap, and I think there might be room in struct
> slab to stick the object size there instead of getting it from the slab
> cache - and folio_size() is cheaper still.

On __ksize():

 * This should only be used internally to query the true size of allocations.
 * It is not meant to be a way to discover the usable size of an allocation
 * after the fact. Instead, use kmalloc_size_roundup().

Except that kmalloc_size_roundup() doesn't look like it is meant for
this use case.  On __ksize() being used only internally, I would not be
at all averse to kfree_rcu() and friends moving to mm.

The idea is for kfree_rcu() to invoke __ksize() when given slab memory
and folio_size() when given vmalloc() memory?

							Thanx, Paul

^ permalink raw reply	[relevance 5%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 23:29  5%                   ` Kent Overstreet
  2024-02-27  0:05  0%                     ` Paul E. McKenney
@ 2024-02-27  0:43  0%                     ` Dave Chinner
  1 sibling, 0 replies; 200+ results
From: Dave Chinner @ 2024-02-27  0:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Paul E. McKenney, Matthew Wilcox, Linus Torvalds, Al Viro,
	Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez,
	Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> > > > RCU allocating and freeing of memory can already be fairly significant
> > > > depending on workload, and I'd expect that to grow - we really just need
> > > > a way for reclaim to kick RCU when needed (and probably add a percpu
> > > > counter for "amount of memory stranded until the next RCU grace
> > > > period").
> > 
> > There are some APIs for that, though the are sharp-edged and mainly
> > intended for rcutorture, and there are some hooks for a CI Kconfig
> > option called RCU_STRICT_GRACE_PERIOD that could be organized into
> > something useful.
> > 
> > Of course, if there is a long-running RCU reader, there is nothing
> > RCU can do.  By definition, it must wait on all pre-existing readers,
> > no exceptions.
> > 
> > But my guess is that you instead are thinking of memory-exhaustion
> > emergencies where you would like RCU to burn more CPU than usual to
> > reduce grace-period latency, there are definitely things that can be done.
> > 
> > I am sure that there are more questions that I should ask, but the one
> > that comes immediately to mind is "Is this API call an occasional thing,
> > or does RCU need to tolerate many CPUs hammering it frequently?"
> > Either answer is fine, I just need to know.  ;-)
> 
> Well, we won't want it getting hammered on continuously - we should be
> able to tune reclaim so that doesn't happen.

If we are under sustained memory pressure, there will be a
relatively steady state of "stranded memory" - every rcu grace
period will be stranding and freeing roughly the same amount of
memory because that reclaim progress across all caches won't change
significantly from grace period to grace period.

I really haven't seen "stranded memory" from reclaimable slab caches
(like inodes and dentries) ever causing issues with allocation or
causing OOM kills.  Hence I'm not sure that there is any real need
for expediting the freeing of RCU memory in the general case - it's
probably only when we get near OOM (i.e. reclaim priority is
approaching 0) that expediting rcu_free()d memory may make any
difference to allocation success...

> I think getting numbers on the amount of memory stranded waiting for RCU
> is probably first order of business - minor tweak to kfree_rcu() et all
> for that; there's APIs they can query to maintain that counter.

Yes, please. Get some numbers that show there is an actual problem
here that needs solving.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-27  0:05  0%                     ` Paul E. McKenney
@ 2024-02-27  0:29  0%                       ` Kent Overstreet
  2024-02-27  0:55  5%                         ` Paul E. McKenney
  0 siblings, 1 reply; 200+ results
From: Kent Overstreet @ 2024-02-27  0:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> > Well, we won't want it getting hammered on continuously - we should be
> > able to tune reclaim so that doesn't happen.
> > 
> > I think getting numbers on the amount of memory stranded waiting for RCU
> > is probably first order of business - minor tweak to kfree_rcu() et all
> > for that; there's APIs they can query to maintain that counter.
> 
> We can easily tell you the number of blocks of memory waiting to be freed.
> But RCU does not know their size.  Yes, we could ferret this on each
> call to kmem_free_rcu(), but that might not be great for performance.
> We could traverse the lists at runtime, but such traversal must be done
> with interrupts disabled, which is also not great.
> 
> > then, we can add a heuristic threshhold somewhere, something like 
> > 
> > if (rcu_stranded * multiplier > reclaimable_memory)
> > 	kick_rcu()
> 
> If it is a heuristic anyway, it sounds best to base the heuristic on
> the number of objects rather than their aggregate size.

I don't think that'll really work given that object size can very from <
100 bytes all the way up to 2MB hugepages. The shrinker API works that
way and I positively hate it; it's really helpful for introspection and
debugability later to give good human understandable units to this
stuff.

And __ksize() is pretty cheap, and I think there might be room in struct
slab to stick the object size there instead of getting it from the slab
cache - and folio_size() is cheaper still.

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  2024-02-26 23:29  5%                   ` Kent Overstreet
@ 2024-02-27  0:05  0%                     ` Paul E. McKenney
  2024-02-27  0:29  0%                       ` Kent Overstreet
  2024-02-27  0:43  0%                     ` Dave Chinner
  1 sibling, 1 reply; 200+ results
From: Paul E. McKenney @ 2024-02-27  0:05 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote:
> On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> > > +cc Paul
> > > 
> > > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote:
> > > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> > > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > > > > > Willy - tangential side note: I looked closer at the issue that you
> > > > > > reported (indirectly) with the small reads during heavy write
> > > > > > activity.
> > > > > > 
> > > > > > Our _reading_ side is very optimized and has none of the write-side
> > > > > > oddities that I can see, and we just have
> > > > > > 
> > > > > >   filemap_read ->
> > > > > >     filemap_get_pages ->
> > > > > >         filemap_get_read_batch ->
> > > > > >           folio_try_get_rcu()
> > > > > > 
> > > > > > and there is no page locking or other locking involved (assuming the
> > > > > > page is cached and marked uptodate etc, of course).
> > > > > > 
> > > > > > So afaik, it really is just that *one* atomic access (and the matching
> > > > > > page ref decrement afterwards).
> > > > > 
> > > > > Yep, that was what the customer reported on their ancient kernel, and
> > > > > we at least didn't make that worse ...
> > > > > 
> > > > > > We could easily do all of this without getting any ref to the page at
> > > > > > all if we did the page cache release with RCU (and the user copy with
> > > > > > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > > > > > really *only* for tiny reads where we could have that buffer on the
> > > > > > stack.
> > > > > > 
> > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > > > > > for to that degree?
> > > > > > 
> > > > > > In contrast, the RCU-delaying of the page cache might be a good idea
> > > > > > in general. We've had other situations where that would have been
> > > > > > nice. The main worry would be low-memory situations, I suspect.
> > > > > > 
> > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even
> > > > > > with the cacheline possibly bouncing, the system call overhead for
> > > > > > tiny reads (particularly with all the mitigations) should be orders of
> > > > > > magnitude higher than two atomic accesses.
> > > > > 
> > > > > Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> > > > > I suspect that this customer would simply disable them; afaik the machine
> > > > > is an appliance and one interacts with it purely by sending transactions
> > > > > to it (it's not even an SQL system, much less a "run arbitrary javascript"
> > > > > kind of system).  But that makes it even more special case, inapplicable
> > > > > to the majority of workloads and closer to smelling like a benchmark.
> > > > > 
> > > > > I've thought about and rejected RCU delaying of the page cache in the
> > > > > past.  With the majority of memory in anon memory & file memory, it just
> > > > > feels too risky to have so much memory waiting to be reused.  We could
> > > > > also improve gup-fast if we could rely on RCU freeing of anon memory.
> > > > > Not sure what workloads might benefit from that, though.
> > > > 
> > > > RCU allocating and freeing of memory can already be fairly significant
> > > > depending on workload, and I'd expect that to grow - we really just need
> > > > a way for reclaim to kick RCU when needed (and probably add a percpu
> > > > counter for "amount of memory stranded until the next RCU grace
> > > > period").
> > 
> > There are some APIs for that, though the are sharp-edged and mainly
> > intended for rcutorture, and there are some hooks for a CI Kconfig
> > option called RCU_STRICT_GRACE_PERIOD that could be organized into
> > something useful.
> > 
> > Of course, if there is a long-running RCU reader, there is nothing
> > RCU can do.  By definition, it must wait on all pre-existing readers,
> > no exceptions.
> > 
> > But my guess is that you instead are thinking of memory-exhaustion
> > emergencies where you would like RCU to burn more CPU than usual to
> > reduce grace-period latency, there are definitely things that can be done.
> > 
> > I am sure that there are more questions that I should ask, but the one
> > that comes immediately to mind is "Is this API call an occasional thing,
> > or does RCU need to tolerate many CPUs hammering it frequently?"
> > Either answer is fine, I just need to know.  ;-)
> 
> Well, we won't want it getting hammered on continuously - we should be
> able to tune reclaim so that doesn't happen.
> 
> I think getting numbers on the amount of memory stranded waiting for RCU
> is probably first order of business - minor tweak to kfree_rcu() et all
> for that; there's APIs they can query to maintain that counter.

We can easily tell you the number of blocks of memory waiting to be freed.
But RCU does not know their size.  Yes, we could ferret this on each
call to kmem_free_rcu(), but that might not be great for performance.
We could traverse the lists at runtime, but such traversal must be done
with interrupts disabled, which is also not great.

> then, we can add a heuristic threshhold somewhere, something like 
> 
> if (rcu_stranded * multiplier > reclaimable_memory)
> 	kick_rcu()

If it is a heuristic anyway, it sounds best to base the heuristic on
the number of objects rather than their aggregate size.

							Thanx, Paul

^ permalink raw reply	[relevance 0%]

* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
  @ 2024-02-26 23:29  5%                   ` Kent Overstreet
  2024-02-27  0:05  0%                     ` Paul E. McKenney
  2024-02-27  0:43  0%                     ` Dave Chinner
  0 siblings, 2 replies; 200+ results
From: Kent Overstreet @ 2024-02-26 23:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain,
	lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav,
	Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason,
	Johannes Weiner

On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote:
> > +cc Paul
> > 
> > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote:
> > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote:
> > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> > > > > Willy - tangential side note: I looked closer at the issue that you
> > > > > reported (indirectly) with the small reads during heavy write
> > > > > activity.
> > > > > 
> > > > > Our _reading_ side is very optimized and has none of the write-side
> > > > > oddities that I can see, and we just have
> > > > > 
> > > > >   filemap_read ->
> > > > >     filemap_get_pages ->
> > > > >         filemap_get_read_batch ->
> > > > >           folio_try_get_rcu()
> > > > > 
> > > > > and there is no page locking or other locking involved (assuming the
> > > > > page is cached and marked uptodate etc, of course).
> > > > > 
> > > > > So afaik, it really is just that *one* atomic access (and the matching
> > > > > page ref decrement afterwards).
> > > > 
> > > > Yep, that was what the customer reported on their ancient kernel, and
> > > > we at least didn't make that worse ...
> > > > 
> > > > > We could easily do all of this without getting any ref to the page at
> > > > > all if we did the page cache release with RCU (and the user copy with
> > > > > "copy_to_user_atomic()").  Honestly, anything else looks like a
> > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but
> > > > > really *only* for tiny reads where we could have that buffer on the
> > > > > stack.
> > > > > 
> > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> > > > > for to that degree?
> > > > > 
> > > > > In contrast, the RCU-delaying of the page cache might be a good idea
> > > > > in general. We've had other situations where that would have been
> > > > > nice. The main worry would be low-memory situations, I suspect.
> > > > > 
> > > > > The "tiny read" optimization smells like a benchmark thing to me. Even
> > > > > with the cacheline possibly bouncing, the system call overhead for
> > > > > tiny reads (particularly with all the mitigations) should be orders of
> > > > > magnitude higher than two atomic accesses.
> > > > 
> > > > Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
> > > > I suspect that this customer would simply disable them; afaik the machine
> > > > is an appliance and one interacts with it purely by sending transactions
> > > > to it (it's not even an SQL system, much less a "run arbitrary javascript"
> > > > kind of system).  But that makes it even more special case, inapplicable
> > > > to the majority of workloads and closer to smelling like a benchmark.
> > > > 
> > > > I've thought about and rejected RCU delaying of the page cache in the
> > > > past.  With the majority of memory in anon memory & file memory, it just
> > > > feels too risky to have so much memory waiting to be reused.  We could
> > > > also improve gup-fast if we could rely on RCU freeing of anon memory.
> > > > Not sure what workloads might benefit from that, though.
> > > 
> > > RCU allocating and freeing of memory can already be fairly significant
> > > depending on workload, and I'd expect that to grow - we really just need
> > > a way for reclaim to kick RCU when needed (and probably add a percpu
> > > counter for "amount of memory stranded until the next RCU grace
> > > period").
> 
> There are some APIs for that, though the are sharp-edged and mainly
> intended for rcutorture, and there are some hooks for a CI Kconfig
> option called RCU_STRICT_GRACE_PERIOD that could be organized into
> something useful.
> 
> Of course, if there is a long-running RCU reader, there is nothing
> RCU can do.  By definition, it must wait on all pre-existing readers,
> no exceptions.
> 
> But my guess is that you instead are thinking of memory-exhaustion
> emergencies where you would like RCU to burn more CPU than usual to
> reduce grace-period latency, there are definitely things that can be done.
> 
> I am sure that there are more questions that I should ask, but the one
> that comes immediately to mind is "Is this API call an occasional thing,
> or does RCU need to tolerate many CPUs hammering it frequently?"
> Either answer is fine, I just need to know.  ;-)

Well, we won't want it getting hammered on continuously - we should be
able to tune reclaim so that doesn't happen.

I think getting numbers on the amount of memory stranded waiting for RCU
is probably first order of business - minor tweak to kfree_rcu() et all
for that; there's APIs they can query to maintain that counter.

then, we can add a heuristic threshhold somewhere, something like 

if (rcu_stranded * multiplier > reclaimable_memory)
	kick_rcu()

^ permalink raw reply	[relevance 5%]

* Re: [PATCH 2/3] check: add support for --list-group-tests
  2024-02-21 16:45  0%     ` Luis Chamberlain
@ 2024-02-25 16:08  0%       ` Zorro Lang
  0 siblings, 0 replies; 200+ results
From: Zorro Lang @ 2024-02-25 16:08 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Dave Chinner, fstests, anand.jain, aalbersh, djwong,
	linux-fsdevel, kdevops, patches

On Wed, Feb 21, 2024 at 08:45:37AM -0800, Luis Chamberlain wrote:
> On Mon, Feb 19, 2024 at 02:38:12PM +1100, Dave Chinner wrote:
> > On Fri, Feb 16, 2024 at 10:18:58AM -0800, Luis Chamberlain wrote:
> > > Since the prior commit adds the ability to list groups but is used
> > > only when we use --start-after, let's add an option which leverages this
> > > to also allow us to easily query which tests are part of the groups
> > > specified.
> > > 
> > > This can be used for dynamic test configuration suites such as kdevops
> > > which may want to take advantage of this information to deterministically
> > > determine if a test falls part of a specific group.
> > > Demo:
> > > 
> > > root@demo-xfs-reflink /var/lib/xfstests # ./check --list-group-tests -g soak
> > > 
> > > generic/019 generic/388 generic/475 generic/476 generic/521 generic/522 generic/616 generic/617 generic/642 generic/648 generic/650 xfs/285 xfs/517 xfs/560 xfs/561 xfs/562 xfs/565 xfs/570 xfs/571 xfs/572 xfs/573 xfs/574 xfs/575 xfs/576 xfs/577 xfs/578 xfs/579 xfs/580 xfs/581 xfs/582 xfs/583 xfs/584 xfs/585 xfs/586 xfs/587 xfs/588 xfs/589 xfs/590 xfs/591 xfs/592 xfs/593 xfs/594 xfs/595 xfs/727 xfs/729 xfs/800
> > 
> > So how is this different to ./check -n -g soak?
> > 
> > '-n' is supposed to show you want tests are going to be run
> > without actually running them, so why can't you use that?
> 
> '-n' will replicate as if you are running all tests but just skip while
> --list-group-tests will just look for the tests for the group and bail right
> away, and it is machine readable.

What do you mean "replicate as if you are running all tests but just skip"?
Sorry I don't understand this explanation 100%, can you show us some examples
to explain what kind of job you hope "--list-group-tests" to do, but the "-n"
is helpless?

Thanks,
Zorro

> 
>   Luis
> 


^ permalink raw reply	[relevance 0%]

* Re: [Lsf-pc] [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more
  @ 2024-02-22 16:08  5%             ` Jan Kara
  0 siblings, 0 replies; 200+ results
From: Jan Kara @ 2024-02-22 16:08 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, Kent Overstreet, Josef Bacik, linux-kernel,
	linux-bcachefs, linux-fsdevel, lsf-pc, linux-btrfs

On Thu 22-02-24 13:48:45, Miklos Szeredi wrote:
> On Thu, 22 Feb 2024 at 12:01, Jan Kara <jack@suse.cz> wrote:
> 
> > I think for "unique inode identifier" we don't even have to come up with
> > new APIs. The file handle + fsid pair is an established way to do this,
> 
> Why not uuid?
> 
> fsid seems to be just a degraded uuid.   We can do better with statx
> and/or statmount.

fanotify uses fsid because we have standard interface for querying fsid
(statfs(2)) and because not all filesystems (in particular virtual ones)
bother with uuid. At least the first thing is being changed now.

> > fanotify successfully uses this as object identifier and Amir did quite
> > some work for this to be usable for vast majority of filesystems (including
> 
> Vast majority != all.

True. If we are going to use this scheme more widely, we need to have a
look whether the remaining cases need fixing or we can just ignore them.
They were not very interesting for fanotify so we moved on.

> Also even uuid is just a statistically unique
> identifier, while st_dev was guaranteed to be unique (but not
> persistent, like uuid).

Well, everything is just statistically true in this world :) If you have
conflicting uuids, you are likely to see also other problems so I would not
be too concerned about that.

> If we are going to start fixing userspace, then we better make sure to
> use the right interfaces, that won't have issues in the future.

I agree we should give this a good thought which identification of a
filesystem is the best.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[relevance 5%]

* [RFC v4 linux-next 19/19] fs & block: remove bdev->bd_inode
  @ 2024-02-22 12:45  1% ` Yu Kuai
  0 siblings, 0 replies; 200+ results
From: Yu Kuai @ 2024-02-22 12:45 UTC (permalink / raw)
  To: jack, hch, brauner, axboe
  Cc: linux-fsdevel, linux-block, yukuai3, yukuai1, yi.zhang, yangerkun

From: Yu Kuai <yukuai3@huawei.com>

In prior patches we introduced the ability to open block devices as
files and made all filesystems stash the opened block device files. With
this patch we remove bdev->bd_inode from struct block_device.

Using files allows us to stop passing struct block_device directly to
almost all buffer_head helpers. Whenever access to the inode of the
block device is needed bdev_file_inode(bdev_file) can be used instead of
bdev->bd_inode.

The only user that doesn't rely on files is the block layer itself in
block/fops.c where we only have access to the block device. As the bdev
filesystem doesn't open block devices as files obviously.

This introduces a union into struct buffer_head and struct iomap. The
union encompasses both struct block_device and struct file. In both
cases a flag is used to differentiate whether a block device or a proper
file was stashed. Simple accessors bh_bdev() and iomap_bdev() are used
to return the block device in the really low-level functions where it's
needed. These are overall just a few callsites.

Originally-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bdev.c                  |   8 ++-
 block/fops.c                  |   2 +
 drivers/md/md-bitmap.c        |   2 +-
 fs/affs/file.c                |   2 +-
 fs/btrfs/inode.c              |   2 +-
 fs/buffer.c                   | 103 +++++++++++++++++++---------------
 fs/direct-io.c                |   4 +-
 fs/erofs/data.c               |   3 +-
 fs/erofs/internal.h           |   1 +
 fs/erofs/zmap.c               |   2 +-
 fs/ext2/inode.c               |   4 +-
 fs/ext2/xattr.c               |   2 +-
 fs/ext4/inode.c               |   8 +--
 fs/ext4/mmp.c                 |   2 +-
 fs/ext4/page-io.c             |   5 +-
 fs/ext4/super.c               |   4 +-
 fs/ext4/xattr.c               |   2 +-
 fs/f2fs/data.c                |   7 ++-
 fs/f2fs/f2fs.h                |   1 +
 fs/fuse/dax.c                 |   2 +-
 fs/gfs2/aops.c                |   2 +-
 fs/gfs2/bmap.c                |   2 +-
 fs/gfs2/meta_io.c             |   2 +-
 fs/hpfs/file.c                |   2 +-
 fs/iomap/buffered-io.c        |   8 +--
 fs/iomap/direct-io.c          |  11 ++--
 fs/iomap/swapfile.c           |   2 +-
 fs/iomap/trace.h              |   2 +-
 fs/jbd2/commit.c              |   2 +-
 fs/jbd2/journal.c             |   8 +--
 fs/jbd2/recovery.c            |   8 +--
 fs/jbd2/revoke.c              |  13 +++--
 fs/jbd2/transaction.c         |   8 +--
 fs/mpage.c                    |  26 ++++++---
 fs/nilfs2/btnode.c            |   4 +-
 fs/nilfs2/gcinode.c           |   2 +-
 fs/nilfs2/mdt.c               |   2 +-
 fs/nilfs2/page.c              |   4 +-
 fs/nilfs2/recovery.c          |  27 +++++----
 fs/ntfs3/fsntfs.c             |   8 +--
 fs/ntfs3/inode.c              |   2 +-
 fs/ntfs3/super.c              |   2 +-
 fs/ocfs2/journal.c            |   2 +-
 fs/reiserfs/fix_node.c        |   2 +-
 fs/reiserfs/journal.c         |  10 ++--
 fs/reiserfs/prints.c          |   4 +-
 fs/reiserfs/reiserfs.h        |   6 +-
 fs/reiserfs/stree.c           |   2 +-
 fs/reiserfs/tail_conversion.c |   2 +-
 fs/xfs/xfs_iomap.c            |   4 +-
 fs/zonefs/file.c              |   4 +-
 include/linux/blk_types.h     |   1 -
 include/linux/blkdev.h        |   2 +
 include/linux/buffer_head.h   |  73 +++++++++++++++---------
 include/linux/iomap.h         |  14 ++++-
 include/trace/events/block.h  |   2 +-
 56 files changed, 259 insertions(+), 182 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index b7af04d34af2..98c192ff81ec 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -412,7 +412,6 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 	spin_lock_init(&bdev->bd_size_lock);
 	mutex_init(&bdev->bd_holder_lock);
 	bdev->bd_partno = partno;
-	bdev->bd_inode = inode;
 	bdev->bd_queue = disk->queue;
 	if (partno)
 		bdev->bd_has_submit_bio = disk->part0->bd_has_submit_bio;
@@ -1230,6 +1229,13 @@ struct folio *bdev_read_folio(struct block_device *bdev, loff_t pos)
 }
 EXPORT_SYMBOL_GPL(bdev_read_folio);
 
+void clean_bdev_aliases2(struct block_device *bdev, sector_t block,
+			 sector_t len)
+{
+	return __clean_bdev_aliases(bdev_inode(bdev), block, len);
+}
+EXPORT_SYMBOL_GPL(clean_bdev_aliases2);
+
 static int __init setup_bdev_allow_write_mounted(char *str)
 {
 	if (kstrtobool(str, &bdev_allow_write_mounted))
diff --git a/block/fops.c b/block/fops.c
index 1fcbdb131a8f..5550f8b53c21 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -386,6 +386,7 @@ static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	loff_t isize = i_size_read(inode);
 
 	iomap->bdev = bdev;
+	iomap->flags |= IOMAP_F_BDEV;
 	iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
 	if (iomap->offset >= isize)
 		return -EIO;
@@ -407,6 +408,7 @@ static int blkdev_get_block(struct inode *inode, sector_t iblock,
 	bh->b_bdev = I_BDEV(inode);
 	bh->b_blocknr = iblock;
 	set_buffer_mapped(bh);
+	set_buffer_bdev(bh);
 	return 0;
 }
 
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 9672f75c3050..689f5f543520 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -380,7 +380,7 @@ static int read_file_page(struct file *file, unsigned long index,
 			}
 
 			bh->b_blocknr = block;
-			bh->b_bdev = inode->i_sb->s_bdev;
+			bh->b_bdev_file = inode->i_sb->s_bdev_file;
 			if (count < blocksize)
 				count = 0;
 			else
diff --git a/fs/affs/file.c b/fs/affs/file.c
index 04c018e19602..c0583831c58f 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -365,7 +365,7 @@ affs_get_block(struct inode *inode, sector_t block, struct buffer_head *bh_resul
 err_alloc:
 	brelse(ext_bh);
 	clear_buffer_mapped(bh_result);
-	bh_result->b_bdev = NULL;
+	bh_result->b_bdev_file = NULL;
 	// unlock cache
 	affs_unlock_ext(inode);
 	return -ENOSPC;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index df55dd891137..b3b2e01093dd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7716,7 +7716,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 		iomap->type = IOMAP_MAPPED;
 	}
 	iomap->offset = start;
-	iomap->bdev = fs_info->fs_devices->latest_dev->bdev;
+	iomap->bdev_file = fs_info->fs_devices->latest_dev->bdev_file;
 	iomap->length = len;
 	free_extent_map(em);
 
diff --git a/fs/buffer.c b/fs/buffer.c
index b55dea034a5d..5753c068ec78 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -129,7 +129,7 @@ static void buffer_io_error(struct buffer_head *bh, char *msg)
 	if (!test_bit(BH_Quiet, &bh->b_state))
 		printk_ratelimited(KERN_ERR
 			"Buffer I/O error on dev %pg, logical block %llu%s\n",
-			bh->b_bdev, (unsigned long long)bh->b_blocknr, msg);
+			bh_bdev(bh), (unsigned long long)bh->b_blocknr, msg);
 }
 
 /*
@@ -187,9 +187,9 @@ EXPORT_SYMBOL(end_buffer_write_sync);
  * succeeds, there is no need to take i_private_lock.
  */
 static struct buffer_head *
-__find_get_block_slow(struct block_device *bdev, sector_t block)
+__find_get_block_slow(struct file *bdev_file, sector_t block)
 {
-	struct inode *bd_inode = bdev->bd_inode;
+	struct inode *bd_inode = file_inode(bdev_file);
 	struct address_space *bd_mapping = bd_inode->i_mapping;
 	struct buffer_head *ret = NULL;
 	pgoff_t index;
@@ -232,7 +232,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 		       "device %pg blocksize: %d\n",
 		       (unsigned long long)block,
 		       (unsigned long long)bh->b_blocknr,
-		       bh->b_state, bh->b_size, bdev,
+		       bh->b_state, bh->b_size, file_bdev(bdev_file),
 		       1 << bd_inode->i_blkbits);
 	}
 out_unlock:
@@ -473,7 +473,7 @@ EXPORT_SYMBOL(mark_buffer_async_write);
  * try_to_free_buffers() will be operating against the *blockdev* mapping
  * at the time, not against the S_ISREG file which depends on those buffers.
  * So the locking for i_private_list is via the i_private_lock in the address_space
- * which backs the buffers.  Which is different from the address_space 
+ * which backs the buffers.  Which is different from the address_space
  * against which the buffers are listed.  So for a particular address_space,
  * mapping->i_private_lock does *not* protect mapping->i_private_list!  In fact,
  * mapping->i_private_list will always be protected by the backing blockdev's
@@ -655,10 +655,12 @@ EXPORT_SYMBOL(generic_buffers_fsync);
  * `bblock + 1' is probably a dirty indirect block.  Hunt it down and, if it's
  * dirty, schedule it for IO.  So that indirects merge nicely with their data.
  */
-void write_boundary_block(struct block_device *bdev,
+void write_boundary_block(struct file *bdev_file,
 			sector_t bblock, unsigned blocksize)
 {
-	struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize);
+	struct buffer_head *bh =
+		__find_get_block(bdev_file, bblock + 1, blocksize);
+
 	if (bh) {
 		if (buffer_dirty(bh))
 			write_dirty_buffer(bh, 0);
@@ -994,8 +996,9 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
  * Initialise the state of a blockdev folio's buffers.
  */ 
 static sector_t folio_init_buffers(struct folio *folio,
-		struct block_device *bdev, unsigned size)
+		struct file *bdev_file, unsigned int size)
 {
+	struct block_device *bdev = file_bdev(bdev_file);
 	struct buffer_head *head = folio_buffers(folio);
 	struct buffer_head *bh = head;
 	bool uptodate = folio_test_uptodate(folio);
@@ -1006,7 +1009,7 @@ static sector_t folio_init_buffers(struct folio *folio,
 		if (!buffer_mapped(bh)) {
 			bh->b_end_io = NULL;
 			bh->b_private = NULL;
-			bh->b_bdev = bdev;
+			bh->b_bdev_file = bdev_file;
 			bh->b_blocknr = block;
 			if (uptodate)
 				set_buffer_uptodate(bh);
@@ -1031,10 +1034,10 @@ static sector_t folio_init_buffers(struct folio *folio,
  * Returns false if we have a failure which cannot be cured by retrying
  * without sleeping.  Returns true if we succeeded, or the caller should retry.
  */
-static bool grow_dev_folio(struct block_device *bdev, sector_t block,
+static bool grow_dev_folio(struct file *bdev_file, sector_t block,
 		pgoff_t index, unsigned size, gfp_t gfp)
 {
-	struct inode *inode = bdev->bd_inode;
+	struct inode *inode = file_inode(bdev_file);
 	struct folio *folio;
 	struct buffer_head *bh;
 	sector_t end_block = 0;
@@ -1047,7 +1050,7 @@ static bool grow_dev_folio(struct block_device *bdev, sector_t block,
 	bh = folio_buffers(folio);
 	if (bh) {
 		if (bh->b_size == size) {
-			end_block = folio_init_buffers(folio, bdev, size);
+			end_block = folio_init_buffers(folio, bdev_file, size);
 			goto unlock;
 		}
 
@@ -1075,7 +1078,7 @@ static bool grow_dev_folio(struct block_device *bdev, sector_t block,
 	 */
 	spin_lock(&inode->i_mapping->i_private_lock);
 	link_dev_buffers(folio, bh);
-	end_block = folio_init_buffers(folio, bdev, size);
+	end_block = folio_init_buffers(folio, bdev_file, size);
 	spin_unlock(&inode->i_mapping->i_private_lock);
 unlock:
 	folio_unlock(folio);
@@ -1088,7 +1091,7 @@ static bool grow_dev_folio(struct block_device *bdev, sector_t block,
  * that folio was dirty, the buffers are set dirty also.  Returns false
  * if we've hit a permanent error.
  */
-static bool grow_buffers(struct block_device *bdev, sector_t block,
+static bool grow_buffers(struct file *bdev_file, sector_t block,
 		unsigned size, gfp_t gfp)
 {
 	loff_t pos;
@@ -1100,18 +1103,19 @@ static bool grow_buffers(struct block_device *bdev, sector_t block,
 	if (check_mul_overflow(block, (sector_t)size, &pos) || pos > MAX_LFS_FILESIZE) {
 		printk(KERN_ERR "%s: requested out-of-range block %llu for device %pg\n",
 			__func__, (unsigned long long)block,
-			bdev);
+			file_bdev(bdev_file));
 		return false;
 	}
 
 	/* Create a folio with the proper size buffers */
-	return grow_dev_folio(bdev, block, pos / PAGE_SIZE, size, gfp);
+	return grow_dev_folio(bdev_file, block, pos / PAGE_SIZE, size, gfp);
 }
 
 static struct buffer_head *
-__getblk_slow(struct block_device *bdev, sector_t block,
-	     unsigned size, gfp_t gfp)
+__getblk_slow(struct file *bdev_file, sector_t block, unsigned size, gfp_t gfp)
 {
+	struct block_device *bdev = file_bdev(bdev_file);
+
 	/* Size must be multiple of hard sectorsize */
 	if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
 			(size < 512 || size > PAGE_SIZE))) {
@@ -1127,11 +1131,11 @@ __getblk_slow(struct block_device *bdev, sector_t block,
 	for (;;) {
 		struct buffer_head *bh;
 
-		bh = __find_get_block(bdev, block, size);
+		bh = __find_get_block(bdev_file, block, size);
 		if (bh)
 			return bh;
 
-		if (!grow_buffers(bdev, block, size, gfp))
+		if (!grow_buffers(bdev_file, block, size, gfp))
 			return NULL;
 	}
 }
@@ -1367,7 +1371,7 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
 	for (i = 0; i < BH_LRU_SIZE; i++) {
 		struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
 
-		if (bh && bh->b_blocknr == block && bh->b_bdev == bdev &&
+		if (bh && bh->b_blocknr == block && bh_bdev(bh) == bdev &&
 		    bh->b_size == size) {
 			if (i) {
 				while (i) {
@@ -1392,13 +1396,14 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
  * NULL
  */
 struct buffer_head *
-__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
+__find_get_block(struct file *bdev_file, sector_t block, unsigned int size)
 {
-	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
+	struct buffer_head *bh = lookup_bh_lru(file_bdev(bdev_file), block,
+					       size);
 
 	if (bh == NULL) {
 		/* __find_get_block_slow will mark the page accessed */
-		bh = __find_get_block_slow(bdev, block);
+		bh = __find_get_block_slow(bdev_file, block);
 		if (bh)
 			bh_lru_install(bh);
 	} else
@@ -1410,32 +1415,32 @@ EXPORT_SYMBOL(__find_get_block);
 
 /**
  * bdev_getblk - Get a buffer_head in a block device's buffer cache.
- * @bdev: The block device.
+ * @bdev_file: The opened block device.
  * @block: The block number.
- * @size: The size of buffer_heads for this @bdev.
+ * @size: The size of buffer_heads for this block device.
  * @gfp: The memory allocation flags to use.
  *
  * Return: The buffer head, or NULL if memory could not be allocated.
  */
-struct buffer_head *bdev_getblk(struct block_device *bdev, sector_t block,
+struct buffer_head *bdev_getblk(struct file *bdev_file, sector_t block,
 		unsigned size, gfp_t gfp)
 {
-	struct buffer_head *bh = __find_get_block(bdev, block, size);
+	struct buffer_head *bh = __find_get_block(bdev_file, block, size);
 
 	might_alloc(gfp);
 	if (bh)
 		return bh;
 
-	return __getblk_slow(bdev, block, size, gfp);
+	return __getblk_slow(bdev_file, block, size, gfp);
 }
 EXPORT_SYMBOL(bdev_getblk);
 
 /*
  * Do async read-ahead on a buffer..
  */
-void __breadahead(struct block_device *bdev, sector_t block, unsigned size)
+void __breadahead(struct file *bdev_file, sector_t block, unsigned int size)
 {
-	struct buffer_head *bh = bdev_getblk(bdev, block, size,
+	struct buffer_head *bh = bdev_getblk(bdev_file, block, size,
 			GFP_NOWAIT | __GFP_MOVABLE);
 
 	if (likely(bh)) {
@@ -1447,7 +1452,7 @@ EXPORT_SYMBOL(__breadahead);
 
 /**
  *  __bread_gfp() - reads a specified block and returns the bh
- *  @bdev: the block_device to read from
+ *  @bdev_file: the opened block_device to read from
  *  @block: number of block
  *  @size: size (in bytes) to read
  *  @gfp: page allocation flag
@@ -1458,12 +1463,11 @@ EXPORT_SYMBOL(__breadahead);
  *  It returns NULL if the block was unreadable.
  */
 struct buffer_head *
-__bread_gfp(struct block_device *bdev, sector_t block,
-		   unsigned size, gfp_t gfp)
+__bread_gfp(struct file *bdev_file, sector_t block, unsigned int size, gfp_t gfp)
 {
 	struct buffer_head *bh;
 
-	gfp |= mapping_gfp_constraint(bdev->bd_inode->i_mapping, ~__GFP_FS);
+	gfp |= mapping_gfp_constraint(bdev_file->f_mapping, ~__GFP_FS);
 
 	/*
 	 * Prefer looping in the allocator rather than here, at least that
@@ -1471,7 +1475,7 @@ __bread_gfp(struct block_device *bdev, sector_t block,
 	 */
 	gfp |= __GFP_NOFAIL;
 
-	bh = bdev_getblk(bdev, block, size, gfp);
+	bh = bdev_getblk(bdev_file, block, size, gfp);
 
 	if (likely(bh) && !buffer_uptodate(bh))
 		bh = __bread_slow(bh);
@@ -1556,7 +1560,7 @@ EXPORT_SYMBOL(folio_set_bh);
 /* Bits that are cleared during an invalidate */
 #define BUFFER_FLAGS_DISCARD \
 	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
-	 1 << BH_Delay | 1 << BH_Unwritten)
+	 1 << BH_Delay | 1 << BH_Unwritten | 1 << BH_Bdev)
 
 static void discard_buffer(struct buffer_head * bh)
 {
@@ -1564,7 +1568,7 @@ static void discard_buffer(struct buffer_head * bh)
 
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
-	bh->b_bdev = NULL;
+	bh->b_bdev_file = NULL;
 	b_state = READ_ONCE(bh->b_state);
 	do {
 	} while (!try_cmpxchg(&bh->b_state, &b_state,
@@ -1675,8 +1679,8 @@ struct buffer_head *create_empty_buffers(struct folio *folio,
 EXPORT_SYMBOL(create_empty_buffers);
 
 /**
- * clean_bdev_aliases: clean a range of buffers in block device
- * @bdev: Block device to clean buffers in
+ * __clean_bdev_aliases: clean a range of buffers in block device
+ * @inode: Block device inode to clean buffers in
  * @block: Start of a range of blocks to clean
  * @len: Number of blocks to clean
  *
@@ -1694,9 +1698,8 @@ EXPORT_SYMBOL(create_empty_buffers);
  * I/O in bforget() - it's more efficient to wait on the I/O only if we really
  * need to.  That happens here.
  */
-void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
+void __clean_bdev_aliases(struct inode *bd_inode, sector_t block, sector_t len)
 {
-	struct inode *bd_inode = bdev->bd_inode;
 	struct address_space *bd_mapping = bd_inode->i_mapping;
 	struct folio_batch fbatch;
 	pgoff_t index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
@@ -1746,7 +1749,7 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
 			break;
 	}
 }
-EXPORT_SYMBOL(clean_bdev_aliases);
+EXPORT_SYMBOL(__clean_bdev_aliases);
 
 static struct buffer_head *folio_create_buffers(struct folio *folio,
 						struct inode *inode,
@@ -2003,7 +2006,17 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
 {
 	loff_t offset = (loff_t)block << inode->i_blkbits;
 
-	bh->b_bdev = iomap->bdev;
+	if (iomap->flags & IOMAP_F_BDEV) {
+		 /*
+		  * If this request originated directly from the block layer we
+		  * only have access to the plain block device. Mark the
+		  * buffer_head similarly.
+		  */
+		bh->b_bdev = iomap->bdev;
+		set_buffer_bdev(bh);
+	} else {
+		bh->b_bdev_file = iomap->bdev_file;
+	}
 
 	/*
 	 * Block points to offset in file we need to map, iomap contains
@@ -2778,7 +2791,7 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
 	if (buffer_prio(bh))
 		opf |= REQ_PRIO;
 
-	bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
+	bio = bio_alloc(bh_bdev(bh), 1, opf, GFP_NOIO);
 
 	fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 60456263a338..77691f2b2565 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -671,7 +671,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
 	sector = start_sector << (sdio->blkbits - 9);
 	nr_pages = bio_max_segs(sdio->pages_in_io);
 	BUG_ON(nr_pages <= 0);
-	dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
+	dio_bio_alloc(dio, sdio, bh_bdev(map_bh), sector, nr_pages);
 	sdio->boundary = 0;
 out:
 	return ret;
@@ -946,7 +946,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 					map_bh->b_blocknr << sdio->blkfactor;
 				if (buffer_new(map_bh)) {
 					clean_bdev_aliases(
-						map_bh->b_bdev,
+						map_bh->b_bdev_file,
 						map_bh->b_blocknr,
 						map_bh->b_size >> i_blkbits);
 				}
diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index dc2d43abe8c5..6127ff1ba453 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -204,6 +204,7 @@ int erofs_map_dev(struct super_block *sb, struct erofs_map_dev *map)
 	int id;
 
 	map->m_bdev = sb->s_bdev;
+	map->m_bdev_file = sb->s_bdev_file;
 	map->m_daxdev = EROFS_SB(sb)->dax_dev;
 	map->m_dax_part_off = EROFS_SB(sb)->dax_part_off;
 	map->m_fscache = EROFS_SB(sb)->s_fscache;
@@ -278,7 +279,7 @@ static int erofs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	if (flags & IOMAP_DAX)
 		iomap->dax_dev = mdev.m_daxdev;
 	else
-		iomap->bdev = mdev.m_bdev;
+		iomap->bdev_file = mdev.m_bdev_file;
 	iomap->length = map.m_llen;
 	iomap->flags = 0;
 	iomap->private = NULL;
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index 0f0706325b7b..50f8a7f161fd 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -377,6 +377,7 @@ enum {
 
 struct erofs_map_dev {
 	struct erofs_fscache *m_fscache;
+	struct file *m_bdev_file;
 	struct block_device *m_bdev;
 	struct dax_device *m_daxdev;
 	u64 m_dax_part_off;
diff --git a/fs/erofs/zmap.c b/fs/erofs/zmap.c
index e313c936351d..6da3083e8252 100644
--- a/fs/erofs/zmap.c
+++ b/fs/erofs/zmap.c
@@ -739,7 +739,7 @@ static int z_erofs_iomap_begin_report(struct inode *inode, loff_t offset,
 	if (ret < 0)
 		return ret;
 
-	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->bdev_file = inode->i_sb->s_bdev_file;
 	iomap->offset = map.m_la;
 	iomap->length = map.m_llen;
 	if (map.m_flags & EROFS_MAP_MAPPED) {
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f3d570a9302b..32555734e727 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -744,7 +744,7 @@ static int ext2_get_blocks(struct inode *inode,
 		 * We must unmap blocks before zeroing so that writeback cannot
 		 * overwrite zeros with stale data from block device page cache.
 		 */
-		clean_bdev_aliases(inode->i_sb->s_bdev,
+		clean_bdev_aliases(inode->i_sb->s_bdev_file,
 				   le32_to_cpu(chain[depth-1].key),
 				   count);
 		/*
@@ -842,7 +842,7 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	if (flags & IOMAP_DAX)
 		iomap->dax_dev = sbi->s_daxdev;
 	else
-		iomap->bdev = inode->i_sb->s_bdev;
+		iomap->bdev_file = inode->i_sb->s_bdev_file;
 
 	if (ret == 0) {
 		/*
diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index c885dcc3bd0d..42e595e87a74 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -80,7 +80,7 @@
 	} while (0)
 # define ea_bdebug(bh, f...) do { \
 		printk(KERN_DEBUG "block %pg:%lu: ", \
-			bh->b_bdev, (unsigned long) bh->b_blocknr); \
+			bh_bdev(bh), (unsigned long) bh->b_blocknr); \
 		printk(f); \
 		printk("\n"); \
 	} while (0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ccf3b5e3a7c..eb861ca94e63 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1791,11 +1791,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
  * reserve space for a single block.
  *
  * For delayed buffer_head we have BH_Mapped, BH_New, BH_Delay set.
- * We also have b_blocknr = -1 and b_bdev initialized properly
+ * We also have b_blocknr = -1 and b_bdev_file initialized properly
  *
  * For unwritten buffer_head we have BH_Mapped, BH_New, BH_Unwritten set.
- * We also have b_blocknr = physicalblock mapping unwritten extent and b_bdev
- * initialized properly.
+ * We also have b_blocknr = physicalblock mapping unwritten extent and
+ * b_bdev_file initialized properly.
  */
 int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
 			   struct buffer_head *bh, int create)
@@ -3235,7 +3235,7 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 	if (flags & IOMAP_DAX)
 		iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
 	else
-		iomap->bdev = inode->i_sb->s_bdev;
+		iomap->bdev_file = inode->i_sb->s_bdev_file;
 	iomap->offset = (u64) map->m_lblk << blkbits;
 	iomap->length = (u64) map->m_len << blkbits;
 
diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index bd946d0c71b7..5641bd34d021 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -384,7 +384,7 @@ int ext4_multi_mount_protect(struct super_block *sb,
 
 	BUILD_BUG_ON(sizeof(mmp->mmp_bdevname) < BDEVNAME_SIZE);
 	snprintf(mmp->mmp_bdevname, sizeof(mmp->mmp_bdevname),
-		 "%pg", bh->b_bdev);
+		 "%pg", bh_bdev(bh));
 
 	/*
 	 * Start a kernel thread to update the MMP block periodically.
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 312bc6813357..b0c3de39daa1 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -93,8 +93,7 @@ struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end)
 static void buffer_io_error(struct buffer_head *bh)
 {
 	printk_ratelimited(KERN_ERR "Buffer I/O error on device %pg, logical block %llu\n",
-		       bh->b_bdev,
-			(unsigned long long)bh->b_blocknr);
+		       bh_bdev(bh), (unsigned long long)bh->b_blocknr);
 }
 
 static void ext4_finish_bio(struct bio *bio)
@@ -397,7 +396,7 @@ static void io_submit_init_bio(struct ext4_io_submit *io,
 	 * bio_alloc will _always_ be able to allocate a bio if
 	 * __GFP_DIRECT_RECLAIM is set, see comments for bio_alloc_bioset().
 	 */
-	bio = bio_alloc(bh->b_bdev, BIO_MAX_VECS, REQ_OP_WRITE, GFP_NOIO);
+	bio = bio_alloc(bh_bdev(bh), BIO_MAX_VECS, REQ_OP_WRITE, GFP_NOIO);
 	fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_end_io = ext4_end_bio;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4df1a5cfe0a5..d2ca92bf5f7e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -261,7 +261,7 @@ struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
 
 void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
 {
-	struct buffer_head *bh = bdev_getblk(sb->s_bdev, block,
+	struct buffer_head *bh = bdev_getblk(sb->s_bdev_file, block,
 			sb->s_blocksize, GFP_NOWAIT | __GFP_NOWARN);
 
 	if (likely(bh)) {
@@ -5862,7 +5862,7 @@ static struct file *ext4_get_journal_blkdev(struct super_block *sb,
 	sb_block = EXT4_MIN_BLOCK_SIZE / blocksize;
 	offset = EXT4_MIN_BLOCK_SIZE % blocksize;
 	set_blocksize(bdev, blocksize);
-	bh = __bread(bdev, sb_block, blocksize);
+	bh = __bread(bdev_file, sb_block, blocksize);
 	if (!bh) {
 		ext4_msg(sb, KERN_ERR, "couldn't read superblock of "
 		       "external journal");
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 82dc5e673d5c..41128ccec2ec 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -68,7 +68,7 @@
 	       inode->i_sb->s_id, inode->i_ino, ##__VA_ARGS__)
 # define ea_bdebug(bh, fmt, ...)					\
 	printk(KERN_DEBUG "block %pg:%lu: " fmt "\n",			\
-	       bh->b_bdev, (unsigned long)bh->b_blocknr, ##__VA_ARGS__)
+	       bh_bdev(bh), (unsigned long)bh->b_blocknr, ##__VA_ARGS__)
 #else
 # define ea_idebug(inode, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
 # define ea_bdebug(bh, fmt, ...)	no_printk(fmt, ##__VA_ARGS__)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 05158f89ef32..8ec12b3716bc 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1606,6 +1606,7 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map, int flag)
 		goto out;
 
 	map->m_bdev = inode->i_sb->s_bdev;
+	map->m_bdev_file = inode->i_sb->s_bdev_file;
 	map->m_multidev_dio =
 		f2fs_allow_multi_device_dio(F2FS_I_SB(inode), flag);
 
@@ -1724,8 +1725,10 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map, int flag)
 		map->m_pblk = blkaddr;
 		map->m_len = 1;
 
-		if (map->m_multidev_dio)
+		if (map->m_multidev_dio) {
 			map->m_bdev = FDEV(bidx).bdev;
+			map->m_bdev_file = FDEV(bidx).bdev_file;
+		}
 	} else if ((map->m_pblk != NEW_ADDR &&
 			blkaddr == (map->m_pblk + ofs)) ||
 			(map->m_pblk == NEW_ADDR && blkaddr == NEW_ADDR) ||
@@ -4250,7 +4253,7 @@ static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		iomap->length = blks_to_bytes(inode, map.m_len);
 		iomap->type = IOMAP_MAPPED;
 		iomap->flags |= IOMAP_F_MERGED;
-		iomap->bdev = map.m_bdev;
+		iomap->bdev_file = map.m_bdev_file;
 		iomap->addr = blks_to_bytes(inode, map.m_pblk);
 	} else {
 		if (flags & IOMAP_WRITE)
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index cc481d7b9287..ed36c11325cd 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -697,6 +697,7 @@ struct extent_tree_info {
 				F2FS_MAP_DELALLOC)
 
 struct f2fs_map_blocks {
+	struct file *m_bdev_file;	/* for multi-device dio */
 	struct block_device *m_bdev;	/* for multi-device dio */
 	block_t m_pblk;
 	block_t m_lblk;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d170bb..24966e93a237 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -575,7 +575,7 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
 
 	iomap->offset = pos;
 	iomap->flags = 0;
-	iomap->bdev = NULL;
+	iomap->bdev_file = NULL;
 	iomap->dax_dev = fc->dax->dev;
 
 	/*
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 974aca9c8ea8..0e4e295ebf49 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -622,7 +622,7 @@ static void gfs2_discard(struct gfs2_sbd *sdp, struct buffer_head *bh)
 			spin_unlock(&sdp->sd_ail_lock);
 		}
 	}
-	bh->b_bdev = NULL;
+	bh->b_bdev_file = NULL;
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
 	clear_buffer_new(bh);
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 789af5c8fade..ef4e7ad83d4c 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -926,7 +926,7 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
 		iomap->flags |= IOMAP_F_GFS2_BOUNDARY;
 
 out:
-	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->bdev_file = inode->i_sb->s_bdev_file;
 unlock:
 	up_read(&ip->i_rw_mutex);
 	return ret;
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index f814054c8cd0..2052d3fc2c24 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -218,7 +218,7 @@ static void gfs2_submit_bhs(blk_opf_t opf, struct buffer_head *bhs[], int num)
 		struct buffer_head *bh = *bhs;
 		struct bio *bio;
 
-		bio = bio_alloc(bh->b_bdev, num, opf, GFP_NOIO);
+		bio = bio_alloc(bh_bdev(bh), num, opf, GFP_NOIO);
 		bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 		while (num > 0) {
 			bh = *bhs;
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 1bb8d97cd9ae..7353d0e2f35a 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -128,7 +128,7 @@ static int hpfs_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	if (WARN_ON_ONCE(flags & (IOMAP_WRITE | IOMAP_ZERO)))
 		return -EINVAL;
 
-	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->bdev_file = inode->i_sb->s_bdev_file;
 	iomap->offset = offset;
 
 	hpfs_lock(sb);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2ad0e287c704..2fc8abd693da 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -415,7 +415,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
-		ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
+		ctx->bio = bio_alloc(iomap_bdev(iomap), bio_max_segs(nr_vecs),
 				     REQ_OP_READ, gfp);
 		/*
 		 * If the bio_alloc fails, try it again for a single page to
@@ -423,7 +423,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		 * what do_mpage_read_folio does.
 		 */
 		if (!ctx->bio) {
-			ctx->bio = bio_alloc(iomap->bdev, 1, REQ_OP_READ,
+			ctx->bio = bio_alloc(iomap_bdev(iomap), 1, REQ_OP_READ,
 					     orig_gfp);
 		}
 		if (ctx->rac)
@@ -662,7 +662,7 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
 	struct bio_vec bvec;
 	struct bio bio;
 
-	bio_init(&bio, iomap->bdev, &bvec, 1, REQ_OP_READ);
+	bio_init(&bio, iomap_bdev(iomap), &bvec, 1, REQ_OP_READ);
 	bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
 	bio_add_folio_nofail(&bio, folio, plen, poff);
 	return submit_bio_wait(&bio);
@@ -1684,7 +1684,7 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 	struct iomap_ioend *ioend;
 	struct bio *bio;
 
-	bio = bio_alloc_bioset(wpc->iomap.bdev, BIO_MAX_VECS,
+	bio = bio_alloc_bioset(iomap_bdev(&wpc->iomap), BIO_MAX_VECS,
 			       REQ_OP_WRITE | wbc_to_write_flags(wbc),
 			       GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..42518754c65d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -56,9 +56,9 @@ static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
 		struct iomap_dio *dio, unsigned short nr_vecs, blk_opf_t opf)
 {
 	if (dio->dops && dio->dops->bio_set)
-		return bio_alloc_bioset(iter->iomap.bdev, nr_vecs, opf,
+		return bio_alloc_bioset(iomap_bdev(&iter->iomap), nr_vecs, opf,
 					GFP_KERNEL, dio->dops->bio_set);
-	return bio_alloc(iter->iomap.bdev, nr_vecs, opf, GFP_KERNEL);
+	return bio_alloc(iomap_bdev(&iter->iomap), nr_vecs, opf, GFP_KERNEL);
 }
 
 static void iomap_dio_submit_bio(const struct iomap_iter *iter,
@@ -288,8 +288,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
-	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
+	if ((pos | length) & (bdev_logical_block_size(iomap_bdev(iomap)) - 1) ||
+	    !bdev_iter_is_aligned(iomap_bdev(iomap), dio->submit.iter))
 		return -EINVAL;
 
 	if (iomap->type == IOMAP_UNWRITTEN) {
@@ -316,7 +316,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
 		    (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
-		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
+		    (bdev_fua(iomap_bdev(iomap)) ||
+			      !bdev_write_cache(iomap_bdev(iomap))))
 			use_fua = true;
 		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
 			dio->flags &= ~IOMAP_DIO_CALLER_COMP;
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index 5fc0ac36dee3..20bd67e85d15 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -116,7 +116,7 @@ static loff_t iomap_swapfile_iter(const struct iomap_iter *iter,
 		return iomap_swapfile_fail(isi, "has shared extents");
 
 	/* Only one bdev per swap file. */
-	if (iomap->bdev != isi->sis->bdev)
+	if (iomap_bdev(iomap) != isi->sis->bdev)
 		return iomap_swapfile_fail(isi, "outside the main device");
 
 	if (isi->iomap.length == 0) {
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index c16fd55f5595..43fb3ce21674 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -134,7 +134,7 @@ DECLARE_EVENT_CLASS(iomap_class,
 		__entry->length = iomap->length;
 		__entry->type = iomap->type;
 		__entry->flags = iomap->flags;
-		__entry->bdev = iomap->bdev ? iomap->bdev->bd_dev : 0;
+		__entry->bdev = iomap_bdev(iomap) ? iomap_bdev(iomap)->bd_dev : 0;
 	),
 	TP_printk("dev %d:%d ino 0x%llx bdev %d:%d addr 0x%llx offset 0x%llx "
 		  "length 0x%llx type %s flags %s",
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 5e122586e06e..fffb1b4e2068 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -1014,7 +1014,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				clear_buffer_mapped(bh);
 				clear_buffer_new(bh);
 				clear_buffer_req(bh);
-				bh->b_bdev = NULL;
+				bh->b_bdev_file = NULL;
 			}
 		}
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index abd42a6ccd0e..bbe5d02801b6 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -434,7 +434,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 
 	folio_set_bh(new_bh, new_folio, new_offset);
 	new_bh->b_size = bh_in->b_size;
-	new_bh->b_bdev = journal->j_dev;
+	new_bh->b_bdev_file = journal->j_dev_file;
 	new_bh->b_blocknr = blocknr;
 	new_bh->b_private = bh_in;
 	set_buffer_mapped(new_bh);
@@ -880,7 +880,7 @@ int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
 	if (ret)
 		return ret;
 
-	bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
+	bh = __getblk(journal->j_dev_file, pblock, journal->j_blocksize);
 	if (!bh)
 		return -ENOMEM;
 
@@ -1007,7 +1007,7 @@ jbd2_journal_get_descriptor_buffer(transaction_t *transaction, int type)
 	if (err)
 		return NULL;
 
-	bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
+	bh = __getblk(journal->j_dev_file, blocknr, journal->j_blocksize);
 	if (!bh)
 		return NULL;
 	atomic_dec(&transaction->t_outstanding_credits);
@@ -1461,7 +1461,7 @@ static int journal_load_superblock(journal_t *journal)
 	struct buffer_head *bh;
 	journal_superblock_t *sb;
 
-	bh = getblk_unmovable(journal->j_dev, journal->j_blk_offset,
+	bh = getblk_unmovable(journal->j_dev_file, journal->j_blk_offset,
 			      journal->j_blocksize);
 	if (bh)
 		err = bh_read(bh, 0);
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 1f7664984d6e..7b561e2c6a7c 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -92,7 +92,7 @@ static int do_readahead(journal_t *journal, unsigned int start)
 			goto failed;
 		}
 
-		bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
+		bh = __getblk(journal->j_dev_file, blocknr, journal->j_blocksize);
 		if (!bh) {
 			err = -ENOMEM;
 			goto failed;
@@ -148,7 +148,7 @@ static int jread(struct buffer_head **bhp, journal_t *journal,
 		return err;
 	}
 
-	bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
+	bh = __getblk(journal->j_dev_file, blocknr, journal->j_blocksize);
 	if (!bh)
 		return -ENOMEM;
 
@@ -370,7 +370,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
 		journal->j_head = journal->j_first;
 	} else {
 #ifdef CONFIG_JBD2_DEBUG
-		int dropped = info.end_transaction - 
+		int dropped = info.end_transaction -
 			be32_to_cpu(journal->j_superblock->s_sequence);
 		jbd2_debug(1,
 			  "JBD2: ignoring %d transaction%s from the journal.\n",
@@ -672,7 +672,7 @@ static int do_one_pass(journal_t *journal,
 
 					/* Find a buffer for the new
 					 * data being restored */
-					nbh = __getblk(journal->j_fs_dev,
+					nbh = __getblk(journal->j_fs_dev_file,
 							blocknr,
 							journal->j_blocksize);
 					if (nbh == NULL) {
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 4556e4689024..99c2758539a8 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -328,7 +328,7 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 {
 	struct buffer_head *bh = NULL;
 	journal_t *journal;
-	struct block_device *bdev;
+	struct file *file;
 	int err;
 
 	might_sleep();
@@ -341,11 +341,11 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 		return -EINVAL;
 	}
 
-	bdev = journal->j_fs_dev;
+	file = journal->j_fs_dev_file;
 	bh = bh_in;
 
 	if (!bh) {
-		bh = __find_get_block(bdev, blocknr, journal->j_blocksize);
+		bh = __find_get_block(file, blocknr, journal->j_blocksize);
 		if (bh)
 			BUFFER_TRACE(bh, "found on hash");
 	}
@@ -355,7 +355,7 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 
 		/* If there is a different buffer_head lying around in
 		 * memory anywhere... */
-		bh2 = __find_get_block(bdev, blocknr, journal->j_blocksize);
+		bh2 = __find_get_block(file, blocknr, journal->j_blocksize);
 		if (bh2) {
 			/* ... and it has RevokeValid status... */
 			if (bh2 != bh && buffer_revokevalid(bh2))
@@ -466,7 +466,8 @@ int jbd2_journal_cancel_revoke(handle_t *handle, struct journal_head *jh)
 	 * state machine will get very upset later on. */
 	if (need_cancel) {
 		struct buffer_head *bh2;
-		bh2 = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
+		bh2 = __find_get_block(bh->b_bdev_file, bh->b_blocknr,
+				       bh->b_size);
 		if (bh2) {
 			if (bh2 != bh)
 				clear_buffer_revoked(bh2);
@@ -495,7 +496,7 @@ void jbd2_clear_buffer_revoked_flags(journal_t *journal)
 			struct jbd2_revoke_record_s *record;
 			struct buffer_head *bh;
 			record = (struct jbd2_revoke_record_s *)list_entry;
-			bh = __find_get_block(journal->j_fs_dev,
+			bh = __find_get_block(journal->j_fs_dev_file,
 					      record->blocknr,
 					      journal->j_blocksize);
 			if (bh) {
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index cb0b8d6fc0c6..30ebc93dc430 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -929,7 +929,7 @@ static void warn_dirty_buffer(struct buffer_head *bh)
 	       "JBD2: Spotted dirty metadata buffer (dev = %pg, blocknr = %llu). "
 	       "There's a risk of filesystem corruption in case of system "
 	       "crash.\n",
-	       bh->b_bdev, (unsigned long long)bh->b_blocknr);
+	       bh_bdev(bh), (unsigned long long)bh->b_blocknr);
 }
 
 /* Call t_frozen trigger and copy buffer data into jh->b_frozen_data. */
@@ -990,7 +990,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 	/* If it takes too long to lock the buffer, trace it */
 	time_lock = jbd2_time_diff(start_lock, jiffies);
 	if (time_lock > HZ/10)
-		trace_jbd2_lock_buffer_stall(bh->b_bdev->bd_dev,
+		trace_jbd2_lock_buffer_stall(bh_bdev(bh)->bd_dev,
 			jiffies_to_msecs(time_lock));
 
 	/* We now hold the buffer lock so it is safe to query the buffer
@@ -2374,7 +2374,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 			write_unlock(&journal->j_state_lock);
 			jbd2_journal_put_journal_head(jh);
 			/* Already zapped buffer? Nothing to do... */
-			if (!bh->b_bdev)
+			if (!bh_bdev(bh))
 				return 0;
 			return -EBUSY;
 		}
@@ -2428,7 +2428,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 	clear_buffer_new(bh);
 	clear_buffer_delay(bh);
 	clear_buffer_unwritten(bh);
-	bh->b_bdev = NULL;
+	bh->b_bdev_file = NULL;
 	return may_free;
 }
 
diff --git a/fs/mpage.c b/fs/mpage.c
index 738882e0766d..ef6e72eec312 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -126,7 +126,12 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
 	do {
 		if (block == page_block) {
 			page_bh->b_state = bh->b_state;
-			page_bh->b_bdev = bh->b_bdev;
+			if (buffer_bdev(bh)) {
+				page_bh->b_bdev = bh->b_bdev;
+				set_buffer_bdev(page_bh);
+			} else {
+				page_bh->b_bdev_file = bh->b_bdev_file;
+			}
 			page_bh->b_blocknr = bh->b_blocknr;
 			break;
 		}
@@ -216,7 +221,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			page_block++;
 			block_in_file++;
 		}
-		bdev = map_bh->b_bdev;
+		bdev = bh_bdev(map_bh);
 	}
 
 	/*
@@ -272,7 +277,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			page_block++;
 			block_in_file++;
 		}
-		bdev = map_bh->b_bdev;
+		bdev = bh_bdev(map_bh);
 	}
 
 	if (first_hole != blocks_per_page) {
@@ -472,7 +477,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	struct block_device *bdev = NULL;
 	int boundary = 0;
 	sector_t boundary_block = 0;
-	struct block_device *boundary_bdev = NULL;
+	struct file *boundary_bdev_file = NULL;
 	size_t length;
 	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
@@ -513,9 +518,9 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 			boundary = buffer_boundary(bh);
 			if (boundary) {
 				boundary_block = bh->b_blocknr;
-				boundary_bdev = bh->b_bdev;
+				boundary_bdev_file = bh->b_bdev_file;
 			}
-			bdev = bh->b_bdev;
+			bdev = bh_bdev(bh);
 		} while ((bh = bh->b_this_page) != head);
 
 		if (first_unmapped)
@@ -549,13 +554,16 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 		map_bh.b_size = 1 << blkbits;
 		if (mpd->get_block(inode, block_in_file, &map_bh, 1))
 			goto confused;
+		/* This helper cannot be used from the block layer directly. */
+		if (WARN_ON_ONCE(buffer_bdev(&map_bh)))
+			goto confused;
 		if (!buffer_mapped(&map_bh))
 			goto confused;
 		if (buffer_new(&map_bh))
 			clean_bdev_bh_alias(&map_bh);
 		if (buffer_boundary(&map_bh)) {
 			boundary_block = map_bh.b_blocknr;
-			boundary_bdev = map_bh.b_bdev;
+			boundary_bdev_file = map_bh.b_bdev_file;
 		}
 		if (page_block) {
 			if (map_bh.b_blocknr != first_block + page_block)
@@ -565,7 +573,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 		}
 		page_block++;
 		boundary = buffer_boundary(&map_bh);
-		bdev = map_bh.b_bdev;
+		bdev = bh_bdev(&map_bh);
 		if (block_in_file == last_block)
 			break;
 		block_in_file++;
@@ -627,7 +635,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	if (boundary || (first_unmapped != blocks_per_page)) {
 		bio = mpage_bio_submit_write(bio);
 		if (boundary_block) {
-			write_boundary_block(boundary_bdev,
+			write_boundary_block(boundary_bdev_file,
 					boundary_block, 1 << blkbits);
 		}
 	} else {
diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index 0131d83b912d..0620bccbf6e0 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -59,7 +59,7 @@ nilfs_btnode_create_block(struct address_space *btnc, __u64 blocknr)
 		BUG();
 	}
 	memset(bh->b_data, 0, i_blocksize(inode));
-	bh->b_bdev = inode->i_sb->s_bdev;
+	bh->b_bdev_file = inode->i_sb->s_bdev_file;
 	bh->b_blocknr = blocknr;
 	set_buffer_mapped(bh);
 	set_buffer_uptodate(bh);
@@ -118,7 +118,7 @@ int nilfs_btnode_submit_block(struct address_space *btnc, __u64 blocknr,
 		goto found;
 	}
 	set_buffer_mapped(bh);
-	bh->b_bdev = inode->i_sb->s_bdev;
+	bh->b_bdev_file = inode->i_sb->s_bdev_file;
 	bh->b_blocknr = pblocknr; /* set block address for read */
 	bh->b_end_io = end_buffer_read_sync;
 	get_bh(bh);
diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c
index bf9a11d58817..77d4b9275b87 100644
--- a/fs/nilfs2/gcinode.c
+++ b/fs/nilfs2/gcinode.c
@@ -84,7 +84,7 @@ int nilfs_gccache_submit_read_data(struct inode *inode, sector_t blkoff,
 	}
 
 	if (!buffer_mapped(bh)) {
-		bh->b_bdev = inode->i_sb->s_bdev;
+		bh->b_bdev_file = inode->i_sb->s_bdev_file;
 		set_buffer_mapped(bh);
 	}
 	bh->b_blocknr = pbn;
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 4f792a0ad0f0..99cf302ce116 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -89,7 +89,7 @@ static int nilfs_mdt_create_block(struct inode *inode, unsigned long block,
 	if (buffer_uptodate(bh))
 		goto failed_bh;
 
-	bh->b_bdev = sb->s_bdev;
+	bh->b_bdev_file = sb->s_bdev_file;
 	err = nilfs_mdt_insert_new_block(inode, block, bh, init_block);
 	if (likely(!err)) {
 		get_bh(bh);
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 14e470fb8870..f893d7e2e472 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -111,7 +111,7 @@ void nilfs_copy_buffer(struct buffer_head *dbh, struct buffer_head *sbh)
 
 	dbh->b_state = sbh->b_state & NILFS_BUFFER_INHERENT_BITS;
 	dbh->b_blocknr = sbh->b_blocknr;
-	dbh->b_bdev = sbh->b_bdev;
+	dbh->b_bdev_file = sbh->b_bdev_file;
 
 	bh = dbh;
 	bits = sbh->b_state & (BIT(BH_Uptodate) | BIT(BH_Mapped));
@@ -216,7 +216,7 @@ static void nilfs_copy_folio(struct folio *dst, struct folio *src,
 		lock_buffer(dbh);
 		dbh->b_state = sbh->b_state & mask;
 		dbh->b_blocknr = sbh->b_blocknr;
-		dbh->b_bdev = sbh->b_bdev;
+		dbh->b_bdev_file = sbh->b_bdev_file;
 		sbh = sbh->b_this_page;
 		dbh = dbh->b_this_page;
 	} while (dbh != dbufs);
diff --git a/fs/nilfs2/recovery.c b/fs/nilfs2/recovery.c
index a9b8d77c8c1d..e2f5dcc923c7 100644
--- a/fs/nilfs2/recovery.c
+++ b/fs/nilfs2/recovery.c
@@ -107,7 +107,8 @@ static int nilfs_compute_checksum(struct the_nilfs *nilfs,
 		do {
 			struct buffer_head *bh;
 
-			bh = __bread(nilfs->ns_bdev, ++start, blocksize);
+			bh = __bread(nilfs->ns_sb->s_bdev_file, ++start,
+				     blocksize);
 			if (!bh)
 				return -EIO;
 			check_bytes -= size;
@@ -136,7 +137,8 @@ int nilfs_read_super_root_block(struct the_nilfs *nilfs, sector_t sr_block,
 	int ret;
 
 	*pbh = NULL;
-	bh_sr = __bread(nilfs->ns_bdev, sr_block, nilfs->ns_blocksize);
+	bh_sr = __bread(nilfs->ns_sb->s_bdev_file, sr_block,
+			nilfs->ns_blocksize);
 	if (unlikely(!bh_sr)) {
 		ret = NILFS_SEG_FAIL_IO;
 		goto failed;
@@ -183,7 +185,8 @@ nilfs_read_log_header(struct the_nilfs *nilfs, sector_t start_blocknr,
 {
 	struct buffer_head *bh_sum;
 
-	bh_sum = __bread(nilfs->ns_bdev, start_blocknr, nilfs->ns_blocksize);
+	bh_sum = __bread(nilfs->ns_sb->s_bdev_file, start_blocknr,
+			 nilfs->ns_blocksize);
 	if (bh_sum)
 		*sum = (struct nilfs_segment_summary *)bh_sum->b_data;
 	return bh_sum;
@@ -250,7 +253,7 @@ static void *nilfs_read_summary_info(struct the_nilfs *nilfs,
 	if (bytes > (*pbh)->b_size - *offset) {
 		blocknr = (*pbh)->b_blocknr;
 		brelse(*pbh);
-		*pbh = __bread(nilfs->ns_bdev, blocknr + 1,
+		*pbh = __bread(nilfs->ns_sb->s_bdev_file, blocknr + 1,
 			       nilfs->ns_blocksize);
 		if (unlikely(!*pbh))
 			return NULL;
@@ -289,7 +292,7 @@ static void nilfs_skip_summary_info(struct the_nilfs *nilfs,
 		*offset = bytes * (count - (bcnt - 1) * nitem_per_block);
 
 		brelse(*pbh);
-		*pbh = __bread(nilfs->ns_bdev, blocknr + bcnt,
+		*pbh = __bread(nilfs->ns_sb->s_bdev_file, blocknr + bcnt,
 			       nilfs->ns_blocksize);
 	}
 }
@@ -318,7 +321,8 @@ static int nilfs_scan_dsync_log(struct the_nilfs *nilfs, sector_t start_blocknr,
 
 	sumbytes = le32_to_cpu(sum->ss_sumbytes);
 	blocknr = start_blocknr + DIV_ROUND_UP(sumbytes, nilfs->ns_blocksize);
-	bh = __bread(nilfs->ns_bdev, start_blocknr, nilfs->ns_blocksize);
+	bh = __bread(nilfs->ns_sb->s_bdev_file, start_blocknr,
+		     nilfs->ns_blocksize);
 	if (unlikely(!bh))
 		goto out;
 
@@ -478,7 +482,8 @@ static int nilfs_recovery_copy_block(struct the_nilfs *nilfs,
 	size_t from = pos & ~PAGE_MASK;
 	void *kaddr;
 
-	bh_org = __bread(nilfs->ns_bdev, rb->blocknr, nilfs->ns_blocksize);
+	bh_org = __bread(nilfs->ns_sb->s_bdev_file, rb->blocknr,
+			 nilfs->ns_blocksize);
 	if (unlikely(!bh_org))
 		return -EIO;
 
@@ -697,7 +702,8 @@ static void nilfs_finish_roll_forward(struct the_nilfs *nilfs,
 	    nilfs_get_segnum_of_block(nilfs, ri->ri_super_root))
 		return;
 
-	bh = __getblk(nilfs->ns_bdev, ri->ri_lsegs_start, nilfs->ns_blocksize);
+	bh = __getblk(nilfs->ns_sb->s_bdev_file, ri->ri_lsegs_start,
+		      nilfs->ns_blocksize);
 	BUG_ON(!bh);
 	memset(bh->b_data, 0, bh->b_size);
 	set_buffer_dirty(bh);
@@ -823,7 +829,8 @@ int nilfs_search_super_root(struct the_nilfs *nilfs,
 	/* Read ahead segment */
 	b = seg_start;
 	while (b <= seg_end)
-		__breadahead(nilfs->ns_bdev, b++, nilfs->ns_blocksize);
+		__breadahead(nilfs->ns_sb->s_bdev_file, b++,
+			     nilfs->ns_blocksize);
 
 	for (;;) {
 		brelse(bh_sum);
@@ -869,7 +876,7 @@ int nilfs_search_super_root(struct the_nilfs *nilfs,
 		if (pseg_start == seg_start) {
 			nilfs_get_segment_range(nilfs, nextnum, &b, &end);
 			while (b <= end)
-				__breadahead(nilfs->ns_bdev, b++,
+				__breadahead(nilfs->ns_sb->s_bdev_file, b++,
 					     nilfs->ns_blocksize);
 		}
 		if (!(flags & NILFS_SS_SR)) {
diff --git a/fs/ntfs3/fsntfs.c b/fs/ntfs3/fsntfs.c
index ae2ef5c11868..def075a25b2c 100644
--- a/fs/ntfs3/fsntfs.c
+++ b/fs/ntfs3/fsntfs.c
@@ -1033,14 +1033,13 @@ struct buffer_head *ntfs_bread(struct super_block *sb, sector_t block)
 
 int ntfs_sb_read(struct super_block *sb, u64 lbo, size_t bytes, void *buffer)
 {
-	struct block_device *bdev = sb->s_bdev;
 	u32 blocksize = sb->s_blocksize;
 	u64 block = lbo >> sb->s_blocksize_bits;
 	u32 off = lbo & (blocksize - 1);
 	u32 op = blocksize - off;
 
 	for (; bytes; block += 1, off = 0, op = blocksize) {
-		struct buffer_head *bh = __bread(bdev, block, blocksize);
+		struct buffer_head *bh = __bread(sb->s_bdev_file, block, blocksize);
 
 		if (!bh)
 			return -EIO;
@@ -1063,7 +1062,6 @@ int ntfs_sb_write(struct super_block *sb, u64 lbo, size_t bytes,
 		  const void *buf, int wait)
 {
 	u32 blocksize = sb->s_blocksize;
-	struct block_device *bdev = sb->s_bdev;
 	sector_t block = lbo >> sb->s_blocksize_bits;
 	u32 off = lbo & (blocksize - 1);
 	u32 op = blocksize - off;
@@ -1077,14 +1075,14 @@ int ntfs_sb_write(struct super_block *sb, u64 lbo, size_t bytes,
 			op = bytes;
 
 		if (op < blocksize) {
-			bh = __bread(bdev, block, blocksize);
+			bh = __bread(sb->s_bdev_file, block, blocksize);
 			if (!bh) {
 				ntfs_err(sb, "failed to read block %llx",
 					 (u64)block);
 				return -EIO;
 			}
 		} else {
-			bh = __getblk(bdev, block, blocksize);
+			bh = __getblk(sb->s_bdev_file, block, blocksize);
 			if (!bh)
 				return -ENOMEM;
 		}
diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index 3c4c878f6d77..a97eedc5130f 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -609,7 +609,7 @@ static noinline int ntfs_get_block_vbo(struct inode *inode, u64 vbo,
 	lbo = ((u64)lcn << cluster_bits) + off;
 
 	set_buffer_mapped(bh);
-	bh->b_bdev = sb->s_bdev;
+	bh->b_bdev_file = sb->s_bdev_file;
 	bh->b_blocknr = lbo >> sb->s_blocksize_bits;
 
 	valid = ni->i_valid;
diff --git a/fs/ntfs3/super.c b/fs/ntfs3/super.c
index cef5467fd928..aa7c6a8b04de 100644
--- a/fs/ntfs3/super.c
+++ b/fs/ntfs3/super.c
@@ -1642,7 +1642,7 @@ void ntfs_unmap_meta(struct super_block *sb, CLST lcn, CLST len)
 		limit >>= 1;
 
 	while (blocks--) {
-		clean_bdev_aliases(bdev, devblock++, 1);
+		clean_bdev_aliases(sb->s_bdev_file, devblock++, 1);
 		if (cnt++ >= limit) {
 			sync_blockdev(bdev);
 			cnt = 0;
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 604fea3a26ff..4ad64997f3c7 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -1209,7 +1209,7 @@ static int ocfs2_force_read_journal(struct inode *inode)
 		}
 
 		for (i = 0; i < p_blocks; i++, p_blkno++) {
-			bh = __find_get_block(osb->sb->s_bdev, p_blkno,
+			bh = __find_get_block(osb->sb->s_bdev_file, p_blkno,
 					osb->sb->s_blocksize);
 			/* block not cached. */
 			if (!bh)
diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c
index 6c13a8d9a73c..2b288b1539d9 100644
--- a/fs/reiserfs/fix_node.c
+++ b/fs/reiserfs/fix_node.c
@@ -2332,7 +2332,7 @@ static void tb_buffer_sanity_check(struct super_block *sb,
 				       "in tree %s[%d] (%b)",
 				       descr, level, bh);
 
-		if (bh->b_bdev != sb->s_bdev)
+		if (bh_bdev(bh) != sb->s_bdev)
 			reiserfs_panic(sb, "jmacd-4", "buffer has wrong "
 				       "device %s[%d] (%b)",
 				       descr, level, bh);
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 6474529c4253..4d07d2f26317 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -618,7 +618,7 @@ static void reiserfs_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 	if (buffer_journaled(bh)) {
 		reiserfs_warning(NULL, "clm-2084",
 				 "pinned buffer %lu:%pg sent to disk",
-				 bh->b_blocknr, bh->b_bdev);
+				 bh->b_blocknr, bh_bdev(bh));
 	}
 	if (uptodate)
 		set_buffer_uptodate(bh);
@@ -2315,7 +2315,7 @@ static int journal_read_transaction(struct super_block *sb,
  * from other places.
  * Note: Do not use journal_getblk/sb_getblk functions here!
  */
-static struct buffer_head *reiserfs_breada(struct block_device *dev,
+static struct buffer_head *reiserfs_breada(struct file *bdev_file,
 					   b_blocknr_t block, int bufsize,
 					   b_blocknr_t max_block)
 {
@@ -2324,7 +2324,7 @@ static struct buffer_head *reiserfs_breada(struct block_device *dev,
 	struct buffer_head *bh;
 	int i, j;
 
-	bh = __getblk(dev, block, bufsize);
+	bh = __getblk(bdev_file, block, bufsize);
 	if (!bh || buffer_uptodate(bh))
 		return (bh);
 
@@ -2334,7 +2334,7 @@ static struct buffer_head *reiserfs_breada(struct block_device *dev,
 	bhlist[0] = bh;
 	j = 1;
 	for (i = 1; i < blocks; i++) {
-		bh = __getblk(dev, block + i, bufsize);
+		bh = __getblk(bdev_file, block + i, bufsize);
 		if (!bh)
 			break;
 		if (buffer_uptodate(bh)) {
@@ -2447,7 +2447,7 @@ static int journal_read(struct super_block *sb)
 		 * device and journal device to be the same
 		 */
 		d_bh =
-		    reiserfs_breada(file_bdev(journal->j_bdev_file), cur_dblock,
+		    reiserfs_breada(journal->j_bdev_file, cur_dblock,
 				    sb->s_blocksize,
 				    SB_ONDISK_JOURNAL_1st_BLOCK(sb) +
 				    SB_ONDISK_JOURNAL_SIZE(sb));
diff --git a/fs/reiserfs/prints.c b/fs/reiserfs/prints.c
index 84a194b77f19..249a458b6e28 100644
--- a/fs/reiserfs/prints.c
+++ b/fs/reiserfs/prints.c
@@ -156,7 +156,7 @@ static int scnprintf_buffer_head(char *buf, size_t size, struct buffer_head *bh)
 {
 	return scnprintf(buf, size,
 			 "dev %pg, size %zd, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)",
-			 bh->b_bdev, bh->b_size,
+			 bh_bdev(bh), bh->b_size,
 			 (unsigned long long)bh->b_blocknr,
 			 atomic_read(&(bh->b_count)),
 			 bh->b_state, bh->b_page,
@@ -561,7 +561,7 @@ static int print_super_block(struct buffer_head *bh)
 		return 1;
 	}
 
-	printk("%pg\'s super block is in block %llu\n", bh->b_bdev,
+	printk("%pg\'s super block is in block %llu\n", bh_bdev(bh),
 	       (unsigned long long)bh->b_blocknr);
 	printk("Reiserfs version %s\n", version);
 	printk("Block count %u\n", sb_block_count(rs));
diff --git a/fs/reiserfs/reiserfs.h b/fs/reiserfs/reiserfs.h
index f0e1f29f20ee..49caa7c42fb7 100644
--- a/fs/reiserfs/reiserfs.h
+++ b/fs/reiserfs/reiserfs.h
@@ -2810,10 +2810,10 @@ struct reiserfs_journal_header {
 
 /* We need these to make journal.c code more readable */
 #define journal_find_get_block(s, block) __find_get_block(\
-		file_bdev(SB_JOURNAL(s)->j_bdev_file), block, s->s_blocksize)
-#define journal_getblk(s, block) __getblk(file_bdev(SB_JOURNAL(s)->j_bdev_file),\
+		SB_JOURNAL(s)->j_bdev_file, block, s->s_blocksize)
+#define journal_getblk(s, block) __getblk(SB_JOURNAL(s)->j_bdev_file,\
 		block, s->s_blocksize)
-#define journal_bread(s, block) __bread(file_bdev(SB_JOURNAL(s)->j_bdev_file),\
+#define journal_bread(s, block) __bread(SB_JOURNAL(s)->j_bdev_file,\
 		block, s->s_blocksize)
 
 enum reiserfs_bh_state_bits {
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 5faf702f8d15..23998f071d9c 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -331,7 +331,7 @@ static inline int key_in_buffer(
 	       || chk_path->path_length > MAX_HEIGHT,
 	       "PAP-5050: pointer to the key(%p) is NULL or invalid path length(%d)",
 	       key, chk_path->path_length);
-	RFALSE(!PATH_PLAST_BUFFER(chk_path)->b_bdev,
+	RFALSE(!bh_bdev(PATH_PLAST_BUFFER(chk_path)),
 	       "PAP-5060: device must not be NODEV");
 
 	if (comp_keys(get_lkey(chk_path, sb), key) == 1)
diff --git a/fs/reiserfs/tail_conversion.c b/fs/reiserfs/tail_conversion.c
index 2cec61af2a9e..f38dfae74e32 100644
--- a/fs/reiserfs/tail_conversion.c
+++ b/fs/reiserfs/tail_conversion.c
@@ -187,7 +187,7 @@ void reiserfs_unmap_buffer(struct buffer_head *bh)
 	clear_buffer_mapped(bh);
 	clear_buffer_req(bh);
 	clear_buffer_new(bh);
-	bh->b_bdev = NULL;
+	bh->b_bdev_file = NULL;
 	unlock_buffer(bh);
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..c06d41bbb919 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -125,7 +125,7 @@ xfs_bmbt_to_iomap(
 	if (mapping_flags & IOMAP_DAX)
 		iomap->dax_dev = target->bt_daxdev;
 	else
-		iomap->bdev = target->bt_bdev;
+		iomap->bdev_file = target->bt_bdev_file;
 	iomap->flags = iomap_flags;
 
 	if (xfs_ipincount(ip) &&
@@ -150,7 +150,7 @@ xfs_hole_to_iomap(
 	iomap->type = IOMAP_HOLE;
 	iomap->offset = XFS_FSB_TO_B(ip->i_mount, offset_fsb);
 	iomap->length = XFS_FSB_TO_B(ip->i_mount, end_fsb - offset_fsb);
-	iomap->bdev = target->bt_bdev;
+	iomap->bdev_file = target->bt_bdev_file;
 	iomap->dax_dev = target->bt_daxdev;
 }
 
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 8dab4c2ad300..e454d08ad7d0 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -38,7 +38,7 @@ static int zonefs_read_iomap_begin(struct inode *inode, loff_t offset,
 	 * act as if there is a hole up to the file maximum size.
 	 */
 	mutex_lock(&zi->i_truncate_mutex);
-	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->bdev_file = inode->i_sb->s_bdev_file;
 	iomap->offset = ALIGN_DOWN(offset, sb->s_blocksize);
 	isize = i_size_read(inode);
 	if (iomap->offset >= isize) {
@@ -88,7 +88,7 @@ static int zonefs_write_iomap_begin(struct inode *inode, loff_t offset,
 	 * write pointer) and unwriten beyond.
 	 */
 	mutex_lock(&zi->i_truncate_mutex);
-	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->bdev_file = inode->i_sb->s_bdev_file;
 	iomap->offset = ALIGN_DOWN(offset, sb->s_blocksize);
 	iomap->addr = (z->z_sector << SECTOR_SHIFT) + iomap->offset;
 	isize = i_size_read(inode);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1c07848dea7e..79c652f42e57 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -49,7 +49,6 @@ struct block_device {
 	bool			bd_write_holder;
 	bool			bd_has_submit_bio;
 	dev_t			bd_dev;
-	struct inode		*bd_inode;	/* will die */
 
 	atomic_t		bd_openers;
 	spinlock_t		bd_size_lock; /* for bd_inode->i_size updates */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3fb02e3a527a..f3bc2e77999a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1524,6 +1524,8 @@ struct block_device *I_BDEV(struct inode *inode);
 struct block_device *file_bdev(struct file *bdev_file);
 bool disk_live(struct gendisk *disk);
 unsigned int block_size(struct block_device *bdev);
+void clean_bdev_aliases2(struct block_device *bdev, sector_t block,
+			 sector_t len);
 
 #ifdef CONFIG_BLOCK
 void invalidate_bdev(struct block_device *bdev);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d78454a4dd1f..863af22f24c4 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -10,6 +10,7 @@
 
 #include <linux/types.h>
 #include <linux/blk_types.h>
+#include <linux/blkdev.h>
 #include <linux/fs.h>
 #include <linux/linkage.h>
 #include <linux/pagemap.h>
@@ -34,6 +35,7 @@ enum bh_state_bits {
 	BH_Meta,	/* Buffer contains metadata */
 	BH_Prio,	/* Buffer should be submitted with REQ_PRIO */
 	BH_Defer_Completion, /* Defer AIO completion to workqueue */
+	BH_Bdev,
 
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
@@ -68,7 +70,10 @@ struct buffer_head {
 	size_t b_size;			/* size of mapping */
 	char *b_data;			/* pointer to data within the page */
 
-	struct block_device *b_bdev;
+	union {
+		struct file *b_bdev_file;
+		struct block_device *b_bdev;
+	};
 	bh_end_io_t *b_end_io;		/* I/O completion */
  	void *b_private;		/* reserved for b_end_io */
 	struct list_head b_assoc_buffers; /* associated with another mapping */
@@ -135,6 +140,14 @@ BUFFER_FNS(Unwritten, unwritten)
 BUFFER_FNS(Meta, meta)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
+BUFFER_FNS(Bdev, bdev)
+
+static __always_inline struct block_device *bh_bdev(struct buffer_head *bh)
+{
+	if (buffer_bdev(bh))
+		return bh->b_bdev;
+	return file_bdev(bh->b_bdev_file);
+}
 
 static __always_inline void set_buffer_uptodate(struct buffer_head *bh)
 {
@@ -212,24 +225,33 @@ int generic_buffers_fsync_noflush(struct file *file, loff_t start, loff_t end,
 				  bool datasync);
 int generic_buffers_fsync(struct file *file, loff_t start, loff_t end,
 			  bool datasync);
-void clean_bdev_aliases(struct block_device *bdev, sector_t block,
-			sector_t len);
+void __clean_bdev_aliases(struct inode *inode, sector_t block, sector_t len);
+
+static inline void clean_bdev_aliases(struct file *bdev_file, sector_t block,
+				      sector_t len)
+{
+	return __clean_bdev_aliases(file_inode(bdev_file), block, len);
+}
+
 static inline void clean_bdev_bh_alias(struct buffer_head *bh)
 {
-	clean_bdev_aliases(bh->b_bdev, bh->b_blocknr, 1);
+	if (buffer_bdev(bh))
+		clean_bdev_aliases2(bh->b_bdev, bh->b_blocknr, 1);
+	else
+		clean_bdev_aliases(bh->b_bdev_file, bh->b_blocknr, 1);
 }
 
 void mark_buffer_async_write(struct buffer_head *bh);
 void __wait_on_buffer(struct buffer_head *);
 wait_queue_head_t *bh_waitq_head(struct buffer_head *bh);
-struct buffer_head *__find_get_block(struct block_device *bdev, sector_t block,
+struct buffer_head *__find_get_block(struct file *bdev_file, sector_t block,
 			unsigned size);
-struct buffer_head *bdev_getblk(struct block_device *bdev, sector_t block,
+struct buffer_head *bdev_getblk(struct file *bdev_file, sector_t block,
 		unsigned size, gfp_t gfp);
 void __brelse(struct buffer_head *);
 void __bforget(struct buffer_head *);
-void __breadahead(struct block_device *, sector_t block, unsigned int size);
-struct buffer_head *__bread_gfp(struct block_device *,
+void __breadahead(struct file *bdev_file, sector_t block, unsigned int size);
+struct buffer_head *__bread_gfp(struct file *bdev_file,
 				sector_t block, unsigned size, gfp_t gfp);
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
 void free_buffer_head(struct buffer_head * bh);
@@ -239,7 +261,7 @@ int sync_dirty_buffer(struct buffer_head *bh);
 int __sync_dirty_buffer(struct buffer_head *bh, blk_opf_t op_flags);
 void write_dirty_buffer(struct buffer_head *bh, blk_opf_t op_flags);
 void submit_bh(blk_opf_t, struct buffer_head *);
-void write_boundary_block(struct block_device *bdev,
+void write_boundary_block(struct file *bdev_file,
 			sector_t bblock, unsigned blocksize);
 int bh_uptodate_or_lock(struct buffer_head *bh);
 int __bh_read(struct buffer_head *bh, blk_opf_t op_flags, bool wait);
@@ -318,66 +340,67 @@ static inline void bforget(struct buffer_head *bh)
 static inline struct buffer_head *
 sb_bread(struct super_block *sb, sector_t block)
 {
-	return __bread_gfp(sb->s_bdev, block, sb->s_blocksize, __GFP_MOVABLE);
+	return __bread_gfp(sb->s_bdev_file, block, sb->s_blocksize,
+			   __GFP_MOVABLE);
 }
 
 static inline struct buffer_head *
 sb_bread_unmovable(struct super_block *sb, sector_t block)
 {
-	return __bread_gfp(sb->s_bdev, block, sb->s_blocksize, 0);
+	return __bread_gfp(sb->s_bdev_file, block, sb->s_blocksize, 0);
 }
 
 static inline void
 sb_breadahead(struct super_block *sb, sector_t block)
 {
-	__breadahead(sb->s_bdev, block, sb->s_blocksize);
+	__breadahead(sb->s_bdev_file, block, sb->s_blocksize);
 }
 
-static inline struct buffer_head *getblk_unmovable(struct block_device *bdev,
+static inline struct buffer_head *getblk_unmovable(struct file *bdev_file,
 		sector_t block, unsigned size)
 {
 	gfp_t gfp;
 
-	gfp = mapping_gfp_constraint(bdev->bd_inode->i_mapping, ~__GFP_FS);
+	gfp = mapping_gfp_constraint(bdev_file->f_mapping, ~__GFP_FS);
 	gfp |= __GFP_NOFAIL;
 
-	return bdev_getblk(bdev, block, size, gfp);
+	return bdev_getblk(bdev_file, block, size, gfp);
 }
 
-static inline struct buffer_head *__getblk(struct block_device *bdev,
+static inline struct buffer_head *__getblk(struct file *bdev_file,
 		sector_t block, unsigned size)
 {
 	gfp_t gfp;
 
-	gfp = mapping_gfp_constraint(bdev->bd_inode->i_mapping, ~__GFP_FS);
+	gfp = mapping_gfp_constraint(bdev_file->f_mapping, ~__GFP_FS);
 	gfp |= __GFP_MOVABLE | __GFP_NOFAIL;
 
-	return bdev_getblk(bdev, block, size, gfp);
+	return bdev_getblk(bdev_file, block, size, gfp);
 }
 
 static inline struct buffer_head *sb_getblk(struct super_block *sb,
 		sector_t block)
 {
-	return __getblk(sb->s_bdev, block, sb->s_blocksize);
+	return __getblk(sb->s_bdev_file, block, sb->s_blocksize);
 }
 
 static inline struct buffer_head *sb_getblk_gfp(struct super_block *sb,
 		sector_t block, gfp_t gfp)
 {
-	return bdev_getblk(sb->s_bdev, block, sb->s_blocksize, gfp);
+	return bdev_getblk(sb->s_bdev_file, block, sb->s_blocksize, gfp);
 }
 
 static inline struct buffer_head *
 sb_find_get_block(struct super_block *sb, sector_t block)
 {
-	return __find_get_block(sb->s_bdev, block, sb->s_blocksize);
+	return __find_get_block(sb->s_bdev_file, block, sb->s_blocksize);
 }
 
 static inline void
 map_bh(struct buffer_head *bh, struct super_block *sb, sector_t block)
 {
 	set_buffer_mapped(bh);
-	bh->b_bdev = sb->s_bdev;
+	bh->b_bdev_file = sb->s_bdev_file;
 	bh->b_blocknr = block;
 	bh->b_size = sb->s_blocksize;
 }
@@ -438,7 +461,7 @@ static inline void bh_readahead_batch(int nr, struct buffer_head *bhs[],
 
 /**
  *  __bread() - reads a specified block and returns the bh
- *  @bdev: the block_device to read from
+ *  @bdev_file: the opened block_device to read from
  *  @block: number of block
  *  @size: size (in bytes) to read
  *
@@ -447,9 +470,9 @@ static inline void bh_readahead_batch(int nr, struct buffer_head *bhs[],
  *  It returns NULL if the block was unreadable.
  */
 static inline struct buffer_head *
-__bread(struct block_device *bdev, sector_t block, unsigned size)
+__bread(struct file *bdev_file, sector_t block, unsigned int size)
 {
-	return __bread_gfp(bdev, block, size, __GFP_MOVABLE);
+	return __bread_gfp(bdev_file, block, size, __GFP_MOVABLE);
 }
 
 /**
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..176b202a2c7d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -77,6 +77,7 @@ struct vm_fault;
  */
 #define IOMAP_F_SIZE_CHANGED	(1U << 8)
 #define IOMAP_F_STALE		(1U << 9)
+#define IOMAP_F_BDEV		(1U << 10)
 
 /*
  * Flags from 0x1000 up are for file system specific usage:
@@ -97,7 +98,11 @@ struct iomap {
 	u64			length;	/* length of mapping, bytes */
 	u16			type;	/* type of mapping */
 	u16			flags;	/* flags for mapping */
-	struct block_device	*bdev;	/* block device for I/O */
+	union {
+		/* block device for I/O */
+		struct block_device	*bdev;
+		struct file		*bdev_file;
+	};
 	struct dax_device	*dax_dev; /* dax_dev for dax operations */
 	void			*inline_data;
 	void			*private; /* filesystem private */
@@ -105,6 +110,13 @@ struct iomap {
 	u64			validity_cookie; /* used with .iomap_valid() */
 };
 
+static inline struct block_device *iomap_bdev(const struct iomap *iomap)
+{
+	if (iomap->flags & IOMAP_F_BDEV)
+		return iomap->bdev;
+	return file_bdev(iomap->bdev_file);
+}
+
 static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
 {
 	return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 0e128ad51460..95d3ed978864 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -26,7 +26,7 @@ DECLARE_EVENT_CLASS(block_buffer,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= bh->b_bdev->bd_dev;
+		__entry->dev		= bh_bdev(bh)->bd_dev;
 		__entry->sector		= bh->b_blocknr;
 		__entry->size		= bh->b_size;
 	),
-- 
2.39.2


^ permalink raw reply related	[relevance 1%]

* Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more
  2024-02-22 10:25  5%         ` Miklos Szeredi
@ 2024-02-22 11:19  0%           ` Kent Overstreet
  0 siblings, 0 replies; 200+ results
From: Kent Overstreet @ 2024-02-22 11:19 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Josef Bacik, linux-bcachefs, linux-btrfs, linux-fsdevel,
	linux-kernel, lsf-pc

On Thu, Feb 22, 2024 at 11:25:12AM +0100, Miklos Szeredi wrote:
> On Thu, 22 Feb 2024 at 10:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> > bits: pigeonhole principle.
> 
> Obviously not.  And I have no idea about the inode number allocation
> strategy of bcachefs and how many bits would be needed for subvolumes,
> etc..   I was just telling what overlayfs does and why.  It's a
> pragmatic solution that works.  I'd very much like to move to better
> interfaces, but creating good interfaces is never easy.

You say "creating good interfaces is never easy" - but we've got a
proposal, that's bounced around a fair bit, and you aren't saying
anything concrete.

> > We need something better than "hacks".
> 
> That's the end goal, obviously.   But we also need to take care of
> legacy.  Always have.

So what are you proposing?

> > This isn't a serious proposal.
> 
> If not, then what is?
> 
> BTW to expand on the st_dev_v2 idea, it can be done by adding a
> STATX_DEV_V2 query mask.

Didn't you see Josef just say they're trying to get away from st_dev?

> The other issue is adding subvolume ID.  You seem to think that it's
> okay to add that to statx and let userspace use (st_ino, st_subvoid)
> to identify the inode.  I'm saying this is wrong, because it doesn't
> work in the general case.

No, I explicitly said that when INO_NOT_UNIQUE is set the _filehandle_
would be the means to identify the file.

^ permalink raw reply	[relevance 0%]

* Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more
  @ 2024-02-22 10:25  5%         ` Miklos Szeredi
  2024-02-22 11:19  0%           ` Kent Overstreet
    1 sibling, 1 reply; 200+ results
From: Miklos Szeredi @ 2024-02-22 10:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Josef Bacik, linux-bcachefs, linux-btrfs, linux-fsdevel,
	linux-kernel, lsf-pc

On Thu, 22 Feb 2024 at 10:42, Kent Overstreet <kent.overstreet@linux.dev> wrote:

> Yeah no, you can't crap multiple 64 bit inode number spaces into 64
> bits: pigeonhole principle.

Obviously not.  And I have no idea about the inode number allocation
strategy of bcachefs and how many bits would be needed for subvolumes,
etc..   I was just telling what overlayfs does and why.  It's a
pragmatic solution that works.  I'd very much like to move to better
interfaces, but creating good interfaces is never easy.

> We need something better than "hacks".

That's the end goal, obviously.   But we also need to take care of
legacy.  Always have.

> This isn't a serious proposal.

If not, then what is?

BTW to expand on the st_dev_v2 idea, it can be done by adding a
STATX_DEV_V2 query mask.

That way userspace can ask for the uniform stx_dev if it wants,
knowing full well that stx_ino will be non-unique within that
filesystem.  Then kernel is free to return with or without
STATX_DEV_V2, which is basically what you proposed.  Except it's now
negotiated and not forced upon legacy interfaces.

The other issue is adding subvolume ID.  You seem to think that it's
okay to add that to statx and let userspace use (st_ino, st_subvoid)
to identify the inode.  I'm saying this is wrong, because it doesn't
work in the general case.

It doesn't work for overlayfs, for example, and we definitely want to
avoid having userspace do filesystem specific things *if it isn't
absolutely necessary*.  So for example "tar" should not care about
subvolumes as long as it's not been explicitly told to care.  And that
means for hard link detection if should using the file handle +
st_dev_v2 instead of st_ino + st_subvolid + st_dev_v2.   So if that
field is added to statx it must come with a stern warning about this
type of usage.

Thanks,
Miklos

^ permalink raw reply	[relevance 5%]

* Re: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more
  @ 2024-02-21 21:04  5%   ` Kent Overstreet
    1 sibling, 0 replies; 200+ results
From: Kent Overstreet @ 2024-02-21 21:04 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-bcachefs, linux-btrfs, linux-fsdevel, linux-kernel, lsf-pc,
	NeilBrown

On Wed, Feb 21, 2024 at 04:06:34PM +0100, Miklos Szeredi wrote:
> On Wed, 21 Feb 2024 at 01:51, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Recently we had a pretty long discussion on statx extensions, which
> > eventually got a bit offtopic but nevertheless hashed out all the major
> > issues.
> >
> > To summarize:
> >  - guaranteeing inode number uniqueness is becoming increasingly
> >    infeasible, we need a bit to tell userspace "inode number is not
> >    unique, use filehandle instead"
> 
> This is a tough one.   POSIX says "The st_ino and st_dev fields taken
> together uniquely identify the file within the system."
> 
> Adding a bit that says "from now the above POSIX rule is invalid"
> doesn't instantly fix all the existing applications that rely on it.

Even POSIX must bend when faced with reality. 64 bits is getting
uncomfortably cramped already and with filesystems getting bigger it's
going to break sooner or later.

We don't want to be abusing st_dev, and snapshots and inode number
sharding mean we're basically out of bits today.

> doing (see documentation) is generally the right direction.  It makes
> various compromises but not to uniqueness, and we haven't had
> complaints (fingers crossed).

I haven't seen anything in overlayfs that looked like a real solution,
just hacks that would break sooner or later if more filesystems are
being stacked.

> Nudging userspace developers to use file handles would also be good,
> but they should do so unconditionally, not based on a flag that has no
> well defined meaning.

If we define it, it has a perfectly well defined meaning.

I wouldn't be against telling userspace to use file handles
unconditionally; they should only need to query it for a file that has
handlinks, anyways.

But I think we _do_ need this bit, if nothing else, as exactly that
nudge.

^ permalink raw reply	[relevance 5%]

* Re: [PATCH 2/3] check: add support for --list-group-tests
  2024-02-19  3:38  0%   ` Dave Chinner
@ 2024-02-21 16:45  0%     ` Luis Chamberlain
  2024-02-25 16:08  0%       ` Zorro Lang
  0 siblings, 1 reply; 200+ results
From: Luis Chamberlain @ 2024-02-21 16:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: fstests, anand.jain, aalbersh, djwong, linux-fsdevel, kdevops, patches

On Mon, Feb 19, 2024 at 02:38:12PM +1100, Dave Chinner wrote:
> On Fri, Feb 16, 2024 at 10:18:58AM -0800, Luis Chamberlain wrote:
> > Since the prior commit adds the ability to list groups but is used
> > only when we use --start-after, let's add an option which leverages this
> > to also allow us to easily query which tests are part of the groups
> > specified.
> > 
> > This can be used for dynamic test configuration suites such as kdevops
> > which may want to take advantage of this information to deterministically
> > determine if a test falls part of a specific group.
> > Demo:
> > 
> > root@demo-xfs-reflink /var/lib/xfstests # ./check --list-group-tests -g soak
> > 
> > generic/019 generic/388 generic/475 generic/476 generic/521 generic/522 generic/616 generic/617 generic/642 generic/648 generic/650 xfs/285 xfs/517 xfs/560 xfs/561 xfs/562 xfs/565 xfs/570 xfs/571 xfs/572 xfs/573 xfs/574 xfs/575 xfs/576 xfs/577 xfs/578 xfs/579 xfs/580 xfs/581 xfs/582 xfs/583 xfs/584 xfs/585 xfs/586 xfs/587 xfs/588 xfs/589 xfs/590 xfs/591 xfs/592 xfs/593 xfs/594 xfs/595 xfs/727 xfs/729 xfs/800
> 
> So how is this different to ./check -n -g soak?
> 
> '-n' is supposed to show you want tests are going to be run
> without actually running them, so why can't you use that?

'-n' will replicate as if you are running all tests but just skip while
--list-group-tests will just look for the tests for the group and bail right
away, and it is machine readable.

  Luis

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] test_xarray: fix soft lockup for advanced-api tests
  2024-02-20  2:28  0% ` Andrew Morton
@ 2024-02-20 17:45  0%   ` Luis Chamberlain
  0 siblings, 0 replies; 200+ results
From: Luis Chamberlain @ 2024-02-20 17:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: willy, linux-fsdevel, linux-mm, linux-kernel, gost.dev, p.raghav,
	da.gomez, kernel test robot

On Mon, Feb 19, 2024 at 06:28:08PM -0800, Andrew Morton wrote:
> On Fri, 16 Feb 2024 11:43:29 -0800 Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> > The new adanced API tests
> 
> So this is a fix against the mm-unstable series "test_xarray: advanced
> API multi-index tests", v2.

Yes.

> > want to vet the xarray API is doing what it
> > promises by manually iterating over a set of possible indexes on its
> > own, and using a query operation which holds the RCU lock and then
> > releases it. So it is not using the helper loop options which xarray
> > provides on purpose. Any loop which iterates over 1 million entries
> > (which is possible with order 20, so emulating say a 4 GiB block size)
> > to just to rcu lock and unlock will eventually end up triggering a soft
> > lockup on systems which don't preempt, and have lock provin and RCU
> > prooving enabled.
> > 
> > xarray users already use XA_CHECK_SCHED for loops which may take a long
> > time, in our case we don't want to RCU unlock and lock as the caller
> > does that already, but rather just force a schedule every XA_CHECK_SCHED
> > iterations since the test is trying to not trust and rather test that
> > xarray is doing the right thing.
> > 
> > [0] https://lkml.kernel.org/r/202402071613.70f28243-lkp@intel.com
> > 
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> 
> As the above links shows, this should be
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202402071613.70f28243-lkp@intel.com

Thanks, yes...

> > --- a/lib/test_xarray.c
> > +++ b/lib/test_xarray.c
> > @@ -781,6 +781,7 @@ static noinline void *test_get_entry(struct xarray *xa, unsigned long index)
> >  {
> >  	XA_STATE(xas, xa, index);
> >  	void *p;
> > +	static unsigned int i = 0;
> 
> I don't think this needs static storage.

Actually it does, without it the schedule never happens and produces the
soft lockup in the splat below.:

> PetPeeve: it is unexpected that `i' has unsigned type.  Can a more
> communicative identifier be used?

Sure,

The static however is needed otherwise we end up with:

Feb 20 14:37:09 small kernel: Linux version 6.8.0-rc4-next-20240212+ (mcgrof@deb-101020-bm01) (gcc (Debian 13.2.0-4) 13.2.0, GNU ld (GNU Binutils for Debian) 2.41) #23 SMP PREEMPT_DYNAMIC Tue Feb 20 14:34:35 UTC 2024
Feb 20 14:37:09 small kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-rc4-next-20240212+ root=UUID=79e12315-47fe-462c-b69d-270b4fa13487 ro console=tty0 console=tty1 console=ttyS0,115200n8 elevator=noop scsi_mod.use_blk_mq=Y net.ifnames=0 biosdevname=0
Feb 20 14:37:09 small kernel: BIOS-provided physical RAM map:
Feb 20 14:37:09 small kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable

...

Feb 20 14:37:09 small kernel: Freeing initrd memory: 95720K
Feb 20 14:37:09 small kernel: Block layer SCSI generic (bsg) driver version 0.4 loaded (major 248)
Feb 20 14:37:09 small kernel: io scheduler mq-deadline registered

...

And the soft lockup:

Feb 20 14:37:09 small kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]
Feb 20 14:37:09 small kernel: Modules linked in:
Feb 20 14:37:09 small kernel: irq event stamp: 1786208
Feb 20 14:37:09 small kernel: hardirqs last  enabled at (1786207): [<ffffffff839633c4>] _raw_spin_unlock_irq+0x24/0x50
Feb 20 14:37:09 small kernel: hardirqs last disabled at (1786208): [<ffffffff8394aafa>] sysvec_apic_timer_interrupt+0xa/0xc0
Feb 20 14:37:09 small kernel: softirqs last  enabled at (1786198): [<ffffffff82e96746>] __irq_exit_rcu+0x76/0xd0
Feb 20 14:37:09 small kernel: softirqs last disabled at (1786193): [<ffffffff82e96746>] __irq_exit_rcu+0x76/0xd0
Feb 20 14:37:09 small kernel: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc4-next-20240212+ #23
Feb 20 14:37:09 small kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Feb 20 14:37:09 small kernel: RIP: 0010:lock_is_held_type+0xee/0x120
Feb 20 14:37:09 small kernel: Code: 77 da f2 83 e8 83 0b 00 00 b8 ff ff ff ff 65 0f c1 05 6e 21 6d 7c 83 f8 01 75 20 41 f7 c7 00 02 00 00 74 06 fb 0f 1f 44 00 00 <5b> 89 e8 5d 41 5c 41 5d 41 5e 41 5f c3 31 ed eb c2 0f 0b 48 c7 c7
Feb 20 14:37:09 small kernel: RSP: 0000:ffffbf4400017d48 EFLAGS: 00000206
Feb 20 14:37:09 small kernel: RAX: 0000000000000001 RBX: ffff9bfe4180ce98 RCX: 0000000000000001
Feb 20 14:37:09 small kernel: RDX: 0000000000000000 RSI: ffffffff83f2da77 RDI: ffffffff83f5c6bf
Feb 20 14:37:09 small kernel: RBP: 0000000000000000 R08: 0000000000000019 R09: 0000000000000019
Feb 20 14:37:09 small kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff84355b38
Feb 20 14:37:09 small kernel: R13: ffff9bfe4180c400 R14: 00000000ffffffff R15: 0000000000000246
Feb 20 14:37:09 small kernel: FS:  0000000000000000(0000) GS:ffff9bfebdc00000(0000) knlGS:0000000000000000
Feb 20 14:37:09 small kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 14:37:09 small kernel: CR2: ffff9bfe53601000 CR3: 0000000011e23001 CR4: 0000000000770ef0
Feb 20 14:37:09 small kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 20 14:37:09 small kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 20 14:37:09 small kernel: PKRU: 55555554
Feb 20 14:37:09 small kernel: Call Trace:
Feb 20 14:37:09 small kernel:  <IRQ>
Feb 20 14:37:09 small kernel:  ? watchdog_timer_fn+0x271/0x310
Feb 20 14:37:09 small kernel:  ? softlockup_fn+0x70/0x70
Feb 20 14:37:09 small kernel:  ? __hrtimer_run_queues+0x19e/0x360
Feb 20 14:37:09 small kernel:  ? hrtimer_interrupt+0xfe/0x230
Feb 20 14:37:09 small kernel:  ? __sysvec_apic_timer_interrupt+0x84/0x1d0
Feb 20 14:37:09 small kernel:  ? sysvec_apic_timer_interrupt+0x98/0xc0
Feb 20 14:37:09 small kernel:  </IRQ>
Feb 20 14:37:09 small kernel:  <TASK>
Feb 20 14:37:09 small kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Feb 20 14:37:09 small kernel:  ? lock_is_held_type+0xee/0x120
Feb 20 14:37:09 small kernel:  ? lock_is_held_type+0xcd/0x120
Feb 20 14:37:09 small kernel:  xas_descend+0xc9/0x190
Feb 20 14:37:09 small kernel:  xas_load+0x39/0x50
Feb 20 14:37:09 small kernel:  test_get_entry.constprop.0+0x91/0x170
Feb 20 14:37:09 small kernel:  check_xa_multi_store_adv.constprop.0+0x21c/0x4c0
Feb 20 14:37:09 small kernel:  check_multi_store_advanced.constprop.0+0x3a/0x60
Feb 20 14:37:09 small kernel:  ? check_xas_retry.constprop.0+0x9a0/0x9a0
Feb 20 14:37:09 small kernel:  xarray_checks+0x4f/0xe0
Feb 20 14:37:09 small kernel:  do_one_initcall+0x5d/0x350
Feb 20 14:37:09 small kernel:  kernel_init_freeable+0x24d/0x410
Feb 20 14:37:09 small kernel:  ? rest_init+0x190/0x190
Feb 20 14:37:09 small kernel:  kernel_init+0x16/0x1b0
Feb 20 14:37:09 small kernel:  ret_from_fork+0x2d/0x50
Feb 20 14:37:09 small kernel:  ? rest_init+0x190/0x190
Feb 20 14:37:09 small kernel:  ret_from_fork_asm+0x11/0x20
Feb 20 14:37:09 small kernel:  </TASK>
Feb 20 14:37:09 small kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 52s! [swapper/0:1]
Feb 20 14:37:09 small kernel: Modules linked in:
Feb 20 14:37:09 small kernel: irq event stamp: 1838538
Feb 20 14:37:09 small kernel: hardirqs last  enabled at (1838537): [<ffffffff83a00d06>] asm_sysvec_apic_timer_interrupt+0x16/0x20
Feb 20 14:37:09 small kernel: hardirqs last disabled at (1838538): [<ffffffff8394aafa>] sysvec_apic_timer_interrupt+0xa/0xc0
Feb 20 14:37:09 small kernel: softirqs last  enabled at (1838508): [<ffffffff82e96746>] __irq_exit_rcu+0x76/0xd0
Feb 20 14:37:09 small kernel: softirqs last disabled at (1838503): [<ffffffff82e96746>] __irq_exit_rcu+0x76/0xd0
Feb 20 14:37:09 small kernel: CPU: 0 PID: 1 Comm: swapper/0 Tainted: G             L     6.8.0-rc4-next-20240212+ #23
Feb 20 14:37:09 small kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Feb 20 14:37:09 small kernel: RIP: 0010:lock_is_held_type+0xee/0x120
Feb 20 14:37:09 small kernel: Code: 77 da f2 83 e8 83 0b 00 00 b8 ff ff ff ff 65 0f c1 05 6e 21 6d 7c 83 f8 01 75 20 41 f7 c7 00 02 00 00 74 06 fb 0f 1f 44 00 00 <5b> 89 e8 5d 41 5c 41 5d 41 5e 41 5f c3 31 ed eb c2 0f 0b 48 c7 c7
Feb 20 14:37:09 small kernel: RSP: 0000:ffffbf4400017d48 EFLAGS: 00000206
Feb 20 14:37:09 small kernel: RAX: 0000000000000001 RBX: ffff9bfe4180ce70 RCX: 0000000000000001
Feb 20 14:37:09 small kernel: RDX: 0000000000000000 RSI: ffffffff83f2da77 RDI: ffffffff83f5c6bf
Feb 20 14:37:09 small kernel: RBP: 0000000000000001 R08: 0000000000000019 R09: 0000000000000019
Feb 20 14:37:09 small kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff842d1040
Feb 20 14:37:09 small kernel: R13: ffff9bfe4180c400 R14: 00000000ffffffff R15: 0000000000000246
Feb 20 14:37:09 small kernel: FS:  0000000000000000(0000) GS:ffff9bfebdc00000(0000) knlGS:0000000000000000
Feb 20 14:37:09 small kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 20 14:37:09 small kernel: CR2: ffff9bfe53601000 CR3: 0000000011e23001 CR4: 0000000000770ef0
Feb 20 14:37:09 small kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 20 14:37:09 small kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 20 14:37:09 small kernel: PKRU: 55555554
Feb 20 14:37:09 small kernel: Call Trace:
Feb 20 14:37:09 small kernel:  <IRQ>
Feb 20 14:37:09 small kernel:  ? watchdog_timer_fn+0x271/0x310
Feb 20 14:37:09 small kernel:  ? softlockup_fn+0x70/0x70
Feb 20 14:37:09 small kernel:  ? __hrtimer_run_queues+0x19e/0x360
Feb 20 14:37:09 small kernel:  ? hrtimer_interrupt+0xfe/0x230
Feb 20 14:37:09 small kernel:  ? __sysvec_apic_timer_interrupt+0x84/0x1d0
Feb 20 14:37:09 small kernel:  ? sysvec_apic_timer_interrupt+0x98/0xc0
Feb 20 14:37:09 small kernel:  </IRQ>
Feb 20 14:37:09 small kernel:  <TASK>
Feb 20 14:37:09 small kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Feb 20 14:37:09 small kernel:  ? lock_is_held_type+0xee/0x120
Feb 20 14:37:09 small kernel:  ? lock_is_held_type+0xcd/0x120
Feb 20 14:37:09 small kernel:  xas_descend+0xd6/0x190
Feb 20 14:37:09 small kernel:  xas_load+0x39/0x50
Feb 20 14:37:09 small kernel:  test_get_entry.constprop.0+0x91/0x170
Feb 20 14:37:09 small kernel:  check_xa_multi_store_adv.constprop.0+0x3b1/0x4c0
Feb 20 14:37:09 small kernel:  check_multi_store_advanced.constprop.0+0x3a/0x60
Feb 20 14:37:09 small kernel:  ? check_xas_retry.constprop.0+0x9a0/0x9a0
Feb 20 14:37:09 small kernel:  xarray_checks+0x4f/0xe0
Feb 20 14:37:09 small kernel:  do_one_initcall+0x5d/0x350
Feb 20 14:37:09 small kernel:  kernel_init_freeable+0x24d/0x410
Feb 20 14:37:09 small kernel:  ? rest_init+0x190/0x190
Feb 20 14:37:09 small kernel:  kernel_init+0x16/0x1b0
Feb 20 14:37:09 small kernel:  ret_from_fork+0x2d/0x50
Feb 20 14:37:09 small kernel:  ? rest_init+0x190/0x190
Feb 20 14:37:09 small kernel:  ret_from_fork_asm+0x11/0x20
Feb 20 14:37:09 small kernel:  </TASK>
Feb 20 14:37:09 small kernel: XArray: 148257077 of 148257077 tests passed

  Luis

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] test_xarray: fix soft lockup for advanced-api tests
  2024-02-16 19:43  5% [PATCH] test_xarray: fix soft lockup for advanced-api tests Luis Chamberlain
@ 2024-02-20  2:28  0% ` Andrew Morton
  2024-02-20 17:45  0%   ` Luis Chamberlain
  0 siblings, 1 reply; 200+ results
From: Andrew Morton @ 2024-02-20  2:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: willy, linux-fsdevel, linux-mm, linux-kernel, gost.dev, p.raghav,
	da.gomez, kernel test robot

On Fri, 16 Feb 2024 11:43:29 -0800 Luis Chamberlain <mcgrof@kernel.org> wrote:

> The new adanced API tests

So this is a fix against the mm-unstable series "test_xarray: advanced
API multi-index tests", v2.

> want to vet the xarray API is doing what it
> promises by manually iterating over a set of possible indexes on its
> own, and using a query operation which holds the RCU lock and then
> releases it. So it is not using the helper loop options which xarray
> provides on purpose. Any loop which iterates over 1 million entries
> (which is possible with order 20, so emulating say a 4 GiB block size)
> to just to rcu lock and unlock will eventually end up triggering a soft
> lockup on systems which don't preempt, and have lock provin and RCU
> prooving enabled.
> 
> xarray users already use XA_CHECK_SCHED for loops which may take a long
> time, in our case we don't want to RCU unlock and lock as the caller
> does that already, but rather just force a schedule every XA_CHECK_SCHED
> iterations since the test is trying to not trust and rather test that
> xarray is doing the right thing.
> 
> [0] https://lkml.kernel.org/r/202402071613.70f28243-lkp@intel.com
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>

As the above links shows, this should be

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202402071613.70f28243-lkp@intel.com

> --- a/lib/test_xarray.c
> +++ b/lib/test_xarray.c
> @@ -781,6 +781,7 @@ static noinline void *test_get_entry(struct xarray *xa, unsigned long index)
>  {
>  	XA_STATE(xas, xa, index);
>  	void *p;
> +	static unsigned int i = 0;

I don't think this needs static storage.

PetPeeve: it is unexpected that `i' has unsigned type.  Can a more
communicative identifier be used?


I shall queue your patch as a fixup patch against
test_xarray-add-tests-for-advanced-multi-index-use and shall add the
below on top.  Pleae check.

--- a/lib/test_xarray.c~test_xarray-fix-soft-lockup-for-advanced-api-tests-fix
+++ a/lib/test_xarray.c
@@ -728,7 +728,7 @@ static noinline void *test_get_entry(str
 {
 	XA_STATE(xas, xa, index);
 	void *p;
-	static unsigned int i = 0;
+	unsigned int loops = 0;
 
 	rcu_read_lock();
 repeat:
@@ -746,7 +746,7 @@ repeat:
 	 * APIs won't be stupid, proper page cache APIs loop over the proper
 	 * order so when using a larger order we skip shared entries.
 	 */
-	if (++i % XA_CHECK_SCHED == 0)
+	if (++loops % XA_CHECK_SCHED == 0)
 		schedule();
 
 	return p;
_


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v9 1/8] landlock: Add IOCTL access right
  2024-02-09 17:06  5% ` [PATCH v9 1/8] landlock: Add IOCTL access right Günther Noack
  2024-02-16 17:19  0%   ` Mickaël Salaün
@ 2024-02-19 18:34  0%   ` Mickaël Salaün
  2024-02-28 12:57  0%     ` Günther Noack
  1 sibling, 1 reply; 200+ results
From: Mickaël Salaün @ 2024-02-19 18:34 UTC (permalink / raw)
  To: Günther Noack, Arnd Bergmann, Christian Brauner
  Cc: linux-security-module, Jeff Xu, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel

Arn, Christian, please take a look at the following RFC patch and the
rationale explained here.

On Fri, Feb 09, 2024 at 06:06:05PM +0100, Günther Noack wrote:
> Introduces the LANDLOCK_ACCESS_FS_IOCTL access right
> and increments the Landlock ABI version to 5.
> 
> Like the truncate right, these rights are associated with a file
> descriptor at the time of open(2), and get respected even when the
> file descriptor is used outside of the thread which it was originally
> opened in.
> 
> A newly enabled Landlock policy therefore does not apply to file
> descriptors which are already open.
> 
> If the LANDLOCK_ACCESS_FS_IOCTL right is handled, only a small number
> of safe IOCTL commands will be permitted on newly opened files.  The
> permitted IOCTLs can be configured through the ruleset in limited ways
> now.  (See documentation for details.)
> 
> Specifically, when LANDLOCK_ACCESS_FS_IOCTL is handled, granting this
> right on a file or directory will *not* permit to do all IOCTL
> commands, but only influence the IOCTL commands which are not already
> handled through other access rights.  The intent is to keep the groups
> of IOCTL commands more fine-grained.
> 
> Noteworthy scenarios which require special attention:
> 
> TTY devices are often passed into a process from the parent process,
> and so a newly enabled Landlock policy does not retroactively apply to
> them automatically.  In the past, TTY devices have often supported
> IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> letting callers control the TTY input buffer (and simulate
> keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> modern kernels though.
> 
> Some legitimate file system features, like setting up fscrypt, are
> exposed as IOCTL commands on regular files and directories -- users of
> Landlock are advised to double check that the sandboxed process does
> not need to invoke these IOCTLs.

I think we really need to allow fscrypt and fs-verity IOCTLs.

> 
> Known limitations:
> 
> The LANDLOCK_ACCESS_FS_IOCTL access right is a coarse-grained control
> over IOCTL commands.  Future work will enable a more fine-grained
> access control for IOCTLs.
> 
> In the meantime, Landlock users may use path-based restrictions in
> combination with their knowledge about the file system layout to
> control what IOCTLs can be done.  Mounting file systems with the nodev
> option can help to distinguish regular files and devices, and give
> guarantees about the affected files, which Landlock alone can not give
> yet.

I had a second though about our current approach, and it looks like we
can do simpler, more generic, and with less IOCTL commands specific
handling.

What we didn't take into account is that an IOCTL needs an opened file,
which means that the caller must already have been allowed to open this
file in read or write mode.

I think most FS-specific IOCTL commands check access rights (i.e. access
mode or required capability), other than implicit ones (at least read or
write), when appropriate.  We don't get such guarantee with device
drivers.

The main threat is IOCTLs on character or block devices because their
impact may be unknown (if we only look at the IOCTL command, not the
backing file), but we should allow IOCTLs on filesystems (e.g. fscrypt,
fs-verity, clone extents).  I think we should only implement a
LANDLOCK_ACCESS_FS_IOCTL_DEV right, which would be more explicit.  This
change would impact the IOCTLs grouping (not required anymore), but
we'll still need the list of VFS IOCTLs.


> 
> Signed-off-by: Günther Noack <gnoack@google.com>
> ---
>  include/uapi/linux/landlock.h                |  55 ++++-
>  security/landlock/fs.c                       | 227 ++++++++++++++++++-
>  security/landlock/fs.h                       |   3 +
>  security/landlock/limits.h                   |  11 +-
>  security/landlock/ruleset.h                  |   2 +-
>  security/landlock/syscalls.c                 |  19 +-
>  tools/testing/selftests/landlock/base_test.c |   2 +-
>  tools/testing/selftests/landlock/fs_test.c   |   5 +-
>  8 files changed, 302 insertions(+), 22 deletions(-)

> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index 73997e63734f..84efea3f7c0f 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c

> @@ -84,6 +87,186 @@ static const struct landlock_object_underops landlock_fs_underops = {
>  	.release = release_inode
>  };
>  
> +/* IOCTL helpers */
> +
> +/*
> + * These are synthetic access rights, which are only used within the kernel, but
> + * not exposed to callers in userspace.  The mapping between these access rights
> + * and IOCTL commands is defined in the get_required_ioctl_access() helper function.
> + */
> +#define LANDLOCK_ACCESS_FS_IOCTL_RW (LANDLOCK_LAST_PUBLIC_ACCESS_FS << 1)
> +#define LANDLOCK_ACCESS_FS_IOCTL_RW_FILE (LANDLOCK_LAST_PUBLIC_ACCESS_FS << 2)
> +
> +/* ioctl_groups - all synthetic access rights for IOCTL command groups */
> +/* clang-format off */
> +#define IOCTL_GROUPS (				\
> +	LANDLOCK_ACCESS_FS_IOCTL_RW |		\
> +	LANDLOCK_ACCESS_FS_IOCTL_RW_FILE)
> +/* clang-format on */
> +
> +static_assert((IOCTL_GROUPS & LANDLOCK_MASK_ACCESS_FS) == IOCTL_GROUPS);
> +
> +/**
> + * get_required_ioctl_access(): Determine required IOCTL access rights.
> + *
> + * @cmd: The IOCTL command that is supposed to be run.
> + *
> + * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
> + * should be considered for inclusion here.
> + *
> + * Returns: The access rights that must be granted on an opened file in order to
> + * use the given @cmd.
> + */
> +static __attribute_const__ access_mask_t
> +get_required_ioctl_access(const unsigned int cmd)
> +{
> +	switch (cmd) {
> +	case FIOCLEX:
> +	case FIONCLEX:
> +	case FIONBIO:
> +	case FIOASYNC:
> +		/*
> +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> +		 * close-on-exec and the file's buffered-IO and async flags.
> +		 * These operations are also available through fcntl(2), and are
> +		 * unconditionally permitted in Landlock.
> +		 */
> +		return 0;
> +	case FIONREAD:
> +	case FIOQSIZE:
> +	case FIGETBSZ:
> +		/*
> +		 * FIONREAD returns the number of bytes available for reading.
> +		 * FIONREAD returns the number of immediately readable bytes for
> +		 * a file.
> +		 *
> +		 * FIOQSIZE queries the size of a file or directory.
> +		 *
> +		 * FIGETBSZ queries the file system's block size for a file or
> +		 * directory.
> +		 *
> +		 * These IOCTL commands are permitted for files which are opened
> +		 * with LANDLOCK_ACCESS_FS_READ_DIR,
> +		 * LANDLOCK_ACCESS_FS_READ_FILE, or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */

Because files or directories can only be opened with
LANDLOCK_ACCESS_FS_{READ,WRITE}_{FILE,DIR}, and because IOCTLs can only
be sent on a file descriptor, this means that we can always allow these
3 commands (for opened files).

> +		return LANDLOCK_ACCESS_FS_IOCTL_RW;
> +	case FS_IOC_FIEMAP:
> +	case FIBMAP:
> +		/*
> +		 * FS_IOC_FIEMAP and FIBMAP query information about the
> +		 * allocation of blocks within a file.  They are permitted for
> +		 * files which are opened with LANDLOCK_ACCESS_FS_READ_FILE or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */
> +		fallthrough;
> +	case FIDEDUPERANGE:
> +	case FICLONE:
> +	case FICLONERANGE:
> +		/*
> +		 * FIDEDUPERANGE, FICLONE and FICLONERANGE make files share
> +		 * their underlying storage ("reflink") between source and
> +		 * destination FDs, on file systems which support that.
> +		 *
> +		 * The underlying implementations are already checking whether
> +		 * the involved files are opened with the appropriate read/write
> +		 * modes.  We rely on this being implemented correctly.
> +		 *
> +		 * These IOCTLs are permitted for files which are opened with
> +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */
> +		fallthrough;
> +	case FS_IOC_RESVSP:
> +	case FS_IOC_RESVSP64:
> +	case FS_IOC_UNRESVSP:
> +	case FS_IOC_UNRESVSP64:
> +	case FS_IOC_ZERO_RANGE:
> +		/*
> +		 * These IOCTLs reserve space, or create holes like
> +		 * fallocate(2).  We rely on the implementations checking the
> +		 * files' read/write modes.
> +		 *
> +		 * These IOCTLs are permitted for files which are opened with
> +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */

These 10 commands only make sense on directories, so we could also
always allow them on file descriptors.

> +		return LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> +	default:
> +		/*
> +		 * Other commands are guarded by the catch-all access right.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL;
> +	}
> +}
> +
> +/**
> + * expand_ioctl() - Return the dst flags from either the src flag or the
> + * %LANDLOCK_ACCESS_FS_IOCTL flag, depending on whether the
> + * %LANDLOCK_ACCESS_FS_IOCTL and src access rights are handled or not.
> + *
> + * @handled: Handled access rights.
> + * @access: The access mask to copy values from.
> + * @src: A single access right to copy from in @access.
> + * @dst: One or more access rights to copy to.
> + *
> + * Returns: @dst, or 0.
> + */
> +static __attribute_const__ access_mask_t
> +expand_ioctl(const access_mask_t handled, const access_mask_t access,
> +	     const access_mask_t src, const access_mask_t dst)
> +{
> +	access_mask_t copy_from;
> +
> +	if (!(handled & LANDLOCK_ACCESS_FS_IOCTL))
> +		return 0;
> +
> +	copy_from = (handled & src) ? src : LANDLOCK_ACCESS_FS_IOCTL;
> +	if (access & copy_from)
> +		return dst;
> +
> +	return 0;
> +}
> +
> +/**
> + * landlock_expand_access_fs() - Returns @access with the synthetic IOCTL group
> + * flags enabled if necessary.
> + *
> + * @handled: Handled FS access rights.
> + * @access: FS access rights to expand.
> + *
> + * Returns: @access expanded by the necessary flags for the synthetic IOCTL
> + * access rights.
> + */
> +static __attribute_const__ access_mask_t landlock_expand_access_fs(
> +	const access_mask_t handled, const access_mask_t access)
> +{
> +	return access |
> +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_WRITE_FILE,
> +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_FILE,
> +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_DIR,
> +			    LANDLOCK_ACCESS_FS_IOCTL_RW);
> +}
> +
> +/**
> + * landlock_expand_handled_access_fs() - add synthetic IOCTL access rights to an
> + * access mask of handled accesses.
> + *
> + * @handled: The handled accesses of a ruleset that is being created.
> + *
> + * Returns: @handled, with the bits for the synthetic IOCTL access rights set,
> + * if %LANDLOCK_ACCESS_FS_IOCTL is handled.
> + */
> +__attribute_const__ access_mask_t
> +landlock_expand_handled_access_fs(const access_mask_t handled)
> +{
> +	return landlock_expand_access_fs(handled, handled);
> +}
> +
>  /* Ruleset management */
>  
>  static struct landlock_object *get_inode_object(struct inode *const inode)
> @@ -148,7 +331,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
>  	LANDLOCK_ACCESS_FS_EXECUTE | \
>  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
>  	LANDLOCK_ACCESS_FS_READ_FILE | \
> -	LANDLOCK_ACCESS_FS_TRUNCATE)
> +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> +	LANDLOCK_ACCESS_FS_IOCTL)
>  /* clang-format on */
>  
>  /*
> @@ -158,6 +342,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
>  			    const struct path *const path,
>  			    access_mask_t access_rights)
>  {
> +	access_mask_t handled;
>  	int err;
>  	struct landlock_id id = {
>  		.type = LANDLOCK_KEY_INODE,
> @@ -170,9 +355,11 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
>  	if (WARN_ON_ONCE(ruleset->num_layers != 1))
>  		return -EINVAL;
>  
> +	handled = landlock_get_fs_access_mask(ruleset, 0);
> +	/* Expands the synthetic IOCTL groups. */
> +	access_rights |= landlock_expand_access_fs(handled, access_rights);
>  	/* Transforms relative access rights to absolute ones. */
> -	access_rights |= LANDLOCK_MASK_ACCESS_FS &
> -			 ~landlock_get_fs_access_mask(ruleset, 0);
> +	access_rights |= LANDLOCK_MASK_ACCESS_FS & ~handled;
>  	id.key.object = get_inode_object(d_backing_inode(path->dentry));
>  	if (IS_ERR(id.key.object))
>  		return PTR_ERR(id.key.object);
> @@ -1333,7 +1520,9 @@ static int hook_file_open(struct file *const file)
>  {
>  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
>  	access_mask_t open_access_request, full_access_request, allowed_access;
> -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> +	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE |
> +					      LANDLOCK_ACCESS_FS_IOCTL |
> +					      IOCTL_GROUPS;
>  	const struct landlock_ruleset *const dom = get_current_fs_domain();
>  
>  	if (!dom)

We should set optional_access according to the file type before
`full_access_request = open_access_request | optional_access;`

const bool is_device = S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode);

optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
if (is_device)
    optional_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;


Because LANDLOCK_ACCESS_FS_IOCTL_DEV is dedicated to character or block
devices, we may want landlock_add_rule() to only allow this access right
to be tied to directories, or character devices, or block devices.  Even
if it would be more consistent with constraints on directory-only access
rights, I'm not sure about that.


> @@ -1375,6 +1564,16 @@ static int hook_file_open(struct file *const file)
>  		}
>  	}
>  
> +	/*
> +	 * Named pipes should be treated just like anonymous pipes.
> +	 * Therefore, we permit all IOCTLs on them.
> +	 */
> +	if (S_ISFIFO(file_inode(file)->i_mode)) {
> +		allowed_access |= LANDLOCK_ACCESS_FS_IOCTL |
> +				  LANDLOCK_ACCESS_FS_IOCTL_RW |
> +				  LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> +	}

Instead of this S_ISFIFO check:

if (!is_device)
    allowed_access |= LANDLOCK_ACCESS_FS_IOCTL_DEV;

> +
>  	/*
>  	 * For operations on already opened files (i.e. ftruncate()), it is the
>  	 * access rights at the time of open() which decide whether the
> @@ -1406,6 +1605,25 @@ static int hook_file_truncate(struct file *const file)
>  	return -EACCES;
>  }
>  
> +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> +			   unsigned long arg)
> +{
> +	const access_mask_t required_access = get_required_ioctl_access(cmd);

const access_mask_t required_access = LANDLOCK_ACCESS_FS_IOCTL_DEV;


> +	const access_mask_t allowed_access =
> +		landlock_file(file)->allowed_access;
> +
> +	/*
> +	 * It is the access rights at the time of opening the file which
> +	 * determine whether IOCTL can be used on the opened file later.
> +	 *
> +	 * The access right is attached to the opened file in hook_file_open().
> +	 */
> +	if ((allowed_access & required_access) == required_access)
> +		return 0;

We could then check against the do_vfs_ioctl()'s commands, excluding
FIONREAD and file_ioctl()'s commands, to always allow VFS-related
commands:

if (vfs_masked_device_ioctl(cmd))
    return 0;

As a safeguard, we could define vfs_masked_device_ioctl(cmd) in
fs/ioctl.c and make it called by do_vfs_ioctl() as a safeguard to make
sure we keep an accurate list of VFS IOCTL commands (see next RFC patch).

The compat IOCTL hook must also be implemented.

What do you think? Any better idea?


> +
> +	return -EACCES;
> +}
> +
>  static struct security_hook_list landlock_hooks[] __ro_after_init = {
>  	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
>  
> @@ -1428,6 +1646,7 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/3] check: add support for --list-group-tests
  2024-02-16 18:18  4% ` [PATCH 2/3] check: add support for --list-group-tests Luis Chamberlain
@ 2024-02-19  3:38  0%   ` Dave Chinner
  2024-02-21 16:45  0%     ` Luis Chamberlain
  0 siblings, 1 reply; 200+ results
From: Dave Chinner @ 2024-02-19  3:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: fstests, anand.jain, aalbersh, djwong, linux-fsdevel, kdevops, patches

On Fri, Feb 16, 2024 at 10:18:58AM -0800, Luis Chamberlain wrote:
> Since the prior commit adds the ability to list groups but is used
> only when we use --start-after, let's add an option which leverages this
> to also allow us to easily query which tests are part of the groups
> specified.
> 
> This can be used for dynamic test configuration suites such as kdevops
> which may want to take advantage of this information to deterministically
> determine if a test falls part of a specific group.
> Demo:
> 
> root@demo-xfs-reflink /var/lib/xfstests # ./check --list-group-tests -g soak
> 
> generic/019 generic/388 generic/475 generic/476 generic/521 generic/522 generic/616 generic/617 generic/642 generic/648 generic/650 xfs/285 xfs/517 xfs/560 xfs/561 xfs/562 xfs/565 xfs/570 xfs/571 xfs/572 xfs/573 xfs/574 xfs/575 xfs/576 xfs/577 xfs/578 xfs/579 xfs/580 xfs/581 xfs/582 xfs/583 xfs/584 xfs/585 xfs/586 xfs/587 xfs/588 xfs/589 xfs/590 xfs/591 xfs/592 xfs/593 xfs/594 xfs/595 xfs/727 xfs/729 xfs/800

So how is this different to ./check -n -g soak?

'-n' is supposed to show you want tests are going to be run
without actually running them, so why can't you use that?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[relevance 0%]

* [PATCH] test_xarray: fix soft lockup for advanced-api tests
@ 2024-02-16 19:43  5% Luis Chamberlain
  2024-02-20  2:28  0% ` Andrew Morton
  0 siblings, 1 reply; 200+ results
From: Luis Chamberlain @ 2024-02-16 19:43 UTC (permalink / raw)
  To: akpm, willy
  Cc: linux-fsdevel, linux-mm, linux-kernel, gost.dev, p.raghav,
	da.gomez, mcgrof, kernel test robot

The new adanced API tests want to vet the xarray API is doing what it
promises by manually iterating over a set of possible indexes on its
own, and using a query operation which holds the RCU lock and then
releases it. So it is not using the helper loop options which xarray
provides on purpose. Any loop which iterates over 1 million entries
(which is possible with order 20, so emulating say a 4 GiB block size)
to just to rcu lock and unlock will eventually end up triggering a soft
lockup on systems which don't preempt, and have lock provin and RCU
prooving enabled.

xarray users already use XA_CHECK_SCHED for loops which may take a long
time, in our case we don't want to RCU unlock and lock as the caller
does that already, but rather just force a schedule every XA_CHECK_SCHED
iterations since the test is trying to not trust and rather test that
xarray is doing the right thing.

[0] https://lkml.kernel.org/r/202402071613.70f28243-lkp@intel.com

Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 lib/test_xarray.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index d4e55b4867dc..ac162025cc59 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -781,6 +781,7 @@ static noinline void *test_get_entry(struct xarray *xa, unsigned long index)
 {
 	XA_STATE(xas, xa, index);
 	void *p;
+	static unsigned int i = 0;
 
 	rcu_read_lock();
 repeat:
@@ -790,6 +791,17 @@ static noinline void *test_get_entry(struct xarray *xa, unsigned long index)
 		goto repeat;
 	rcu_read_unlock();
 
+	/*
+	 * This is not part of the page cache, this selftest is pretty
+	 * aggressive and does not want to trust the xarray API but rather
+	 * test it, and for order 20 (4 GiB block size) we can loop over
+	 * over a million entries which can cause a soft lockup. Page cache
+	 * APIs won't be stupid, proper page cache APIs loop over the proper
+	 * order so when using a larger order we skip shared entries.
+	 */
+	if (++i % XA_CHECK_SCHED == 0)
+		schedule();
+
 	return p;
 }
 
-- 
2.42.0


^ permalink raw reply related	[relevance 5%]

* [PATCH fstests 0/3] few enhancements
@ 2024-02-16 18:18  5% Luis Chamberlain
  2024-02-16 18:18  2% ` [PATCH 1/3] tests: augment soak test group Luis Chamberlain
  2024-02-16 18:18  4% ` [PATCH 2/3] check: add support for --list-group-tests Luis Chamberlain
  0 siblings, 2 replies; 200+ results
From: Luis Chamberlain @ 2024-02-16 18:18 UTC (permalink / raw)
  To: fstests, anand.jain, aalbersh, djwong
  Cc: linux-fsdevel, kdevops, patches, Luis Chamberlain

This adds a couple of enhancements picked up over experiences with
kdevops. The first one is to augment the set of tests which are part of
the soak group. That required careful review of all of our tests, so
might as well update the tests with this information.

To allow us to verify this, we add an option to let us list the tests
which are part of a group with --list-group-tests. This came out of
recent discussions where Darrick proposed perhaps it might be possible
to query this [0], this is an attempt to help with that. The goal here
is to ensure that any test which does use SOAK_DURATION will be
agumented properly in the future as part of the soak group.

The last patch is a simple fstest watchdog enhacement which lets
applications monitoring only a guest's kernel buffer to know when
the show has started and ended. It is completely optional to use, but
kdevops has been using this in its own wrapper oscheck.sh for years now.

[0] https://lkml.kernel.org/r/20240125222956.GD6188@frogsfrogsfrogs

Luis Chamberlain (3):
  tests: augment soak test group
  check: add support for --list-group-tests
  check: add --print-start-done to enhance watchdogs

 check             | 32 +++++++++++++++++++++++++++++++-
 tests/generic/019 |  2 +-
 tests/generic/388 |  2 +-
 tests/generic/475 |  2 +-
 tests/generic/642 |  2 +-
 tests/generic/648 |  2 +-
 tests/xfs/285     |  2 +-
 tests/xfs/517     |  2 +-
 tests/xfs/560     |  2 +-
 tests/xfs/561     |  2 +-
 tests/xfs/562     |  2 +-
 tests/xfs/565     |  2 +-
 tests/xfs/570     |  2 +-
 tests/xfs/571     |  2 +-
 tests/xfs/572     |  2 +-
 tests/xfs/573     |  2 +-
 tests/xfs/574     |  2 +-
 tests/xfs/575     |  2 +-
 tests/xfs/576     |  2 +-
 tests/xfs/577     |  2 +-
 tests/xfs/578     |  2 +-
 tests/xfs/579     |  2 +-
 tests/xfs/580     |  2 +-
 tests/xfs/581     |  2 +-
 tests/xfs/582     |  2 +-
 tests/xfs/583     |  2 +-
 tests/xfs/584     |  2 +-
 tests/xfs/585     |  2 +-
 tests/xfs/586     |  2 +-
 tests/xfs/587     |  2 +-
 tests/xfs/588     |  2 +-
 tests/xfs/589     |  2 +-
 tests/xfs/590     |  2 +-
 tests/xfs/591     |  2 +-
 tests/xfs/592     |  2 +-
 tests/xfs/593     |  2 +-
 tests/xfs/594     |  2 +-
 tests/xfs/595     |  2 +-
 tests/xfs/727     |  2 +-
 tests/xfs/729     |  2 +-
 tests/xfs/800     |  2 +-
 41 files changed, 71 insertions(+), 41 deletions(-)

-- 
2.42.0


^ permalink raw reply	[relevance 5%]

* [PATCH 2/3] check: add support for --list-group-tests
  2024-02-16 18:18  5% [PATCH fstests 0/3] few enhancements Luis Chamberlain
  2024-02-16 18:18  2% ` [PATCH 1/3] tests: augment soak test group Luis Chamberlain
@ 2024-02-16 18:18  4% ` Luis Chamberlain
  2024-02-19  3:38  0%   ` Dave Chinner
  1 sibling, 1 reply; 200+ results
From: Luis Chamberlain @ 2024-02-16 18:18 UTC (permalink / raw)
  To: fstests, anand.jain, aalbersh, djwong
  Cc: linux-fsdevel, kdevops, patches, Luis Chamberlain

Since the prior commit adds the ability to list groups but is used
only when we use --start-after, let's add an option which leverages this
to also allow us to easily query which tests are part of the groups
specified.

This can be used for dynamic test configuration suites such as kdevops
which may want to take advantage of this information to deterministically
determine if a test falls part of a specific group.

Demo:

root@demo-xfs-reflink /var/lib/xfstests # ./check --list-group-tests -g soak

generic/019 generic/388 generic/475 generic/476 generic/521 generic/522 generic/616 generic/617 generic/642 generic/648 generic/650 xfs/285 xfs/517 xfs/560 xfs/561 xfs/562 xfs/565 xfs/570 xfs/571 xfs/572 xfs/573 xfs/574 xfs/575 xfs/576 xfs/577 xfs/578 xfs/579 xfs/580 xfs/581 xfs/582 xfs/583 xfs/584 xfs/585 xfs/586 xfs/587 xfs/588 xfs/589 xfs/590 xfs/591 xfs/592 xfs/593 xfs/594 xfs/595 xfs/727 xfs/729 xfs/800

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 check | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/check b/check
index f081bf8ce685..523cf024c139 100755
--- a/check
+++ b/check
@@ -19,6 +19,7 @@ have_test_arg=false
 randomize=false
 exact_order=false
 start_after_test=""
+list_group_tests=false
 export here=`pwd`
 xfile=""
 subdir_xfile=""
@@ -81,6 +82,7 @@ check options
     -b			brief test summary
     -R fmt[,fmt]	generate report in formats specified. Supported formats: xunit, xunit-quiet
     --large-fs		optimise scratch device for large filesystems
+    --list-group-tests	only list tests part of the groups you specified, do not run the tests
     --start-after	only start testing after the test specified
     -s section		run only specified section from config file
     -S section		exclude the specified section from the config file
@@ -276,8 +278,16 @@ _prepare_test_list()
 			done
 			group_all="$group_all $list"
 		done
+
+		group_all=$(echo $group_all | sed -e 's|tests/||g')
+
+		# Keep it simple, allow for easy machine scraping
+		if $list_group_tests ; then
+			echo $group_all
+			exit 0
+		fi
+
 		if [[ "$start_after_test" != "" && $start_after_found -ne 1 ]]; then
-			group_all=$(echo $group_all | sed -e 's|tests/||g')
 			echo "Start after test $start_after_test not found in any group specified."
 			echo "Be sure you specify a test present in one of your test run groups if using --start-after."
 			echo
@@ -366,6 +376,9 @@ while [ $# -gt 0 ]; do
 		start_after_test="$2"
 		shift
 		;;
+	--list-group-tests)
+		list_group_tests=true
+		;;
 	-s)	RUN_SECTION="$RUN_SECTION $2"; shift ;;
 	-S)	EXCLUDE_SECTION="$EXCLUDE_SECTION $2"; shift ;;
 	-l)	diff="diff" ;;
-- 
2.42.0


^ permalink raw reply related	[relevance 4%]

* [PATCH 1/3] tests: augment soak test group
  2024-02-16 18:18  5% [PATCH fstests 0/3] few enhancements Luis Chamberlain
@ 2024-02-16 18:18  2% ` Luis Chamberlain
  2024-02-16 18:18  4% ` [PATCH 2/3] check: add support for --list-group-tests Luis Chamberlain
  1 sibling, 0 replies; 200+ results
From: Luis Chamberlain @ 2024-02-16 18:18 UTC (permalink / raw)
  To: fstests, anand.jain, aalbersh, djwong
  Cc: linux-fsdevel, kdevops, patches, Luis Chamberlain

Many tests are using SOAK_DURATION but they have not been added to the
soak group. We want to have a deterministic way to query which tests are
part of the soak group, so to enable test frameworks which use fstests
to get an idea when a test may have lapsed the expected amount of time
for the test to complete. Of course such a time is subjetive to a test
environment and system, however max variables are possible and are used
for an initial test run, and later an enhanced test environement can
leverage and also use prior known test times with check.time. That is
exactly what kdevops uses to determine a timeout.

In kdevops we have to maintain a list of static array of tests which
uses soak, with this, we shold be able to grow that set dynamically.

Tests either use SOAK_DURATION directly or they use the helper loop such as
_soak_loop_running(). XFS also uses SOAK_DURATION with helpers such as
_scratch_xfs_stress_scrub().

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 tests/generic/019 | 2 +-
 tests/generic/388 | 2 +-
 tests/generic/475 | 2 +-
 tests/generic/642 | 2 +-
 tests/generic/648 | 2 +-
 tests/xfs/285     | 2 +-
 tests/xfs/517     | 2 +-
 tests/xfs/560     | 2 +-
 tests/xfs/561     | 2 +-
 tests/xfs/562     | 2 +-
 tests/xfs/565     | 2 +-
 tests/xfs/570     | 2 +-
 tests/xfs/571     | 2 +-
 tests/xfs/572     | 2 +-
 tests/xfs/573     | 2 +-
 tests/xfs/574     | 2 +-
 tests/xfs/575     | 2 +-
 tests/xfs/576     | 2 +-
 tests/xfs/577     | 2 +-
 tests/xfs/578     | 2 +-
 tests/xfs/579     | 2 +-
 tests/xfs/580     | 2 +-
 tests/xfs/581     | 2 +-
 tests/xfs/582     | 2 +-
 tests/xfs/583     | 2 +-
 tests/xfs/584     | 2 +-
 tests/xfs/585     | 2 +-
 tests/xfs/586     | 2 +-
 tests/xfs/587     | 2 +-
 tests/xfs/588     | 2 +-
 tests/xfs/589     | 2 +-
 tests/xfs/590     | 2 +-
 tests/xfs/591     | 2 +-
 tests/xfs/592     | 2 +-
 tests/xfs/593     | 2 +-
 tests/xfs/594     | 2 +-
 tests/xfs/595     | 2 +-
 tests/xfs/727     | 2 +-
 tests/xfs/729     | 2 +-
 tests/xfs/800     | 2 +-
 40 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/tests/generic/019 b/tests/generic/019
index b81c1d17ba65..a77ce1e3dad6 100755
--- a/tests/generic/019
+++ b/tests/generic/019
@@ -8,7 +8,7 @@
 # check filesystem consistency at the end.
 #
 . ./common/preamble
-_begin_fstest aio dangerous enospc rw stress recoveryloop
+_begin_fstest aio dangerous enospc rw stress recoveryloop soak
 
 fio_config=$tmp.fio
 
diff --git a/tests/generic/388 b/tests/generic/388
index 4a5be6698cbd..523f4b310b8a 100755
--- a/tests/generic/388
+++ b/tests/generic/388
@@ -15,7 +15,7 @@
 # spurious corruption reports and/or mount failures.
 #
 . ./common/preamble
-_begin_fstest shutdown auto log metadata recoveryloop
+_begin_fstest shutdown auto log metadata recoveryloop soak
 
 # Override the default cleanup function.
 _cleanup()
diff --git a/tests/generic/475 b/tests/generic/475
index ce7fe013b1fc..cfbbcedf80e2 100755
--- a/tests/generic/475
+++ b/tests/generic/475
@@ -12,7 +12,7 @@
 # testing efforts.
 #
 . ./common/preamble
-_begin_fstest shutdown auto log metadata eio recoveryloop smoketest
+_begin_fstest shutdown auto log metadata eio recoveryloop smoketest soak
 
 # Override the default cleanup function.
 _cleanup()
diff --git a/tests/generic/642 b/tests/generic/642
index 4d0c41fd5d51..9c367c653807 100755
--- a/tests/generic/642
+++ b/tests/generic/642
@@ -8,7 +8,7 @@
 # bugs in the xattr code.
 #
 . ./common/preamble
-_begin_fstest auto soak attr long_rw stress smoketest
+_begin_fstest auto soak attr long_rw stress smoketest soak
 
 _cleanup()
 {
diff --git a/tests/generic/648 b/tests/generic/648
index 3b3544ff49c3..e3f4ce7af801 100755
--- a/tests/generic/648
+++ b/tests/generic/648
@@ -12,7 +12,7 @@
 # in writeback on the host that cause VM guests to fail to recover.
 #
 . ./common/preamble
-_begin_fstest shutdown auto log metadata eio recoveryloop
+_begin_fstest shutdown auto log metadata eio recoveryloop soak
 
 _cleanup()
 {
diff --git a/tests/xfs/285 b/tests/xfs/285
index 0056baeb1c73..e0510d7f6696 100755
--- a/tests/xfs/285
+++ b/tests/xfs/285
@@ -8,7 +8,7 @@
 # or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	cd /
diff --git a/tests/xfs/517 b/tests/xfs/517
index 68438e544ea0..815c1fb40cc1 100755
--- a/tests/xfs/517
+++ b/tests/xfs/517
@@ -7,7 +7,7 @@
 # Race freeze and fsmap for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest auto quick fsmap freeze
+_begin_fstest auto quick fsmap freeze soak
 
 _register_cleanup "_cleanup" BUS
 
diff --git a/tests/xfs/560 b/tests/xfs/560
index 28b45d5f5e72..a931da7bc239 100755
--- a/tests/xfs/560
+++ b/tests/xfs/560
@@ -7,7 +7,7 @@
 # Race GETFSMAP and ro remount for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest auto quick fsmap remount
+_begin_fstest auto quick fsmap remount soak
 
 # Override the default cleanup function.
 _cleanup()
diff --git a/tests/xfs/561 b/tests/xfs/561
index c1d68c6fe62c..10277e8a6d75 100755
--- a/tests/xfs/561
+++ b/tests/xfs/561
@@ -8,7 +8,7 @@
 # crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 # Override the default cleanup function.
 _cleanup()
diff --git a/tests/xfs/562 b/tests/xfs/562
index a5c6e88875fc..a7304cd3b551 100755
--- a/tests/xfs/562
+++ b/tests/xfs/562
@@ -8,7 +8,7 @@
 # or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 # Override the default cleanup function.
 _cleanup()
diff --git a/tests/xfs/565 b/tests/xfs/565
index 826bc5354a77..8000984bdee6 100755
--- a/tests/xfs/565
+++ b/tests/xfs/565
@@ -8,7 +8,7 @@
 # or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	cd /
diff --git a/tests/xfs/570 b/tests/xfs/570
index 9f3ba873ae3d..e8c3a315d325 100755
--- a/tests/xfs/570
+++ b/tests/xfs/570
@@ -7,7 +7,7 @@
 # Race fsstress and superblock scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/571 b/tests/xfs/571
index 9d22de8f45c5..4e5ad4b0460e 100755
--- a/tests/xfs/571
+++ b/tests/xfs/571
@@ -7,7 +7,7 @@
 # Race fsstress and AGF scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/572 b/tests/xfs/572
index b0e352af4e40..dfbed43ffa83 100755
--- a/tests/xfs/572
+++ b/tests/xfs/572
@@ -7,7 +7,7 @@
 # Race fsstress and AGFL scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/573 b/tests/xfs/573
index a2b6bef3cf3b..5be8fea7676e 100755
--- a/tests/xfs/573
+++ b/tests/xfs/573
@@ -7,7 +7,7 @@
 # Race fsstress and AGI scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/574 b/tests/xfs/574
index 5a4bad00162d..847a99bc01b7 100755
--- a/tests/xfs/574
+++ b/tests/xfs/574
@@ -8,7 +8,7 @@
 # crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/575 b/tests/xfs/575
index 3d29620f2c4b..66731af213eb 100755
--- a/tests/xfs/575
+++ b/tests/xfs/575
@@ -8,7 +8,7 @@
 # crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/576 b/tests/xfs/576
index e11476d452fd..d1b99b968068 100755
--- a/tests/xfs/576
+++ b/tests/xfs/576
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/577 b/tests/xfs/577
index d1abe6fafb15..dad9b3f400cc 100755
--- a/tests/xfs/577
+++ b/tests/xfs/577
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/578 b/tests/xfs/578
index 8160b7ef515e..28db2c53ba83 100755
--- a/tests/xfs/578
+++ b/tests/xfs/578
@@ -8,7 +8,7 @@
 # or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/579 b/tests/xfs/579
index a00ae02aa74e..bd187852419d 100755
--- a/tests/xfs/579
+++ b/tests/xfs/579
@@ -8,7 +8,7 @@
 # or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/580 b/tests/xfs/580
index f49cba6427c4..1094f04e730c 100755
--- a/tests/xfs/580
+++ b/tests/xfs/580
@@ -8,7 +8,7 @@
 # if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/581 b/tests/xfs/581
index 1d08bc7df3e6..e733bf3962ce 100755
--- a/tests/xfs/581
+++ b/tests/xfs/581
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/582 b/tests/xfs/582
index 7a8c330befd1..97c2bfde1453 100755
--- a/tests/xfs/582
+++ b/tests/xfs/582
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/583 b/tests/xfs/583
index a6121a83bb65..9eb4cefe05f0 100755
--- a/tests/xfs/583
+++ b/tests/xfs/583
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/584 b/tests/xfs/584
index c80ba57550cb..81ab3e82120b 100755
--- a/tests/xfs/584
+++ b/tests/xfs/584
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/585 b/tests/xfs/585
index ea47dada7bc3..74493ba1f3d7 100755
--- a/tests/xfs/585
+++ b/tests/xfs/585
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/586 b/tests/xfs/586
index e802ee718887..8d1e960fe0d4 100755
--- a/tests/xfs/586
+++ b/tests/xfs/586
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/587 b/tests/xfs/587
index 71e1ce69ae0b..dd9442c203ae 100755
--- a/tests/xfs/587
+++ b/tests/xfs/587
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/588 b/tests/xfs/588
index f56c50ace5f2..824f47fc8d05 100755
--- a/tests/xfs/588
+++ b/tests/xfs/588
@@ -7,7 +7,7 @@
 # Race fsstress and data fork scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/589 b/tests/xfs/589
index d9cd81e02be8..2ca3dd3d0d41 100755
--- a/tests/xfs/589
+++ b/tests/xfs/589
@@ -7,7 +7,7 @@
 # Race fsstress and attr fork scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/590 b/tests/xfs/590
index 4e39109abd9a..587c0be19cca 100755
--- a/tests/xfs/590
+++ b/tests/xfs/590
@@ -7,7 +7,7 @@
 # Race fsstress and cow fork scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/591 b/tests/xfs/591
index 00d5114e06ef..79492e8aeefb 100755
--- a/tests/xfs/591
+++ b/tests/xfs/591
@@ -7,7 +7,7 @@
 # Race fsstress and directory scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/592 b/tests/xfs/592
index 02ac456b5e2b..aacd95cbfad4 100755
--- a/tests/xfs/592
+++ b/tests/xfs/592
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/593 b/tests/xfs/593
index cf2ac506ca72..7a8b4a6010fc 100755
--- a/tests/xfs/593
+++ b/tests/xfs/593
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/594 b/tests/xfs/594
index 323b191b59ae..2f6287396be1 100755
--- a/tests/xfs/594
+++ b/tests/xfs/594
@@ -8,7 +8,7 @@
 # We can't open symlink files directly for scrubbing, so we use xfs_scrub(8).
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/595 b/tests/xfs/595
index fc2a89ed8625..4e431258ce58 100755
--- a/tests/xfs/595
+++ b/tests/xfs/595
@@ -9,7 +9,7 @@
 # xfs_scrub(8).
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/727 b/tests/xfs/727
index 6c5ac7db5e47..81be43cc521d 100755
--- a/tests/xfs/727
+++ b/tests/xfs/727
@@ -8,7 +8,7 @@
 # livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/729 b/tests/xfs/729
index 235cb175d259..70ed67eb24f3 100755
--- a/tests/xfs/729
+++ b/tests/xfs/729
@@ -7,7 +7,7 @@
 # Race fsstress and nlinks scrub for a while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	_scratch_xfs_stress_scrub_cleanup &> /dev/null
diff --git a/tests/xfs/800 b/tests/xfs/800
index a23e47338e59..6086a4ee2fa2 100755
--- a/tests/xfs/800
+++ b/tests/xfs/800
@@ -8,7 +8,7 @@
 # while to see if we crash or livelock.
 #
 . ./common/preamble
-_begin_fstest scrub dangerous_fsstress_scrub
+_begin_fstest scrub dangerous_fsstress_scrub soak
 
 _cleanup() {
 	cd /
-- 
2.42.0


^ permalink raw reply related	[relevance 2%]

* Re: [PATCH v9 1/8] landlock: Add IOCTL access right
  2024-02-09 17:06  5% ` [PATCH v9 1/8] landlock: Add IOCTL access right Günther Noack
@ 2024-02-16 17:19  0%   ` Mickaël Salaün
  2024-02-19 18:34  0%   ` Mickaël Salaün
  1 sibling, 0 replies; 200+ results
From: Mickaël Salaün @ 2024-02-16 17:19 UTC (permalink / raw)
  To: Günther Noack
  Cc: linux-security-module, Jeff Xu, Arnd Bergmann,
	Jorge Lucangeli Obes, Allen Webb, Dmitry Torokhov, Paul Moore,
	Konstantin Meskhidze, Matt Bobrowski, linux-fsdevel

On Fri, Feb 09, 2024 at 06:06:05PM +0100, Günther Noack wrote:
> Introduces the LANDLOCK_ACCESS_FS_IOCTL access right
> and increments the Landlock ABI version to 5.
> 
> Like the truncate right, these rights are associated with a file
> descriptor at the time of open(2), and get respected even when the
> file descriptor is used outside of the thread which it was originally
> opened in.
> 
> A newly enabled Landlock policy therefore does not apply to file
> descriptors which are already open.
> 
> If the LANDLOCK_ACCESS_FS_IOCTL right is handled, only a small number
> of safe IOCTL commands will be permitted on newly opened files.  The
> permitted IOCTLs can be configured through the ruleset in limited ways
> now.  (See documentation for details.)
> 
> Specifically, when LANDLOCK_ACCESS_FS_IOCTL is handled, granting this
> right on a file or directory will *not* permit to do all IOCTL
> commands, but only influence the IOCTL commands which are not already
> handled through other access rights.  The intent is to keep the groups
> of IOCTL commands more fine-grained.
> 
> Noteworthy scenarios which require special attention:
> 
> TTY devices are often passed into a process from the parent process,
> and so a newly enabled Landlock policy does not retroactively apply to
> them automatically.  In the past, TTY devices have often supported
> IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
> letting callers control the TTY input buffer (and simulate
> keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
> modern kernels though.
> 
> Some legitimate file system features, like setting up fscrypt, are
> exposed as IOCTL commands on regular files and directories -- users of
> Landlock are advised to double check that the sandboxed process does
> not need to invoke these IOCTLs.
> 
> Known limitations:
> 
> The LANDLOCK_ACCESS_FS_IOCTL access right is a coarse-grained control
> over IOCTL commands.  Future work will enable a more fine-grained
> access control for IOCTLs.
> 
> In the meantime, Landlock users may use path-based restrictions in
> combination with their knowledge about the file system layout to
> control what IOCTLs can be done.  Mounting file systems with the nodev
> option can help to distinguish regular files and devices, and give
> guarantees about the affected files, which Landlock alone can not give
> yet.
> 
> Signed-off-by: Günther Noack <gnoack@google.com>
> ---
>  include/uapi/linux/landlock.h                |  55 ++++-
>  security/landlock/fs.c                       | 227 ++++++++++++++++++-
>  security/landlock/fs.h                       |   3 +
>  security/landlock/limits.h                   |  11 +-
>  security/landlock/ruleset.h                  |   2 +-
>  security/landlock/syscalls.c                 |  19 +-
>  tools/testing/selftests/landlock/base_test.c |   2 +-
>  tools/testing/selftests/landlock/fs_test.c   |   5 +-
>  8 files changed, 302 insertions(+), 22 deletions(-)
> 
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> index 25c8d7677539..16d7d72804f8 100644
> --- a/include/uapi/linux/landlock.h
> +++ b/include/uapi/linux/landlock.h
> @@ -128,7 +128,7 @@ struct landlock_net_port_attr {
>   * files and directories.  Files or directories opened before the sandboxing
>   * are not subject to these restrictions.
>   *
> - * A file can only receive these access rights:
> + * The following access rights apply only to files:
>   *
>   * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
>   * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
> @@ -138,12 +138,13 @@ struct landlock_net_port_attr {
>   * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
>   * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
>   *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
> - *   ``O_TRUNC``. Whether an opened file can be truncated with
> - *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
> - *   same way as read and write permissions are checked during
> - *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
> - *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
> - *   third version of the Landlock ABI.
> + *   ``O_TRUNC``.  This access right is available since the third version of the
> + *   Landlock ABI.
> + *
> + * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> + * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> + * read and write permissions are checked during :manpage:`open(2)` using
> + * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
>   *
>   * A directory can receive access rights related to files or directories.  The
>   * following access right is applied to the directory itself, and the
> @@ -198,13 +199,50 @@ struct landlock_net_port_attr {
>   *   If multiple requirements are not met, the ``EACCES`` error code takes
>   *   precedence over ``EXDEV``.
>   *
> + * The following access right applies both to files and directories:
> + *
> + * - %LANDLOCK_ACCESS_FS_IOCTL: Invoke :manpage:`ioctl(2)` commands on an opened
> + *   file or directory.
> + *
> + *   This access right applies to all :manpage:`ioctl(2)` commands, except of
> + *   ``FIOCLEX``, ``FIONCLEX``, ``FIONBIO`` and ``FIOASYNC``.  These commands
> + *   continue to be invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL
> + *   access right.
> + *
> + *   When certain other access rights are handled in the ruleset, in addition to
> + *   %LANDLOCK_ACCESS_FS_IOCTL, granting these access rights will unlock access
> + *   to additional groups of IOCTL commands, on the affected files:
> + *
> + *   * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE unlock
> + *     access to ``FIOQSIZE``, ``FIONREAD``, ``FIGETBSZ``, ``FS_IOC_FIEMAP``,
> + *     ``FIBMAP``, ``FIDEDUPERANGE``, ``FICLONE``, ``FICLONERANGE``,
> + *     ``FS_IOC_RESVSP``, ``FS_IOC_RESVSP64``, ``FS_IOC_UNRESVSP``,
> + *     ``FS_IOC_UNRESVSP64``, ``FS_IOC_ZERO_RANGE``.
> + *
> + *   * %LANDLOCK_ACCESS_FS_READ_DIR unlocks access to ``FIOQSIZE``,
> + *     ``FIONREAD``, ``FIGETBSZ``.
> + *
> + *   When these access rights are handled in the ruleset, the availability of
> + *   the affected IOCTL commands is not governed by %LANDLOCK_ACCESS_FS_IOCTL
> + *   any more, but by the respective access right.
> + *
> + *   All other IOCTL commands are not handled specially, and are governed by
> + *   %LANDLOCK_ACCESS_FS_IOCTL.  This includes %FS_IOC_GETFLAGS and
> + *   %FS_IOC_SETFLAGS for manipulating inode flags (:manpage:`ioctl_iflags(2)`),
> + *   %FS_IOC_FSFETXATTR and %FS_IOC_FSSETXATTR for manipulating extended
> + *   attributes, as well as %FIFREEZE and %FITHAW for freezing and thawing file
> + *   systems.
> + *
> + *   This access right is available since the fifth version of the Landlock
> + *   ABI.
> + *
>   * .. warning::
>   *
>   *   It is currently not possible to restrict some file-related actions
>   *   accessible through these syscall families: :manpage:`chdir(2)`,
>   *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
>   *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
> - *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
> + *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
>   *   Future Landlock evolutions will enable to restrict them.
>   */
>  /* clang-format off */
> @@ -223,6 +261,7 @@ struct landlock_net_port_attr {
>  #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
>  #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
>  #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
> +#define LANDLOCK_ACCESS_FS_IOCTL			(1ULL << 15)
>  /* clang-format on */
>  
>  /**
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index 73997e63734f..84efea3f7c0f 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -7,6 +7,7 @@
>   * Copyright © 2021-2022 Microsoft Corporation
>   */
>  
> +#include <asm/ioctls.h>
>  #include <kunit/test.h>
>  #include <linux/atomic.h>
>  #include <linux/bitops.h>
> @@ -14,6 +15,7 @@
>  #include <linux/compiler_types.h>
>  #include <linux/dcache.h>
>  #include <linux/err.h>
> +#include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
> @@ -29,6 +31,7 @@
>  #include <linux/types.h>
>  #include <linux/wait_bit.h>
>  #include <linux/workqueue.h>
> +#include <uapi/linux/fiemap.h>
>  #include <uapi/linux/landlock.h>
>  
>  #include "common.h"
> @@ -84,6 +87,186 @@ static const struct landlock_object_underops landlock_fs_underops = {
>  	.release = release_inode
>  };
>  
> +/* IOCTL helpers */
> +
> +/*
> + * These are synthetic access rights, which are only used within the kernel, but
> + * not exposed to callers in userspace.  The mapping between these access rights
> + * and IOCTL commands is defined in the get_required_ioctl_access() helper function.
> + */
> +#define LANDLOCK_ACCESS_FS_IOCTL_RW (LANDLOCK_LAST_PUBLIC_ACCESS_FS << 1)
> +#define LANDLOCK_ACCESS_FS_IOCTL_RW_FILE (LANDLOCK_LAST_PUBLIC_ACCESS_FS << 2)
> +
> +/* ioctl_groups - all synthetic access rights for IOCTL command groups */
> +/* clang-format off */
> +#define IOCTL_GROUPS (				\
> +	LANDLOCK_ACCESS_FS_IOCTL_RW |		\
> +	LANDLOCK_ACCESS_FS_IOCTL_RW_FILE)
> +/* clang-format on */
> +
> +static_assert((IOCTL_GROUPS & LANDLOCK_MASK_ACCESS_FS) == IOCTL_GROUPS);
> +
> +/**
> + * get_required_ioctl_access(): Determine required IOCTL access rights.
> + *
> + * @cmd: The IOCTL command that is supposed to be run.
> + *
> + * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
> + * should be considered for inclusion here.

It might be a good idea to add a similar comment in
fs/ioctl.c:do_vfs_ioctl(), just before the "default" case, to make sure
nobody forget to Cc us if a new command is added.

> + *
> + * Returns: The access rights that must be granted on an opened file in order to
> + * use the given @cmd.
> + */
> +static __attribute_const__ access_mask_t
> +get_required_ioctl_access(const unsigned int cmd)
> +{
> +	switch (cmd) {
> +	case FIOCLEX:
> +	case FIONCLEX:
> +	case FIONBIO:
> +	case FIOASYNC:
> +		/*
> +		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
> +		 * close-on-exec and the file's buffered-IO and async flags.
> +		 * These operations are also available through fcntl(2), and are
> +		 * unconditionally permitted in Landlock.
> +		 */
> +		return 0;
> +	case FIONREAD:
> +	case FIOQSIZE:
> +	case FIGETBSZ:
> +		/*
> +		 * FIONREAD returns the number of bytes available for reading.
> +		 * FIONREAD returns the number of immediately readable bytes for
> +		 * a file.
> +		 *
> +		 * FIOQSIZE queries the size of a file or directory.
> +		 *
> +		 * FIGETBSZ queries the file system's block size for a file or
> +		 * directory.
> +		 *
> +		 * These IOCTL commands are permitted for files which are opened
> +		 * with LANDLOCK_ACCESS_FS_READ_DIR,
> +		 * LANDLOCK_ACCESS_FS_READ_FILE, or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_RW;
> +	case FS_IOC_FIEMAP:
> +	case FIBMAP:
> +		/*
> +		 * FS_IOC_FIEMAP and FIBMAP query information about the
> +		 * allocation of blocks within a file.  They are permitted for
> +		 * files which are opened with LANDLOCK_ACCESS_FS_READ_FILE or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */
> +		fallthrough;
> +	case FIDEDUPERANGE:
> +	case FICLONE:
> +	case FICLONERANGE:
> +		/*
> +		 * FIDEDUPERANGE, FICLONE and FICLONERANGE make files share
> +		 * their underlying storage ("reflink") between source and
> +		 * destination FDs, on file systems which support that.
> +		 *
> +		 * The underlying implementations are already checking whether
> +		 * the involved files are opened with the appropriate read/write
> +		 * modes.  We rely on this being implemented correctly.
> +		 *
> +		 * These IOCTLs are permitted for files which are opened with
> +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */
> +		fallthrough;
> +	case FS_IOC_RESVSP:
> +	case FS_IOC_RESVSP64:
> +	case FS_IOC_UNRESVSP:
> +	case FS_IOC_UNRESVSP64:
> +	case FS_IOC_ZERO_RANGE:
> +		/*
> +		 * These IOCTLs reserve space, or create holes like
> +		 * fallocate(2).  We rely on the implementations checking the
> +		 * files' read/write modes.
> +		 *
> +		 * These IOCTLs are permitted for files which are opened with
> +		 * LANDLOCK_ACCESS_FS_READ_FILE or
> +		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> +	default:
> +		/*
> +		 * Other commands are guarded by the catch-all access right.
> +		 */
> +		return LANDLOCK_ACCESS_FS_IOCTL;
> +	}

Good documentation and better grouping!

> +}
> +
> +/**
> + * expand_ioctl() - Return the dst flags from either the src flag or the
> + * %LANDLOCK_ACCESS_FS_IOCTL flag, depending on whether the
> + * %LANDLOCK_ACCESS_FS_IOCTL and src access rights are handled or not.
> + *
> + * @handled: Handled access rights.
> + * @access: The access mask to copy values from.
> + * @src: A single access right to copy from in @access.
> + * @dst: One or more access rights to copy to.
> + *
> + * Returns: @dst, or 0.
> + */
> +static __attribute_const__ access_mask_t
> +expand_ioctl(const access_mask_t handled, const access_mask_t access,
> +	     const access_mask_t src, const access_mask_t dst)
> +{
> +	access_mask_t copy_from;
> +
> +	if (!(handled & LANDLOCK_ACCESS_FS_IOCTL))
> +		return 0;
> +
> +	copy_from = (handled & src) ? src : LANDLOCK_ACCESS_FS_IOCTL;
> +	if (access & copy_from)
> +		return dst;
> +
> +	return 0;
> +}
> +
> +/**
> + * landlock_expand_access_fs() - Returns @access with the synthetic IOCTL group
> + * flags enabled if necessary.
> + *
> + * @handled: Handled FS access rights.
> + * @access: FS access rights to expand.
> + *
> + * Returns: @access expanded by the necessary flags for the synthetic IOCTL
> + * access rights.
> + */
> +static __attribute_const__ access_mask_t landlock_expand_access_fs(
> +	const access_mask_t handled, const access_mask_t access)
> +{
> +	return access |
> +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_WRITE_FILE,
> +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_FILE,
> +			    LANDLOCK_ACCESS_FS_IOCTL_RW |
> +				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
> +	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_DIR,
> +			    LANDLOCK_ACCESS_FS_IOCTL_RW);
> +}
> +
> +/**
> + * landlock_expand_handled_access_fs() - add synthetic IOCTL access rights to an
> + * access mask of handled accesses.
> + *
> + * @handled: The handled accesses of a ruleset that is being created.
> + *
> + * Returns: @handled, with the bits for the synthetic IOCTL access rights set,
> + * if %LANDLOCK_ACCESS_FS_IOCTL is handled.
> + */
> +__attribute_const__ access_mask_t
> +landlock_expand_handled_access_fs(const access_mask_t handled)
> +{
> +	return landlock_expand_access_fs(handled, handled);
> +}
> +
>  /* Ruleset management */
>  
>  static struct landlock_object *get_inode_object(struct inode *const inode)
> @@ -148,7 +331,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
>  	LANDLOCK_ACCESS_FS_EXECUTE | \
>  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
>  	LANDLOCK_ACCESS_FS_READ_FILE | \
> -	LANDLOCK_ACCESS_FS_TRUNCATE)
> +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> +	LANDLOCK_ACCESS_FS_IOCTL)
>  /* clang-format on */
>  
>  /*
> @@ -158,6 +342,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
>  			    const struct path *const path,
>  			    access_mask_t access_rights)
>  {
> +	access_mask_t handled;
>  	int err;
>  	struct landlock_id id = {
>  		.type = LANDLOCK_KEY_INODE,
> @@ -170,9 +355,11 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
>  	if (WARN_ON_ONCE(ruleset->num_layers != 1))
>  		return -EINVAL;
>  
> +	handled = landlock_get_fs_access_mask(ruleset, 0);
> +	/* Expands the synthetic IOCTL groups. */
> +	access_rights |= landlock_expand_access_fs(handled, access_rights);
>  	/* Transforms relative access rights to absolute ones. */
> -	access_rights |= LANDLOCK_MASK_ACCESS_FS &
> -			 ~landlock_get_fs_access_mask(ruleset, 0);
> +	access_rights |= LANDLOCK_MASK_ACCESS_FS & ~handled;
>  	id.key.object = get_inode_object(d_backing_inode(path->dentry));
>  	if (IS_ERR(id.key.object))
>  		return PTR_ERR(id.key.object);
> @@ -1333,7 +1520,9 @@ static int hook_file_open(struct file *const file)
>  {
>  	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
>  	access_mask_t open_access_request, full_access_request, allowed_access;
> -	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
> +	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE |
> +					      LANDLOCK_ACCESS_FS_IOCTL |
> +					      IOCTL_GROUPS;
>  	const struct landlock_ruleset *const dom = get_current_fs_domain();
>  
>  	if (!dom)
> @@ -1375,6 +1564,16 @@ static int hook_file_open(struct file *const file)
>  		}
>  	}
>  
> +	/*
> +	 * Named pipes should be treated just like anonymous pipes.
> +	 * Therefore, we permit all IOCTLs on them.
> +	 */
> +	if (S_ISFIFO(file_inode(file)->i_mode)) {
> +		allowed_access |= LANDLOCK_ACCESS_FS_IOCTL |
> +				  LANDLOCK_ACCESS_FS_IOCTL_RW |
> +				  LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
> +	}

This should not be required, cf. other thread.

> +
>  	/*
>  	 * For operations on already opened files (i.e. ftruncate()), it is the
>  	 * access rights at the time of open() which decide whether the
> @@ -1406,6 +1605,25 @@ static int hook_file_truncate(struct file *const file)
>  	return -EACCES;
>  }
>  
> +static int hook_file_ioctl(struct file *file, unsigned int cmd,
> +			   unsigned long arg)
> +{
> +	const access_mask_t required_access = get_required_ioctl_access(cmd);
> +	const access_mask_t allowed_access =
> +		landlock_file(file)->allowed_access;
> +
> +	/*
> +	 * It is the access rights at the time of opening the file which
> +	 * determine whether IOCTL can be used on the opened file later.
> +	 *
> +	 * The access right is attached to the opened file in hook_file_open().
> +	 */
> +	if ((allowed_access & required_access) == required_access)
> +		return 0;
> +
> +	return -EACCES;
> +}
> +
>  static struct security_hook_list landlock_hooks[] __ro_after_init = {
>  	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
>  
> @@ -1428,6 +1646,7 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
>  	LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
>  	LSM_HOOK_INIT(file_open, hook_file_open),
>  	LSM_HOOK_INIT(file_truncate, hook_file_truncate),
> +	LSM_HOOK_INIT(file_ioctl, hook_file_ioctl),
>  };
>  
>  __init void landlock_add_fs_hooks(void)
> diff --git a/security/landlock/fs.h b/security/landlock/fs.h
> index 488e4813680a..086576b8386b 100644
> --- a/security/landlock/fs.h
> +++ b/security/landlock/fs.h
> @@ -92,4 +92,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
>  			    const struct path *const path,
>  			    access_mask_t access_hierarchy);
>  
> +__attribute_const__ access_mask_t
> +landlock_expand_handled_access_fs(const access_mask_t handled);
> +
>  #endif /* _SECURITY_LANDLOCK_FS_H */
> diff --git a/security/landlock/limits.h b/security/landlock/limits.h
> index 93c9c6f91556..ecbdc8bbf906 100644
> --- a/security/landlock/limits.h
> +++ b/security/landlock/limits.h
> @@ -18,7 +18,16 @@
>  #define LANDLOCK_MAX_NUM_LAYERS		16
>  #define LANDLOCK_MAX_NUM_RULES		U32_MAX
>  
> -#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_TRUNCATE
> +/*
> + * For file system access rights, Landlock distinguishes between the publicly
> + * visible access rights (1 to LANDLOCK_LAST_PUBLIC_ACCESS_FS) and the private
> + * ones which are not exposed to userspace (LANDLOCK_LAST_PUBLIC_ACCESS_FS + 1
> + * to LANDLOCK_LAST_ACCESS_FS).  The private access rights are defined in fs.c.
> + */
> +#define LANDLOCK_LAST_PUBLIC_ACCESS_FS	LANDLOCK_ACCESS_FS_IOCTL
> +#define LANDLOCK_MASK_PUBLIC_ACCESS_FS	((LANDLOCK_LAST_PUBLIC_ACCESS_FS << 1) - 1)
> +
> +#define LANDLOCK_LAST_ACCESS_FS		(LANDLOCK_LAST_PUBLIC_ACCESS_FS << 2)
>  #define LANDLOCK_MASK_ACCESS_FS		((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
>  #define LANDLOCK_NUM_ACCESS_FS		__const_hweight64(LANDLOCK_MASK_ACCESS_FS)
>  #define LANDLOCK_SHIFT_ACCESS_FS	0
> diff --git a/security/landlock/ruleset.h b/security/landlock/ruleset.h
> index c7f1526784fd..5a28ea8e1c3d 100644
> --- a/security/landlock/ruleset.h
> +++ b/security/landlock/ruleset.h
> @@ -30,7 +30,7 @@
>  	LANDLOCK_ACCESS_FS_REFER)
>  /* clang-format on */
>  
> -typedef u16 access_mask_t;
> +typedef u32 access_mask_t;
>  /* Makes sure all filesystem access rights can be stored. */
>  static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_ACCESS_FS);
>  /* Makes sure all network access rights can be stored. */
> diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> index 898358f57fa0..f0bc50003b46 100644
> --- a/security/landlock/syscalls.c
> +++ b/security/landlock/syscalls.c
> @@ -137,7 +137,7 @@ static const struct file_operations ruleset_fops = {
>  	.write = fop_dummy_write,
>  };
>  
> -#define LANDLOCK_ABI_VERSION 4
> +#define LANDLOCK_ABI_VERSION 5
>  
>  /**
>   * sys_landlock_create_ruleset - Create a new ruleset
> @@ -192,8 +192,8 @@ SYSCALL_DEFINE3(landlock_create_ruleset,
>  		return err;
>  
>  	/* Checks content (and 32-bits cast). */
> -	if ((ruleset_attr.handled_access_fs | LANDLOCK_MASK_ACCESS_FS) !=
> -	    LANDLOCK_MASK_ACCESS_FS)
> +	if ((ruleset_attr.handled_access_fs | LANDLOCK_MASK_PUBLIC_ACCESS_FS) !=
> +	    LANDLOCK_MASK_PUBLIC_ACCESS_FS)
>  		return -EINVAL;
>  
>  	/* Checks network content (and 32-bits cast). */
> @@ -201,6 +201,10 @@ SYSCALL_DEFINE3(landlock_create_ruleset,
>  	    LANDLOCK_MASK_ACCESS_NET)
>  		return -EINVAL;
>  
> +	/* Expands synthetic IOCTL groups. */
> +	ruleset_attr.handled_access_fs = landlock_expand_handled_access_fs(
> +		ruleset_attr.handled_access_fs);
> +
>  	/* Checks arguments and transforms to kernel struct. */
>  	ruleset = landlock_create_ruleset(ruleset_attr.handled_access_fs,
>  					  ruleset_attr.handled_access_net);
> @@ -309,8 +313,13 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset,
>  	if (!path_beneath_attr.allowed_access)
>  		return -ENOMSG;
>  
> -	/* Checks that allowed_access matches the @ruleset constraints. */
> -	mask = landlock_get_raw_fs_access_mask(ruleset, 0);
> +	/*
> +	 * Checks that allowed_access matches the @ruleset constraints and only
> +	 * consists of publicly visible access rights (as opposed to synthetic
> +	 * ones).
> +	 */
> +	mask = landlock_get_raw_fs_access_mask(ruleset, 0) &
> +	       LANDLOCK_MASK_PUBLIC_ACCESS_FS;
>  	if ((path_beneath_attr.allowed_access | mask) != mask)
>  		return -EINVAL;
>  
> diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
> index 646f778dfb1e..d292b419ccba 100644
> --- a/tools/testing/selftests/landlock/base_test.c
> +++ b/tools/testing/selftests/landlock/base_test.c
> @@ -75,7 +75,7 @@ TEST(abi_version)
>  	const struct landlock_ruleset_attr ruleset_attr = {
>  		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
>  	};
> -	ASSERT_EQ(4, landlock_create_ruleset(NULL, 0,
> +	ASSERT_EQ(5, landlock_create_ruleset(NULL, 0,
>  					     LANDLOCK_CREATE_RULESET_VERSION));
>  
>  	ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
> diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
> index 2d6d9b43d958..3203f4a5bc85 100644
> --- a/tools/testing/selftests/landlock/fs_test.c
> +++ b/tools/testing/selftests/landlock/fs_test.c
> @@ -527,9 +527,10 @@ TEST_F_FORK(layout1, inval)
>  	LANDLOCK_ACCESS_FS_EXECUTE | \
>  	LANDLOCK_ACCESS_FS_WRITE_FILE | \
>  	LANDLOCK_ACCESS_FS_READ_FILE | \
> -	LANDLOCK_ACCESS_FS_TRUNCATE)
> +	LANDLOCK_ACCESS_FS_TRUNCATE | \
> +	LANDLOCK_ACCESS_FS_IOCTL)
>  
> -#define ACCESS_LAST LANDLOCK_ACCESS_FS_TRUNCATE
> +#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL
>  
>  #define ACCESS_ALL ( \
>  	ACCESS_FILE | \
> -- 
> 2.43.0.687.g38aa6559b0-goog
> 
> 

^ permalink raw reply	[relevance 0%]

* [PATCH v6 8/9] Introduce cpu_dcache_is_aliasing() across all architectures
  @ 2024-02-15 14:46  8% ` Mathieu Desnoyers
  0 siblings, 0 replies; 200+ results
From: Mathieu Desnoyers @ 2024-02-15 14:46 UTC (permalink / raw)
  To: Dan Williams, Arnd Bergmann, Dave Chinner
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Russell King,
	linux-arch, linux-cxl, linux-fsdevel, linux-mm, linux-xfs,
	dm-devel, nvdimm, linux-s390

Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
             commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.GA24258@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arch@vger.kernel.org
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvdimm@lists.linux.dev
---
 arch/arc/Kconfig                    |  1 +
 arch/arc/include/asm/cachetype.h    |  9 +++++++++
 arch/arm/Kconfig                    |  1 +
 arch/arm/include/asm/cachetype.h    |  2 ++
 arch/csky/Kconfig                   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +++++++++
 arch/m68k/Kconfig                   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +++++++++
 arch/mips/Kconfig                   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +++++++++
 arch/nios2/Kconfig                  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++++++++++
 arch/parisc/Kconfig                 |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +++++++++
 arch/sh/Kconfig                     |  1 +
 arch/sh/include/asm/cachetype.h     |  9 +++++++++
 arch/sparc/Kconfig                  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++++++++++++++
 arch/xtensa/Kconfig                 |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++++++++++
 include/linux/cacheinfo.h           |  6 ++++++
 mm/Kconfig                          |  6 ++++++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
 	def_bool y
 	select ARC_TIMERS
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CACHE_LINE_SIZE
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index 000000000000..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
 	select ARCH_32BIT_OFF_T
 	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
 	select ARCH_HAS_BINFMT_FLAT
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CPU_FINALIZE_INIT if MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..b9dbe1d4c8fe 100644
--- a/arch/arm/include/asm/cachetype.h
+++ b/arch/arm/include/asm/cachetype.h
@@ -20,6 +20,8 @@ extern unsigned int cacheid;
 #define icache_is_vipt_aliasing()	cacheid_is(CACHEID_VIPT_I_ALIASING)
 #define icache_is_pipt()		cacheid_is(CACHEID_PIPT)
 
+#define cpu_dcache_is_aliasing()	(cache_is_vivt() || cache_is_vipt_aliasing())
+
 /*
  * __LINUX_ARM_ARCH__ is the minimum supported CPU architecture
  * Mask out support which will never be present on newer CPUs.
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index cf2a6fd7dff8..8a91eccf76dc 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -2,6 +2,7 @@
 config CSKY
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/csky/include/asm/cachetype.h b/arch/csky/include/asm/cachetype.h
new file mode 100644
index 000000000000..98cbe3af662f
--- /dev/null
+++ b/arch/csky/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_CSKY_CACHETYPE_H
+#define __ASM_CSKY_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 4b3e93cac723..a9c3e3de0c6d 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -3,6 +3,7 @@ config M68K
 	bool
 	default y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_BINFMT_FLAT
 	select ARCH_HAS_CPU_FINALIZE_INIT if MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
diff --git a/arch/m68k/include/asm/cachetype.h b/arch/m68k/include/asm/cachetype.h
new file mode 100644
index 000000000000..7fad5d9ab8fe
--- /dev/null
+++ b/arch/m68k/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_M68K_CACHETYPE_H
+#define __ASM_M68K_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 797ae590ebdb..ab1c8bd96666 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,6 +4,7 @@ config MIPS
 	default y
 	select ARCH_32BIT_OFF_T if !64BIT
 	select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CPU_FINALIZE_INIT
 	select ARCH_HAS_CURRENT_STACK_POINTER if !CC_IS_CLANG || CLANG_VERSION >= 140000
 	select ARCH_HAS_DEBUG_VIRTUAL if !64BIT
diff --git a/arch/mips/include/asm/cachetype.h b/arch/mips/include/asm/cachetype.h
new file mode 100644
index 000000000000..9f4ba2fe1155
--- /dev/null
+++ b/arch/mips/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MIPS_CACHETYPE_H
+#define __ASM_MIPS_CACHETYPE_H
+
+#include <asm/cpu-features.h>
+
+#define cpu_dcache_is_aliasing()	cpu_has_dc_aliases
+
+#endif
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index d54464021a61..760fb541ecd2 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -2,6 +2,7 @@
 config NIOS2
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
diff --git a/arch/nios2/include/asm/cachetype.h b/arch/nios2/include/asm/cachetype.h
new file mode 100644
index 000000000000..eb9c416b8a1c
--- /dev/null
+++ b/arch/nios2/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_NIOS2_CACHETYPE_H
+#define __ASM_NIOS2_CACHETYPE_H
+
+#include <asm/page.h>
+#include <asm/cache.h>
+
+#define cpu_dcache_is_aliasing()	(NIOS2_DCACHE_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index d14ccc948a29..0f25c227f74b 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -8,6 +8,7 @@ config PARISC
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_SYSCALL_TRACEPOINTS
 	select ARCH_WANT_FRAME_POINTERS
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_ALLOC if PA11
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/parisc/include/asm/cachetype.h b/arch/parisc/include/asm/cachetype.h
new file mode 100644
index 000000000000..e0868a1d3c47
--- /dev/null
+++ b/arch/parisc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PARISC_CACHETYPE_H
+#define __ASM_PARISC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 7500521b2b98..2ad3e29f0ebe 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -2,6 +2,7 @@
 config SUPERH
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && MMU
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if SPARSEMEM && MMU
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A)
diff --git a/arch/sh/include/asm/cachetype.h b/arch/sh/include/asm/cachetype.h
new file mode 100644
index 000000000000..a5fffe536068
--- /dev/null
+++ b/arch/sh/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SH_CACHETYPE_H
+#define __ASM_SH_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 49849790e66d..5ba627da15d7 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -13,6 +13,7 @@ config 64BIT
 config SPARC
 	bool
 	default y
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_MIGHT_HAVE_PC_PARPORT if SPARC64 && PCI
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select DMA_OPS
diff --git a/arch/sparc/include/asm/cachetype.h b/arch/sparc/include/asm/cachetype.h
new file mode 100644
index 000000000000..caf1c0045892
--- /dev/null
+++ b/arch/sparc/include/asm/cachetype.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SPARC_CACHETYPE_H
+#define __ASM_SPARC_CACHETYPE_H
+
+#include <asm/page.h>
+
+#ifdef CONFIG_SPARC32
+extern int vac_cache_size;
+#define cpu_dcache_is_aliasing()	(vac_cache_size > PAGE_SIZE)
+#else
+#define cpu_dcache_is_aliasing()	(L1DCACHE_SIZE > PAGE_SIZE)
+#endif
+
+#endif
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 7d792077e5fd..2dfde54d1a84 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -2,6 +2,7 @@
 config XTENSA
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_BINFMT_FLAT if !MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/xtensa/include/asm/cachetype.h b/arch/xtensa/include/asm/cachetype.h
new file mode 100644
index 000000000000..51bd49e2a1c5
--- /dev/null
+++ b/arch/xtensa/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_XTENSA_CACHETYPE_H
+#define __ASM_XTENSA_CACHETYPE_H
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+#define cpu_dcache_is_aliasing()	(DCACHE_WAY_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index d504eb4b49ab..2cb15fe4fe12 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -138,4 +138,10 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level)
 #define use_arch_cache_info()	(false)
 #endif
 
+#ifndef CONFIG_ARCH_HAS_CPU_CACHE_ALIASING
+#define cpu_dcache_is_aliasing()	false
+#else
+#include <asm/cachetype.h>
+#endif
+
 #endif /* _LINUX_CACHEINFO_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 57cd378c73d6..db09c9ad15c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1016,6 +1016,12 @@ config IDLE_PAGE_TRACKING
 	  See Documentation/admin-guide/mm/idle_page_tracking.rst for
 	  more details.
 
+# Architectures which implement cpu_dcache_is_aliasing() to query
+# whether the data caches are aliased (VIVT or VIPT with dcache
+# aliasing) need to select this.
+config ARCH_HAS_CPU_CACHE_ALIASING
+	bool
+
 config ARCH_HAS_CACHE_LINE_SIZE
 	bool
 
-- 
2.39.2


^ permalink raw reply related	[relevance 8%]

* Re: [PATCH v3 10/15] block: Add fops atomic write support
  2024-02-14  9:38  5%         ` Nilay Shroff
@ 2024-02-14 11:29  0%           ` John Garry
  0 siblings, 0 replies; 200+ results
From: John Garry @ 2024-02-14 11:29 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: axboe, brauner, bvanassche, dchinner, djwong, hch, jack, jbongio,
	jejb, kbusch, linux-block, linux-fsdevel, linux-kernel,
	linux-nvme, linux-scsi, linux-xfs, martin.petersen, ming.lei,
	ojaswin, sagi, tytso, viro

On 14/02/2024 09:38, Nilay Shroff wrote:
> 
> 
> On 2/13/24 17:22, John Garry wrote:
>> On 13/02/2024 11:08, Nilay Shroff wrote:
>>>> It's relied that atomic_write_unit_max is <= atomic_write_boundary and both are a power-of-2. Please see the NVMe patch, which this is checked. Indeed, it would not make sense if atomic_write_unit_max > atomic_write_boundary (when non-zero).
>>>>
>>>> So if the write is naturally aligned and its size is <= atomic_write_unit_max, then it cannot be straddling a boundary.
>>> Ok fine but in case the device doesn't support namespace atomic boundary size (i.e. NABSPF is zero) then still do we need
>>> to restrict IO which crosses the atomic boundary?
>>
>> Is there a boundary if NABSPF is zero?
> If NABSPF is zero then there's no boundary and so we may not need to worry about IO crossing boundary.
> 
> Even though, the atomic boundary is not defined, this function doesn't allow atomic write crossing atomic_write_unit_max_bytes.
> For instance, if AWUPF is 63 and an IO starts atomic write from logical block #32 and the number of logical blocks to be written

When you say "IO", you need to be clearer. Do you mean a write from 
userspace or a merged atomic write?

If userspace issues an atomic write which is 64 blocks at offset 32, 
then it will be rejected.

It will be rejected as it is not naturally aligned, e.g. a 64 block 
writes can only be at offset 0, 64, 128,

> in this IO equals to #64 then it's not allowed.
>  However if this same IO starts from logical block #0 then it's allowed.
> So my point here's that can this restriction be avoided when atomic boundary is zero (or not defined)?

We want a consistent set of rules for userspace to follow, whether the 
atomic boundary is zero or non-zero.

Currently the atomic boundary only comes into play for merging writes, 
i.e. we cannot merge a write in which the resultant IO straddles a boundary.

> 
> Also, it seems that the restriction implemented for atomic write to succeed are very strict. For example, atomic-write can't
> succeed if an IO starts from logical block #8 and the number of logical blocks to be written in this IO equals to #16.
> In this particular case, IO is well within atomic-boundary (if it's defined) and atomic-size-limit, so why do we NOT want to
> allow it? Is it intentional? I think, the spec doesn't mention about such limitation.

According to the NVMe spec, this is ok. However we don't want the user 
to have to deal with things like NVMe boundaries. Indeed, for FSes, we 
do not have a direct linear map from FS blocks to physical blocks, so it 
would be impossible for the user to know about a boundary condition in 
this context.

We are trying to formulate rules which work for the somewhat orthogonal 
HW features of both SCSI and NVMe for both block devices and FSes, while 
also dealing with alignment concerns of extent-based FSes, like XFS.

> 
>>
>>>
>>> I am quoting this from NVMe spec (Command Set Specification, revision 1.0a, Section 2.1.4.3) :
>>> "To ensure backwards compatibility, the values reported for AWUN, AWUPF, and ACWU shall be set such that
>>> they  are  supported  even  if  a  write  crosses  an  atomic  boundary.  If  a  controller  does  not
>>> guarantee atomicity across atomic boundaries, the controller shall set AWUN, AWUPF, and ACWU to 0h (1 LBA)."
>>
>> How about respond to the NVMe patch in this series, asking this question?
>>
> Yes I will send this query to the NVMe patch in this series.

Thanks,
John


^ permalink raw reply	[relevance 0%]

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-13 23:41  0%                   ` Dave Chinner
@ 2024-02-14 11:06  0%                     ` John Garry
  0 siblings, 0 replies; 200+ results
From: John Garry @ 2024-02-14 11:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin


> 
>> Setting the rtvol extsize at mkfs time or enabling atomic writes
>> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
>> terms of atomic writes.
> 
> Which is wrong. mkfs.xfs gets physical information about the volume
> from the kernel and makes the filesystem accounting to that
> information. That's how we do stripe alignment, sector sizing, etc.
> Atomic write support and setting up alignment constraints should be
> no different.

Yes, I was just looking at adding a mkfs option to format for atomic 
writes, which would check physical information about the volume and 
whether it suits rtextsize and then subsequently also set 
XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

> 
> Yes, mkfs allows the user to override the hardware configsi it
> probes, but it also warns when the override is doing something
> sub-optimal (like aligning all AG headers to the same disk in a
> stripe).
> 
> IOWs, mkfs should be pulling this atomic write info from the
> hardware and configuring the filesysetm around that information.
> That's the target we should be aiming the kernel implementation at
> and optimising for - a filesystem that is correctly configured
> according to published hardware capability.

Right

So, for example, if the atomic writes option is set and the rtextsize 
set by the user is so much larger than what HW can support in terms of 
atomic writes, then we should let the user know about this.

> 
> Everything else is in the "make it behave correctly, but we don't
> care if it's slow" category.
> 
>> This check is not done as it is not fixed what the bdev can do in terms of
>> atomic writes - or, more specifically, what they request_queue reports is
>> not be fixed. There are things which can change this. For example, a FW
>> update could change all the atomic write capabilities of a disk. Or even if
>> we swapped a SCSI disk into another host the atomic write limits may change,
>> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
>> BIO_MAX_VECS - which could come from a kernel update - could also change
>> what we report as atomic write limit in the request queue.
> 
> If that sort of thing happens, then that's too bad. We already have
> these sorts of "do not do if you care about performance"
> constraints. e.g. don't do a RAID restripe that changes the
> alignment/size of the RAID device (e.g. add a single disk and make
> the stripe width wider) because the physical filesystem structure
> will no longer be aligned to the underlying hardware. instead, you
> have to grow striped volumes with compatible stripes in compatible
> sizes to ensure the filesystem remains aligned to the storage...
> 
> We don't try to cater for every single possible permutation of
> storage hardware configurations - that way lies madness. Design and
> optimise for the common case of correctly configured and well
> behaved storage, and everything else we just don't care about beyond
> "don't corrupt or lose data".

ok

> 
>>>>> And therein lies the problem.
>>>>>

...

>>
>> That sounds fine.
>>
>> My question then is how we determine this max atomic write size granularity.
>>
>> We don't explicitly tell the FS what atomic write size we want for a file.
>> Rather we mkfs with some extsize value which should match our atomic write
>> maximal value and then tell the FS we want to do atomic writes on a file,
>> and if this is accepted then we can query the atomic write min and max unit
>> size, and this would be [FS block size, min(bdev atomic write limit,
>> rtexsize)].
>>
>> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
>> writes support. So then we could use rtextsize as this max atomic write
>> size.
> 
> Maybe, but I think continuing to focus on this as 'atomic writes
> requires' is wrong.
> 
> The filesystem does not carea bout atomic writes. What it cares
> about is the allocation constraints that need to be implemented.
> That constraint is that all BMBT extent operations need to be
> aligned to a specific extent size, not filesystem blocks.
> 
> The current extent size hint (and rtextsize) only impact the
> -allocation of extents-. They are not directly placing constraints
> on the BMBT layout, they are placing constraints on the free space
> search that the allocator runs on the BNO/CNT btrees to select an
> extent that is then inserted into the BMBT.
> 
> The problem is that unwritten extent conversion, truncate, hole
> punching, etc also all need to be correctly aligned for files that
> are configured to support atomic writes. These operations place
> constraints on how the BMBT can modify the existing extent list.
> 
> These are different constraints to what rtextsize/extszhint apply,
> and that's the fundamental behavioural difference between existing
> extent size hint behaviour and the behaviour needed by atomic
> writes.
> 
>> But I am not 100% sure that it your idea (apologies if I am wrong - I
>> am sincerely trying to follow your idea), but rather it would be
>> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
>> atomic write limit is 16KB, then there is no much point in dealing in 1MB
>> blocks for this unwritten extent conversion alignment.
> 
> Exactly my point - there really is no relationship between rtextsize
> and atomic write constraints and that it is a mistake to use
> rtextsize as it stands as a placeholder for atomic write
> constraints.
> 

ok

>> If so, then my
>> concern is that the bdev atomic write upper limit is not fixed. This can
>> solved, but I would still like to be clear on this max atomic write size.
> 
> Right, we haven't clearly defined how XFS is supposed align BMBT
> operations in a way that is compatible for atomic write operations.
> 
> What the patchset does is try to extend and infer things from
> existing allocation alignment constraints, but then not apply those
> constraints to critical extent state operations (pure BMBT
> modifications) that atomic writes also need constrained to work
> correctly and efficiently.

Right. Previously I also did mention that we could explicitly request 
the atomic write size per-inode, but a drawback is that this would 
require an on-disk format change.

> 
>>> i.e. atomic writes need to use max write size granularity for all IO
>>> operations, not filesystem block granularity.
>>>
>>> And that also means things like rtextsize and extsize hints need to
>>> match these atomic write requirements, too....
>>
>> As above, I am not 100% sure if you mean these to be the atomic write
>> maximal value.
> 
> Yes, they either need to be the same as the atomic write max value
> or, much better, once a hint has been set, then resultant size is
> then aligned up to be compatible with the atomic write constraints.
> 
> e.g. set an inode extent size hint of 960kB on a device with 128kB
> atomic write capability. If the inode has the atomic write flag set,
> then allocations need to round the extent size up from 960kB to 1MB
> so that the BMBT extent layout and alignment is compatible with 128kB
> atomic writes.
> 
> At this point, zeroing, punching, unwritten extent conversion, etc
> also needs to be done on aligned 128kB ranges to be comaptible with
> atomic writes, rather than filesysetm block boundaries that would
> normally be used if just the extent size hint was set.
> 
> This is effectively what the original "force align" inode flag bit
> did - it told the inode that all BMBT manipulations need to be
> extent size hint aligned, not just allocation. It's a generic flag
> that implements specific extent manipulation constraints that happen
> to be compatible with atomic writes when the right extent size hint
> is set.
> 
> So at this point, I'm not sure that we should have an "atomic
> writes" flag in the inode. 

Another motivation for this flag is that we can explicitly enable some 
software-based atomic write support for an inode when the backing device 
does not have HW support.

In addition, in this series setting FS_XFLAG_ATOMICWRITES means 
XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something 
similar for other OSes, and for those other OSes it may also mean some 
other special alignment feature enabled. We want a consistent user 
experience.

> We need to tell BMBT modifications
> to behave in a particular way - forced alignment - not that atomic
> writes are being used in the filesystem....
> 
> At this point, the filesystem can do all the extent modification
> alignment stuff that atomic writes require without caring if the
> block device supports atomic writes or even if the
> application is using atomic writes.
> 
> This means we can test the BMBT functionality in fstests without
> requiring hardware (or emulation) that supports atomic writes - all
> we need to do is set the forced align flag, an extent size hint and
> go run fsx on it...
> 

The current idea was that the forcealign feature would be required for 
the second phase for atomic write support - non-rtvol support. Darrick 
did send that series out separately over the New Year's break.

I think that you wanted to progress the following series first:
https://lore.kernel.org/linux-xfs/20231004001943.349265-1-david@fromorbit.com/

Right?

Thanks,
John




^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 10/15] block: Add fops atomic write support
  @ 2024-02-14  9:38  5%         ` Nilay Shroff
  2024-02-14 11:29  0%           ` John Garry
  0 siblings, 1 reply; 200+ results
From: Nilay Shroff @ 2024-02-14  9:38 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, bvanassche, dchinner, djwong, hch, jack, jbongio,
	jejb, kbusch, linux-block, linux-fsdevel, linux-kernel,
	linux-nvme, linux-scsi, linux-xfs, martin.petersen, ming.lei,
	ojaswin, sagi, tytso, viro



On 2/13/24 17:22, John Garry wrote:
> On 13/02/2024 11:08, Nilay Shroff wrote:
>>> It's relied that atomic_write_unit_max is <= atomic_write_boundary and both are a power-of-2. Please see the NVMe patch, which this is checked. Indeed, it would not make sense if atomic_write_unit_max > atomic_write_boundary (when non-zero).
>>>
>>> So if the write is naturally aligned and its size is <= atomic_write_unit_max, then it cannot be straddling a boundary.
>> Ok fine but in case the device doesn't support namespace atomic boundary size (i.e. NABSPF is zero) then still do we need
>> to restrict IO which crosses the atomic boundary?
> 
> Is there a boundary if NABSPF is zero?
If NABSPF is zero then there's no boundary and so we may not need to worry about IO crossing boundary.

Even though, the atomic boundary is not defined, this function doesn't allow atomic write crossing atomic_write_unit_max_bytes.
For instance, if AWUPF is 63 and an IO starts atomic write from logical block #32 and the number of logical blocks to be written
in this IO equals to #64 then it's not allowed. However if this same IO starts from logical block #0 then it's allowed.
So my point here's that can this restriction be avoided when atomic boundary is zero (or not defined)? 

Also, it seems that the restriction implemented for atomic write to succeed are very strict. For example, atomic-write can't
succeed if an IO starts from logical block #8 and the number of logical blocks to be written in this IO equals to #16. 
In this particular case, IO is well within atomic-boundary (if it's defined) and atomic-size-limit, so why do we NOT want to 
allow it? Is it intentional? I think, the spec doesn't mention about such limitation.

> 
>>
>> I am quoting this from NVMe spec (Command Set Specification, revision 1.0a, Section 2.1.4.3) :
>> "To ensure backwards compatibility, the values reported for AWUN, AWUPF, and ACWU shall be set such that
>> they  are  supported  even  if  a  write  crosses  an  atomic  boundary.  If  a  controller  does  not
>> guarantee atomicity across atomic boundaries, the controller shall set AWUN, AWUPF, and ACWU to 0h (1 LBA)."
> 
> How about respond to the NVMe patch in this series, asking this question?
> 
Yes I will send this query to the NVMe patch in this series.

Thanks,
--Nilay

^ permalink raw reply	[relevance 5%]

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-09 12:47  4%                 ` John Garry
@ 2024-02-13 23:41  0%                   ` Dave Chinner
  2024-02-14 11:06  0%                     ` John Garry
  0 siblings, 1 reply; 200+ results
From: Dave Chinner @ 2024-02-13 23:41 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Fri, Feb 09, 2024 at 12:47:38PM +0000, John Garry wrote:
> > > > Why should we jump through crazy hoops to try to make filesystems
> > > > optimised for large IOs with mismatched, overlapping small atomic
> > > > writes?
> > > 
> > > As mentioned, typically the atomic writes will be the same size, but we may
> > > have other writes of smaller size.
> > 
> > Then we need the tiny write to allocate and zero according to the
> > maximum sized atomic write bounds. Then we just don't care about
> > large atomic IO overlapping small IO, because the extent on disk
> > aligned to the large atomic IO is then always guaranteed to be the
> > correct size and shape.
> 
> I think it's worth mentioning that there is currently a separation between
> how we configure the FS extent size for atomic writes and what the bdev can
> actually support in terms of atomic writes.

And that's part of what is causing all the issues here - we're
trying to jump though hoops at the fs level to handle cases that
they device doesn't support and vice versa.

> Setting the rtvol extsize at mkfs time or enabling atomic writes
> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
> terms of atomic writes.

Which is wrong. mkfs.xfs gets physical information about the volume
from the kernel and makes the filesystem accounting to that
information. That's how we do stripe alignment, sector sizing, etc.
Atomic write support and setting up alignment constraints should be
no different.

Yes, mkfs allows the user to override the hardware configsi it
probes, but it also warns when the override is doing something
sub-optimal (like aligning all AG headers to the same disk in a
stripe).

IOWs, mkfs should be pulling this atomic write info from the
hardware and configuring the filesysetm around that information.
That's the target we should be aiming the kernel implementation at
and optimising for - a filesystem that is correctly configured
according to published hardware capability.

Everything else is in the "make it behave correctly, but we don't
care if it's slow" category.

> This check is not done as it is not fixed what the bdev can do in terms of
> atomic writes - or, more specifically, what they request_queue reports is
> not be fixed. There are things which can change this. For example, a FW
> update could change all the atomic write capabilities of a disk. Or even if
> we swapped a SCSI disk into another host the atomic write limits may change,
> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
> BIO_MAX_VECS - which could come from a kernel update - could also change
> what we report as atomic write limit in the request queue.

If that sort of thing happens, then that's too bad. We already have
these sorts of "do not do if you care about performance"
constraints. e.g. don't do a RAID restripe that changes the
alignment/size of the RAID device (e.g. add a single disk and make
the stripe width wider) because the physical filesystem structure
will no longer be aligned to the underlying hardware. instead, you
have to grow striped volumes with compatible stripes in compatible
sizes to ensure the filesystem remains aligned to the storage...

We don't try to cater for every single possible permutation of
storage hardware configurations - that way lies madness. Design and
optimise for the common case of correctly configured and well
behaved storage, and everything else we just don't care about beyond
"don't corrupt or lose data".

> > > > And therein lies the problem.
> > > > 
> > > > If you are doing sub-rtextent IO at all, then you are forcing the
> > > > filesystem down the path of explicitly using unwritten extents and
> > > > requiring O_DSYNC direct IO to do journal flushes in IO completion
> > > > context and then performance just goes down hill from them.
> > > > 
> > > > The requirement for unwritten extents to track sub-rtextsize written
> > > > regions is what you're trying to work around with XFS_BMAPI_ZERO so
> > > > that atomic writes will always see "atomic write aligned" allocated
> > > > regions.
> > > > 
> > > > Do you see the problem here? You've explicitly told the filesystem
> > > > that allocation is aligned to 64kB chunks, then because the
> > > > filesystem block size is 4kB, it's allowed to track unwritten
> > > > regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> > > > then changes unwritten extents at 4kB boundaries. Then you do a
> > > > overlapping 16kB IO that*requires*  16kB allocation alignment, and
> > > > things go BOOM.
> > > > 
> > > > Yes, they should go BOOM.
> > > > 
> > > > This is a horrible configuration - it is incomaptible with 16kB
> > > > aligned and sized atomic IO.
> > > 
> > > Just because the DB may do 16KB atomic writes most of the time should not
> > > disallow it from any other form of writes.
> > 
> > That's not what I said. I said the using sub-rtextsize atomic writes
> > with single FSB unwritten extent tracking is horrible and
> > incompatible with doing 16kB atomic writes.
> > 
> > This setup will not work at all well with your patches and should go
> > BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
> > has uncoordinated extent allocation and unwritten conversion
> > granularity.
> > 
> > That's the fundamental design problem with your approach - it allows
> > unwritten conversion at *minimum IO sizes* and that does not work
> > with atomic IOs with larger alignment requirements.
> > 
> > The fundamental design principle is this: for maximally sized atomic
> > writes to always succeed we require every allocation, zeroing and
> > unwritten conversion operation to use alignments and sizes that are
> > compatible with the maximum atomic write sizes being used.
> > 
> 
> That sounds fine.
> 
> My question then is how we determine this max atomic write size granularity.
> 
> We don't explicitly tell the FS what atomic write size we want for a file.
> Rather we mkfs with some extsize value which should match our atomic write
> maximal value and then tell the FS we want to do atomic writes on a file,
> and if this is accepted then we can query the atomic write min and max unit
> size, and this would be [FS block size, min(bdev atomic write limit,
> rtexsize)].
> 
> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
> writes support. So then we could use rtextsize as this max atomic write
> size.

Maybe, but I think continuing to focus on this as 'atomic writes
requires' is wrong.

The filesystem does not carea bout atomic writes. What it cares
about is the allocation constraints that need to be implemented.
That constraint is that all BMBT extent operations need to be
aligned to a specific extent size, not filesystem blocks.

The current extent size hint (and rtextsize) only impact the
-allocation of extents-. They are not directly placing constraints
on the BMBT layout, they are placing constraints on the free space
search that the allocator runs on the BNO/CNT btrees to select an
extent that is then inserted into the BMBT.

The problem is that unwritten extent conversion, truncate, hole
punching, etc also all need to be correctly aligned for files that
are configured to support atomic writes. These operations place
constraints on how the BMBT can modify the existing extent list.

These are different constraints to what rtextsize/extszhint apply,
and that's the fundamental behavioural difference between existing
extent size hint behaviour and the behaviour needed by atomic
writes.

> But I am not 100% sure that it your idea (apologies if I am wrong - I
> am sincerely trying to follow your idea), but rather it would be
> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
> atomic write limit is 16KB, then there is no much point in dealing in 1MB
> blocks for this unwritten extent conversion alignment.

Exactly my point - there really is no relationship between rtextsize
and atomic write constraints and that it is a mistake to use
rtextsize as it stands as a placeholder for atomic write
constraints.

> If so, then my
> concern is that the bdev atomic write upper limit is not fixed. This can
> solved, but I would still like to be clear on this max atomic write size.

Right, we haven't clearly defined how XFS is supposed align BMBT
operations in a way that is compatible for atomic write operations.

What the patchset does is try to extend and infer things from
existing allocation alignment constraints, but then not apply those
constraints to critical extent state operations (pure BMBT
modifications) that atomic writes also need constrained to work
correctly and efficiently.

> > i.e. atomic writes need to use max write size granularity for all IO
> > operations, not filesystem block granularity.
> > 
> > And that also means things like rtextsize and extsize hints need to
> > match these atomic write requirements, too....
> 
> As above, I am not 100% sure if you mean these to be the atomic write
> maximal value.

Yes, they either need to be the same as the atomic write max value
or, much better, once a hint has been set, then resultant size is
then aligned up to be compatible with the atomic write constraints.

e.g. set an inode extent size hint of 960kB on a device with 128kB
atomic write capability. If the inode has the atomic write flag set,
then allocations need to round the extent size up from 960kB to 1MB
so that the BMBT extent layout and alignment is compatible with 128kB
atomic writes.

At this point, zeroing, punching, unwritten extent conversion, etc
also needs to be done on aligned 128kB ranges to be comaptible with
atomic writes, rather than filesysetm block boundaries that would
normally be used if just the extent size hint was set.

This is effectively what the original "force align" inode flag bit
did - it told the inode that all BMBT manipulations need to be
extent size hint aligned, not just allocation. It's a generic flag
that implements specific extent manipulation constraints that happen
to be compatible with atomic writes when the right extent size hint
is set.

So at this point, I'm not sure that we should have an "atomic
writes" flag in the inode. We need to tell BMBT modifications
to behave in a particular way - forced alignment - not that atomic
writes are being used in the filesystem....

At this point, the filesystem can do all the extent modification
alignment stuff that atomic writes require without caring if the
block device supports atomic writes or even if the
application is using atomic writes.

This means we can test the BMBT functionality in fstests without
requiring hardware (or emulation) that supports atomic writes - all
we need to do is set the forced align flag, an extent size hint and
go run fsx on it...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag
  @ 2024-02-13 17:08  0%       ` Darrick J. Wong
  0 siblings, 0 replies; 200+ results
From: Darrick J. Wong @ 2024-02-13 17:08 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Mon, Feb 05, 2024 at 12:58:30PM +0000, John Garry wrote:
> On 02/02/2024 17:57, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:41PM +0000, John Garry wrote:
> > > Add a flag indicating that a regular file is enabled for atomic writes.
> > 
> > This is a file attribute that mirrors an ondisk inode flag.  Actual
> > support for untorn file writes (for now) depends on both the iflag and
> > the underlying storage devices, which we can only really check at statx
> > and pwrite time.  This is the same story as FS_XFLAG_DAX, which signals
> > to the fs that we should try to enable the fsdax IO path on the file
> > (instead of the regular page cache), but applications have to query
> > STAT_ATTR_DAX to find out if they really got that IO path.
> 
> To be clear, are you suggesting to add this info to the commit message?

That and a S_ATOMICW flag for the inode that triggers the proposed
STAT_ATTR_ATOMICWRITES flag.

> > "try to enable atomic writes", perhaps? >
> > (and the comment for FS_XFLAG_DAX ought to read "try to use DAX for IO")
> 
> To me that sounds like "try to use DAX for IO, and, if not possible, fall
> back on some other method" - is that reality of what that flag does?

As hch said, yes.

--D

> Thanks,
> John
> 
> > 
> > --D
> > 
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   include/uapi/linux/fs.h | 1 +
> > >   1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index a0975ae81e64..b5b4e1db9576 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -140,6 +140,7 @@ struct fsxattr {
> > >   #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
> > >   #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
> > >   #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> > > +#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
> > >   #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
> > >   /* the read-only stuff doesn't really belong here, but any other place is
> > > -- 
> > > 2.31.1
> > > 
> > > 
> 
> 

^ permalink raw reply	[relevance 0%]

* [PATCH v5 7/8] Introduce cpu_dcache_is_aliasing() across all architectures
  @ 2024-02-12 16:31  8% ` Mathieu Desnoyers
  0 siblings, 0 replies; 200+ results
From: Mathieu Desnoyers @ 2024-02-12 16:31 UTC (permalink / raw)
  To: Dan Williams, Arnd Bergmann, Dave Chinner
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Russell King,
	linux-arch, linux-cxl, linux-fsdevel, linux-mm, linux-xfs,
	dm-devel, nvdimm, linux-s390

Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
             commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.GA24258@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arch@vger.kernel.org
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvdimm@lists.linux.dev
---
 arch/arc/Kconfig                    |  1 +
 arch/arc/include/asm/cachetype.h    |  9 +++++++++
 arch/arm/Kconfig                    |  1 +
 arch/arm/include/asm/cachetype.h    |  2 ++
 arch/csky/Kconfig                   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +++++++++
 arch/m68k/Kconfig                   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +++++++++
 arch/mips/Kconfig                   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +++++++++
 arch/nios2/Kconfig                  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++++++++++
 arch/parisc/Kconfig                 |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +++++++++
 arch/sh/Kconfig                     |  1 +
 arch/sh/include/asm/cachetype.h     |  9 +++++++++
 arch/sparc/Kconfig                  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++++++++++++++
 arch/xtensa/Kconfig                 |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++++++++++
 include/linux/cacheinfo.h           |  6 ++++++
 mm/Kconfig                          |  6 ++++++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
 	def_bool y
 	select ARC_TIMERS
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CACHE_LINE_SIZE
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index 000000000000..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
 	select ARCH_32BIT_OFF_T
 	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
 	select ARCH_HAS_BINFMT_FLAT
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CPU_FINALIZE_INIT if MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..b9dbe1d4c8fe 100644
--- a/arch/arm/include/asm/cachetype.h
+++ b/arch/arm/include/asm/cachetype.h
@@ -20,6 +20,8 @@ extern unsigned int cacheid;
 #define icache_is_vipt_aliasing()	cacheid_is(CACHEID_VIPT_I_ALIASING)
 #define icache_is_pipt()		cacheid_is(CACHEID_PIPT)
 
+#define cpu_dcache_is_aliasing()	(cache_is_vivt() || cache_is_vipt_aliasing())
+
 /*
  * __LINUX_ARM_ARCH__ is the minimum supported CPU architecture
  * Mask out support which will never be present on newer CPUs.
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index cf2a6fd7dff8..8a91eccf76dc 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -2,6 +2,7 @@
 config CSKY
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/csky/include/asm/cachetype.h b/arch/csky/include/asm/cachetype.h
new file mode 100644
index 000000000000..98cbe3af662f
--- /dev/null
+++ b/arch/csky/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_CSKY_CACHETYPE_H
+#define __ASM_CSKY_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 4b3e93cac723..a9c3e3de0c6d 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -3,6 +3,7 @@ config M68K
 	bool
 	default y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_BINFMT_FLAT
 	select ARCH_HAS_CPU_FINALIZE_INIT if MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
diff --git a/arch/m68k/include/asm/cachetype.h b/arch/m68k/include/asm/cachetype.h
new file mode 100644
index 000000000000..7fad5d9ab8fe
--- /dev/null
+++ b/arch/m68k/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_M68K_CACHETYPE_H
+#define __ASM_M68K_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 797ae590ebdb..ab1c8bd96666 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,6 +4,7 @@ config MIPS
 	default y
 	select ARCH_32BIT_OFF_T if !64BIT
 	select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CPU_FINALIZE_INIT
 	select ARCH_HAS_CURRENT_STACK_POINTER if !CC_IS_CLANG || CLANG_VERSION >= 140000
 	select ARCH_HAS_DEBUG_VIRTUAL if !64BIT
diff --git a/arch/mips/include/asm/cachetype.h b/arch/mips/include/asm/cachetype.h
new file mode 100644
index 000000000000..9f4ba2fe1155
--- /dev/null
+++ b/arch/mips/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MIPS_CACHETYPE_H
+#define __ASM_MIPS_CACHETYPE_H
+
+#include <asm/cpu-features.h>
+
+#define cpu_dcache_is_aliasing()	cpu_has_dc_aliases
+
+#endif
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index d54464021a61..760fb541ecd2 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -2,6 +2,7 @@
 config NIOS2
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
diff --git a/arch/nios2/include/asm/cachetype.h b/arch/nios2/include/asm/cachetype.h
new file mode 100644
index 000000000000..eb9c416b8a1c
--- /dev/null
+++ b/arch/nios2/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_NIOS2_CACHETYPE_H
+#define __ASM_NIOS2_CACHETYPE_H
+
+#include <asm/page.h>
+#include <asm/cache.h>
+
+#define cpu_dcache_is_aliasing()	(NIOS2_DCACHE_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index d14ccc948a29..0f25c227f74b 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -8,6 +8,7 @@ config PARISC
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_SYSCALL_TRACEPOINTS
 	select ARCH_WANT_FRAME_POINTERS
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_ALLOC if PA11
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/parisc/include/asm/cachetype.h b/arch/parisc/include/asm/cachetype.h
new file mode 100644
index 000000000000..e0868a1d3c47
--- /dev/null
+++ b/arch/parisc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PARISC_CACHETYPE_H
+#define __ASM_PARISC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 7500521b2b98..2ad3e29f0ebe 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -2,6 +2,7 @@
 config SUPERH
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && MMU
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if SPARSEMEM && MMU
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A)
diff --git a/arch/sh/include/asm/cachetype.h b/arch/sh/include/asm/cachetype.h
new file mode 100644
index 000000000000..a5fffe536068
--- /dev/null
+++ b/arch/sh/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SH_CACHETYPE_H
+#define __ASM_SH_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 49849790e66d..5ba627da15d7 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -13,6 +13,7 @@ config 64BIT
 config SPARC
 	bool
 	default y
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_MIGHT_HAVE_PC_PARPORT if SPARC64 && PCI
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select DMA_OPS
diff --git a/arch/sparc/include/asm/cachetype.h b/arch/sparc/include/asm/cachetype.h
new file mode 100644
index 000000000000..caf1c0045892
--- /dev/null
+++ b/arch/sparc/include/asm/cachetype.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SPARC_CACHETYPE_H
+#define __ASM_SPARC_CACHETYPE_H
+
+#include <asm/page.h>
+
+#ifdef CONFIG_SPARC32
+extern int vac_cache_size;
+#define cpu_dcache_is_aliasing()	(vac_cache_size > PAGE_SIZE)
+#else
+#define cpu_dcache_is_aliasing()	(L1DCACHE_SIZE > PAGE_SIZE)
+#endif
+
+#endif
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 7d792077e5fd..2dfde54d1a84 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -2,6 +2,7 @@
 config XTENSA
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_BINFMT_FLAT if !MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/xtensa/include/asm/cachetype.h b/arch/xtensa/include/asm/cachetype.h
new file mode 100644
index 000000000000..51bd49e2a1c5
--- /dev/null
+++ b/arch/xtensa/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_XTENSA_CACHETYPE_H
+#define __ASM_XTENSA_CACHETYPE_H
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+#define cpu_dcache_is_aliasing()	(DCACHE_WAY_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index d504eb4b49ab..2cb15fe4fe12 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -138,4 +138,10 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level)
 #define use_arch_cache_info()	(false)
 #endif
 
+#ifndef CONFIG_ARCH_HAS_CPU_CACHE_ALIASING
+#define cpu_dcache_is_aliasing()	false
+#else
+#include <asm/cachetype.h>
+#endif
+
 #endif /* _LINUX_CACHEINFO_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 57cd378c73d6..db09c9ad15c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1016,6 +1016,12 @@ config IDLE_PAGE_TRACKING
 	  See Documentation/admin-guide/mm/idle_page_tracking.rst for
 	  more details.
 
+# Architectures which implement cpu_dcache_is_aliasing() to query
+# whether the data caches are aliased (VIVT or VIPT with dcache
+# aliasing) need to select this.
+config ARCH_HAS_CPU_CACHE_ALIASING
+	bool
+
 config ARCH_HAS_CACHE_LINE_SIZE
 	bool
 
-- 
2.39.2


^ permalink raw reply related	[relevance 8%]

* Re: [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo
    @ 2024-02-12  6:18  0% ` Darrick J. Wong
  1 sibling, 0 replies; 200+ results
From: Darrick J. Wong @ 2024-02-12  6:18 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-mm, linux-kernel, tytso,
	adilger.kernel, jack, ritesh.list, hch, willy, zokeefe, yi.zhang,
	chengzhihao1, yukuai3, wangkefeng.wang

On Sat, Jan 27, 2024 at 09:57:59AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Hello,
> 
> This is the third version of RFC patch series that convert ext4 regular
> file's buffered IO path to iomap and enable large folio. It's rebased on
> 6.7 and Christoph's "map multiple blocks per ->map_blocks in iomap
> writeback" series [1]. I've fixed all issues found in the last about 3
> weeks of stress tests and fault injection tests in v2. I hope I've
> covered most of the corner cases, and any comments are welcome. :)
> 
> Changes since v2:
>  - Update patch 1-6 to v3 [2].
>  - iomap_zero and iomap_unshare don't need to update i_size and call
>    iomap_write_failed(), introduce a new helper iomap_write_end_simple()
>    to avoid doing that.
>  - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(),
>    introduce a new helper ext4_iomap_map_one_extent() to allocate
>    delalloc blocks in writeback, which is always under i_data_sem in
>    write mode. This is done to prevent the writing back delalloc
>    extents become stale if it raced by truncate.
>  - Add a lock detection in mapping_clear_large_folios().
> Changes since v1:
>  - Introduce seq count for iomap buffered write and writeback to protect
>    races from extents changes, e.g. truncate, mwrite.
>  - Always allocate unwritten extents for new blocks, drop dioread_lock
>    mode, and make no distinctions between dioread_lock and
>    dioread_nolock.
>  - Don't add ditry data range to jinode, drop data=ordered mode, and
>    make no distinctions between data=ordered and data=writeback mode.
>  - Postpone updating i_disksize to endio.
>  - Allow splitting extents and use reserved space in endio.
>  - Instead of reimplement a new delayed mapping helper
>    ext4_iomap_da_map_blocks() for buffer write, try to reuse
>    ext4_da_map_blocks().
>  - Add support for disabling large folio on active inodes.
>  - Support online defragmentation, make file fall back to buffer_head
>    and disable large folio in ext4_move_extents().
>  - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
>  - Add dirty_len and pos trace info to trace_iomap_writepage_map().
>  - Update patch 1-6 to v2.
> 
> This series only support ext4 with the default features and mount
> options, doesn't support inline_data, bigalloc, dax, fs_verity, fs_crypt
> and data=journal mode, ext4 would fall back to buffer_head path

Do you plan to add bigalloc or !extents support as a part 2 patchset?

An ext2 port to iomap has been (vaguely) in the works for a while,
though iirc willy never got the performance to match because iomap
didn't have a mechanism for the caller to tell it "run the IO now even
though you don't have a complete page, because the indirect block is the
next block after the 11th block".

--D

> automatically if you enabled these features/options. Although it has
> many limitations now, it can satisfy the requirements of common cases
> and bring a great performance benefit.
> 
> Patch 1-6: this is a preparation series, it changes ext4_map_blocks()
> and ext4_set_iomap() to recognize delayed only extents, I've send it out
> separately [2].
> 
> Patch 7-8: these are two minor iomap changes, the first one is don't
> update i_size and don't call iomap_write_failed() in zero_range, the
> second one is for debug in iomap writeback path that I've discussed whit
> Christoph [3].
> 
> Patch 9-15: this is another preparation series, including some changes
> for delayed extents. Firstly, it factor out buffer_head from
> ext4_da_map_blocks(), make it to support adding multi-blocks once a
> time. Then make unwritten to written extents conversion in endio use to
> reserved space, reduce the risk of potential data loss. Finally,
> introduce a sequence counter for extent status tree, which is useful
> for iomap buffer write and write back.
> 
> Patch 16-22: Implement buffered IO iomap path for read, write, mmap,
> zero range, truncate and writeback, replace current buffered_head path.
> Please look at the following patch for details.
> 
> Patch 23-26: Convert to iomap for regular file's buffered IO path
> besides inline_data, bigalloc, dax, fs_verity, fs_crypt, and
> data=journal mode, and enable large folio. It should be note that
> buffered iomap path hasn't support Online defrag yet, so we need fall
> back to buffer_head and disable large folio automatically if user call
> EXT4_IOC_MOVE_EXT.
> 
> About Tests:
>  - kvm-xfstests in auto mode, and about 3 weeks of stress tests and
>    fault injection tests.
>  - A performance tests below.
> 
>    Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
>    with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.
> 
>    == buffer read ==
> 
>                   buffer head        iomap with large folio
>    type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>    ----------------------------------------------------
>    hole     4K    565k    2206       811k    3167
>    hole     64K   45.1k   2820       78.1k   4879
>    hole     1M    2744    2744       4890    4891
>    ramdisk  4K    436k    1703       554k    2163
>    ramdisk  64K   29.6k   1848       44.0k   2747
>    ramdisk  1M    1994    1995       2809    2809
>    nvme     4K    306k    1196       324k    1267
>    nvme     64K   19.3k   1208       24.3k   1517
>    nvme     1M    1694    1694       2256    2256
> 
>    == buffer write ==
> 
>                                        buffer head    ext4_iomap    
>    type   Overwrite Sync Writeback bs  IOPS   BW      IOPS   BW
>    -------------------------------------------------------------
>    cache    N       N    N         4K   395k   1544   415k   1621
>    cache    N       N    N         64K  30.8k  1928   80.1k  5005
>    cache    N       N    N         1M   1963   1963   5641   5642
>    cache    Y       N    N         4K   423k   1652   443k   1730
>    cache    Y       N    N         64K  33.0k  2063   80.8k  5051
>    cache    Y       N    N         1M   2103   2103   5588   5589
>    ramdisk  N       N    Y         4K   362k   1416   307k   1198
>    ramdisk  N       N    Y         64K  22.4k  1399   64.8k  4050
>    ramdisk  N       N    Y         1M   1670   1670   4559   4560
>    ramdisk  N       Y    N         4K   9830   38.4   13.5k  52.8
>    ramdisk  N       Y    N         64K  5834   365    10.1k  629
>    ramdisk  N       Y    N         1M   1011   1011   2064   2064
>    ramdisk  Y       N    Y         4K   397k   1550   409k   1598
>    ramdisk  Y       N    Y         64K  29.2k  1827   73.6k  4597
>    ramdisk  Y       N    Y         1M   1837   1837   4985   4985
>    ramdisk  Y       Y    N         4K   173k   675    182k   710
>    ramdisk  Y       Y    N         64K  17.7k  1109   33.7k  2105
>    ramdisk  Y       Y    N         1M   1128   1129   1790   1791
>    nvme     N       N    Y         4K   298k   1164   290k   1134
>    nvme     N       N    Y         64K  21.5k  1343   57.4k  3590
>    nvme     N       N    Y         1M   1308   1308   3664   3664
>    nvme     N       Y    N         4K   10.7k  41.8   12.0k  46.9
>    nvme     N       Y    N         64K  5962   373    8598   537
>    nvme     N       Y    N         1M   676    677    1417   1418
>    nvme     Y       N    Y         4K   366k   1430   373k   1456
>    nvme     Y       N    Y         64K  26.7k  1670   56.8k  3547
>    nvme     Y       N    Y         1M   1745   1746   3586   3586
>    nvme     Y       Y    N         4K   59.0k  230    61.2k  239
>    nvme     Y       Y    N         64K  13.0k  813    21.0k  1311
>    nvme     Y       Y    N         1M   683    683    1368   1369
>  
> TODO
>  - Keep on doing stress tests and fixing.
>  - I will rebase and resend my another patch set "ext4: more accurate
>    metadata reservaion for delalloc mount option[4]" later, it's useful
>    for iomap conversion. After this series, I suppose we could totally
>    drop ext4_nonda_switch() and prevent the risk of data loss caused by
>    extents splitting.
>  - Support for more features and mount options in the future.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20231207072710.176093-1-hch@lst.de/
> [2] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/
> [3] https://lore.kernel.org/linux-fsdevel/20231207150311.GA18830@lst.de/
> [4] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/
> 
> Thanks,
> Yi.
> 
> ---
> v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
> 
> Zhang Yi (26):
>   ext4: refactor ext4_da_map_blocks()
>   ext4: convert to exclusive lock while inserting delalloc extents
>   ext4: correct the hole length returned by ext4_map_blocks()
>   ext4: add a hole extent entry in cache after punch
>   ext4: make ext4_map_blocks() distinguish delalloc only extent
>   ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
>   iomap: don't increase i_size if it's not a write operation
>   iomap: add pos and dirty_len into trace_iomap_writepage_map
>   ext4: allow inserting delalloc extents with multi-blocks
>   ext4: correct delalloc extent length
>   ext4: also mark extent as delalloc if it's been unwritten
>   ext4: factor out bh handles to ext4_da_get_block_prep()
>   ext4: use reserved metadata blocks when splitting extent in endio
>   ext4: factor out ext4_map_{create|query}_blocks()
>   ext4: introduce seq counter for extent entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: implement zero_range iomap path
>   ext4: writeback partial blocks before zero range
>   ext4: fall back to buffer_head path for defrag
>   ext4: partially enable iomap for regular file's buffered IO path
>   filemap: support disable large folios on active inode
>   ext4: enable large folio for regular file with iomap buffered IO path
> 
>  fs/ext4/ext4.h              |  14 +-
>  fs/ext4/ext4_jbd2.c         |   6 +
>  fs/ext4/ext4_jbd2.h         |   7 +
>  fs/ext4/extents.c           | 149 +++---
>  fs/ext4/extents_status.c    |  39 +-
>  fs/ext4/extents_status.h    |   4 +-
>  fs/ext4/file.c              |  19 +-
>  fs/ext4/ialloc.c            |   5 +
>  fs/ext4/inode.c             | 891 +++++++++++++++++++++++++++---------
>  fs/ext4/move_extent.c       |  35 ++
>  fs/ext4/page-io.c           | 107 +++++
>  fs/ext4/super.c             |   3 +
>  fs/iomap/buffered-io.c      |  30 +-
>  fs/iomap/trace.h            |  43 +-
>  include/linux/pagemap.h     |  14 +
>  include/trace/events/ext4.h |  31 +-
>  mm/readahead.c              |   6 +-
>  17 files changed, 1109 insertions(+), 294 deletions(-)
> 
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[relevance 0%]

* [PATCH v9 1/8] landlock: Add IOCTL access right
  @ 2024-02-09 17:06  5% ` Günther Noack
  2024-02-16 17:19  0%   ` Mickaël Salaün
  2024-02-19 18:34  0%   ` Mickaël Salaün
  0 siblings, 2 replies; 200+ results
From: Günther Noack @ 2024-02-09 17:06 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Jeff Xu, Arnd Bergmann, Jorge Lucangeli Obes, Allen Webb,
	Dmitry Torokhov, Paul Moore, Konstantin Meskhidze,
	Matt Bobrowski, linux-fsdevel, Günther Noack

Introduces the LANDLOCK_ACCESS_FS_IOCTL access right
and increments the Landlock ABI version to 5.

Like the truncate right, these rights are associated with a file
descriptor at the time of open(2), and get respected even when the
file descriptor is used outside of the thread which it was originally
opened in.

A newly enabled Landlock policy therefore does not apply to file
descriptors which are already open.

If the LANDLOCK_ACCESS_FS_IOCTL right is handled, only a small number
of safe IOCTL commands will be permitted on newly opened files.  The
permitted IOCTLs can be configured through the ruleset in limited ways
now.  (See documentation for details.)

Specifically, when LANDLOCK_ACCESS_FS_IOCTL is handled, granting this
right on a file or directory will *not* permit to do all IOCTL
commands, but only influence the IOCTL commands which are not already
handled through other access rights.  The intent is to keep the groups
of IOCTL commands more fine-grained.

Noteworthy scenarios which require special attention:

TTY devices are often passed into a process from the parent process,
and so a newly enabled Landlock policy does not retroactively apply to
them automatically.  In the past, TTY devices have often supported
IOCTL commands like TIOCSTI and some TIOCLINUX subcommands, which were
letting callers control the TTY input buffer (and simulate
keypresses).  This should be restricted to CAP_SYS_ADMIN programs on
modern kernels though.

Some legitimate file system features, like setting up fscrypt, are
exposed as IOCTL commands on regular files and directories -- users of
Landlock are advised to double check that the sandboxed process does
not need to invoke these IOCTLs.

Known limitations:

The LANDLOCK_ACCESS_FS_IOCTL access right is a coarse-grained control
over IOCTL commands.  Future work will enable a more fine-grained
access control for IOCTLs.

In the meantime, Landlock users may use path-based restrictions in
combination with their knowledge about the file system layout to
control what IOCTLs can be done.  Mounting file systems with the nodev
option can help to distinguish regular files and devices, and give
guarantees about the affected files, which Landlock alone can not give
yet.

Signed-off-by: Günther Noack <gnoack@google.com>
---
 include/uapi/linux/landlock.h                |  55 ++++-
 security/landlock/fs.c                       | 227 ++++++++++++++++++-
 security/landlock/fs.h                       |   3 +
 security/landlock/limits.h                   |  11 +-
 security/landlock/ruleset.h                  |   2 +-
 security/landlock/syscalls.c                 |  19 +-
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   |   5 +-
 8 files changed, 302 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
index 25c8d7677539..16d7d72804f8 100644
--- a/include/uapi/linux/landlock.h
+++ b/include/uapi/linux/landlock.h
@@ -128,7 +128,7 @@ struct landlock_net_port_attr {
  * files and directories.  Files or directories opened before the sandboxing
  * are not subject to these restrictions.
  *
- * A file can only receive these access rights:
+ * The following access rights apply only to files:
  *
  * - %LANDLOCK_ACCESS_FS_EXECUTE: Execute a file.
  * - %LANDLOCK_ACCESS_FS_WRITE_FILE: Open a file with write access. Note that
@@ -138,12 +138,13 @@ struct landlock_net_port_attr {
  * - %LANDLOCK_ACCESS_FS_READ_FILE: Open a file with read access.
  * - %LANDLOCK_ACCESS_FS_TRUNCATE: Truncate a file with :manpage:`truncate(2)`,
  *   :manpage:`ftruncate(2)`, :manpage:`creat(2)`, or :manpage:`open(2)` with
- *   ``O_TRUNC``. Whether an opened file can be truncated with
- *   :manpage:`ftruncate(2)` is determined during :manpage:`open(2)`, in the
- *   same way as read and write permissions are checked during
- *   :manpage:`open(2)` using %LANDLOCK_ACCESS_FS_READ_FILE and
- *   %LANDLOCK_ACCESS_FS_WRITE_FILE. This access right is available since the
- *   third version of the Landlock ABI.
+ *   ``O_TRUNC``.  This access right is available since the third version of the
+ *   Landlock ABI.
+ *
+ * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
+ * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
+ * read and write permissions are checked during :manpage:`open(2)` using
+ * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE.
  *
  * A directory can receive access rights related to files or directories.  The
  * following access right is applied to the directory itself, and the
@@ -198,13 +199,50 @@ struct landlock_net_port_attr {
  *   If multiple requirements are not met, the ``EACCES`` error code takes
  *   precedence over ``EXDEV``.
  *
+ * The following access right applies both to files and directories:
+ *
+ * - %LANDLOCK_ACCESS_FS_IOCTL: Invoke :manpage:`ioctl(2)` commands on an opened
+ *   file or directory.
+ *
+ *   This access right applies to all :manpage:`ioctl(2)` commands, except of
+ *   ``FIOCLEX``, ``FIONCLEX``, ``FIONBIO`` and ``FIOASYNC``.  These commands
+ *   continue to be invokable independent of the %LANDLOCK_ACCESS_FS_IOCTL
+ *   access right.
+ *
+ *   When certain other access rights are handled in the ruleset, in addition to
+ *   %LANDLOCK_ACCESS_FS_IOCTL, granting these access rights will unlock access
+ *   to additional groups of IOCTL commands, on the affected files:
+ *
+ *   * %LANDLOCK_ACCESS_FS_READ_FILE and %LANDLOCK_ACCESS_FS_WRITE_FILE unlock
+ *     access to ``FIOQSIZE``, ``FIONREAD``, ``FIGETBSZ``, ``FS_IOC_FIEMAP``,
+ *     ``FIBMAP``, ``FIDEDUPERANGE``, ``FICLONE``, ``FICLONERANGE``,
+ *     ``FS_IOC_RESVSP``, ``FS_IOC_RESVSP64``, ``FS_IOC_UNRESVSP``,
+ *     ``FS_IOC_UNRESVSP64``, ``FS_IOC_ZERO_RANGE``.
+ *
+ *   * %LANDLOCK_ACCESS_FS_READ_DIR unlocks access to ``FIOQSIZE``,
+ *     ``FIONREAD``, ``FIGETBSZ``.
+ *
+ *   When these access rights are handled in the ruleset, the availability of
+ *   the affected IOCTL commands is not governed by %LANDLOCK_ACCESS_FS_IOCTL
+ *   any more, but by the respective access right.
+ *
+ *   All other IOCTL commands are not handled specially, and are governed by
+ *   %LANDLOCK_ACCESS_FS_IOCTL.  This includes %FS_IOC_GETFLAGS and
+ *   %FS_IOC_SETFLAGS for manipulating inode flags (:manpage:`ioctl_iflags(2)`),
+ *   %FS_IOC_FSFETXATTR and %FS_IOC_FSSETXATTR for manipulating extended
+ *   attributes, as well as %FIFREEZE and %FITHAW for freezing and thawing file
+ *   systems.
+ *
+ *   This access right is available since the fifth version of the Landlock
+ *   ABI.
+ *
  * .. warning::
  *
  *   It is currently not possible to restrict some file-related actions
  *   accessible through these syscall families: :manpage:`chdir(2)`,
  *   :manpage:`stat(2)`, :manpage:`flock(2)`, :manpage:`chmod(2)`,
  *   :manpage:`chown(2)`, :manpage:`setxattr(2)`, :manpage:`utime(2)`,
- *   :manpage:`ioctl(2)`, :manpage:`fcntl(2)`, :manpage:`access(2)`.
+ *   :manpage:`fcntl(2)`, :manpage:`access(2)`.
  *   Future Landlock evolutions will enable to restrict them.
  */
 /* clang-format off */
@@ -223,6 +261,7 @@ struct landlock_net_port_attr {
 #define LANDLOCK_ACCESS_FS_MAKE_SYM			(1ULL << 12)
 #define LANDLOCK_ACCESS_FS_REFER			(1ULL << 13)
 #define LANDLOCK_ACCESS_FS_TRUNCATE			(1ULL << 14)
+#define LANDLOCK_ACCESS_FS_IOCTL			(1ULL << 15)
 /* clang-format on */
 
 /**
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 73997e63734f..84efea3f7c0f 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -7,6 +7,7 @@
  * Copyright © 2021-2022 Microsoft Corporation
  */
 
+#include <asm/ioctls.h>
 #include <kunit/test.h>
 #include <linux/atomic.h>
 #include <linux/bitops.h>
@@ -14,6 +15,7 @@
 #include <linux/compiler_types.h>
 #include <linux/dcache.h>
 #include <linux/err.h>
+#include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -29,6 +31,7 @@
 #include <linux/types.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue.h>
+#include <uapi/linux/fiemap.h>
 #include <uapi/linux/landlock.h>
 
 #include "common.h"
@@ -84,6 +87,186 @@ static const struct landlock_object_underops landlock_fs_underops = {
 	.release = release_inode
 };
 
+/* IOCTL helpers */
+
+/*
+ * These are synthetic access rights, which are only used within the kernel, but
+ * not exposed to callers in userspace.  The mapping between these access rights
+ * and IOCTL commands is defined in the get_required_ioctl_access() helper function.
+ */
+#define LANDLOCK_ACCESS_FS_IOCTL_RW (LANDLOCK_LAST_PUBLIC_ACCESS_FS << 1)
+#define LANDLOCK_ACCESS_FS_IOCTL_RW_FILE (LANDLOCK_LAST_PUBLIC_ACCESS_FS << 2)
+
+/* ioctl_groups - all synthetic access rights for IOCTL command groups */
+/* clang-format off */
+#define IOCTL_GROUPS (				\
+	LANDLOCK_ACCESS_FS_IOCTL_RW |		\
+	LANDLOCK_ACCESS_FS_IOCTL_RW_FILE)
+/* clang-format on */
+
+static_assert((IOCTL_GROUPS & LANDLOCK_MASK_ACCESS_FS) == IOCTL_GROUPS);
+
+/**
+ * get_required_ioctl_access(): Determine required IOCTL access rights.
+ *
+ * @cmd: The IOCTL command that is supposed to be run.
+ *
+ * Any new IOCTL commands that are implemented in fs/ioctl.c's do_vfs_ioctl()
+ * should be considered for inclusion here.
+ *
+ * Returns: The access rights that must be granted on an opened file in order to
+ * use the given @cmd.
+ */
+static __attribute_const__ access_mask_t
+get_required_ioctl_access(const unsigned int cmd)
+{
+	switch (cmd) {
+	case FIOCLEX:
+	case FIONCLEX:
+	case FIONBIO:
+	case FIOASYNC:
+		/*
+		 * FIOCLEX, FIONCLEX, FIONBIO and FIOASYNC manipulate the FD's
+		 * close-on-exec and the file's buffered-IO and async flags.
+		 * These operations are also available through fcntl(2), and are
+		 * unconditionally permitted in Landlock.
+		 */
+		return 0;
+	case FIONREAD:
+	case FIOQSIZE:
+	case FIGETBSZ:
+		/*
+		 * FIONREAD returns the number of bytes available for reading.
+		 * FIONREAD returns the number of immediately readable bytes for
+		 * a file.
+		 *
+		 * FIOQSIZE queries the size of a file or directory.
+		 *
+		 * FIGETBSZ queries the file system's block size for a file or
+		 * directory.
+		 *
+		 * These IOCTL commands are permitted for files which are opened
+		 * with LANDLOCK_ACCESS_FS_READ_DIR,
+		 * LANDLOCK_ACCESS_FS_READ_FILE, or
+		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_RW;
+	case FS_IOC_FIEMAP:
+	case FIBMAP:
+		/*
+		 * FS_IOC_FIEMAP and FIBMAP query information about the
+		 * allocation of blocks within a file.  They are permitted for
+		 * files which are opened with LANDLOCK_ACCESS_FS_READ_FILE or
+		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
+		 */
+		fallthrough;
+	case FIDEDUPERANGE:
+	case FICLONE:
+	case FICLONERANGE:
+		/*
+		 * FIDEDUPERANGE, FICLONE and FICLONERANGE make files share
+		 * their underlying storage ("reflink") between source and
+		 * destination FDs, on file systems which support that.
+		 *
+		 * The underlying implementations are already checking whether
+		 * the involved files are opened with the appropriate read/write
+		 * modes.  We rely on this being implemented correctly.
+		 *
+		 * These IOCTLs are permitted for files which are opened with
+		 * LANDLOCK_ACCESS_FS_READ_FILE or
+		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
+		 */
+		fallthrough;
+	case FS_IOC_RESVSP:
+	case FS_IOC_RESVSP64:
+	case FS_IOC_UNRESVSP:
+	case FS_IOC_UNRESVSP64:
+	case FS_IOC_ZERO_RANGE:
+		/*
+		 * These IOCTLs reserve space, or create holes like
+		 * fallocate(2).  We rely on the implementations checking the
+		 * files' read/write modes.
+		 *
+		 * These IOCTLs are permitted for files which are opened with
+		 * LANDLOCK_ACCESS_FS_READ_FILE or
+		 * LANDLOCK_ACCESS_FS_WRITE_FILE.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
+	default:
+		/*
+		 * Other commands are guarded by the catch-all access right.
+		 */
+		return LANDLOCK_ACCESS_FS_IOCTL;
+	}
+}
+
+/**
+ * expand_ioctl() - Return the dst flags from either the src flag or the
+ * %LANDLOCK_ACCESS_FS_IOCTL flag, depending on whether the
+ * %LANDLOCK_ACCESS_FS_IOCTL and src access rights are handled or not.
+ *
+ * @handled: Handled access rights.
+ * @access: The access mask to copy values from.
+ * @src: A single access right to copy from in @access.
+ * @dst: One or more access rights to copy to.
+ *
+ * Returns: @dst, or 0.
+ */
+static __attribute_const__ access_mask_t
+expand_ioctl(const access_mask_t handled, const access_mask_t access,
+	     const access_mask_t src, const access_mask_t dst)
+{
+	access_mask_t copy_from;
+
+	if (!(handled & LANDLOCK_ACCESS_FS_IOCTL))
+		return 0;
+
+	copy_from = (handled & src) ? src : LANDLOCK_ACCESS_FS_IOCTL;
+	if (access & copy_from)
+		return dst;
+
+	return 0;
+}
+
+/**
+ * landlock_expand_access_fs() - Returns @access with the synthetic IOCTL group
+ * flags enabled if necessary.
+ *
+ * @handled: Handled FS access rights.
+ * @access: FS access rights to expand.
+ *
+ * Returns: @access expanded by the necessary flags for the synthetic IOCTL
+ * access rights.
+ */
+static __attribute_const__ access_mask_t landlock_expand_access_fs(
+	const access_mask_t handled, const access_mask_t access)
+{
+	return access |
+	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_WRITE_FILE,
+			    LANDLOCK_ACCESS_FS_IOCTL_RW |
+				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
+	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_FILE,
+			    LANDLOCK_ACCESS_FS_IOCTL_RW |
+				    LANDLOCK_ACCESS_FS_IOCTL_RW_FILE) |
+	       expand_ioctl(handled, access, LANDLOCK_ACCESS_FS_READ_DIR,
+			    LANDLOCK_ACCESS_FS_IOCTL_RW);
+}
+
+/**
+ * landlock_expand_handled_access_fs() - add synthetic IOCTL access rights to an
+ * access mask of handled accesses.
+ *
+ * @handled: The handled accesses of a ruleset that is being created.
+ *
+ * Returns: @handled, with the bits for the synthetic IOCTL access rights set,
+ * if %LANDLOCK_ACCESS_FS_IOCTL is handled.
+ */
+__attribute_const__ access_mask_t
+landlock_expand_handled_access_fs(const access_mask_t handled)
+{
+	return landlock_expand_access_fs(handled, handled);
+}
+
 /* Ruleset management */
 
 static struct landlock_object *get_inode_object(struct inode *const inode)
@@ -148,7 +331,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL)
 /* clang-format on */
 
 /*
@@ -158,6 +342,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
 			    const struct path *const path,
 			    access_mask_t access_rights)
 {
+	access_mask_t handled;
 	int err;
 	struct landlock_id id = {
 		.type = LANDLOCK_KEY_INODE,
@@ -170,9 +355,11 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
 	if (WARN_ON_ONCE(ruleset->num_layers != 1))
 		return -EINVAL;
 
+	handled = landlock_get_fs_access_mask(ruleset, 0);
+	/* Expands the synthetic IOCTL groups. */
+	access_rights |= landlock_expand_access_fs(handled, access_rights);
 	/* Transforms relative access rights to absolute ones. */
-	access_rights |= LANDLOCK_MASK_ACCESS_FS &
-			 ~landlock_get_fs_access_mask(ruleset, 0);
+	access_rights |= LANDLOCK_MASK_ACCESS_FS & ~handled;
 	id.key.object = get_inode_object(d_backing_inode(path->dentry));
 	if (IS_ERR(id.key.object))
 		return PTR_ERR(id.key.object);
@@ -1333,7 +1520,9 @@ static int hook_file_open(struct file *const file)
 {
 	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_FS] = {};
 	access_mask_t open_access_request, full_access_request, allowed_access;
-	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE;
+	const access_mask_t optional_access = LANDLOCK_ACCESS_FS_TRUNCATE |
+					      LANDLOCK_ACCESS_FS_IOCTL |
+					      IOCTL_GROUPS;
 	const struct landlock_ruleset *const dom = get_current_fs_domain();
 
 	if (!dom)
@@ -1375,6 +1564,16 @@ static int hook_file_open(struct file *const file)
 		}
 	}
 
+	/*
+	 * Named pipes should be treated just like anonymous pipes.
+	 * Therefore, we permit all IOCTLs on them.
+	 */
+	if (S_ISFIFO(file_inode(file)->i_mode)) {
+		allowed_access |= LANDLOCK_ACCESS_FS_IOCTL |
+				  LANDLOCK_ACCESS_FS_IOCTL_RW |
+				  LANDLOCK_ACCESS_FS_IOCTL_RW_FILE;
+	}
+
 	/*
 	 * For operations on already opened files (i.e. ftruncate()), it is the
 	 * access rights at the time of open() which decide whether the
@@ -1406,6 +1605,25 @@ static int hook_file_truncate(struct file *const file)
 	return -EACCES;
 }
 
+static int hook_file_ioctl(struct file *file, unsigned int cmd,
+			   unsigned long arg)
+{
+	const access_mask_t required_access = get_required_ioctl_access(cmd);
+	const access_mask_t allowed_access =
+		landlock_file(file)->allowed_access;
+
+	/*
+	 * It is the access rights at the time of opening the file which
+	 * determine whether IOCTL can be used on the opened file later.
+	 *
+	 * The access right is attached to the opened file in hook_file_open().
+	 */
+	if ((allowed_access & required_access) == required_access)
+		return 0;
+
+	return -EACCES;
+}
+
 static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(inode_free_security, hook_inode_free_security),
 
@@ -1428,6 +1646,7 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
 	LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
 	LSM_HOOK_INIT(file_open, hook_file_open),
 	LSM_HOOK_INIT(file_truncate, hook_file_truncate),
+	LSM_HOOK_INIT(file_ioctl, hook_file_ioctl),
 };
 
 __init void landlock_add_fs_hooks(void)
diff --git a/security/landlock/fs.h b/security/landlock/fs.h
index 488e4813680a..086576b8386b 100644
--- a/security/landlock/fs.h
+++ b/security/landlock/fs.h
@@ -92,4 +92,7 @@ int landlock_append_fs_rule(struct landlock_ruleset *const ruleset,
 			    const struct path *const path,
 			    access_mask_t access_hierarchy);
 
+__attribute_const__ access_mask_t
+landlock_expand_handled_access_fs(const access_mask_t handled);
+
 #endif /* _SECURITY_LANDLOCK_FS_H */
diff --git a/security/landlock/limits.h b/security/landlock/limits.h
index 93c9c6f91556..ecbdc8bbf906 100644
--- a/security/landlock/limits.h
+++ b/security/landlock/limits.h
@@ -18,7 +18,16 @@
 #define LANDLOCK_MAX_NUM_LAYERS		16
 #define LANDLOCK_MAX_NUM_RULES		U32_MAX
 
-#define LANDLOCK_LAST_ACCESS_FS		LANDLOCK_ACCESS_FS_TRUNCATE
+/*
+ * For file system access rights, Landlock distinguishes between the publicly
+ * visible access rights (1 to LANDLOCK_LAST_PUBLIC_ACCESS_FS) and the private
+ * ones which are not exposed to userspace (LANDLOCK_LAST_PUBLIC_ACCESS_FS + 1
+ * to LANDLOCK_LAST_ACCESS_FS).  The private access rights are defined in fs.c.
+ */
+#define LANDLOCK_LAST_PUBLIC_ACCESS_FS	LANDLOCK_ACCESS_FS_IOCTL
+#define LANDLOCK_MASK_PUBLIC_ACCESS_FS	((LANDLOCK_LAST_PUBLIC_ACCESS_FS << 1) - 1)
+
+#define LANDLOCK_LAST_ACCESS_FS		(LANDLOCK_LAST_PUBLIC_ACCESS_FS << 2)
 #define LANDLOCK_MASK_ACCESS_FS		((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
 #define LANDLOCK_NUM_ACCESS_FS		__const_hweight64(LANDLOCK_MASK_ACCESS_FS)
 #define LANDLOCK_SHIFT_ACCESS_FS	0
diff --git a/security/landlock/ruleset.h b/security/landlock/ruleset.h
index c7f1526784fd..5a28ea8e1c3d 100644
--- a/security/landlock/ruleset.h
+++ b/security/landlock/ruleset.h
@@ -30,7 +30,7 @@
 	LANDLOCK_ACCESS_FS_REFER)
 /* clang-format on */
 
-typedef u16 access_mask_t;
+typedef u32 access_mask_t;
 /* Makes sure all filesystem access rights can be stored. */
 static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_ACCESS_FS);
 /* Makes sure all network access rights can be stored. */
diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
index 898358f57fa0..f0bc50003b46 100644
--- a/security/landlock/syscalls.c
+++ b/security/landlock/syscalls.c
@@ -137,7 +137,7 @@ static const struct file_operations ruleset_fops = {
 	.write = fop_dummy_write,
 };
 
-#define LANDLOCK_ABI_VERSION 4
+#define LANDLOCK_ABI_VERSION 5
 
 /**
  * sys_landlock_create_ruleset - Create a new ruleset
@@ -192,8 +192,8 @@ SYSCALL_DEFINE3(landlock_create_ruleset,
 		return err;
 
 	/* Checks content (and 32-bits cast). */
-	if ((ruleset_attr.handled_access_fs | LANDLOCK_MASK_ACCESS_FS) !=
-	    LANDLOCK_MASK_ACCESS_FS)
+	if ((ruleset_attr.handled_access_fs | LANDLOCK_MASK_PUBLIC_ACCESS_FS) !=
+	    LANDLOCK_MASK_PUBLIC_ACCESS_FS)
 		return -EINVAL;
 
 	/* Checks network content (and 32-bits cast). */
@@ -201,6 +201,10 @@ SYSCALL_DEFINE3(landlock_create_ruleset,
 	    LANDLOCK_MASK_ACCESS_NET)
 		return -EINVAL;
 
+	/* Expands synthetic IOCTL groups. */
+	ruleset_attr.handled_access_fs = landlock_expand_handled_access_fs(
+		ruleset_attr.handled_access_fs);
+
 	/* Checks arguments and transforms to kernel struct. */
 	ruleset = landlock_create_ruleset(ruleset_attr.handled_access_fs,
 					  ruleset_attr.handled_access_net);
@@ -309,8 +313,13 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset,
 	if (!path_beneath_attr.allowed_access)
 		return -ENOMSG;
 
-	/* Checks that allowed_access matches the @ruleset constraints. */
-	mask = landlock_get_raw_fs_access_mask(ruleset, 0);
+	/*
+	 * Checks that allowed_access matches the @ruleset constraints and only
+	 * consists of publicly visible access rights (as opposed to synthetic
+	 * ones).
+	 */
+	mask = landlock_get_raw_fs_access_mask(ruleset, 0) &
+	       LANDLOCK_MASK_PUBLIC_ACCESS_FS;
 	if ((path_beneath_attr.allowed_access | mask) != mask)
 		return -EINVAL;
 
diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
index 646f778dfb1e..d292b419ccba 100644
--- a/tools/testing/selftests/landlock/base_test.c
+++ b/tools/testing/selftests/landlock/base_test.c
@@ -75,7 +75,7 @@ TEST(abi_version)
 	const struct landlock_ruleset_attr ruleset_attr = {
 		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
 	};
-	ASSERT_EQ(4, landlock_create_ruleset(NULL, 0,
+	ASSERT_EQ(5, landlock_create_ruleset(NULL, 0,
 					     LANDLOCK_CREATE_RULESET_VERSION));
 
 	ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
index 2d6d9b43d958..3203f4a5bc85 100644
--- a/tools/testing/selftests/landlock/fs_test.c
+++ b/tools/testing/selftests/landlock/fs_test.c
@@ -527,9 +527,10 @@ TEST_F_FORK(layout1, inval)
 	LANDLOCK_ACCESS_FS_EXECUTE | \
 	LANDLOCK_ACCESS_FS_WRITE_FILE | \
 	LANDLOCK_ACCESS_FS_READ_FILE | \
-	LANDLOCK_ACCESS_FS_TRUNCATE)
+	LANDLOCK_ACCESS_FS_TRUNCATE | \
+	LANDLOCK_ACCESS_FS_IOCTL)
 
-#define ACCESS_LAST LANDLOCK_ACCESS_FS_TRUNCATE
+#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL
 
 #define ACCESS_ALL ( \
 	ACCESS_FILE | \
-- 
2.43.0.687.g38aa6559b0-goog


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  @ 2024-02-09 12:47  4%                 ` John Garry
  2024-02-13 23:41  0%                   ` Dave Chinner
  0 siblings, 1 reply; 200+ results
From: John Garry @ 2024-02-09 12:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

>>
>> Playing devil's advocate here, at least this behavior should be documented.
> 
> That's what man pages are for, yes?
> 
> Are you expecting your deployments to be run on highly suboptimal
> configurations and so the code needs to be optimised for this
> behaviour, or are you expecting them to be run on correctly
> configured systems which would never see these issues?

The latter hopefully

> 
> 
>>> The whole reason for rtextsize existing is to optimise the rtextent
>>> allocation to the typical minimum IO size done to that volume. If
>>> all your IO is sub-rtextsize size and alignment, then all that has
>>> been done is forcing the entire rt device IO into a corner it was
>>> never really intended nor optimised for.
>>
>> Sure, but just because we are optimized for a certain IO write size should
>> not mean that other writes are disallowed or quite problematic.
> 
> Atomic writes are just "other writes". They are writes that are
> *expected to fail* if they cannot be done atomically.

Agreed

> 
> Application writers will quickly learn how to do sane, fast,
> reliable atomic write IO if we reject anything that is going to
> requires some complex, sub-optimal workaround in the kernel to make
> it work. The simplest solution is to -fail the write-, because
> userspace *must* be prepared for *any* atomic write to fail.

Sure, but it needs to be such that the application writer at least knows 
why it failed, which so far had not been documented.

> 
>>> Why should we jump through crazy hoops to try to make filesystems
>>> optimised for large IOs with mismatched, overlapping small atomic
>>> writes?
>>
>> As mentioned, typically the atomic writes will be the same size, but we may
>> have other writes of smaller size.
> 
> Then we need the tiny write to allocate and zero according to the
> maximum sized atomic write bounds. Then we just don't care about
> large atomic IO overlapping small IO, because the extent on disk
> aligned to the large atomic IO is then always guaranteed to be the
> correct size and shape.

I think it's worth mentioning that there is currently a separation 
between how we configure the FS extent size for atomic writes and what 
the bdev can actually support in terms of atomic writes.

Setting the rtvol extsize at mkfs time or enabling atomic writes 
FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do 
in terms of atomic writes.

This check is not done as it is not fixed what the bdev can do in terms 
of atomic writes - or, more specifically, what they request_queue 
reports is not be fixed. There are things which can change this. For 
example, a FW update could change all the atomic write capabilities of a 
disk. Or even if we swapped a SCSI disk into another host the atomic 
write limits may change, as the atomic write unit max depends on the 
SCSI HBA DMA limits. Changing BIO_MAX_VECS - which could come from a 
kernel update - could also change what we report as atomic write limit 
in the request queue.

> 
> 
>>>> With the change in this patch, instead we have something like this after the
>>>> first write:
>>>>
>>>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
>>>> wrote 4096 bytes at pos 0 write_size=4096
>>>> # filefrag -v mnt/file
>>>> Filesystem type is: 58465342
>>>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>>>     ext:     logical_offset:        physical_offset: length:   expected:
>>>> flags:
>>>>       0:        0..       3:         24..        27:      4:
>>>> last,eof
>>>> mnt/file: 1 extent found
>>>> #
>>>>
>>>> So the 16KB extent is in written state and the 2nd 16KB write would iter
>>>> once, producing a single BIO.
>>> Sure, I know how it works. My point is that it's a terrible way to
>>> go about allowing that second atomic write to succeed.
>> I think 'terrible' is a bit too strong a word here.
> 
> Doing it anything in a way that a user can DOS the entire filesystem
> is *terrible*. No ifs, buts or otherwise.

Understood

> 
>> Indeed, you suggest to
>> manually zero the file to solve this problem, below, while this code change
>> does the same thing automatically.
> 
> Yes, but I also outlined a way that it can be done automatically
> without being terrible. There are multiple options here, I outlined
> two different approaches that are acceptible.

I think that I need to check these alternate solutions in more detail. 
More below.

> 
>>>>>> In this
>>>>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>>>>>> ensure that the extent is completely written whenever we allocate it. At
>>>>>> least that is my idea.
>>>>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
>>>>> add below say "no" and then the application has to do things the
>>>>> slow, safe way....
>>>> We have been porting atomic write support to some database apps and they
>>>> (database developers) have had to do something like manually zero the
>>>> complete file to get around this issue, but that's not a good user
>>>> experience.
>>> Better the application zeros the file when it is being initialised
>>> and doesn't have performance constraints rather than forcing the
>>> filesystem to do it in the IO fast path when IO performance and
>>> latency actually matters to the application.
>>
>> Can't we do both? I mean, the well-informed user can still pre-zero the file
>> just to ensure we aren't doing this zero'ing with the extent allocation.
> 
> I never said we can't do zeroing. I just said that it's normally
> better when the application controls zeroing directly.

ok

> 
>>> And therein lies the problem.
>>>
>>> If you are doing sub-rtextent IO at all, then you are forcing the
>>> filesystem down the path of explicitly using unwritten extents and
>>> requiring O_DSYNC direct IO to do journal flushes in IO completion
>>> context and then performance just goes down hill from them.
>>>
>>> The requirement for unwritten extents to track sub-rtextsize written
>>> regions is what you're trying to work around with XFS_BMAPI_ZERO so
>>> that atomic writes will always see "atomic write aligned" allocated
>>> regions.
>>>
>>> Do you see the problem here? You've explicitly told the filesystem
>>> that allocation is aligned to 64kB chunks, then because the
>>> filesystem block size is 4kB, it's allowed to track unwritten
>>> regions at 4kB boundaries. Then you do 4kB aligned file IO, which
>>> then changes unwritten extents at 4kB boundaries. Then you do a
>>> overlapping 16kB IO that*requires*  16kB allocation alignment, and
>>> things go BOOM.
>>>
>>> Yes, they should go BOOM.
>>>
>>> This is a horrible configuration - it is incomaptible with 16kB
>>> aligned and sized atomic IO.
>>
>> Just because the DB may do 16KB atomic writes most of the time should not
>> disallow it from any other form of writes.
> 
> That's not what I said. I said the using sub-rtextsize atomic writes
> with single FSB unwritten extent tracking is horrible and
> incompatible with doing 16kB atomic writes.
> 
> This setup will not work at all well with your patches and should go
> BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
> has uncoordinated extent allocation and unwritten conversion
> granularity.
> 
> That's the fundamental design problem with your approach - it allows
> unwritten conversion at *minimum IO sizes* and that does not work
> with atomic IOs with larger alignment requirements.
> 
> The fundamental design principle is this: for maximally sized atomic
> writes to always succeed we require every allocation, zeroing and
> unwritten conversion operation to use alignments and sizes that are
> compatible with the maximum atomic write sizes being used.
> 

That sounds fine.

My question then is how we determine this max atomic write size granularity.

We don't explicitly tell the FS what atomic write size we want for a 
file. Rather we mkfs with some extsize value which should match our 
atomic write maximal value and then tell the FS we want to do atomic 
writes on a file, and if this is accepted then we can query the atomic 
write min and max unit size, and this would be [FS block size, min(bdev 
atomic write limit, rtexsize)].

If rtextsize is 16KB, then we have a good idea that we want 16KB atomic 
writes support. So then we could use rtextsize as this max atomic write 
size. But I am not 100% sure that it your idea (apologies if I am wrong 
- I am sincerely trying to follow your idea), but rather it would be 
min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and 
bdev atomic write limit is 16KB, then there is no much point in dealing 
in 1MB blocks for this unwritten extent conversion alignment. If so, 
then my concern is that the bdev atomic write upper limit is not fixed. 
This can solved, but I would still like to be clear on this max atomic 
write size.

 > i.e. atomic writes need to use max write size granularity for all IO
 > operations, not filesystem block granularity.
> 
> And that also means things like rtextsize and extsize hints need to
> match these atomic write requirements, too....
> 

As above, I am not 100% sure if you mean these to be the atomic write 
maximal value.

>>> Allocation is aligned to 64kB, written
>>> region tracking is aligned to 4kB, and there's nothing to tell the
>>> filesystem that it should be maintaining 16kB "written alignment" so
>>> that 16kB atomic writes can always be issued atomically.

Please note that in my previous example the mkfs rtextsize arg should 
really have been 16KB, and that the intention would have been to enable 
16KB atomic writes. I used 64KB casually as I thought it should be 
possible to support sub-rtextsize atomic writes. The point which I was 
trying to make was that the 16KB atomic write and 4KB regular write 
intermixing was problematic.

>>>
>>> i.e. if we are going to do 16kB aligned atomic IO, then all the
>>> allocation and unwritten tracking needs to be done in 16kB aligned
>>> chunks, not 4kB. That means a 4KB write into an unwritten region or
>>> a hole actually needs to zero the rest of the 16KB range it sits
>>> within.
>>>
>>> The direct IO code can do this, but it needs extension of the
>>> unaligned IO serialisation in XFS (the alignment checks in
>>> xfs_file_dio_write()) and the the sub-block zeroing in
>>> iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
>>> allocation size, not the fsblock size) to do this safely.
>>>
>>> Regardless of how we do it, all IO concurrency on this file is shot
>>> if we have sub-rtextent sized IOs being done. That is true even with
>>> this patch set - XFS_BMAPI_ZERO is done whilst holding the
>>> XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
>>> zeroing is being done.
>>>
>>> IOWs, anything to do with sub-rtextent IO really has to be treated
>>> like sub-fsblock DIO - i.e. exclusive inode access until the
>>> sub-rtextent zeroing has been completed.
>>
>> I do understand that this is not perfect that we may have mixed block sizes
>> being written, but I don't think that we should disallow it and throw an
>> error.
> 
> Ummmm, did you read what you quoted?
> 
> The above is an outline of the IO path modifications that will allow
> mixed IO sizes to be used with atomic writes without requiring the
> XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
> zeroing out to the existing DIO sub-block zeroing, hence ensuring
> that we only ever convert unwritten extents on max sized atomic
> write boundaries for atomic write enabled inodes.

ok, I get this idea. And, indeed, it does sound better than the 
XFS_BMAPI_ZERO proposal.

> 
> At no point have I said "no mixed writes".

For sure

> I've said no to the
> XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
> that it works around and given you a decent amount of detail on how
> to sanely implementing mixed write support that will work (slowly)
> with those configurations and IO patterns.
> 
> So it's your choice - you can continue to beleive I don't mixed
> writes to work at all, or you can go back and try to understand the
> IO path changes I've suggested that will allow mixed atomic writes
> to work as well as they possibly can....
> 

Ack

Much appreciated,
John



^ permalink raw reply	[relevance 4%]

* RE: [PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures
  2024-02-08 18:49  8% ` [PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures Mathieu Desnoyers
@ 2024-02-08 21:52  0%   ` Dan Williams
  0 siblings, 0 replies; 200+ results
From: Dan Williams @ 2024-02-08 21:52 UTC (permalink / raw)
  To: Mathieu Desnoyers, Dan Williams, Arnd Bergmann, Dave Chinner
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Russell King,
	linux-arch, linux-cxl, linux-fsdevel, linux-mm, linux-xfs,
	dm-devel, nvdimm, linux-s390

Mathieu Desnoyers wrote:
> Introduce a generic way to query whether the data cache is virtually
> aliased on all architectures. Its purpose is to ensure that subsystems
> which are incompatible with virtually aliased data caches (e.g. FS_DAX)
> can reliably query this.
> 
> For data cache aliasing, there are three scenarios dependending on the
> architecture. Here is a breakdown based on my understanding:
> 
> A) The data cache is always aliasing:
> 
> * arc
> * csky
> * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
> * sh
> * parisc
> 
> B) The data cache aliasing is statically known or depends on querying CPU
>    state at runtime:
> 
> * arm (cache_is_vivt() || cache_is_vipt_aliasing())
> * mips (cpu_has_dc_aliases)
> * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
> * sparc32 (vac_cache_size > PAGE_SIZE)
> * sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
> * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)
> 
> C) The data cache is never aliasing:
> 
> * alpha
> * arm64 (aarch64)
> * hexagon
> * loongarch (but with incoherent write buffers, which are disabled since
>              commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
> * microblaze
> * openrisc
> * powerpc
> * riscv
> * s390
> * um
> * x86
> 
> Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
> implement "cpu_dcache_is_aliasing()".
> 
> Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
> cpu_dcache_is_aliasing() simply evaluates to "false".
> 
> Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
> work. This would be useful to gate features like XIP on architectures
> which have aliasing CPU dcache-icache but not CPU dcache-dcache.
> 
> Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
> to clarify that we really mean "CPU data cache" and "CPU cache" to
> eliminate any possible confusion with VFS "dentry cache" and "page
> cache".
> 
> Link: https://lore.kernel.org/lkml/20030910210416.GA24258@mail.jlokier.co.uk/
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-cxl@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-xfs@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvdimm@lists.linux.dev
> ---
>  arch/arc/Kconfig                    |  1 +
>  arch/arc/include/asm/cachetype.h    |  9 +++++++++
>  arch/arm/Kconfig                    |  1 +
>  arch/arm/include/asm/cachetype.h    |  2 ++
>  arch/csky/Kconfig                   |  1 +
>  arch/csky/include/asm/cachetype.h   |  9 +++++++++
>  arch/m68k/Kconfig                   |  1 +
>  arch/m68k/include/asm/cachetype.h   |  9 +++++++++
>  arch/mips/Kconfig                   |  1 +
>  arch/mips/include/asm/cachetype.h   |  9 +++++++++
>  arch/nios2/Kconfig                  |  1 +
>  arch/nios2/include/asm/cachetype.h  | 10 ++++++++++
>  arch/parisc/Kconfig                 |  1 +
>  arch/parisc/include/asm/cachetype.h |  9 +++++++++
>  arch/sh/Kconfig                     |  1 +
>  arch/sh/include/asm/cachetype.h     |  9 +++++++++
>  arch/sparc/Kconfig                  |  1 +
>  arch/sparc/include/asm/cachetype.h  | 14 ++++++++++++++
>  arch/xtensa/Kconfig                 |  1 +
>  arch/xtensa/include/asm/cachetype.h | 10 ++++++++++
>  include/linux/cacheinfo.h           |  6 ++++++
>  mm/Kconfig                          |  6 ++++++
>  22 files changed, 112 insertions(+)
>  create mode 100644 arch/arc/include/asm/cachetype.h
>  create mode 100644 arch/csky/include/asm/cachetype.h
>  create mode 100644 arch/m68k/include/asm/cachetype.h
>  create mode 100644 arch/mips/include/asm/cachetype.h
>  create mode 100644 arch/nios2/include/asm/cachetype.h
>  create mode 100644 arch/parisc/include/asm/cachetype.h
>  create mode 100644 arch/sh/include/asm/cachetype.h
>  create mode 100644 arch/sparc/include/asm/cachetype.h
>  create mode 100644 arch/xtensa/include/asm/cachetype.h
> 
[..]
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index d504eb4b49ab..2cb15fe4fe12 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -138,4 +138,10 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level)
>  #define use_arch_cache_info()	(false)
>  #endif
>  
> +#ifndef CONFIG_ARCH_HAS_CPU_CACHE_ALIASING
> +#define cpu_dcache_is_aliasing()	false
> +#else
> +#include <asm/cachetype.h>
> +#endif
> +
>  #endif /* _LINUX_CACHEINFO_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 57cd378c73d6..db09c9ad15c9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1016,6 +1016,12 @@ config IDLE_PAGE_TRACKING
>  	  See Documentation/admin-guide/mm/idle_page_tracking.rst for
>  	  more details.
>  
> +# Architectures which implement cpu_dcache_is_aliasing() to query
> +# whether the data caches are aliased (VIVT or VIPT with dcache
> +# aliasing) need to select this.
> +config ARCH_HAS_CPU_CACHE_ALIASING
> +	bool
> +
>  config ARCH_HAS_CACHE_LINE_SIZE
>  	bool

I can't speak to the specific arch changes, but the generic support
above looks ok.

If you get any pushback on the per arch changes then maybe this could be
split into a patch that simply does the coarse grained select of
CONFIG_ARCH_HAS_CPU_CACHE_ALIASING for ARM, MIPS, and SPARC. Then,
follow-on with patches per-arch to do the more fine grained option.

Certainly Andrew's tree is great for simultaneous cross arch changes
like this.

^ permalink raw reply	[relevance 0%]

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] tracing the source of errors
  2024-02-07 11:00  0% ` [Lsf-pc] " Jan Kara
@ 2024-02-08 20:39  0%   ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 200+ results
From: Gabriel Krisman Bertazi @ 2024-02-08 20:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, lsf-pc, linux-fsdevel, linux-kernel

Jan Kara <jack@suse.cz> writes:

> On Wed 07-02-24 10:54:34, Miklos Szeredi via Lsf-pc wrote:
>> [I'm not planning to attend LSF this year, but I thought this topic
>> might be of interest to those who will.]
>> 
>> The errno thing is really ancient and yet quite usable.  But when
>> trying to find out where a particular EINVAL is coming from, that's
>> often mission impossible.
>> 
>> Would it make sense to add infrastructure to allow tracing the source
>> of errors?  E.g.
>> 
>> strace --errno-trace ls -l foo
>> ...
>> statx(AT_FDCWD, "foo", ...) = -1 ENOENT [fs/namei.c:1852]
>> ...
>> 
>> Don't know about others, but this issue comes up quite often for me.
>
> Yes, having this available would be really useful at times. Sometimes I
> had to resort to kprobes or good old printks.
>
>> I would implement this with macros that record the place where a
>> particular error has originated, and some way to query the last one
>> (which wouldn't be 100% accurate, but good enough I guess).
>
> The problem always has been how to implement this functionality in a
> transparent way so the code does not become a mess. So if you have some
> idea, I'd say go for it :)

I had a proposal to provide the LoC of filesystem errors as part of an
extended record of the FAN_FS_ERROR messages (fanotify interface).  It
might be a sensible interface to expose this information if not
prohibitively expensive.

One might record the position with a macro and do the fsnotify_sb_error
from a safer context.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[relevance 0%]

* [PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures
  @ 2024-02-08 18:49  8% ` Mathieu Desnoyers
  2024-02-08 21:52  0%   ` Dan Williams
  0 siblings, 1 reply; 200+ results
From: Mathieu Desnoyers @ 2024-02-08 18:49 UTC (permalink / raw)
  To: Dan Williams, Arnd Bergmann, Dave Chinner
  Cc: linux-kernel, Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
	Vishal Verma, Dave Jiang, Matthew Wilcox, Russell King,
	linux-arch, linux-cxl, linux-fsdevel, linux-mm, linux-xfs,
	dm-devel, nvdimm, linux-s390

Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
             commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.GA24258@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arch@vger.kernel.org
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvdimm@lists.linux.dev
---
 arch/arc/Kconfig                    |  1 +
 arch/arc/include/asm/cachetype.h    |  9 +++++++++
 arch/arm/Kconfig                    |  1 +
 arch/arm/include/asm/cachetype.h    |  2 ++
 arch/csky/Kconfig                   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +++++++++
 arch/m68k/Kconfig                   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +++++++++
 arch/mips/Kconfig                   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +++++++++
 arch/nios2/Kconfig                  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++++++++++
 arch/parisc/Kconfig                 |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +++++++++
 arch/sh/Kconfig                     |  1 +
 arch/sh/include/asm/cachetype.h     |  9 +++++++++
 arch/sparc/Kconfig                  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++++++++++++++
 arch/xtensa/Kconfig                 |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++++++++++
 include/linux/cacheinfo.h           |  6 ++++++
 mm/Kconfig                          |  6 ++++++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
 	def_bool y
 	select ARC_TIMERS
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CACHE_LINE_SIZE
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index 000000000000..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
 	select ARCH_32BIT_OFF_T
 	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
 	select ARCH_HAS_BINFMT_FLAT
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CPU_FINALIZE_INIT if MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..b9dbe1d4c8fe 100644
--- a/arch/arm/include/asm/cachetype.h
+++ b/arch/arm/include/asm/cachetype.h
@@ -20,6 +20,8 @@ extern unsigned int cacheid;
 #define icache_is_vipt_aliasing()	cacheid_is(CACHEID_VIPT_I_ALIASING)
 #define icache_is_pipt()		cacheid_is(CACHEID_PIPT)
 
+#define cpu_dcache_is_aliasing()	(cache_is_vivt() || cache_is_vipt_aliasing())
+
 /*
  * __LINUX_ARM_ARCH__ is the minimum supported CPU architecture
  * Mask out support which will never be present on newer CPUs.
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index cf2a6fd7dff8..8a91eccf76dc 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -2,6 +2,7 @@
 config CSKY
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/csky/include/asm/cachetype.h b/arch/csky/include/asm/cachetype.h
new file mode 100644
index 000000000000..98cbe3af662f
--- /dev/null
+++ b/arch/csky/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_CSKY_CACHETYPE_H
+#define __ASM_CSKY_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 4b3e93cac723..a9c3e3de0c6d 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -3,6 +3,7 @@ config M68K
 	bool
 	default y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_BINFMT_FLAT
 	select ARCH_HAS_CPU_FINALIZE_INIT if MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
diff --git a/arch/m68k/include/asm/cachetype.h b/arch/m68k/include/asm/cachetype.h
new file mode 100644
index 000000000000..7fad5d9ab8fe
--- /dev/null
+++ b/arch/m68k/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_M68K_CACHETYPE_H
+#define __ASM_M68K_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 797ae590ebdb..ab1c8bd96666 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -4,6 +4,7 @@ config MIPS
 	default y
 	select ARCH_32BIT_OFF_T if !64BIT
 	select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_CPU_FINALIZE_INIT
 	select ARCH_HAS_CURRENT_STACK_POINTER if !CC_IS_CLANG || CLANG_VERSION >= 140000
 	select ARCH_HAS_DEBUG_VIRTUAL if !64BIT
diff --git a/arch/mips/include/asm/cachetype.h b/arch/mips/include/asm/cachetype.h
new file mode 100644
index 000000000000..9f4ba2fe1155
--- /dev/null
+++ b/arch/mips/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MIPS_CACHETYPE_H
+#define __ASM_MIPS_CACHETYPE_H
+
+#include <asm/cpu-features.h>
+
+#define cpu_dcache_is_aliasing()	cpu_has_dc_aliases
+
+#endif
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index d54464021a61..760fb541ecd2 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -2,6 +2,7 @@
 config NIOS2
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_SYNC_DMA_FOR_CPU
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
diff --git a/arch/nios2/include/asm/cachetype.h b/arch/nios2/include/asm/cachetype.h
new file mode 100644
index 000000000000..eb9c416b8a1c
--- /dev/null
+++ b/arch/nios2/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_NIOS2_CACHETYPE_H
+#define __ASM_NIOS2_CACHETYPE_H
+
+#include <asm/page.h>
+#include <asm/cache.h>
+
+#define cpu_dcache_is_aliasing()	(NIOS2_DCACHE_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index d14ccc948a29..0f25c227f74b 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -8,6 +8,7 @@ config PARISC
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_SYSCALL_TRACEPOINTS
 	select ARCH_WANT_FRAME_POINTERS
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_DMA_ALLOC if PA11
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/parisc/include/asm/cachetype.h b/arch/parisc/include/asm/cachetype.h
new file mode 100644
index 000000000000..e0868a1d3c47
--- /dev/null
+++ b/arch/parisc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_PARISC_CACHETYPE_H
+#define __ASM_PARISC_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 7500521b2b98..2ad3e29f0ebe 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -2,6 +2,7 @@
 config SUPERH
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && MMU
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if SPARSEMEM && MMU
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A)
diff --git a/arch/sh/include/asm/cachetype.h b/arch/sh/include/asm/cachetype.h
new file mode 100644
index 000000000000..a5fffe536068
--- /dev/null
+++ b/arch/sh/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SH_CACHETYPE_H
+#define __ASM_SH_CACHETYPE_H
+
+#include <linux/types.h>
+
+#define cpu_dcache_is_aliasing()	true
+
+#endif
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 49849790e66d..5ba627da15d7 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -13,6 +13,7 @@ config 64BIT
 config SPARC
 	bool
 	default y
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_MIGHT_HAVE_PC_PARPORT if SPARC64 && PCI
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select DMA_OPS
diff --git a/arch/sparc/include/asm/cachetype.h b/arch/sparc/include/asm/cachetype.h
new file mode 100644
index 000000000000..caf1c0045892
--- /dev/null
+++ b/arch/sparc/include/asm/cachetype.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_SPARC_CACHETYPE_H
+#define __ASM_SPARC_CACHETYPE_H
+
+#include <asm/page.h>
+
+#ifdef CONFIG_SPARC32
+extern int vac_cache_size;
+#define cpu_dcache_is_aliasing()	(vac_cache_size > PAGE_SIZE)
+#else
+#define cpu_dcache_is_aliasing()	(L1DCACHE_SIZE > PAGE_SIZE)
+#endif
+
+#endif
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 7d792077e5fd..2dfde54d1a84 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -2,6 +2,7 @@
 config XTENSA
 	def_bool y
 	select ARCH_32BIT_OFF_T
+	select ARCH_HAS_CPU_CACHE_ALIASING
 	select ARCH_HAS_BINFMT_FLAT if !MMU
 	select ARCH_HAS_CURRENT_STACK_POINTER
 	select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/xtensa/include/asm/cachetype.h b/arch/xtensa/include/asm/cachetype.h
new file mode 100644
index 000000000000..51bd49e2a1c5
--- /dev/null
+++ b/arch/xtensa/include/asm/cachetype.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_XTENSA_CACHETYPE_H
+#define __ASM_XTENSA_CACHETYPE_H
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+#define cpu_dcache_is_aliasing()	(DCACHE_WAY_SIZE > PAGE_SIZE)
+
+#endif
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index d504eb4b49ab..2cb15fe4fe12 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -138,4 +138,10 @@ static inline int get_cpu_cacheinfo_id(int cpu, int level)
 #define use_arch_cache_info()	(false)
 #endif
 
+#ifndef CONFIG_ARCH_HAS_CPU_CACHE_ALIASING
+#define cpu_dcache_is_aliasing()	false
+#else
+#include <asm/cachetype.h>
+#endif
+
 #endif /* _LINUX_CACHEINFO_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 57cd378c73d6..db09c9ad15c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1016,6 +1016,12 @@ config IDLE_PAGE_TRACKING
 	  See Documentation/admin-guide/mm/idle_page_tracking.rst for
 	  more details.
 
+# Architectures which implement cpu_dcache_is_aliasing() to query
+# whether the data caches are aliased (VIVT or VIPT with dcache
+# aliasing) need to select this.
+config ARCH_HAS_CPU_CACHE_ALIASING
+	bool
+
 config ARCH_HAS_CACHE_LINE_SIZE
 	bool
 
-- 
2.39.2


^ permalink raw reply related	[relevance 8%]

* Re: [LSF/MM/BPF TOPIC] tracing the source of errors
  2024-02-07  9:54  5% [LSF/MM/BPF TOPIC] tracing the source of errors Miklos Szeredi
  2024-02-07 11:00  0% ` [Lsf-pc] " Jan Kara
@ 2024-02-07 17:16  0% ` Darrick J. Wong
  1 sibling, 0 replies; 200+ results
From: Darrick J. Wong @ 2024-02-07 17:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: lsf-pc, linux-kernel, linux-fsdevel

On Wed, Feb 07, 2024 at 10:54:34AM +0100, Miklos Szeredi wrote:
> [I'm not planning to attend LSF this year, but I thought this topic
> might be of interest to those who will.]
> 
> The errno thing is really ancient and yet quite usable.  But when
> trying to find out where a particular EINVAL is coming from, that's
> often mission impossible.
> 
> Would it make sense to add infrastructure to allow tracing the source
> of errors?  E.g.
> 
> strace --errno-trace ls -l foo
> ...
> statx(AT_FDCWD, "foo", ...) = -1 ENOENT [fs/namei.c:1852]
> ...
> 
> Don't know about others, but this issue comes up quite often for me.
> 
> I would implement this with macros that record the place where a
> particular error has originated, and some way to query the last one
> (which wouldn't be 100% accurate, but good enough I guess).

Hmmm, weren't Kent and Suren working on code tagging for memory
allocation profiling?  It would be kinda nice to wrap that up in the
error return paths as well.

Granted then we end up with some nasty macro mess like:

[Pretend that there's a struct errno_tag, DEFINE_ALLOC_TAG, and
__alloc_tag_add symbols that looks mostly like struct alloc_tag from [1]
and then (backslashes elided)]

#define Err(x)
({
	int __errno = (x);
	DEFINE_ERRNO_TAG(_errno_tag);

	trace_return_errno(__this_address, __errno)
	__errno_tag_add(&_errno_tag, __errno);
	__errno;
})

	foo = kmalloc(...);
	if (!foo)
		return Err(-ENOMEM);

or

	if (fs_is_messed_up())
		return Err(-EINVAL);

This would get us the ability to ftrace for where errno returns
initiate, as well as collect counters for how often we're actually
doing that in production.  You could even add time_stats too, but
annotating the entire kernel might be a stretch.

--D

[1] https://lwn.net/Articles/906660/

> Thanks,
> Miklos
> 

^ permalink raw reply	[relevance 0%]

* Re: Fanotify: concurrent work and handling files being executed
  @ 2024-02-07 11:15  5%           ` Amir Goldstein
  0 siblings, 0 replies; 200+ results
From: Amir Goldstein @ 2024-02-07 11:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: Sargun Dhillon, Linux FS-devel Mailing List, Sweet Tea Dorminy

On Wed, Feb 7, 2024 at 12:44 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 06-02-24 18:44:47, Amir Goldstein wrote:
> > On Tue, Feb 6, 2024 at 6:30 PM Sargun Dhillon <sargun@sargun.me> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 6:50 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 06-02-24 09:44:29, Amir Goldstein wrote:
> > > > > On Tue, Feb 6, 2024 at 1:24 AM Sargun Dhillon <sargun@sargun.me> wrote:
> > > > > >
> > > > > > One of the issues we've hit recently while using fanotify in an HSM is
> > > > > > racing with files that are opened for execution.
> > > > > >
> > > > > > There is a race that can result in ETXTBUSY.
> > > > > > Pid 1: You have a file marked with FAN_OPEN_EXEC_PERM.
> > > > > > Pid 2: execve(file_by_path)
> > > > > > Pid 1: gets notification, with file.fd
> > > > > > Pid 2: blocked, waiting for notification to resolve
> > > > > > Pid 1: Does work with FD (populates the file)
> > > > > > Pid 1: writes FAN_ALLOW to the fanotify file descriptor allowing the event.
> > > > > > Pid 2: continues, and falls through to deny_write_access (and fails)
> > > > > > Pid 1: closes fd
> > > >
> > > > Right, this is kind of nasty.
> > > >
> > > > > > Pid 1 can close the FD before responding, but this can result in a
> > > > > > race if fanotify is being handled in a multi-threaded
> > > > > > manner.
> > > >
> > > > Yep.
> > > >
> > > > > > I.e. if there are two threads operating on the same fanotify group,
> > > > > > and an event's FD has been closed, that can be reused
> > > > > > by another event. This is largely not a problem because the
> > > > > > outstanding events are added in a FIFO manner to the outstanding
> > > > > > event list, and as long as the earlier event is closed and responded
> > > > > > to without interruption, it should be okay, but it's difficult
> > > > > > to guarantee that this happens, unless event responses are serialized
> > > > > > in some fashion, with strict ordering between
> > > > > > responses.
> > > >
> > > > Yes, essentially you must make sure you will not read any new events from
> > > > the notification queue between fd close & writing of the response. Frankly,
> > > > I find this as quite ugly and asking for trouble (subtle bugs down the
> > > > line).
> > > >
> > > Is there a preference for either refactoring fanotify_event_metadata, or
> > > adding this new ID type as a piece of metadata?
> > >
> > > I almost feel like the FD should move to being metadata, and we should
> > > use ID in place of fd in fanotify_event_metadata. If we use an xarray,
> > > it should be reasonable to use a 32-bit identifier, so we don't need
> > > to modify the fanotify_event_metadata structure at all.
> >
> > I have a strong preference for FANOTIFY_METADATA_VERSION 4
> > because I really would like event->key to be 64bit and in the header,
> > but I have a feeling that Jan may have a different opinion..
>
> I also think 64-bit ID would be potentially more useful for userspace
> (mostly because of guaranteed uniqueness). I'm just still not yet sure how
> do you plan to use it for persistent events because there are several
> options how to generate the ID. I'd hate to add yet-another-id in the near
> future.

The use case of permission events and persistent async events do not
overlap and for the future, they should not be mixed in the same group
at all. mixing permission events and async events in the same group
was probably a mistake and we should not repeat it.

The event->id namespace is per group.
permission events are always realtime events so they do not persist
and event->id can be a runtime id used for responding.
A future group for persistent async events would be something like
https://learn.microsoft.com/en-us/windows/win32/fileio/change-journal-records
and then the 64-bit event->id would have a different meaning.
It means "ACK that events up ID are consumed" or "query the events since ID".

>
> Regarding FANOTIFY_METADATA_VERSION 4: What are your reasons to want the ID
> in the header?

Just a matter of personal taste - I see event->id as being fundamental
information
and not "extra" information.

Looking forward at persistent events, it would be easier to iterate and skip
up to ID, if event->id is in the header.

> I think we'd need explicit init flag to enable event ID
> reporting anyway?But admittedly I can see some appeal of having ID in the
> header if we are going to use the ID for matching responses to permission
> events.

Yes, as you wrote,
permission events with FAN_REPORT_FID are a clean start.
I don't mind if this setup requires an explicit FAN_REPORT_EVENT_ID flag.

Also, regarding your suggestion to report FID instead of event->fd,
I think we can do it like that:
1. With FAN_REPORT_FID in permission events, event->fd should
    NOT be used to access the file
2. Instead, it should be used as a mount_fd arg to open_by_handle_at()
    along with the reported fid
3. In that case, event->fd may be an O_PATH fd (we need a
    patch to allow O_PATH fd as mount_fd)
4. An fd that is open with open_by_handle_at(event->fd, ...
    will have the FMODE_NONOTIFY flag, so it is safe to access the file

This model also solves my problem with rename(), because a single
mount_fd could be used to open both old and new parent path fds.

Does that sound like a plan?

Thanks,
Amir.

^ permalink raw reply	[relevance 5%]

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] tracing the source of errors
  2024-02-07  9:54  5% [LSF/MM/BPF TOPIC] tracing the source of errors Miklos Szeredi
@ 2024-02-07 11:00  0% ` Jan Kara
  2024-02-08 20:39  0%   ` Gabriel Krisman Bertazi
  2024-02-07 17:16  0% ` Darrick J. Wong
  1 sibling, 1 reply; 200+ results
From: Jan Kara @ 2024-02-07 11:00 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: lsf-pc, linux-fsdevel, linux-kernel

On Wed 07-02-24 10:54:34, Miklos Szeredi via Lsf-pc wrote:
> [I'm not planning to attend LSF this year, but I thought this topic
> might be of interest to those who will.]
> 
> The errno thing is really ancient and yet quite usable.  But when
> trying to find out where a particular EINVAL is coming from, that's
> often mission impossible.
> 
> Would it make sense to add infrastructure to allow tracing the source
> of errors?  E.g.
> 
> strace --errno-trace ls -l foo
> ...
> statx(AT_FDCWD, "foo", ...) = -1 ENOENT [fs/namei.c:1852]
> ...
> 
> Don't know about others, but this issue comes up quite often for me.

Yes, having this available would be really useful at times. Sometimes I
had to resort to kprobes or good old printks.

> I would implement this with macros that record the place where a
> particular error has originated, and some way to query the last one
> (which wouldn't be 100% accurate, but good enough I guess).

The problem always has been how to implement this functionality in a
transparent way so the code does not become a mess. So if you have some
idea, I'd say go for it :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[relevance 0%]

* [LSF/MM/BPF TOPIC] tracing the source of errors
@ 2024-02-07  9:54  5% Miklos Szeredi
  2024-02-07 11:00  0% ` [Lsf-pc] " Jan Kara
  2024-02-07 17:16  0% ` Darrick J. Wong
  0 siblings, 2 replies; 200+ results
From: Miklos Szeredi @ 2024-02-07  9:54 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-kernel, linux-fsdevel

[I'm not planning to attend LSF this year, but I thought this topic
might be of interest to those who will.]

The errno thing is really ancient and yet quite usable.  But when
trying to find out where a particular EINVAL is coming from, that's
often mission impossible.

Would it make sense to add infrastructure to allow tracing the source
of errors?  E.g.

strace --errno-trace ls -l foo
...
statx(AT_FDCWD, "foo", ...) = -1 ENOENT [fs/namei.c:1852]
...

Don't know about others, but this issue comes up quite often for me.

I would implement this with macros that record the place where a
particular error has originated, and some way to query the last one
(which wouldn't be 100% accurate, but good enough I guess).

Thanks,
Miklos

^ permalink raw reply	[relevance 5%]

* Re: [PATCH v3 3/7] fs: FS_IOC_GETUUID
  2024-02-07  6:41  0%   ` Amir Goldstein
@ 2024-02-07  6:46  0%     ` Amir Goldstein
  0 siblings, 0 replies; 200+ results
From: Amir Goldstein @ 2024-02-07  6:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, brauner, linux-btrfs, Jan Kara, Dave Chinner,
	Darrick J. Wong, Theodore Ts'o

On Wed, Feb 7, 2024 at 8:41 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 4:57 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > Add a new generic ioctls for querying the filesystem UUID.
> >
> > These are lifted versions of the ext4 ioctls, with one change: we're not
> > using a flexible array member, because UUIDs will never be more than 16
> > bytes.
> >
> > This patch adds a generic implementation of FS_IOC_GETFSUUID, which
> > reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 -
> > that can be done on offline filesystems by the people who need it,
> > trying to do it online is just asking for too much trouble.
> >
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Cc: "Darrick J. Wong" <djwong@kernel.org>
> > Cc: Theodore Ts'o <tytso@mit.edu>
> > Cc: linux-fsdevel@vger.kernel.or

typo in list address.

Thanks,
Amir.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3 3/7] fs: FS_IOC_GETUUID
  2024-02-07  2:56  4% ` [PATCH v3 3/7] fs: FS_IOC_GETUUID Kent Overstreet
@ 2024-02-07  6:41  0%   ` Amir Goldstein
  2024-02-07  6:46  0%     ` Amir Goldstein
  0 siblings, 1 reply; 200+ results
From: Amir Goldstein @ 2024-02-07  6:41 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, brauner, linux-btrfs, Jan Kara, Dave Chinner,
	Darrick J. Wong, Theodore Ts'o, linux-fsdevel

On Wed, Feb 7, 2024 at 4:57 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> Add a new generic ioctls for querying the filesystem UUID.
>
> These are lifted versions of the ext4 ioctls, with one change: we're not
> using a flexible array member, because UUIDs will never be more than 16
> bytes.
>
> This patch adds a generic implementation of FS_IOC_GETFSUUID, which
> reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 -
> that can be done on offline filesystems by the people who need it,
> trying to do it online is just asking for too much trouble.
>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.or
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst         |  3 ++-
>  fs/ioctl.c                                       | 16 ++++++++++++++++
>  include/uapi/linux/fs.h                          | 16 ++++++++++++++++
>  3 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 457e16f06e04..3731ecf1e437 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -82,8 +82,9 @@ Code  Seq#    Include File                                           Comments
>  0x10  00-0F  drivers/char/s390/vmcp.h
>  0x10  10-1F  arch/s390/include/uapi/sclp_ctl.h
>  0x10  20-2F  arch/s390/include/uapi/asm/hypfs.h
> -0x12  all    linux/fs.h
> +0x12  all    linux/fs.h                                              BLK* ioctls
>               linux/blkpg.h
> +0x15  all    linux/fs.h                                              FS_IOC_* ioctls
>  0x1b  all                                                            InfiniBand Subsystem
>                                                                       <http://infiniband.sourceforge.net/>
>  0x20  all    drivers/cdrom/cm206.h
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 76cf22ac97d7..74eab9549383 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -763,6 +763,19 @@ static int ioctl_fssetxattr(struct file *file, void __user *argp)
>         return err;
>  }
>
> +static int ioctl_getfsuuid(struct file *file, void __user *argp)
> +{
> +       struct super_block *sb = file_inode(file)->i_sb;
> +       struct fsuuid2 u = { .len = sb->s_uuid_len, };
> +
> +       if (!sb->s_uuid_len)
> +               return -ENOIOCTLCMD;
> +
> +       memcpy(&u.uuid[0], &sb->s_uuid, sb->s_uuid_len);
> +
> +       return copy_to_user(argp, &u, sizeof(u)) ? -EFAULT : 0;
> +}
> +
>  /*
>   * do_vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
>   * It's just a simple helper for sys_ioctl and compat_sys_ioctl.
> @@ -845,6 +858,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
>         case FS_IOC_FSSETXATTR:
>                 return ioctl_fssetxattr(filp, argp);
>
> +       case FS_IOC_GETFSUUID:
> +               return ioctl_getfsuuid(filp, argp);
> +
>         default:
>                 if (S_ISREG(inode->i_mode))
>                         return file_ioctl(filp, cmd, argp);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 48ad69f7722e..d459f816cd50 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,19 @@ struct fstrim_range {
>         __u64 minlen;
>  };
>
> +/*
> + * We include a length field because some filesystems (vfat) have an identifier
> + * that we do want to expose as a UUID, but doesn't have the standard length.
> + *
> + * We use a fixed size buffer beacuse this interface will, by fiat, never
> + * support "UUIDs" longer than 16 bytes; we don't want to force all downstream
> + * users to have to deal with that.
> + */
> +struct fsuuid2 {
> +       __u8    len;
> +       __u8    uuid[16];
> +};
> +
>  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
>  #define FILE_DEDUPE_RANGE_SAME         0
>  #define FILE_DEDUPE_RANGE_DIFFERS      1
> @@ -190,6 +203,9 @@ struct fsxattr {
>   * (see uapi/linux/blkzoned.h)
>   */
>
> +/* Returns the external filesystem UUID, the same one blkid returns */
> +#define FS_IOC_GETFSUUID               _IOR(0x15, 0, struct fsuuid2)

Please move that to the end of FS_IOC_* ioctls block.
The fact that it started a new vfs ioctl namespace does not justify starting
a different list IMO.

uapi readers don't care about the value of the ioctl.
locality to FS_IOC_GETFSLABEL is more important IMO.

Thanks,
Amir.


> +
>  #define BMAP_IOCTL 1           /* obsolete - kept for compatibility */
>  #define FIBMAP    _IO(0x00,1)  /* bmap access */
>  #define FIGETBSZ   _IO(0x00,2) /* get the block size used for bmap */
> --
> 2.43.0
>
>

^ permalink raw reply	[relevance 0%]

* [PATCH v3 4/7] fat: Hook up sb->s_uuid
    2024-02-07  2:56  4% ` [PATCH v3 3/7] fs: FS_IOC_GETUUID Kent Overstreet
@ 2024-02-07  2:56  5% ` Kent Overstreet
  1 sibling, 0 replies; 200+ results
From: Kent Overstreet @ 2024-02-07  2:56 UTC (permalink / raw)
  To: linux-fsdevel, brauner; +Cc: Kent Overstreet, linux-btrfs

Now that we have a standard ioctl for querying the filesystem UUID,
initialize sb->s_uuid so that it works.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/fat/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 1fac3dabf130..5c813696d1ff 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -1762,6 +1762,9 @@ int fat_fill_super(struct super_block *sb, void *data, int silent, int isvfat,
 	else /* fat 16 or 12 */
 		sbi->vol_id = bpb.fat16_vol_id;
 
+	__le32 vol_id_le = cpu_to_le32(sbi->vol_id);
+	super_set_uuid(sb, (void *) &vol_id_le, sizeof(vol_id_le));
+
 	sbi->dir_per_block = sb->s_blocksize / sizeof(struct msdos_dir_entry);
 	sbi->dir_per_block_bits = ffs(sbi->dir_per_block) - 1;
 
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v3 3/7] fs: FS_IOC_GETUUID
  @ 2024-02-07  2:56  4% ` Kent Overstreet
  2024-02-07  6:41  0%   ` Amir Goldstein
  2024-02-07  2:56  5% ` [PATCH v3 4/7] fat: Hook up sb->s_uuid Kent Overstreet
  1 sibling, 1 reply; 200+ results
From: Kent Overstreet @ 2024-02-07  2:56 UTC (permalink / raw)
  To: linux-fsdevel, brauner
  Cc: Kent Overstreet, linux-btrfs, Jan Kara, Dave Chinner,
	Darrick J. Wong, Theodore Ts'o, linux-fsdevel

Add a new generic ioctls for querying the filesystem UUID.

These are lifted versions of the ext4 ioctls, with one change: we're not
using a flexible array member, because UUIDs will never be more than 16
bytes.

This patch adds a generic implementation of FS_IOC_GETFSUUID, which
reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 -
that can be done on offline filesystems by the people who need it,
trying to do it online is just asking for too much trouble.

Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.or
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 .../userspace-api/ioctl/ioctl-number.rst         |  3 ++-
 fs/ioctl.c                                       | 16 ++++++++++++++++
 include/uapi/linux/fs.h                          | 16 ++++++++++++++++
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 457e16f06e04..3731ecf1e437 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -82,8 +82,9 @@ Code  Seq#    Include File                                           Comments
 0x10  00-0F  drivers/char/s390/vmcp.h
 0x10  10-1F  arch/s390/include/uapi/sclp_ctl.h
 0x10  20-2F  arch/s390/include/uapi/asm/hypfs.h
-0x12  all    linux/fs.h
+0x12  all    linux/fs.h                                              BLK* ioctls
              linux/blkpg.h
+0x15  all    linux/fs.h                                              FS_IOC_* ioctls
 0x1b  all                                                            InfiniBand Subsystem
                                                                      <http://infiniband.sourceforge.net/>
 0x20  all    drivers/cdrom/cm206.h
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 76cf22ac97d7..74eab9549383 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -763,6 +763,19 @@ static int ioctl_fssetxattr(struct file *file, void __user *argp)
 	return err;
 }
 
+static int ioctl_getfsuuid(struct file *file, void __user *argp)
+{
+	struct super_block *sb = file_inode(file)->i_sb;
+	struct fsuuid2 u = { .len = sb->s_uuid_len, };
+
+	if (!sb->s_uuid_len)
+		return -ENOIOCTLCMD;
+
+	memcpy(&u.uuid[0], &sb->s_uuid, sb->s_uuid_len);
+
+	return copy_to_user(argp, &u, sizeof(u)) ? -EFAULT : 0;
+}
+
 /*
  * do_vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
  * It's just a simple helper for sys_ioctl and compat_sys_ioctl.
@@ -845,6 +858,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FS_IOC_FSSETXATTR:
 		return ioctl_fssetxattr(filp, argp);
 
+	case FS_IOC_GETFSUUID:
+		return ioctl_getfsuuid(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			return file_ioctl(filp, cmd, argp);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 48ad69f7722e..d459f816cd50 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,19 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/*
+ * We include a length field because some filesystems (vfat) have an identifier
+ * that we do want to expose as a UUID, but doesn't have the standard length.
+ *
+ * We use a fixed size buffer beacuse this interface will, by fiat, never
+ * support "UUIDs" longer than 16 bytes; we don't want to force all downstream
+ * users to have to deal with that.
+ */
+struct fsuuid2 {
+	__u8	len;
+	__u8	uuid[16];
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -190,6 +203,9 @@ struct fsxattr {
  * (see uapi/linux/blkzoned.h)
  */
 
+/* Returns the external filesystem UUID, the same one blkid returns */
+#define FS_IOC_GETFSUUID		_IOR(0x15, 0, struct fsuuid2)
+
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
-- 
2.43.0


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v2 3/7] fs: FS_IOC_GETUUID
  2024-02-06 20:18  4% ` [PATCH v2 3/7] fs: FS_IOC_GETUUID Kent Overstreet
  2024-02-06 20:29  0%   ` Randy Dunlap
@ 2024-02-06 22:01  0%   ` Dave Chinner
  1 sibling, 0 replies; 200+ results
From: Dave Chinner @ 2024-02-06 22:01 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: brauner, linux-fsdevel, linux-kernel, Jan Kara, Dave Chinner,
	Darrick J. Wong, Theodore Ts'o, linux-fsdevel

On Tue, Feb 06, 2024 at 03:18:51PM -0500, Kent Overstreet wrote:
> Add a new generic ioctls for querying the filesystem UUID.
> 
> These are lifted versions of the ext4 ioctls, with one change: we're not
> using a flexible array member, because UUIDs will never be more than 16
> bytes.
> 
> This patch adds a generic implementation of FS_IOC_GETFSUUID, which
> reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 -
> that can be done on offline filesystems by the people who need it,
> trying to do it online is just asking for too much trouble.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.or
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  fs/ioctl.c              | 16 ++++++++++++++++
>  include/uapi/linux/fs.h | 17 +++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 76cf22ac97d7..046c30294a82 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -763,6 +763,19 @@ static int ioctl_fssetxattr(struct file *file, void __user *argp)
>  	return err;
>  }
>  
> +static int ioctl_getfsuuid(struct file *file, void __user *argp)
> +{
> +	struct super_block *sb = file_inode(file)->i_sb;
> +
> +	if (!sb->s_uuid_len)
> +		return -ENOIOCTLCMD;
> +
> +	struct fsuuid2 u = { .len = sb->s_uuid_len, };
> +	memcpy(&u.uuid[0], &sb->s_uuid, sb->s_uuid_len);
> +
> +	return copy_to_user(argp, &u, sizeof(u)) ? -EFAULT : 0;
> +}

Can we please keep the declarations separate from the code? I always
find this sort of implicit scoping of variables both difficult to
read (especially in larger functions) and a landmine waiting to be
tripped over. This could easily just be:

static int ioctl_getfsuuid(struct file *file, void __user *argp)
{
	struct super_block *sb = file_inode(file)->i_sb;
	struct fsuuid2 u = { .len = sb->s_uuid_len, };

	....

and then it's consistent with all the rest of the code...

> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 48ad69f7722e..16a6ecadfd8d 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,19 @@ struct fstrim_range {
>  	__u64 minlen;
>  };
>  
> +/*
> + * We include a length field because some filesystems (vfat) have an identifier
> + * that we do want to expose as a UUID, but doesn't have the standard length.
> + *
> + * We use a fixed size buffer beacuse this interface will, by fiat, never
> + * support "UUIDs" longer than 16 bytes; we don't want to force all downstream
> + * users to have to deal with that.
> + */
> +struct fsuuid2 {
> +	__u8	len;
> +	__u8	uuid[16];
> +};
> +
>  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
>  #define FILE_DEDUPE_RANGE_SAME		0
>  #define FILE_DEDUPE_RANGE_DIFFERS	1
> @@ -190,6 +203,9 @@ struct fsxattr {
>   * (see uapi/linux/blkzoned.h)
>   */
>  
> +/* Returns the external filesystem UUID, the same one blkid returns */
> +#define FS_IOC_GETFSUUID		_IOR(0x12, 142, struct fsuuid2)
> +

Can you add a comment somewhere in the file saying that new VFS
ioctls should use the "0x12" namespace in the range 142-255, and
mention that BLK ioctls should be kept within the 0x12 {0-141}
range?

Probably also document this clearly in
Documentation/userspace-api/ioctl/ioctl-number.rst, too?

-Dave.


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2 3/7] fs: FS_IOC_GETUUID
  2024-02-06 20:18  4% ` [PATCH v2 3/7] fs: FS_IOC_GETUUID Kent Overstreet
@ 2024-02-06 20:29  0%   ` Randy Dunlap
  2024-02-06 22:01  0%   ` Dave Chinner
  1 sibling, 0 replies; 200+ results
From: Randy Dunlap @ 2024-02-06 20:29 UTC (permalink / raw)
  To: Kent Overstreet, brauner, linux-fsdevel, linux-kernel
  Cc: Jan Kara, Dave Chinner, Darrick J. Wong, Theodore Ts'o,
	linux-fsdevel



On 2/6/24 12:18, Kent Overstreet wrote:
> Add a new generic ioctls for querying the filesystem UUID.
> 
> These are lifted versions of the ext4 ioctls, with one change: we're not
> using a flexible array member, because UUIDs will never be more than 16
> bytes.
> 
> This patch adds a generic implementation of FS_IOC_GETFSUUID, which
> reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 -
> that can be done on offline filesystems by the people who need it,
> trying to do it online is just asking for too much trouble.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: linux-fsdevel@vger.kernel.or
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  fs/ioctl.c              | 16 ++++++++++++++++
>  include/uapi/linux/fs.h | 17 +++++++++++++++++
>  2 files changed, 33 insertions(+)
> 


> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 48ad69f7722e..16a6ecadfd8d 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,19 @@ struct fstrim_range {
>  	__u64 minlen;
>  };
>  
> +/*
> + * We include a length field because some filesystems (vfat) have an identifier
> + * that we do want to expose as a UUID, but doesn't have the standard length.
> + *
> + * We use a fixed size buffer beacuse this interface will, by fiat, never

                                 because

> + * support "UUIDs" longer than 16 bytes; we don't want to force all downstream
> + * users to have to deal with that.
> + */
> +struct fsuuid2 {
> +	__u8	len;
> +	__u8	uuid[16];
> +};
> +
>  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
>  #define FILE_DEDUPE_RANGE_SAME		0
>  #define FILE_DEDUPE_RANGE_DIFFERS	1
> @@ -190,6 +203,9 @@ struct fsxattr {
>   * (see uapi/linux/blkzoned.h)
>   */
>  
> +/* Returns the external filesystem UUID, the same one blkid returns */
> +#define FS_IOC_GETFSUUID		_IOR(0x12, 142, struct fsuuid2)
> +
>  #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
>  #define FIBMAP	   _IO(0x00,1)	/* bmap access */
>  #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
> @@ -198,6 +214,7 @@ struct fsxattr {
>  #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
>  #define FICLONE		_IOW(0x94, 9, int)
>  #define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
> +
>  #define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)

Why the additional blank line? (nit)

>  
>  #define FSLABEL_MAX 256	/* Max chars for the interface; each fs may differ */

-- 
#Randy

^ permalink raw reply	[relevance 0%]

* [PATCH v2 4/7] fat: Hook up sb->s_uuid
    2024-02-06 20:18  4% ` [PATCH v2 3/7] fs: FS_IOC_GETUUID Kent Overstreet
@ 2024-02-06 20:18  5% ` Kent Overstreet
  1 sibling, 0 replies; 200+ results
From: Kent Overstreet @ 2024-02-06 20:18 UTC (permalink / raw)
  To: brauner, linux-fsdevel, linux-kernel; +Cc: Kent Overstreet

Now that we have a standard ioctl for querying the filesystem UUID,
initialize sb->s_uuid so that it works.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/fat/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 1fac3dabf130..5c813696d1ff 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -1762,6 +1762,9 @@ int fat_fill_super(struct super_block *sb, void *data, int silent, int isvfat,
 	else /* fat 16 or 12 */
 		sbi->vol_id = bpb.fat16_vol_id;
 
+	__le32 vol_id_le = cpu_to_le32(sbi->vol_id);
+	super_set_uuid(sb, (void *) &vol_id_le, sizeof(vol_id_le));
+
 	sbi->dir_per_block = sb->s_blocksize / sizeof(struct msdos_dir_entry);
 	sbi->dir_per_block_bits = ffs(sbi->dir_per_block) - 1;
 
-- 
2.43.0


^ permalink raw reply related	[relevance 5%]

* [PATCH v2 3/7] fs: FS_IOC_GETUUID
  @ 2024-02-06 20:18  4% ` Kent Overstreet
  2024-02-06 20:29  0%   ` Randy Dunlap
  2024-02-06 22:01  0%   ` Dave Chinner
  2024-02-06 20:18  5% ` [PATCH v2 4/7] fat: Hook up sb->s_uuid Kent Overstreet
  1 sibling, 2 replies; 200+ results
From: Kent Overstreet @ 2024-02-06 20:18 UTC (permalink / raw)
  To: brauner, linux-fsdevel, linux-kernel
  Cc: Kent Overstreet, Jan Kara, Dave Chinner, Darrick J. Wong,
	Theodore Ts'o, linux-fsdevel

Add a new generic ioctls for querying the filesystem UUID.

These are lifted versions of the ext4 ioctls, with one change: we're not
using a flexible array member, because UUIDs will never be more than 16
bytes.

This patch adds a generic implementation of FS_IOC_GETFSUUID, which
reads from super_block->s_uuid. We're not lifting SETFSUUID from ext4 -
that can be done on offline filesystems by the people who need it,
trying to do it online is just asking for too much trouble.

Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.or
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/ioctl.c              | 16 ++++++++++++++++
 include/uapi/linux/fs.h | 17 +++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 76cf22ac97d7..046c30294a82 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -763,6 +763,19 @@ static int ioctl_fssetxattr(struct file *file, void __user *argp)
 	return err;
 }
 
+static int ioctl_getfsuuid(struct file *file, void __user *argp)
+{
+	struct super_block *sb = file_inode(file)->i_sb;
+
+	if (!sb->s_uuid_len)
+		return -ENOIOCTLCMD;
+
+	struct fsuuid2 u = { .len = sb->s_uuid_len, };
+	memcpy(&u.uuid[0], &sb->s_uuid, sb->s_uuid_len);
+
+	return copy_to_user(argp, &u, sizeof(u)) ? -EFAULT : 0;
+}
+
 /*
  * do_vfs_ioctl() is not for drivers and not intended to be EXPORT_SYMBOL()'d.
  * It's just a simple helper for sys_ioctl and compat_sys_ioctl.
@@ -845,6 +858,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FS_IOC_FSSETXATTR:
 		return ioctl_fssetxattr(filp, argp);
 
+	case FS_IOC_GETFSUUID:
+		return ioctl_getfsuuid(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			return file_ioctl(filp, cmd, argp);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 48ad69f7722e..16a6ecadfd8d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,19 @@ struct fstrim_range {
 	__u64 minlen;
 };
 
+/*
+ * We include a length field because some filesystems (vfat) have an identifier
+ * that we do want to expose as a UUID, but doesn't have the standard length.
+ *
+ * We use a fixed size buffer beacuse this interface will, by fiat, never
+ * support "UUIDs" longer than 16 bytes; we don't want to force all downstream
+ * users to have to deal with that.
+ */
+struct fsuuid2 {
+	__u8	len;
+	__u8	uuid[16];
+};
+
 /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
 #define FILE_DEDUPE_RANGE_SAME		0
 #define FILE_DEDUPE_RANGE_DIFFERS	1
@@ -190,6 +203,9 @@ struct fsxattr {
  * (see uapi/linux/blkzoned.h)
  */
 
+/* Returns the external filesystem UUID, the same one blkid returns */
+#define FS_IOC_GETFSUUID		_IOR(0x12, 142, struct fsuuid2)
+
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
@@ -198,6 +214,7 @@ struct fsxattr {
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
 #define FICLONE		_IOW(0x94, 9, int)
 #define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
+
 #define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
 
 #define FSLABEL_MAX 256	/* Max chars for the interface; each fs may differ */
-- 
2.43.0


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH 2/6] fs: FS_IOC_GETUUID
  2024-02-06  8:24  0%       ` Amir Goldstein
@ 2024-02-06  9:00  0%         ` Kent Overstreet
  0 siblings, 0 replies; 200+ results
From: Kent Overstreet @ 2024-02-06  9:00 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-btrfs,
	linux-xfs, linux-ext4, Christian Brauner, Jan Kara, Dave Chinner,
	Darrick J. Wong, Theodore Ts'o, linux-fsdevel,
	Miklos Szeredi

On Tue, Feb 06, 2024 at 10:24:45AM +0200, Amir Goldstein wrote:
> On Tue, Feb 6, 2024 at 12:49 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Tue, Feb 06, 2024 at 09:17:58AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 05, 2024 at 03:05:13PM -0500, Kent Overstreet wrote:
> > > > Add a new generic ioctls for querying the filesystem UUID.
> > > >
> > > > These are lifted versions of the ext4 ioctls, with one change: we're not
> > > > using a flexible array member, because UUIDs will never be more than 16
> > > > bytes.
> > > >
> > > > This patch adds a generic implementation of FS_IOC_GETFSUUID, which
> > > > reads from super_block->s_uuid; FS_IOC_SETFSUUID is left for individual
> > > > filesystems to implement.
> > > >
> 
> It's fine to have a generic implementation, but the filesystem should
> have the option to opt-in for a specific implementation.
> 
> There are several examples, even with xfs and btrfs where ->s_uuid
> does not contain the filesystem's UUID or there is more than one
> uuid and ->s_uuid is not the correct one to expose to the user.

Yeah, some of you were smoking some good stuff from the stories I've
been hearing...

> A model like ioctl_[gs]etflags() looks much more appropriate
> and could be useful for network filesystems/FUSE as well.

A filesystem needs to store two UUIDs (that identify the filesystem as a
whole).

 - Your internal UUID, which can never change because it's referenced in
   various other on disk data structures
 - Your external UUID, which identifies the filesystem to the outside
   world. Users want to be able to change this - which is why it has to
   be distinct from the internal UUID.

The internal UUID must never be exposed to the outside world, and that
includes the VFS; storing your private UUID in sb->s_uuid is wrong -
separation of concerns.

yes, I am aware of fscrypt, and yes, someone's going to have to fix
that.

This interface is only for the external/public UUID.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/6] fs: FS_IOC_GETUUID
  @ 2024-02-06  8:24  0%       ` Amir Goldstein
  2024-02-06  9:00  0%         ` Kent Overstreet
  0 siblings, 1 reply; 200+ results
From: Amir Goldstein @ 2024-02-06  8:24 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-btrfs,
	linux-xfs, linux-ext4, Christian Brauner, Jan Kara, Dave Chinner,
	Darrick J. Wong, Theodore Ts'o, linux-fsdevel,
	Miklos Szeredi

On Tue, Feb 6, 2024 at 12:49 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Tue, Feb 06, 2024 at 09:17:58AM +1100, Dave Chinner wrote:
> > On Mon, Feb 05, 2024 at 03:05:13PM -0500, Kent Overstreet wrote:
> > > Add a new generic ioctls for querying the filesystem UUID.
> > >
> > > These are lifted versions of the ext4 ioctls, with one change: we're not
> > > using a flexible array member, because UUIDs will never be more than 16
> > > bytes.
> > >
> > > This patch adds a generic implementation of FS_IOC_GETFSUUID, which
> > > reads from super_block->s_uuid; FS_IOC_SETFSUUID is left for individual
> > > filesystems to implement.
> > >

It's fine to have a generic implementation, but the filesystem should
have the option to opt-in for a specific implementation.

There are several examples, even with xfs and btrfs where ->s_uuid
does not contain the filesystem's UUID or there is more than one
uuid and ->s_uuid is not the correct one to expose to the user.

A model like ioctl_[gs]etflags() looks much more appropriate
and could be useful for network filesystems/FUSE as well.

> > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Dave Chinner <dchinner@redhat.com>
> > > Cc: "Darrick J. Wong" <djwong@kernel.org>
> > > Cc: Theodore Ts'o <tytso@mit.edu>
> > > Cc: linux-fsdevel@vger.kernel.or
> > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > ---
> > >  fs/ioctl.c              | 16 ++++++++++++++++
> > >  include/uapi/linux/fs.h | 16 ++++++++++++++++
> > >  2 files changed, 32 insertions(+)
> > >
> > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > index 76cf22ac97d7..858801060408 100644
> > > --- a/fs/ioctl.c
> > > +++ b/fs/ioctl.c
> > > @@ -763,6 +763,19 @@ static int ioctl_fssetxattr(struct file *file, void __user *argp)
> > >     return err;
> > >  }
> > >
> > > +static int ioctl_getfsuuid(struct file *file, void __user *argp)
> > > +{
> > > +   struct super_block *sb = file_inode(file)->i_sb;
> > > +
> > > +   if (WARN_ON(sb->s_uuid_len > sizeof(sb->s_uuid)))
> > > +           sb->s_uuid_len = sizeof(sb->s_uuid);
> >
> > A "get"/read only ioctl should not be change superblock fields -
> > this is not the place for enforcing superblock filed constraints.
> > Make a helper function super_set_uuid(sb, uuid, uuid_len) for the
> > filesystems to call that does all the validity checking and then
> > sets the superblock fields appropriately.
>
> *nod* good thought...
>
> > > +struct fsuuid2 {
> > > +   __u32       fsu_len;
> > > +   __u32       fsu_flags;
> > > +   __u8        fsu_uuid[16];
> > > +};
> >
> > Nobody in userspace will care that this is "version 2" of the ext4
> > ioctl. I'd just name it "fs_uuid" as though the ext4 version didn't
> > ever exist.
>
> I considered that - but I decided I wanted the explicit versioning,
> because too often we live with unfixed mistakes because versioning is
> ugly, or something?
>
> Doing a new revision of an API should be a normal, frequent thing, and I
> want to start making it a convention.
>
> >
> > > +
> > >  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> > >  #define FILE_DEDUPE_RANGE_SAME             0
> > >  #define FILE_DEDUPE_RANGE_DIFFERS  1
> > > @@ -215,6 +229,8 @@ struct fsxattr {
> > >  #define FS_IOC_FSSETXATTR          _IOW('X', 32, struct fsxattr)
> > >  #define FS_IOC_GETFSLABEL          _IOR(0x94, 49, char[FSLABEL_MAX])
> > >  #define FS_IOC_SETFSLABEL          _IOW(0x94, 50, char[FSLABEL_MAX])
> > > +#define FS_IOC_GETFSUUID           _IOR(0x94, 51, struct fsuuid2)
> > > +#define FS_IOC_SETFSUUID           _IOW(0x94, 52, struct fsuuid2)
> >
> > 0x94 is the btrfs ioctl space, not the VFS space - why did you
> > choose that? That said, what is the VFS ioctl space identifier? 'v',
> > perhaps?
>
> "Promoting ioctls from fs to vfs without revising and renaming
> considered harmful"... this is a mess that could have been avoided if we
> weren't taking the lazy route.
>
> And 'v' doesn't look like it to me, I really have no idea what to use
> here. Does anyone?
>

All the other hoisted FS_IOC_* use the original fs ioctl namespace they
came from. Although it is not an actual hoist, I'd use:

struct fsuuid128 {
       __u32       fsu_len;
       __u32       fsu_flags;
       __u8        fsu_uuid[16];
};

#define FS_IOC_GETFSUUID              _IOR('f', 45, struct fsuuid128)
#define FS_IOC_SETFSUUID              _IOW('f', 46, struct fsuuid128)

Technically, could also overload EXT4_IOC_[GS]ETFSUUID numbers
because of the different type:

#define FS_IOC_GETFSUUID              _IOR('f', 44, struct fsuuid128)
#define FS_IOC_SETFSUUID              _IOW('f', 44, struct fsuuid128)

and then ext4 can follow up with this patch, because as far as I can tell,
the ext4 implementation is already compatible with the new ioctls.

Thanks,
Amir.

--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1613,8 +1613,10 @@ static long __ext4_ioctl(struct file *filp,
unsigned int cmd, unsigned long arg)
                return ext4_ioctl_setlabel(filp,
                                           (const void __user *)arg);

+       case FS_IOC_GETFSUUID:
         case EXT4_IOC_GETFSUUID:
                 return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
+       case FS_IOC_SETFSUUID:
         case EXT4_IOC_SETFSUUID:
                 return ext4_ioctl_setuuid(filp, (const void __user *)arg);

^ permalink raw reply	[relevance 0%]

Results 1-200 of ~5000   | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2023-10-16 11:50     [PATCH RFC gmem v1 0/8] KVM: gmem hooks/changes needed for x86 (other archs?) Michael Roth
2023-10-16 11:50     ` [PATCH RFC gmem v1 4/8] KVM: x86: Add gmem hook for invalidating memory Michael Roth
2024-02-09 10:11       ` Steven Price
2024-02-09 14:28         ` Sean Christopherson
2024-02-09 15:02           ` Steven Price
2024-02-09 15:13             ` Sean Christopherson
2024-03-11 17:24  5%           ` Michael Roth
2024-03-12 20:26  0%             ` Sean Christopherson
2024-03-13 17:11  0%               ` Steven Price
2024-01-24 11:38     [PATCH v3 10/15] block: Add fops atomic write support John Garry
2024-02-13  9:36     ` Nilay Shroff
2024-02-13  9:58       ` [PATCH " John Garry
2024-02-13 11:08         ` Nilay Shroff
2024-02-13 11:52           ` John Garry
2024-02-14  9:38  5%         ` Nilay Shroff
2024-02-14 11:29  0%           ` John Garry
2024-01-24 14:26     [PATCH 0/6] block atomic writes for XFS John Garry
2024-01-24 14:26     ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-02-02 17:57       ` Darrick J. Wong
2024-02-05 12:58         ` John Garry
2024-02-13 17:08  0%       ` Darrick J. Wong
2024-01-24 14:26     ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
2024-02-02 18:47       ` Darrick J. Wong
2024-02-05 13:36         ` John Garry
2024-02-06  1:15           ` Dave Chinner
2024-02-06  9:53             ` John Garry
2024-02-07  0:06               ` Dave Chinner
2024-02-07 14:13                 ` John Garry
2024-02-09  1:40                   ` Dave Chinner
2024-02-09 12:47  4%                 ` John Garry
2024-02-13 23:41  0%                   ` Dave Chinner
2024-02-14 11:06  0%                     ` John Garry
2024-01-27  1:57     [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Zhang Yi
2024-01-27  1:58     ` [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks() Zhang Yi
2024-05-09 15:16  9%   ` Luis Henriques
2024-05-09 16:39  7%     ` Theodore Ts'o
2024-05-09 17:23  0%       ` Luis Henriques
2024-05-10  3:39  0%         ` Zhang Yi
2024-05-10  9:41  0%           ` Luis Henriques
2024-05-10 11:40  0%             ` Zhang Yi
2024-02-12  6:18  0% ` [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Darrick J. Wong
2024-02-05 20:05     [PATCH 0/6] filesystem visibility ioctls Kent Overstreet
2024-02-05 20:05     ` [PATCH 2/6] fs: FS_IOC_GETUUID Kent Overstreet
2024-02-05 22:17       ` Dave Chinner
2024-02-05 22:49         ` Kent Overstreet
2024-02-06  8:24  0%       ` Amir Goldstein
2024-02-06  9:00  0%         ` Kent Overstreet
2024-02-05 23:23     Fanotify: concurrent work and handling files being executed Sargun Dhillon
2024-02-06  7:44     ` Amir Goldstein
2024-02-06 13:50       ` Jan Kara
2024-02-06 16:29         ` Sargun Dhillon
2024-02-06 16:44           ` Amir Goldstein
2024-02-07 10:44             ` Jan Kara
2024-02-07 11:15  5%           ` Amir Goldstein
2024-02-06 14:24     [PATCH v15 0/9] FUSE passthrough for file io Amir Goldstein
2024-02-06 14:24     ` [PATCH v15 3/9] fuse: implement ioctls to manage backing files Amir Goldstein
2024-02-28 10:50       ` Jingbo Xu
2024-02-28 11:07         ` Amir Goldstein
2024-02-28 11:14           ` Miklos Szeredi
2024-02-28 11:28             ` Amir Goldstein
2024-02-28 14:32               ` Jens Axboe
2024-02-28 15:01                 ` Miklos Szeredi
2024-02-29 10:15                   ` Christian Brauner
2024-02-29 10:17                     ` Christian Brauner
2024-03-05 10:57  5%                   ` Miklos Szeredi
2024-02-06 14:24     ` [PATCH v15 9/9] fuse: auto-invalidate inode attributes in passthrough mode Amir Goldstein
2024-04-02 20:13       ` Sweet Tea Dorminy
2024-04-02 21:18         ` Bernd Schubert
2024-04-03  8:18           ` Amir Goldstein
2024-04-04 14:07  5%         ` Sweet Tea Dorminy
2024-02-06 20:18     [PATCH v2 0/7] filesystem visibililty ioctls Kent Overstreet
2024-02-06 20:18  4% ` [PATCH v2 3/7] fs: FS_IOC_GETUUID Kent Overstreet
2024-02-06 20:29  0%   ` Randy Dunlap
2024-02-06 22:01  0%   ` Dave Chinner
2024-02-06 20:18  5% ` [PATCH v2 4/7] fat: Hook up sb->s_uuid Kent Overstreet
2024-02-07  2:56     [PATCH v3 0/7] filesystem visibililty ioctls Kent Overstreet
2024-02-07  2:56  4% ` [PATCH v3 3/7] fs: FS_IOC_GETUUID Kent Overstreet
2024-02-07  6:41  0%   ` Amir Goldstein
2024-02-07  6:46  0%     ` Amir Goldstein
2024-02-07  2:56  5% ` [PATCH v3 4/7] fat: Hook up sb->s_uuid Kent Overstreet
2024-02-07  9:54  5% [LSF/MM/BPF TOPIC] tracing the source of errors Miklos Szeredi
2024-02-07 11:00  0% ` [Lsf-pc] " Jan Kara
2024-02-08 20:39  0%   ` Gabriel Krisman Bertazi
2024-02-07 17:16  0% ` Darrick J. Wong
2024-02-08 18:49     [PATCH v4 00/12] Introduce cpu_dcache_is_aliasing() to fix DAX regression Mathieu Desnoyers
2024-02-08 18:49  8% ` [PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures Mathieu Desnoyers
2024-02-08 21:52  0%   ` Dan Williams
2024-02-09 17:06     [PATCH v9 0/8] Landlock: IOCTL support Günther Noack
2024-02-09 17:06  5% ` [PATCH v9 1/8] landlock: Add IOCTL access right Günther Noack
2024-02-16 17:19  0%   ` Mickaël Salaün
2024-02-19 18:34  0%   ` Mickaël Salaün
2024-02-28 12:57  0%     ` Günther Noack
2024-03-01 12:59  0%       ` Mickaël Salaün
2024-02-12 16:30     [PATCH v5 0/8] Introduce cpu_dcache_is_aliasing() to fix DAX regression Mathieu Desnoyers
2024-02-12 16:31  8% ` [PATCH v5 7/8] Introduce cpu_dcache_is_aliasing() across all architectures Mathieu Desnoyers
2024-02-15 14:46     [PATCH v6 0/9] Introduce cpu_dcache_is_aliasing() to fix DAX regression Mathieu Desnoyers
2024-02-15 14:46  8% ` [PATCH v6 8/9] Introduce cpu_dcache_is_aliasing() across all architectures Mathieu Desnoyers
2024-02-16 18:18  5% [PATCH fstests 0/3] few enhancements Luis Chamberlain
2024-02-16 18:18  2% ` [PATCH 1/3] tests: augment soak test group Luis Chamberlain
2024-02-16 18:18  4% ` [PATCH 2/3] check: add support for --list-group-tests Luis Chamberlain
2024-02-19  3:38  0%   ` Dave Chinner
2024-02-21 16:45  0%     ` Luis Chamberlain
2024-02-25 16:08  0%       ` Zorro Lang
2024-02-16 19:43  5% [PATCH] test_xarray: fix soft lockup for advanced-api tests Luis Chamberlain
2024-02-20  2:28  0% ` Andrew Morton
2024-02-20 17:45  0%   ` Luis Chamberlain
2024-02-21  0:51     [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more Kent Overstreet
2024-02-21 15:06     ` Miklos Szeredi
2024-02-21 21:04  5%   ` Kent Overstreet
2024-02-21 21:08       ` Josef Bacik
2024-02-22  9:14         ` Miklos Szeredi
2024-02-22  9:42           ` Kent Overstreet
2024-02-22 10:25  5%         ` Miklos Szeredi
2024-02-22 11:19  0%           ` Kent Overstreet
2024-02-22 11:01             ` [Lsf-pc] " Jan Kara
2024-02-22 12:48               ` Miklos Szeredi
2024-02-22 16:08  5%             ` Jan Kara
2024-02-22 12:45     [RFC v4 linux-next 00/19] fs & block: remove bdev->bd_inode Yu Kuai
2024-02-22 12:45  1% ` [RFC v4 linux-next 19/19] " Yu Kuai
2024-02-25 21:14     [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Matthew Wilcox
2024-02-25 23:45     ` Linus Torvalds
2024-02-26  1:02       ` Kent Overstreet
2024-02-26  1:32         ` Linus Torvalds
2024-02-26  2:50           ` Al Viro
2024-02-26 17:17             ` Linus Torvalds
2024-02-26 21:07               ` Matthew Wilcox
2024-02-26 21:17                 ` Kent Overstreet
2024-02-26 21:19                   ` Kent Overstreet
2024-02-26 21:55                     ` Paul E. McKenney
2024-02-26 23:29  5%                   ` Kent Overstreet
2024-02-27  0:05  0%                     ` Paul E. McKenney
2024-02-27  0:29  0%                       ` Kent Overstreet
2024-02-27  0:55  5%                         ` Paul E. McKenney
2024-02-27  1:08  5%                           ` Kent Overstreet
2024-02-27  5:17  0%                             ` Paul E. McKenney
2024-02-27  6:21  0%                               ` Kent Overstreet
2024-02-27 15:32  0%                                 ` Paul E. McKenney
2024-02-27  0:43  0%                     ` Dave Chinner
2024-02-28  4:28  6% [PATCH] xfs: stop advertising SB_I_VERSION Dave Chinner
2024-02-28 16:08  0% ` Darrick J. Wong
2024-03-01 13:42  0% ` Jeff Layton
2024-02-28  6:12     [LSF/MM/BPF TOPIC] untorn buffered writes Theodore Ts'o
2024-02-28 14:11     ` Matthew Wilcox
2024-02-28 23:33       ` Theodore Ts'o
2024-02-29  1:07  5%     ` Dave Chinner
2024-02-29  0:20  3% [LSF/MM/BPF TOPIC] Famfs: shared memory file system for disaggregated memory [LSF/MM/BPF ATTEND] John Groves
2024-04-23 13:30  0% ` [Lsf-pc] " Amir Goldstein
2024-04-24 12:22  0%   ` John Groves
2024-03-02  7:41     [RFC 1/8] fs: Add FS_XFLAG_ATOMICWRITES flag Ritesh Harjani (IBM)
2024-03-02  7:42  6% ` [RFC 4/8] ext4: Add statx and other atomic write helper routines Ritesh Harjani (IBM)
2024-03-04 13:04     [PATCH v2 00/14] block atomic writes for XFS John Garry
2024-03-04 13:04  5% ` [PATCH v2 09/14] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-03-05 12:27  5% [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size Christian Brauner
2024-03-05 14:33  0% ` Seth Forshee
2024-03-05 15:17  0% ` Serge E. Hallyn
2024-03-05 16:21  0% ` Christian Brauner
2024-03-07 20:01  0% ` Jarkko Sakkinen
2024-03-07 20:03  0%   ` Jarkko Sakkinen
2024-03-07  5:36     [LSF/MM/BPF TOPIC] statx attributes Steve French
2024-03-07  8:54  5% ` Amir Goldstein
2024-03-07 16:37  0%   ` Steve French
2024-03-07 17:45  0%     ` Kent Overstreet
2024-03-07 20:03  0%       ` Steve French
2024-03-07 20:22  0%         ` Andrew Walker
2024-03-08  2:29     [PATCH v2] statx: stx_subvol Kent Overstreet
2024-03-08 11:42     ` Neal Gompa
2024-03-08 16:34       ` Kent Overstreet
2024-03-08 16:44         ` Neal Gompa
2024-03-08 16:48           ` Kent Overstreet
2024-03-08 16:56             ` Darrick J. Wong
2024-03-11  2:17               ` Dave Chinner
2024-03-11  5:30  5%             ` Miklos Szeredi
2024-03-11  5:49  0%               ` Kent Overstreet
2024-03-08 10:19     [GIT PULL] vfs uuid Christian Brauner
2024-03-14 19:55     ` Fwd: " Steve French
2024-03-15  3:06  5%   ` Kent Overstreet
2024-03-14 17:07     [PATCH v2] xfs: allow cross-linking special files without project quota Andrey Albershteyn
2024-03-15  2:48  5% ` Darrick J. Wong
2024-03-15  9:35  0%   ` Andrey Albershteyn
2024-04-05 22:22  0%   ` Andrey Albershteyn
2024-03-22 15:09  2% [PATCH v11 0/9] Landlock: IOCTL support Günther Noack
2024-03-24  5:00     kernel crash in mknod Steve French
2024-03-24  5:46     ` Al Viro
2024-03-24 16:50       ` Roberto Sassu
2024-03-25 16:06         ` Christian Brauner
     [not found]           ` <CAH2r5muL4NEwLxq_qnPOCTHunLB_vmDA-1jJ152POwBv+aTcXg@mail.gmail.com>
2024-03-25 19:54             ` Al Viro
2024-03-25 20:47  5%           ` Paulo Alcantara
2024-03-25 21:13  0%             ` Al Viro
2024-03-25 21:31  0%               ` Paulo Alcantara
2024-03-25 13:39  2% [PATCH v12 0/9] Landlock: IOCTL support Günther Noack
2024-03-26 13:38     [PATCH v6 00/10] block atomic writes John Garry
2024-03-26 13:38     ` [PATCH v6 10/10] nvme: Atomic write support John Garry
2024-04-11  0:29       ` Luis Chamberlain
2024-04-11  8:59         ` John Garry
     [not found]           ` <CGME20240411162308uscas1p2a0a08f3fb19af69de911961b03965257@uscas1p2.samsung.com>
2024-04-11 16:22             ` Luis Chamberlain
2024-04-11 23:32  4%           ` Dan Helmick
2024-03-27 13:10  2% [PATCH v13 00/10] Landlock: IOCTL support Günther Noack
2024-03-27 13:10  6% ` [PATCH v13 01/10] landlock: Add IOCTL access right for character and block devices Günther Noack
2024-03-27 16:57  0%   ` Mickaël Salaün
2024-03-28 12:01  0%     ` Mickaël Salaün
2024-04-02 18:28  0%     ` Günther Noack
2024-04-03 11:15  0%       ` Mickaël Salaün
2024-04-05 16:17  0%         ` Günther Noack
2024-04-05 18:01  0%           ` Mickaël Salaün
2024-03-27 15:10  1% [ANNOUNCE] util-linux v2.40 Karel Zak
2024-03-28 16:57     [PATCH v6 00/15] netfs, cifs: Delegate high-level I/O to netfslib David Howells
2024-03-28 16:58  4% ` [PATCH v6 11/15] cifs: When caching, try to open O_WRONLY file rdwr on server David Howells
2024-03-29  9:58  0%   ` Naveen Mamindlapalli
2024-03-28 16:58  2% ` [PATCH v6 13/15] cifs: Remove some code that's no longer used, part 1 David Howells
2024-03-30  0:32     [PATCHSET v5.5 1/2] fs-verity: support merkle tree access by blocks Darrick J. Wong
2024-03-30  0:34     ` [PATCH 08/13] fsverity: expose merkle tree geometry to callers Darrick J. Wong
2024-04-05  2:50       ` Eric Biggers
2024-04-25  0:45  5%     ` Darrick J. Wong
2024-04-25  0:49  0%       ` Eric Biggers
2024-04-25  1:01  0%         ` Darrick J. Wong
2024-04-25  1:04  0%           ` Eric Biggers
2024-03-30  0:32     [PATCHSET v5.5 2/2] xfs: fs-verity support Darrick J. Wong
2024-03-30  0:43  5% ` [PATCH 28/29] xfs: allow verity files to be opened even if the fsverity metadata is damaged Darrick J. Wong
2024-04-02 18:04  0%   ` Andrey Albershteyn
2024-04-02 20:00  7%   ` Colin Walters
2024-04-02 22:52  0%     ` Darrick J. Wong
2024-04-02 23:45  0%       ` Eric Biggers
2024-04-03  1:34  0%         ` Darrick J. Wong
2024-04-03  0:10  0%       ` Colin Walters
2024-04-03  1:39  0%         ` Darrick J. Wong
2024-04-03  8:35  0%         ` Alexander Larsson
2024-04-02  9:11  4% [PATCH] cifs: Fix caching to try to do open O_WRONLY as rdwr on server David Howells
2024-04-05 21:40  2% [PATCH v14 00/12] Landlock: IOCTL support Günther Noack
2024-04-05 21:40  6% ` [PATCH v14 02/12] landlock: Add IOCTL access right for character and block devices Günther Noack
2024-04-12 15:16  0%   ` Mickaël Salaün
2024-04-06  9:09     [PATCH vfs.all 00/26] fs & block: remove bdev->bd_inode Yu Kuai
2024-04-06  9:09  3% ` [PATCH vfs.all 25/26] buffer: add helpers to get and set bdev Yu Kuai
     [not found]     <20240408125309.280181634@linuxfoundation.org>
2024-04-08 12:57  4% ` [PATCH 6.8 179/273] cifs: Fix caching to try to do open O_WRONLY as rdwr on server Greg Kroah-Hartman
     [not found]     <20240408125306.643546457@linuxfoundation.org>
2024-04-08 12:58  4% ` [PATCH 6.6 188/252] " Greg Kroah-Hartman
     [not found]     <20240408125256.218368873@linuxfoundation.org>
2024-04-08 12:58  4% ` [PATCH 6.1 100/138] " Greg Kroah-Hartman
2024-04-09 19:22     [PATCH v1 00/18] mm: mapcount for large folios + page_mapcount() cleanups David Hildenbrand
2024-04-09 19:22  5% ` [PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range() David Hildenbrand
2024-04-10  3:41  5% [PATCH v2 0/9] ext4: support adding multi-delalloc blocks Zhang Yi
2024-04-10  3:41 11% ` [PATCH v2 1/9] ext4: factor out a common helper to query extent map Zhang Yi
2024-04-24 20:05  7%   ` Jan Kara
2024-04-10 13:27  2% [RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
2024-04-10 13:27 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
2024-04-10 13:27  4% ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
2024-04-10 13:28  5% ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
2024-04-10 13:28  5% ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
2024-04-10 14:29  2% [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
2024-04-10 14:29 11% ` [PATCH v4 01/34] ext4: factor out a common helper to query extent map Zhang Yi
2024-04-26 11:55  7%   ` Ritesh Harjani
2024-04-10 14:29  4% ` [RFC PATCH v4 15/34] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
2024-04-10 14:29  5% ` [RFC PATCH v4 17/34] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
2024-04-10 14:29  5% ` [RFC PATCH v4 23/34] ext4: implement buffered read iomap path Zhang Yi
2024-04-11  1:12  0% ` [RESEND RFC PATCH v4 00/34] ext4: use iomap for regular file's buffered IO path and enable large folio Zhang Yi
2024-04-24  8:12  0% ` Zhang Yi
2024-04-11 20:31     commit e57bf9cda9cd ("timerfd: convert to ->read_iter()") breaks booting on debian stable (bookworm, 12.5) Bert Karwatzki
2024-04-11 22:01  1% ` Bert Karwatzki
2024-04-15 11:28     [PATCH 03/26] netfs: Update i_blocks when write committed to pagecache Jeff Layton
2024-03-28 16:33     ` [PATCH 00/26] netfs, afs, 9p, cifs: Rework netfs to use ->writepages() to copy to cache David Howells
2024-03-28 16:33       ` [PATCH 03/26] netfs: Update i_blocks when write committed to pagecache David Howells
2024-04-16 22:47  5%     ` David Howells
2024-04-19 16:11  2% [PATCH v15 00/11] Landlock: IOCTL support Günther Noack
2024-04-19 16:11  6% ` [PATCH v15 01/11] landlock: Add IOCTL access right for character and block devices Günther Noack
2024-05-08 10:40  0% ` [PATCH v15 00/11] Landlock: IOCTL support Mickaël Salaün
2024-04-29 17:47     [PATCH v3 00/21] block atomic writes for XFS John Garry
2024-04-29 17:47  5% ` [PATCH v3 16/21] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-04-30  3:18     [PATCHSET v5.6 1/2] fs-verity: support merkle tree access by blocks Darrick J. Wong
2024-04-30  3:21  6% ` [PATCH 09/18] fsverity: expose merkle tree geometry to callers Darrick J. Wong
2024-04-30 14:09     [PATCH v7 00/16] netfs, cifs: Delegate high-level I/O to netfslib David Howells
2024-04-30 14:09  2% ` [PATCH v7 13/16] cifs: Remove some code that's no longer used, part 1 David Howells
2024-05-04  0:30 13% [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
2024-05-04  0:30  9% ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
2024-05-04 15:28  6%   ` Greg KH
2024-05-04 21:50  9%     ` Andrii Nakryiko
2024-05-06 13:58  6%       ` Arnaldo Carvalho de Melo
2024-05-06 18:05  6%         ` Namhyung Kim
2024-05-06 18:51  6%           ` Andrii Nakryiko
2024-05-06 18:53  6%           ` Arnaldo Carvalho de Melo
2024-05-06 19:16  7%             ` Arnaldo Carvalho de Melo
2024-05-07 21:55  7%               ` Namhyung Kim
2024-05-06 18:41  6%         ` Andrii Nakryiko
2024-05-06 20:35  6%           ` Arnaldo Carvalho de Melo
2024-05-07 16:36 11%             ` Andrii Nakryiko
2024-05-04 23:36  9%   ` kernel test robot
2024-05-07 18:10  7%   ` Liam R. Howlett
2024-05-07 18:52  6%     ` Andrii Nakryiko
2024-05-04  0:30  8% ` [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
2024-05-04 15:32  0%   ` Greg KH
2024-05-04 22:13  0%     ` Andrii Nakryiko
2024-05-07 15:48  0%       ` Liam R. Howlett
2024-05-07 16:27  0%         ` Andrii Nakryiko
2024-05-07 18:06               ` Liam R. Howlett
2024-05-07 19:00  6%             ` Andrii Nakryiko
2024-05-08  1:20  0%               ` Liam R. Howlett
2024-05-04 11:24  7% ` [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Christian Brauner
2024-05-04 15:33  7%   ` Greg KH
2024-05-04 21:50  7%     ` Andrii Nakryiko
2024-05-04 21:50 14%   ` Andrii Nakryiko
2024-05-05  5:26  6% ` Ian Rogers
2024-05-06 18:58 11%   ` Andrii Nakryiko
2024-05-04 18:37  6% [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Alexey Dobriyan
2024-05-08  6:12  5% [PATCH v3 00/10] ext4: support adding multi-delalloc blocks Zhang Yi
2024-05-08  6:12 11% ` [PATCH v3 01/10] ext4: factor out a common helper to query extent map Zhang Yi
2024-05-11 11:26  4% [PATCH v4 00/10] ext4: support adding multi-delalloc blocks Zhang Yi
2024-05-11 11:26 11% ` [PATCH v4 01/10] ext4: factor out a common helper to query extent map Zhang Yi
2024-05-16  9:22     [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels
2024-05-16 19:07     ` Casey Schaufler
2024-05-16 19:29       ` Jarkko Sakkinen
2024-05-16 19:31         ` Jarkko Sakkinen
2024-05-16 20:00           ` Jarkko Sakkinen
2024-05-17 11:42             ` Jonathan Calmels
2024-05-17 17:53               ` Casey Schaufler
2024-05-18 12:20                 ` Serge Hallyn
2024-05-19 17:03                   ` Casey Schaufler
2024-05-20  0:54  5%                 ` Jonathan Calmels
2024-05-17 12:39  4% [PATCH v5 00/10] ext4: support adding multi-delalloc blocks Zhang Yi
2024-05-17 12:39 11% ` [PATCH v5 01/10] ext4: factor out a common helper to query extent map Zhang Yi
2024-05-17 16:19  7%   ` Markus Elfring
2024-05-24  4:10 12% [PATCH v2 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
2024-05-24  4:10  4% ` [PATCH v2 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Andrii Nakryiko
2024-05-24  4:10 17% ` [PATCH v2 3/9] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
2024-05-24  4:10  4% ` [PATCH v2 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Andrii Nakryiko
2024-05-24  4:10  8% ` [PATCH v2 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence Andrii Nakryiko
2024-05-24  4:10  8% ` [PATCH v2 7/9] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
2024-05-24  4:10  9% ` [PATCH v2 9/9] selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).