linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/20] Introduce the famfs shared-memory file system
@ 2024-02-23 17:41 John Groves
  2024-02-23 17:41 ` [RFC PATCH 01/20] famfs: Documentation John Groves
                   ` (21 more replies)
  0 siblings, 22 replies; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This patch set introduces famfs[1] - a special-purpose fs-dax file system
for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
CXL-specific in anyway way.

* Famfs creates a simple access method for storing and sharing data in
  sharable memory. The memory is exposed and accessed as memory-mappable
  dax files.
* Famfs supports multiple hosts mounting the same file system from the
  same memory (something existing fs-dax file systems don't do).
* A famfs file system can be created on either a /dev/pmem device in fs-dax
  mode, or a /dev/dax device in devdax mode (the latter depending on
  patches 2-6 of this series).

The famfs kernel file system is part the famfs framework; additional
components in user space[2] handle metadata and direct the famfs kernel
module to instantiate files that map to specific memory. The famfs user
space has documentation and a reasonably thorough test suite.

The famfs kernel module never accesses the shared memory directly (either
data or metadata). Because of this, shared memory managed by the famfs
framework does not create a RAS "blast radius" problem that should be able
to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
can be expected to kill apps via SIGBUS and cause mounts to be disabled
due to memory failure notifications.

Famfs does not attempt to solve concurrency or coherency problems for apps,
although it does solve these problems in regard to its own data structures.
Apps may encounter hard concurrency problems, but there are use cases that
are imminently useful and uncomplicated from a concurrency perspective:
serial sharing is one (only one host at a time has access), and read-only
concurrent sharing is another (all hosts can read-cache without worry).

Contents:

* famfs kernel documentation [patch 1]. Note that evolving famfs user
  documentation is at [2]
* dev_dax_iomap patchset [patches 2-6] - This enables fs-dax to use the
  iomap interface via a character /dev/dax device (e.g. /dev/dax0.0). For
  historical reasons the iomap infrastructure was enabled only for
  /dev/pmem devices (which are dax block devices). As famfs is the first
  fs-dax file system that works on /dev/dax, this patch series fills in
  the bare minimum infrastructure to enable iomap api usage with /dev/dax.
* famfs patchset [patches 7-20] - this introduces the kernel component of
  famfs.

IMPORTANT NOTE: There is a developing consensus that /dev/dax requires
some fundamental re-factoring (e.g. [3]) that is related but outside the
scope of this series.

Some observations about using sharable memory

* It does not make sense to online sharable memory as system-ram.
  System-ram gets zeroed when it is onlined, so sharing is basically
  nonsense.
* It does not make sense to put struct page's in sharable memory, because
  those can't be shared. However, separately providing non-sharable
  capacity to be used for struct page's might be a sensible approach if the
  size of struct page array for sharable memory is too large to put in
  conventional system-ram (albeit with possible RAS implications).
* Sharable memory is pmem-like, in that a host is likely to connect in
  order to gain access to data that is already in the memory. Moreover
  the power domain for shared memory is separate for that of the server.
  Having observed that, famfs is not intended for persistent storage. It is
  intended for sharing data sets in memory during a time frame where the
  memory and the compute nodes are expected to remain operational - such
  as during a clustered data analytics job.

Could we do this with FUSE?

The key performance requirement for famfs is efficient handling of VMA
faults. This requires caching the complete dax extent lists for all active
files so faults can be handled without upcalls, which FUSE does not do.
It would probably be possible to put this capability FUSE, but we think
that keeping famfs separate from FUSE is the simpler approach.

This patch set is available as a branch at [5]

References

[1] https://lpc.events/event/17/contributions/1455/
[2] https://github.com/cxl-micron-reskit/famfs
[3] https://lore.kernel.org/all/166630293549.1017198.3833687373550679565.stgit@dwillia2-xfh.jf.intel.com/
[4] https://www.computeexpresslink.org/download-the-specification
[5] https://github.com/cxl-micron-reskit/famfs-linux

John Groves (20):
  famfs: Documentation
  dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since
    both need it now
  dev_dax_iomap: Save the kva from memremap
  dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
  famfs: Add include/linux/famfs_ioctl.h
  famfs: Add famfs_internal.h
  famfs: Add super_operations
  famfs: famfs_open_device() & dax_holder_operations
  famfs: Add fs_context_operations
  famfs: Add inode_operations and file_system_type
  famfs: Add iomap_ops
  famfs: Add struct file_operations
  famfs: Add ioctl to file_operations
  famfs: Add fault counters
  famfs: Add module stuff
  famfs: Support character dax via the dev_dax_iomap patch
  famfs: Update MAINTAINERS file
  famfs: Add Kconfig and Makefile plumbing

 Documentation/filesystems/famfs.rst | 124 +++++
 MAINTAINERS                         |  11 +
 drivers/dax/Kconfig                 |   6 +
 drivers/dax/bus.c                   | 131 ++++++
 drivers/dax/dax-private.h           |   1 +
 drivers/dax/device.c                |  38 +-
 drivers/dax/super.c                 |  38 ++
 fs/Kconfig                          |   2 +
 fs/Makefile                         |   1 +
 fs/famfs/Kconfig                    |  10 +
 fs/famfs/Makefile                   |   5 +
 fs/famfs/famfs_file.c               | 704 ++++++++++++++++++++++++++++
 fs/famfs/famfs_inode.c              | 586 +++++++++++++++++++++++
 fs/famfs/famfs_internal.h           | 126 +++++
 include/linux/dax.h                 |   5 +
 include/uapi/linux/famfs_ioctl.h    |  56 +++
 16 files changed, 1821 insertions(+), 23 deletions(-)
 create mode 100644 Documentation/filesystems/famfs.rst
 create mode 100644 fs/famfs/Kconfig
 create mode 100644 fs/famfs/Makefile
 create mode 100644 fs/famfs/famfs_file.c
 create mode 100644 fs/famfs/famfs_inode.c
 create mode 100644 fs/famfs/famfs_internal.h
 create mode 100644 include/uapi/linux/famfs_ioctl.h


base-commit: 841c35169323cd833294798e58b9bf63fa4fa1de
-- 
2.43.0


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [RFC PATCH 01/20] famfs: Documentation
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-23 17:41 ` [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Introduce Documentation/filesystems/famfs.rst into the Documentation
tree

Signed-off-by: John Groves <john@groves.net>
---
 Documentation/filesystems/famfs.rst | 124 ++++++++++++++++++++++++++++
 1 file changed, 124 insertions(+)
 create mode 100644 Documentation/filesystems/famfs.rst

diff --git a/Documentation/filesystems/famfs.rst b/Documentation/filesystems/famfs.rst
new file mode 100644
index 000000000000..c2cc50c10d03
--- /dev/null
+++ b/Documentation/filesystems/famfs.rst
@@ -0,0 +1,124 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _famfs_index:
+
+==================================================================
+famfs: The kernel component of the famfs shared memory file system
+==================================================================
+
+- Copyright (C) 2024 Micron Technology, Inc.
+
+Introduction
+============
+Compute Express Link (CXL) provides a mechanism for disaggregated or
+fabric-attached memory (FAM). This creates opportunities for data sharing;
+clustered apps that would otherwise have to shard or replicate data can
+share one copy in disaggregated memory.
+
+Famfs, which is not CXL-specific in any way, provides a mechanism for
+multiple hosts to use data in shared memory, by giving it a file system
+interface. With famfs, any app that understands files (which is all of
+them, right?) can access data sets in shared memory. Although famfs
+supports read and write calls, the real point is to support mmap, which
+provides direct (dax) access to the memory - either writable or read-only.
+
+Shared memory can pose complex coherency and synchronization issues, but
+there are also simple cases. Two simple and eminently useful patterns that
+occur frequently in data analytics and AI are:
+
+* Serial Sharing - Only one host or process at a time has access to a file
+* Read-only Sharing - Multiple hosts or processes share read-only access
+  to a file
+
+The famfs kernel file system is part of the famfs framework; User space
+components [1] handle metadata allocation and distribution, and direct the
+famfs kernel module to instantiate files that map to specific memory.
+
+The famfs framework manages coherency of its own metadata and structures,
+but does not attempt to manage coherency for applications.
+
+Famfs also provides data isolation between files. That is, even though
+the host has access to an entire memory "device" (as a dax device), apps
+cannot write to memory for which the file is read-only, and mapping one
+file provides isolation from the memory of all other files. This is pretty
+basic, but some experimental shared memory usage patterns provide no such
+isolation.
+
+Principles of Operation
+=======================
+
+Without its user space components, the famfs kernel module is just a
+semi-functional clone of ramfs with latent fs-dax support. The user space
+components maintain superblocks and metadata logs, and use the famfs kernel
+component to provide a file system view of shared memory across multiple
+hosts.
+
+Each host has an independent instance of the famfs kernel module. After
+mount, files are not visible until the user space component instantiates
+them (normally by playing the famfs metadata log).
+
+Once instantiated, files on each host can point to the same shared memory,
+but in-memory metadata (inodes, etc.) is ephemeral on each host that has a
+famfs instance mounted. Like ramfs, the famfs in-kernel file system has no
+backing store for metadata modifications. If metadata is ever persisted,
+that must be done by the user space components. However, mutations to file
+data are saved to the shared memory - subject to write permission and
+processor cache behavior.
+
+
+Famfs is Not a Conventional File System
+---------------------------------------
+
+Famfs files can be accessed by conventional means, but there are
+limitations. The kernel component of famfs is not involved in the
+allocation of backing memory for files at all; the famfs user space
+creates files and passes the allocation extent lists into the kernel via
+the per-file FAMFSIOC_MAP_CREATE ioctl. A file that lacks this metadata is
+treated as invalid by the famfs kernel module. As a practical matter files
+must be created via the famfs library or cli, but they can be consumed as
+if they were conventional files.
+
+Famfs differs in some important ways from conventional file systems:
+
+* Files must be pre-allocated by the famfs framework; Allocation is never
+  performed on write.
+* Any operation that changes a file's size is considered to put the file
+  in an invalid state, disabling access to the data. It may be possible to
+  revisit this in the future.
+* (Typically the famfs user space can restore files to a valid state by
+  replaying the famfs metadata log.)
+
+Famfs exists to apply the existing file system abstractions on top of
+shared memory so applications and workflows can more easily consume it.
+
+Key Requirements
+================
+
+The primary requirements for famfs are:
+
+1. Must support a file system abstraction backed by sharable dax memory
+2. Files must efficiently handle VMA faults
+3. Must support metadata distribution in a sharable way
+4. Must handle clients with a stale copy of metadata
+
+The famfs kernel component takes care of 1-2 above.
+
+Requirements 3 and 4 are handled by the user space components, and are
+largely orthogonal to the functionality of the famfs kernel module.
+
+Requirements 3 and 4 cannot be met by conventional fs-dax file systems
+(e.g. xfs and ext4) because they use write-back metadata; it is not valid
+to mount such a file system on two hosts from the same in-memory image.
+
+
+Famfs Usage
+===========
+
+Famfs usage is documented at [1].
+
+
+References
+==========
+
+- [1] Famfs user space repository and documentation
+      https://github.com/cxl-micron-reskit/famfs
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
  2024-02-23 17:41 ` [RFC PATCH 01/20] famfs: Documentation John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:05   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now John Groves
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This function should be called by fs-dax file systems after opening the
devdax device. This adds holder_operations.

This function serves the same role as fs_dax_get_by_bdev(), which dax
file systems call after opening the pmem block device.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |  5 +++++
 2 files changed, 43 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index f4b635526345..fc96362de237 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
 EXPORT_SYMBOL_GPL(fs_put_dax);
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+/**
+ * fs_dax_get()
+ *
+ * fs-dax file systems call this function to prepare to use a devdax device for fsdax.
+ * This is like fs_dax_get_by_bdev(), but the caller already has struct dev_dax (and there
+ * is no bdev). The holder makes this exclusive.
+ *
+ * @dax_dev: dev to be prepared for fs-dax usage
+ * @holder: filesystem or mapped device inside the dax_device
+ * @hops: operations for the inner holder
+ *
+ * Returns: 0 on success, -1 on failure
+ */
+int fs_dax_get(
+	struct dax_device *dax_dev,
+	void *holder,
+	const struct dax_holder_operations *hops)
+{
+	/* dax_dev->ops should have been populated by devm_create_dev_dax() */
+	if (WARN_ON(!dax_dev->ops))
+		return -1;
+
+	if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
+		return -1;
+
+	if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
+		pr_warn("%s: holder_data already set\n", __func__);
+		return -1;
+	}
+	dax_dev->holder_ops = hops;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(fs_dax_get);
+#endif /* DEV_DAX_IOMAP */
+
 enum dax_device_flags {
 	/* !alive + rcu grace period == no new operations / mappings */
 	DAXDEV_ALIVE,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..e973289bfde3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,7 +57,12 @@ struct dax_holder_operations {
 
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
+#endif
 void *dax_holder(struct dax_device *dax_dev);
+struct dax_device *inode_dax(struct inode *inode);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
  2024-02-23 17:41 ` [RFC PATCH 01/20] famfs: Documentation John Groves
  2024-02-23 17:41 ` [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:10   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap John Groves
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

bus.c can't call functions in device.c - that creates a circular linkage
dependency.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/bus.c    | 24 ++++++++++++++++++++++++
 drivers/dax/device.c | 23 -----------------------
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..664e8c1b9930 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1325,6 +1325,30 @@ static const struct device_type dev_dax_type = {
 	.groups = dax_attribute_groups,
 };
 
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c  */
+__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+			      unsigned long size)
+{
+	int i;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+		struct range *range = &dax_range->range;
+		unsigned long long pgoff_end;
+		phys_addr_t phys;
+
+		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+			continue;
+		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+		if (phys + size - 1 <= range->end)
+			return phys;
+		break;
+	}
+	return -1;
+}
+EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 93ebedc5ec8c..40ba660013cf 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -50,29 +50,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	return 0;
 }
 
-/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
-__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
-		unsigned long size)
-{
-	int i;
-
-	for (i = 0; i < dev_dax->nr_range; i++) {
-		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
-		struct range *range = &dax_range->range;
-		unsigned long long pgoff_end;
-		phys_addr_t phys;
-
-		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
-		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
-			continue;
-		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
-		if (phys + size - 1 <= range->end)
-			return phys;
-		break;
-	}
-	return -1;
-}
-
 static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
 			      unsigned long fault_size)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (2 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:21   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Save the kva from memremap because we need it for iomap rw support

Prior to famfs, there were no iomap users of /dev/dax - so the virtual
address from memremap was not needed.

Also: in some cases dev_dax_probe() is called with the first
dev_dax->range offset past pgmap[0].range. In those cases we need to
add the difference to virt_addr in order to have the physaddr's in
dev_dax->ranges match dev_dax->virt_addr.

Dragons...

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/dax-private.h |  1 +
 drivers/dax/device.c      | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 446617b73aea..894eb1c66b4a 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -63,6 +63,7 @@ struct dax_mapping {
 struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
+	u64 virt_addr;
 	unsigned int align;
 	int target_node;
 	bool dyn_id;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 40ba660013cf..6cd79d00fe1b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
+	u64 data_offset = 0;
 	struct inode *inode;
 	struct cdev *cdev;
 	void *addr;
@@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
+	/* Detect whether the data is at a non-zero offset into the memory */
+	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
+		u64 phys = (u64)dev_dax->ranges[0].range.start;
+		u64 pgmap_phys = (u64)dev_dax->pgmap[0].range.start;
+		u64 vmemmap_shift = (u64)dev_dax->pgmap[0].vmemmap_shift;
+
+		if (!WARN_ON(pgmap_phys > phys))
+			data_offset = phys - pgmap_phys;
+
+		pr_notice("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n",
+		       __func__, phys, pgmap_phys, data_offset, vmemmap_shift);
+	}
+	dev_dax->virt_addr = (u64)addr + data_offset;
+
 	inode = dax_inode(dax_dev);
 	cdev = inode->i_cdev;
 	cdev_init(cdev, &dax_fops);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (3 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:32   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Notes about this commit:

* These methods are based somewhat loosely on pmem_dax_ops from
  drivers/nvdimm/pmem.c

* dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
  newly stored as dev_dax->virt_addr by dev_dax_probe().

* The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
  for read/write (dax_iomap_rw())

* dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
  tested yet. I'm looking for suggestions as to how to test those.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/bus.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 664e8c1b9930..06fcda810674 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -10,6 +10,12 @@
 #include "dax-private.h"
 #include "bus.h"
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+#include <linux/backing-dev.h>
+#include <linux/pfn_t.h>
+#include <linux/range.h>
+#endif
+
 static DEFINE_MUTEX(dax_bus_lock);
 
 #define DAX_NAME_LEN 30
@@ -1349,6 +1355,101 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+static void write_dax(void *pmem_addr, struct page *page,
+		unsigned int off, unsigned int len)
+{
+	unsigned int chunk;
+	void *mem;
+
+	while (len) {
+		mem = kmap_local_page(page);
+		chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+		memcpy_flushcache(pmem_addr, mem + off, chunk);
+		kunmap_local(mem);
+		len -= chunk;
+		off = 0;
+		page++;
+		pmem_addr += chunk;
+	}
+}
+
+static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+			     long nr_pages, enum dax_access_mode mode, void **kaddr,
+			     pfn_t *pfn)
+{
+	struct dev_dax *dev_dax = dax_get_private(dax_dev);
+	size_t dax_size = dev_dax_size(dev_dax);
+	size_t size = nr_pages << PAGE_SHIFT;
+	size_t offset = pgoff << PAGE_SHIFT;
+	phys_addr_t phys;
+	u64 virt_addr = dev_dax->virt_addr + offset;
+	pfn_t local_pfn;
+	u64 flags = PFN_DEV|PFN_MAP;
+
+	WARN_ON(!dev_dax->virt_addr); /* virt_addr must be saved for direct_access */
+
+	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+	if (kaddr)
+		*kaddr = (void *)virt_addr;
+
+	local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
+	if (pfn)
+		*pfn = local_pfn;
+
+	/* This the valid size at the specified address */
+	return PHYS_PFN(min_t(size_t, size, dax_size - offset));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+				    size_t nr_pages)
+{
+	long resid = nr_pages << PAGE_SHIFT;
+	long offset = pgoff << PAGE_SHIFT;
+
+	/* Break into one write per dax region */
+	while (resid > 0) {
+		void *kaddr;
+		pgoff_t poff = offset >> PAGE_SHIFT;
+		long len = __dev_dax_direct_access(dax_dev, poff,
+						   nr_pages, DAX_ACCESS, &kaddr, NULL);
+		len = min_t(long, len, PAGE_SIZE);
+		write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+		offset += len;
+		resid  -= len;
+	}
+	return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, pfn_t *pfn)
+{
+	return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+		void *addr, size_t bytes, struct iov_iter *i)
+{
+	size_t len, off;
+
+	off = offset_in_page(addr);
+	len = PFN_PHYS(PFN_UP(off + bytes));
+
+	return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+	.direct_access = dev_dax_direct_access,
+	.zero_page_range = dev_dax_zero_page_range,
+	.recovery_write = dev_dax_recovery_write,
+};
+
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
@@ -1404,11 +1505,17 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 		}
 	}
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+	/* holder_ops currently populated separately in a slightly hacky way */
+	dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
+#else
 	/*
 	 * No dax_operations since there is no access to this device outside of
 	 * mmap of the resulting character device.
 	 */
 	dax_dev = alloc_dax(dev_dax, NULL);
+#endif
+
 	if (IS_ERR(dax_dev)) {
 		rc = PTR_ERR(dax_dev);
 		goto err_alloc_dax;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (4 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:34   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h John Groves
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Add the CONFIG_DEV_DAX_IOMAP kernel config parameter to control building
of the iomap functionality to support fsdax on devdax.

Signed-off-by: John Groves <john@groves.net>
---
 drivers/dax/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..b1ebcc77120b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
 
 	  Say N if unsure.
 
+config DEV_DAX_IOMAP
+       depends on DEV_DAX && DAX
+       def_bool y
+       help
+         Support iomap mapping of devdax devices (for FS-DAX file
+         systems that reside on character /dev/dax devices)
 endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (5 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-24  1:39   ` Randy Dunlap
  2024-02-26 12:39   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 08/20] famfs: Add famfs_internal.h John Groves
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Add uapi include file for famfs. The famfs user space uses ioctl on
individual files to pass in mapping information and file size. This
would be hard to do via sysfs or other means, since it's
file-specific.

Signed-off-by: John Groves <john@groves.net>
---
 include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)
 create mode 100644 include/uapi/linux/famfs_ioctl.h

diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
new file mode 100644
index 000000000000..6b3e6452d02f
--- /dev/null
+++ b/include/uapi/linux/famfs_ioctl.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+#ifndef FAMFS_IOCTL_H
+#define FAMFS_IOCTL_H
+
+#include <linux/ioctl.h>
+#include <linux/uuid.h>
+
+#define FAMFS_MAX_EXTENTS 2
+
+enum extent_type {
+	SIMPLE_DAX_EXTENT = 13,
+	INVALID_EXTENT_TYPE,
+};
+
+struct famfs_extent {
+	__u64              offset;
+	__u64              len;
+};
+
+enum famfs_file_type {
+	FAMFS_REG,
+	FAMFS_SUPERBLOCK,
+	FAMFS_LOG,
+};
+
+/**
+ * struct famfs_ioc_map
+ *
+ * This is the metadata that indicates where the memory is for a famfs file
+ */
+struct famfs_ioc_map {
+	enum extent_type          extent_type;
+	enum famfs_file_type      file_type;
+	__u64                     file_size;
+	__u64                     ext_list_count;
+	struct famfs_extent       ext_list[FAMFS_MAX_EXTENTS];
+};
+
+#define FAMFSIOC_MAGIC 'u'
+
+/* famfs file ioctl opcodes */
+#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
+#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
+
+#endif /* FAMFS_IOCTL_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 08/20] famfs: Add famfs_internal.h
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (6 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:48   ` Jonathan Cameron
  2024-02-27 13:38   ` Christian Brauner
  2024-02-23 17:41 ` [RFC PATCH 09/20] famfs: Add super_operations John Groves
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Add the famfs_internal.h include file. This contains internal data
structures such as the per-file metadata structure (famfs_file_meta)
and extent formats.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_internal.h | 53 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)
 create mode 100644 fs/famfs/famfs_internal.h

diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
new file mode 100644
index 000000000000..af3990d43305
--- /dev/null
+++ b/fs/famfs/famfs_internal.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+#ifndef FAMFS_INTERNAL_H
+#define FAMFS_INTERNAL_H
+
+#include <linux/atomic.h>
+#include <linux/famfs_ioctl.h>
+
+#define FAMFS_MAGIC 0x87b282ff
+
+#define FAMFS_BLKDEV_MODE (FMODE_READ|FMODE_WRITE)
+
+extern const struct file_operations      famfs_file_operations;
+
+/*
+ * Each famfs dax file has this hanging from its inode->i_private.
+ */
+struct famfs_file_meta {
+	int                   error;
+	enum famfs_file_type  file_type;
+	size_t                file_size;
+	enum extent_type      tfs_extent_type;
+	size_t                tfs_extent_ct;
+	struct famfs_extent   tfs_extents[];  /* flexible array */
+};
+
+struct famfs_mount_opts {
+	umode_t mode;
+};
+
+extern const struct iomap_ops             famfs_iomap_ops;
+extern const struct vm_operations_struct  famfs_file_vm_ops;
+
+#define ROOTDEV_STRLEN 80
+
+struct famfs_fs_info {
+	struct famfs_mount_opts  mount_opts;
+	struct file             *dax_filp;
+	struct dax_device       *dax_devp;
+	struct bdev_handle      *bdev_handle;
+	struct list_head         fsi_list;
+	char                    *rootdev;
+};
+
+#endif /* FAMFS_INTERNAL_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 09/20] famfs: Add super_operations
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (7 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 08/20] famfs: Add famfs_internal.h John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:51   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations John Groves
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Introduce the famfs superblock operations

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)
 create mode 100644 fs/famfs/famfs_inode.c

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
new file mode 100644
index 000000000000..3329aff000d1
--- /dev/null
+++ b/fs/famfs/famfs_inode.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, inc
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/backing-dev.h>
+#include <linux/sched.h>
+#include <linux/parser.h>
+#include <linux/magic.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
+#include <linux/seq_file.h>
+#include <linux/dax.h>
+#include <linux/hugetlb.h>
+#include <linux/uio.h>
+#include <linux/iomap.h>
+#include <linux/path.h>
+#include <linux/namei.h>
+#include <linux/pfn_t.h>
+#include <linux/blkdev.h>
+
+#include "famfs_internal.h"
+
+#define FAMFS_DEFAULT_MODE	0755
+
+static const struct super_operations famfs_ops;
+static const struct inode_operations famfs_file_inode_operations;
+static const struct inode_operations famfs_dir_inode_operations;
+
+/**********************************************************************************
+ * famfs super_operations
+ *
+ * TODO: implement a famfs_statfs() that shows size, free and available space, etc.
+ */
+
+/**
+ * famfs_show_options() - Display the mount options in /proc/mounts.
+ */
+static int famfs_show_options(
+	struct seq_file *m,
+	struct dentry   *root)
+{
+	struct famfs_fs_info *fsi = root->d_sb->s_fs_info;
+
+	if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE)
+		seq_printf(m, ",mode=%o", fsi->mount_opts.mode);
+
+	return 0;
+}
+
+static const struct super_operations famfs_ops = {
+	.statfs		= simple_statfs,
+	.drop_inode	= generic_delete_inode,
+	.show_options	= famfs_show_options,
+};
+
+
+MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (8 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 09/20] famfs: Add super_operations John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 12:56   ` Jonathan Cameron
  2024-02-27 13:39   ` Christian Brauner
  2024-02-23 17:41 ` [RFC PATCH 11/20] famfs: Add fs_context_operations John Groves
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces
the function that opens a block (pmem) device and the struct
dax_holder_operations that are needed for that ABI.

In this commit, support for opening character /dev/dax is stubbed. A
later commit introduces this capability.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_inode.c | 83 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 3329aff000d1..82c861998093 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = {
 	.show_options	= famfs_show_options,
 };
 
+/***************************************************************************************
+ * dax_holder_operations for block dax
+ */
+
+static int
+famfs_blk_dax_notify_failure(
+	struct dax_device	*dax_devp,
+	u64			offset,
+	u64			len,
+	int			mf_flags)
+{
+
+	pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
+	       __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
+	return -EOPNOTSUPP;
+}
+
+const struct dax_holder_operations famfs_blk_dax_holder_ops = {
+	.notify_failure		= famfs_blk_dax_notify_failure,
+};
+
+static int
+famfs_open_char_device(
+	struct super_block *sb,
+	struct fs_context  *fc)
+{
+	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
+	       __func__, fc->source);
+	return -ENODEV;
+}
+
+/**
+ * famfs_open_device()
+ *
+ * Open the memory device. If it looks like /dev/dax, call famfs_open_char_device().
+ * Otherwise try to open it as a block/pmem device.
+ */
+static int
+famfs_open_device(
+	struct super_block *sb,
+	struct fs_context  *fc)
+{
+	struct famfs_fs_info *fsi = sb->s_fs_info;
+	struct dax_device    *dax_devp;
+	u64 start_off = 0;
+	struct bdev_handle   *handlep;
+
+	if (fsi->dax_devp) {
+		pr_err("%s: already mounted\n", __func__);
+		return -EALREADY;
+	}
+
+	if (strstr(fc->source, "/dev/dax")) /* There is probably a better way to check this */
+		return famfs_open_char_device(sb, fc);
+
+	if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better way to check this */
+		pr_err("%s: primary backing dev (%s) is not pmem\n",
+		       __func__, fc->source);
+		return -EINVAL;
+	}
+
+	handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);
+	if (IS_ERR(handlep->bdev)) {
+		pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, fc->source);
+		return PTR_ERR(handlep->bdev);
+	}
+
+	dax_devp = fs_dax_get_by_bdev(handlep->bdev, &start_off,
+				      fsi  /* holder */,
+				      &famfs_blk_dax_holder_ops);
+	if (IS_ERR(dax_devp)) {
+		pr_err("%s: unable to get daxdev from handlep->bdev\n", __func__);
+		bdev_release(handlep);
+		return -ENODEV;
+	}
+	fsi->bdev_handle = handlep;
+	fsi->dax_devp    = dax_devp;
+
+	pr_notice("%s: root device is block dax (%s)\n", __func__, fc->source);
+	return 0;
+}
+
+
 
 MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (9 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 13:20   ` Jonathan Cameron
  2024-02-27 13:41   ` Christian Brauner
  2024-02-23 17:41 ` [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type John Groves
                   ` (10 subsequent siblings)
  21 siblings, 2 replies; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This commit introduces the famfs fs_context_operations and
famfs_get_inode() which is used by the context operations.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_inode.c | 178 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 178 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 82c861998093..f98f82962d7b 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -41,6 +41,50 @@ static const struct super_operations famfs_ops;
 static const struct inode_operations famfs_file_inode_operations;
 static const struct inode_operations famfs_dir_inode_operations;
 
+static struct inode *famfs_get_inode(
+	struct super_block *sb,
+	const struct inode *dir,
+	umode_t             mode,
+	dev_t               dev)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (inode) {
+		struct timespec64       tv;
+
+		inode->i_ino = get_next_ino();
+		inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
+		inode->i_mapping->a_ops = &ram_aops;
+		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_set_unevictable(inode->i_mapping);
+		tv = inode_set_ctime_current(inode);
+		inode_set_mtime_to_ts(inode, tv);
+		inode_set_atime_to_ts(inode, tv);
+
+		switch (mode & S_IFMT) {
+		default:
+			init_special_inode(inode, mode, dev);
+			break;
+		case S_IFREG:
+			inode->i_op = &famfs_file_inode_operations;
+			inode->i_fop = &famfs_file_operations;
+			break;
+		case S_IFDIR:
+			inode->i_op = &famfs_dir_inode_operations;
+			inode->i_fop = &simple_dir_operations;
+
+			/* Directory inodes start off with i_nlink == 2 (for "." entry) */
+			inc_nlink(inode);
+			break;
+		case S_IFLNK:
+			inode->i_op = &page_symlink_inode_operations;
+			inode_nohighmem(inode);
+			break;
+		}
+	}
+	return inode;
+}
+
 /**********************************************************************************
  * famfs super_operations
  *
@@ -150,6 +194,140 @@ famfs_open_device(
 	return 0;
 }
 
+/*****************************************************************************************
+ * fs_context_operations
+ */
+static int
+famfs_fill_super(
+	struct super_block *sb,
+	struct fs_context  *fc)
+{
+	struct famfs_fs_info *fsi = sb->s_fs_info;
+	struct inode *inode;
+	int rc = 0;
+
+	sb->s_maxbytes		= MAX_LFS_FILESIZE;
+	sb->s_blocksize		= PAGE_SIZE;
+	sb->s_blocksize_bits	= PAGE_SHIFT;
+	sb->s_magic		= FAMFS_MAGIC;
+	sb->s_op		= &famfs_ops;
+	sb->s_time_gran		= 1;
+
+	rc = famfs_open_device(sb, fc);
+	if (rc)
+		goto out;
+
+	inode = famfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		rc = -ENOMEM;
+
+out:
+	return rc;
+}
+
+enum famfs_param {
+	Opt_mode,
+	Opt_dax,
+};
+
+const struct fs_parameter_spec famfs_fs_parameters[] = {
+	fsparam_u32oct("mode",	  Opt_mode),
+	fsparam_string("dax",     Opt_dax),
+	{}
+};
+
+static int famfs_parse_param(
+	struct fs_context   *fc,
+	struct fs_parameter *param)
+{
+	struct famfs_fs_info *fsi = fc->s_fs_info;
+	struct fs_parse_result result;
+	int opt;
+
+	opt = fs_parse(fc, famfs_fs_parameters, param, &result);
+	if (opt == -ENOPARAM) {
+		opt = vfs_parse_fs_param_source(fc, param);
+		if (opt != -ENOPARAM)
+			return opt;
+
+		return 0;
+	}
+	if (opt < 0)
+		return opt;
+
+	switch (opt) {
+	case Opt_mode:
+		fsi->mount_opts.mode = result.uint_32 & S_IALLUGO;
+		break;
+	case Opt_dax:
+		if (strcmp(param->string, "always"))
+			pr_notice("%s: invalid dax mode %s\n",
+				  __func__, param->string);
+		break;
+	}
+
+	return 0;
+}
+
+static DEFINE_MUTEX(famfs_context_mutex);
+static LIST_HEAD(famfs_context_list);
+
+static int famfs_get_tree(struct fs_context *fc)
+{
+	struct famfs_fs_info *fsi_entry;
+	struct famfs_fs_info *fsi = fc->s_fs_info;
+
+	fsi->rootdev = kstrdup(fc->source, GFP_KERNEL);
+	if (!fsi->rootdev)
+		return -ENOMEM;
+
+	/* Fail if famfs is already mounted from the same device */
+	mutex_lock(&famfs_context_mutex);
+	list_for_each_entry(fsi_entry, &famfs_context_list, fsi_list) {
+		if (strcmp(fsi_entry->rootdev, fc->source) == 0) {
+			mutex_unlock(&famfs_context_mutex);
+			pr_err("%s: already mounted from rootdev %s\n", __func__, fc->source);
+			return -EALREADY;
+		}
+	}
+
+	list_add(&fsi->fsi_list, &famfs_context_list);
+	mutex_unlock(&famfs_context_mutex);
+
+	return get_tree_nodev(fc, famfs_fill_super);
+
+}
+
+static void famfs_free_fc(struct fs_context *fc)
+{
+	struct famfs_fs_info *fsi = fc->s_fs_info;
+
+	if (fsi && fsi->rootdev)
+		kfree(fsi->rootdev);
+
+	kfree(fsi);
+}
+
+static const struct fs_context_operations famfs_context_ops = {
+	.free		= famfs_free_fc,
+	.parse_param	= famfs_parse_param,
+	.get_tree	= famfs_get_tree,
+};
+
+static int famfs_init_fs_context(struct fs_context *fc)
+{
+	struct famfs_fs_info *fsi;
+
+	fsi = kzalloc(sizeof(*fsi), GFP_KERNEL);
+	if (!fsi)
+		return -ENOMEM;
+
+	fsi->mount_opts.mode = FAMFS_DEFAULT_MODE;
+	fc->s_fs_info        = fsi;
+	fc->ops              = &famfs_context_ops;
+	return 0;
+}
 
 
 MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (10 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 11/20] famfs: Add fs_context_operations John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 13:25   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 13/20] famfs: Add iomap_ops John Groves
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This commit introduces the famfs inode_operations. There is nothing really
unique to famfs here in the inode_operations..

This commit also introduces the famfs_file_system_type struct and the
famfs_kill_sb() function.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_inode.c | 132 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 132 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index f98f82962d7b..ab46ec50b70d 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -85,6 +85,109 @@ static struct inode *famfs_get_inode(
 	return inode;
 }
 
+/***************************************************************************
+ * famfs inode_operations: these are currently pretty much boilerplate
+ */
+
+static const struct inode_operations famfs_file_inode_operations = {
+	/* All generic */
+	.setattr	   = simple_setattr,
+	.getattr	   = simple_getattr,
+};
+
+
+/*
+ * File creation. Allocate an inode, and we're done..
+ */
+/* SMP-safe */
+static int
+famfs_mknod(
+	struct mnt_idmap *idmap,
+	struct inode     *dir,
+	struct dentry    *dentry,
+	umode_t           mode,
+	dev_t             dev)
+{
+	struct inode *inode = famfs_get_inode(dir->i_sb, dir, mode, dev);
+	int error           = -ENOSPC;
+
+	if (inode) {
+		struct timespec64       tv;
+
+		d_instantiate(dentry, inode);
+		dget(dentry);	/* Extra count - pin the dentry in core */
+		error = 0;
+		tv = inode_set_ctime_current(inode);
+		inode_set_mtime_to_ts(inode, tv);
+		inode_set_atime_to_ts(inode, tv);
+	}
+	return error;
+}
+
+static int famfs_mkdir(
+	struct mnt_idmap *idmap,
+	struct inode     *dir,
+	struct dentry    *dentry,
+	umode_t           mode)
+{
+	int retval = famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFDIR, 0);
+
+	if (!retval)
+		inc_nlink(dir);
+
+	return retval;
+}
+
+static int famfs_create(
+	struct mnt_idmap *idmap,
+	struct inode     *dir,
+	struct dentry    *dentry,
+	umode_t           mode,
+	bool              excl)
+{
+	return famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
+}
+
+static int famfs_symlink(
+	struct mnt_idmap *idmap,
+	struct inode     *dir,
+	struct dentry    *dentry,
+	const char       *symname)
+{
+	struct inode *inode;
+	int error = -ENOSPC;
+
+	inode = famfs_get_inode(dir->i_sb, dir, S_IFLNK | 0777, 0);
+	if (inode) {
+		int l = strlen(symname)+1;
+
+		error = page_symlink(inode, symname, l);
+		if (!error) {
+			struct timespec64       tv;
+
+			d_instantiate(dentry, inode);
+			dget(dentry);
+			tv = inode_set_ctime_current(inode);
+			inode_set_mtime_to_ts(inode, tv);
+			inode_set_atime_to_ts(inode, tv);
+		} else
+			iput(inode);
+	}
+	return error;
+}
+
+static const struct inode_operations famfs_dir_inode_operations = {
+	.create		= famfs_create,
+	.lookup		= simple_lookup,
+	.link		= simple_link,
+	.unlink		= simple_unlink,
+	.symlink	= famfs_symlink,
+	.mkdir		= famfs_mkdir,
+	.rmdir		= simple_rmdir,
+	.mknod		= famfs_mknod,
+	.rename		= simple_rename,
+};
+
 /**********************************************************************************
  * famfs super_operations
  *
@@ -329,5 +432,34 @@ static int famfs_init_fs_context(struct fs_context *fc)
 	return 0;
 }
 
+static void famfs_kill_sb(struct super_block *sb)
+{
+	struct famfs_fs_info *fsi = sb->s_fs_info;
+
+	mutex_lock(&famfs_context_mutex);
+	list_del(&fsi->fsi_list);
+	mutex_unlock(&famfs_context_mutex);
+
+	if (fsi->bdev_handle)
+		bdev_release(fsi->bdev_handle);
+	if (fsi->dax_devp)
+		fs_put_dax(fsi->dax_devp, fsi);
+	if (fsi->dax_filp) /* This only happens if it's char dax */
+		filp_close(fsi->dax_filp, NULL);
+
+	if (fsi && fsi->rootdev)
+		kfree(fsi->rootdev);
+	kfree(fsi);
+	kill_litter_super(sb);
+}
+
+#define MODULE_NAME "famfs"
+static struct file_system_type famfs_fs_type = {
+	.name		  = MODULE_NAME,
+	.init_fs_context  = famfs_init_fs_context,
+	.parameters	  = famfs_fs_parameters,
+	.kill_sb	  = famfs_kill_sb,
+	.fs_flags	  = FS_USERNS_MOUNT,
+};
 
 MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 13/20] famfs: Add iomap_ops
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (11 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 13:30   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 14/20] famfs: Add struct file_operations John Groves
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This commit introduces the famfs iomap_ops. When either
dax_iomap_fault() or dax_iomap_rw() is called, we get a callback
via our iomap_begin() handler. The question being asked is
"please resolve (file, offset) to (daxdev, offset)". The function
famfs_meta_to_dax_offset() does this.

The per-file metadata is just an extent list to the
backing dax dev.  The order of this resolution is O(N) for N
extents. Note with the current user space, files usually have
only one extent.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_file.c | 245 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 245 insertions(+)
 create mode 100644 fs/famfs/famfs_file.c

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
new file mode 100644
index 000000000000..fc667d5f7be8
--- /dev/null
+++ b/fs/famfs/famfs_file.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * famfs - dax file system for shared fabric-attached memory
+ *
+ * Copyright 2023-2024 Micron Technology, Inc.
+ *
+ * This file system, originally based on ramfs the dax support from xfs,
+ * is intended to allow multiple host systems to mount a common file system
+ * view of dax files that map to shared memory.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/dax.h>
+#include <linux/uio.h>
+#include <linux/iomap.h>
+#include <uapi/linux/famfs_ioctl.h>
+#include "famfs_internal.h"
+
+/*********************************************************************
+ * iomap_operations
+ *
+ * This stuff uses the iomap (dax-related) helpers to resolve file offsets to
+ * offsets within a dax device.
+ */
+
+/**
+ * famfs_meta_to_dax_offset()
+ *
+ * This function is called by famfs_iomap_begin() to resolve an offset in a file to
+ * an offset in a dax device. This is upcalled from dax from calls to both
+ * dax_iomap_fault() and dax_iomap_rw(). Dax finishes the job resolving a fault to
+ * a specific physical page (the fault case) or doing a memcpy variant (the rw case)
+ *
+ * Pages can be PTE (4k), PMD (2MiB) or (theoretically) PuD (1GiB)
+ * (these sizes are for X86; may vary on other cpu architectures
+ *
+ * @inode  - the file where the fault occurred
+ * @iomap  - struct iomap to be filled in to indicate where to find the right memory, relative
+ *           to a dax device.
+ * @offset - the offset within the file where the fault occurred (will be page boundary)
+ * @len    - the length of the faulted mapping (will be a page multiple)
+ *           (will be trimmed in *iomap if it's disjoint in the extent list)
+ * @flags
+ */
+static int
+famfs_meta_to_dax_offset(
+	struct inode *inode,
+	struct iomap *iomap,
+	loff_t        offset,
+	loff_t        len,
+	unsigned int  flags)
+{
+	struct famfs_file_meta *meta = (struct famfs_file_meta *)inode->i_private;
+	int i;
+	loff_t local_offset = offset;
+	struct famfs_fs_info  *fsi = inode->i_sb->s_fs_info;
+
+	iomap->offset = offset; /* file offset */
+
+	for (i = 0; i < meta->tfs_extent_ct; i++) {
+		loff_t dax_ext_offset = meta->tfs_extents[i].offset;
+		loff_t dax_ext_len    = meta->tfs_extents[i].len;
+
+		if ((dax_ext_offset == 0) && (meta->file_type != FAMFS_SUPERBLOCK))
+			pr_err("%s: zero offset on non-superblock file!!\n", __func__);
+
+		/* local_offset is the offset minus the size of extents skipped so far;
+		 * If local_offset < dax_ext_len, the data of interest starts in this extent
+		 */
+		if (local_offset < dax_ext_len) {
+			loff_t ext_len_remainder = dax_ext_len - local_offset;
+
+			/*+
+			 * OK, we found the file metadata extent where this data begins
+			 * @local_offset      - The offset within the current extent
+			 * @ext_len_remainder - Remaining length of ext after skipping local_offset
+			 *
+			 * iomap->addr is the offset within the dax device where that data
+			 * starts
+			 */
+			iomap->addr    = dax_ext_offset + local_offset; /* dax dev offset */
+			iomap->offset  = offset; /* file offset */
+			iomap->length  = min_t(loff_t, len, ext_len_remainder);
+			iomap->dax_dev = fsi->dax_devp;
+			iomap->type    = IOMAP_MAPPED;
+			iomap->flags   = flags;
+
+			return 0;
+		}
+		local_offset -= dax_ext_len; /* Get ready for the next extent */
+	}
+
+	/* Set iomap to zero length in this case, and return 0
+	 * This just means that the r/w is past EOF
+	 */
+	iomap->addr    = offset;
+	iomap->offset  = offset; /* file offset */
+	iomap->length  = 0; /* this had better result in no access to dax mem */
+	iomap->dax_dev = fsi->dax_devp;
+	iomap->type    = IOMAP_MAPPED;
+	iomap->flags   = flags;
+
+	return 0;
+}
+
+/**
+ * famfs_iomap_begin()
+ *
+ * This function is pretty simple because files are
+ * * never partially allocated
+ * * never have holes (never sparse)
+ * * never "allocate on write"
+ */
+static int
+famfs_iomap_begin(
+	struct inode	       *inode,
+	loff_t			offset,
+	loff_t			length,
+	unsigned int		flags,
+	struct iomap	       *iomap,
+	struct iomap	       *srcmap)
+{
+	struct famfs_file_meta *meta = inode->i_private;
+	size_t size;
+	int rc;
+
+	size = i_size_read(inode);
+
+	WARN_ON(size != meta->file_size);
+
+	rc = famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
+
+	return rc;
+}
+
+/* Note: We never need a special set of write_iomap_ops because famfs never
+ * performs allocation on write.
+ */
+const struct iomap_ops famfs_iomap_ops = {
+	.iomap_begin		= famfs_iomap_begin,
+};
+
+/*********************************************************************
+ * vm_operations
+ */
+static vm_fault_t
+__famfs_filemap_fault(
+	struct vm_fault		*vmf,
+	unsigned int		pe_size,
+	bool			write_fault)
+{
+	struct inode		*inode = file_inode(vmf->vma->vm_file);
+	vm_fault_t		ret;
+
+	if (write_fault) {
+		sb_start_pagefault(inode->i_sb);
+		file_update_time(vmf->vma->vm_file);
+	}
+
+	if (IS_DAX(inode)) {
+		pfn_t pfn;
+
+		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
+		if (ret & VM_FAULT_NEEDDSYNC)
+			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+	} else {
+		/* All famfs faults will be dax... */
+		pr_err("%s: oops, non-dax fault\n", __func__);
+		ret = VM_FAULT_SIGBUS;
+	}
+
+	if (write_fault)
+		sb_end_pagefault(inode->i_sb);
+
+	return ret;
+}
+
+static inline bool
+famfs_is_write_fault(
+	struct vm_fault		*vmf)
+{
+	return (vmf->flags & FAULT_FLAG_WRITE) &&
+	       (vmf->vma->vm_flags & VM_SHARED);
+}
+
+static vm_fault_t
+famfs_filemap_fault(
+	struct vm_fault		*vmf)
+{
+	/* DAX can shortcut the normal fault path on write faults! */
+	return __famfs_filemap_fault(vmf, 0,
+			IS_DAX(file_inode(vmf->vma->vm_file)) && famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_huge_fault(
+	struct vm_fault	*vmf,
+	unsigned int	 pe_size)
+{
+	if (!IS_DAX(file_inode(vmf->vma->vm_file))) {
+		pr_err("%s: file not marked IS_DAX!!\n", __func__);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* DAX can shortcut the normal fault path on write faults! */
+	return __famfs_filemap_fault(vmf, pe_size, famfs_is_write_fault(vmf));
+}
+
+static vm_fault_t
+famfs_filemap_page_mkwrite(
+	struct vm_fault		*vmf)
+{
+	return __famfs_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_pfn_mkwrite(
+	struct vm_fault		*vmf)
+{
+	return __famfs_filemap_fault(vmf, 0, true);
+}
+
+static vm_fault_t
+famfs_filemap_map_pages(
+	struct vm_fault	       *vmf,
+	pgoff_t			start_pgoff,
+	pgoff_t			end_pgoff)
+{
+	vm_fault_t ret;
+
+	ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
+	return ret;
+}
+
+const struct vm_operations_struct famfs_file_vm_ops = {
+	.fault		= famfs_filemap_fault,
+	.huge_fault	= famfs_filemap_huge_fault,
+	.map_pages	= famfs_filemap_map_pages,
+	.page_mkwrite	= famfs_filemap_page_mkwrite,
+	.pfn_mkwrite	= famfs_filemap_pfn_mkwrite,
+};
+
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 14/20] famfs: Add struct file_operations
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (12 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 13/20] famfs: Add iomap_ops John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 13:32   ` Jonathan Cameron
  2024-02-23 17:41 ` [RFC PATCH 15/20] famfs: Add ioctl to file_operations John Groves
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This commit introduces the famfs file_operations. We call
thp_get_unmapped_area() to force PMD page alignment. Our read and
write handlers (famfs_dax_read_iter() and famfs_dax_write_iter())
call dax_iomap_rw() to do the work.

famfs_file_invalid() checks for various ways a famfs file can be
in an invalid state so we can fail I/O or fault resolution in those
cases. Those cases include the following:

* No famfs metadata
* file i_size does not match the originally allocated size
* file is not flagged as DAX
* errors were detected previously on the file

An invalid file can often be fixed by replaying the log, or by
umount/mount/log replay - all of which are user space operations.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_file.c | 136 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 136 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index fc667d5f7be8..5228e9de1e3b 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,142 @@
 #include <uapi/linux/famfs_ioctl.h>
 #include "famfs_internal.h"
 
+/*********************************************************************
+ * file_operations
+ */
+
+/* Reject I/O to files that aren't in a valid state */
+static ssize_t
+famfs_file_invalid(struct inode *inode)
+{
+	size_t i_size       = i_size_read(inode);
+	struct famfs_file_meta *meta = inode->i_private;
+
+	if (!meta) {
+		pr_err("%s: un-initialized famfs file\n", __func__);
+		return -EIO;
+	}
+	if (i_size != meta->file_size) {
+		pr_err("%s: something changed the size from  %ld to %ld\n",
+		       __func__, meta->file_size, i_size);
+		meta->error = 1;
+		return -ENXIO;
+	}
+	if (!IS_DAX(inode)) {
+		pr_err("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
+		meta->error = 1;
+		return -ENXIO;
+	}
+	if (meta->error) {
+		pr_err("%s: previously detected metadata errors\n", __func__);
+		meta->error = 1;
+		return -EIO;
+	}
+	return 0;
+}
+
+static ssize_t
+famfs_dax_read_iter(
+	struct kiocb		*iocb,
+	struct iov_iter		*to)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	size_t i_size       = i_size_read(inode);
+	size_t count        = iov_iter_count(to);
+	size_t max_count;
+	ssize_t rc;
+
+	rc = famfs_file_invalid(inode);
+	if (rc)
+		return rc;
+
+	max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+	if (count > max_count)
+		iov_iter_truncate(to, max_count);
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	rc = dax_iomap_rw(iocb, to, &famfs_iomap_ops);
+
+	file_accessed(iocb->ki_filp);
+	return rc;
+}
+
+/**
+ * famfs_write_iter()
+ *
+ * We need our own write-iter in order to prevent append
+ */
+static ssize_t
+famfs_dax_write_iter(
+	struct kiocb    *iocb,
+	struct iov_iter *from)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	size_t i_size       = i_size_read(inode);
+	size_t count        = iov_iter_count(from);
+	size_t max_count;
+	ssize_t rc;
+
+	rc = famfs_file_invalid(inode);
+	if (rc)
+		return rc;
+
+	/* Starting offset of write is: iocb->ki_pos
+	 * length is iov_iter_count(from)
+	 */
+	max_count = max_t(size_t, 0, i_size - iocb->ki_pos);
+
+	/* If write would go past EOF, truncate it to end at EOF since famfs does not
+	 * alloc-on-write
+	 */
+	if (count > max_count)
+		iov_iter_truncate(from, max_count);
+
+	if (!iov_iter_count(from))
+		return 0;
+
+	return dax_iomap_rw(iocb, from, &famfs_iomap_ops);
+}
+
+static int
+famfs_file_mmap(
+	struct file		*file,
+	struct vm_area_struct	*vma)
+{
+	struct inode		*inode = file_inode(file);
+	ssize_t rc;
+
+	rc = famfs_file_invalid(inode);
+	if (rc)
+		return (int)rc;
+
+	file_accessed(file);
+	vma->vm_ops = &famfs_file_vm_ops;
+	vm_flags_set(vma, VM_HUGEPAGE);
+	return 0;
+}
+
+const struct file_operations famfs_file_operations = {
+	.owner             = THIS_MODULE,
+
+	/* Custom famfs operations */
+	.write_iter	   = famfs_dax_write_iter,
+	.read_iter	   = famfs_dax_read_iter,
+	.mmap		   = famfs_file_mmap,
+
+	/* Force PMD alignment for mmap */
+	.get_unmapped_area = thp_get_unmapped_area,
+
+	/* Generic Operations */
+	.fsync		   = noop_fsync,
+	.splice_read	   = filemap_splice_read,
+	.splice_write	   = iter_file_splice_write,
+	.llseek		   = generic_file_llseek,
+};
+
 /*********************************************************************
  * iomap_operations
  *
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 15/20] famfs: Add ioctl to file_operations
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (13 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 14/20] famfs: Add struct file_operations John Groves
@ 2024-02-23 17:41 ` John Groves
  2024-02-26 13:44   ` Jonathan Cameron
  2024-02-23 17:42 ` [RFC PATCH 16/20] famfs: Add fault counters John Groves
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:41 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This commit introduces the per-file ioctl function famfs_file_ioctl()
into struct file_operations, and introduces the famfs_file_init_dax()
function (which is called by famfs_file_ioct())

famfs_file_init_dax() associates a dax extent list with a file, making
it into a proper famfs file. It is called from the FAMFSIOC_MAP_CREATE
ioctl. Starting with an empty file (which is basically a ramfs file),
this turns the file into a DAX file backed by the specified extent list.

The other ioctls are:

FAMFSIOC_NOP - A convenient way for user space to verify it's a famfs file
FAMFSIOC_MAP_GET - Get the header of the metadata for a file
FAMFSIOC_MAP_GETEXT - Get the extents for a file

The latter two, together, are comparable to xfs_bmap. Our user space tools
use them primarly in testing.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_file.c | 226 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 226 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index 5228e9de1e3b..fd42d5966982 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,231 @@
 #include <uapi/linux/famfs_ioctl.h>
 #include "famfs_internal.h"
 
+/**
+ * famfs_map_meta_alloc() - Allocate famfs file metadata
+ * @mapp:       Pointer to an mcache_map_meta pointer
+ * @ext_count:  The number of extents needed
+ */
+static int
+famfs_meta_alloc(
+	struct famfs_file_meta  **metap,
+	size_t                    ext_count)
+{
+	struct famfs_file_meta *meta;
+	size_t                  metasz;
+
+	*metap = NULL;
+
+	metasz = sizeof(*meta) + sizeof(*(meta->tfs_extents)) * ext_count;
+
+	meta = kzalloc(metasz, GFP_KERNEL);
+	if (!meta)
+		return -ENOMEM;
+
+	meta->tfs_extent_ct = ext_count;
+	*metap = meta;
+
+	return 0;
+}
+
+static void
+famfs_meta_free(
+	struct famfs_file_meta *map)
+{
+	kfree(map);
+}
+
+/**
+ * famfs_file_init_dax() - FAMFSIOC_MAP_CREATE ioctl handler
+ * @file:
+ * @arg:        ptr to struct mcioc_map in user space
+ *
+ * Setup the dax mapping for a file. Files are created empty, and then function is called
+ * (by famfs_file_ioctl()) to setup the mapping and set the file size.
+ */
+static int
+famfs_file_init_dax(
+	struct file    *file,
+	void __user    *arg)
+{
+	struct famfs_extent    *tfs_extents = NULL;
+	struct famfs_file_meta *meta = NULL;
+	struct inode           *inode;
+	struct famfs_ioc_map    imap;
+	struct famfs_fs_info   *fsi;
+	struct super_block     *sb;
+	int    alignment_errs = 0;
+	size_t extent_total = 0;
+	size_t ext_count;
+	int    rc = 0;
+	int    i;
+
+	rc = copy_from_user(&imap, arg, sizeof(imap));
+	if (rc)
+		return -EFAULT;
+
+	ext_count = imap.ext_list_count;
+	if (ext_count < 1) {
+		rc = -ENOSPC;
+		goto errout;
+	}
+
+	if (ext_count > FAMFS_MAX_EXTENTS) {
+		rc = -E2BIG;
+		goto errout;
+	}
+
+	inode = file_inode(file);
+	if (!inode) {
+		rc = -EBADF;
+		goto errout;
+	}
+	sb  = inode->i_sb;
+	fsi = inode->i_sb->s_fs_info;
+
+	tfs_extents = &imap.ext_list[0];
+
+	rc = famfs_meta_alloc(&meta, ext_count);
+	if (rc)
+		goto errout;
+
+	meta->file_type = imap.file_type;
+	meta->file_size = imap.file_size;
+
+	/* Fill in the internal file metadata structure */
+	for (i = 0; i < imap.ext_list_count; i++) {
+		size_t len;
+		off_t  offset;
+
+		offset = imap.ext_list[i].offset;
+		len    = imap.ext_list[i].len;
+
+		extent_total += len;
+
+		if (WARN_ON(offset == 0 && meta->file_type != FAMFS_SUPERBLOCK)) {
+			rc = -EINVAL;
+			goto errout;
+		}
+
+		meta->tfs_extents[i].offset = offset;
+		meta->tfs_extents[i].len    = len;
+
+		/* All extent addresses/offsets must be 2MiB aligned,
+		 * and all but the last length must be a 2MiB multiple.
+		 */
+		if (!IS_ALIGNED(offset, PMD_SIZE)) {
+			pr_err("%s: error ext %d hpa %lx not aligned\n",
+			       __func__, i, offset);
+			alignment_errs++;
+		}
+		if (i < (imap.ext_list_count - 1) && !IS_ALIGNED(len, PMD_SIZE)) {
+			pr_err("%s: error ext %d length %ld not aligned\n",
+			       __func__, i, len);
+			alignment_errs++;
+		}
+	}
+
+	/*
+	 * File size can be <= ext list size, since extent sizes are constrained
+	 * to PMD multiples
+	 */
+	if (imap.file_size > extent_total) {
+		pr_err("%s: file size %lld larger than ext list size %lld\n",
+		       __func__, (u64)imap.file_size, (u64)extent_total);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	if (alignment_errs > 0) {
+		pr_err("%s: there were %d alignment errors in the extent list\n",
+		       __func__, alignment_errs);
+		rc = -EINVAL;
+		goto errout;
+	}
+
+	/* Publish the famfs metadata on inode->i_private */
+	inode_lock(inode);
+	if (inode->i_private) {
+		rc = -EEXIST; /* file already has famfs metadata */
+	} else {
+		inode->i_private = meta;
+		i_size_write(inode, imap.file_size);
+		inode->i_flags |= S_DAX;
+	}
+	inode_unlock(inode);
+
+ errout:
+	if (rc)
+		famfs_meta_free(meta);
+
+	return rc;
+}
+
+/**
+ * famfs_file_ioctl() -  top-level famfs file ioctl handler
+ * @file:
+ * @cmd:
+ * @arg:
+ */
+static
+long
+famfs_file_ioctl(
+	struct file    *file,
+	unsigned int    cmd,
+	unsigned long   arg)
+{
+	long rc;
+
+	switch (cmd) {
+	case FAMFSIOC_NOP:
+		rc = 0;
+		break;
+
+	case FAMFSIOC_MAP_CREATE:
+		rc = famfs_file_init_dax(file, (void *)arg);
+		break;
+
+	case FAMFSIOC_MAP_GET: {
+		struct inode *inode = file_inode(file);
+		struct famfs_file_meta *meta = inode->i_private;
+		struct famfs_ioc_map umeta;
+
+		memset(&umeta, 0, sizeof(umeta));
+
+		if (meta) {
+			/* TODO: do more to harmonize these structures */
+			umeta.extent_type    = meta->tfs_extent_type;
+			umeta.file_size      = i_size_read(inode);
+			umeta.ext_list_count = meta->tfs_extent_ct;
+
+			rc = copy_to_user((void __user *)arg, &umeta, sizeof(umeta));
+			if (rc)
+				pr_err("%s: copy_to_user returned %ld\n", __func__, rc);
+
+		} else {
+			rc = -EINVAL;
+		}
+	}
+		break;
+	case FAMFSIOC_MAP_GETEXT: {
+		struct inode *inode = file_inode(file);
+		struct famfs_file_meta *meta = inode->i_private;
+
+		if (meta)
+			rc = copy_to_user((void __user *)arg, meta->tfs_extents,
+					  meta->tfs_extent_ct * sizeof(struct famfs_extent));
+		else
+			rc = -EINVAL;
+	}
+		break;
+	default:
+		rc = -ENOTTY;
+		break;
+	}
+
+	return rc;
+}
+
 /*********************************************************************
  * file_operations
  */
@@ -143,6 +368,7 @@ const struct file_operations famfs_file_operations = {
 	/* Custom famfs operations */
 	.write_iter	   = famfs_dax_write_iter,
 	.read_iter	   = famfs_dax_read_iter,
+	.unlocked_ioctl    = famfs_file_ioctl,
 	.mmap		   = famfs_file_mmap,
 
 	/* Force PMD alignment for mmap */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (14 preceding siblings ...)
  2024-02-23 17:41 ` [RFC PATCH 15/20] famfs: Add ioctl to file_operations John Groves
@ 2024-02-23 17:42 ` John Groves
  2024-02-23 18:23   ` Dave Hansen
  2024-02-23 17:42 ` [RFC PATCH 17/20] famfs: Add module stuff John Groves
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:42 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

One of the key requirements for famfs is that it service vma faults
efficiently. Our metadata helps - the search order is n for n extents,
and n is usually 1. But we can still observe gnarly lock contention
in mm if PTE faults are happening. This commit introduces fault counters
that can be enabled and read via /sys/fs/famfs/...

These counters have proved useful in troubleshooting situations where
PTE faults were happening instead of PMD. No performance impact when
disabled.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_file.c     | 97 +++++++++++++++++++++++++++++++++++++++
 fs/famfs/famfs_internal.h | 73 +++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)

diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
index fd42d5966982..a626f8a89790 100644
--- a/fs/famfs/famfs_file.c
+++ b/fs/famfs/famfs_file.c
@@ -19,6 +19,100 @@
 #include <uapi/linux/famfs_ioctl.h>
 #include "famfs_internal.h"
 
+/***************************************************************************************
+ * filemap_fault counters
+ *
+ * The counters and the fault_count_enable file live at
+ * /sys/fs/famfs/
+ */
+struct famfs_fault_counters ffc;
+static int fault_count_enable;
+
+static ssize_t
+fault_count_enable_show(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			char *buf)
+{
+	return sprintf(buf, "%d\n", fault_count_enable);
+}
+
+static ssize_t
+fault_count_enable_store(struct kobject        *kobj,
+			 struct kobj_attribute *attr,
+			 const char            *buf,
+			 size_t                 count)
+{
+	int value;
+	int rc;
+
+	rc = sscanf(buf, "%d", &value);
+	if (rc != 1)
+		return 0;
+
+	if (value > 0) /* clear fault counters when enabling, but not when disabling */
+		famfs_clear_fault_counters(&ffc);
+
+	fault_count_enable = value;
+	return count;
+}
+
+/* Individual fault counters are read-only */
+static ssize_t
+fault_count_pte_show(struct kobject *kobj,
+		     struct kobj_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "%llu", famfs_pte_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pmd_show(struct kobject *kobj,
+		     struct kobj_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "%llu", famfs_pmd_fault_ct(&ffc));
+}
+
+static ssize_t
+fault_count_pud_show(struct kobject *kobj,
+		     struct kobj_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "%llu", famfs_pud_fault_ct(&ffc));
+}
+
+static struct kobj_attribute fault_count_enable_attribute = __ATTR(fault_count_enable,
+								   0660,
+								   fault_count_enable_show,
+								   fault_count_enable_store);
+static struct kobj_attribute fault_count_pte_attribute = __ATTR(pte_fault_ct,
+								0440,
+								fault_count_pte_show,
+								NULL);
+static struct kobj_attribute fault_count_pmd_attribute = __ATTR(pmd_fault_ct,
+								0440,
+								fault_count_pmd_show,
+								NULL);
+static struct kobj_attribute fault_count_pud_attribute = __ATTR(pud_fault_ct,
+								0440,
+								fault_count_pud_show,
+								NULL);
+
+
+static struct attribute *attrs[] = {
+	&fault_count_enable_attribute.attr,
+	&fault_count_pte_attribute.attr,
+	&fault_count_pmd_attribute.attr,
+	&fault_count_pud_attribute.attr,
+	NULL,
+};
+
+struct attribute_group famfs_attr_group = {
+	.attrs = attrs,
+};
+
+/* End fault counters */
+
 /**
  * famfs_map_meta_alloc() - Allocate famfs file metadata
  * @mapp:       Pointer to an mcache_map_meta pointer
@@ -525,6 +619,9 @@ __famfs_filemap_fault(
 	if (IS_DAX(inode)) {
 		pfn_t pfn;
 
+		if (fault_count_enable)
+			famfs_inc_fault_counter_by_order(&ffc, pe_size);
+
 		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
index af3990d43305..987cb172a149 100644
--- a/fs/famfs/famfs_internal.h
+++ b/fs/famfs/famfs_internal.h
@@ -50,4 +50,77 @@ struct famfs_fs_info {
 	char                    *rootdev;
 };
 
+/*
+ * filemap_fault counters
+ */
+extern struct attribute_group famfs_attr_group;
+
+enum famfs_fault {
+	FAMFS_PTE = 0,
+	FAMFS_PMD,
+	FAMFS_PUD,
+	FAMFS_NUM_FAULT_TYPES,
+};
+
+static inline int valid_fault_type(int type)
+{
+	if (unlikely(type < 0 || type > FAMFS_PUD))
+		return 0;
+	return 1;
+}
+
+struct famfs_fault_counters {
+	atomic64_t fault_ct[FAMFS_NUM_FAULT_TYPES];
+};
+
+extern struct famfs_fault_counters ffc;
+
+static inline void famfs_clear_fault_counters(struct famfs_fault_counters *fc)
+{
+	int i;
+
+	for (i = 0; i < FAMFS_NUM_FAULT_TYPES; i++)
+		atomic64_set(&fc->fault_ct[i], 0);
+}
+
+static inline void famfs_inc_fault_counter(struct famfs_fault_counters *fc,
+					   enum famfs_fault type)
+{
+	if (valid_fault_type(type))
+		atomic64_inc(&fc->fault_ct[type]);
+}
+
+static inline void famfs_inc_fault_counter_by_order(struct famfs_fault_counters *fc, int order)
+{
+	int pgf = -1;
+
+	switch (order) {
+	case 0:
+		pgf = FAMFS_PTE;
+		break;
+	case PMD_ORDER:
+		pgf = FAMFS_PMD;
+		break;
+	case PUD_ORDER:
+		pgf = FAMFS_PUD;
+		break;
+	}
+	famfs_inc_fault_counter(fc, pgf);
+}
+
+static inline u64 famfs_pte_fault_ct(struct famfs_fault_counters *fc)
+{
+	return atomic64_read(&fc->fault_ct[FAMFS_PTE]);
+}
+
+static inline u64 famfs_pmd_fault_ct(struct famfs_fault_counters *fc)
+{
+	return atomic64_read(&fc->fault_ct[FAMFS_PMD]);
+}
+
+static inline u64 famfs_pud_fault_ct(struct famfs_fault_counters *fc)
+{
+	return atomic64_read(&fc->fault_ct[FAMFS_PUD]);
+}
+
 #endif /* FAMFS_INTERNAL_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 17/20] famfs: Add module stuff
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (15 preceding siblings ...)
  2024-02-23 17:42 ` [RFC PATCH 16/20] famfs: Add fault counters John Groves
@ 2024-02-23 17:42 ` John Groves
  2024-02-26 13:47   ` Jonathan Cameron
  2024-02-23 17:42 ` [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch John Groves
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:42 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This commit introduces the module init and exit machinery for famfs.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index ab46ec50b70d..0d659820e8ff 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = {
 	.fs_flags	  = FS_USERNS_MOUNT,
 };
 
+/*****************************************************************************************
+ * Module stuff
+ */
+static struct kobject *famfs_kobj;
+
+static int __init init_famfs_fs(void)
+{
+	int rc;
+
+#if defined(CONFIG_DEV_DAX_IOMAP)
+	pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__);
+#else
+	pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", __func__);
+#endif
+	famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj);
+	if (!famfs_kobj) {
+		pr_warn("Failed to create kobject\n");
+		return -ENOMEM;
+	}
+
+	rc = sysfs_create_group(famfs_kobj, &famfs_attr_group);
+	if (rc) {
+		kobject_put(famfs_kobj);
+		pr_warn("%s: Failed to create sysfs group\n", __func__);
+		return rc;
+	}
+
+	return register_filesystem(&famfs_fs_type);
+}
+
+static void
+__exit famfs_exit(void)
+{
+	sysfs_remove_group(famfs_kobj,  &famfs_attr_group);
+	kobject_put(famfs_kobj);
+	unregister_filesystem(&famfs_fs_type);
+	pr_info("%s: unregistered\n", __func__);
+}
+
+
+fs_initcall(init_famfs_fs);
+module_exit(famfs_exit);
+
+MODULE_AUTHOR("John Groves, Micron Technology");
 MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (16 preceding siblings ...)
  2024-02-23 17:42 ` [RFC PATCH 17/20] famfs: Add module stuff John Groves
@ 2024-02-23 17:42 ` John Groves
  2024-02-26 13:52   ` Jonathan Cameron
  2024-02-23 17:42 ` [RFC PATCH 19/20] famfs: Update MAINTAINERS file John Groves
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:42 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This commit introduces the ability to open a character /dev/dax device
instead of a block /dev/pmem device. This rests on the dev_dax_iomap
patches earlier in this series.

Signed-off-by: John Groves <john@groves.net>
---
 fs/famfs/famfs_inode.c | 97 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 10 deletions(-)

diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
index 0d659820e8ff..7d65ac497147 100644
--- a/fs/famfs/famfs_inode.c
+++ b/fs/famfs/famfs_inode.c
@@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = {
 	.show_options	= famfs_show_options,
 };
 
+/*****************************************************************************/
+
+#if defined(CONFIG_DEV_DAX_IOMAP)
+
+/*
+ * famfs dax_operations  (for char dax)
+ */
+static int
+famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
+			u64 len, int mf_flags)
+{
+	pr_err("%s: offset %lld len %llu flags %x\n", __func__,
+	       offset, len, mf_flags);
+	return -EOPNOTSUPP;
+}
+
+static const struct dax_holder_operations famfs_dax_holder_ops = {
+	.notify_failure		= famfs_dax_notify_failure,
+};
+
+/*****************************************************************************/
+
+/**
+ * famfs_open_char_device()
+ *
+ * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap patch
+ */
+static int
+famfs_open_char_device(
+	struct super_block *sb,
+	struct fs_context  *fc)
+{
+	struct famfs_fs_info *fsi = sb->s_fs_info;
+	struct dax_device    *dax_devp;
+	struct inode         *daxdev_inode;
+
+	int rc = 0;
+
+	pr_notice("%s: Opening character dax device %s\n", __func__, fc->source);
+
+	fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
+	if (IS_ERR(fsi->dax_filp)) {
+		pr_err("%s: failed to open dax device %s\n",
+		       __func__, fc->source);
+		fsi->dax_filp = NULL;
+		return PTR_ERR(fsi->dax_filp);
+	}
+
+	daxdev_inode = file_inode(fsi->dax_filp);
+	dax_devp     = inode_dax(daxdev_inode);
+	if (IS_ERR(dax_devp)) {
+		pr_err("%s: unable to get daxdev from inode for %s\n",
+		       __func__, fc->source);
+		rc = -ENODEV;
+		goto char_err;
+	}
+
+	rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops);
+	if (rc) {
+		pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__);
+		goto char_err;
+	}
+
+	fsi->bdev_handle = NULL;
+	fsi->dax_devp = dax_devp;
+
+	return 0;
+
+char_err:
+	filp_close(fsi->dax_filp, NULL);
+	return rc;
+}
+
+#else /* CONFIG_DEV_DAX_IOMAP */
+static int
+famfs_open_char_device(
+	struct super_block *sb,
+	struct fs_context  *fc)
+{
+	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
+	       __func__, fc->source);
+	return -ENODEV;
+}
+
+
+#endif /* CONFIG_DEV_DAX_IOMAP */
+
 /***************************************************************************************
  * dax_holder_operations for block dax
  */
@@ -236,16 +323,6 @@ const struct dax_holder_operations famfs_blk_dax_holder_ops = {
 	.notify_failure		= famfs_blk_dax_notify_failure,
 };
 
-static int
-famfs_open_char_device(
-	struct super_block *sb,
-	struct fs_context  *fc)
-{
-	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
-	       __func__, fc->source);
-	return -ENODEV;
-}
-
 /**
  * famfs_open_device()
  *
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 19/20] famfs: Update MAINTAINERS file
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (17 preceding siblings ...)
  2024-02-23 17:42 ` [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch John Groves
@ 2024-02-23 17:42 ` John Groves
  2024-02-23 17:42 ` [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing John Groves
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-23 17:42 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

This patch introduces famfs into the MAINTAINERS file

Signed-off-by: John Groves <john@groves.net>
---
 MAINTAINERS | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 73d898383e51..e4e8bf3602bb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8097,6 +8097,17 @@ F:	Documentation/networking/failover.rst
 F:	include/net/failover.h
 F:	net/core/failover.c
 
+FAMFS
+M:	John Groves <jgroves@micron.com>
+M:	John Groves <John@Groves.net>
+M:	John Groves <john@jagalactic.com>
+L:	linux-cxl@vger.kernel.org
+L:	linux-fsdevel@vger.kernel.org
+S:	Supported
+F:	Documentation/filesystems/famfs.rst
+F:	fs/famfs
+F:	include/uapi/linux/famfs_ioctl.h
+
 FANOTIFY
 M:	Jan Kara <jack@suse.cz>
 R:	Amir Goldstein <amir73il@gmail.com>
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (18 preceding siblings ...)
  2024-02-23 17:42 ` [RFC PATCH 19/20] famfs: Update MAINTAINERS file John Groves
@ 2024-02-23 17:42 ` John Groves
  2024-02-24  1:50   ` Randy Dunlap
  2024-02-24  0:07 ` [RFC PATCH 00/20] Introduce the famfs shared-memory file system Luis Chamberlain
  2024-02-29  6:52 ` Amir Goldstein
  21 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 17:42 UTC (permalink / raw)
  To: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: John, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price, John Groves

Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile

Signed-off-by: John Groves <john@groves.net>
---
 fs/Kconfig        |  2 ++
 fs/Makefile       |  1 +
 fs/famfs/Kconfig  | 10 ++++++++++
 fs/famfs/Makefile |  5 +++++
 4 files changed, 18 insertions(+)
 create mode 100644 fs/famfs/Kconfig
 create mode 100644 fs/famfs/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..8a11625a54a2 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -141,6 +141,8 @@ source "fs/autofs/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
 
+source "fs/famfs/Kconfig"
+
 menu "Caches"
 
 source "fs/netfs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index c09016257f05..382c1ea4f4c3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-$(CONFIG_FAMFS)             += famfs/
diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
new file mode 100644
index 000000000000..e450928d8912
--- /dev/null
+++ b/fs/famfs/Kconfig
@@ -0,0 +1,10 @@
+
+
+config FAMFS
+       tristate "famfs: shared memory file system"
+       depends on DEV_DAX && FS_DAX
+       help
+         Support for the famfs file system. Famfs is a dax file system that
+	 can support scale-out shared access to fabric-attached memory
+	 (e.g. CXL shared memory). Famfs is not a general purpose file system;
+	 it is an enabler for data sets in shared memory.
diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile
new file mode 100644
index 000000000000..8cac90c090a4
--- /dev/null
+++ b/fs/famfs/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_FAMFS) += famfs.o
+
+famfs-y := famfs_inode.o famfs_file.o
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 17:42 ` [RFC PATCH 16/20] famfs: Add fault counters John Groves
@ 2024-02-23 18:23   ` Dave Hansen
  2024-02-23 19:56     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Dave Hansen @ 2024-02-23 18:23 UTC (permalink / raw)
  To: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: john, Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

On 2/23/24 09:42, John Groves wrote:
> One of the key requirements for famfs is that it service vma faults
> efficiently. Our metadata helps - the search order is n for n extents,
> and n is usually 1. But we can still observe gnarly lock contention
> in mm if PTE faults are happening. This commit introduces fault counters
> that can be enabled and read via /sys/fs/famfs/...
> 
> These counters have proved useful in troubleshooting situations where
> PTE faults were happening instead of PMD. No performance impact when
> disabled.

This seems kinda wonky.  Why does _this_ specific filesystem need its
own fault counters.  Seems like something we'd want to do much more
generically, if it is needed at all.

Was the issue here just that vm_ops->fault() was getting called instead
of ->huge_fault()?  Or something more subtle?

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 18:23   ` Dave Hansen
@ 2024-02-23 19:56     ` John Groves
  2024-02-23 20:04       ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 19:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/23 10:23AM, Dave Hansen wrote:
> On 2/23/24 09:42, John Groves wrote:
> > One of the key requirements for famfs is that it service vma faults
> > efficiently. Our metadata helps - the search order is n for n extents,
> > and n is usually 1. But we can still observe gnarly lock contention
> > in mm if PTE faults are happening. This commit introduces fault counters
> > that can be enabled and read via /sys/fs/famfs/...
> > 
> > These counters have proved useful in troubleshooting situations where
> > PTE faults were happening instead of PMD. No performance impact when
> > disabled.
> 
> This seems kinda wonky.  Why does _this_ specific filesystem need its
> own fault counters.  Seems like something we'd want to do much more
> generically, if it is needed at all.
> 
> Was the issue here just that vm_ops->fault() was getting called instead
> of ->huge_fault()?  Or something more subtle?

Thanks for your reply Dave!

First, I'm willing to pull the fault counters out if the brain trust doesn't
like them.

I put them in because we were running benchmarks of computational data
analytics and and noted that jobs took 3x as long on famfs as raw dax -
which indicated I was doing something wrong, because it should be equivalent
or very close.

The the solution was to call thp_get_unmapped_area() in
famfs_file_operations, and performance doesn't vary significantly from raw
dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.

After that I wanted a way to be double-secret-certain that it was servicing
PMD faults as intended. Which it basically always is, so far. (The smoke
tests in user space check this.)

John

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 19:56     ` John Groves
@ 2024-02-23 20:04       ` Dan Williams
  2024-02-23 20:39         ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2024-02-23 20:04 UTC (permalink / raw)
  To: John Groves, Dave Hansen
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

John Groves wrote:
> On 24/02/23 10:23AM, Dave Hansen wrote:
> > On 2/23/24 09:42, John Groves wrote:
> > > One of the key requirements for famfs is that it service vma faults
> > > efficiently. Our metadata helps - the search order is n for n extents,
> > > and n is usually 1. But we can still observe gnarly lock contention
> > > in mm if PTE faults are happening. This commit introduces fault counters
> > > that can be enabled and read via /sys/fs/famfs/...
> > > 
> > > These counters have proved useful in troubleshooting situations where
> > > PTE faults were happening instead of PMD. No performance impact when
> > > disabled.
> > 
> > This seems kinda wonky.  Why does _this_ specific filesystem need its
> > own fault counters.  Seems like something we'd want to do much more
> > generically, if it is needed at all.
> > 
> > Was the issue here just that vm_ops->fault() was getting called instead
> > of ->huge_fault()?  Or something more subtle?
> 
> Thanks for your reply Dave!
> 
> First, I'm willing to pull the fault counters out if the brain trust doesn't
> like them.
> 
> I put them in because we were running benchmarks of computational data
> analytics and and noted that jobs took 3x as long on famfs as raw dax -
> which indicated I was doing something wrong, because it should be equivalent
> or very close.
> 
> The the solution was to call thp_get_unmapped_area() in
> famfs_file_operations, and performance doesn't vary significantly from raw
> dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.
> 
> After that I wanted a way to be double-secret-certain that it was servicing
> PMD faults as intended. Which it basically always is, so far. (The smoke
> tests in user space check this.)

We had similar unit test regression concerns with fsdax where some
upstream change silently broke PMD faults. The solution there was trace
points in the fault handlers and a basic test that knows apriori that it
*should* be triggering a certain number of huge faults:

https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 20:04       ` Dan Williams
@ 2024-02-23 20:39         ` John Groves
  2024-02-23 21:19           ` Dave Hansen
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-23 20:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, John Groves, Jonathan Corbet, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/23 12:04PM, Dan Williams wrote:
> John Groves wrote:
> > On 24/02/23 10:23AM, Dave Hansen wrote:
> > > On 2/23/24 09:42, John Groves wrote:
> > > > One of the key requirements for famfs is that it service vma faults
> > > > efficiently. Our metadata helps - the search order is n for n extents,
> > > > and n is usually 1. But we can still observe gnarly lock contention
> > > > in mm if PTE faults are happening. This commit introduces fault counters
> > > > that can be enabled and read via /sys/fs/famfs/...
> > > > 
> > > > These counters have proved useful in troubleshooting situations where
> > > > PTE faults were happening instead of PMD. No performance impact when
> > > > disabled.
> > > 
> > > This seems kinda wonky.  Why does _this_ specific filesystem need its
> > > own fault counters.  Seems like something we'd want to do much more
> > > generically, if it is needed at all.
> > > 
> > > Was the issue here just that vm_ops->fault() was getting called instead
> > > of ->huge_fault()?  Or something more subtle?
> > 
> > Thanks for your reply Dave!
> > 
> > First, I'm willing to pull the fault counters out if the brain trust doesn't
> > like them.
> > 
> > I put them in because we were running benchmarks of computational data
> > analytics and and noted that jobs took 3x as long on famfs as raw dax -
> > which indicated I was doing something wrong, because it should be equivalent
> > or very close.
> > 
> > The the solution was to call thp_get_unmapped_area() in
> > famfs_file_operations, and performance doesn't vary significantly from raw
> > dax now. Prior to that I wasn't making sure the mmap address was PMD aligned.
> > 
> > After that I wanted a way to be double-secret-certain that it was servicing
> > PMD faults as intended. Which it basically always is, so far. (The smoke
> > tests in user space check this.)
> 
> We had similar unit test regression concerns with fsdax where some
> upstream change silently broke PMD faults. The solution there was trace
> points in the fault handlers and a basic test that knows apriori that it
> *should* be triggering a certain number of huge faults:
> 
> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31

Good approach, thanks Dan! My working assumption is that we'll be able to make
that approach work in the famfs tests. So the fault counters should go away
in the next version.

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 20:39         ` John Groves
@ 2024-02-23 21:19           ` Dave Hansen
  2024-02-23 23:50             ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Dave Hansen @ 2024-02-23 21:19 UTC (permalink / raw)
  To: John Groves, Dan Williams
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	linux-cxl, linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

On 2/23/24 12:39, John Groves wrote:
>> We had similar unit test regression concerns with fsdax where some
>> upstream change silently broke PMD faults. The solution there was trace
>> points in the fault handlers and a basic test that knows apriori that it
>> *should* be triggering a certain number of huge faults:
>>
>> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
> Good approach, thanks Dan! My working assumption is that we'll be able to make
> that approach work in the famfs tests. So the fault counters should go away
> in the next version.

I do really suspect there's something more generic that should be done
here.  Maybe we need a generic 'huge_faults' perf event to pair up with
the good ol' faults that we already have:

# perf stat -e faults /bin/ls

 Performance counter stats for '/bin/ls':

               104      faults


       0.001499862 seconds time elapsed

       0.001490000 seconds user
       0.000000000 seconds sys




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 21:19           ` Dave Hansen
@ 2024-02-23 23:50             ` Dan Williams
  2024-02-24  3:59               ` Matthew Wilcox
  0 siblings, 1 reply; 105+ messages in thread
From: Dan Williams @ 2024-02-23 23:50 UTC (permalink / raw)
  To: Dave Hansen, John Groves, Dan Williams
  Cc: John Groves, Jonathan Corbet, Vishal Verma, Dave Jiang,
	Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox,
	linux-cxl, linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

Dave Hansen wrote:
> On 2/23/24 12:39, John Groves wrote:
> >> We had similar unit test regression concerns with fsdax where some
> >> upstream change silently broke PMD faults. The solution there was trace
> >> points in the fault handlers and a basic test that knows apriori that it
> >> *should* be triggering a certain number of huge faults:
> >>
> >> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
> > Good approach, thanks Dan! My working assumption is that we'll be able to make
> > that approach work in the famfs tests. So the fault counters should go away
> > in the next version.
> 
> I do really suspect there's something more generic that should be done
> here.  Maybe we need a generic 'huge_faults' perf event to pair up with
> the good ol' faults that we already have:
> 
> # perf stat -e faults /bin/ls
> 
>  Performance counter stats for '/bin/ls':
> 
>                104      faults
> 
> 
>        0.001499862 seconds time elapsed
> 
>        0.001490000 seconds user
>        0.000000000 seconds sys

Certainly something like that would have satisified this sanity test use
case. I will note that mm_account_fault() would need some help to figure
out the size of the page table entry that got installed. Maybe
extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
from PUD to PMD, or all the way back to PTE.

Then use cases like this could just add a dynamic probe in
mm_account_fault(). No real need for a new tracepoint unless there was a
use case for this outside of regression testing fault handlers, right?

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (19 preceding siblings ...)
  2024-02-23 17:42 ` [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing John Groves
@ 2024-02-24  0:07 ` Luis Chamberlain
  2024-02-26 13:27   ` John Groves
  2024-02-29  6:52 ` Amir Goldstein
  21 siblings, 1 reply; 105+ messages in thread
From: Luis Chamberlain @ 2024-02-24  0:07 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote:
> This patch set introduces famfs[1] - a special-purpose fs-dax file system
> for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> CXL-specific in anyway way.
> 
> * Famfs creates a simple access method for storing and sharing data in
>   sharable memory. The memory is exposed and accessed as memory-mappable
>   dax files.
> * Famfs supports multiple hosts mounting the same file system from the
>   same memory (something existing fs-dax file systems don't do).
> * A famfs file system can be created on either a /dev/pmem device in fs-dax
>   mode, or a /dev/dax device in devdax mode (the latter depending on
>   patches 2-6 of this series).
> 
> The famfs kernel file system is part the famfs framework; additional
> components in user space[2] handle metadata and direct the famfs kernel
> module to instantiate files that map to specific memory. The famfs user
> space has documentation and a reasonably thorough test suite.
> 
> The famfs kernel module never accesses the shared memory directly (either
> data or metadata). Because of this, shared memory managed by the famfs
> framework does not create a RAS "blast radius" problem that should be able
> to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
> can be expected to kill apps via SIGBUS and cause mounts to be disabled
> due to memory failure notifications.
> 
> Famfs does not attempt to solve concurrency or coherency problems for apps,
> although it does solve these problems in regard to its own data structures.
> Apps may encounter hard concurrency problems, but there are use cases that
> are imminently useful and uncomplicated from a concurrency perspective:
> serial sharing is one (only one host at a time has access), and read-only
> concurrent sharing is another (all hosts can read-cache without worry).

Can you do me a favor, curious if you can run a test like this:

fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring                                                                                                                            
-direct=1                                                                                                                                                                                    
--group_reporting=1 --alloc-size=1048576 --filesize=1GiB                                                                                                                                      
--readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1                                                                                                                      
--directory=/mnt 

What do you get for throughput?

The absolute large the system an capacity the better.

  Luis

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-23 17:41 ` [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h John Groves
@ 2024-02-24  1:39   ` Randy Dunlap
  2024-02-24  2:23     ` John Groves
  2024-02-26 12:39   ` Jonathan Cameron
  1 sibling, 1 reply; 105+ messages in thread
From: Randy Dunlap @ 2024-02-24  1:39 UTC (permalink / raw)
  To: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: john, Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

Hi--

On 2/23/24 09:41, John Groves wrote:
> Add uapi include file for famfs. The famfs user space uses ioctl on
> individual files to pass in mapping information and file size. This
> would be hard to do via sysfs or other means, since it's
> file-specific.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
>  create mode 100644 include/uapi/linux/famfs_ioctl.h
> 
> diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
> new file mode 100644
> index 000000000000..6b3e6452d02f
> --- /dev/null
> +++ b/include/uapi/linux/famfs_ioctl.h
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,

      This is confusing to me. Is it just me? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +#ifndef FAMFS_IOCTL_H
> +#define FAMFS_IOCTL_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/uuid.h>
> +
> +#define FAMFS_MAX_EXTENTS 2
> +
> +enum extent_type {
> +	SIMPLE_DAX_EXTENT = 13,
> +	INVALID_EXTENT_TYPE,
> +};
> +
> +struct famfs_extent {
> +	__u64              offset;
> +	__u64              len;
> +};
> +
> +enum famfs_file_type {
> +	FAMFS_REG,
> +	FAMFS_SUPERBLOCK,
> +	FAMFS_LOG,
> +};
> +
> +/**

"/**" is used to begin kernel-doc comments, but this comment block is missing
a few entries to make it be kernel-doc compatible. Please either add them
or just use "/*" to begin the comment.

> + * struct famfs_ioc_map
> + *
> + * This is the metadata that indicates where the memory is for a famfs file
> + */
> +struct famfs_ioc_map {
> +	enum extent_type          extent_type;
> +	enum famfs_file_type      file_type;
> +	__u64                     file_size;
> +	__u64                     ext_list_count;
> +	struct famfs_extent       ext_list[FAMFS_MAX_EXTENTS];
> +};
> +
> +#define FAMFSIOC_MAGIC 'u'

This 'u' value should be documented in
Documentation/userspace-api/ioctl/ioctl-number.rst.

and if possible, you might want to use values like 0x5x or 0x8x
that don't conflict with the ioctl numbers that are already used
in the 'u' space.

> +
> +/* famfs file ioctl opcodes */
> +#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
> +#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
> +
> +#endif /* FAMFS_IOCTL_H */

-- 
#Randy

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing
  2024-02-23 17:42 ` [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing John Groves
@ 2024-02-24  1:50   ` Randy Dunlap
  2024-02-24  2:24     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Randy Dunlap @ 2024-02-24  1:50 UTC (permalink / raw)
  To: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm
  Cc: john, Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

Hi,

On 2/23/24 09:42, John Groves wrote:
> Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/Kconfig        |  2 ++
>  fs/Makefile       |  1 +
>  fs/famfs/Kconfig  | 10 ++++++++++
>  fs/famfs/Makefile |  5 +++++
>  4 files changed, 18 insertions(+)
>  create mode 100644 fs/famfs/Kconfig
>  create mode 100644 fs/famfs/Makefile
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 89fdbefd1075..8a11625a54a2 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -141,6 +141,8 @@ source "fs/autofs/Kconfig"
>  source "fs/fuse/Kconfig"
>  source "fs/overlayfs/Kconfig"
>  
> +source "fs/famfs/Kconfig"
> +
>  menu "Caches"
>  
>  source "fs/netfs/Kconfig"
> diff --git a/fs/Makefile b/fs/Makefile
> index c09016257f05..382c1ea4f4c3 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
>  obj-$(CONFIG_EROFS_FS)		+= erofs/
>  obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
>  obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
> +obj-$(CONFIG_FAMFS)             += famfs/
> diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
> new file mode 100644
> index 000000000000..e450928d8912
> --- /dev/null
> +++ b/fs/famfs/Kconfig
> @@ -0,0 +1,10 @@
> +
> +
> +config FAMFS
> +       tristate "famfs: shared memory file system"
> +       depends on DEV_DAX && FS_DAX
> +       help
> +         Support for the famfs file system. Famfs is a dax file system that
> +	 can support scale-out shared access to fabric-attached memory
> +	 (e.g. CXL shared memory). Famfs is not a general purpose file system;
> +	 it is an enabler for data sets in shared memory.

Please use one tab + 2 spaces to indent help text (below the "help" keyword)
as documented in Documentation/process/coding-style.rst.

> diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile
> new file mode 100644
> index 000000000000..8cac90c090a4
> --- /dev/null
> +++ b/fs/famfs/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +obj-$(CONFIG_FAMFS) += famfs.o
> +
> +famfs-y := famfs_inode.o famfs_file.o

-- 
#Randy

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-24  1:39   ` Randy Dunlap
@ 2024-02-24  2:23     ` John Groves
  2024-02-24  3:27       ` Randy Dunlap
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-24  2:23 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/23 05:39PM, Randy Dunlap wrote:
> Hi--
> 
> On 2/23/24 09:41, John Groves wrote:
> > Add uapi include file for famfs. The famfs user space uses ioctl on
> > individual files to pass in mapping information and file size. This
> > would be hard to do via sysfs or other means, since it's
> > file-specific.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
> >  1 file changed, 56 insertions(+)
> >  create mode 100644 include/uapi/linux/famfs_ioctl.h
> > 
> > diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
> > new file mode 100644
> > index 000000000000..6b3e6452d02f
> > --- /dev/null
> > +++ b/include/uapi/linux/famfs_ioctl.h
> > @@ -0,0 +1,56 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2024 Micron Technology, Inc.
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> 
>       This is confusing to me. Is it just me? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Thanks Randy. I think I was trying to say "based on ramfs *plus* the dax
support from xfs. But I'll try to come up with something more clear than
that...

> 
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +#ifndef FAMFS_IOCTL_H
> > +#define FAMFS_IOCTL_H
> > +
> > +#include <linux/ioctl.h>
> > +#include <linux/uuid.h>
> > +
> > +#define FAMFS_MAX_EXTENTS 2
> > +
> > +enum extent_type {
> > +	SIMPLE_DAX_EXTENT = 13,
> > +	INVALID_EXTENT_TYPE,
> > +};
> > +
> > +struct famfs_extent {
> > +	__u64              offset;
> > +	__u64              len;
> > +};
> > +
> > +enum famfs_file_type {
> > +	FAMFS_REG,
> > +	FAMFS_SUPERBLOCK,
> > +	FAMFS_LOG,
> > +};
> > +
> > +/**
> 
> "/**" is used to begin kernel-doc comments, but this comment block is missing
> a few entries to make it be kernel-doc compatible. Please either add them
> or just use "/*" to begin the comment.

Will do, thanks. And I'll check the whole code base for other instances;
I won't be surprise if I was sloop about that in more than one place.

> 
> > + * struct famfs_ioc_map
> > + *
> > + * This is the metadata that indicates where the memory is for a famfs file
> > + */
> > +struct famfs_ioc_map {
> > +	enum extent_type          extent_type;
> > +	enum famfs_file_type      file_type;
> > +	__u64                     file_size;
> > +	__u64                     ext_list_count;
> > +	struct famfs_extent       ext_list[FAMFS_MAX_EXTENTS];
> > +};
> > +
> > +#define FAMFSIOC_MAGIC 'u'
> 
> This 'u' value should be documented in
> Documentation/userspace-api/ioctl/ioctl-number.rst.
> 
> and if possible, you might want to use values like 0x5x or 0x8x
> that don't conflict with the ioctl numbers that are already used
> in the 'u' space.

Will do. I was trying to be too clever there, invoking "mu" for
micron. 

> 
> > +
> > +/* famfs file ioctl opcodes */
> > +#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
> > +#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
> > +#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
> > +#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
> > +
> > +#endif /* FAMFS_IOCTL_H */
> 
> -- 
> #Randy

Thank you for taking the time to look it over, Randy.

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing
  2024-02-24  1:50   ` Randy Dunlap
@ 2024-02-24  2:24     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-24  2:24 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/23 05:50PM, Randy Dunlap wrote:
> Hi,
> 
> On 2/23/24 09:42, John Groves wrote:
> > Add famfs Kconfig and Makefile, and hook into fs/Kconfig and fs/Makefile
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/Kconfig        |  2 ++
> >  fs/Makefile       |  1 +
> >  fs/famfs/Kconfig  | 10 ++++++++++
> >  fs/famfs/Makefile |  5 +++++
> >  4 files changed, 18 insertions(+)
> >  create mode 100644 fs/famfs/Kconfig
> >  create mode 100644 fs/famfs/Makefile
> > 
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 89fdbefd1075..8a11625a54a2 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -141,6 +141,8 @@ source "fs/autofs/Kconfig"
> >  source "fs/fuse/Kconfig"
> >  source "fs/overlayfs/Kconfig"
> >  
> > +source "fs/famfs/Kconfig"
> > +
> >  menu "Caches"
> >  
> >  source "fs/netfs/Kconfig"
> > diff --git a/fs/Makefile b/fs/Makefile
> > index c09016257f05..382c1ea4f4c3 100644
> > --- a/fs/Makefile
> > +++ b/fs/Makefile
> > @@ -130,3 +130,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
> >  obj-$(CONFIG_EROFS_FS)		+= erofs/
> >  obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
> >  obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
> > +obj-$(CONFIG_FAMFS)             += famfs/
> > diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
> > new file mode 100644
> > index 000000000000..e450928d8912
> > --- /dev/null
> > +++ b/fs/famfs/Kconfig
> > @@ -0,0 +1,10 @@
> > +
> > +
> > +config FAMFS
> > +       tristate "famfs: shared memory file system"
> > +       depends on DEV_DAX && FS_DAX
> > +       help
> > +         Support for the famfs file system. Famfs is a dax file system that
> > +	 can support scale-out shared access to fabric-attached memory
> > +	 (e.g. CXL shared memory). Famfs is not a general purpose file system;
> > +	 it is an enabler for data sets in shared memory.
> 
> Please use one tab + 2 spaces to indent help text (below the "help" keyword)
> as documented in Documentation/process/coding-style.rst.

Will do, thank you!

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-24  2:23     ` John Groves
@ 2024-02-24  3:27       ` Randy Dunlap
  2024-02-24 23:32         ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Randy Dunlap @ 2024-02-24  3:27 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

Hi John,

On 2/23/24 18:23, John Groves wrote:
>>> +
>>> +#define FAMFSIOC_MAGIC 'u'
>> This 'u' value should be documented in
>> Documentation/userspace-api/ioctl/ioctl-number.rst.
>>
>> and if possible, you might want to use values like 0x5x or 0x8x
>> that don't conflict with the ioctl numbers that are already used
>> in the 'u' space.
> Will do. I was trying to be too clever there, invoking "mu" for
> micron. 

I might have been unclear about this one.
It's OK to use 'u' but the values 1-4 below conflict in the 'u' space:

'u'   00-1F  linux/smb_fs.h                                          gone
'u'   20-3F  linux/uvcvideo.h                                        USB video class host driver
'u'   40-4f  linux/udmabuf.h

so if you could use
'u'   50-5f
or
'u'   80-8f

then those conflicts wouldn't be there.
HTH.

>>> +
>>> +/* famfs file ioctl opcodes */
>>> +#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
>>> +#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
>>> +#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
>>> +#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)

-- 
#Randy

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-23 23:50             ` Dan Williams
@ 2024-02-24  3:59               ` Matthew Wilcox
  2024-02-24  4:30                 ` Dan Williams
  0 siblings, 1 reply; 105+ messages in thread
From: Matthew Wilcox @ 2024-02-24  3:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, John Groves, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, linux-cxl, linux-fsdevel, linux-doc, linux-kernel,
	nvdimm, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price

On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote:
> Certainly something like that would have satisified this sanity test use
> case. I will note that mm_account_fault() would need some help to figure
> out the size of the page table entry that got installed. Maybe
> extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
> VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
> from PUD to PMD, or all the way back to PTE.

ugh, no, it's more complicated than that.  look at the recent changes to
set_ptes().  we can now install PTEs of many different sizes, depending
on the architecture.  someday i look forward to supporting all the page
sizes on parisc (4k, 16k, 64k, 256k, ... 4G)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 16/20] famfs: Add fault counters
  2024-02-24  3:59               ` Matthew Wilcox
@ 2024-02-24  4:30                 ` Dan Williams
  0 siblings, 0 replies; 105+ messages in thread
From: Dan Williams @ 2024-02-24  4:30 UTC (permalink / raw)
  To: Matthew Wilcox, Dan Williams
  Cc: Dave Hansen, John Groves, John Groves, Jonathan Corbet,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, linux-cxl, linux-fsdevel, linux-doc, linux-kernel,
	nvdimm, john, Dave Chinner, Christoph Hellwig, dave.hansen,
	gregory.price

Matthew Wilcox wrote:
> On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote:
> > Certainly something like that would have satisified this sanity test use
> > case. I will note that mm_account_fault() would need some help to figure
> > out the size of the page table entry that got installed. Maybe
> > extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments
> > VM_FAULT_FALLBACK to indicate whether, for example, the fallback went
> > from PUD to PMD, or all the way back to PTE.
> 
> ugh, no, it's more complicated than that.  look at the recent changes to
> set_ptes().  we can now install PTEs of many different sizes, depending
> on the architecture.  someday i look forward to supporting all the page
> sizes on parisc (4k, 16k, 64k, 256k, ... 4G)

Nice!

There are enough bits in vm_fault_t to represent many page sizes instead
of the entry type as I suggested, but I would defer to you or Dave on
how to make "installed pte size" generically traceable per Dave's
suggestion.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-24  3:27       ` Randy Dunlap
@ 2024-02-24 23:32         ` John Groves
  2024-02-24 23:40           ` Randy Dunlap
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-24 23:32 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/23 07:27PM, Randy Dunlap wrote:
> Hi John,
> 
> On 2/23/24 18:23, John Groves wrote:
> >>> +
> >>> +#define FAMFSIOC_MAGIC 'u'
> >> This 'u' value should be documented in
> >> Documentation/userspace-api/ioctl/ioctl-number.rst.
> >>
> >> and if possible, you might want to use values like 0x5x or 0x8x
> >> that don't conflict with the ioctl numbers that are already used
> >> in the 'u' space.
> > Will do. I was trying to be too clever there, invoking "mu" for
> > micron. 
> 
> I might have been unclear about this one.
> It's OK to use 'u' but the values 1-4 below conflict in the 'u' space:
> 
> 'u'   00-1F  linux/smb_fs.h                                          gone
> 'u'   20-3F  linux/uvcvideo.h                                        USB video class host driver
> 'u'   40-4f  linux/udmabuf.h
> 
> so if you could use
> 'u'   50-5f
> or
> 'u'   80-8f
> 
> then those conflicts wouldn't be there.
> HTH.
> 
> >>> +
> >>> +/* famfs file ioctl opcodes */
> >>> +#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
> >>> +#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
> >>> +#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
> >>> +#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
> 
> -- 
> #Randy

Thanks Randy; I think I'm the one that didn't read carefully enough.

Does this look right?

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 457e16f06e04..44a44809657b 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -288,6 +288,7 @@ Code  Seq#    Include File                                           Comments
 'u'   00-1F  linux/smb_fs.h                                          gone
 'u'   20-3F  linux/uvcvideo.h                                        USB video class host driver
 'u'   40-4f  linux/udmabuf.h                                         userspace dma-buf misc device
+'u'   50-5F  linux/famfs_ioctl.h                                     famfs shared memory file system
 'v'   00-1F  linux/ext2_fs.h                                         conflict!
 'v'   00-1F  linux/fs.h                                              conflict!
 'v'   00-0F  linux/sonypi.h                                          conflict!
diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
index 6b3e6452d02f..57521898ed57 100644
--- a/include/uapi/linux/famfs_ioctl.h
+++ b/include/uapi/linux/famfs_ioctl.h
@@ -48,9 +48,9 @@ struct famfs_ioc_map {
 #define FAMFSIOC_MAGIC 'u'

 /* famfs file ioctl opcodes */
-#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
-#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
-#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
-#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
+#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 0x50, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 0x51, struct famfs_ioc_map)
+#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 0x52, struct famfs_extent)
+#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  0x53)

Thank you!
John


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-24 23:32         ` John Groves
@ 2024-02-24 23:40           ` Randy Dunlap
  0 siblings, 0 replies; 105+ messages in thread
From: Randy Dunlap @ 2024-02-24 23:40 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price



On 2/24/24 15:32, John Groves wrote:
> On 24/02/23 07:27PM, Randy Dunlap wrote:
>> Hi John,
>>
>> On 2/23/24 18:23, John Groves wrote:
>>>>> +
>>>>> +#define FAMFSIOC_MAGIC 'u'
>>>> This 'u' value should be documented in
>>>> Documentation/userspace-api/ioctl/ioctl-number.rst.
>>>>
>>>> and if possible, you might want to use values like 0x5x or 0x8x
>>>> that don't conflict with the ioctl numbers that are already used
>>>> in the 'u' space.
>>> Will do. I was trying to be too clever there, invoking "mu" for
>>> micron. 
>>
>> I might have been unclear about this one.
>> It's OK to use 'u' but the values 1-4 below conflict in the 'u' space:
>>
>> 'u'   00-1F  linux/smb_fs.h                                          gone
>> 'u'   20-3F  linux/uvcvideo.h                                        USB video class host driver
>> 'u'   40-4f  linux/udmabuf.h
>>
>> so if you could use
>> 'u'   50-5f
>> or
>> 'u'   80-8f
>>
>> then those conflicts wouldn't be there.
>> HTH.
>>
>>>>> +
>>>>> +/* famfs file ioctl opcodes */
>>>>> +#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
>>>>> +#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
>>>>> +#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
>>>>> +#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
>>
>> -- 
>> #Randy
> 
> Thanks Randy; I think I'm the one that didn't read carefully enough.
> 
> Does this look right?
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 457e16f06e04..44a44809657b 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -288,6 +288,7 @@ Code  Seq#    Include File                                           Comments
>  'u'   00-1F  linux/smb_fs.h                                          gone
>  'u'   20-3F  linux/uvcvideo.h                                        USB video class host driver
>  'u'   40-4f  linux/udmabuf.h                                         userspace dma-buf misc device
> +'u'   50-5F  linux/famfs_ioctl.h                                     famfs shared memory file system
>  'v'   00-1F  linux/ext2_fs.h                                         conflict!
>  'v'   00-1F  linux/fs.h                                              conflict!
>  'v'   00-0F  linux/sonypi.h                                          conflict!
> diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
> index 6b3e6452d02f..57521898ed57 100644
> --- a/include/uapi/linux/famfs_ioctl.h
> +++ b/include/uapi/linux/famfs_ioctl.h
> @@ -48,9 +48,9 @@ struct famfs_ioc_map {
>  #define FAMFSIOC_MAGIC 'u'
> 
>  /* famfs file ioctl opcodes */
> -#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
> -#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
> -#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
> -#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
> +#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 0x50, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 0x51, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 0x52, struct famfs_extent)
> +#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  0x53)
> 
> Thank you!
> John
> 

Yes, that looks good.
Thanks.

-- 
#Randy

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  2024-02-23 17:41 ` [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
@ 2024-02-26 12:05   ` Jonathan Cameron
  2024-02-26 15:00     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:05 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:46 -0600
John Groves <John@Groves.net> wrote:

> This function should be called by fs-dax file systems after opening the
> devdax device. This adds holder_operations.
> 
> This function serves the same role as fs_dax_get_by_bdev(), which dax
> file systems call after opening the pmem block device.
> 
> Signed-off-by: John Groves <john@groves.net>

A few trivial comments form a first read to get my head around this.

Yeah, it is only an RFC, but who doesn't like tidy code? :)


> ---
>  drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h |  5 +++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index f4b635526345..fc96362de237 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
>  EXPORT_SYMBOL_GPL(fs_put_dax);
>  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>  
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +
> +/**
> + * fs_dax_get()

Smells like kernel doc but fairly sure it needs a short description.
Have you sanity checked for warnings when running scripts/kerneldoc on it?

> + *
> + * fs-dax file systems call this function to prepare to use a devdax device for fsdax.
Trivial but lines too long. Keep under 80 chars unless there is a strong
readability arguement for not doing so.


> + * This is like fs_dax_get_by_bdev(), but the caller already has struct dev_dax (and there
> + * is no bdev). The holder makes this exclusive.

Not familiar with this area: what does exclusive mean here?

> + *
> + * @dax_dev: dev to be prepared for fs-dax usage
> + * @holder: filesystem or mapped device inside the dax_device
> + * @hops: operations for the inner holder
> + *
> + * Returns: 0 on success, -1 on failure

Why not return < 0 and use somewhat useful return values?

> + */
> +int fs_dax_get(
> +	struct dax_device *dax_dev,
> +	void *holder,
> +	const struct dax_holder_operations *hops)

Match local style for indents - it's a bit inconsistent but probably...

int fs_dax_get(struct dad_device *dev_dax, void *holder,
	       const struct dax_holder_operations *hops)

> +{
> +	/* dax_dev->ops should have been populated by devm_create_dev_dax() */
> +	if (WARN_ON(!dax_dev->ops))
> +		return -1;
> +
> +	if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))

You dereferenced dax_dev on the line above so check is too late or
unnecessary

> +		return -1;
> +
> +	if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
> +		pr_warn("%s: holder_data already set\n", __func__);

Perhaps nicer to use a pr_fmt() deal with the func name if you need it.
or make it pr_debug and let dynamic debug control formatting if anyone
wants the function name.

> +		return -1;
> +	}
> +	dax_dev->holder_ops = hops;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get);
> +#endif /* DEV_DAX_IOMAP */
> +
>  enum dax_device_flags {
>  	/* !alive + rcu grace period == no new operations / mappings */
>  	DAXDEV_ALIVE,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b463502b16e1..e973289bfde3 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -57,7 +57,12 @@ struct dax_holder_operations {
>  
>  #if IS_ENABLED(CONFIG_DAX)
>  struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
> +
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
line wrap < 80 chars

> +#endif
>  void *dax_holder(struct dax_device *dax_dev);
> +struct dax_device *inode_dax(struct inode *inode);

Unrelated change?

>  void put_dax(struct dax_device *dax_dev);
>  void kill_dax(struct dax_device *dax_dev);
>  void dax_write_cache(struct dax_device *dax_dev, bool wc);


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now
  2024-02-23 17:41 ` [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now John Groves
@ 2024-02-26 12:10   ` Jonathan Cameron
  2024-02-26 15:13     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:10 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:47 -0600
John Groves <John@Groves.net> wrote:

> bus.c can't call functions in device.c - that creates a circular linkage
> dependency.
> 
> Signed-off-by: John Groves <john@groves.net>

This also adds the export which you should mention!

Do they need it already? Seems like tense of patch title
may be wrong.

> ---
>  drivers/dax/bus.c    | 24 ++++++++++++++++++++++++
>  drivers/dax/device.c | 23 -----------------------
>  2 files changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1ff1ab5fa105..664e8c1b9930 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1325,6 +1325,30 @@ static const struct device_type dev_dax_type = {
>  	.groups = dax_attribute_groups,
>  };
>  
> +/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c  */
> +__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
> +			      unsigned long size)
> +{
> +	int i;
> +
> +	for (i = 0; i < dev_dax->nr_range; i++) {
> +		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
> +		struct range *range = &dax_range->range;
> +		unsigned long long pgoff_end;
> +		phys_addr_t phys;
> +
> +		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
> +		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
> +			continue;
> +		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
> +		if (phys + size - 1 <= range->end)
> +			return phys;
> +		break;
> +	}
> +	return -1;

Not related to your patch but returning -1 in a phys_addr_t isn't ideal.
I assume aim is all bits set as a marker, in which case
PHYS_ADDR_MAX from limits.h would make things clearer.

> +}
> +EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
> +
>  struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
>  {
>  	struct dax_region *dax_region = data->dax_region;
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 93ebedc5ec8c..40ba660013cf 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -50,29 +50,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> -/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
> -__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
> -		unsigned long size)
> -{
> -	int i;
> -
> -	for (i = 0; i < dev_dax->nr_range; i++) {
> -		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
> -		struct range *range = &dax_range->range;
> -		unsigned long long pgoff_end;
> -		phys_addr_t phys;
> -
> -		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
> -		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
> -			continue;
> -		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
> -		if (phys + size - 1 <= range->end)
> -			return phys;
> -		break;
> -	}
> -	return -1;
> -}
> -
>  static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
>  			      unsigned long fault_size)
>  {


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap
  2024-02-23 17:41 ` [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap John Groves
@ 2024-02-26 12:21   ` Jonathan Cameron
  2024-02-26 15:48     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:21 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:48 -0600
John Groves <John@Groves.net> wrote:

> Save the kva from memremap because we need it for iomap rw support
> 
> Prior to famfs, there were no iomap users of /dev/dax - so the virtual
> address from memremap was not needed.
> 
> Also: in some cases dev_dax_probe() is called with the first
> dev_dax->range offset past pgmap[0].range. In those cases we need to
> add the difference to virt_addr in order to have the physaddr's in
> dev_dax->ranges match dev_dax->virt_addr.

Probably good to have info on when this happens and preferably why
this dragon is there.

> 
> Dragons...
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/dax-private.h |  1 +
>  drivers/dax/device.c      | 15 +++++++++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> index 446617b73aea..894eb1c66b4a 100644
> --- a/drivers/dax/dax-private.h
> +++ b/drivers/dax/dax-private.h
> @@ -63,6 +63,7 @@ struct dax_mapping {
>  struct dev_dax {
>  	struct dax_region *region;
>  	struct dax_device *dax_dev;
> +	u64 virt_addr;

Why as a u64? If it's a virt address why not just void *?

>  	unsigned int align;
>  	int target_node;
>  	bool dyn_id;
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 40ba660013cf..6cd79d00fe1b 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
>  	struct dax_device *dax_dev = dev_dax->dax_dev;
>  	struct device *dev = &dev_dax->dev;
>  	struct dev_pagemap *pgmap;
> +	u64 data_offset = 0;
>  	struct inode *inode;
>  	struct cdev *cdev;
>  	void *addr;
> @@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
>  	if (IS_ERR(addr))
>  		return PTR_ERR(addr);
>  
> +	/* Detect whether the data is at a non-zero offset into the memory */
> +	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> +		u64 phys = (u64)dev_dax->ranges[0].range.start;

Why the cast? Ranges use u64s internally.

> +		u64 pgmap_phys = (u64)dev_dax->pgmap[0].range.start;
> +		u64 vmemmap_shift = (u64)dev_dax->pgmap[0].vmemmap_shift;
> +
> +		if (!WARN_ON(pgmap_phys > phys))
> +			data_offset = phys - pgmap_phys;
> +
> +		pr_notice("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n",
> +		       __func__, phys, pgmap_phys, data_offset, vmemmap_shift);

pr_debug() + dynamic debug will then deal with __func__ for you.

> +	}
> +	dev_dax->virt_addr = (u64)addr + data_offset;
> +
>  	inode = dax_inode(dax_dev);
>  	cdev = inode->i_cdev;
>  	cdev_init(cdev, &dax_fops);


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  2024-02-23 17:41 ` [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
@ 2024-02-26 12:32   ` Jonathan Cameron
  2024-02-26 16:09     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:32 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:49 -0600
John Groves <John@Groves.net> wrote:

> Notes about this commit:
> 
> * These methods are based somewhat loosely on pmem_dax_ops from
>   drivers/nvdimm/pmem.c
> 
> * dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
>   newly stored as dev_dax->virt_addr by dev_dax_probe().
> 
> * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
>   for read/write (dax_iomap_rw())
> 
> * dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
>   tested yet. I'm looking for suggestions as to how to test those.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/bus.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 107 insertions(+)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 664e8c1b9930..06fcda810674 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -10,6 +10,12 @@
>  #include "dax-private.h"
>  #include "bus.h"
>  
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +#include <linux/backing-dev.h>
> +#include <linux/pfn_t.h>
> +#include <linux/range.h>
> +#endif
> +

Is it worth avoiding includes based on config? Probably not.

>  static DEFINE_MUTEX(dax_bus_lock);
>  
>  #define DAX_NAME_LEN 30
> @@ -1349,6 +1355,101 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
>  }
>  EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
>  
> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +

> +
> +static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> +			     long nr_pages, enum dax_access_mode mode, void **kaddr,
> +			     pfn_t *pfn)
> +{
> +	struct dev_dax *dev_dax = dax_get_private(dax_dev);
> +	size_t dax_size = dev_dax_size(dev_dax);
> +	size_t size = nr_pages << PAGE_SHIFT;
> +	size_t offset = pgoff << PAGE_SHIFT;
> +	phys_addr_t phys;
> +	u64 virt_addr = dev_dax->virt_addr + offset;
> +	pfn_t local_pfn;
> +	u64 flags = PFN_DEV|PFN_MAP;
> +
> +	WARN_ON(!dev_dax->virt_addr); /* virt_addr must be saved for direct_access */
Fair enough, but from local code point of view, does it make sense to check this
if !kaddr as we won't use this.
> +
> +	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
> +
> +	if (kaddr)
> +		*kaddr = (void *)virt_addr;

Back to earlier comment on virt_addr as a void *. Definitely looking like
that would be more accurate and simpler!  Also not much point in computing
virt_addr unless kaddr is good.

> +
> +	local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
If you aren't going to do anything with it for !pfn, move it under the if (pfn).

> +	if (pfn)
> +		*pfn = local_pfn;
> +
> +	/* This the valid size at the specified address */
> +	return PHYS_PFN(min_t(size_t, size, dax_size - offset));
> +}
> +

> +
> +static const struct dax_operations dev_dax_ops = {
> +	.direct_access = dev_dax_direct_access,
> +	.zero_page_range = dev_dax_zero_page_range,
> +	.recovery_write = dev_dax_recovery_write,
> +};
> +
> +#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
> +
>  struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
>  {
>  	struct dax_region *dax_region = data->dax_region;
> @@ -1404,11 +1505,17 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
>  		}
>  	}
>  

If we were to make this 

	if (IS_ENABLED(CONFIG_DEV_DAX_IOMAP))

etc can we avoid the ifdef stuff above and let dead code removal deal with it?
Might need a few stubs - I haven't tried.

> +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> +	/* holder_ops currently populated separately in a slightly hacky way */
> +	dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
> +#else
>  	/*
>  	 * No dax_operations since there is no access to this device outside of
>  	 * mmap of the resulting character device.
>  	 */
>  	dax_dev = alloc_dax(dev_dax, NULL);
> +#endif
> +
>  	if (IS_ERR(dax_dev)) {
>  		rc = PTR_ERR(dax_dev);
>  		goto err_alloc_dax;


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
  2024-02-23 17:41 ` [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves
@ 2024-02-26 12:34   ` Jonathan Cameron
  2024-02-26 16:12     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:34 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:50 -0600
John Groves <John@Groves.net> wrote:

> Add the CONFIG_DEV_DAX_IOMAP kernel config parameter to control building
> of the iomap functionality to support fsdax on devdax.

I would squash with previous patch.

Only reason I ever see for separate Kconfig patches is when there is something
complex in the dependencies and you want to talk about it in depth in the
patch description. That's not true here so no need for separate patch.

> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  drivers/dax/Kconfig | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index a88744244149..b1ebcc77120b 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -78,4 +78,10 @@ config DEV_DAX_KMEM
>  
>  	  Say N if unsure.
>  
> +config DEV_DAX_IOMAP
> +       depends on DEV_DAX && DAX
> +       def_bool y
> +       help
> +         Support iomap mapping of devdax devices (for FS-DAX file
> +         systems that reside on character /dev/dax devices)
>  endif


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-23 17:41 ` [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h John Groves
  2024-02-24  1:39   ` Randy Dunlap
@ 2024-02-26 12:39   ` Jonathan Cameron
  2024-02-26 16:44     ` John Groves
  1 sibling, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:39 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:51 -0600
John Groves <John@Groves.net> wrote:

> Add uapi include file for famfs. The famfs user space uses ioctl on
> individual files to pass in mapping information and file size. This
> would be hard to do via sysfs or other means, since it's
> file-specific.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
>  create mode 100644 include/uapi/linux/famfs_ioctl.h
> 
> diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
> new file mode 100644
> index 000000000000..6b3e6452d02f
> --- /dev/null
> +++ b/include/uapi/linux/famfs_ioctl.h
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +#ifndef FAMFS_IOCTL_H
> +#define FAMFS_IOCTL_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/uuid.h>
> +
> +#define FAMFS_MAX_EXTENTS 2
Why 2?
> +
> +enum extent_type {
> +	SIMPLE_DAX_EXTENT = 13,

Comment on this would be good to have

> +	INVALID_EXTENT_TYPE,
> +};
> +
> +struct famfs_extent {
> +	__u64              offset;
> +	__u64              len;
> +};
> +
> +enum famfs_file_type {
> +	FAMFS_REG,
> +	FAMFS_SUPERBLOCK,
> +	FAMFS_LOG,
> +};
> +
> +/**
> + * struct famfs_ioc_map
> + *
> + * This is the metadata that indicates where the memory is for a famfs file
> + */
> +struct famfs_ioc_map {
> +	enum extent_type          extent_type;
> +	enum famfs_file_type      file_type;

These are going to be potentially varying in size depending on arch, compiler
settings etc.  Been a while, but I though best practice for uapi was always
fixed size elements even though we lose the typing.


> +	__u64                     file_size;
> +	__u64                     ext_list_count;
> +	struct famfs_extent       ext_list[FAMFS_MAX_EXTENTS];
> +};
> +
> +#define FAMFSIOC_MAGIC 'u'
> +
> +/* famfs file ioctl opcodes */
> +#define FAMFSIOC_MAP_CREATE    _IOW(FAMFSIOC_MAGIC, 1, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GET       _IOR(FAMFSIOC_MAGIC, 2, struct famfs_ioc_map)
> +#define FAMFSIOC_MAP_GETEXT    _IOR(FAMFSIOC_MAGIC, 3, struct famfs_extent)
> +#define FAMFSIOC_NOP           _IO(FAMFSIOC_MAGIC,  4)
> +
> +#endif /* FAMFS_IOCTL_H */


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 08/20] famfs: Add famfs_internal.h
  2024-02-23 17:41 ` [RFC PATCH 08/20] famfs: Add famfs_internal.h John Groves
@ 2024-02-26 12:48   ` Jonathan Cameron
  2024-02-26 17:35     ` John Groves
  2024-02-27 13:38   ` Christian Brauner
  1 sibling, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:48 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:52 -0600
John Groves <John@Groves.net> wrote:

> Add the famfs_internal.h include file. This contains internal data
> structures such as the per-file metadata structure (famfs_file_meta)
> and extent formats.
> 
> Signed-off-by: John Groves <john@groves.net>
Hi John,

Build this up as you add the definitions in later patches.

Separate header patches just make people jump back and forth when trying
to review.  Obviously more work to build this stuff up cleanly but
it's worth doing to save review time.

Generally I'd plumb up Kconfig and Makefile a the beginning as it means
that the set is bisectable and we can check the logic of building each stage.
That is harder to do but tends to bring benefits in forcing clear step
wise approach on a patch set. Feel free to ignore this one though as it
can slow things down.

A few trivial comments inline.

> ---
>  fs/famfs/famfs_internal.h | 53 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>  create mode 100644 fs/famfs/famfs_internal.h
> 
> diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
> new file mode 100644
> index 000000000000..af3990d43305
> --- /dev/null
> +++ b/fs/famfs/famfs_internal.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +#ifndef FAMFS_INTERNAL_H
> +#define FAMFS_INTERNAL_H
> +
> +#include <linux/atomic.h>

Why?

> +#include <linux/famfs_ioctl.h>
> +
> +#define FAMFS_MAGIC 0x87b282ff
> +
> +#define FAMFS_BLKDEV_MODE (FMODE_READ|FMODE_WRITE)

Spaces around | 

> +
> +extern const struct file_operations      famfs_file_operations;

I wouldn't force alignment. It rots too often as new stuff gets added
and doesn't really help readability much.

> +
> +/*
> + * Each famfs dax file has this hanging from its inode->i_private.
> + */
> +struct famfs_file_meta {
> +	int                   error;
> +	enum famfs_file_type  file_type;
> +	size_t                file_size;
> +	enum extent_type      tfs_extent_type;
> +	size_t                tfs_extent_ct;
> +	struct famfs_extent   tfs_extents[];  /* flexible array */

Comment kind of obvious ;) I'd drop it.  Though we have
magic markings for __counted_by which would be good to use from the start.



> +};
> +
> +struct famfs_mount_opts {
> +	umode_t mode;
> +};
> +
> +extern const struct iomap_ops             famfs_iomap_ops;
> +extern const struct vm_operations_struct  famfs_file_vm_ops;
> +
> +#define ROOTDEV_STRLEN 80

Why?  You aren't creating an array of this size here so I can't
immediately see what the define is for.

> +
> +struct famfs_fs_info {
> +	struct famfs_mount_opts  mount_opts;
> +	struct file             *dax_filp;
> +	struct dax_device       *dax_devp;
> +	struct bdev_handle      *bdev_handle;
> +	struct list_head         fsi_list;
> +	char                    *rootdev;
> +};
> +
> +#endif /* FAMFS_INTERNAL_H */


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 09/20] famfs: Add super_operations
  2024-02-23 17:41 ` [RFC PATCH 09/20] famfs: Add super_operations John Groves
@ 2024-02-26 12:51   ` Jonathan Cameron
  2024-02-26 21:47     ` John Groves
  2024-02-27 17:48     ` John Groves
  0 siblings, 2 replies; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:51 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:53 -0600
John Groves <John@Groves.net> wrote:

> Introduce the famfs superblock operations
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 72 insertions(+)
>  create mode 100644 fs/famfs/famfs_inode.c
> 
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> new file mode 100644
> index 000000000000..3329aff000d1
> --- /dev/null
> +++ b/fs/famfs/famfs_inode.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, inc
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/pagemap.h>
> +#include <linux/highmem.h>
> +#include <linux/time.h>
> +#include <linux/init.h>
> +#include <linux/string.h>
> +#include <linux/backing-dev.h>
> +#include <linux/sched.h>
> +#include <linux/parser.h>
> +#include <linux/magic.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/fs_context.h>
> +#include <linux/fs_parser.h>
> +#include <linux/seq_file.h>
> +#include <linux/dax.h>
> +#include <linux/hugetlb.h>
> +#include <linux/uio.h>
> +#include <linux/iomap.h>
> +#include <linux/path.h>
> +#include <linux/namei.h>
> +#include <linux/pfn_t.h>
> +#include <linux/blkdev.h>

That's a lot of header for such a small patch.. I'm going to guess
they aren't all used - bring them in as you need them - I hope
you never need some of these!


> +
> +#include "famfs_internal.h"
> +
> +#define FAMFS_DEFAULT_MODE	0755
> +
> +static const struct super_operations famfs_ops;
> +static const struct inode_operations famfs_file_inode_operations;
> +static const struct inode_operations famfs_dir_inode_operations;

Why are these all up here?

> +
> +/**********************************************************************************
> + * famfs super_operations
> + *
> + * TODO: implement a famfs_statfs() that shows size, free and available space, etc.
> + */
> +
> +/**
> + * famfs_show_options() - Display the mount options in /proc/mounts.
Run kernel doc script + fix all warnings.

> + */
> +static int famfs_show_options(
> +	struct seq_file *m,
> +	struct dentry   *root)
Not that familiar with fs code, but this unusual kernel style. I'd go with 
something more common

static int famfs_show_options(struct seq_file *m, struct dentry *root)

> +{
> +	struct famfs_fs_info *fsi = root->d_sb->s_fs_info;
> +
> +	if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE)
> +		seq_printf(m, ",mode=%o", fsi->mount_opts.mode);
> +
> +	return 0;
> +}
> +
> +static const struct super_operations famfs_ops = {
> +	.statfs		= simple_statfs,
> +	.drop_inode	= generic_delete_inode,
> +	.show_options	= famfs_show_options,
> +};
> +
> +
One blank line probably fine.


Add the rest of the stuff a module normally has, author etc in this
patch.

> +MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations
  2024-02-23 17:41 ` [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations John Groves
@ 2024-02-26 12:56   ` Jonathan Cameron
  2024-02-26 22:22     ` John Groves
  2024-02-27 13:39   ` Christian Brauner
  1 sibling, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 12:56 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:54 -0600
John Groves <John@Groves.net> wrote:

> Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces
> the function that opens a block (pmem) device and the struct
> dax_holder_operations that are needed for that ABI.
> 
> In this commit, support for opening character /dev/dax is stubbed. A
> later commit introduces this capability.
> 
> Signed-off-by: John Groves <john@groves.net>

Formatting comments mostly same as previous patches, so I'll stop repeating them.

> ---
>  fs/famfs/famfs_inode.c | 83 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 83 insertions(+)
> 
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index 3329aff000d1..82c861998093 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = {
>  	.show_options	= famfs_show_options,
>  };
>  
> +/***************************************************************************************
> + * dax_holder_operations for block dax
> + */
> +
> +static int
> +famfs_blk_dax_notify_failure(
> +	struct dax_device	*dax_devp,
> +	u64			offset,
> +	u64			len,
> +	int			mf_flags)
> +{
> +
> +	pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
> +	       __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
> +	return -EOPNOTSUPP;
> +}
> +
> +const struct dax_holder_operations famfs_blk_dax_holder_ops = {
> +	.notify_failure		= famfs_blk_dax_notify_failure,
> +};
> +
> +static int
> +famfs_open_char_device(
> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> +	       __func__, fc->source);
> +	return -ENODEV;
> +}
> +
> +/**
> + * famfs_open_device()
> + *
> + * Open the memory device. If it looks like /dev/dax, call famfs_open_char_device().
> + * Otherwise try to open it as a block/pmem device.
> + */
> +static int
> +famfs_open_device(
> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	struct famfs_fs_info *fsi = sb->s_fs_info;
> +	struct dax_device    *dax_devp;
> +	u64 start_off = 0;
> +	struct bdev_handle   *handlep;
Definitely don't force alignment in local parameter definitions.
Always goes wrong and makes for unreadable mess in patches!

> +
> +	if (fsi->dax_devp) {
> +		pr_err("%s: already mounted\n", __func__);
Fine to fail but worth a error message? Not sure on convention on this but seems noisy
and maybe in userspace control which isn't good.
> +		return -EALREADY;
> +	}
> +
> +	if (strstr(fc->source, "/dev/dax")) /* There is probably a better way to check this */
> +		return famfs_open_char_device(sb, fc);
> +
> +	if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better way to check this */
> +		pr_err("%s: primary backing dev (%s) is not pmem\n",
> +		       __func__, fc->source);
> +		return -EINVAL;
> +	}
> +
> +	handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);
> +	if (IS_ERR(handlep->bdev)) {
> +		pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, fc->source);
> +		return PTR_ERR(handlep->bdev);
> +	}
> +
> +	dax_devp = fs_dax_get_by_bdev(handlep->bdev, &start_off,
> +				      fsi  /* holder */,
> +				      &famfs_blk_dax_holder_ops);
> +	if (IS_ERR(dax_devp)) {
> +		pr_err("%s: unable to get daxdev from handlep->bdev\n", __func__);
> +		bdev_release(handlep);
> +		return -ENODEV;
> +	}
> +	fsi->bdev_handle = handlep;
> +	fsi->dax_devp    = dax_devp;
> +
> +	pr_notice("%s: root device is block dax (%s)\n", __func__, fc->source);

pr_debug()  Kernel log is too noisy anyway! + I'd assume we can tell this succeeded
in lots of other ways.


> +	return 0;
> +}
> +
> +
>  
>  MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-23 17:41 ` [RFC PATCH 11/20] famfs: Add fs_context_operations John Groves
@ 2024-02-26 13:20   ` Jonathan Cameron
  2024-02-26 22:43     ` John Groves
  2024-02-27 13:41   ` Christian Brauner
  1 sibling, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 13:20 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:55 -0600
John Groves <John@Groves.net> wrote:

> This commit introduces the famfs fs_context_operations and
> famfs_get_inode() which is used by the context operations.
> 
> Signed-off-by: John Groves <john@groves.net>
Trivial comments inline.

> ---
>  fs/famfs/famfs_inode.c | 178 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 178 insertions(+)
> 
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index 82c861998093..f98f82962d7b 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -41,6 +41,50 @@ static const struct super_operations famfs_ops;
>  static const struct inode_operations famfs_file_inode_operations;
>  static const struct inode_operations famfs_dir_inode_operations;
>  
> +static struct inode *famfs_get_inode(
> +	struct super_block *sb,
> +	const struct inode *dir,
> +	umode_t             mode,
> +	dev_t               dev)
> +{
> +	struct inode *inode = new_inode(sb);
> +
> +	if (inode) {
reverse logic would be simpler and reduce indent.

	if (!inode)
		return NULL;


> +		struct timespec64       tv;
> +
> +		inode->i_ino = get_next_ino();
> +		inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
> +		inode->i_mapping->a_ops = &ram_aops;
> +		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		mapping_set_unevictable(inode->i_mapping);
> +		tv = inode_set_ctime_current(inode);
> +		inode_set_mtime_to_ts(inode, tv);
> +		inode_set_atime_to_ts(inode, tv);
> +
> +		switch (mode & S_IFMT) {
> +		default:
> +			init_special_inode(inode, mode, dev);
> +			break;
> +		case S_IFREG:
> +			inode->i_op = &famfs_file_inode_operations;
> +			inode->i_fop = &famfs_file_operations;
> +			break;
> +		case S_IFDIR:
> +			inode->i_op = &famfs_dir_inode_operations;
> +			inode->i_fop = &simple_dir_operations;
> +
> +			/* Directory inodes start off with i_nlink == 2 (for "." entry) */
> +			inc_nlink(inode);
> +			break;
> +		case S_IFLNK:
> +			inode->i_op = &page_symlink_inode_operations;
> +			inode_nohighmem(inode);
> +			break;
> +		}
> +	}
> +	return inode;
> +}
> +
>  /**********************************************************************************
>   * famfs super_operations
>   *
> @@ -150,6 +194,140 @@ famfs_open_device(
>  	return 0;
>  }
>  
> +/*****************************************************************************************
> + * fs_context_operations
> + */
> +static int
> +famfs_fill_super(
> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	struct famfs_fs_info *fsi = sb->s_fs_info;
> +	struct inode *inode;
> +	int rc = 0;
Always initialized so no need to do it here.

> +
> +	sb->s_maxbytes		= MAX_LFS_FILESIZE;
> +	sb->s_blocksize		= PAGE_SIZE;
> +	sb->s_blocksize_bits	= PAGE_SHIFT;
> +	sb->s_magic		= FAMFS_MAGIC;
> +	sb->s_op		= &famfs_ops;
> +	sb->s_time_gran		= 1;
> +
> +	rc = famfs_open_device(sb, fc);
> +	if (rc)
> +		goto out;
		return rc; //unless you need to do more in out in later patch..

> +
> +	inode = famfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
> +	sb->s_root = d_make_root(inode);
> +	if (!sb->s_root)
> +		rc = -ENOMEM;
		return -ENOMEM;

	return 0;

> +
> +out:
> +	return rc;
> +}
> +
> +enum famfs_param {
> +	Opt_mode,
> +	Opt_dax,
Why capital O?

> +};
> +

...

> +
> +static DEFINE_MUTEX(famfs_context_mutex);
> +static LIST_HEAD(famfs_context_list);
> +
> +static int famfs_get_tree(struct fs_context *fc)
> +{
> +	struct famfs_fs_info *fsi_entry;
> +	struct famfs_fs_info *fsi = fc->s_fs_info;
> +
> +	fsi->rootdev = kstrdup(fc->source, GFP_KERNEL);
> +	if (!fsi->rootdev)
> +		return -ENOMEM;
> +
> +	/* Fail if famfs is already mounted from the same device */
> +	mutex_lock(&famfs_context_mutex);

New toys might be good to use from start to avoid need for explicit
unlocks in error paths.

	scoped_guard(mutex, &famfs_context_mutex) {
		list_for_each_entry(fsi_entry, &famfs_context_list, fsi_list) {
			if (strcmp(fsi_entry->rootdev, cs_source) == 0) {
			//could invert with a continue to reduce indent
			// or factor this out as a little helper.
			// famfs_check_not_mounted()
				pr_err();
				return -EALREADY;
			}
		}	
		list_add(&fsi->fs_list, &famfs_context_list);
	}

	return get_tree_nodev(...

> +	list_for_each_entry(fsi_entry, &famfs_context_list, fsi_list) {
> +		if (strcmp(fsi_entry->rootdev, fc->source) == 0) {
> +			mutex_unlock(&famfs_context_mutex);
> +			pr_err("%s: already mounted from rootdev %s\n", __func__, fc->source);
> +			return -EALREADY;
> +		}
> +	}
> +
> +	list_add(&fsi->fsi_list, &famfs_context_list);
> +	mutex_unlock(&famfs_context_mutex);
> +
> +	return get_tree_nodev(fc, famfs_fill_super);
> +
> +}

>  
>  MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type
  2024-02-23 17:41 ` [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type John Groves
@ 2024-02-26 13:25   ` Jonathan Cameron
  2024-02-26 22:53     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 13:25 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:56 -0600
John Groves <John@Groves.net> wrote:

> This commit introduces the famfs inode_operations. There is nothing really
> unique to famfs here in the inode_operations..
> 
> This commit also introduces the famfs_file_system_type struct and the
> famfs_kill_sb() function.
> 
> Signed-off-by: John Groves <john@groves.net>

Trivial comments only.

> +
> +/*
> + * File creation. Allocate an inode, and we're done..
> + */
> +/* SMP-safe */
> +static int
> +famfs_mknod(
> +	struct mnt_idmap *idmap,
> +	struct inode     *dir,
> +	struct dentry    *dentry,
> +	umode_t           mode,
> +	dev_t             dev)
> +{
> +	struct inode *inode = famfs_get_inode(dir->i_sb, dir, mode, dev);
> +	int error           = -ENOSPC;
> +
> +	if (inode) {

As below. I would flip it for cleaner code/ shorter indent etc.

> +		struct timespec64       tv;
> +
> +		d_instantiate(dentry, inode);
> +		dget(dentry);	/* Extra count - pin the dentry in core */
> +		error = 0;
> +		tv = inode_set_ctime_current(inode);
> +		inode_set_mtime_to_ts(inode, tv);
> +		inode_set_atime_to_ts(inode, tv);
> +	}
> +	return error;
> +}
> +
> +static int famfs_mkdir(
> +	struct mnt_idmap *idmap,
> +	struct inode     *dir,
> +	struct dentry    *dentry,
> +	umode_t           mode)
> +{
> +	int retval = famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFDIR, 0);
> +
> +	if (!retval)
> +		inc_nlink(dir);

Copy local style, so fine if this is common pattern, otherwise I'd go for
consistent error cases out of line as easier for us sleepy caffeine 
deprived reviewers.


	if (retval)
		return retval;

	inc_nlink(dir);

	return 0;
> +
> +	return retval;
> +}
> +
> +static int famfs_create(
> +	struct mnt_idmap *idmap,
> +	struct inode     *dir,
> +	struct dentry    *dentry,
> +	umode_t           mode,
> +	bool              excl)
> +{
> +	return famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
> +}
> +
> +static int famfs_symlink(
> +	struct mnt_idmap *idmap,
> +	struct inode     *dir,
> +	struct dentry    *dentry,
> +	const char       *symname)
> +{
> +	struct inode *inode;
> +	int error = -ENOSPC;
> +
> +	inode = famfs_get_inode(dir->i_sb, dir, S_IFLNK | 0777, 0);
	if (!inode)
		return -ENOSPC;

> +	if (inode) {
> +		int l = strlen(symname)+1;
> +
> +		error = page_symlink(inode, symname, l);
	if (error) {
		iput(inode);
		return error;
	}
	
	...

> +		if (!error) {
> +			struct timespec64       tv;
> +
> +			d_instantiate(dentry, inode);
> +			dget(dentry);
> +			tv = inode_set_ctime_current(inode);
> +			inode_set_mtime_to_ts(inode, tv);
> +			inode_set_atime_to_ts(inode, tv);
> +		} else
> +			iput(inode);
> +	}
> +	return error;
> +}



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-24  0:07 ` [RFC PATCH 00/20] Introduce the famfs shared-memory file system Luis Chamberlain
@ 2024-02-26 13:27   ` John Groves
  2024-02-26 15:53     ` Luis Chamberlain
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-26 13:27 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/23 04:07PM, Luis Chamberlain wrote:
> On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote:
> > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > CXL-specific in anyway way.
> > 
> > * Famfs creates a simple access method for storing and sharing data in
> >   sharable memory. The memory is exposed and accessed as memory-mappable
> >   dax files.
> > * Famfs supports multiple hosts mounting the same file system from the
> >   same memory (something existing fs-dax file systems don't do).
> > * A famfs file system can be created on either a /dev/pmem device in fs-dax
> >   mode, or a /dev/dax device in devdax mode (the latter depending on
> >   patches 2-6 of this series).
> > 
> > The famfs kernel file system is part the famfs framework; additional
> > components in user space[2] handle metadata and direct the famfs kernel
> > module to instantiate files that map to specific memory. The famfs user
> > space has documentation and a reasonably thorough test suite.
> > 
> > The famfs kernel module never accesses the shared memory directly (either
> > data or metadata). Because of this, shared memory managed by the famfs
> > framework does not create a RAS "blast radius" problem that should be able
> > to crash or de-stabilize the kernel. Poison or timeouts in famfs memory
> > can be expected to kill apps via SIGBUS and cause mounts to be disabled
> > due to memory failure notifications.
> > 
> > Famfs does not attempt to solve concurrency or coherency problems for apps,
> > although it does solve these problems in regard to its own data structures.
> > Apps may encounter hard concurrency problems, but there are use cases that
> > are imminently useful and uncomplicated from a concurrency perspective:
> > serial sharing is one (only one host at a time has access), and read-only
> > concurrent sharing is another (all hosts can read-cache without worry).
> 
> Can you do me a favor, curious if you can run a test like this:
> 
> fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring                                                                                                                            
> -direct=1                                                                                                                                                                                    
> --group_reporting=1 --alloc-size=1048576 --filesize=1GiB                                                                                                                                      
> --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1                                                                                                                      
> --directory=/mnt 
> 
> What do you get for throughput?
> 
> The absolute large the system an capacity the better.
> 
>   Luis

Luis,

First, thanks for paying attention. I think I need to clarify a few things
about famfs and then check how that modifies your ask; apologies if some
are obvious. You should tell me whether this is still interesting given
these clarifications and limitations, or if there is something else you'd
like to see tested instead. But read on, I have run the closest tests I
can.

Famfs files just map to dax memory; they don't have a backing store. So the
io_uring and direct=1 options don't work. The coolness is that the files &
memory can be shared, and that apps can deal with files rather than having
to learn new abstractions.

Famfs files are never allocate-on-write, so (--fallocate=none is ok, but
"actual" fallocate doesn't work - and --create_on_open desn't work). But it
seems to be happy if I preallocate the files for the test.

I don't currently have custody of a really beefy system (can get one, just
need to plan ahead). My primary dev system is a 48 HT core E5-2690 v3 @
2.60G (around 10 years old).

I have a 128GB dax device that is backed by ddr4 via efi_fake_mem. So I
can't do 48 x 10 x 1G, but I can do 48 x 10 x 256M. I ran this on
ddr4-backed famfs, and xfs backed by a sata ssd. Probably not fair, but
it's what I have on a Sunday evening.

I can get access to a beefy system with real cxl memory, though don't
assume 100% I can report performance on that - will check into that. But
think about what you're looking for in light of the fact that famfs is just
a shared-memory file system, so no O_DIRECT or io_uring. Basically just
(hopefully efficient) vma fault handling and metadata distribution.

###

Here is famfs. I had to drop the io_uring and script up alloc/creation
of the files (sudo famfs creat -s 256M /mnt/famfs/foo)

$ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=100MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --directory=/mnt/famfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1
...
fio-3.33
Starting 48 processes
Jobs: 40 (f=400)
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=201738: Mon Feb 26 06:48:21 2024
  write: IOPS=15.2k, BW=29.6GiB/s (31.8GB/s)(44.7GiB/1511msec); 0 zone resets
    clat (usec): min=156, max=54645, avg=2077.40, stdev=1730.77
     lat (usec): min=171, max=54686, avg=2404.87, stdev=2056.50
    clat percentiles (usec):
     |  1.00th=[  196],  5.00th=[  243], 10.00th=[  367], 20.00th=[  644],
     | 30.00th=[  857], 40.00th=[ 1352], 50.00th=[ 1876], 60.00th=[ 2442],
     | 70.00th=[ 2868], 80.00th=[ 3228], 90.00th=[ 3884], 95.00th=[ 4555],
     | 99.00th=[ 6390], 99.50th=[ 7439], 99.90th=[16450], 99.95th=[23987],
     | 99.99th=[46924]
   bw (  MiB/s): min=21544, max=28034, per=81.80%, avg=24789.35, stdev=130.16, samples=81
   iops        : min=10756, max=14000, avg=12378.00, stdev=65.06, samples=81
  lat (usec)   : 250=5.42%, 500=9.67%, 750=8.07%, 1000=11.77%
  lat (msec)   : 2=16.87%, 4=39.59%, 10=8.37%, 20=0.17%, 50=0.07%
  lat (msec)   : 100=0.01%
  cpu          : usr=13.26%, sys=81.62%, ctx=2075, majf=0, minf=18159
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,22896,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec

$ sudo famfs fsck -h /mnt/famfs
Famfs Superblock:
  Filesystem UUID: 591f3f62-0a79-4543-9ab5-e02dc807c76c
  System UUID:     00000000-0000-0000-0000-0cc47aaaa734
  sizeof superblock: 168
  num_daxdevs:              1
  primary: /dev/dax1.0   137438953472

Log stats:
  # of log entriesi in use: 480 of 25575
  Log size in use:          157488
  No allocation errors found

Capacity:
  Device capacity:        128.00G
  Bitmap capacity:        127.99G
  Sum of file sizes:      120.00G
  Allocated space:        120.00G
  Free space:             7.99G
  Space amplification:     1.00
  Percent used:            93.8%

Famfs log:
  480 of 25575 entries used
  480 files
  0 directories

###

Here is the same fio command, plus --ioengine=io_uring and --direct=1. It's
apples and oranges, since famfs is a memory interface and not a storage
interface. This is run on an xfs file system on a SATA ssd.

Note units are msec here, usec above.

fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/home/jmg/t1
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
...
fio-3.33
Starting 48 processes
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
Jobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.1%][w=454MiB/s][w=227 IOPS][eta 01m:32sJobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.4%][w=456MiB/s][w=228 IOPS][eta 01m:31sJobs: 36 (f=360): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(5)][72.9%][w=454MiB/s][w=227 IOPS][eta 01m:29s]         Jobs: 33 (f=330): [_(3),W(2),_(1),W(1),_(1),W(1),_(1),W(4),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(2),_(1),W(2)][73.0%][w=458MiB/s][w=229 IOPS][eta 01Jobs: 30 (f=300): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(1),W(1),_(3),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(1),W(2)][73.6%][w=462MiB/s][w=231 IOPS][eta 01mJobs: 28 (f=280): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(5),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(2),W(1)][74.1%][w=456MiB/s][w=228 IOPS][eta 01m:25s]     Jobs: 25 (f=250): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(4),_(1),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.1%][w=458MiB/s][w=229 IOPJobs: 24 (f=240): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.6%][w=456MiB/s][w=228 IOPJobs: 23 (f=230): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),E(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][76.2%][w=452MiB/s][w=226 IOPJobs: 20 (f=200): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(1),W(3),_(1),W(1),_(2),W(1),_(3)][76.7%][w=448MiB/s][w=224 IOPS][eta 01m:15sJobs: 19 (f=190): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][77.5%][w=464MiB/s][w=232 IOPS][eta 01m:12sJobs: 18 (f=180): [_(3),W(2),_(3),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][78.8%][w=478MiB/s][w=239 IOPS][eta 01m:07s]         Jobs: 4 (f=40): [_(3),W(1),_(22),W(1),_(12),W(1),_(4),W(1),_(3)][92.4%][w=462MiB/s][w=231 IOPS][eta 00m:21s]                                                   
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=210709: Mon Feb 26 07:20:51 2024
  write: IOPS=228, BW=458MiB/s (480MB/s)(114GiB/255942msec); 0 zone resets
    slat (usec): min=39, max=776, avg=186.65, stdev=49.13
    clat (msec): min=4, max=6718, avg=199.27, stdev=324.82
     lat (msec): min=4, max=6718, avg=199.45, stdev=324.82
    clat percentiles (msec):
     |  1.00th=[   30],  5.00th=[   47], 10.00th=[   60], 20.00th=[   69],
     | 30.00th=[   78], 40.00th=[   85], 50.00th=[   95], 60.00th=[  114],
     | 70.00th=[  142], 80.00th=[  194], 90.00th=[  409], 95.00th=[  810],
     | 99.00th=[ 1703], 99.50th=[ 2140], 99.90th=[ 3037], 99.95th=[ 3440],
     | 99.99th=[ 4665]
   bw (  KiB/s): min=195570, max=2422953, per=100.00%, avg=653513.53, stdev=8137.30, samples=17556
   iops        : min=   60, max= 1180, avg=314.22, stdev= 3.98, samples=17556
  lat (msec)   : 10=0.11%, 20=0.37%, 50=5.35%, 100=47.30%, 250=32.22%
  lat (msec)   : 500=6.11%, 750=2.98%, 1000=1.98%, 2000=2.97%, >=2000=0.60%
  cpu          : usr=0.10%, sys=0.01%, ctx=58709, majf=0, minf=669
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=458MiB/s (480MB/s), 458MiB/s-458MiB/s (480MB/s-480MB/s), io=114GiB (123GB), run=255942-255942msec

Disk stats (read/write):
    dm-2: ios=11/82263, merge=0/0, ticks=270/13403617, in_queue=13403887, util=97.10%, aggrios=11/152359, aggrmerge=0/5087, aggrticks=271/11493029, aggrin_queue=11494994, aggrutil=100.00%
  sdb: ios=11/152359, merge=0/5087, ticks=271/11493029, in_queue=11494994, util=100.00%

###

Let me know what else you'd like to see tried.

Regards,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 13/20] famfs: Add iomap_ops
  2024-02-23 17:41 ` [RFC PATCH 13/20] famfs: Add iomap_ops John Groves
@ 2024-02-26 13:30   ` Jonathan Cameron
  2024-02-26 23:00     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 13:30 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:57 -0600
John Groves <John@Groves.net> wrote:

> This commit introduces the famfs iomap_ops. When either
> dax_iomap_fault() or dax_iomap_rw() is called, we get a callback
> via our iomap_begin() handler. The question being asked is
> "please resolve (file, offset) to (daxdev, offset)". The function
> famfs_meta_to_dax_offset() does this.
> 
> The per-file metadata is just an extent list to the
> backing dax dev.  The order of this resolution is O(N) for N
> extents. Note with the current user space, files usually have
> only one extent.
> 
> Signed-off-by: John Groves <john@groves.net>

> ---
>  fs/famfs/famfs_file.c | 245 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 245 insertions(+)
>  create mode 100644 fs/famfs/famfs_file.c
> 
> diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> new file mode 100644
> index 000000000000..fc667d5f7be8
> --- /dev/null
> +++ b/fs/famfs/famfs_file.c
> @@ -0,0 +1,245 @@

> +static int
> +famfs_meta_to_dax_offset(
> +	struct inode *inode,
> +	struct iomap *iomap,
> +	loff_t        offset,
> +	loff_t        len,
> +	unsigned int  flags)
> +{
> +	struct famfs_file_meta *meta = (struct famfs_file_meta *)inode->i_private;

i_private is void * so no need for explicit cast (C spec says this is always fine without)


> +
> +/**
> + * famfs_iomap_begin()
> + *
> + * This function is pretty simple because files are
> + * * never partially allocated
> + * * never have holes (never sparse)
> + * * never "allocate on write"
> + */
> +static int
> +famfs_iomap_begin(
> +	struct inode	       *inode,
> +	loff_t			offset,
> +	loff_t			length,
> +	unsigned int		flags,
> +	struct iomap	       *iomap,
> +	struct iomap	       *srcmap)
> +{
> +	struct famfs_file_meta *meta = inode->i_private;
> +	size_t size;
> +	int rc;
> +
> +	size = i_size_read(inode);
> +
> +	WARN_ON(size != meta->file_size);
> +
> +	rc = famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
> +
> +	return rc;
	return famfs_meta_...

> +}


> +static vm_fault_t
> +famfs_filemap_map_pages(
> +	struct vm_fault	       *vmf,
> +	pgoff_t			start_pgoff,
> +	pgoff_t			end_pgoff)
> +{
> +	vm_fault_t ret;
> +
> +	ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
> +	return ret;
	return filename_map_pages()....

> +}
> +
>


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 14/20] famfs: Add struct file_operations
  2024-02-23 17:41 ` [RFC PATCH 14/20] famfs: Add struct file_operations John Groves
@ 2024-02-26 13:32   ` Jonathan Cameron
  2024-02-26 23:09     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 13:32 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:58 -0600
John Groves <John@Groves.net> wrote:

> This commit introduces the famfs file_operations. We call
> thp_get_unmapped_area() to force PMD page alignment. Our read and
> write handlers (famfs_dax_read_iter() and famfs_dax_write_iter())
> call dax_iomap_rw() to do the work.
> 
> famfs_file_invalid() checks for various ways a famfs file can be
> in an invalid state so we can fail I/O or fault resolution in those
> cases. Those cases include the following:
> 
> * No famfs metadata
> * file i_size does not match the originally allocated size
> * file is not flagged as DAX
> * errors were detected previously on the file
> 
> An invalid file can often be fixed by replaying the log, or by
> umount/mount/log replay - all of which are user space operations.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/famfs/famfs_file.c | 136 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 136 insertions(+)
> 
> diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> index fc667d5f7be8..5228e9de1e3b 100644
> --- a/fs/famfs/famfs_file.c
> +++ b/fs/famfs/famfs_file.c
> @@ -19,6 +19,142 @@
>  #include <uapi/linux/famfs_ioctl.h>
>  #include "famfs_internal.h"
>  
> +/*********************************************************************
> + * file_operations
> + */
> +
> +/* Reject I/O to files that aren't in a valid state */
> +static ssize_t
> +famfs_file_invalid(struct inode *inode)
> +{
> +	size_t i_size       = i_size_read(inode);
> +	struct famfs_file_meta *meta = inode->i_private;
> +
> +	if (!meta) {
> +		pr_err("%s: un-initialized famfs file\n", __func__);
> +		return -EIO;
> +	}
> +	if (i_size != meta->file_size) {
> +		pr_err("%s: something changed the size from  %ld to %ld\n",
> +		       __func__, meta->file_size, i_size);
> +		meta->error = 1;
> +		return -ENXIO;
> +	}
> +	if (!IS_DAX(inode)) {
> +		pr_err("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
> +		meta->error = 1;
> +		return -ENXIO;
> +	}
> +	if (meta->error) {
> +		pr_err("%s: previously detected metadata errors\n", __func__);
> +		meta->error = 1;

Already set?  If treating it as only a boolean, maybe make it one?

> +		return -EIO;
> +	}
> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 15/20] famfs: Add ioctl to file_operations
  2024-02-23 17:41 ` [RFC PATCH 15/20] famfs: Add ioctl to file_operations John Groves
@ 2024-02-26 13:44   ` Jonathan Cameron
  0 siblings, 0 replies; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 13:44 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:41:59 -0600
John Groves <John@Groves.net> wrote:

> This commit introduces the per-file ioctl function famfs_file_ioctl()
> into struct file_operations, and introduces the famfs_file_init_dax()
> function (which is called by famfs_file_ioct())
> 
> famfs_file_init_dax() associates a dax extent list with a file, making
> it into a proper famfs file. It is called from the FAMFSIOC_MAP_CREATE
> ioctl. Starting with an empty file (which is basically a ramfs file),
> this turns the file into a DAX file backed by the specified extent list.
> 
> The other ioctls are:
> 
> FAMFSIOC_NOP - A convenient way for user space to verify it's a famfs file
> FAMFSIOC_MAP_GET - Get the header of the metadata for a file
> FAMFSIOC_MAP_GETEXT - Get the extents for a file
> 
> The latter two, together, are comparable to xfs_bmap. Our user space tools
> use them primarly in testing.
> 
> Signed-off-by: John Groves <john@groves.net>
A few more comments inline. Nothing fundamental just nice to have
simplifications of the code.

> ---
>  fs/famfs/famfs_file.c | 226 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 226 insertions(+)
> 
> diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> index 5228e9de1e3b..fd42d5966982 100644
> --- a/fs/famfs/famfs_file.c
> +++ b/fs/famfs/famfs_file.c
> @@ -19,6 +19,231 @@
>  #include <uapi/linux/famfs_ioctl.h>
>  #include "famfs_internal.h"
>  
> +/**
> + * famfs_map_meta_alloc() - Allocate famfs file metadata
> + * @mapp:       Pointer to an mcache_map_meta pointer
> + * @ext_count:  The number of extents needed
> + */
> +static int
> +famfs_meta_alloc(
> +	struct famfs_file_meta  **metap,
> +	size_t                    ext_count)
> +{
> +	struct famfs_file_meta *meta;
> +	size_t                  metasz;
> +
> +	*metap = NULL;

Not responsibility of caller?

> +
> +	metasz = sizeof(*meta) + sizeof(*(meta->tfs_extents)) * ext_count;

Looks like struct_size() would be appropriate.


> +
> +	meta = kzalloc(metasz, GFP_KERNEL);
> +	if (!meta)
> +		return -ENOMEM;
> +
> +	meta->tfs_extent_ct = ext_count;
> +	*metap = meta;
> +
> +	return 0;
> +}
> +
> +static void
> +famfs_meta_free(
> +	struct famfs_file_meta *map)
> +{
> +	kfree(map);
Given this is just kfree you can use __free magic to simplify things below.

> +}
> +
> +/**
> + * famfs_file_init_dax() - FAMFSIOC_MAP_CREATE ioctl handler
> + * @file:
> + * @arg:        ptr to struct mcioc_map in user space
> + *
> + * Setup the dax mapping for a file. Files are created empty, and then function is called
> + * (by famfs_file_ioctl()) to setup the mapping and set the file size.
> + */
> +static int
> +famfs_file_init_dax(
> +	struct file    *file,
> +	void __user    *arg)
> +{
> +	struct famfs_extent    *tfs_extents = NULL;
> +	struct famfs_file_meta *meta = NULL;
> +	struct inode           *inode;
> +	struct famfs_ioc_map    imap;
> +	struct famfs_fs_info   *fsi;
> +	struct super_block     *sb;
> +	int    alignment_errs = 0;
> +	size_t extent_total = 0;
> +	size_t ext_count;
> +	int    rc = 0;
> +	int    i;
> +
> +	rc = copy_from_user(&imap, arg, sizeof(imap));
> +	if (rc)
> +		return -EFAULT;
> +
> +	ext_count = imap.ext_list_count;
> +	if (ext_count < 1) {
> +		rc = -ENOSPC;
> +		goto errout;
		meta data not yet allocated.
		return -ENOSPC;

> +	}
> +
> +	if (ext_count > FAMFS_MAX_EXTENTS) {
> +		rc = -E2BIG;
> +		goto errout;	
		return 

> +	}
> +
> +	inode = file_inode(file);
> +	if (!inode) {
> +		rc = -EBADF;
> +		goto errout;
		return;

> +	}
> +	sb  = inode->i_sb;
> +	fsi = inode->i_sb->s_fs_info;
> +
> +	tfs_extents = &imap.ext_list[0];
> +
> +	rc = famfs_meta_alloc(&meta, ext_count);
> +	if (rc)
> +		goto errout;
	return ...

	only after this point should there be any
	meta data to free on exit?

> +
> +	meta->file_type = imap.file_type;
> +	meta->file_size = imap.file_size;
> +
> +	/* Fill in the internal file metadata structure */
> +	for (i = 0; i < imap.ext_list_count; i++) {
> +		size_t len;
> +		off_t  offset;
> +
> +		offset = imap.ext_list[i].offset;
> +		len    = imap.ext_list[i].len;
> +
> +		extent_total += len;
> +
> +		if (WARN_ON(offset == 0 && meta->file_type != FAMFS_SUPERBLOCK)) {
> +			rc = -EINVAL;
> +			goto errout;
> +		}
> +
> +		meta->tfs_extents[i].offset = offset;
> +		meta->tfs_extents[i].len    = len;
> +
> +		/* All extent addresses/offsets must be 2MiB aligned,
> +		 * and all but the last length must be a 2MiB multiple.
> +		 */
> +		if (!IS_ALIGNED(offset, PMD_SIZE)) {
> +			pr_err("%s: error ext %d hpa %lx not aligned\n",
> +			       __func__, i, offset);
> +			alignment_errs++;
> +		}
> +		if (i < (imap.ext_list_count - 1) && !IS_ALIGNED(len, PMD_SIZE)) {
> +			pr_err("%s: error ext %d length %ld not aligned\n",
> +			       __func__, i, len);
> +			alignment_errs++;
> +		}
> +	}
> +
> +	/*
> +	 * File size can be <= ext list size, since extent sizes are constrained
> +	 * to PMD multiples
> +	 */
> +	if (imap.file_size > extent_total) {
> +		pr_err("%s: file size %lld larger than ext list size %lld\n",
> +		       __func__, (u64)imap.file_size, (u64)extent_total);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	if (alignment_errs > 0) {
> +		pr_err("%s: there were %d alignment errors in the extent list\n",
> +		       __func__, alignment_errs);
> +		rc = -EINVAL;
> +		goto errout;
> +	}
> +
> +	/* Publish the famfs metadata on inode->i_private */
> +	inode_lock(inode);

Easy to add a guard definition - maybe useful enough to bother as can then do
this which makes the error handling align with other cases.

	scoped_guard(inode_sem, inode) {
		if (inode->i_private) {
			rc = -EEXIST;
			goto errout;
		}
		inode->...

	}
> +	if (inode->i_private) {
> +		rc = -EEXIST; /* file already has famfs metadata */
> +	} else {
> +		inode->i_private = meta;

You could use __free on the meta data and 
		inode->i_private = no_ptr_free(meta);
here. Then all your earlier error paths become direct returns.

> +		i_size_write(inode, imap.file_size);
> +		inode->i_flags |= S_DAX;
> +	}
> +	inode_unlock(inode);
> +
> + errout:
> +	if (rc)
> +		famfs_meta_free(meta);
A separate error path is going to be easier to follow as no if (rc)

> +
> +	return rc;
> +}
> +
> +/**
> + * famfs_file_ioctl() -  top-level famfs file ioctl handler
> + * @file:
> + * @cmd:
> + * @arg:
> + */
> +static
> +long
> +famfs_file_ioctl(
> +	struct file    *file,
> +	unsigned int    cmd,
> +	unsigned long   arg)
> +{
> +	long rc;
> +
> +	switch (cmd) {
> +	case FAMFSIOC_NOP:
> +		rc = 0;
		return 0;
> +		break;
> +
> +	case FAMFSIOC_MAP_CREATE:
> +		rc = famfs_file_init_dax(file, (void *)arg);
		return famfs_file_init_dax()

> +		break;
> +
> +	case FAMFSIOC_MAP_GET: {
> +		struct inode *inode = file_inode(file);
> +		struct famfs_file_meta *meta = inode->i_private;
> +		struct famfs_ioc_map umeta;
> +
> +		memset(&umeta, 0, sizeof(umeta));
> +
> +		if (meta) {
> +			/* TODO: do more to harmonize these structures */
> +			umeta.extent_type    = meta->tfs_extent_type;
> +			umeta.file_size      = i_size_read(inode);
> +			umeta.ext_list_count = meta->tfs_extent_ct;
> +
> +			rc = copy_to_user((void __user *)arg, &umeta, sizeof(umeta));
> +			if (rc)
> +				pr_err("%s: copy_to_user returned %ld\n", __func__, rc);
> +
> +		} else {
> +			rc = -EINVAL;
> +		}
Flip logic.

		if (!meta)
			return -EINVAL;

		umeta ...
		return 0;

> +	}
> +		break;
> +	case FAMFSIOC_MAP_GETEXT: {
> +		struct inode *inode = file_inode(file);
> +		struct famfs_file_meta *meta = inode->i_private;
> +
> +		if (meta)
> +			rc = copy_to_user((void __user *)arg, meta->tfs_extents,
> +					  meta->tfs_extent_ct * sizeof(struct famfs_extent));
> +		else
> +			rc = -EINVAL;
		if (!meta)
			return -EINVAL;

		return copy_to_user

> +	}
> +		break;
> +	default:
> +		rc = -ENOTTY;
return -ENOTTY;

> +		break;
> +	}
> +
> +	return rc;
Early returns will simplify the flow for anyone reading this.

> +}


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 17/20] famfs: Add module stuff
  2024-02-23 17:42 ` [RFC PATCH 17/20] famfs: Add module stuff John Groves
@ 2024-02-26 13:47   ` Jonathan Cameron
  2024-02-27 22:15     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 13:47 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:42:01 -0600
John Groves <John@Groves.net> wrote:

> This commit introduces the module init and exit machinery for famfs.
> 
> Signed-off-by: John Groves <john@groves.net>
I'd prefer to see this from the start with the functionality of the module
built up as you go + build logic in place.  Makes it easy to spot places
where the patches aren't appropriately self constrained. 
> ---
>  fs/famfs/famfs_inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index ab46ec50b70d..0d659820e8ff 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = {
>  	.fs_flags	  = FS_USERNS_MOUNT,
>  };
>  
> +/*****************************************************************************************
> + * Module stuff

I'd drop these drivers structure comments. They add little beyond
a high possibility of being wrong after the code has evolved a bit.

> + */
> +static struct kobject *famfs_kobj;
> +
> +static int __init init_famfs_fs(void)
> +{
> +	int rc;
> +
> +#if defined(CONFIG_DEV_DAX_IOMAP)
> +	pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__);
> +#else
> +	pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", __func__);
> +#endif
> +	famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj);
> +	if (!famfs_kobj) {
> +		pr_warn("Failed to create kobject\n");
> +		return -ENOMEM;
> +	}
> +
> +	rc = sysfs_create_group(famfs_kobj, &famfs_attr_group);
> +	if (rc) {
> +		kobject_put(famfs_kobj);
> +		pr_warn("%s: Failed to create sysfs group\n", __func__);
> +		return rc;
> +	}
> +
> +	return register_filesystem(&famfs_fs_type);

If this fails, do we not leak the kobj and sysfs groups?

> +}
> +
> +static void
> +__exit famfs_exit(void)
> +{
> +	sysfs_remove_group(famfs_kobj,  &famfs_attr_group);
> +	kobject_put(famfs_kobj);
> +	unregister_filesystem(&famfs_fs_type);
> +	pr_info("%s: unregistered\n", __func__);
> +}
> +
> +
> +fs_initcall(init_famfs_fs);
> +module_exit(famfs_exit);
> +
> +MODULE_AUTHOR("John Groves, Micron Technology");
>  MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch
  2024-02-23 17:42 ` [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch John Groves
@ 2024-02-26 13:52   ` Jonathan Cameron
  2024-02-27 22:27     ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 13:52 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, 23 Feb 2024 11:42:02 -0600
John Groves <John@Groves.net> wrote:

> This commit introduces the ability to open a character /dev/dax device
> instead of a block /dev/pmem device. This rests on the dev_dax_iomap
> patches earlier in this series.

Not sure the back reference is needed given it's in the series.

> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/famfs/famfs_inode.c | 97 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 87 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index 0d659820e8ff..7d65ac497147 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = {
>  	.show_options	= famfs_show_options,
>  };
>  
> +/*****************************************************************************/
> +
> +#if defined(CONFIG_DEV_DAX_IOMAP)
> +
> +/*
> + * famfs dax_operations  (for char dax)
> + */
> +static int
> +famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
> +			u64 len, int mf_flags)
> +{
> +	pr_err("%s: offset %lld len %llu flags %x\n", __func__,
> +	       offset, len, mf_flags);
> +	return -EOPNOTSUPP;
> +}
> +
> +static const struct dax_holder_operations famfs_dax_holder_ops = {
> +	.notify_failure		= famfs_dax_notify_failure,
> +};
> +
> +/*****************************************************************************/
> +
> +/**
> + * famfs_open_char_device()
> + *
> + * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap patch

That comment you definitely don't need as this won't get merged without
that patch being in place.


> + */
> +static int
> +famfs_open_char_device(
> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	struct famfs_fs_info *fsi = sb->s_fs_info;
> +	struct dax_device    *dax_devp;
> +	struct inode         *daxdev_inode;
> +
> +	int rc = 0;
set in all paths where it's used.

> +
> +	pr_notice("%s: Opening character dax device %s\n", __func__, fc->source);

pr_debug

> +
> +	fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
> +	if (IS_ERR(fsi->dax_filp)) {
> +		pr_err("%s: failed to open dax device %s\n",
> +		       __func__, fc->source);
> +		fsi->dax_filp = NULL;
Better to use a local variable

	fp = filp_open(fc->source, O_RDWR, 0);
	if (IS_ERR(fp)) {
		pr_err.
		return;
	}
	fsi->dax_filp = fp;
or similar.

> +		return PTR_ERR(fsi->dax_filp);
> +	}
> +
> +	daxdev_inode = file_inode(fsi->dax_filp);
> +	dax_devp     = inode_dax(daxdev_inode);
> +	if (IS_ERR(dax_devp)) {
> +		pr_err("%s: unable to get daxdev from inode for %s\n",
> +		       __func__, fc->source);
> +		rc = -ENODEV;
> +		goto char_err;
> +	}
> +
> +	rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops);
> +	if (rc) {
> +		pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__);
> +		goto char_err;
> +	}
> +
> +	fsi->bdev_handle = NULL;
> +	fsi->dax_devp = dax_devp;
> +
> +	return 0;
> +
> +char_err:
> +	filp_close(fsi->dax_filp, NULL);

You carefully set fsi->dax_filp to null in other other error paths.
Why there and not here?

> +	return rc;
> +}
> +
> +#else /* CONFIG_DEV_DAX_IOMAP */
> +static int
> +famfs_open_char_device(
> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> +	       __func__, fc->source);
> +	return -ENODEV;
> +}
> +
> +
> +#endif /* CONFIG_DEV_DAX_IOMAP */
> +
>  /***************************************************************************************
>   * dax_holder_operations for block dax
>   */
> @@ -236,16 +323,6 @@ const struct dax_holder_operations famfs_blk_dax_holder_ops = {
>  	.notify_failure		= famfs_blk_dax_notify_failure,
>  };
>  

Put it in right place earlier! Makes this less noisy.

> -static int
> -famfs_open_char_device(
> -	struct super_block *sb,
> -	struct fs_context  *fc)
> -{
> -	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> -	       __func__, fc->source);
> -	return -ENODEV;
> -}
> -
>  /**
>   * famfs_open_device()
>   *


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
  2024-02-26 12:05   ` Jonathan Cameron
@ 2024-02-26 15:00     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 15:00 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:05PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:46 -0600
> John Groves <John@Groves.net> wrote:
> 
> > This function should be called by fs-dax file systems after opening the
> > devdax device. This adds holder_operations.
> > 
> > This function serves the same role as fs_dax_get_by_bdev(), which dax
> > file systems call after opening the pmem block device.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> 
> A few trivial comments form a first read to get my head around this.
> 
> Yeah, it is only an RFC, but who doesn't like tidy code? :)

Hope your eyes don't burn too much ;)
> 
> 
> > ---
> >  drivers/dax/super.c | 38 ++++++++++++++++++++++++++++++++++++++
> >  include/linux/dax.h |  5 +++++
> >  2 files changed, 43 insertions(+)
> > 
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index f4b635526345..fc96362de237 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -121,6 +121,44 @@ void fs_put_dax(struct dax_device *dax_dev, void *holder)
> >  EXPORT_SYMBOL_GPL(fs_put_dax);
> >  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
> >  
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +
> > +/**
> > + * fs_dax_get()
> 
> Smells like kernel doc but fairly sure it needs a short description.
> Have you sanity checked for warnings when running scripts/kerneldoc on it?

Right, and there were other cases. Randy pointed one out, and I've already
gone through and "fixed" them.

> 
> > + *
> > + * fs-dax file systems call this function to prepare to use a devdax device for fsdax.
> Trivial but lines too long. Keep under 80 chars unless there is a strong
> readability arguement for not doing so.

I was under the impression the "kids these days" have a 100 column standard.
But I will go through and limit line to 80 except where it gets too awkward.

> 
> 
> > + * This is like fs_dax_get_by_bdev(), but the caller already has struct dev_dax (and there
> > + * is no bdev). The holder makes this exclusive.
> 
> Not familiar with this area: what does exclusive mean here?

The holder_ops are set via cmpxchg, in such a way that if there are already
holder_ops, the call to fs_dax_get() will fail. (as it should)

> 
> > + *
> > + * @dax_dev: dev to be prepared for fs-dax usage
> > + * @holder: filesystem or mapped device inside the dax_device
> > + * @hops: operations for the inner holder
> > + *
> > + * Returns: 0 on success, -1 on failure
> 
> Why not return < 0 and use somewhat useful return values?

Good idea, will do.

> 
> > + */
> > +int fs_dax_get(
> > +	struct dax_device *dax_dev,
> > +	void *holder,
> > +	const struct dax_holder_operations *hops)
> 
> Match local style for indents - it's a bit inconsistent but probably...
> 
> int fs_dax_get(struct dad_device *dev_dax, void *holder,
> 	       const struct dax_holder_operations *hops)

Done

> 
> > +{
> > +	/* dax_dev->ops should have been populated by devm_create_dev_dax() */
> > +	if (WARN_ON(!dax_dev->ops))
> > +		return -1;
> > +
> > +	if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
> 
> You dereferenced dax_dev on the line above so check is too late or
> unnecessary

Good catch, thank you!

> 
> > +		return -1;
> > +
> > +	if (cmpxchg(&dax_dev->holder_data, NULL, holder)) {
> > +		pr_warn("%s: holder_data already set\n", __func__);
> 
> Perhaps nicer to use a pr_fmt() deal with the func name if you need it.
> or make it pr_debug and let dynamic debug control formatting if anyone
> wants the function name.

Sounds good.

> 
> > +		return -1;
> > +	}
> > +	dax_dev->holder_ops = hops;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(fs_dax_get);
> > +#endif /* DEV_DAX_IOMAP */
> > +
> >  enum dax_device_flags {
> >  	/* !alive + rcu grace period == no new operations / mappings */
> >  	DAXDEV_ALIVE,
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index b463502b16e1..e973289bfde3 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -57,7 +57,12 @@ struct dax_holder_operations {
> >  
> >  #if IS_ENABLED(CONFIG_DAX)
> >  struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
> > +
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +int fs_dax_get(struct dax_device *dax_dev, void *holder, const struct dax_holder_operations *hops);
> line wrap < 80 chars

Roger that

> 
> > +#endif
> >  void *dax_holder(struct dax_device *dax_dev);
> > +struct dax_device *inode_dax(struct inode *inode);
> 
> Unrelated change?

Kinda, but I'm not sure there is a better home for this one. Patch 18,
which is a famfs patch, calls inode_dax(). It was already exported but not
prototyped in dax.h.

Mixing it in with other dev_dax_iomap content seems better than mixing it
with famfs content. Could make it a separate patch, but I was trying to
some old docs that said keep patch sets <=15 - which I deemed impossible here.

What say others?

> 
> >  void put_dax(struct dax_device *dax_dev);
> >  void kill_dax(struct dax_device *dax_dev);
> >  void dax_write_cache(struct dax_device *dax_dev, bool wc);
> 

Thanks Jonathan!


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now
  2024-02-26 12:10   ` Jonathan Cameron
@ 2024-02-26 15:13     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 15:13 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:10PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:47 -0600
> John Groves <John@Groves.net> wrote:
> 
> > bus.c can't call functions in device.c - that creates a circular linkage
> > dependency.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> 
> This also adds the export which you should mention!
> 
> Do they need it already? Seems like tense of patch title
> may be wrong.

I added "Also exports dax_pgoff_to_phys() since both bus.c and
device.c now call it."

The export is necessary because bus.c and device.c are not in the same .ko

Let me know if it seems like I'm misunderstanding...

> 
> > ---
> >  drivers/dax/bus.c    | 24 ++++++++++++++++++++++++
> >  drivers/dax/device.c | 23 -----------------------
> >  2 files changed, 24 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 1ff1ab5fa105..664e8c1b9930 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -1325,6 +1325,30 @@ static const struct device_type dev_dax_type = {
> >  	.groups = dax_attribute_groups,
> >  };
> >  
> > +/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c  */
> > +__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
> > +			      unsigned long size)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < dev_dax->nr_range; i++) {
> > +		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
> > +		struct range *range = &dax_range->range;
> > +		unsigned long long pgoff_end;
> > +		phys_addr_t phys;
> > +
> > +		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
> > +		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
> > +			continue;
> > +		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
> > +		if (phys + size - 1 <= range->end)
> > +			return phys;
> > +		break;
> > +	}
> > +	return -1;
> 
> Not related to your patch but returning -1 in a phys_addr_t isn't ideal.
> I assume aim is all bits set as a marker, in which case
> PHYS_ADDR_MAX from limits.h would make things clearer.

Perhaps Dan or the other dax people can comment on this? I just moved the
function verbatim, but Jonathan makes a good point!

Thanks,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap
  2024-02-26 12:21   ` Jonathan Cameron
@ 2024-02-26 15:48     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 15:48 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:21PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:48 -0600
> John Groves <John@Groves.net> wrote:
> 
> > Save the kva from memremap because we need it for iomap rw support
> > 
> > Prior to famfs, there were no iomap users of /dev/dax - so the virtual
> > address from memremap was not needed.
> > 
> > Also: in some cases dev_dax_probe() is called with the first
> > dev_dax->range offset past pgmap[0].range. In those cases we need to
> > add the difference to virt_addr in order to have the physaddr's in
> > dev_dax->ranges match dev_dax->virt_addr.
> 
> Probably good to have info on when this happens and preferably why
> this dragon is there.

I added this paragraph:

  This happens with devdax devices that started as pmem and got converted
  to devdax. I'm not sure whether the offset is due to label storage, or
  page tables. Dan?

...which is also insufficient, but perhaps Dan or somebody else from the
dax side can correct this.

> 
> > 
> > Dragons...
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  drivers/dax/dax-private.h |  1 +
> >  drivers/dax/device.c      | 15 +++++++++++++++
> >  2 files changed, 16 insertions(+)
> > 
> > diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
> > index 446617b73aea..894eb1c66b4a 100644
> > --- a/drivers/dax/dax-private.h
> > +++ b/drivers/dax/dax-private.h
> > @@ -63,6 +63,7 @@ struct dax_mapping {
> >  struct dev_dax {
> >  	struct dax_region *region;
> >  	struct dax_device *dax_dev;
> > +	u64 virt_addr;
> 
> Why as a u64? If it's a virt address why not just void *?

Changed to void * - thanks

> 
> >  	unsigned int align;
> >  	int target_node;
> >  	bool dyn_id;
> > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > index 40ba660013cf..6cd79d00fe1b 100644
> > --- a/drivers/dax/device.c
> > +++ b/drivers/dax/device.c
> > @@ -372,6 +372,7 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
> >  	struct dax_device *dax_dev = dev_dax->dax_dev;
> >  	struct device *dev = &dev_dax->dev;
> >  	struct dev_pagemap *pgmap;
> > +	u64 data_offset = 0;
> >  	struct inode *inode;
> >  	struct cdev *cdev;
> >  	void *addr;
> > @@ -426,6 +427,20 @@ static int dev_dax_probe(struct dev_dax *dev_dax)
> >  	if (IS_ERR(addr))
> >  		return PTR_ERR(addr);
> >  
> > +	/* Detect whether the data is at a non-zero offset into the memory */
> > +	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> > +		u64 phys = (u64)dev_dax->ranges[0].range.start;
> 
> Why the cast? Ranges use u64s internally.

I've removed all the unnecessary casts in this function - thanks
for the catch

> 
> > +		u64 pgmap_phys = (u64)dev_dax->pgmap[0].range.start;
> > +		u64 vmemmap_shift = (u64)dev_dax->pgmap[0].vmemmap_shift;
> > +
> > +		if (!WARN_ON(pgmap_phys > phys))
> > +			data_offset = phys - pgmap_phys;
> > +
> > +		pr_notice("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx shift=%llx\n",
> > +		       __func__, phys, pgmap_phys, data_offset, vmemmap_shift);
> 
> pr_debug() + dynamic debug will then deal with __func__ for you.

Thanks - yeah that would be better than just taking it out...

> 
> > +	}
> > +	dev_dax->virt_addr = (u64)addr + data_offset;
> > +
> >  	inode = dax_inode(dax_dev);
> >  	cdev = inode->i_cdev;
> >  	cdev_init(cdev, &dax_fops);
> 

Thanks,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-26 13:27   ` John Groves
@ 2024-02-26 15:53     ` Luis Chamberlain
  2024-02-26 21:16       ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Luis Chamberlain @ 2024-02-26 15:53 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> Run status group 0 (all jobs):
>   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec

> This is run on an xfs file system on a SATA ssd.

To compare more closer apples to apples, wouldn't it make more sense
to try this with XFS on pmem (with fio -direct=1)?

  Luis

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
  2024-02-26 12:32   ` Jonathan Cameron
@ 2024-02-26 16:09     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 16:09 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:32PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:49 -0600
> John Groves <John@Groves.net> wrote:
> 
> > Notes about this commit:
> > 
> > * These methods are based somewhat loosely on pmem_dax_ops from
> >   drivers/nvdimm/pmem.c
> > 
> > * dev_dax_direct_access() is returns the hpa, pfn and kva. The kva was
> >   newly stored as dev_dax->virt_addr by dev_dax_probe().
> > 
> > * The hpa/pfn are used for mmap (dax_iomap_fault()), and the kva is used
> >   for read/write (dax_iomap_rw())
> > 
> > * dev_dax_recovery_write() and dev_dax_zero_page_range() have not been
> >   tested yet. I'm looking for suggestions as to how to test those.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  drivers/dax/bus.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 107 insertions(+)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 664e8c1b9930..06fcda810674 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -10,6 +10,12 @@
> >  #include "dax-private.h"
> >  #include "bus.h"
> >  
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +#include <linux/backing-dev.h>
> > +#include <linux/pfn_t.h>
> > +#include <linux/range.h>
> > +#endif
> > +
> 
> Is it worth avoiding includes based on config? Probably not.

Just trying to demonstrate that I can be tedious :D
I'll drop the #if unless somebody disagrees.

> 
> >  static DEFINE_MUTEX(dax_bus_lock);
> >  
> >  #define DAX_NAME_LEN 30
> > @@ -1349,6 +1355,101 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_pgoff_to_phys);
> >  
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +
> 
> > +
> > +static long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> > +			     long nr_pages, enum dax_access_mode mode, void **kaddr,
> > +			     pfn_t *pfn)
> > +{
> > +	struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > +	size_t dax_size = dev_dax_size(dev_dax);
> > +	size_t size = nr_pages << PAGE_SHIFT;
> > +	size_t offset = pgoff << PAGE_SHIFT;
> > +	phys_addr_t phys;
> > +	u64 virt_addr = dev_dax->virt_addr + offset;
> > +	pfn_t local_pfn;
> > +	u64 flags = PFN_DEV|PFN_MAP;
> > +
> > +	WARN_ON(!dev_dax->virt_addr); /* virt_addr must be saved for direct_access */
> Fair enough, but from local code point of view, does it make sense to check this
> if !kaddr as we won't use this.

Hmm. This gets called with kaddr=NULL for mmap faults, and with non-NULL
kaddr for read/write (which need the virt_addr to do a memcpy variant).
If dev_dax->virt-addr is NULL, mmap will work but read/write will hork.

I lean toward keeping the warning. With these updates, it's broken if
read/write are broken.

> > +
> > +	phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
> > +
> > +	if (kaddr)
> > +		*kaddr = (void *)virt_addr;
> 
> Back to earlier comment on virt_addr as a void *. Definitely looking like
> that would be more accurate and simpler!  Also not much point in computing
> virt_addr unless kaddr is good.

Yes, done (the void *).

the computation is copied directly from drivers/nvdimm/__pmem_direct_access() -
which does not warn if virt_addr is null. Actually I suppose this code should
just trust that dev_dax_probe sets virt_addr, and not warn?

So I'm now contradicting myself above...

> 
> > +
> > +	local_pfn = phys_to_pfn_t(phys, flags); /* are flags correct? */
> If you aren't going to do anything with it for !pfn, move it under the if (pfn).
> 
> > +	if (pfn)
> > +		*pfn = local_pfn;
> > +
> > +	/* This the valid size at the specified address */
> > +	return PHYS_PFN(min_t(size_t, size, dax_size - offset));
> > +}
> > +
> 
> > +
> > +static const struct dax_operations dev_dax_ops = {
> > +	.direct_access = dev_dax_direct_access,
> > +	.zero_page_range = dev_dax_zero_page_range,
> > +	.recovery_write = dev_dax_recovery_write,
> > +};
> > +
> > +#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
> > +
> >  struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> >  {
> >  	struct dax_region *dax_region = data->dax_region;
> > @@ -1404,11 +1505,17 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
> >  		}
> >  	}
> >  
> 
> If we were to make this 
> 
> 	if (IS_ENABLED(CONFIG_DEV_DAX_IOMAP))
> 
> etc can we avoid the ifdef stuff above and let dead code removal deal with it?
> Might need a few stubs - I haven't tried.

Better, thanks. No stubs needed.

> 
> > +#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
> > +	/* holder_ops currently populated separately in a slightly hacky way */
> > +	dax_dev = alloc_dax(dev_dax, &dev_dax_ops);
> > +#else
> >  	/*
> >  	 * No dax_operations since there is no access to this device outside of
> >  	 * mmap of the resulting character device.
> >  	 */
> >  	dax_dev = alloc_dax(dev_dax, NULL);
> > +#endif

Thanks!
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter
  2024-02-26 12:34   ` Jonathan Cameron
@ 2024-02-26 16:12     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 16:12 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:34PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:50 -0600
> John Groves <John@Groves.net> wrote:
> 
> > Add the CONFIG_DEV_DAX_IOMAP kernel config parameter to control building
> > of the iomap functionality to support fsdax on devdax.
> 
> I would squash with previous patch.
> 
> Only reason I ever see for separate Kconfig patches is when there is something
> complex in the dependencies and you want to talk about it in depth in the
> patch description. That's not true here so no need for separate patch.

Done

Thanks,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-26 12:39   ` Jonathan Cameron
@ 2024-02-26 16:44     ` John Groves
  2024-02-26 16:56       ` Jonathan Cameron
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-26 16:44 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:39PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:51 -0600
> John Groves <John@Groves.net> wrote:
> 
> > Add uapi include file for famfs. The famfs user space uses ioctl on
> > individual files to pass in mapping information and file size. This
> > would be hard to do via sysfs or other means, since it's
> > file-specific.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
> >  1 file changed, 56 insertions(+)
> >  create mode 100644 include/uapi/linux/famfs_ioctl.h
> > 
> > diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
> > new file mode 100644
> > index 000000000000..6b3e6452d02f
> > --- /dev/null
> > +++ b/include/uapi/linux/famfs_ioctl.h
> > @@ -0,0 +1,56 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2024 Micron Technology, Inc.
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +#ifndef FAMFS_IOCTL_H
> > +#define FAMFS_IOCTL_H
> > +
> > +#include <linux/ioctl.h>
> > +#include <linux/uuid.h>
> > +
> > +#define FAMFS_MAX_EXTENTS 2
> Why 2?

You catch everything! 

This limit is in place to avoid supporting somethign we're not testing. It
will probably be raised later.

Currently user space doesn't support deleting files, which makes it easy
to ignore whether any clients have a stale view of metadata. If there is
no delete, there's actually no reason to have more than 1 extent.

> > +
> > +enum extent_type {
> > +	SIMPLE_DAX_EXTENT = 13,
> 
> Comment on this would be good to have

Done. Basically we anticipate there being other types of extents in the
future.

> 
> > +	INVALID_EXTENT_TYPE,
> > +};
> > +
> > +struct famfs_extent {
> > +	__u64              offset;
> > +	__u64              len;
> > +};
> > +
> > +enum famfs_file_type {
> > +	FAMFS_REG,
> > +	FAMFS_SUPERBLOCK,
> > +	FAMFS_LOG,
> > +};
> > +
> > +/**
> > + * struct famfs_ioc_map
> > + *
> > + * This is the metadata that indicates where the memory is for a famfs file
> > + */
> > +struct famfs_ioc_map {
> > +	enum extent_type          extent_type;
> > +	enum famfs_file_type      file_type;
> 
> These are going to be potentially varying in size depending on arch, compiler
> settings etc.  Been a while, but I though best practice for uapi was always
> fixed size elements even though we lose the typing.

I might not be following you fully here. User space is running the same
arch as kernel, so an enum can't be a different size, right? It could be
a different size on different arches, but this is just between user/kernel.

I initially thought of XDR for on-media-format, which file systems need
to do with on-media structs (superblocks, logs, inodes, etc. etc.). But
this struct is not used in that way.

In fact, famfs' on-media/in-memory metadata (superblock, log, log entries)
is only ever read read and written by user space - so it's the user space
code that needs XDR on-media-format handling.

So to clarify - do you think those enums should be u32 or the like?

Thanks!
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-26 16:44     ` John Groves
@ 2024-02-26 16:56       ` Jonathan Cameron
  2024-02-26 18:04         ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-26 16:56 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Mon, 26 Feb 2024 10:44:43 -0600
John Groves <John@groves.net> wrote:

> On 24/02/26 12:39PM, Jonathan Cameron wrote:
> > On Fri, 23 Feb 2024 11:41:51 -0600
> > John Groves <John@Groves.net> wrote:
> >   
> > > Add uapi include file for famfs. The famfs user space uses ioctl on
> > > individual files to pass in mapping information and file size. This
> > > would be hard to do via sysfs or other means, since it's
> > > file-specific.
> > > 
> > > Signed-off-by: John Groves <john@groves.net>
> > > ---
> > >  include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
> > >  1 file changed, 56 insertions(+)
> > >  create mode 100644 include/uapi/linux/famfs_ioctl.h
> > > 
> > > diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
> > > new file mode 100644
> > > index 000000000000..6b3e6452d02f
> > > --- /dev/null
> > > +++ b/include/uapi/linux/famfs_ioctl.h
> > > @@ -0,0 +1,56 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > +/*
> > > + * famfs - dax file system for shared fabric-attached memory
> > > + *
> > > + * Copyright 2023-2024 Micron Technology, Inc.
> > > + *
> > > + * This file system, originally based on ramfs the dax support from xfs,
> > > + * is intended to allow multiple host systems to mount a common file system
> > > + * view of dax files that map to shared memory.
> > > + */
> > > +#ifndef FAMFS_IOCTL_H
> > > +#define FAMFS_IOCTL_H
> > > +
> > > +#include <linux/ioctl.h>
> > > +#include <linux/uuid.h>
> > > +
> > > +#define FAMFS_MAX_EXTENTS 2  
> > Why 2?  
> 
> You catch everything! 
> 
> This limit is in place to avoid supporting somethign we're not testing. It
> will probably be raised later.
> 
> Currently user space doesn't support deleting files, which makes it easy
> to ignore whether any clients have a stale view of metadata. If there is
> no delete, there's actually no reason to have more than 1 extent.
Then have 1. + a Comment on why it is 1.
> 
> > > +
> > > +enum extent_type {
> > > +	SIMPLE_DAX_EXTENT = 13,  
> > 
> > Comment on this would be good to have  
> 
> Done. Basically we anticipate there being other types of extents in the
> future.

I was more curious about the 13!

> 
> >   
> > > +	INVALID_EXTENT_TYPE,
> > > +};
> > > +
> > > +struct famfs_extent {
> > > +	__u64              offset;
> > > +	__u64              len;
> > > +};
> > > +
> > > +enum famfs_file_type {
> > > +	FAMFS_REG,
> > > +	FAMFS_SUPERBLOCK,
> > > +	FAMFS_LOG,
> > > +};
> > > +
> > > +/**
> > > + * struct famfs_ioc_map
> > > + *
> > > + * This is the metadata that indicates where the memory is for a famfs file
> > > + */
> > > +struct famfs_ioc_map {
> > > +	enum extent_type          extent_type;
> > > +	enum famfs_file_type      file_type;  
> > 
> > These are going to be potentially varying in size depending on arch, compiler
> > settings etc.  Been a while, but I though best practice for uapi was always
> > fixed size elements even though we lose the typing.  
> 
> I might not be following you fully here. User space is running the same
> arch as kernel, so an enum can't be a different size, right? It could be
> a different size on different arches, but this is just between user/kernel.

I can't remember why, but this has bitten me in the past.
Ah, should have known Daniel would have written something on it ;)
https://www.kernel.org/doc/html/next/process/botching-up-ioctls.html

It's the fun of need for compat ioctls with 32bit userspace on 64bit kernels.

The alignment one is key as well. That bit me more than once due to
32bit x86 aligning 64 bit integers at 32 bits.

We could just not support these cases but it's easy to get right so why
bother with complexity of ruling them out.

> 
> I initially thought of XDR for on-media-format, which file systems need
> to do with on-media structs (superblocks, logs, inodes, etc. etc.). But
> this struct is not used in that way.
> 
> In fact, famfs' on-media/in-memory metadata (superblock, log, log entries)
> is only ever read read and written by user space - so it's the user space
> code that needs XDR on-media-format handling.
> 
> So to clarify - do you think those enums should be u32 or the like?

Yes. As it's userspace, uint32_t maybe or __u32. I 'think'
both are acceptable in uapi headers these days.

> 
> Thanks!
> John
> 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 08/20] famfs: Add famfs_internal.h
  2024-02-26 12:48   ` Jonathan Cameron
@ 2024-02-26 17:35     ` John Groves
  2024-02-27 10:28       ` Jonathan Cameron
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-26 17:35 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:48PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:52 -0600
> John Groves <John@Groves.net> wrote:
> 
> > Add the famfs_internal.h include file. This contains internal data
> > structures such as the per-file metadata structure (famfs_file_meta)
> > and extent formats.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> Hi John,
> 
> Build this up as you add the definitions in later patches.
> 
> Separate header patches just make people jump back and forth when trying
> to review.  Obviously more work to build this stuff up cleanly but
> it's worth doing to save review time.
> 

Ohhhhkaaaaay. I think you're right, just not looking forward to
all that rebasing.

> Generally I'd plumb up Kconfig and Makefile a the beginning as it means
> that the set is bisectable and we can check the logic of building each stage.
> That is harder to do but tends to bring benefits in forcing clear step
> wise approach on a patch set. Feel free to ignore this one though as it
> can slow things down.

I'm not sure that's practical. A file system needs a bunch of different
kinds of operations
- super_operations
- fs_context_operations
- inode_operations
- file_operations
- dax holder_operations, iomap_ops
- etc.

Will think about the dependency graph of these entities, but I'm not sure
it's tractable...

> 
> A few trivial comments inline.
> 
> > ---
> >  fs/famfs/famfs_internal.h | 53 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 53 insertions(+)
> >  create mode 100644 fs/famfs/famfs_internal.h
> > 
> > diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
> > new file mode 100644
> > index 000000000000..af3990d43305
> > --- /dev/null
> > +++ b/fs/famfs/famfs_internal.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2024 Micron Technology, Inc.
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +#ifndef FAMFS_INTERNAL_H
> > +#define FAMFS_INTERNAL_H
> > +
> > +#include <linux/atomic.h>
> 
> Why?

Because fault counters are the one phased change to this file, and this
should have been with that. That may go away, but either way I'll do the
phased thing with this file.

> 
> > +#include <linux/famfs_ioctl.h>
> > +
> > +#define FAMFS_MAGIC 0x87b282ff
> > +
> > +#define FAMFS_BLKDEV_MODE (FMODE_READ|FMODE_WRITE)
> 
> Spaces around | 

Done

> 
> > +
> > +extern const struct file_operations      famfs_file_operations;
> 
> I wouldn't force alignment. It rots too often as new stuff gets added
> and doesn't really help readability much.

OK

> 
> > +
> > +/*
> > + * Each famfs dax file has this hanging from its inode->i_private.
> > + */
> > +struct famfs_file_meta {
> > +	int                   error;
> > +	enum famfs_file_type  file_type;
> > +	size_t                file_size;
> > +	enum extent_type      tfs_extent_type;
> > +	size_t                tfs_extent_ct;
> > +	struct famfs_extent   tfs_extents[];  /* flexible array */
> 
> Comment kind of obvious ;) I'd drop it.  Though we have
> magic markings for __counted_by which would be good to use from the start.

Done

> 
> 
> 
> > +};
> > +
> > +struct famfs_mount_opts {
> > +	umode_t mode;
> > +};
> > +
> > +extern const struct iomap_ops             famfs_iomap_ops;
> > +extern const struct vm_operations_struct  famfs_file_vm_ops;
> > +
> > +#define ROOTDEV_STRLEN 80
> 
> Why?  You aren't creating an array of this size here so I can't
> immediately see what the define is for.

Oversight. It was a char array but I switched to strdup() and 
failed to delete this. Gone now, thanks.

Thanks,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h
  2024-02-26 16:56       ` Jonathan Cameron
@ 2024-02-26 18:04         ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 18:04 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 04:56PM, Jonathan Cameron wrote:
> On Mon, 26 Feb 2024 10:44:43 -0600
> John Groves <John@groves.net> wrote:
> 
> > On 24/02/26 12:39PM, Jonathan Cameron wrote:
> > > On Fri, 23 Feb 2024 11:41:51 -0600
> > > John Groves <John@Groves.net> wrote:
> > >   
> > > > Add uapi include file for famfs. The famfs user space uses ioctl on
> > > > individual files to pass in mapping information and file size. This
> > > > would be hard to do via sysfs or other means, since it's
> > > > file-specific.
> > > > 
> > > > Signed-off-by: John Groves <john@groves.net>
> > > > ---
> > > >  include/uapi/linux/famfs_ioctl.h | 56 ++++++++++++++++++++++++++++++++
> > > >  1 file changed, 56 insertions(+)
> > > >  create mode 100644 include/uapi/linux/famfs_ioctl.h
> > > > 
> > > > diff --git a/include/uapi/linux/famfs_ioctl.h b/include/uapi/linux/famfs_ioctl.h
> > > > new file mode 100644
> > > > index 000000000000..6b3e6452d02f
> > > > --- /dev/null
> > > > +++ b/include/uapi/linux/famfs_ioctl.h
> > > > @@ -0,0 +1,56 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > > +/*
> > > > + * famfs - dax file system for shared fabric-attached memory
> > > > + *
> > > > + * Copyright 2023-2024 Micron Technology, Inc.
> > > > + *
> > > > + * This file system, originally based on ramfs the dax support from xfs,
> > > > + * is intended to allow multiple host systems to mount a common file system
> > > > + * view of dax files that map to shared memory.
> > > > + */
> > > > +#ifndef FAMFS_IOCTL_H
> > > > +#define FAMFS_IOCTL_H
> > > > +
> > > > +#include <linux/ioctl.h>
> > > > +#include <linux/uuid.h>
> > > > +
> > > > +#define FAMFS_MAX_EXTENTS 2  
> > > Why 2?  
> > 
> > You catch everything! 
> > 
> > This limit is in place to avoid supporting somethign we're not testing. It
> > will probably be raised later.
> > 
> > Currently user space doesn't support deleting files, which makes it easy
> > to ignore whether any clients have a stale view of metadata. If there is
> > no delete, there's actually no reason to have more than 1 extent.
> Then have 1. + a Comment on why it is 1.

Actually we test the 2 case. That seemed important to testing ioctl and
famfs_meta_to_dax_offset(). It just doesn't yet happen in the wild. Will
clarify with a comment.

> > 
> > > > +
> > > > +enum extent_type {
> > > > +	SIMPLE_DAX_EXTENT = 13,  
> > > 
> > > Comment on this would be good to have  
> > 
> > Done. Basically we anticipate there being other types of extents in the
> > future.
> 
> I was more curious about the 13!

I think I was just being feisty that day. Will drop that...

> 
> > 
> > >   
> > > > +	INVALID_EXTENT_TYPE,
> > > > +};
> > > > +
> > > > +struct famfs_extent {
> > > > +	__u64              offset;
> > > > +	__u64              len;
> > > > +};
> > > > +
> > > > +enum famfs_file_type {
> > > > +	FAMFS_REG,
> > > > +	FAMFS_SUPERBLOCK,
> > > > +	FAMFS_LOG,
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct famfs_ioc_map
> > > > + *
> > > > + * This is the metadata that indicates where the memory is for a famfs file
> > > > + */
> > > > +struct famfs_ioc_map {
> > > > +	enum extent_type          extent_type;
> > > > +	enum famfs_file_type      file_type;  
> > > 
> > > These are going to be potentially varying in size depending on arch, compiler
> > > settings etc.  Been a while, but I though best practice for uapi was always
> > > fixed size elements even though we lose the typing.  
> > 
> > I might not be following you fully here. User space is running the same
> > arch as kernel, so an enum can't be a different size, right? It could be
> > a different size on different arches, but this is just between user/kernel.
> 
> I can't remember why, but this has bitten me in the past.
> Ah, should have known Daniel would have written something on it ;)
> https://www.kernel.org/doc/html/next/process/botching-up-ioctls.html
> 
> It's the fun of need for compat ioctls with 32bit userspace on 64bit kernels.
> 
> The alignment one is key as well. That bit me more than once due to
> 32bit x86 aligning 64 bit integers at 32 bits.
> 
> We could just not support these cases but it's easy to get right so why
> bother with complexity of ruling them out.

Makes sense. Will do.

> 
> > 
> > I initially thought of XDR for on-media-format, which file systems need
> > to do with on-media structs (superblocks, logs, inodes, etc. etc.). But
> > this struct is not used in that way.
> > 
> > In fact, famfs' on-media/in-memory metadata (superblock, log, log entries)
> > is only ever read read and written by user space - so it's the user space
> > code that needs XDR on-media-format handling.
> > 
> > So to clarify - do you think those enums should be u32 or the like?
> 
> Yes. As it's userspace, uint32_t maybe or __u32. I 'think'
> both are acceptable in uapi headers these days.

Roger that.

Thanks,
John

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-26 15:53     ` Luis Chamberlain
@ 2024-02-26 21:16       ` John Groves
  2024-02-27  0:58         ` Luis Chamberlain
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-26 21:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 07:53AM, Luis Chamberlain wrote:
> On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > Run status group 0 (all jobs):
> >   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> 
> > This is run on an xfs file system on a SATA ssd.
> 
> To compare more closer apples to apples, wouldn't it make more sense
> to try this with XFS on pmem (with fio -direct=1)?
> 
>   Luis

Makes sense. Here is the same command line I used with xfs before, but 
now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
because xfs requires that.

fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
...
fio-3.33
Starting 48 processes
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
Jobs: 36 (f=360): [W(3),_(1),W(3),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(7),_(1),W(3),_(1),W(2),_(2),W(4),_(1),W(5),_(1)][77.8%][w=15.1GiB/s][w=7750 IOPS][eta 00m:02s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=8798: Mon Feb 26 15:10:30 2024
  write: IOPS=7582, BW=14.8GiB/s (15.9GB/s)(114GiB/7723msec); 0 zone resets
    slat (usec): min=23, max=7352, avg=131.80, stdev=151.63
    clat (usec): min=385, max=22638, avg=5789.74, stdev=3124.93
     lat (usec): min=432, max=22724, avg=5921.54, stdev=3133.18
    clat percentiles (usec):
     |  1.00th=[  799],  5.00th=[ 1467], 10.00th=[ 2073], 20.00th=[ 3097],
     | 30.00th=[ 3949], 40.00th=[ 4752], 50.00th=[ 5473], 60.00th=[ 6194],
     | 70.00th=[ 7046], 80.00th=[ 8029], 90.00th=[ 9634], 95.00th=[11338],
     | 99.00th=[16319], 99.50th=[17957], 99.90th=[20055], 99.95th=[20579],
     | 99.99th=[21365]
   bw (  MiB/s): min=10852, max=26980, per=100.00%, avg=15940.43, stdev=88.61, samples=665
   iops        : min= 5419, max=13477, avg=7963.08, stdev=44.28, samples=665
  lat (usec)   : 500=0.15%, 750=0.47%, 1000=1.34%
  lat (msec)   : 2=7.40%, 4=21.46%, 10=60.57%, 20=8.50%, 50=0.11%
  cpu          : usr=2.33%, sys=0.32%, ctx=58806, majf=0, minf=36301
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s-15.9GB/s), io=114GiB (123GB), run=7723-7723msec

Disk stats (read/write):
  pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


I only have some educated guesses as to why famfs is faster. Since files 
are preallocated, they're always contiguous. And famfs is vastly simpler
because it isn't aimed at general purpose uses cases (and indeed can't
handle them).

Regards,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 09/20] famfs: Add super_operations
  2024-02-26 12:51   ` Jonathan Cameron
@ 2024-02-26 21:47     ` John Groves
  2024-02-27 10:34       ` Jonathan Cameron
  2024-02-27 17:48     ` John Groves
  1 sibling, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-26 21:47 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:51PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:53 -0600
> John Groves <John@Groves.net> wrote:
> 
> > Introduce the famfs superblock operations
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 72 insertions(+)
> >  create mode 100644 fs/famfs/famfs_inode.c
> > 
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > new file mode 100644
> > index 000000000000..3329aff000d1
> > --- /dev/null
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -0,0 +1,72 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2024 Micron Technology, inc
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +
> > +#include <linux/fs.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/highmem.h>
> > +#include <linux/time.h>
> > +#include <linux/init.h>
> > +#include <linux/string.h>
> > +#include <linux/backing-dev.h>
> > +#include <linux/sched.h>
> > +#include <linux/parser.h>
> > +#include <linux/magic.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/fs_context.h>
> > +#include <linux/fs_parser.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/dax.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/uio.h>
> > +#include <linux/iomap.h>
> > +#include <linux/path.h>
> > +#include <linux/namei.h>
> > +#include <linux/pfn_t.h>
> > +#include <linux/blkdev.h>
> 
> That's a lot of header for such a small patch.. I'm going to guess
> they aren't all used - bring them in as you need them - I hope
> you never need some of these!

I didn't phase in headers in this series. Based on these recommendations,
the next version of this series is gonna have to be 100% constructed from
scratch, but okay. My head hurts just thinking about it. I need a nap...

I've been rebasing for 3 weeks to get this series out, and it occurs to
me that maybe there are tools I'm not aware of that make it eaiser? I'm
just typing "rebase -i..." 200 times a day. Is there a less soul-crushing way?

> 
> 
> > +
> > +#include "famfs_internal.h"
> > +
> > +#define FAMFS_DEFAULT_MODE	0755
> > +
> > +static const struct super_operations famfs_ops;
> > +static const struct inode_operations famfs_file_inode_operations;
> > +static const struct inode_operations famfs_dir_inode_operations;
> 
> Why are these all up here?

These forward declarations are needed by a later patch in the series.
They were in famfs_internal.h, but they are only used in this file, so
I moved them here.

For all answers such as this, I will hereafter reply "rebase fu", with
further clarification only if necessary.

> 
> > +
> > +/**********************************************************************************
> > + * famfs super_operations
> > + *
> > + * TODO: implement a famfs_statfs() that shows size, free and available space, etc.
> > + */
> > +
> > +/**
> > + * famfs_show_options() - Display the mount options in /proc/mounts.
> Run kernel doc script + fix all warnings.

Will do; I actually think I have already fixed those...

> 
> > + */
> > +static int famfs_show_options(
> > +	struct seq_file *m,
> > +	struct dentry   *root)
> Not that familiar with fs code, but this unusual kernel style. I'd go with 
> something more common
> 
> static int famfs_show_options(struct seq_file *m, struct dentry *root)

Done. To all functions...

> 
> > +{
> > +	struct famfs_fs_info *fsi = root->d_sb->s_fs_info;
> > +
> > +	if (fsi->mount_opts.mode != FAMFS_DEFAULT_MODE)
> > +		seq_printf(m, ",mode=%o", fsi->mount_opts.mode);
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct super_operations famfs_ops = {
> > +	.statfs		= simple_statfs,
> > +	.drop_inode	= generic_delete_inode,
> > +	.show_options	= famfs_show_options,
> > +};
> > +
> > +
> One blank line probably fine.

Done

> 
> 
> Add the rest of the stuff a module normally has, author etc in this
> patch.

Because "rebase fu" I'm not sure the order will remain the same. Will
try not to make anybody tell me this again though...

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations
  2024-02-26 12:56   ` Jonathan Cameron
@ 2024-02-26 22:22     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 22:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:56PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:54 -0600
> John Groves <John@Groves.net> wrote:
> 
> > Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces
> > the function that opens a block (pmem) device and the struct
> > dax_holder_operations that are needed for that ABI.
> > 
> > In this commit, support for opening character /dev/dax is stubbed. A
> > later commit introduces this capability.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> 
> Formatting comments mostly same as previous patches, so I'll stop repeating them.

I tried to bulk apply those recommendations.

> 
> > ---
> >  fs/famfs/famfs_inode.c | 83 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 83 insertions(+)
> > 
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index 3329aff000d1..82c861998093 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = {
> >  	.show_options	= famfs_show_options,
> >  };
> >  
> > +/***************************************************************************************
> > + * dax_holder_operations for block dax
> > + */
> > +
> > +static int
> > +famfs_blk_dax_notify_failure(
> > +	struct dax_device	*dax_devp,
> > +	u64			offset,
> > +	u64			len,
> > +	int			mf_flags)
> > +{
> > +
> > +	pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
> > +	       __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
> > +	return -EOPNOTSUPP;
> > +}
> > +
> > +const struct dax_holder_operations famfs_blk_dax_holder_ops = {
> > +	.notify_failure		= famfs_blk_dax_notify_failure,
> > +};
> > +
> > +static int
> > +famfs_open_char_device(
> > +	struct super_block *sb,
> > +	struct fs_context  *fc)
> > +{
> > +	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> > +	       __func__, fc->source);
> > +	return -ENODEV;
> > +}
> > +
> > +/**
> > + * famfs_open_device()
> > + *
> > + * Open the memory device. If it looks like /dev/dax, call famfs_open_char_device().
> > + * Otherwise try to open it as a block/pmem device.
> > + */
> > +static int
> > +famfs_open_device(
> > +	struct super_block *sb,
> > +	struct fs_context  *fc)
> > +{
> > +	struct famfs_fs_info *fsi = sb->s_fs_info;
> > +	struct dax_device    *dax_devp;
> > +	u64 start_off = 0;
> > +	struct bdev_handle   *handlep;
> Definitely don't force alignment in local parameter definitions.
> Always goes wrong and makes for unreadable mess in patches!

Okay, undone. Everywhere.

> 
> > +
> > +	if (fsi->dax_devp) {
> > +		pr_err("%s: already mounted\n", __func__);
> Fine to fail but worth a error message? Not sure on convention on this but seems noisy
> and maybe in userspace control which isn't good.

Changing to pr_debug. Would be good to have access to it in that way

> > +		return -EALREADY;
> > +	}
> > +
> > +	if (strstr(fc->source, "/dev/dax")) /* There is probably a better way to check this */
> > +		return famfs_open_char_device(sb, fc);
> > +
> > +	if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better way to check this */
> > +		pr_err("%s: primary backing dev (%s) is not pmem\n",
> > +		       __func__, fc->source);
> > +		return -EINVAL;
> > +	}
> > +
> > +	handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);
> > +	if (IS_ERR(handlep->bdev)) {
> > +		pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, fc->source);
> > +		return PTR_ERR(handlep->bdev);
> > +	}
> > +
> > +	dax_devp = fs_dax_get_by_bdev(handlep->bdev, &start_off,
> > +				      fsi  /* holder */,
> > +				      &famfs_blk_dax_holder_ops);
> > +	if (IS_ERR(dax_devp)) {
> > +		pr_err("%s: unable to get daxdev from handlep->bdev\n", __func__);
> > +		bdev_release(handlep);
> > +		return -ENODEV;
> > +	}
> > +	fsi->bdev_handle = handlep;
> > +	fsi->dax_devp    = dax_devp;
> > +
> > +	pr_notice("%s: root device is block dax (%s)\n", __func__, fc->source);
> 
> pr_debug()  Kernel log is too noisy anyway! + I'd assume we can tell this succeeded
> in lots of other ways.

Done

> 
> 
> > +	return 0;
> > +}
> > +
> > +
> >  
> >  MODULE_LICENSE("GPL");

Thanks,
John
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-26 13:20   ` Jonathan Cameron
@ 2024-02-26 22:43     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 22:43 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 01:20PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:55 -0600
> John Groves <John@Groves.net> wrote:
> 
> > This commit introduces the famfs fs_context_operations and
> > famfs_get_inode() which is used by the context operations.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> Trivial comments inline.
> 
> > ---
> >  fs/famfs/famfs_inode.c | 178 +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 178 insertions(+)
> > 
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index 82c861998093..f98f82962d7b 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -41,6 +41,50 @@ static const struct super_operations famfs_ops;
> >  static const struct inode_operations famfs_file_inode_operations;
> >  static const struct inode_operations famfs_dir_inode_operations;
> >  
> > +static struct inode *famfs_get_inode(
> > +	struct super_block *sb,
> > +	const struct inode *dir,
> > +	umode_t             mode,
> > +	dev_t               dev)
> > +{
> > +	struct inode *inode = new_inode(sb);
> > +
> > +	if (inode) {
> reverse logic would be simpler and reduce indent.
> 
> 	if (!inode)
> 		return NULL;
> 

Good one - I can be derpy this way. Although I'd bet I just copied that
from ramfs...

> 
> > +		struct timespec64       tv;
> > +
> > +		inode->i_ino = get_next_ino();
> > +		inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
> > +		inode->i_mapping->a_ops = &ram_aops;
> > +		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > +		mapping_set_unevictable(inode->i_mapping);
> > +		tv = inode_set_ctime_current(inode);
> > +		inode_set_mtime_to_ts(inode, tv);
> > +		inode_set_atime_to_ts(inode, tv);
> > +
> > +		switch (mode & S_IFMT) {
> > +		default:
> > +			init_special_inode(inode, mode, dev);
> > +			break;
> > +		case S_IFREG:
> > +			inode->i_op = &famfs_file_inode_operations;
> > +			inode->i_fop = &famfs_file_operations;
> > +			break;
> > +		case S_IFDIR:
> > +			inode->i_op = &famfs_dir_inode_operations;
> > +			inode->i_fop = &simple_dir_operations;
> > +
> > +			/* Directory inodes start off with i_nlink == 2 (for "." entry) */
> > +			inc_nlink(inode);
> > +			break;
> > +		case S_IFLNK:
> > +			inode->i_op = &page_symlink_inode_operations;
> > +			inode_nohighmem(inode);
> > +			break;
> > +		}
> > +	}
> > +	return inode;
> > +}
> > +
> >  /**********************************************************************************
> >   * famfs super_operations
> >   *
> > @@ -150,6 +194,140 @@ famfs_open_device(
> >  	return 0;
> >  }
> >  
> > +/*****************************************************************************************
> > + * fs_context_operations
> > + */
> > +static int
> > +famfs_fill_super(
> > +	struct super_block *sb,
> > +	struct fs_context  *fc)
> > +{
> > +	struct famfs_fs_info *fsi = sb->s_fs_info;
> > +	struct inode *inode;
> > +	int rc = 0;
> Always initialized so no need to do it here.

Fixed in more than one place.

> 
> > +
> > +	sb->s_maxbytes		= MAX_LFS_FILESIZE;
> > +	sb->s_blocksize		= PAGE_SIZE;
> > +	sb->s_blocksize_bits	= PAGE_SHIFT;
> > +	sb->s_magic		= FAMFS_MAGIC;
> > +	sb->s_op		= &famfs_ops;
> > +	sb->s_time_gran		= 1;
> > +
> > +	rc = famfs_open_device(sb, fc);
> > +	if (rc)
> > +		goto out;
> 		return rc; //unless you need to do more in out in later patch..

Done

> 
> > +
> > +	inode = famfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
> > +	sb->s_root = d_make_root(inode);
> > +	if (!sb->s_root)
> > +		rc = -ENOMEM;
> 		return -ENOMEM;

Done

> 
> 	return 0;

Done

> 
> > +
> > +out:
> > +	return rc;
> > +}
> > +
> > +enum famfs_param {
> > +	Opt_mode,
> > +	Opt_dax,
> Why capital O?

Direct copy from ramfs

> 
> > +};
> > +
> 
> ...
> 
> > +
> > +static DEFINE_MUTEX(famfs_context_mutex);
> > +static LIST_HEAD(famfs_context_list);
> > +
> > +static int famfs_get_tree(struct fs_context *fc)
> > +{
> > +	struct famfs_fs_info *fsi_entry;
> > +	struct famfs_fs_info *fsi = fc->s_fs_info;
> > +
> > +	fsi->rootdev = kstrdup(fc->source, GFP_KERNEL);
> > +	if (!fsi->rootdev)
> > +		return -ENOMEM;
> > +
> > +	/* Fail if famfs is already mounted from the same device */
> > +	mutex_lock(&famfs_context_mutex);
> 
> New toys might be good to use from start to avoid need for explicit
> unlocks in error paths.
> 
> 	scoped_guard(mutex, &famfs_context_mutex) {
> 		list_for_each_entry(fsi_entry, &famfs_context_list, fsi_list) {
> 			if (strcmp(fsi_entry->rootdev, cs_source) == 0) {
> 			//could invert with a continue to reduce indent
> 			// or factor this out as a little helper.
> 			// famfs_check_not_mounted()
> 				pr_err();
> 				return -EALREADY;
> 			}
> 		}	
> 		list_add(&fsi->fs_list, &famfs_context_list);
> 	}
> 
> 	return get_tree_nodev(...

Hey, I like this one. Thanks!

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type
  2024-02-26 13:25   ` Jonathan Cameron
@ 2024-02-26 22:53     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 22:53 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 01:25PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:56 -0600
> John Groves <John@Groves.net> wrote:
> 
> > This commit introduces the famfs inode_operations. There is nothing really
> > unique to famfs here in the inode_operations..
> > 
> > This commit also introduces the famfs_file_system_type struct and the
> > famfs_kill_sb() function.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> 
> Trivial comments only.
> 
> > +
> > +/*
> > + * File creation. Allocate an inode, and we're done..
> > + */
> > +/* SMP-safe */
> > +static int
> > +famfs_mknod(
> > +	struct mnt_idmap *idmap,
> > +	struct inode     *dir,
> > +	struct dentry    *dentry,
> > +	umode_t           mode,
> > +	dev_t             dev)
> > +{
> > +	struct inode *inode = famfs_get_inode(dir->i_sb, dir, mode, dev);
> > +	int error           = -ENOSPC;
> > +
> > +	if (inode) {
> 
> As below. I would flip it for cleaner code/ shorter indent etc.

Nice - done.

> 
> > +		struct timespec64       tv;
> > +
> > +		d_instantiate(dentry, inode);
> > +		dget(dentry);	/* Extra count - pin the dentry in core */
> > +		error = 0;
> > +		tv = inode_set_ctime_current(inode);
> > +		inode_set_mtime_to_ts(inode, tv);
> > +		inode_set_atime_to_ts(inode, tv);
> > +	}
> > +	return error;
> > +}
> > +
> > +static int famfs_mkdir(
> > +	struct mnt_idmap *idmap,
> > +	struct inode     *dir,
> > +	struct dentry    *dentry,
> > +	umode_t           mode)
> > +{
> > +	int retval = famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFDIR, 0);
> > +
> > +	if (!retval)
> > +		inc_nlink(dir);
> 
> Copy local style, so fine if this is common pattern, otherwise I'd go for
> consistent error cases out of line as easier for us sleepy caffeine 
> deprived reviewers.
> 
> 
> 	if (retval)
> 		return retval;
> 
> 	inc_nlink(dir);
> 
> 	return 0;

Agree, done.

> > +
> > +	return retval;
> > +}
> > +
> > +static int famfs_create(
> > +	struct mnt_idmap *idmap,
> > +	struct inode     *dir,
> > +	struct dentry    *dentry,
> > +	umode_t           mode,
> > +	bool              excl)
> > +{
> > +	return famfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
> > +}
> > +
> > +static int famfs_symlink(
> > +	struct mnt_idmap *idmap,
> > +	struct inode     *dir,
> > +	struct dentry    *dentry,
> > +	const char       *symname)
> > +{
> > +	struct inode *inode;
> > +	int error = -ENOSPC;
> > +
> > +	inode = famfs_get_inode(dir->i_sb, dir, S_IFLNK | 0777, 0);
> 	if (!inode)
> 		return -ENOSPC;
> 
> > +	if (inode) {
> > +		int l = strlen(symname)+1;
> > +
> > +		error = page_symlink(inode, symname, l);
> 	if (error) {
> 		iput(inode);
> 		return error;
> 	}
> 	
> 	...
> 

Right, I like it. This was some tortured conditioning, which came from
somewhere. I deny responsibility :D

Thanks,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 13/20] famfs: Add iomap_ops
  2024-02-26 13:30   ` Jonathan Cameron
@ 2024-02-26 23:00     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 23:00 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 01:30PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:57 -0600
> John Groves <John@Groves.net> wrote:
> 
> > This commit introduces the famfs iomap_ops. When either
> > dax_iomap_fault() or dax_iomap_rw() is called, we get a callback
> > via our iomap_begin() handler. The question being asked is
> > "please resolve (file, offset) to (daxdev, offset)". The function
> > famfs_meta_to_dax_offset() does this.
> > 
> > The per-file metadata is just an extent list to the
> > backing dax dev.  The order of this resolution is O(N) for N
> > extents. Note with the current user space, files usually have
> > only one extent.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> 
> > ---
> >  fs/famfs/famfs_file.c | 245 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 245 insertions(+)
> >  create mode 100644 fs/famfs/famfs_file.c
> > 
> > diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> > new file mode 100644
> > index 000000000000..fc667d5f7be8
> > --- /dev/null
> > +++ b/fs/famfs/famfs_file.c
> > @@ -0,0 +1,245 @@
> 
> > +static int
> > +famfs_meta_to_dax_offset(
> > +	struct inode *inode,
> > +	struct iomap *iomap,
> > +	loff_t        offset,
> > +	loff_t        len,
> > +	unsigned int  flags)
> > +{
> > +	struct famfs_file_meta *meta = (struct famfs_file_meta *)inode->i_private;
> 
> i_private is void * so no need for explicit cast (C spec says this is always fine without)

Yessir.

> 
> 
> > +
> > +/**
> > + * famfs_iomap_begin()
> > + *
> > + * This function is pretty simple because files are
> > + * * never partially allocated
> > + * * never have holes (never sparse)
> > + * * never "allocate on write"
> > + */
> > +static int
> > +famfs_iomap_begin(
> > +	struct inode	       *inode,
> > +	loff_t			offset,
> > +	loff_t			length,
> > +	unsigned int		flags,
> > +	struct iomap	       *iomap,
> > +	struct iomap	       *srcmap)
> > +{
> > +	struct famfs_file_meta *meta = inode->i_private;
> > +	size_t size;
> > +	int rc;
> > +
> > +	size = i_size_read(inode);
> > +
> > +	WARN_ON(size != meta->file_size);
> > +
> > +	rc = famfs_meta_to_dax_offset(inode, iomap, offset, length, flags);
> > +
> > +	return rc;
> 	return famfs_meta_...

Done

> 
> > +}
> 
> 
> > +static vm_fault_t
> > +famfs_filemap_map_pages(
> > +	struct vm_fault	       *vmf,
> > +	pgoff_t			start_pgoff,
> > +	pgoff_t			end_pgoff)
> > +{
> > +	vm_fault_t ret;
> > +
> > +	ret = filemap_map_pages(vmf, start_pgoff, end_pgoff);
> > +	return ret;
> 	return filename_map_pages()....

Done, thanks

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 14/20] famfs: Add struct file_operations
  2024-02-26 13:32   ` Jonathan Cameron
@ 2024-02-26 23:09     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-26 23:09 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 01:32PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:58 -0600
> John Groves <John@Groves.net> wrote:
> 
> > This commit introduces the famfs file_operations. We call
> > thp_get_unmapped_area() to force PMD page alignment. Our read and
> > write handlers (famfs_dax_read_iter() and famfs_dax_write_iter())
> > call dax_iomap_rw() to do the work.
> > 
> > famfs_file_invalid() checks for various ways a famfs file can be
> > in an invalid state so we can fail I/O or fault resolution in those
> > cases. Those cases include the following:
> > 
> > * No famfs metadata
> > * file i_size does not match the originally allocated size
> > * file is not flagged as DAX
> > * errors were detected previously on the file
> > 
> > An invalid file can often be fixed by replaying the log, or by
> > umount/mount/log replay - all of which are user space operations.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/famfs/famfs_file.c | 136 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 136 insertions(+)
> > 
> > diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c
> > index fc667d5f7be8..5228e9de1e3b 100644
> > --- a/fs/famfs/famfs_file.c
> > +++ b/fs/famfs/famfs_file.c
> > @@ -19,6 +19,142 @@
> >  #include <uapi/linux/famfs_ioctl.h>
> >  #include "famfs_internal.h"
> >  
> > +/*********************************************************************
> > + * file_operations
> > + */
> > +
> > +/* Reject I/O to files that aren't in a valid state */
> > +static ssize_t
> > +famfs_file_invalid(struct inode *inode)
> > +{
> > +	size_t i_size       = i_size_read(inode);
> > +	struct famfs_file_meta *meta = inode->i_private;
> > +
> > +	if (!meta) {
> > +		pr_err("%s: un-initialized famfs file\n", __func__);
> > +		return -EIO;
> > +	}
> > +	if (i_size != meta->file_size) {
> > +		pr_err("%s: something changed the size from  %ld to %ld\n",
> > +		       __func__, meta->file_size, i_size);
> > +		meta->error = 1;
> > +		return -ENXIO;
> > +	}
> > +	if (!IS_DAX(inode)) {
> > +		pr_err("%s: inode %llx IS_DAX is false\n", __func__, (u64)inode);
> > +		meta->error = 1;
> > +		return -ENXIO;
> > +	}
> > +	if (meta->error) {
> > +		pr_err("%s: previously detected metadata errors\n", __func__);
> > +		meta->error = 1;
> 
> Already set?  If treating it as only a boolean, maybe make it one?

Done, thanks

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-26 21:16       ` John Groves
@ 2024-02-27  0:58         ` Luis Chamberlain
  2024-02-27  2:05           ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Luis Chamberlain @ 2024-02-27  0:58 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote:
>
> On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > Run status group 0 (all jobs):
> > >   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> >
> > > This is run on an xfs file system on a SATA ssd.
> >
> > To compare more closer apples to apples, wouldn't it make more sense
> > to try this with XFS on pmem (with fio -direct=1)?
> >
> >   Luis
>
> Makes sense. Here is the same command line I used with xfs before, but
> now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> because xfs requires that.
>
> fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs

Could you try with mkfs.xfs -d agcount=1024

 Luis

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-27  0:58         ` Luis Chamberlain
@ 2024-02-27  2:05           ` John Groves
  2024-02-29  2:15             ` Dave Chinner
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-27  2:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 04:58PM, Luis Chamberlain wrote:
> On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote:
> >
> > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > Run status group 0 (all jobs):
> > > >   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > >
> > > > This is run on an xfs file system on a SATA ssd.
> > >
> > > To compare more closer apples to apples, wouldn't it make more sense
> > > to try this with XFS on pmem (with fio -direct=1)?
> > >
> > >   Luis
> >
> > Makes sense. Here is the same command line I used with xfs before, but
> > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > because xfs requires that.
> >
> > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
> 
> Could you try with mkfs.xfs -d agcount=1024
> 
>  Luis

$ luis/fio-xfsdax.sh 
+ sudo mkfs.xfs -d agcount=1024 -m reflink=0 -f /dev/pmem0
meta-data=/dev/pmem0             isize=512    agcount=1024, agsize=32768 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=33554432, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
+ sudo mount -o dax /dev/pmem0 /mnt/xfs
+ sudo chown jmg:jmg /mnt/xfs
+ ls -al /mnt/xfs
total 0
drwxr-xr-x  2 jmg  jmg   6 Feb 26 19:56 .
drwxr-xr-x. 4 root root 30 Feb 26 14:58 ..
++ nproc
+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
...
fio-3.33
Starting 48 processes
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB)
Jobs: 17 (f=170): [_(2),W(1),_(8),W(2),_(7),W(3),_(2),W(2),_(3),W(2),_(2),W(1),_(2),W(1),_(1),W(3),_(4),W(2)][Jobs: 1 (f=10): [_(47),W(1)][100.0%][w=8022MiB/s][w=4011 IOPS][eta 00m:00s]                                                                                
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=141563: Mon Feb 26 19:56:28 2024
  write: IOPS=6578, BW=12.8GiB/s (13.8GB/s)(114GiB/8902msec); 0 zone resets
    slat (usec): min=18, max=60593, avg=1230.85, stdev=1799.97
    clat (usec): min=2, max=98969, avg=5133.25, stdev=5141.07
     lat (usec): min=294, max=99725, avg=6364.09, stdev=5440.30
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   46], 10.00th=[  217], 20.00th=[ 2376],
     | 30.00th=[ 2999], 40.00th=[ 3556], 50.00th=[ 3785], 60.00th=[ 3982],
     | 70.00th=[ 4228], 80.00th=[ 7504], 90.00th=[13173], 95.00th=[14091],
     | 99.00th=[21890], 99.50th=[27919], 99.90th=[45351], 99.95th=[57934],
     | 99.99th=[82314]
   bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
   iops        : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
  lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
  lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
  lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
  lat (msec)   : 100=0.08%
  cpu          : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=12.8GiB/s (13.8GB/s), 12.8GiB/s-12.8GiB/s (13.8GB/s-13.8GB/s), io=114GiB (123GB), run=8902-8902msec

Disk stats (read/write):
  pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


I ran it several times with similar results.

Regards,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 08/20] famfs: Add famfs_internal.h
  2024-02-26 17:35     ` John Groves
@ 2024-02-27 10:28       ` Jonathan Cameron
  2024-02-28  1:06         ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-27 10:28 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Mon, 26 Feb 2024 11:35:17 -0600
John Groves <John@groves.net> wrote:

> On 24/02/26 12:48PM, Jonathan Cameron wrote:
> > On Fri, 23 Feb 2024 11:41:52 -0600
> > John Groves <John@Groves.net> wrote:
> >   
> > > Add the famfs_internal.h include file. This contains internal data
> > > structures such as the per-file metadata structure (famfs_file_meta)
> > > and extent formats.
> > > 
> > > Signed-off-by: John Groves <john@groves.net>  
> > Hi John,
> > 
> > Build this up as you add the definitions in later patches.
> > 
> > Separate header patches just make people jump back and forth when trying
> > to review.  Obviously more work to build this stuff up cleanly but
> > it's worth doing to save review time.
> >   
> 
> Ohhhhkaaaaay. I think you're right, just not looking forward to
> all that rebasing.

:)  Patch mangling is half the fun of upstream development :)

> 
> > Generally I'd plumb up Kconfig and Makefile a the beginning as it means
> > that the set is bisectable and we can check the logic of building each stage.
> > That is harder to do but tends to bring benefits in forcing clear step
> > wise approach on a patch set. Feel free to ignore this one though as it
> > can slow things down.  
> 
> I'm not sure that's practical. A file system needs a bunch of different
> kinds of operations
> - super_operations
> - fs_context_operations
> - inode_operations
> - file_operations
> - dax holder_operations, iomap_ops
> - etc.
> 
> Will think about the dependency graph of these entities, but I'm not sure
> it's tractable...

Sure.  There's a difference though between doing something useful (or
even successfully loading) and being able to build it at intermediate steps.
I'm only looking for buildability.

If not possible, even with a few stubs, empty ops structures etc
then fair enough.

Jonathan

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 09/20] famfs: Add super_operations
  2024-02-26 21:47     ` John Groves
@ 2024-02-27 10:34       ` Jonathan Cameron
  0 siblings, 0 replies; 105+ messages in thread
From: Jonathan Cameron @ 2024-02-27 10:34 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Mon, 26 Feb 2024 15:47:53 -0600
John Groves <John@groves.net> wrote:

> On 24/02/26 12:51PM, Jonathan Cameron wrote:
> > On Fri, 23 Feb 2024 11:41:53 -0600
> > John Groves <John@Groves.net> wrote:
> >   
> > > Introduce the famfs superblock operations
> > > 
> > > Signed-off-by: John Groves <john@groves.net>
> > > ---
> > >  fs/famfs/famfs_inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 72 insertions(+)
> > >  create mode 100644 fs/famfs/famfs_inode.c
> > > 
> > > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > > new file mode 100644
> > > index 000000000000..3329aff000d1
> > > --- /dev/null
> > > +++ b/fs/famfs/famfs_inode.c
> > > @@ -0,0 +1,72 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * famfs - dax file system for shared fabric-attached memory
> > > + *
> > > + * Copyright 2023-2024 Micron Technology, inc
> > > + *
> > > + * This file system, originally based on ramfs the dax support from xfs,
> > > + * is intended to allow multiple host systems to mount a common file system
> > > + * view of dax files that map to shared memory.
> > > + */
> > > +
> > > +#include <linux/fs.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/highmem.h>
> > > +#include <linux/time.h>
> > > +#include <linux/init.h>
> > > +#include <linux/string.h>
> > > +#include <linux/backing-dev.h>
> > > +#include <linux/sched.h>
> > > +#include <linux/parser.h>
> > > +#include <linux/magic.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/uaccess.h>
> > > +#include <linux/fs_context.h>
> > > +#include <linux/fs_parser.h>
> > > +#include <linux/seq_file.h>
> > > +#include <linux/dax.h>
> > > +#include <linux/hugetlb.h>
> > > +#include <linux/uio.h>
> > > +#include <linux/iomap.h>
> > > +#include <linux/path.h>
> > > +#include <linux/namei.h>
> > > +#include <linux/pfn_t.h>
> > > +#include <linux/blkdev.h>  
> > 
> > That's a lot of header for such a small patch.. I'm going to guess
> > they aren't all used - bring them in as you need them - I hope
> > you never need some of these!  
> 
> I didn't phase in headers in this series. Based on these recommendations,
> the next version of this series is gonna have to be 100% constructed from
> scratch, but okay. My head hurts just thinking about it. I need a nap...
> 
> I've been rebasing for 3 weeks to get this series out, and it occurs to
> me that maybe there are tools I'm not aware of that make it eaiser? I'm
> just typing "rebase -i..." 200 times a day. Is there a less soul-crushing way?

Hmm. There are things that make it easier to pick and chose parts of a
big diff for different patches.  Some combination of 
git reset HEAD~1
and one of the 'graphical' tools like tig that let you pick lines.

That lets you quickly break up a patch where you want to move things, then
you can reorder the patches to put them next to where you want to move
changes to and rely on git rebase -i with f or s to squash them.

Figuring out optimum path to the eventual break up you want is
a skill though.  When doing this sort of mangling I tend to get it wrong
and shout at my computer a few times a day ;)
Then git rebase --abort and try again.

End result is that you end up with coherent series and it looks like
you wrote perfect code in nice steps from the start!

Jonathan



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 08/20] famfs: Add famfs_internal.h
  2024-02-23 17:41 ` [RFC PATCH 08/20] famfs: Add famfs_internal.h John Groves
  2024-02-26 12:48   ` Jonathan Cameron
@ 2024-02-27 13:38   ` Christian Brauner
  2024-02-27 14:12     ` John Groves
  1 sibling, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2024-02-27 13:38 UTC (permalink / raw)
  To: John Groves
  Cc: Christian Brauner, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, Feb 23, 2024 at 11:41:52AM -0600, John Groves wrote:
> Add the famfs_internal.h include file. This contains internal data
> structures such as the per-file metadata structure (famfs_file_meta)
> and extent formats.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/famfs/famfs_internal.h | 53 +++++++++++++++++++++++++++++++++++++++

Already mentioned in another reply here but adding a bunch of types such
as famfs_file_operations that aren't even defines is pretty odd. So you
should reorder this.

>  1 file changed, 53 insertions(+)
>  create mode 100644 fs/famfs/famfs_internal.h
> 
> diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
> new file mode 100644
> index 000000000000..af3990d43305
> --- /dev/null
> +++ b/fs/famfs/famfs_internal.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, Inc.
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +#ifndef FAMFS_INTERNAL_H
> +#define FAMFS_INTERNAL_H
> +
> +#include <linux/atomic.h>
> +#include <linux/famfs_ioctl.h>
> +
> +#define FAMFS_MAGIC 0x87b282ff

That needs to go into include/uapi/linux/magic.h.

> +
> +#define FAMFS_BLKDEV_MODE (FMODE_READ|FMODE_WRITE)
> +
> +extern const struct file_operations      famfs_file_operations;
> +
> +/*
> + * Each famfs dax file has this hanging from its inode->i_private.
> + */
> +struct famfs_file_meta {
> +	int                   error;
> +	enum famfs_file_type  file_type;
> +	size_t                file_size;
> +	enum extent_type      tfs_extent_type;
> +	size_t                tfs_extent_ct;
> +	struct famfs_extent   tfs_extents[];  /* flexible array */
> +};
> +
> +struct famfs_mount_opts {
> +	umode_t mode;
> +};
> +
> +extern const struct iomap_ops             famfs_iomap_ops;
> +extern const struct vm_operations_struct  famfs_file_vm_ops;
> +
> +#define ROOTDEV_STRLEN 80
> +
> +struct famfs_fs_info {
> +	struct famfs_mount_opts  mount_opts;
> +	struct file             *dax_filp;
> +	struct dax_device       *dax_devp;
> +	struct bdev_handle      *bdev_handle;
> +	struct list_head         fsi_list;
> +	char                    *rootdev;
> +};
> +
> +#endif /* FAMFS_INTERNAL_H */
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations
  2024-02-23 17:41 ` [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations John Groves
  2024-02-26 12:56   ` Jonathan Cameron
@ 2024-02-27 13:39   ` Christian Brauner
  2024-02-27 18:38     ` John Groves
  1 sibling, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2024-02-27 13:39 UTC (permalink / raw)
  To: John Groves
  Cc: Christian Brauner, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, Feb 23, 2024 at 11:41:54AM -0600, John Groves wrote:
> Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces
> the function that opens a block (pmem) device and the struct
> dax_holder_operations that are needed for that ABI.
> 
> In this commit, support for opening character /dev/dax is stubbed. A
> later commit introduces this capability.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/famfs/famfs_inode.c | 83 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 83 insertions(+)
> 
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index 3329aff000d1..82c861998093 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = {
>  	.show_options	= famfs_show_options,
>  };
>  
> +/***************************************************************************************
> + * dax_holder_operations for block dax
> + */
> +
> +static int
> +famfs_blk_dax_notify_failure(
> +	struct dax_device	*dax_devp,
> +	u64			offset,
> +	u64			len,
> +	int			mf_flags)
> +{
> +
> +	pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
> +	       __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
> +	return -EOPNOTSUPP;
> +}
> +
> +const struct dax_holder_operations famfs_blk_dax_holder_ops = {
> +	.notify_failure		= famfs_blk_dax_notify_failure,
> +};
> +
> +static int
> +famfs_open_char_device(
> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> +	       __func__, fc->source);
> +	return -ENODEV;
> +}
> +
> +/**
> + * famfs_open_device()
> + *
> + * Open the memory device. If it looks like /dev/dax, call famfs_open_char_device().
> + * Otherwise try to open it as a block/pmem device.
> + */
> +static int
> +famfs_open_device(

I'm confused why that function is added here but it's completely unclear
in what wider context it's called. This is really hard to follow.

> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	struct famfs_fs_info *fsi = sb->s_fs_info;
> +	struct dax_device    *dax_devp;
> +	u64 start_off = 0;
> +	struct bdev_handle   *handlep;
> +
> +	if (fsi->dax_devp) {
> +		pr_err("%s: already mounted\n", __func__);
> +		return -EALREADY;
> +	}
> +
> +	if (strstr(fc->source, "/dev/dax")) /* There is probably a better way to check this */
> +		return famfs_open_char_device(sb, fc);
> +
> +	if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better way to check this */

Yeah, this is not just a bit ugly but also likely wrong because:

sudo mount --bind /dev/pmem /opt/muhaha

fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/opt/muhaha", [...])

or a simple mknod to create that device somewhere else. You likely want:

lookup_bdev(fc->source, &dev);

if (!DEVICE_NUMBER_SOMETHING_SOMETHING_SANE(dev))
	return invalfc(fc, "SOMETHING SOMETHING...

bdev_open_by_dev(dev, ....)

(This reminds me that I should get back to making it possible to specify
"source" as a file descriptor instead of a mere string with the new
mount api...)

> +		pr_err("%s: primary backing dev (%s) is not pmem\n",
> +		       __func__, fc->source);
> +		return -EINVAL;
> +	}
> +
> +	handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);

Hm, I suspected that FAMFS_BLKDEV_MODE would be wrong based on:
https://lore.kernel.org/r/13556dbbd8d0f51bc31e3bdec796283fe85c6baf.1708709155.git.john@groves.net

It's defined as FMODE_READ | FMODE_WRITE which is wrong. But these
helpers want BLOCK_OPEN_READ | BLOCK_OPEN_WRITE.

> +	if (IS_ERR(handlep->bdev)) {

@bdev_handle will be gone as of v6.9 so you might want to wait until
then to resend.

> +		pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, fc->source);
> +		return PTR_ERR(handlep->bdev);
> +	}
> +
> +	dax_devp = fs_dax_get_by_bdev(handlep->bdev, &start_off,
> +				      fsi  /* holder */,
> +				      &famfs_blk_dax_holder_ops);
> +	if (IS_ERR(dax_devp)) {
> +		pr_err("%s: unable to get daxdev from handlep->bdev\n", __func__);
> +		bdev_release(handlep);
> +		return -ENODEV;
> +	}
> +	fsi->bdev_handle = handlep;
> +	fsi->dax_devp    = dax_devp;
> +
> +	pr_notice("%s: root device is block dax (%s)\n", __func__, fc->source);
> +	return 0;
> +}
> +
> +
>  
>  MODULE_LICENSE("GPL");
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-23 17:41 ` [RFC PATCH 11/20] famfs: Add fs_context_operations John Groves
  2024-02-26 13:20   ` Jonathan Cameron
@ 2024-02-27 13:41   ` Christian Brauner
  2024-02-28  0:59     ` John Groves
  1 sibling, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2024-02-27 13:41 UTC (permalink / raw)
  To: John Groves
  Cc: Christian Brauner, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On Fri, Feb 23, 2024 at 11:41:55AM -0600, John Groves wrote:
> This commit introduces the famfs fs_context_operations and
> famfs_get_inode() which is used by the context operations.
> 
> Signed-off-by: John Groves <john@groves.net>
> ---
>  fs/famfs/famfs_inode.c | 178 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 178 insertions(+)
> 
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> index 82c861998093..f98f82962d7b 100644
> --- a/fs/famfs/famfs_inode.c
> +++ b/fs/famfs/famfs_inode.c
> @@ -41,6 +41,50 @@ static const struct super_operations famfs_ops;
>  static const struct inode_operations famfs_file_inode_operations;
>  static const struct inode_operations famfs_dir_inode_operations;
>  
> +static struct inode *famfs_get_inode(
> +	struct super_block *sb,
> +	const struct inode *dir,
> +	umode_t             mode,
> +	dev_t               dev)
> +{
> +	struct inode *inode = new_inode(sb);
> +
> +	if (inode) {
> +		struct timespec64       tv;
> +
> +		inode->i_ino = get_next_ino();
> +		inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
> +		inode->i_mapping->a_ops = &ram_aops;
> +		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		mapping_set_unevictable(inode->i_mapping);
> +		tv = inode_set_ctime_current(inode);
> +		inode_set_mtime_to_ts(inode, tv);
> +		inode_set_atime_to_ts(inode, tv);
> +
> +		switch (mode & S_IFMT) {
> +		default:
> +			init_special_inode(inode, mode, dev);
> +			break;
> +		case S_IFREG:
> +			inode->i_op = &famfs_file_inode_operations;
> +			inode->i_fop = &famfs_file_operations;
> +			break;
> +		case S_IFDIR:
> +			inode->i_op = &famfs_dir_inode_operations;
> +			inode->i_fop = &simple_dir_operations;
> +
> +			/* Directory inodes start off with i_nlink == 2 (for "." entry) */
> +			inc_nlink(inode);
> +			break;
> +		case S_IFLNK:
> +			inode->i_op = &page_symlink_inode_operations;
> +			inode_nohighmem(inode);
> +			break;
> +		}
> +	}
> +	return inode;
> +}
> +
>  /**********************************************************************************
>   * famfs super_operations
>   *
> @@ -150,6 +194,140 @@ famfs_open_device(
>  	return 0;
>  }
>  
> +/*****************************************************************************************
> + * fs_context_operations
> + */
> +static int
> +famfs_fill_super(
> +	struct super_block *sb,
> +	struct fs_context  *fc)
> +{
> +	struct famfs_fs_info *fsi = sb->s_fs_info;
> +	struct inode *inode;
> +	int rc = 0;
> +
> +	sb->s_maxbytes		= MAX_LFS_FILESIZE;
> +	sb->s_blocksize		= PAGE_SIZE;
> +	sb->s_blocksize_bits	= PAGE_SHIFT;
> +	sb->s_magic		= FAMFS_MAGIC;
> +	sb->s_op		= &famfs_ops;
> +	sb->s_time_gran		= 1;
> +
> +	rc = famfs_open_device(sb, fc);
> +	if (rc)
> +		goto out;
> +
> +	inode = famfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
> +	sb->s_root = d_make_root(inode);
> +	if (!sb->s_root)
> +		rc = -ENOMEM;
> +
> +out:
> +	return rc;
> +}
> +
> +enum famfs_param {
> +	Opt_mode,
> +	Opt_dax,
> +};
> +
> +const struct fs_parameter_spec famfs_fs_parameters[] = {
> +	fsparam_u32oct("mode",	  Opt_mode),
> +	fsparam_string("dax",     Opt_dax),
> +	{}
> +};
> +
> +static int famfs_parse_param(
> +	struct fs_context   *fc,
> +	struct fs_parameter *param)
> +{
> +	struct famfs_fs_info *fsi = fc->s_fs_info;
> +	struct fs_parse_result result;
> +	int opt;
> +
> +	opt = fs_parse(fc, famfs_fs_parameters, param, &result);
> +	if (opt == -ENOPARAM) {
> +		opt = vfs_parse_fs_param_source(fc, param);
> +		if (opt != -ENOPARAM)
> +			return opt;

I'm not sure I understand this. But in any case add, you should add
Opt_source to enum famfs_param and then add

        fsparam_string("source",        Opt_source),

to famfs_fs_parameters. Then you can add:

famfs_parse_source(fc, param);

You might want to consider validating your devices right away. So think
about:

fd_fs = fsopen("famfs", ...);
ret = fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/definitely/not/valid/device", ...) // succeeds
ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_1", ...) // succeeds
ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_2", ...) // succeeds 
ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_3", ...) // succeeds 
ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_N", ...) // succeeds 
ret = fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) // superblock creation failed

So what failed exactly? Yes, you can log into the fscontext and dmesg
that it's @source that's the issue but it's annoying for userspace to
setup a whole mount context only to figure out that some option was
wrong at the end of it.

So validating

famfs_parse_source(...)
{
	if (fc->source)
		return invalfc(fc, "Uhm, we already have a source....
	
       lookup_bdev(fc->source, &dev)
       // validate it's a device you're actually happy to use

       fc->source = param->string;
       param->string = NULL;
}

Your ->get_tree implementation that actually creates/finds the
superblock will validate fc->source again and yes, there's a race here
in so far as the path that fc->source points to could change in between
validating this in famfs_parse_source() and ->get_tree() superblock
creation. This is fixable even right now but then you couldn't reuse
common infrastrucute so I would just accept that race for now and we
should provide a nicer mechanism on the vfs layer.

> +
> +		return 0;
> +	}
> +	if (opt < 0)
> +		return opt;
> +
> +	switch (opt) {
> +	case Opt_mode:
> +		fsi->mount_opts.mode = result.uint_32 & S_IALLUGO;
> +		break;
> +	case Opt_dax:
> +		if (strcmp(param->string, "always"))
> +			pr_notice("%s: invalid dax mode %s\n",
> +				  __func__, param->string);
> +		break;
> +	}
> +
> +	return 0;
> +}
> +
> +static DEFINE_MUTEX(famfs_context_mutex);
> +static LIST_HEAD(famfs_context_list);
> +
> +static int famfs_get_tree(struct fs_context *fc)
> +{
> +	struct famfs_fs_info *fsi_entry;
> +	struct famfs_fs_info *fsi = fc->s_fs_info;
> +
> +	fsi->rootdev = kstrdup(fc->source, GFP_KERNEL);
> +	if (!fsi->rootdev)
> +		return -ENOMEM;
> +
> +	/* Fail if famfs is already mounted from the same device */
> +	mutex_lock(&famfs_context_mutex);
> +	list_for_each_entry(fsi_entry, &famfs_context_list, fsi_list) {
> +		if (strcmp(fsi_entry->rootdev, fc->source) == 0) {
> +			mutex_unlock(&famfs_context_mutex);
> +			pr_err("%s: already mounted from rootdev %s\n", __func__, fc->source);
> +			return -EALREADY;

What errno is EALREADY? Isn't that socket stuff. In any case, it seems
you want EBUSY?

But bigger picture I'm lost. And why do you keep that list based on
strings? What if I do:

mount -t famfs /dev/pmem1234 /mnt # succeeds

mount -t famfs /dev/pmem1234 /opt # ah, fsck me, this fails.. But wait a minute....

mount --bind /dev/pmem1234 /evil-masterplan

mount -t famfs /evil-masterplan /opt # succeeds. YAY

I believe that would trivially defeat your check.

> +		}
> +	}
> +
> +	list_add(&fsi->fsi_list, &famfs_context_list);
> +	mutex_unlock(&famfs_context_mutex);
> +
> +	return get_tree_nodev(fc, famfs_fill_super);

So why isn't this using get_tree_bdev()? Note that a while ago I
added FSCONFIG_CMD_CREAT_EXCL which prevents silent superblock reuse. To
implement that I added fs_context->exclusive. If you unconditionally set
fc->exclusive = 1 in your famfs_init_fs_context() and use
get_tree_bdev() it will give you EBUSY if fc->source is already in use -
including other famfs instances.

I also fail to yet understand how that function which actually opens the block
device and gets the dax device figures into this. It's a bit hard to follow
what's going on since you add all those unused functions and types so there's
never a wider context to see that stuff in.

> +
> +}
> +
> +static void famfs_free_fc(struct fs_context *fc)
> +{
> +	struct famfs_fs_info *fsi = fc->s_fs_info;
> +
> +	if (fsi && fsi->rootdev)
> +		kfree(fsi->rootdev);
> +
> +	kfree(fsi);
> +}
> +
> +static const struct fs_context_operations famfs_context_ops = {
> +	.free		= famfs_free_fc,
> +	.parse_param	= famfs_parse_param,
> +	.get_tree	= famfs_get_tree,
> +};
> +
> +static int famfs_init_fs_context(struct fs_context *fc)
> +{
> +	struct famfs_fs_info *fsi;
> +
> +	fsi = kzalloc(sizeof(*fsi), GFP_KERNEL);
> +	if (!fsi)
> +		return -ENOMEM;
> +
> +	fsi->mount_opts.mode = FAMFS_DEFAULT_MODE;
> +	fc->s_fs_info        = fsi;
> +	fc->ops              = &famfs_context_ops;
> +	return 0;
> +}
>  
>  
>  MODULE_LICENSE("GPL");
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 08/20] famfs: Add famfs_internal.h
  2024-02-27 13:38   ` Christian Brauner
@ 2024-02-27 14:12     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-27 14:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

On 24/02/27 02:38PM, Christian Brauner wrote:
> On Fri, Feb 23, 2024 at 11:41:52AM -0600, John Groves wrote:
> > Add the famfs_internal.h include file. This contains internal data
> > structures such as the per-file metadata structure (famfs_file_meta)
> > and extent formats.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/famfs/famfs_internal.h | 53 +++++++++++++++++++++++++++++++++++++++
> 
> Already mentioned in another reply here but adding a bunch of types such
> as famfs_file_operations that aren't even defines is pretty odd. So you
> should reorder this.

Acknowledged, thanks. V2 will phase in only what is needed by the
code in each patch.

> 
> >  1 file changed, 53 insertions(+)
> >  create mode 100644 fs/famfs/famfs_internal.h
> > 
> > diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h
> > new file mode 100644
> > index 000000000000..af3990d43305
> > --- /dev/null
> > +++ b/fs/famfs/famfs_internal.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * famfs - dax file system for shared fabric-attached memory
> > + *
> > + * Copyright 2023-2024 Micron Technology, Inc.
> > + *
> > + * This file system, originally based on ramfs the dax support from xfs,
> > + * is intended to allow multiple host systems to mount a common file system
> > + * view of dax files that map to shared memory.
> > + */
> > +#ifndef FAMFS_INTERNAL_H
> > +#define FAMFS_INTERNAL_H
> > +
> > +#include <linux/atomic.h>
> > +#include <linux/famfs_ioctl.h>
> > +
> > +#define FAMFS_MAGIC 0x87b282ff
> 
> That needs to go into include/uapi/linux/magic.h.

Done for v2.

Thank you,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 09/20] famfs: Add super_operations
  2024-02-26 12:51   ` Jonathan Cameron
  2024-02-26 21:47     ` John Groves
@ 2024-02-27 17:48     ` John Groves
  1 sibling, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-27 17:48 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 12:51PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:41:53 -0600
> John Groves <John@Groves.net> wrote:
> > + */
> > +static int famfs_show_options(
> > +	struct seq_file *m,
> > +	struct dentry   *root)
> Not that familiar with fs code, but this unusual kernel style. I'd go with 
> something more common
> 
> static int famfs_show_options(struct seq_file *m, struct dentry *root)

Actually, xfs does function declarations and prototypes this way, not sure if
it's everywhere. But I like this format because changing one argument usually
doesn't put un-changed args into the diff.

So I may keep this style after all.

John

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations
  2024-02-27 13:39   ` Christian Brauner
@ 2024-02-27 18:38     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-27 18:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

On 24/02/27 02:39PM, Christian Brauner wrote:
> On Fri, Feb 23, 2024 at 11:41:54AM -0600, John Groves wrote:
> > Famfs works on both /dev/pmem and /dev/dax devices. This commit introduces
> > the function that opens a block (pmem) device and the struct
> > dax_holder_operations that are needed for that ABI.
> > 
> > In this commit, support for opening character /dev/dax is stubbed. A
> > later commit introduces this capability.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/famfs/famfs_inode.c | 83 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 83 insertions(+)
> > 
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index 3329aff000d1..82c861998093 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -68,5 +68,88 @@ static const struct super_operations famfs_ops = {
> >  	.show_options	= famfs_show_options,
> >  };
> >  
> > +/***************************************************************************************
> > + * dax_holder_operations for block dax
> > + */
> > +
> > +static int
> > +famfs_blk_dax_notify_failure(
> > +	struct dax_device	*dax_devp,
> > +	u64			offset,
> > +	u64			len,
> > +	int			mf_flags)
> > +{
> > +
> > +	pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
> > +	       __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
> > +	return -EOPNOTSUPP;
> > +}
> > +
> > +const struct dax_holder_operations famfs_blk_dax_holder_ops = {
> > +	.notify_failure		= famfs_blk_dax_notify_failure,
> > +};
> > +
> > +static int
> > +famfs_open_char_device(
> > +	struct super_block *sb,
> > +	struct fs_context  *fc)
> > +{
> > +	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> > +	       __func__, fc->source);
> > +	return -ENODEV;
> > +}
> > +
> > +/**
> > + * famfs_open_device()
> > + *
> > + * Open the memory device. If it looks like /dev/dax, call famfs_open_char_device().
> > + * Otherwise try to open it as a block/pmem device.
> > + */
> > +static int
> > +famfs_open_device(
> 
> I'm confused why that function is added here but it's completely unclear
> in what wider context it's called. This is really hard to follow.

First, thank you for taking the time to do a thoughtful review.

I didn't factor this series correctly. The next one will be
"module-operations-up" unless you or somebody suggests a more sensible
approach.

Some background that might be useful: this work is really targeted for 
/dev/dax, but it started on /dev/pmem because the iomap interface wasn't 
working on /dev/dax. This patch addresses that (the dev_dax_iomap commits), 
although it's likely that code will evolve.

The current famfs code base tries to support both pmem (block) and /dev/dax 
(char), but I'm now thinking it should move to /dev/dax-only (no block 
support).

/dev/pmem devices can converted to /dev/dax mode anyway, so I'm not sure 
there is a reason to support both interfaces. (Need to think a bit more on 
that...).

> 
> > +	struct super_block *sb,
> > +	struct fs_context  *fc)
> > +{
> > +	struct famfs_fs_info *fsi = sb->s_fs_info;
> > +	struct dax_device    *dax_devp;
> > +	u64 start_off = 0;
> > +	struct bdev_handle   *handlep;
> > +
> > +	if (fsi->dax_devp) {
> > +		pr_err("%s: already mounted\n", __func__);
> > +		return -EALREADY;
> > +	}
> > +
> > +	if (strstr(fc->source, "/dev/dax")) /* There is probably a better way to check this */
> > +		return famfs_open_char_device(sb, fc);
> > +
> > +	if (!strstr(fc->source, "/dev/pmem")) { /* There is probably a better way to check this */
> 
> Yeah, this is not just a bit ugly but also likely wrong because:
> 
> sudo mount --bind /dev/pmem /opt/muhaha
> 
> fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/opt/muhaha", [...])
> 
> or a simple mknod to create that device somewhere else. You likely want:
> 
> lookup_bdev(fc->source, &dev);
> 
> if (!DEVICE_NUMBER_SOMETHING_SOMETHING_SANE(dev))
> 	return invalfc(fc, "SOMETHING SOMETHING...
> 
> bdev_open_by_dev(dev, ....)
> 
> (This reminds me that I should get back to making it possible to specify
> "source" as a file descriptor instead of a mere string with the new
> mount api...)

All good points - sorry for the flakyness here.

I think the solution is to stop trying to support both pmem and dax. Then 
I don't need to distinguish between different device types.

> 
> > +		pr_err("%s: primary backing dev (%s) is not pmem\n",
> > +		       __func__, fc->source);
> > +		return -EINVAL;
> > +	}
> > +
> > +	handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);
> 
> Hm, I suspected that FAMFS_BLKDEV_MODE would be wrong based on:
> https://lore.kernel.org/r/13556dbbd8d0f51bc31e3bdec796283fe85c6baf.1708709155.git.john@groves.net
> 
> It's defined as FMODE_READ | FMODE_WRITE which is wrong. But these
> helpers want BLOCK_OPEN_READ | BLOCK_OPEN_WRITE.

Dropping pmem/block support will also make this go away

> 
> > +	if (IS_ERR(handlep->bdev)) {
> 
> @bdev_handle will be gone as of v6.9 so you might want to wait until
> then to resend.

And this dependency will also disappear...

Thank you!!
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 17/20] famfs: Add module stuff
  2024-02-26 13:47   ` Jonathan Cameron
@ 2024-02-27 22:15     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-27 22:15 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 01:47PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:42:01 -0600
> John Groves <John@Groves.net> wrote:
> 
> > This commit introduces the module init and exit machinery for famfs.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> I'd prefer to see this from the start with the functionality of the module
> built up as you go + build logic in place.  Makes it easy to spot places
> where the patches aren't appropriately self constrained. 
> > ---
> >  fs/famfs/famfs_inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> > 
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index ab46ec50b70d..0d659820e8ff 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -462,4 +462,48 @@ static struct file_system_type famfs_fs_type = {
> >  	.fs_flags	  = FS_USERNS_MOUNT,
> >  };
> >  
> > +/*****************************************************************************************
> > + * Module stuff
> 
> I'd drop these drivers structure comments. They add little beyond
> a high possibility of being wrong after the code has evolved a bit.

Probably will do with the module-ops-up refactor for v2

> 
> > + */
> > +static struct kobject *famfs_kobj;
> > +
> > +static int __init init_famfs_fs(void)
> > +{
> > +	int rc;
> > +
> > +#if defined(CONFIG_DEV_DAX_IOMAP)
> > +	pr_notice("%s: Your kernel supports famfs on /dev/dax\n", __func__);
> > +#else
> > +	pr_notice("%s: Your kernel does not support famfs on /dev/dax\n", __func__);
> > +#endif
> > +	famfs_kobj = kobject_create_and_add(MODULE_NAME, fs_kobj);
> > +	if (!famfs_kobj) {
> > +		pr_warn("Failed to create kobject\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	rc = sysfs_create_group(famfs_kobj, &famfs_attr_group);
> > +	if (rc) {
> > +		kobject_put(famfs_kobj);
> > +		pr_warn("%s: Failed to create sysfs group\n", __func__);
> > +		return rc;
> > +	}
> > +
> > +	return register_filesystem(&famfs_fs_type);
> 
> If this fails, do we not leak the kobj and sysfs groups?

Good catch, thanks! Fixed for now- but the kobj is also likely to go away. Will
endeavor to get it right...

Thanks,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch
  2024-02-26 13:52   ` Jonathan Cameron
@ 2024-02-27 22:27     ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-27 22:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/26 01:52PM, Jonathan Cameron wrote:
> On Fri, 23 Feb 2024 11:42:02 -0600
> John Groves <John@Groves.net> wrote:
> 
> > This commit introduces the ability to open a character /dev/dax device
> > instead of a block /dev/pmem device. This rests on the dev_dax_iomap
> > patches earlier in this series.
> 
> Not sure the back reference is needed given it's in the series.

Roger that

> 
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/famfs/famfs_inode.c | 97 +++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 87 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index 0d659820e8ff..7d65ac497147 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c
> > @@ -215,6 +215,93 @@ static const struct super_operations famfs_ops = {
> >  	.show_options	= famfs_show_options,
> >  };
> >  
> > +/*****************************************************************************/
> > +
> > +#if defined(CONFIG_DEV_DAX_IOMAP)
> > +
> > +/*
> > + * famfs dax_operations  (for char dax)
> > + */
> > +static int
> > +famfs_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
> > +			u64 len, int mf_flags)
> > +{
> > +	pr_err("%s: offset %lld len %llu flags %x\n", __func__,
> > +	       offset, len, mf_flags);
> > +	return -EOPNOTSUPP;
> > +}
> > +
> > +static const struct dax_holder_operations famfs_dax_holder_ops = {
> > +	.notify_failure		= famfs_dax_notify_failure,
> > +};
> > +
> > +/*****************************************************************************/
> > +
> > +/**
> > + * famfs_open_char_device()
> > + *
> > + * Open a /dev/dax device. This only works in kernels with the dev_dax_iomap patch
> 
> That comment you definitely don't need as this won't get merged without
> that patch being in place.

This will be gone from v2. I'm 90% sure there is no reason to keep the block
device backing support (pmem), since devdax is the point AND pmem can be
converted to devdax mode. So famfs will become devdax only...etc.

This was under development for quite a few months, and actually working,
I got the dev_dax_iomap right (er, "right enough" for it to work :D). But now
that dev_dax_iomap looks basically stable, pmem/block support can come out.

> 
> 
> > + */
> > +static int
> > +famfs_open_char_device(
> > +	struct super_block *sb,
> > +	struct fs_context  *fc)
> > +{
> > +	struct famfs_fs_info *fsi = sb->s_fs_info;
> > +	struct dax_device    *dax_devp;
> > +	struct inode         *daxdev_inode;
> > +
> > +	int rc = 0;
> set in all paths where it's used.
> 
> > +
> > +	pr_notice("%s: Opening character dax device %s\n", __func__, fc->source);
> 
> pr_debug

Done

> 
> > +
> > +	fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
> > +	if (IS_ERR(fsi->dax_filp)) {
> > +		pr_err("%s: failed to open dax device %s\n",
> > +		       __func__, fc->source);
> > +		fsi->dax_filp = NULL;
> Better to use a local variable
> 
> 	fp = filp_open(fc->source, O_RDWR, 0);
> 	if (IS_ERR(fp)) {
> 		pr_err.
> 		return;
> 	}
> 	fsi->dax_filp = fp;
> or similar.

Done, thanks.

> 
> > +		return PTR_ERR(fsi->dax_filp);
> > +	}
> > +
> > +	daxdev_inode = file_inode(fsi->dax_filp);
> > +	dax_devp     = inode_dax(daxdev_inode);
> > +	if (IS_ERR(dax_devp)) {
> > +		pr_err("%s: unable to get daxdev from inode for %s\n",
> > +		       __func__, fc->source);
> > +		rc = -ENODEV;
> > +		goto char_err;
> > +	}
> > +
> > +	rc = fs_dax_get(dax_devp, fsi, &famfs_dax_holder_ops);
> > +	if (rc) {
> > +		pr_info("%s: err attaching famfs_dax_holder_ops\n", __func__);
> > +		goto char_err;
> > +	}
> > +
> > +	fsi->bdev_handle = NULL;
> > +	fsi->dax_devp = dax_devp;
> > +
> > +	return 0;
> > +
> > +char_err:
> > +	filp_close(fsi->dax_filp, NULL);
> 
> You carefully set fsi->dax_filp to null in other other error paths.
> Why there and not here?

Why indeed - done now.

> 
> > +	return rc;
> > +}
> > +
> > +#else /* CONFIG_DEV_DAX_IOMAP */
> > +static int
> > +famfs_open_char_device(
> > +	struct super_block *sb,
> > +	struct fs_context  *fc)
> > +{
> > +	pr_err("%s: Root device is %s, but your kernel does not support famfs on /dev/dax\n",
> > +	       __func__, fc->source);
> > +	return -ENODEV;
> > +}
> > +
> > +
> > +#endif /* CONFIG_DEV_DAX_IOMAP */
> > +
> >  /***************************************************************************************
> >   * dax_holder_operations for block dax
> >   */
> > @@ -236,16 +323,6 @@ const struct dax_holder_operations famfs_blk_dax_holder_ops = {
> >  	.notify_failure		= famfs_blk_dax_notify_failure,
> >  };
> >  
> 
> Put it in right place earlier! Makes this less noisy.

This will be eliminated by the move to /dev/dax-only

Thanks,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-27 13:41   ` Christian Brauner
@ 2024-02-28  0:59     ` John Groves
  2024-02-28  1:49       ` Randy Dunlap
  2024-02-28 10:07       ` Christian Brauner
  0 siblings, 2 replies; 105+ messages in thread
From: John Groves @ 2024-02-28  0:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

On 24/02/27 02:41PM, Christian Brauner wrote:
> On Fri, Feb 23, 2024 at 11:41:55AM -0600, John Groves wrote:
> > This commit introduces the famfs fs_context_operations and
> > famfs_get_inode() which is used by the context operations.
> > 
> > Signed-off-by: John Groves <john@groves.net>
> > ---
> >  fs/famfs/famfs_inode.c | 178 +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 178 insertions(+)
> > 
> > diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> > index 82c861998093..f98f82962d7b 100644
> > --- a/fs/famfs/famfs_inode.c
> > +++ b/fs/famfs/famfs_inode.c

<snip>

> > +enum famfs_param {
> > +	Opt_mode,
> > +	Opt_dax,
> > +};
> > +
> > +const struct fs_parameter_spec famfs_fs_parameters[] = {
> > +	fsparam_u32oct("mode",	  Opt_mode),
> > +	fsparam_string("dax",     Opt_dax),
> > +	{}
> > +};
> > +
> > +static int famfs_parse_param(
> > +	struct fs_context   *fc,
> > +	struct fs_parameter *param)
> > +{
> > +	struct famfs_fs_info *fsi = fc->s_fs_info;
> > +	struct fs_parse_result result;
> > +	int opt;
> > +
> > +	opt = fs_parse(fc, famfs_fs_parameters, param, &result);
> > +	if (opt == -ENOPARAM) {
> > +		opt = vfs_parse_fs_param_source(fc, param);
> > +		if (opt != -ENOPARAM)
> > +			return opt;
> 
> I'm not sure I understand this. But in any case add, you should add
> Opt_source to enum famfs_param and then add
> 
>         fsparam_string("source",        Opt_source),
> 
> to famfs_fs_parameters. Then you can add:
> 
> famfs_parse_source(fc, param);
> 
> You might want to consider validating your devices right away. So think
> about:
> 
> fd_fs = fsopen("famfs", ...);
> ret = fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/definitely/not/valid/device", ...) // succeeds
> ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_1", ...) // succeeds
> ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_2", ...) // succeeds 
> ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_3", ...) // succeeds 
> ret = fsconfig(fd_fs, FSCONFIG_SET_FLAG, "OPTION_N", ...) // succeeds 
> ret = fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) // superblock creation failed
> 
> So what failed exactly? Yes, you can log into the fscontext and dmesg
> that it's @source that's the issue but it's annoying for userspace to
> setup a whole mount context only to figure out that some option was
> wrong at the end of it.
> 
> So validating
> 
> famfs_parse_source(...)
> {
> 	if (fc->source)
> 		return invalfc(fc, "Uhm, we already have a source....
> 	
>        lookup_bdev(fc->source, &dev)
>        // validate it's a device you're actually happy to use
> 
>        fc->source = param->string;
>        param->string = NULL;
> }
> 
> Your ->get_tree implementation that actually creates/finds the
> superblock will validate fc->source again and yes, there's a race here
> in so far as the path that fc->source points to could change in between
> validating this in famfs_parse_source() and ->get_tree() superblock
> creation. This is fixable even right now but then you couldn't reuse
> common infrastrucute so I would just accept that race for now and we
> should provide a nicer mechanism on the vfs layer.

I wasn't aware of the new fsconfig interface. Is there documentation or a
file sytsem that already uses it that I should refer to? I didn't find an
obvious candidate, but it might be me. If it should be obvious from the
example above, tell me and I'll try harder.

My famfs code above was copied from ramfs. If you point me to 
documentation I might send you a ramfs fsconfig patch too :D.

> 
> > +
> > +		return 0;
> > +	}
> > +	if (opt < 0)
> > +		return opt;
> > +
> > +	switch (opt) {
> > +	case Opt_mode:
> > +		fsi->mount_opts.mode = result.uint_32 & S_IALLUGO;
> > +		break;
> > +	case Opt_dax:
> > +		if (strcmp(param->string, "always"))
> > +			pr_notice("%s: invalid dax mode %s\n",
> > +				  __func__, param->string);
> > +		break;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static DEFINE_MUTEX(famfs_context_mutex);
> > +static LIST_HEAD(famfs_context_list);
> > +
> > +static int famfs_get_tree(struct fs_context *fc)
> > +{
> > +	struct famfs_fs_info *fsi_entry;
> > +	struct famfs_fs_info *fsi = fc->s_fs_info;
> > +
> > +	fsi->rootdev = kstrdup(fc->source, GFP_KERNEL);
> > +	if (!fsi->rootdev)
> > +		return -ENOMEM;
> > +
> > +	/* Fail if famfs is already mounted from the same device */
> > +	mutex_lock(&famfs_context_mutex);
> > +	list_for_each_entry(fsi_entry, &famfs_context_list, fsi_list) {
> > +		if (strcmp(fsi_entry->rootdev, fc->source) == 0) {
> > +			mutex_unlock(&famfs_context_mutex);
> > +			pr_err("%s: already mounted from rootdev %s\n", __func__, fc->source);
> > +			return -EALREADY;
> 
> What errno is EALREADY? Isn't that socket stuff. In any case, it seems
> you want EBUSY?

Thanks... That should probaby be EBUSY. But the whole famfs_context_list
should probably also be removed. More below...

> 
> But bigger picture I'm lost. And why do you keep that list based on
> strings? What if I do:
> 
> mount -t famfs /dev/pmem1234 /mnt # succeeds
> 
> mount -t famfs /dev/pmem1234 /opt # ah, fsck me, this fails.. But wait a minute....
> 
> mount --bind /dev/pmem1234 /evil-masterplan
> 
> mount -t famfs /evil-masterplan /opt # succeeds. YAY
> 
> I believe that would trivially defeat your check.
> 

And I suspect this is related to the get_tree issue you noticed below.

This famfs code was working in 6.5 without keeping the linked list of devices,
but in 6.6/6.7/6.8 it works provided you don't try to repeat a mount command
that has already succeeded. I'm not sure why 6.5 protected me from that,
but the later versions don't. In 6.6+ That hits a BUG_ON (have specifics on 
that but not handy right now).

So for a while we just removed repeated mount requests from the famfs smoke
tests, but eventually I implemented the list above, which - though you're right
it would be easy to circumvent and therefore is not right - it did solve the
problem that we were testing for.

I suspect that correctly handling get_tree might solve this problem.

Please assume that linked list will be removed - it was not the right solution.

More below...

> > +		}
> > +	}
> > +
> > +	list_add(&fsi->fsi_list, &famfs_context_list);
> > +	mutex_unlock(&famfs_context_mutex);
> > +
> > +	return get_tree_nodev(fc, famfs_fill_super);
> 
> So why isn't this using get_tree_bdev()? Note that a while ago I
> added FSCONFIG_CMD_CREAT_EXCL which prevents silent superblock reuse. To
> implement that I added fs_context->exclusive. If you unconditionally set
> fc->exclusive = 1 in your famfs_init_fs_context() and use
> get_tree_bdev() it will give you EBUSY if fc->source is already in use -
> including other famfs instances.
> 
> I also fail to yet understand how that function which actually opens the block
> device and gets the dax device figures into this. It's a bit hard to follow
> what's going on since you add all those unused functions and types so there's
> never a wider context to see that stuff in.

Clearly that's a bug in my code. That get_tree_nodev() is from ramfs, which
was the starting point for famfs.

I'm wondering if doing this correctly (get_tree_bdev() when it's pmem) would
have solved my double mount problem on 6.6 onward.

However, there's another wrinkle: I'm concluding
(see https://lore.kernel.org/linux-fsdevel/ups6cvjw6bx5m3hotn452brbbcgemnarsasre6ep2lbe4tpjsy@ezp6oh5c72ur/)
that famfs should drop block support and just work with /dev/dax. So famfs 
may be the first file system to be hosted on a character device? Certainly 
first on character dax. 

Given that, what variant of get_tree() should it call? Should it add 
get_tree_dax()? I'm not yet familiar enough with that code to have a worthy 
opinion on this.

Please let me know what you think.

Thank you for the serious review!
John



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 08/20] famfs: Add famfs_internal.h
  2024-02-27 10:28       ` Jonathan Cameron
@ 2024-02-28  1:06         ` John Groves
  0 siblings, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-28  1:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

On 24/02/27 10:28AM, Jonathan Cameron wrote:
> On Mon, 26 Feb 2024 11:35:17 -0600
> John Groves <John@groves.net> wrote:
> 
> > On 24/02/26 12:48PM, Jonathan Cameron wrote:
> > > On Fri, 23 Feb 2024 11:41:52 -0600
> > > John Groves <John@Groves.net> wrote:
> > >   
> > > > Add the famfs_internal.h include file. This contains internal data
> > > > structures such as the per-file metadata structure (famfs_file_meta)
> > > > and extent formats.
> > > > 
> > > > Signed-off-by: John Groves <john@groves.net>  
> > > Hi John,
> > > 
> > > Build this up as you add the definitions in later patches.
> > > 
> > > Separate header patches just make people jump back and forth when trying
> > > to review.  Obviously more work to build this stuff up cleanly but
> > > it's worth doing to save review time.
> > >   
> > 
> > Ohhhhkaaaaay. I think you're right, just not looking forward to
> > all that rebasing.
> 
> :)  Patch mangling is half the fun of upstream development :)
> 
> > 
> > > Generally I'd plumb up Kconfig and Makefile a the beginning as it means
> > > that the set is bisectable and we can check the logic of building each stage.
> > > That is harder to do but tends to bring benefits in forcing clear step
> > > wise approach on a patch set. Feel free to ignore this one though as it
> > > can slow things down.  
> > 
> > I'm not sure that's practical. A file system needs a bunch of different
> > kinds of operations
> > - super_operations
> > - fs_context_operations
> > - inode_operations
> > - file_operations
> > - dax holder_operations, iomap_ops
> > - etc.
> > 
> > Will think about the dependency graph of these entities, but I'm not sure
> > it's tractable...
> 
> Sure.  There's a difference though between doing something useful (or
> even successfully loading) and being able to build it at intermediate steps.
> I'm only looking for buildability.
> 
> If not possible, even with a few stubs, empty ops structures etc
> then fair enough.
> 
> Jonathan

I'm through at least the first stage of grief on this. By the time we're
through this I'll be able to reconstitute the whole bloody thing from memory,
backwards :D

John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-28  0:59     ` John Groves
@ 2024-02-28  1:49       ` Randy Dunlap
  2024-02-28  8:17         ` Christian Brauner
  2024-02-28 10:07       ` Christian Brauner
  1 sibling, 1 reply; 105+ messages in thread
From: Randy Dunlap @ 2024-02-28  1:49 UTC (permalink / raw)
  To: John Groves, Christian Brauner
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price



On 2/27/24 16:59, John Groves wrote:
> On 24/02/27 02:41PM, Christian Brauner wrote:
>> On Fri, Feb 23, 2024 at 11:41:55AM -0600, John Groves wrote:
>>> This commit introduces the famfs fs_context_operations and
>>> famfs_get_inode() which is used by the context operations.
>>>
>>> Signed-off-by: John Groves <john@groves.net>
>>> ---
>>>  fs/famfs/famfs_inode.c | 178 +++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 178 insertions(+)
>>>
>>> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
>>> index 82c861998093..f98f82962d7b 100644
>>> --- a/fs/famfs/famfs_inode.c
>>> +++ b/fs/famfs/famfs_inode.c
> 
> <snip>
> 

> 
> I wasn't aware of the new fsconfig interface. Is there documentation or a
> file sytsem that already uses it that I should refer to? I didn't find an
> obvious candidate, but it might be me. If it should be obvious from the
> example above, tell me and I'll try harder.

> My famfs code above was copied from ramfs. If you point me to 
> documentation I might send you a ramfs fsconfig patch too :D.

All that I found was the commit to add fsconfig to the kernel tree:

commit ecdab150fddb
Author: David Howells <dhowells@redhat.com>
Date:   Thu Nov 1 23:36:09 2018 +0000

    vfs: syscall: Add fsconfig() for configuring and managing a context

and the lore archive for its discussion:
https://lore.kernel.org/all/153313723557.13253.9055982745313603422.stgit@warthog.procyon.org.uk/


plus David's userspace man page addition for it:
https://lore.kernel.org/linux-fsdevel/159680897140.29015.15318866561972877762.stgit@warthog.procyon.org.uk/


-- 
#Randy

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-28  1:49       ` Randy Dunlap
@ 2024-02-28  8:17         ` Christian Brauner
  0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2024-02-28  8:17 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price

> plus David's userspace man page addition for it:
> https://lore.kernel.org/linux-fsdevel/159680897140.29015.15318866561972877762.stgit@warthog.procyon.org.uk/

Up to date manpages are
https://github.com/brauner/man-pages-md

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-28  0:59     ` John Groves
  2024-02-28  1:49       ` Randy Dunlap
@ 2024-02-28 10:07       ` Christian Brauner
  2024-02-28 12:01         ` Christian Brauner
  1 sibling, 1 reply; 105+ messages in thread
From: Christian Brauner @ 2024-02-28 10:07 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

> I wasn't aware of the new fsconfig interface. Is there documentation or a
> file sytsem that already uses it that I should refer to? I didn't find an
> obvious candidate, but it might be me. If it should be obvious from the
> example above, tell me and I'll try harder.
> 
> My famfs code above was copied from ramfs. If you point me to 

Ok, but that's the wrong filesystem to use as a model imho. Because it
really doesn't deal with devices at all. That's why it uses
get_tree_nodev() with "nodev" as in "no device" kinda. So ramfs doesn't
have any of these issues. Whereas your filesystems is dealing with
devices dax (or pmem).

> documentation I might send you a ramfs fsconfig patch too :D.

So the manpages are at:

https://github.com/brauner/man-pages-md

But really, there shouldn't be anything that needs to change for ramfs.

> > What errno is EALREADY? Isn't that socket stuff. In any case, it seems
> > you want EBUSY?
> 
> Thanks... That should probaby be EBUSY. But the whole famfs_context_list
> should probably also be removed. More below...
> 
> > 
> > But bigger picture I'm lost. And why do you keep that list based on
> > strings? What if I do:
> > 
> > mount -t famfs /dev/pmem1234 /mnt # succeeds
> > 
> > mount -t famfs /dev/pmem1234 /opt # ah, fsck me, this fails.. But wait a minute....
> > 
> > mount --bind /dev/pmem1234 /evil-masterplan
> > 
> > mount -t famfs /evil-masterplan /opt # succeeds. YAY
> > 
> > I believe that would trivially defeat your check.
> > 
> 
> And I suspect this is related to the get_tree issue you noticed below.
> 
> This famfs code was working in 6.5 without keeping the linked list of devices,
> but in 6.6/6.7/6.8 it works provided you don't try to repeat a mount command
> that has already succeeded. I'm not sure why 6.5 protected me from that,
> but the later versions don't. In 6.6+ That hits a BUG_ON (have specifics on 
> that but not handy right now).

get_tree_nodev() by default will always allocate a new superblock. This
is how tmpfs and ramfs work. If you do:

mount -t tmpfs tmpfs /mnt
mount -t tmpfs tmpfs /opt

You get two new, independent superblocks. This is what you want for
these multi-instance filesystems: each new mount creates a new instance.

If famfs doesn't want to allow reusing devices - which I very much think
it wants to prevent - then it cannot use get_tree_nodev() directly
without having a hack like you did. Because you'll get a new superblock
no problem. So the fact that it did work somehow likely was a bug in
your code.

The reason your code causes crashes is very likely this:

struct famfs_fs_info *fsi = sb->s_fs_info;
handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);

If you look at Documentation/filesystems/porting.rst you should see that
if you use @fs_holder_ops then your holder should be the struct
super_block, not your personal fsinfo.

> So for a while we just removed repeated mount requests from the famfs smoke
> tests, but eventually I implemented the list above, which - though you're right
> it would be easy to circumvent and therefore is not right - it did solve the
> problem that we were testing for.
> 
> I suspect that correctly handling get_tree might solve this problem.
> 
> Please assume that linked list will be removed - it was not the right solution.
> 
> More below...
> 
> > > +		}
> > > +	}
> > > +
> > > +	list_add(&fsi->fsi_list, &famfs_context_list);
> > > +	mutex_unlock(&famfs_context_mutex);
> > > +
> > > +	return get_tree_nodev(fc, famfs_fill_super);
> > 
> > So why isn't this using get_tree_bdev()? Note that a while ago I
> > added FSCONFIG_CMD_CREAT_EXCL which prevents silent superblock reuse. To
> > implement that I added fs_context->exclusive. If you unconditionally set
> > fc->exclusive = 1 in your famfs_init_fs_context() and use
> > get_tree_bdev() it will give you EBUSY if fc->source is already in use -
> > including other famfs instances.
> > 
> > I also fail to yet understand how that function which actually opens the block
> > device and gets the dax device figures into this. It's a bit hard to follow
> > what's going on since you add all those unused functions and types so there's
> > never a wider context to see that stuff in.
> 
> Clearly that's a bug in my code. That get_tree_nodev() is from ramfs, which
> was the starting point for famfs.
> 
> I'm wondering if doing this correctly (get_tree_bdev() when it's pmem) would
> have solved my double mount problem on 6.6 onward.
> 
> However, there's another wrinkle: I'm concluding
> (see https://lore.kernel.org/linux-fsdevel/ups6cvjw6bx5m3hotn452brbbcgemnarsasre6ep2lbe4tpjsy@ezp6oh5c72ur/)
> that famfs should drop block support and just work with /dev/dax. So famfs 
> may be the first file system to be hosted on a character device? Certainly 
> first on character dax. 

Ugh, ok. I defer to others whether that makes sense or not. It would be
a lot easier for you if you used pmem block devices, I guess because it
would be easy to detect reuse in common infrastructure.

But also, I'm looking at your code a bit closer. There's a bit of a
wrinkle the way it's currently written...

Say someone went a bit weird and did:

mount -t xfs xfs /dev/sda /my/xfs-filesystem
mknod DAX_DEVICE /my/xfs-filesystem/dax1234

and then did:

mount -t famfs famfs /my/xfs-filesystem/dax1234 /mnt

Internally in famfs you do:

fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);

and you stash that file... Which means that you are pinning that xfs
filesystems implicitly. IOW, if someone does:

umount /my/xfs-filesystem

they get EBUSY for completely opaque reasons. And if they did:

umount -l /my/xfs-filesystem

followed by mounting that xfs filesystem again they'd get the same
superblock for that xfs filesystem.

What I'm trying to say is that I think you cannot pin another filesystem
like this when you open that device.

IOW, you either need to stash the plain dax device or dax needs to
become it's own tiny internal pseudo fs such that we can open dax
devices internally just like files. Which might actually also be worth
doing. But I'm not the maintainer of that.

> 
> Given that, what variant of get_tree() should it call? Should it add 
> get_tree_dax()? I'm not yet familiar enough with that code to have a worthy 
> opinion on this.

I don't think we need a common helper if famfs would be the only user of this.
But maybe I'm wrong. But roughly you'd need something similar to what we
do for block devices, I'd reckon. So lookup_daxdev() which is similar to
lookup_bdev() and allows you to translate from path to dax device
number maybe.

lookup_daxdev(const char *name, struct dax_dev? *daxdev)
{
	/* Don't actually open the dax device pointlessly */
	kern_path(fc->source, LOOKUP_FOLLOW, path);
	if (!S_ISCHR(inode->i_mode))
		// fail
	if (!may_open_dev(&path))
		// fail

	// check dax device and pin

	// get rid of path references
	path_put(&path);
}

famfs_get_tree(/* broken broken broken */)
{

	lookup_daxdev(fc->source, &ddev);

	sb = sget_fc(fc, famfs_test_super, set_anon_super_fc)
	if (IS_ERR(sb))
		// Error here may mean (aside from memory):
		// * superblock incompatible bc of read-write vs read-only
		// * non-matching user namespace
		// * FSCONFIG_CMD_CREATE_EXCL requested by mounter

	if (!sb->s_root) {
		// fill_super; new sb; dax device currently not used.
	} else {
		// A superblock for that dax device already exists and
		// may be reused. Any additional rejection reasons for
		// such an sb are up to the filesystem.
	}

	// Now really open or claim the dax device.
	// If you fail get rid of the superblock
	// (deactivate_locked_super()).

All handwavy, I know and probably I forgot details. But for you to fill
that in. ;)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 11/20] famfs: Add fs_context_operations
  2024-02-28 10:07       ` Christian Brauner
@ 2024-02-28 12:01         ` Christian Brauner
  0 siblings, 0 replies; 105+ messages in thread
From: Christian Brauner @ 2024-02-28 12:01 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Jan Kara, Matthew Wilcox, linux-cxl,
	linux-fsdevel, linux-doc, linux-kernel, nvdimm, john,
	Dave Chinner, Christoph Hellwig, dave.hansen, gregory.price

On Wed, Feb 28, 2024 at 11:07:20AM +0100, Christian Brauner wrote:
> > I wasn't aware of the new fsconfig interface. Is there documentation or a
> > file sytsem that already uses it that I should refer to? I didn't find an
> > obvious candidate, but it might be me. If it should be obvious from the
> > example above, tell me and I'll try harder.
> > 
> > My famfs code above was copied from ramfs. If you point me to 
> 
> Ok, but that's the wrong filesystem to use as a model imho. Because it
> really doesn't deal with devices at all. That's why it uses
> get_tree_nodev() with "nodev" as in "no device" kinda. So ramfs doesn't
> have any of these issues. Whereas your filesystems is dealing with
> devices dax (or pmem).
> 
> > documentation I might send you a ramfs fsconfig patch too :D.
> 
> So the manpages are at:
> 
> https://github.com/brauner/man-pages-md
> 
> But really, there shouldn't be anything that needs to change for ramfs.
> 
> > > What errno is EALREADY? Isn't that socket stuff. In any case, it seems
> > > you want EBUSY?
> > 
> > Thanks... That should probaby be EBUSY. But the whole famfs_context_list
> > should probably also be removed. More below...
> > 
> > > 
> > > But bigger picture I'm lost. And why do you keep that list based on
> > > strings? What if I do:
> > > 
> > > mount -t famfs /dev/pmem1234 /mnt # succeeds
> > > 
> > > mount -t famfs /dev/pmem1234 /opt # ah, fsck me, this fails.. But wait a minute....
> > > 
> > > mount --bind /dev/pmem1234 /evil-masterplan
> > > 
> > > mount -t famfs /evil-masterplan /opt # succeeds. YAY
> > > 
> > > I believe that would trivially defeat your check.
> > > 
> > 
> > And I suspect this is related to the get_tree issue you noticed below.
> > 
> > This famfs code was working in 6.5 without keeping the linked list of devices,
> > but in 6.6/6.7/6.8 it works provided you don't try to repeat a mount command
> > that has already succeeded. I'm not sure why 6.5 protected me from that,
> > but the later versions don't. In 6.6+ That hits a BUG_ON (have specifics on 
> > that but not handy right now).
> 
> get_tree_nodev() by default will always allocate a new superblock. This
> is how tmpfs and ramfs work. If you do:
> 
> mount -t tmpfs tmpfs /mnt
> mount -t tmpfs tmpfs /opt
> 
> You get two new, independent superblocks. This is what you want for
> these multi-instance filesystems: each new mount creates a new instance.
> 
> If famfs doesn't want to allow reusing devices - which I very much think
> it wants to prevent - then it cannot use get_tree_nodev() directly
> without having a hack like you did. Because you'll get a new superblock
> no problem. So the fact that it did work somehow likely was a bug in
> your code.
> 
> The reason your code causes crashes is very likely this:
> 
> struct famfs_fs_info *fsi = sb->s_fs_info;
> handlep = bdev_open_by_path(fc->source, FAMFS_BLKDEV_MODE, fsi, &fs_holder_ops);
> 
> If you look at Documentation/filesystems/porting.rst you should see that
> if you use @fs_holder_ops then your holder should be the struct
> super_block, not your personal fsinfo.
> 
> > So for a while we just removed repeated mount requests from the famfs smoke
> > tests, but eventually I implemented the list above, which - though you're right
> > it would be easy to circumvent and therefore is not right - it did solve the
> > problem that we were testing for.
> > 
> > I suspect that correctly handling get_tree might solve this problem.
> > 
> > Please assume that linked list will be removed - it was not the right solution.
> > 
> > More below...
> > 
> > > > +		}
> > > > +	}
> > > > +
> > > > +	list_add(&fsi->fsi_list, &famfs_context_list);
> > > > +	mutex_unlock(&famfs_context_mutex);
> > > > +
> > > > +	return get_tree_nodev(fc, famfs_fill_super);
> > > 
> > > So why isn't this using get_tree_bdev()? Note that a while ago I
> > > added FSCONFIG_CMD_CREAT_EXCL which prevents silent superblock reuse. To
> > > implement that I added fs_context->exclusive. If you unconditionally set
> > > fc->exclusive = 1 in your famfs_init_fs_context() and use
> > > get_tree_bdev() it will give you EBUSY if fc->source is already in use -
> > > including other famfs instances.
> > > 
> > > I also fail to yet understand how that function which actually opens the block
> > > device and gets the dax device figures into this. It's a bit hard to follow
> > > what's going on since you add all those unused functions and types so there's
> > > never a wider context to see that stuff in.
> > 
> > Clearly that's a bug in my code. That get_tree_nodev() is from ramfs, which
> > was the starting point for famfs.
> > 
> > I'm wondering if doing this correctly (get_tree_bdev() when it's pmem) would
> > have solved my double mount problem on 6.6 onward.
> > 
> > However, there's another wrinkle: I'm concluding
> > (see https://lore.kernel.org/linux-fsdevel/ups6cvjw6bx5m3hotn452brbbcgemnarsasre6ep2lbe4tpjsy@ezp6oh5c72ur/)
> > that famfs should drop block support and just work with /dev/dax. So famfs 
> > may be the first file system to be hosted on a character device? Certainly 
> > first on character dax. 
> 
> Ugh, ok. I defer to others whether that makes sense or not. It would be
> a lot easier for you if you used pmem block devices, I guess because it
> would be easy to detect reuse in common infrastructure.
> 
> But also, I'm looking at your code a bit closer. There's a bit of a
> wrinkle the way it's currently written...
> 
> Say someone went a bit weird and did:
> 
> mount -t xfs xfs /dev/sda /my/xfs-filesystem
> mknod DAX_DEVICE /my/xfs-filesystem/dax1234
> 
> and then did:
> 
> mount -t famfs famfs /my/xfs-filesystem/dax1234 /mnt
> 
> Internally in famfs you do:
> 
> fsi->dax_filp = filp_open(fc->source, O_RDWR, 0);
> 
> and you stash that file... Which means that you are pinning that xfs
> filesystems implicitly. IOW, if someone does:
> 
> umount /my/xfs-filesystem
> 
> they get EBUSY for completely opaque reasons. And if they did:
> 
> umount -l /my/xfs-filesystem
> 
> followed by mounting that xfs filesystem again they'd get the same
> superblock for that xfs filesystem.
> 
> What I'm trying to say is that I think you cannot pin another filesystem
> like this when you open that device.
> 
> IOW, you either need to stash the plain dax device or dax needs to
> become it's own tiny internal pseudo fs such that we can open dax
> devices internally just like files. Which might actually also be worth
> doing. But I'm not the maintainer of that.

Ah, I see it's already like that and I was looking at the wrong file.
Great! So in that case you could add helper to open dax devices as
files:

struct file *dax_file_open(struct dax_device *dev, int flags, /* other stuff */)
{
	/* open that thing */
        dax_file = alloc_file_pseudo(dax_inode, dax_vfsmnt, "", flags | O_LARGEFILE, &something_fops);
}

and then you can treat them as regular files without running into the
issues I pointed out.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-27  2:05           ` John Groves
@ 2024-02-29  2:15             ` Dave Chinner
  2024-02-29 14:52               ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Dave Chinner @ 2024-02-29  2:15 UTC (permalink / raw)
  To: John Groves
  Cc: Luis Chamberlain, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Christoph Hellwig, dave.hansen,
	gregory.price

On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> On 24/02/26 04:58PM, Luis Chamberlain wrote:
> > On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote:
> > >
> > > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > > Run status group 0 (all jobs):
> > > > >   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > > >
> > > > > This is run on an xfs file system on a SATA ssd.
> > > >
> > > > To compare more closer apples to apples, wouldn't it make more sense
> > > > to try this with XFS on pmem (with fio -direct=1)?
> > > >
> > > >   Luis
> > >
> > > Makes sense. Here is the same command line I used with xfs before, but
> > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > > because xfs requires that.
> > >
> > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
> > 
> > Could you try with mkfs.xfs -d agcount=1024

Won't change anything for the better, may make things worse.

>    bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
>    iops        : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
>   lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
>   lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
>   lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%

Most of the IO latencies are up round the 4-20ms marks. That seems
kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
should only take a couple of hundred microseconds. For Famfs, the
latencies appear to be around 1-4ms.

So where's all that extra time coming from?


>   lat (msec)   : 100=0.08%
>   cpu          : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511

And why is system time reporting at almost zero instead of almost
all the remaining cpu time (i.e. up at 80-90%)?

Can you run call-graph kernel profiles for XFS and famfs whilst
running this workload so we have some insight into what is behaving
differently here?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
                   ` (20 preceding siblings ...)
  2024-02-24  0:07 ` [RFC PATCH 00/20] Introduce the famfs shared-memory file system Luis Chamberlain
@ 2024-02-29  6:52 ` Amir Goldstein
  2024-02-29 22:16   ` John Groves
  2024-05-17  9:55   ` Miklos Szeredi
  21 siblings, 2 replies; 105+ messages in thread
From: Amir Goldstein @ 2024-02-29  6:52 UTC (permalink / raw)
  To: John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Miklos Szeredi, Vivek Goyal

On Fri, Feb 23, 2024 at 7:42 PM John Groves <John@groves.net> wrote:
>
> This patch set introduces famfs[1] - a special-purpose fs-dax file system
> for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> CXL-specific in anyway way.
>
> * Famfs creates a simple access method for storing and sharing data in
>   sharable memory. The memory is exposed and accessed as memory-mappable
>   dax files.
> * Famfs supports multiple hosts mounting the same file system from the
>   same memory (something existing fs-dax file systems don't do).
> * A famfs file system can be created on either a /dev/pmem device in fs-dax
>   mode, or a /dev/dax device in devdax mode (the latter depending on
>   patches 2-6 of this series).
>
> The famfs kernel file system is part the famfs framework; additional
> components in user space[2] handle metadata and direct the famfs kernel
> module to instantiate files that map to specific memory. The famfs user
> space has documentation and a reasonably thorough test suite.
>

So can we say that Famfs is Fuse specialized for DAX?

I am asking because you seem to have asked it first:
https://lore.kernel.org/linux-fsdevel/0100018b2439ebf3-a442db6f-f685-4bc4-b4b0-28dc333f6712-000000@email.amazonses.com/
I guess that you did not get your answers to your questions before or at LPC?

I did not see your question back in October.
Let me try to answer your questions and we can discuss later if a new dedicated
kernel driver + userspace API is really needed, or if FUSE could be used as is
extended for your needs.

You wrote:
"...My naive reading of the existence of some sort of fuse/dax support
for virtiofs
suggested that there might be a way of doing this - but I may be wrong
about that."

I'm not virtiofs expert, but I don't think that you are wrong about this.
IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
by virtiofs client.

So what are the gaps between virtiofs and famfs that justify a new filesystem
driver and new userspace API?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-29  2:15             ` Dave Chinner
@ 2024-02-29 14:52               ` John Groves
  2024-03-11  1:29                 ` Dave Chinner
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-02-29 14:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Luis Chamberlain, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Christoph Hellwig, dave.hansen,
	gregory.price


Hi Dave!

On 24/02/29 01:15PM, Dave Chinner wrote:
> On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > On 24/02/26 04:58PM, Luis Chamberlain wrote:
> > > On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote:
> > > >
> > > > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > > > Run status group 0 (all jobs):
> > > > > >   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > > > >
> > > > > > This is run on an xfs file system on a SATA ssd.
> > > > >
> > > > > To compare more closer apples to apples, wouldn't it make more sense
> > > > > to try this with XFS on pmem (with fio -direct=1)?
> > > > >
> > > > >   Luis
> > > >
> > > > Makes sense. Here is the same command line I used with xfs before, but
> > > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > > > because xfs requires that.
> > > >
> > > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
> > > 
> > > Could you try with mkfs.xfs -d agcount=1024
> 
> Won't change anything for the better, may make things worse.

I dropped that arg, though performance looked about the same either way.

> 
> >    bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> >    iops        : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> >   lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> >   lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> >   lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
> 
> Most of the IO latencies are up round the 4-20ms marks. That seems
> kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> should only take a couple of hundred microseconds. For Famfs, the
> latencies appear to be around 1-4ms.
> 
> So where's all that extra time coming from?

Below, you will see two runs with performance and latency distribution
about the same as famfs (the answer for that was --fallocate=native).

> 
> 
> >   lat (msec)   : 100=0.08%
> >   cpu          : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
> 
> And why is system time reporting at almost zero instead of almost
> all the remaining cpu time (i.e. up at 80-90%)?

Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times
it's about what you would expect. I suspect some sort of measurement error,
like maybe the method doesn't work with my cpu model? (I'm grasping, but with
a somewhat rational basis...)

I pasted two xfs runs below. The first has the wonky cpu sys value, and
the second looks about like what one would expect.

> 
> Can you run call-graph kernel profiles for XFS and famfs whilst
> running this workload so we have some insight into what is behaving
> differently here?

Can you point me to an example of how to do that?

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


I'd been thinking about the ~2x gap for a few days, and the most obvious
difference is famfs files must be preallocated (like fallocate, but works
a bit differently since allocation happens in user space). I just checked 
one of the xfs files, and it had maybe 80 extents (whereas the famfs 
files always have 1 extent here).

FWIW I ran xfs with and without io_uring, and there was no apparent
difference (which makes sense to me because it's not block I/O).

The prior ~2x gap still seems like a lot of overhead for extent list 
mapping to memory, but adding --fallocate=native to the xfs test brought 
it into line with famfs:


+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
...
fio-3.33
Starting 48 processes
Jobs: 38 (f=380): [W(5),_(1),W(12),_(1),W(3),_(1),W(2),_(1),W(2),_(1),W(1),_(1),W(1),_(1),W(6),_(1),W(6),_(2)][57.1%][w=28.0GiB/s][w=14.3k IOPS][eta 00m:03s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1452590: Thu Feb 29 07:46:06 2024
  write: IOPS=15.3k, BW=29.8GiB/s (32.0GB/s)(114GiB/3838msec); 0 zone resets
    slat (usec): min=17, max=55364, avg=668.20, stdev=1120.41
    clat (nsec): min=1368, max=99619k, avg=1982477.32, stdev=2198309.32
     lat (usec): min=179, max=99813, avg=2650.68, stdev=2485.15
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[   14], 10.00th=[  172], 20.00th=[  420],
     | 30.00th=[  644], 40.00th=[ 1057], 50.00th=[ 1582], 60.00th=[ 2008],
     | 70.00th=[ 2343], 80.00th=[ 3097], 90.00th=[ 4555], 95.00th=[ 5473],
     | 99.00th=[ 8717], 99.50th=[11863], 99.90th=[20055], 99.95th=[27657],
     | 99.99th=[49546]
   bw (  MiB/s): min=20095, max=59216, per=100.00%, avg=35985.47, stdev=318.61, samples=280
   iops        : min=10031, max=29587, avg=17970.76, stdev=159.29, samples=280
  lat (usec)   : 2=0.06%, 4=1.02%, 10=2.33%, 20=4.29%, 50=1.85%
  lat (usec)   : 100=0.20%, 250=3.26%, 500=11.23%, 750=8.87%, 1000=5.82%
  lat (msec)   : 2=20.95%, 4=26.74%, 10=12.60%, 20=0.66%, 50=0.09%
  lat (msec)   : 100=0.01%
  cpu          : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=29.8GiB/s (32.0GB/s), 29.8GiB/s-29.8GiB/s (32.0GB/s-32.0GB/s), io=114GiB (123GB), run=3838-3838msec

Disk stats (read/write):
  pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


## Here is a run where the cpu looks "normal"

+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1
...
fio-3.33
Starting 48 processes
Jobs: 19 (f=190): [W(2),_(1),W(2),_(8),W(1),_(3),W(1),_(1),W(2),_(2),W(1),_(1),W(3),_(2),W(1),_(1),W(1),_(2),W(2),_(7),W(3),_(1)][55.6%][w=26.7GiB/s][w=13.6k IOPS][eta 00m:04s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1463615: Thu Feb 29 08:19:53 2024
  write: IOPS=12.4k, BW=24.1GiB/s (25.9GB/s)(114GiB/4736msec); 0 zone resets
    clat (usec): min=138, max=117903, avg=2581.99, stdev=2704.61
     lat (usec): min=152, max=120405, avg=3019.04, stdev=2964.47
    clat percentiles (usec):
     |  1.00th=[  161],  5.00th=[  249], 10.00th=[  627], 20.00th=[ 1270],
     | 30.00th=[ 1631], 40.00th=[ 1942], 50.00th=[ 2089], 60.00th=[ 2212],
     | 70.00th=[ 2343], 80.00th=[ 2704], 90.00th=[ 5866], 95.00th=[ 6849],
     | 99.00th=[12387], 99.50th=[14353], 99.90th=[26084], 99.95th=[38536],
     | 99.99th=[78119]
   bw (  MiB/s): min=21204, max=47040, per=100.00%, avg=29005.40, stdev=237.31, samples=329
   iops        : min=10577, max=23497, avg=14479.74, stdev=118.65, samples=329
  lat (usec)   : 250=5.04%, 500=4.03%, 750=2.37%, 1000=3.13%
  lat (msec)   : 2=29.39%, 4=41.05%, 10=13.37%, 20=1.45%, 50=0.15%
  lat (msec)   : 100=0.03%, 250=0.01%
  cpu          : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=24.1GiB/s (25.9GB/s), 24.1GiB/s-24.1GiB/s (25.9GB/s-25.9GB/s), io=114GiB (123GB), run=4736-4736msec

Disk stats (read/write):
  pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


Cheers,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-29  6:52 ` Amir Goldstein
@ 2024-02-29 22:16   ` John Groves
  2024-05-17  9:55   ` Miklos Szeredi
  1 sibling, 0 replies; 105+ messages in thread
From: John Groves @ 2024-02-29 22:16 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Miklos Szeredi, Vivek Goyal

On 24/02/29 08:52AM, Amir Goldstein wrote:
> On Fri, Feb 23, 2024 at 7:42 PM John Groves <John@groves.net> wrote:
> >
> > This patch set introduces famfs[1] - a special-purpose fs-dax file system
> > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not
> > CXL-specific in anyway way.
> >
> > * Famfs creates a simple access method for storing and sharing data in
> >   sharable memory. The memory is exposed and accessed as memory-mappable
> >   dax files.
> > * Famfs supports multiple hosts mounting the same file system from the
> >   same memory (something existing fs-dax file systems don't do).
> > * A famfs file system can be created on either a /dev/pmem device in fs-dax
> >   mode, or a /dev/dax device in devdax mode (the latter depending on
> >   patches 2-6 of this series).
> >
> > The famfs kernel file system is part the famfs framework; additional
> > components in user space[2] handle metadata and direct the famfs kernel
> > module to instantiate files that map to specific memory. The famfs user
> > space has documentation and a reasonably thorough test suite.
> >
> 
> So can we say that Famfs is Fuse specialized for DAX?
> 
> I am asking because you seem to have asked it first:
> https://lore.kernel.org/linux-fsdevel/0100018b2439ebf3-a442db6f-f685-4bc4-b4b0-28dc333f6712-000000@email.amazonses.com/
> I guess that you did not get your answers to your questions before or at LPC?

Thanks for paying attention Amir. I think there is some validity to thinking
of famfs as Fuse for DAX. Administration / metadata originating in user space
is similar (but doing it this way also helps reduce RAS exposure to memory 
that might have a more complex connection path).

One way it differs from fuse is that famfs is very much aimed at use
cases that require performance. *Accessing* files must run at full
memory speeds.

> 
> I did not see your question back in October.
> Let me try to answer your questions and we can discuss later if a new dedicated
> kernel driver + userspace API is really needed, or if FUSE could be used as is
> extended for your needs.
> 
> You wrote:
> "...My naive reading of the existence of some sort of fuse/dax support
> for virtiofs
> suggested that there might be a way of doing this - but I may be wrong
> about that."
> 
> I'm not virtiofs expert, but I don't think that you are wrong about this.
> IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> by virtiofs client.
> 
> So what are the gaps between virtiofs and famfs that justify a new filesystem
> driver and new userspace API?

I have a lot of thoughts here, and an actual conversation might be good
sooner rather than later. I hope to be at LSFMM to discuss this - if you agree,
put in a vote for my topic ;). But if you want to talk sooner than that, I'm
interested.

I think one piece of evidence that this isn't possible with Fuse today is that
I had to plumb the iomap interface for /dev/dax in this patch set. That is the
way that fs-dax file systems communicate with the dax layer for fault 
resolution. If fuse/virtiofs handles dax somehow without the iomap interface,
I suspect it's doing something somehow simpler, /and/ that might need to get 
reconciled with the fs-dax methodology. Or maybe I don't know what I'm talking
about (in which case, please help :D).

I think one thing that might make sense would be to bring up this functionality
as a standalone file system, and then consider merging it into fuse when &
if the time seems right. 

Famfs doesn't currently have any up-calls. User space plays the log and tells
the kmod to instantiate files with extent lists to dax. Access happens with
zero user space involvement.

The important thing, the thing I'm currently paid for, is making it
practical to use disaggregated shared memory - it's ultimately not important 
which mechanism is used to enable a filesystem access method for memory.

But caching metadata in the kernel for efficient fault handling is the
only way to get it to perform at "memory speeds" so that appears critical.

One final observation: famfs has significantly more code in user space than
in kernel space, and it's the user side that is likely to grow over time.
That logic is at least theoretically independent of the kernel ABI.

> 
> Thanks,
> Amir.

Thanks!
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-29 14:52               ` John Groves
@ 2024-03-11  1:29                 ` Dave Chinner
  0 siblings, 0 replies; 105+ messages in thread
From: Dave Chinner @ 2024-03-11  1:29 UTC (permalink / raw)
  To: John Groves
  Cc: Luis Chamberlain, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Christoph Hellwig, dave.hansen,
	gregory.price

On Thu, Feb 29, 2024 at 08:52:48AM -0600, John Groves wrote:
> On 24/02/29 01:15PM, Dave Chinner wrote:
> > On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > >    bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> > >    iops        : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> > >   lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> > >   lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> > >   lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
> > 
> > Most of the IO latencies are up round the 4-20ms marks. That seems
> > kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> > should only take a couple of hundred microseconds. For Famfs, the
> > latencies appear to be around 1-4ms.
> > 
> > So where's all that extra time coming from?
> 
> Below, you will see two runs with performance and latency distribution
> about the same as famfs (the answer for that was --fallocate=native).

Ah, that is exactly what I suspected, and was wanting profiles
because that will show up in them clearly.

> > >   lat (msec)   : 100=0.08%
> > >   cpu          : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
> > 
> > And why is system time reporting at almost zero instead of almost
> > all the remaining cpu time (i.e. up at 80-90%)?
> 
> Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times
> it's about what you would expect. I suspect some sort of measurement error,
> like maybe the method doesn't work with my cpu model? (I'm grasping, but with
> a somewhat rational basis...)
> 
> I pasted two xfs runs below. The first has the wonky cpu sys value, and
> the second looks about like what one would expect.
> 
> > 
> > Can you run call-graph kernel profiles for XFS and famfs whilst
> > running this workload so we have some insight into what is behaving
> > differently here?
> 
> Can you point me to an example of how to do that?

perf record --call-graph ...
pref report --call-graph ...


> I'd been thinking about the ~2x gap for a few days, and the most obvious
> difference is famfs files must be preallocated (like fallocate, but works
> a bit differently since allocation happens in user space). I just checked 
> one of the xfs files, and it had maybe 80 extents (whereas the famfs 
> files always have 1 extent here).

Which is about 4MB per extent. Extent size is not the problem for
zero-seek-latency storage hardware, though.

Essentially what you are seeing is interleaving extent allocation
between all the files because they are located in the same
directory. The locality algorithm is trying to place the data
extents close to the owner inode, but the indoes are also all close
together because they are located in the same AG as the parent
directory inode. Allocation concurrency is created by placing new
directories in different allocation groups, so we end up with
workloads in different directories being largely isolated from each
other.

However, that means when you are trying to write to many files in
the same directory at the same time, they are largely all competing
for the same AG lock to do block allocation during IO submission.
That creates interleaving of write() sized extents between different
files. We use speculative preallocation for buffered IO to avoid
this, and for direct IO the application needs to use extent size hints
or preallocation to avoid this contention based interleaving.

IOWs, by using fallocate() to preallocate all the space there will
be no allocation during IO submission and so the serialisation that
occurs due to competing allocations just goes away...

> FWIW I ran xfs with and without io_uring, and there was no apparent
> difference (which makes sense to me because it's not block I/O).
> 
> The prior ~2x gap still seems like a lot of overhead for extent list 
> mapping to memory, but adding --fallocate=native to the xfs test brought 
> it into line with famfs:

As I suspected. :)

As for CPU usage accounting, the number of context switches says it
all.

"Bad":

>   cpu          : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801

"good":

>   cpu          : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708

I'd say that in the "bad" case most of the kernel work is being
shuffled off to kernel threads to do the work and so it doesn't get
accounted to the submission task.  In comparison, in the "good" case
the work is being done in the submission thread and hence there's a
lot fewer context switches and the system time is correctly
accounted to the submission task.

Perhaps an io_uring task accounting problem?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-02-29  6:52 ` Amir Goldstein
  2024-02-29 22:16   ` John Groves
@ 2024-05-17  9:55   ` Miklos Szeredi
  2024-05-19  5:59     ` Amir Goldstein
  1 sibling, 1 reply; 105+ messages in thread
From: Miklos Szeredi @ 2024-05-17  9:55 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal

On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <amir73il@gmail.com> wrote:

> I'm not virtiofs expert, but I don't think that you are wrong about this.
> IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> by virtiofs client.
>
> So what are the gaps between virtiofs and famfs that justify a new filesystem
> driver and new userspace API?

Let me try to fill in some gaps.  I've looked at the famfs driver
(even tried to set it up in a VM, but got stuck with the EFI stuff).

- famfs has an extent list per file that indicates how each page
within the file should be mapped onto the dax device, IOW it has the
following mapping:

  [famfs file, offset] -> [offset, length]

- fuse can currently map a fuse file onto a backing file:

  [fuse file] -> [backing file]

The interface for the latter is

   backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map);
...
   fuse_open_out.flags |= FOPEN_PASSTHROUGH;
   fuse_open_out.backing_id = backing_id;

This looks suitable for doing the famfs file - > dax device mapping as
well.  I wouldn't extend the ioctl with extent information, since
famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax
device.  The flags field could be used to tell the kernel to treat
this fd as a dax device instead of a a regular file.

Letter, when the file is opened the extent list could be sent in the
open reply together with the backing id.  The fuse_ext_header
mechanism seems suitable for this.

And I think that's it as far as API's are concerned.

Note: this is already more generic than the current famfs prototype,
since multiple dax devices could be used as backing for famfs files,
with the constraint that a single file can only map data from a single
dax device.

As for implementing dax passthrough, I think that needs a separate
source file, the one used by virtiofs (fs/fuse/dax.c) does not appear
to have many commonalities with this one.  That could be renamed to
virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT.

Comments?  Am I missing something significant?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-17  9:55   ` Miklos Szeredi
@ 2024-05-19  5:59     ` Amir Goldstein
  2024-05-22  2:05       ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Amir Goldstein @ 2024-05-19  5:59 UTC (permalink / raw)
  To: Miklos Szeredi, John Groves
  Cc: John Groves, Jonathan Corbet, Dan Williams, Vishal Verma,
	Dave Jiang, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal

On Fri, May 17, 2024 at 12:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I'm not virtiofs expert, but I don't think that you are wrong about this.
> > IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> > by virtiofs client.
> >
> > So what are the gaps between virtiofs and famfs that justify a new filesystem
> > driver and new userspace API?
>
> Let me try to fill in some gaps.  I've looked at the famfs driver
> (even tried to set it up in a VM, but got stuck with the EFI stuff).
>
> - famfs has an extent list per file that indicates how each page
> within the file should be mapped onto the dax device, IOW it has the
> following mapping:
>
>   [famfs file, offset] -> [offset, length]
>
> - fuse can currently map a fuse file onto a backing file:
>
>   [fuse file] -> [backing file]
>
> The interface for the latter is
>
>    backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map);
> ...
>    fuse_open_out.flags |= FOPEN_PASSTHROUGH;
>    fuse_open_out.backing_id = backing_id;

FYI, library and example code was recently merged to libfuse:
https://github.com/libfuse/libfuse/pull/919

>
> This looks suitable for doing the famfs file - > dax device mapping as
> well.  I wouldn't extend the ioctl with extent information, since
> famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax
> device.  The flags field could be used to tell the kernel to treat
> this fd as a dax device instead of a a regular file.
>
> Letter, when the file is opened the extent list could be sent in the
> open reply together with the backing id.  The fuse_ext_header
> mechanism seems suitable for this.
>
> And I think that's it as far as API's are concerned.
>
> Note: this is already more generic than the current famfs prototype,
> since multiple dax devices could be used as backing for famfs files,
> with the constraint that a single file can only map data from a single
> dax device.
>
> As for implementing dax passthrough, I think that needs a separate
> source file, the one used by virtiofs (fs/fuse/dax.c) does not appear
> to have many commonalities with this one.  That could be renamed to
> virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT.
>
> Comments?

Would probably also need to decouple CONFIG_FUSE_DAX
from CONFIG_FUSE_VIRTIO_DAX.

What about fc->dax_mode (i.e. dax= mount option)?

What about FUSE_IS_DAX()? does it apply to both dax implementations?

Sounds like a decent plan.
John, let us know if you need help understanding the details.

> Am I missing something significant?

Would we need to set IS_DAX() on inode init time or can we set it
later on first file open?

Currently, iomodes enforces that all opens are either
mapped to same backing file or none mapped to backing file:

fuse_inode_uncached_io_start()
{
...
        /* deny conflicting backing files on same fuse inode */

The iomodes rules will need to be amended to verify that:
- IS_DAX() inode open is always mapped to backing dax device
- All files of the same fuse inode are mapped to the same range
  of backing file/dax device.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-19  5:59     ` Amir Goldstein
@ 2024-05-22  2:05       ` John Groves
  2024-05-22  8:58         ` Miklos Szeredi
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-05-22  2:05 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal

Initial reply to both Amir and Miklos. Sorry for the delay - I took a few
days off after LSFMM and I'm just re-engaging now.

First an observation: these messages are on the famfs v1 patch set thread.
The v2 patch set is at [1]. That is also the default branch now if you clone
the famfs kernel from [2].

Among the biggest changes at v2 is dropping /dev/pmem support and only 
supporting /dev/dax (character) devices as backing devs for famfs.

On 24/05/19 08:59AM, Amir Goldstein wrote:
> On Fri, May 17, 2024 at 12:55 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > I'm not virtiofs expert, but I don't think that you are wrong about this.
> > > IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> > > by virtiofs client.
> > >
> > > So what are the gaps between virtiofs and famfs that justify a new filesystem
> > > driver and new userspace API?
> >
> > Let me try to fill in some gaps.  I've looked at the famfs driver
> > (even tried to set it up in a VM, but got stuck with the EFI stuff).

I'm happy to help with that if you care - ping me if so; getting a VM running 
in EFI mode is not necessary if you reserve the dax memory via memmap=, or
via libvirt xml.

> >
> > - famfs has an extent list per file that indicates how each page
> > within the file should be mapped onto the dax device, IOW it has the
> > following mapping:
> >
> >   [famfs file, offset] -> [offset, length]

More generally, a famfs file extent is [daxdev, offset, len]; there may
be multiple extents per file, and in the future this definitely needs to
generalize to multiple daxdev's.

Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly, 
I think)...

A single backing device (daxdev) will contain extents of many famfs
files (plus metadata - currently a superblock and a log). I'm not sure
it's realistic to have a backing daxdev "open" per famfs file. 

In addition there is:

- struct dax_holder_operations - to allow a notify_failure() upcall
  from dax. This provides the critical capability to shut down famfs
  if there are memory errors. This is filesystem- (or technically daxdev-
  wide)

- The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
  and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
  I strongly suspect that famfs_fuse can't be correct unless it uses
  this path rather than just the idea of a single backing file.
  This interface explicitly supports files that map to disjoint ranges
  of one or more dax devices.

- the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
  character devdax.

- Note that dax devices, unlike files, don't support read/write - only
  mmap(). I suspect (though I'm still pretty ignorant) that this means
  we can't just treat the dax device as an extent-based backing file.


> >
> > - fuse can currently map a fuse file onto a backing file:
> >
> >   [fuse file] -> [backing file]
> >
> > The interface for the latter is
> >
> >    backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map);
> > ...
> >    fuse_open_out.flags |= FOPEN_PASSTHROUGH;
> >    fuse_open_out.backing_id = backing_id;
> 
> FYI, library and example code was recently merged to libfuse:
> https://github.com/libfuse/libfuse/pull/919
> 
> >
> > This looks suitable for doing the famfs file - > dax device mapping as
> > well.  I wouldn't extend the ioctl with extent information, since
> > famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax
> > device.  The flags field could be used to tell the kernel to treat
> > this fd as a dax device instead of a a regular file.

A dax device to famfs is a lot more like a backing device for a "filesystem"
than a backing file for another file. And, as previously mentioned, there
is the iomap_ops interface and the holder_ops interface that deal with
multiple file tenants on a dax device (plus error notification, 
respectively)

Probably doable, but important distinctions...

> >
> > Letter, when the file is opened the extent list could be sent in the
> > open reply together with the backing id.  The fuse_ext_header
> > mechanism seems suitable for this.
> >
> > And I think that's it as far as API's are concerned.
> >
> > Note: this is already more generic than the current famfs prototype,
> > since multiple dax devices could be used as backing for famfs files,
> > with the constraint that a single file can only map data from a single
> > dax device.
> >
> > As for implementing dax passthrough, I think that needs a separate
> > source file, the one used by virtiofs (fs/fuse/dax.c) does not appear
> > to have many commonalities with this one.  That could be renamed to
> > virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT.
> >
> > Comments?
> 
> Would probably also need to decouple CONFIG_FUSE_DAX
> from CONFIG_FUSE_VIRTIO_DAX.
> 
> What about fc->dax_mode (i.e. dax= mount option)?
> 
> What about FUSE_IS_DAX()? does it apply to both dax implementations?
> 
> Sounds like a decent plan.
> John, let us know if you need help understanding the details.

I'm certain I will need some help, but I'll try to do my part. 

First question: can you suggest an example fuse file pass-through
file system that I might use as a jumping-off point? Something that
gets the basic pass-through capability from which to start hacking
in famfs/dax capabilities?

When I started on famfs, I used ramfs because it got me all the basic
file system functionality minus a backing store. Then I built the dax
functionality by referring to xfs. 

> 
> > Am I missing something significant?
> 
> Would we need to set IS_DAX() on inode init time or can we set it
> later on first file open?
> 
> Currently, iomodes enforces that all opens are either
> mapped to same backing file or none mapped to backing file:
> 
> fuse_inode_uncached_io_start()
> {
> ...
>         /* deny conflicting backing files on same fuse inode */
> 
> The iomodes rules will need to be amended to verify that:
> - IS_DAX() inode open is always mapped to backing dax device
> - All files of the same fuse inode are mapped to the same range
>   of backing file/dax device.

I'm confused by the last item. I would think there would be a fuse
inode per famfs file, and that multiple of those would map to separate
extent lists of one or more backing dax devices.

Or maybe I misunderstand the meaning of "fuse inode". Feel free to
assign reading...

> 
> Thanks,
> Amir.

Thanks Miklos and Amir,
John

[1] https://lore.kernel.org/linux-fsdevel/cover.1714409084.git.john@groves.net/T/#m3b11e8d311eca80763c7d6f27d43efd1cdba628b
[2] https://github.com/cxl-micron-reskit/famfs-linux



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-22  2:05       ` John Groves
@ 2024-05-22  8:58         ` Miklos Szeredi
  2024-05-22 10:16           ` Amir Goldstein
  2024-05-23  2:49           ` John Groves
  0 siblings, 2 replies; 105+ messages in thread
From: Miklos Szeredi @ 2024-05-22  8:58 UTC (permalink / raw)
  To: John Groves
  Cc: Amir Goldstein, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal

On Wed, 22 May 2024 at 04:05, John Groves <John@groves.net> wrote:
> I'm happy to help with that if you care - ping me if so; getting a VM running
> in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> via libvirt xml.

Could you please give an example?

I use a raw qemu command line with a -kernel option and a root fs
image (not a disk image with a bootloader).


> More generally, a famfs file extent is [daxdev, offset, len]; there may
> be multiple extents per file, and in the future this definitely needs to
> generalize to multiple daxdev's.
>
> Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
> I think)...
>
> A single backing device (daxdev) will contain extents of many famfs
> files (plus metadata - currently a superblock and a log). I'm not sure
> it's realistic to have a backing daxdev "open" per famfs file.

That's exactly what I was saying.

The passthrough interface was deliberately done in a way to separate
the mapping into two steps:

 1) registering the backing file (which could be a device)

 2) mapping from a fuse file to a registered backing file

Step 1 can happen at any time, while step 2 currently happens at open,
but for various other purposes like metadata passthrough it makes
sense to allow the mapping to happen at lookup time and be cached for
the lifetime of the inode.

> In addition there is:
>
> - struct dax_holder_operations - to allow a notify_failure() upcall
>   from dax. This provides the critical capability to shut down famfs
>   if there are memory errors. This is filesystem- (or technically daxdev-
>   wide)

This can be hooked into fuse_is_bad().

> - The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
>   and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
>   I strongly suspect that famfs_fuse can't be correct unless it uses
>   this path rather than just the idea of a single backing file.

Agreed.

> - the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
>   character devdax.

You'll need to channel those patches through the respective
maintainers, preferably before the fuse parts are merged.

> - Note that dax devices, unlike files, don't support read/write - only
>   mmap(). I suspect (though I'm still pretty ignorant) that this means
>   we can't just treat the dax device as an extent-based backing file.

Doesn't matter, it'll use the iomap infrastructure instead of the
passthrough infrastructure.

But the interfaces for regular passthrough and fsdax could be shared.
Conceptually they are very similar:  there's a backing store indexable
with byte offsets.

What's currently missing from the API is an extent list in
fuse_open_out.   The format could be:

  [ {backing_id, offset, length}, ... ]

allowing each extent to map to a different backing device.

> A dax device to famfs is a lot more like a backing device for a "filesystem"
> than a backing file for another file. And, as previously mentioned, there
> is the iomap_ops interface and the holder_ops interface that deal with
> multiple file tenants on a dax device (plus error notification,
> respectively)
>
> Probably doable, but important distinctions...

Yeah, that's why I suggested to create a new source file for this
within fs/fuse.  Alternatively we could try splitting up fuse into
modules (core, virtiofs, cuse, fsdax) but I think that can be left as
a cleanup step.

> First question: can you suggest an example fuse file pass-through
> file system that I might use as a jumping-off point? Something that
> gets the basic pass-through capability from which to start hacking
> in famfs/dax capabilities?

An example is in Amir's libfuse repo at

   https://github.com/libfuse/libfuse

> I'm confused by the last item. I would think there would be a fuse
> inode per famfs file, and that multiple of those would map to separate
> extent lists of one or more backing dax devices.

Yeah.

> Or maybe I misunderstand the meaning of "fuse inode". Feel free to
> assign reading...

I think Amir meant that each open file could in theory have a
different mapping.  This is allowed by the fuse interface, but is
disallowed in practice.

I'm in favor of caching the extent map so it only has to be given on
the first open (or lookup).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-22  8:58         ` Miklos Szeredi
@ 2024-05-22 10:16           ` Amir Goldstein
  2024-05-22 11:28             ` Miklos Szeredi
  2024-05-23  2:49           ` John Groves
  1 sibling, 1 reply; 105+ messages in thread
From: Amir Goldstein @ 2024-05-22 10:16 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal, Bernd Schubert

On Wed, May 22, 2024 at 11:58 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, 22 May 2024 at 04:05, John Groves <John@groves.net> wrote:
> > I'm happy to help with that if you care - ping me if so; getting a VM running
> > in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> > via libvirt xml.
>
> Could you please give an example?
>
> I use a raw qemu command line with a -kernel option and a root fs
> image (not a disk image with a bootloader).
>
>
> > More generally, a famfs file extent is [daxdev, offset, len]; there may
> > be multiple extents per file, and in the future this definitely needs to
> > generalize to multiple daxdev's.
> >
> > Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
> > I think)...
> >
> > A single backing device (daxdev) will contain extents of many famfs
> > files (plus metadata - currently a superblock and a log). I'm not sure
> > it's realistic to have a backing daxdev "open" per famfs file.
>
> That's exactly what I was saying.
>
> The passthrough interface was deliberately done in a way to separate
> the mapping into two steps:
>
>  1) registering the backing file (which could be a device)
>
>  2) mapping from a fuse file to a registered backing file
>
> Step 1 can happen at any time, while step 2 currently happens at open,
> but for various other purposes like metadata passthrough it makes
> sense to allow the mapping to happen at lookup time and be cached for
> the lifetime of the inode.
>
> > In addition there is:
> >
> > - struct dax_holder_operations - to allow a notify_failure() upcall
> >   from dax. This provides the critical capability to shut down famfs
> >   if there are memory errors. This is filesystem- (or technically daxdev-
> >   wide)
>
> This can be hooked into fuse_is_bad().
>
> > - The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
> >   and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
> >   I strongly suspect that famfs_fuse can't be correct unless it uses
> >   this path rather than just the idea of a single backing file.
>
> Agreed.
>
> > - the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
> >   character devdax.
>
> You'll need to channel those patches through the respective
> maintainers, preferably before the fuse parts are merged.
>
> > - Note that dax devices, unlike files, don't support read/write - only
> >   mmap(). I suspect (though I'm still pretty ignorant) that this means
> >   we can't just treat the dax device as an extent-based backing file.
>
> Doesn't matter, it'll use the iomap infrastructure instead of the
> passthrough infrastructure.
>
> But the interfaces for regular passthrough and fsdax could be shared.
> Conceptually they are very similar:  there's a backing store indexable
> with byte offsets.
>
> What's currently missing from the API is an extent list in
> fuse_open_out.   The format could be:
>
>   [ {backing_id, offset, length}, ... ]
>
> allowing each extent to map to a different backing device.
>
> > A dax device to famfs is a lot more like a backing device for a "filesystem"
> > than a backing file for another file. And, as previously mentioned, there
> > is the iomap_ops interface and the holder_ops interface that deal with
> > multiple file tenants on a dax device (plus error notification,
> > respectively)
> >
> > Probably doable, but important distinctions...
>
> Yeah, that's why I suggested to create a new source file for this
> within fs/fuse.  Alternatively we could try splitting up fuse into
> modules (core, virtiofs, cuse, fsdax) but I think that can be left as
> a cleanup step.
>
> > First question: can you suggest an example fuse file pass-through
> > file system that I might use as a jumping-off point? Something that
> > gets the basic pass-through capability from which to start hacking
> > in famfs/dax capabilities?
>
> An example is in Amir's libfuse repo at
>
>    https://github.com/libfuse/libfuse
>

That's not my repo, it's the official one ;-)
but yeh, my passthrough example got merged last week:
https://github.com/libfuse/libfuse/pull/919

> > I'm confused by the last item. I would think there would be a fuse
> > inode per famfs file, and that multiple of those would map to separate
> > extent lists of one or more backing dax devices.
>
> Yeah.
>
> > Or maybe I misunderstand the meaning of "fuse inode". Feel free to
> > assign reading...
>
> I think Amir meant that each open file could in theory have a
> different mapping.  This is allowed by the fuse interface, but is
> disallowed in practice.
>
> I'm in favor of caching the extent map so it only has to be given on
> the first open (or lookup).

Yeh, sorry, that was a bit confusing.
The statement is that because the simples plan as Miklos
suggested is to pass the extent list in reply to open
two different opens of the same inode are not allowed to
pass in different extent lists.

The new iomode.c code does something similar.
Currently fuse_inode has a reference to fuse_backing which
stores the backing file (that can be the dax device) and it also
has a reference to fuse_inode_dax with an rbtree of fuse_dax_mapping
Can we reuse fuse_inode_dax for the needs of famfs?

The first open would cache the extent list in fuse_inode and
second open would verify that the extent list matches.

Last file close could clean the cache extent list or not - that
is an API decision.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-22 10:16           ` Amir Goldstein
@ 2024-05-22 11:28             ` Miklos Szeredi
  2024-05-22 13:41               ` Amir Goldstein
  0 siblings, 1 reply; 105+ messages in thread
From: Miklos Szeredi @ 2024-05-22 11:28 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal, Bernd Schubert

On Wed, 22 May 2024 at 12:16, Amir Goldstein <amir73il@gmail.com> wrote:

> The first open would cache the extent list in fuse_inode and
> second open would verify that the extent list matches.
>
> Last file close could clean the cache extent list or not - that
> is an API decision.

Well, current API clears the mapping, and I would treat the fi->fb as
a just a special case of the extent list.  So by default I'd keep this
behavior, but perhaps it would make sense to optionally allow the
mapping to remain after the last close.  For now this is probably not
relevant...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-22 11:28             ` Miklos Szeredi
@ 2024-05-22 13:41               ` Amir Goldstein
  0 siblings, 0 replies; 105+ messages in thread
From: Amir Goldstein @ 2024-05-22 13:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: John Groves, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal, Bernd Schubert

On Wed, May 22, 2024 at 2:28 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, 22 May 2024 at 12:16, Amir Goldstein <amir73il@gmail.com> wrote:
>
> > The first open would cache the extent list in fuse_inode and
> > second open would verify that the extent list matches.
> >
> > Last file close could clean the cache extent list or not - that
> > is an API decision.
>
> Well, current API clears the mapping, and I would treat the fi->fb as
> a just a special case of the extent list.  So by default I'd keep this
> behavior, but perhaps it would make sense to optionally allow the
> mapping to remain after the last close.  For now this is probably not
> relevant...

Already in the works ;)

Not tested - probably not working POC:
https://github.com/amir73il/linux/commits/fuse-backing-inode-wip

I am trying an API to opt into inode operation passthrough, which
has a by-product of keeping fi->fb around after last close.

This is designed to be setup on lookup, but could also be setup on
first open.

I have some ideas for how to return backing id with lookup
(and readdirplus) response, but haven't tried them yet.
But setup backing file from lookup response will surely
stick around until inode evict.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-22  8:58         ` Miklos Szeredi
  2024-05-22 10:16           ` Amir Goldstein
@ 2024-05-23  2:49           ` John Groves
  2024-05-23 13:57             ` Miklos Szeredi
  1 sibling, 1 reply; 105+ messages in thread
From: John Groves @ 2024-05-23  2:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Amir Goldstein, John Groves, Jonathan Corbet, Dan Williams,
	Vishal Verma, Dave Jiang, Alexander Viro, Christian Brauner,
	Jan Kara, Matthew Wilcox, linux-cxl, linux-fsdevel, linux-doc,
	linux-kernel, nvdimm, john, Dave Chinner, Christoph Hellwig,
	dave.hansen, gregory.price, Vivek Goyal

On 24/05/22 10:58AM, Miklos Szeredi wrote:
> On Wed, 22 May 2024 at 04:05, John Groves <John@groves.net> wrote:
> > I'm happy to help with that if you care - ping me if so; getting a VM running
> > in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> > via libvirt xml.
> 
> Could you please give an example?
> 
> I use a raw qemu command line with a -kernel option and a root fs
> image (not a disk image with a bootloader).

That's not the way I'm running VMs, but... I presume you know how to add
kernel command line arguments to VMs that you run this way?

- memmap=<size>!<hpa_offset> will reserve a pretend pmem device at <hpa_offset>
- memmap=<size>$<hpa_offset> will reserve a pretend dax device at <hpa_offset>

Both of the above will work regardless of whether the VM is in EFI mode.
The '$' is harder to escape through grub; and the pmem device can be converted
to devdax via 'ndctl reconfigure-device --mode=devdax...'. A dax device would
likely also need to be put in devdax mode (as the default seems to be 
system-ram mode).  

Incomplete documentation (that you have probably already seen) is at [1]

I can dig deeper if needed.

Otherwise the feedback in this thread makes sense to me and I'm planning to 
start hacking on famfs patches Thursday. Watch this space ;)

Regards,
John

[1] https://github.com/cxl-micron-reskit/famfs/blob/master/markdown/vm-configuration.md


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-23  2:49           ` John Groves
@ 2024-05-23 13:57             ` Miklos Szeredi
  2024-05-24  0:47               ` John Groves
  0 siblings, 1 reply; 105+ messages in thread
From: Miklos Szeredi @ 2024-05-23 13:57 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl, linux-fsdevel, linux-kernel, nvdimm

[trimming CC list]

On Thu, 23 May 2024 at 04:49, John Groves <John@groves.net> wrote:

> - memmap=<size>!<hpa_offset> will reserve a pretend pmem device at <hpa_offset>
> - memmap=<size>$<hpa_offset> will reserve a pretend dax device at <hpa_offset>

Doesn't get me a /dev/dax or /dev/pmem

Complete qemu command line:

qemu-kvm -s -serial none -parallel none -kernel
/home/mszeredi/git/linux/arch/x86/boot/bzImage -drive
format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
stdio,id=virtiocon0,signal=off -device virtio-serial -device
virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net
nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home
-device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device
virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0
memmap=4G$4G'

root@kvm:~/famfs# scripts/chk_efi.sh
This system is neither Ubuntu nor Fedora. It is identified as debian.
/sys/firmware/efi not found; probably not efi
 not found; probably nof efi
/boot/efi/EFI not found; probably not efi
/boot/efi/EFI/BOOT not found; probably not efi
/boot/efi/EFI/ not found; probably not efi
/boot/efi/EFI//grub.cfg not found; probably nof efi
Probably not efi; errs=6

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-23 13:57             ` Miklos Szeredi
@ 2024-05-24  0:47               ` John Groves
  2024-05-24  7:55                 ` Miklos Szeredi
  0 siblings, 1 reply; 105+ messages in thread
From: John Groves @ 2024-05-24  0:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-cxl, linux-fsdevel, linux-kernel, nvdimm

On 24/05/23 03:57PM, Miklos Szeredi wrote:
> [trimming CC list]
> 
> On Thu, 23 May 2024 at 04:49, John Groves <John@groves.net> wrote:
> 
> > - memmap=<size>!<hpa_offset> will reserve a pretend pmem device at <hpa_offset>
> > - memmap=<size>$<hpa_offset> will reserve a pretend dax device at <hpa_offset>
> 
> Doesn't get me a /dev/dax or /dev/pmem
> 
> Complete qemu command line:
> 
> qemu-kvm -s -serial none -parallel none -kernel
> /home/mszeredi/git/linux/arch/x86/boot/bzImage -drive
> format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
> format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
> stdio,id=virtiocon0,signal=off -device virtio-serial -device
> virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net
> nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home
> -device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device
> virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0
> memmap=4G$4G'
> 
> root@kvm:~/famfs# scripts/chk_efi.sh
> This system is neither Ubuntu nor Fedora. It is identified as debian.
> /sys/firmware/efi not found; probably not efi
>  not found; probably nof efi
> /boot/efi/EFI not found; probably not efi
> /boot/efi/EFI/BOOT not found; probably not efi
> /boot/efi/EFI/ not found; probably not efi
> /boot/efi/EFI//grub.cfg not found; probably nof efi
> Probably not efi; errs=6
> 
> Thanks,
> Miklos


Apologies, but I'm short on time at the moment - going into a long holiday
weekend in the US with family plans. I should be focused again by middle of
next week.

But can you check /proc/cmdline to see of the memmap arg got through without
getting mangled? The '$' tends to get fubar'd. You might need \$, or I've seen
the need for \\\$. If it's un-mangled, there should be a dax device.

If that doesn't work, it's worth trying '!' instead, which I think would give
you a pmem device - if the arg gets through (but ! is less likely to get
horked). That pmem device can be converted to devdax...

Regards,
John


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
  2024-05-24  0:47               ` John Groves
@ 2024-05-24  7:55                 ` Miklos Szeredi
  0 siblings, 0 replies; 105+ messages in thread
From: Miklos Szeredi @ 2024-05-24  7:55 UTC (permalink / raw)
  To: John Groves; +Cc: linux-cxl, linux-fsdevel, linux-kernel, nvdimm

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]

On Fri, 24 May 2024 at 02:47, John Groves <John@groves.net> wrote:

> Apologies, but I'm short on time at the moment - going into a long holiday
> weekend in the US with family plans. I should be focused again by middle of
> next week.

NP.

Obviously I'll need to test it before anything is merged, other than
that this is not urgent at all...

> But can you check /proc/cmdline to see of the memmap arg got through without
> getting mangled? The '$' tends to get fubar'd. You might need \$, or I've seen
> the need for \\\$. If it's un-mangled, there should be a dax device.

/proc/cmdline shows the option correctly:

root@kvm:~# cat /proc/cmdline
root=/dev/vda console=hvc0 memmap=4G$4G

> If that doesn't work, it's worth trying '!' instead, which I think would give
> you a pmem device - if the arg gets through (but ! is less likely to get
> horked). That pmem device can be converted to devdax...

That doesn't work either.  No device created in /dev  (dax or pmem).

free(1) does show that the reserved memory is gone in both cases, so
something does happen.

Attaching my .config as well.

Thanks,
Miklos

[-- Attachment #2: .config --]
[-- Type: application/octet-stream, Size: 89618 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 6.9.0 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (GCC) 13.2.1 20240316 (Red Hat 13.2.1-7)"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=130201
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=24000
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=24000
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
CONFIG_GCC_ASM_GOTO_OUTPUT_WORKAROUND=y
CONFIG_TOOLS_SUPPORT_RELR=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=0
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_WERROR=y
# CONFIG_UAPI_HEADER_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="kvm"
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_IDLE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100
# end of Timers subsystem

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
CONFIG_BPF_SYSCALL=y
# CONFIG_BPF_JIT is not set
# CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set
# CONFIG_BPF_PRELOAD is not set
# end of BPF subsystem

CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
# CONFIG_PREEMPT_DYNAMIC is not set
# CONFIG_SCHED_CORE is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

# CONFIG_IKCONFIG is not set
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=21
CONFIG_LOG_CPU_MAX_BUF_SHIFT=21
# CONFIG_PRINTK_INDEX is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_GCC10_NO_ARRAY_BOUNDS=y
CONFIG_CC_NO_ARRAY_BOUNDS=y
CONFIG_GCC_NO_STRINGOP_OVERFLOW=y
CONFIG_CC_NO_STRINGOP_OVERFLOW=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_FAVOR_DYNMODS is not set
# CONFIG_MEMCG is not set
# CONFIG_BLK_CGROUP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_SCHED_MM_CID=y
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_PERF is not set
# CONFIG_CGROUP_BPF is not set
# CONFIG_CGROUP_MISC is not set
# CONFIG_CGROUP_DEBUG is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_TIME_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_RELAY=y
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_BOOT_CONFIG is not set
CONFIG_INITRAMFS_PRESERVE_MTIME=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_LD_ORPHAN_WARN=y
CONFIG_LD_ORPHAN_WARN_LEVEL="error"
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_RSEQ=y
CONFIG_CACHESTAT_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_SELFTEST is not set
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
# end of Kernel Performance Events And Counters

# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y

#
# Kexec and crash features
#
CONFIG_VMCORE_INFO=y
# CONFIG_KEXEC is not set
# CONFIG_KEXEC_FILE is not set
# end of Kexec and crash features
# end of General setup

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_AUDIT_ARCH=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
CONFIG_SMP=y
# CONFIG_X86_X2APIC is not set
CONFIG_X86_MPPARSE=y
# CONFIG_X86_CPU_RESCTRL is not set
# CONFIG_X86_FRED is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_VSMP is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_MID is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
CONFIG_IOSF_MBI=y
# CONFIG_IOSF_MBI_DEBUG is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_HYPERVISOR_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_X86_HV_CALLBACK_VECTOR=y
# CONFIG_XEN is not set
CONFIG_KVM_GUEST=y
CONFIG_ARCH_CPUIDLE_HALTPOLL=y
# CONFIG_PVH is not set
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_JAILHOUSE_GUEST is not set
# CONFIG_ACRN_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_HAVE_PAE=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
CONFIG_X86_VMX_FEATURE_NAMES=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_ZHAOXIN=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_GART_IOMMU is not set
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS_RANGE_BEGIN=2
CONFIG_NR_CPUS_RANGE_END=512
CONFIG_NR_CPUS_DEFAULT=64
CONFIG_NR_CPUS=64
CONFIG_SCHED_CLUSTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_SCHED_MC_PRIO is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCELOG_LEGACY is not set
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
CONFIG_PERF_EVENTS_AMD_UNCORE=y
# CONFIG_PERF_EVENTS_AMD_BRS is not set
# end of Performance monitoring

CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
CONFIG_X86_IOPL_IOPERM=y
CONFIG_MICROCODE=y
# CONFIG_MICROCODE_LATE_LOADING is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
# CONFIG_X86_5LEVEL is not set
CONFIG_X86_DIRECT_GBPAGES=y
# CONFIG_X86_CPA_STATISTICS is not set
# CONFIG_AMD_MEM_ENCRYPT is not set
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
# CONFIG_ARCH_MEMORY_PROBE is not set
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
# CONFIG_X86_PMEM_LEGACY is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_X86_UMIP=y
CONFIG_CC_HAS_IBT=y
# CONFIG_X86_KERNEL_IBT is not set
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_X86_INTEL_TSX_MODE_OFF=y
# CONFIG_X86_INTEL_TSX_MODE_ON is not set
# CONFIG_X86_INTEL_TSX_MODE_AUTO is not set
# CONFIG_X86_USER_SHADOW_STACK is not set
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_HANDOVER_PROTOCOL=y
# CONFIG_EFI_MIXED is not set
# CONFIG_EFI_FAKE_MEMMAP is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_ARCH_SUPPORTS_KEXEC=y
CONFIG_ARCH_SUPPORTS_KEXEC_FILE=y
CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_SIG_FORCE=y
CONFIG_ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG=y
CONFIG_ARCH_SUPPORTS_KEXEC_JUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_DUMP=y
CONFIG_ARCH_SUPPORTS_CRASH_HOTPLUG=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
# CONFIG_ADDRESS_MASKING is not set
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_LEGACY_VSYSCALL_XONLY is not set
CONFIG_LEGACY_VSYSCALL_NONE=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_MODIFY_LDT_SYSCALL=y
# CONFIG_STRICT_SIGALTSTACK_SIZE is not set
CONFIG_HAVE_LIVEPATCH=y
# CONFIG_LIVEPATCH is not set
# end of Processor type and features

CONFIG_CC_HAS_NAMED_AS=y
CONFIG_USE_X86_SEG_SUPPORT=y
CONFIG_CC_HAS_SLS=y
CONFIG_CC_HAS_RETURN_THUNK=y
CONFIG_CC_HAS_ENTRY_PADDING=y
CONFIG_FUNCTION_PADDING_CFI=11
CONFIG_FUNCTION_PADDING_BYTES=16
CONFIG_CALL_PADDING=y
CONFIG_HAVE_CALL_THUNKS=y
CONFIG_CALL_THUNKS=y
CONFIG_PREFIX_SYMBOLS=y
CONFIG_CPU_MITIGATIONS=y
CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y
CONFIG_MITIGATION_RETPOLINE=y
CONFIG_MITIGATION_RETHUNK=y
CONFIG_MITIGATION_UNRET_ENTRY=y
CONFIG_MITIGATION_CALL_DEPTH_TRACKING=y
# CONFIG_CALL_THUNKS_DEBUG is not set
CONFIG_MITIGATION_IBPB_ENTRY=y
CONFIG_MITIGATION_IBRS_ENTRY=y
CONFIG_MITIGATION_SRSO=y
# CONFIG_MITIGATION_SLS is not set
# CONFIG_MITIGATION_GDS_FORCE is not set
CONFIG_MITIGATION_RFDS=y
CONFIG_MITIGATION_SPECTRE_BHI=y
CONFIG_ARCH_HAS_ADD_PAGES=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
# CONFIG_HIBERNATION_SNAPSHOT_DEV is not set
CONFIG_HIBERNATION_COMP_LZO=y
CONFIG_HIBERNATION_DEF_COMP="lzo"
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_USERSPACE_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SPCR_TABLE=y
# CONFIG_ACPI_FPDT is not set
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
# CONFIG_ACPI_AC is not set
# CONFIG_ACPI_BATTERY is not set
# CONFIG_ACPI_BUTTON is not set
# CONFIG_ACPI_TINY_POWER_BUTTON is not set
# CONFIG_ACPI_TAD is not set
# CONFIG_ACPI_DOCK is not set
# CONFIG_ACPI_PROCESSOR is not set
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_HOTPLUG_MEMORY is not set
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_NFIT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
# CONFIG_ACPI_APEI is not set
# CONFIG_ACPI_DPTF is not set
# CONFIG_ACPI_EXTLOG is not set
# CONFIG_ACPI_CONFIGFS is not set
# CONFIG_ACPI_PFRUT is not set
CONFIG_ACPI_PCC=y
# CONFIG_ACPI_FFH is not set
# CONFIG_PMIC_OPREGION is not set
CONFIG_ACPI_PRMT=y
CONFIG_X86_PM_TIMER=y

#
# CPU Frequency scaling
#
# CONFIG_CPU_FREQ is not set
# end of CPU Frequency scaling

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_CPU_IDLE_GOV_TEO is not set
CONFIG_CPU_IDLE_GOV_HALTPOLL=y
CONFIG_HALTPOLL_CPUIDLE=y
# end of CPU Idle

# CONFIG_INTEL_IDLE is not set
# end of Power management and ACPI options

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_MMCONF_FAM10H=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# end of Bus options (PCI etc.)

#
# Binary Emulations
#
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_EMULATION_DEFAULT_DISABLED is not set
# CONFIG_X86_X32_ABI is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
# end of Binary Emulations

CONFIG_VIRTUALIZATION=y
# CONFIG_KVM is not set
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y
CONFIG_AS_TPAUSE=y
CONFIG_AS_GFNI=y
CONFIG_AS_WRUSS=y
CONFIG_ARCH_CONFIGURES_CPU_MITIGATIONS=y

#
# General architecture-dependent options
#
CONFIG_HOTPLUG_SMT=y
CONFIG_HOTPLUG_CORE_SYNC=y
CONFIG_HOTPLUG_CORE_SYNC_DEAD=y
CONFIG_HOTPLUG_CORE_SYNC_FULL=y
CONFIG_HOTPLUG_SPLIT_STARTUP=y
CONFIG_HOTPLUG_PARALLEL=y
CONFIG_GENERIC_ENTRY=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
# CONFIG_STATIC_CALL_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_KRETPROBE_ON_RETHOOK=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_ARCH_HAS_CPU_FINALIZE_INIT=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_RUST=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_TABLE_FREE=y
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
CONFIG_MMU_GATHER_MERGE_VMAS=y
CONFIG_MMU_LAZY_TLB_REFCOUNT=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
# CONFIG_SECCOMP_CACHE_DEBUG is not set
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_LTO_NONE=y
CONFIG_ARCH_SUPPORTS_CFI_CLANG=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING_USER=y
CONFIG_HAVE_CONTEXT_TRACKING_USER_OFFSTACK=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PUD=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_PMD_MKWRITE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_HAVE_PAGE_SIZE_4KB=y
CONFIG_PAGE_SIZE_4KB=y
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
CONFIG_PAGE_SHIFT=12
CONFIG_HAVE_OBJTOOL=y
CONFIG_HAVE_JUMP_LABEL_HACK=y
CONFIG_HAVE_NOINSTR_HACK=y
CONFIG_HAVE_NOINSTR_VALIDATION=y
CONFIG_HAVE_UACCESS_VALIDATION=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
# CONFIG_LOCK_EVENT_COUNTS is not set
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_HAVE_STATIC_CALL=y
CONFIG_HAVE_STATIC_CALL_INLINE=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_PAGE_TABLE_CHECK=y
CONFIG_ARCH_HAS_ELFCORE_COMPAT=y
CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH=y
CONFIG_DYNAMIC_SIGFRAME=y
CONFIG_ARCH_HAS_HW_PTE_YOUNG=y
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_FUNCTION_ALIGNMENT_4B=y
CONFIG_FUNCTION_ALIGNMENT_16B=y
CONFIG_FUNCTION_ALIGNMENT=16
# end of General architecture-dependent options

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_DEBUG is not set
# CONFIG_MODULE_FORCE_LOAD is not set
# CONFIG_MODULE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
CONFIG_MODULE_COMPRESS_NONE=y
# CONFIG_MODULE_COMPRESS_GZIP is not set
# CONFIG_MODULE_COMPRESS_XZ is not set
# CONFIG_MODULE_COMPRESS_ZSTD is not set
# CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set
CONFIG_MODPROBE_PATH="/sbin/modprobe"
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
# CONFIG_BLOCK_LEGACY_AUTOLOAD is not set
CONFIG_BLK_CGROUP_PUNT_BIO=y
# CONFIG_BLK_DEV_BSGLIB is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLK_DEV_WRITE_MOUNTED=y
# CONFIG_BLK_DEV_ZONED is not set
# CONFIG_BLK_WBT is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set
# CONFIG_BLK_INLINE_ENCRYPTION is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_EFI_PARTITION=y
# end of Partition Types

CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_BLK_PM=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
# end of IO Schedulers

CONFIG_PADATA=y
CONFIG_ASN1=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y
# end of Executable file formats

#
# Memory Management options
#
CONFIG_SWAP=y
# CONFIG_ZSWAP is not set

#
# Slab allocator options
#
CONFIG_SLUB=y
# CONFIG_SLAB_MERGE_DEFAULT is not set
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
# CONFIG_SLUB_STATS is not set
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_RANDOM_KMALLOC_CACHES is not set
# end of Slab allocator options

# CONFIG_SHUFFLE_PAGE_ALLOCATOR is not set
# CONFIG_COMPAT_BRK is not set
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=y
CONFIG_ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_HAVE_BOOTMEM_INFO_NODE=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTPLUG=y
# CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is not set
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
CONFIG_COMPACT_UNEVICTABLE_DEFAULT=1
# CONFIG_PAGE_REPORTING is not set
CONFIG_MIGRATION=y
CONFIG_DEVICE_MIGRATION=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PCP_BATCH_SCALE_MAX=5
CONFIG_PHYS_ADDR_T_64BIT=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_MEMORY_FAILURE is not set
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
# CONFIG_TRANSPARENT_HUGEPAGE_NEVER is not set
CONFIG_THP_SWAP=y
# CONFIG_READ_ONLY_THP_FOR_FS is not set
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
# CONFIG_CMA is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
# CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CURRENT_STACK_POINTER=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ZONE_DMA=y
CONFIG_ZONE_DMA32=y
CONFIG_ZONE_DEVICE=y
# CONFIG_DEVICE_PRIVATE is not set
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
CONFIG_VM_EVENT_COUNTERS=y
# CONFIG_PERCPU_STATS is not set
# CONFIG_GUP_TEST is not set
# CONFIG_DMAPOOL_TEST is not set
CONFIG_ARCH_HAS_PTE_SPECIAL=y
CONFIG_MEMFD_CREATE=y
CONFIG_SECRETMEM=y
# CONFIG_ANON_VMA_NAME is not set
# CONFIG_USERFAULTFD is not set
# CONFIG_LRU_GEN is not set
CONFIG_ARCH_SUPPORTS_PER_VMA_LOCK=y
CONFIG_PER_VMA_LOCK=y
CONFIG_LOCK_MM_AND_FIND_VMA=y

#
# Data Access Monitoring
#
# CONFIG_DAMON is not set
# end of Data Access Monitoring
# end of Memory Management options

CONFIG_NET=y
CONFIG_NET_INGRESS=y
CONFIG_NET_EGRESS=y
CONFIG_NET_XGRESS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_AF_UNIX_OOB=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
# CONFIG_XFRM_USER is not set
# CONFIG_NET_KEY is not set
# CONFIG_XDP_SOCKETS is not set
CONFIG_NET_HANDSHAKE=y
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_NET_IPVTI is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
CONFIG_INET_TABLE_PERTURB_ORDER=16
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_AO is not set
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_NETLABEL is not set
# CONFIG_MPTCP is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
CONFIG_PCPU_DEV_REFCNT=y
CONFIG_MAX_SKB_FRAGS=17
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_SOCK_RX_QUEUE_MAPPING=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
# end of Network testing
# end of Networking options

# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
# CONFIG_MCTP is not set
# CONFIG_WIRELESS is not set
# CONFIG_RFKILL is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_FD=y
CONFIG_NET_9P_VIRTIO=y
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
CONFIG_CEPH_LIB=y
# CONFIG_CEPH_LIB_PRETTYDEBUG is not set
# CONFIG_CEPH_LIB_USE_DNS_RESOLVER is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
CONFIG_NET_SOCK_MSG=y
CONFIG_PAGE_POOL=y
# CONFIG_PAGE_POOL_STATS is not set
CONFIG_FAILOVER=y
CONFIG_ETHTOOL_NETLINK=y

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
# CONFIG_HOTPLUG_PCI_PCIE is not set
CONFIG_PCIEAER=y
# CONFIG_PCIEAER_INJECT is not set
# CONFIG_PCIE_ECRC is not set
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_MSI=y
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
# CONFIG_PCI_IOV is not set
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
# CONFIG_PCI_P2PDMA is not set
CONFIG_PCI_LABEL=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_ACPI is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# PCI controller drivers
#
# CONFIG_VMD is not set

#
# Cadence-based PCIe controllers
#
# end of Cadence-based PCIe controllers

#
# DesignWare-based PCIe controllers
#
# CONFIG_PCI_MESON is not set
# CONFIG_PCIE_DW_PLAT_HOST is not set
# end of DesignWare-based PCIe controllers

#
# Mobiveil-based PCIe controllers
#
# end of Mobiveil-based PCIe controllers
# end of PCI controller drivers

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set
# end of PCI Endpoint

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# end of PCI switch controller drivers

# CONFIG_CXL_BUS is not set
# CONFIG_PCCARD is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
# CONFIG_DEVTMPFS_SAFE is not set
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_DEBUG=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_FW_LOADER_USER_HELPER is not set
# CONFIG_FW_LOADER_COMPRESS is not set
# CONFIG_FW_CACHE is not set
# CONFIG_FW_UPLOAD is not set
# end of Firmware loader

CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
CONFIG_GENERIC_CPU_DEVICES=y
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set
# CONFIG_FW_DEVLINK_SYNC_STATE_TIMEOUT is not set
# end of Generic Driver Options

#
# Bus devices
#
# CONFIG_MHI_BUS is not set
# CONFIG_MHI_BUS_EP is not set
# end of Bus devices

#
# Cache Drivers
#
# end of Cache Drivers

CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y

#
# Firmware Drivers
#

#
# ARM System Control and Management Interface Protocol
#
# end of ARM System Control and Management Interface Protocol

# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_FW_CFG_SYSFS is not set
# CONFIG_SYSFB_SIMPLEFB is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_ESRT=y
CONFIG_EFI_DXE_MEM_ATTRIBUTES=y
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
# CONFIG_RESET_ATTACK_MITIGATION is not set
# CONFIG_EFI_RCI2_TABLE is not set
# CONFIG_EFI_DISABLE_PCI_DMA is not set
CONFIG_EFI_EARLYCON=y
# CONFIG_EFI_CUSTOM_SSDT_OVERLAYS is not set
# CONFIG_EFI_DISABLE_RUNTIME is not set
# CONFIG_EFI_COCO_SECRET is not set
# end of EFI (Extensible Firmware Interface) Support

#
# Qualcomm firmware drivers
#
# end of Qualcomm firmware drivers

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

# CONFIG_GNSS is not set
CONFIG_MTD=y
# CONFIG_MTD_TESTS is not set

#
# Partition parsers
#
# CONFIG_MTD_CMDLINE_PARTS is not set
# CONFIG_MTD_REDBOOT_PARTS is not set
# end of Partition parsers

#
# User Modules And Translation Layers
#
# CONFIG_MTD_BLOCK is not set
# CONFIG_MTD_BLOCK_RO is not set
# CONFIG_FTL is not set
# CONFIG_NFTL is not set
# CONFIG_INFTL is not set
# CONFIG_RFD_FTL is not set
# CONFIG_SSFDC is not set
# CONFIG_SM_FTL is not set
# CONFIG_MTD_OOPS is not set
# CONFIG_MTD_SWAP is not set
# CONFIG_MTD_PARTITIONED_MASTER is not set

#
# RAM/ROM/Flash chip drivers
#
# CONFIG_MTD_CFI is not set
# CONFIG_MTD_JEDECPROBE is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
# CONFIG_MTD_RAM is not set
# CONFIG_MTD_ROM is not set
# CONFIG_MTD_ABSENT is not set
# end of RAM/ROM/Flash chip drivers

#
# Mapping drivers for chip access
#
# CONFIG_MTD_COMPLEX_MAPPINGS is not set
# CONFIG_MTD_PLATRAM is not set
# end of Mapping drivers for chip access

#
# Self-contained MTD device drivers
#
# CONFIG_MTD_PMC551 is not set
# CONFIG_MTD_SLRAM is not set
# CONFIG_MTD_PHRAM is not set
# CONFIG_MTD_MTDRAM is not set
# CONFIG_MTD_BLOCK2MTD is not set

#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOCG3 is not set
# end of Self-contained MTD device drivers

#
# NAND
#
# CONFIG_MTD_ONENAND is not set
# CONFIG_MTD_RAW_NAND is not set

#
# ECC engine support
#
# CONFIG_MTD_NAND_ECC_SW_HAMMING is not set
# CONFIG_MTD_NAND_ECC_SW_BCH is not set
# CONFIG_MTD_NAND_ECC_MXIC is not set
# end of ECC engine support
# end of NAND

#
# LPDDR & LPDDR2 PCM memory drivers
#
# CONFIG_MTD_LPDDR is not set
# end of LPDDR & LPDDR2 PCM memory drivers

CONFIG_MTD_UBI=y
CONFIG_MTD_UBI_WL_THRESHOLD=4096
CONFIG_MTD_UBI_BEB_LIMIT=20
# CONFIG_MTD_UBI_FASTMAP is not set
# CONFIG_MTD_UBI_GLUEBI is not set
# CONFIG_MTD_UBI_BLOCK is not set
# CONFIG_MTD_UBI_NVMEM is not set
# CONFIG_MTD_HYPERBUS is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_ZRAM is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_VIRTIO_BLK=y
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_UBLK is not set

#
# NVME Support
#
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TCP is not set
# CONFIG_NVME_TARGET is not set
# end of NVME Support

#
# Misc devices
#
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_SRAM is not set
# CONFIG_DW_XDATA_PCIE is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_XILINX_SDFEC is not set
# CONFIG_NSM is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_93CX6 is not set
# end of EEPROM support

# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# end of Texas Instruments shared transport line discipline

#
# Altera FPGA firmware download module (requires I2C)
#
# CONFIG_INTEL_MEI is not set
# CONFIG_VMWARE_VMCI is not set
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_BCM_VK is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_UACCE is not set
# CONFIG_PVPANIC is not set
# end of Misc devices

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
# CONFIG_SCSI is not set
# end of SCSI device support

# CONFIG_ATA is not set
# CONFIG_MD is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# end of IEEE 1394 (FireWire) support

# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_WIREGUARD is not set
# CONFIG_EQUALIZER is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_IPVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_GENEVE is not set
# CONFIG_BAREUDP is not set
# CONFIG_GTP is not set
# CONFIG_MACSEC is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_TUN is not set
# CONFIG_TUN_VNET_CROSS_LE is not set
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=y
# CONFIG_NLMON is not set
# CONFIG_NETKIT is not set
# CONFIG_ARCNET is not set
# CONFIG_ETHERNET is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_PHYLIB is not set
# CONFIG_PSE_CONTROLLER is not set
# CONFIG_MDIO_DEVICE is not set

#
# PCS device drivers
#
# end of PCS device drivers

# CONFIG_PPP is not set
# CONFIG_SLIP is not set

#
# Host-side USB support is needed for USB Network Adapter support
#
# CONFIG_WLAN is not set
# CONFIG_WAN is not set

#
# Wireless WAN
#
# CONFIG_WWAN is not set
# end of Wireless WAN

# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
# CONFIG_NETDEVSIM is not set
CONFIG_NET_FAILOVER=y
# CONFIG_ISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_SPARSEKMAP=y
# CONFIG_INPUT_MATRIXKMAP is not set
CONFIG_INPUT_VIVALDIFMAP=y

#
# Userland interfaces
#
# CONFIG_INPUT_MOUSEDEV is not set
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_SAMSUNG is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
# CONFIG_INPUT_MISC is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_LEGACY_TIOCSTI=y
CONFIG_LDISC_AUTOLOAD=y

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_16550A_VARIANTS is not set
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCILIB=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_PCI1XXXX=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_DWLIB=y
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y
CONFIG_SERIAL_8250_PERICOM=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_LANTIQ is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_FSL_LINFLEXUART is not set
# CONFIG_SERIAL_SPRD is not set
# end of Serial drivers

CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_NOZOMI is not set
# CONFIG_NULL_TTY is not set
CONFIG_HVC_DRIVER=y
# CONFIG_SERIAL_DEV_BUS is not set
CONFIG_VIRTIO_CONSOLE=y
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_BA431 is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_HW_RANDOM_VIRTIO=y
# CONFIG_HW_RANDOM_XIPHERA is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
CONFIG_DEVMEM=y
# CONFIG_NVRAM is not set
CONFIG_DEVPORT=y
# CONFIG_HPET is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
# CONFIG_XILLYBUS is not set
# end of Character devices

#
# I2C support
#
# CONFIG_I2C is not set
# end of I2C support

# CONFIG_I3C is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y
CONFIG_PTP_1588_CLOCK_OPTIONAL=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
CONFIG_PTP_1588_CLOCK_KVM=y
# CONFIG_PTP_1588_CLOCK_MOCK is not set
# CONFIG_PTP_1588_CLOCK_VMW is not set
# end of PTP clock support

# CONFIG_PINCTRL is not set
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_POWER_SUPPLY_HWMON=y
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_SAMSUNG_SDI is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_BATTERY_GOLDFISH is not set
CONFIG_HWMON=y
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AS370 is not set
# CONFIG_SENSORS_ASUS_ROG_RYUJIN is not set
# CONFIG_SENSORS_AXI_FAN_CONTROL is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_CORSAIR_CPRO is not set
# CONFIG_SENSORS_CORSAIR_PSU is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MR75203 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NPCM7XX is not set
# CONFIG_SENSORS_OXP is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ASUS_EC is not set
# CONFIG_THERMAL is not set
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_MADERA is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_INTEL_PMC_BXT is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TQMX86 is not set
# CONFIG_MFD_VX855 is not set
# end of Multifunction device drivers

# CONFIG_REGULATOR is not set
# CONFIG_RC_CORE is not set

#
# CEC support
#
# CONFIG_MEDIA_CEC_SUPPORT is not set
# end of CEC support

# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_AUXDISPLAY is not set
# CONFIG_AGP is not set
# CONFIG_VGA_SWITCHEROO is not set
# CONFIG_DRM is not set

#
# Frame buffer Devices
#
# CONFIG_FB is not set
# end of Frame buffer Devices

#
# Backlight & LCD device support
#
# CONFIG_LCD_CLASS_DEVICE is not set
# CONFIG_BACKLIGHT_CLASS_DEVICE is not set
# end of Backlight & LCD device support

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
# end of Console display driver support
# end of Graphics support

# CONFIG_SOUND is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
# CONFIG_HID_A4TECH is not set
# CONFIG_HID_ACRUX is not set
# CONFIG_HID_AUREAL is not set
# CONFIG_HID_BELKIN is not set
# CONFIG_HID_CHERRY is not set
# CONFIG_HID_COUGAR is not set
# CONFIG_HID_MACALLY is not set
# CONFIG_HID_CMEDIA is not set
# CONFIG_HID_CYPRESS is not set
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_EVISION is not set
# CONFIG_HID_EZKEY is not set
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_GLORIOUS is not set
# CONFIG_HID_GOOGLE_STADIA_FF is not set
# CONFIG_HID_VIVALDI is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_VIEWSONIC is not set
# CONFIG_HID_VRC2 is not set
# CONFIG_HID_XIAOMI is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
# CONFIG_HID_ITE is not set
# CONFIG_HID_JABRA is not set
# CONFIG_HID_TWINHAN is not set
# CONFIG_HID_KENSINGTON is not set
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO is not set
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MALTRON is not set
# CONFIG_HID_MAYFLASH is not set
CONFIG_HID_REDRAGON=y
# CONFIG_HID_MICROSOFT is not set
# CONFIG_HID_MONTEREY is not set
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTI is not set
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PXRC is not set
# CONFIG_HID_RAZER is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SEMITEK is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEAM is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_TOPRE is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set
# end of Special HID drivers

#
# HID-BPF support
#
# CONFIG_HID_BPF is not set
# end of HID-BPF support

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
# end of Intel ISH HID support

#
# AMD SFH HID Support
#
# CONFIG_AMD_SFH_HID is not set
# end of AMD SFH HID Support

CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
# CONFIG_USB_ULPI_BUS is not set
CONFIG_USB_ARCH_HAS_HCD=y
# CONFIG_USB is not set
CONFIG_USB_PCI=y
CONFIG_USB_PCI_AMD=y

#
# USB dual-mode controller drivers
#

#
# USB port drivers
#

#
# USB Physical Layer drivers
#
# CONFIG_NOP_USB_XCEIV is not set
# end of USB Physical Layer drivers

# CONFIG_USB_GADGET is not set
# CONFIG_TYPEC is not set
# CONFIG_USB_ROLE_SWITCH is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
# CONFIG_EDAC_AMD64 is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
# CONFIG_EDAC_SBRIDGE is not set
# CONFIG_EDAC_SKX is not set
# CONFIG_EDAC_I10NM is not set
# CONFIG_EDAC_PND2 is not set
# CONFIG_EDAC_IGEN6 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
# CONFIG_RTC_HCTOSYS is not set
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#

#
# SPI RTC drivers
#

#
# SPI and I2C RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_RP5C01 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_GOLDFISH is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_ACPI=y
# CONFIG_ALTERA_MSGDMA is not set
# CONFIG_INTEL_IDMA64 is not set
# CONFIG_INTEL_IDXD is not set
# CONFIG_INTEL_IDXD_COMPAT is not set
# CONFIG_INTEL_IOATDMA is not set
# CONFIG_PLX_DMA is not set
# CONFIG_XILINX_DMA is not set
# CONFIG_XILINX_XDMA is not set
# CONFIG_AMD_PTDMA is not set
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
CONFIG_DW_DMAC_CORE=y
# CONFIG_DW_DMAC is not set
# CONFIG_DW_DMAC_PCI is not set
# CONFIG_DW_EDMA is not set
CONFIG_HSU_DMA=y
# CONFIG_SF_PDMA is not set
# CONFIG_INTEL_LDMA is not set

#
# DMA Clients
#
# CONFIG_ASYNC_TX_DMA is not set
# CONFIG_DMATEST is not set

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
# CONFIG_SW_SYNC is not set
# CONFIG_UDMABUF is not set
# CONFIG_DMABUF_MOVE_NOTIFY is not set
# CONFIG_DMABUF_DEBUG is not set
# CONFIG_DMABUF_SELFTESTS is not set
# CONFIG_DMABUF_HEAPS is not set
# CONFIG_DMABUF_SYSFS_STATS is not set
# end of DMABUF options

# CONFIG_UIO is not set
# CONFIG_VFIO is not set
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO_ANCHOR=y
CONFIG_VIRTIO=y
CONFIG_VIRTIO_PCI_LIB=y
CONFIG_VIRTIO_PCI_LIB_LEGACY=y
CONFIG_VIRTIO_MENU=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_PCI_ADMIN_LEGACY=y
CONFIG_VIRTIO_PCI_LEGACY=y
# CONFIG_VIRTIO_PMEM is not set
# CONFIG_VIRTIO_BALLOON is not set
CONFIG_VIRTIO_INPUT=y
# CONFIG_VIRTIO_MMIO is not set
# CONFIG_VDPA is not set
CONFIG_VHOST_MENU=y
# CONFIG_VHOST_NET is not set
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set
# end of Microsoft Hyper-V guest support

# CONFIG_GREYBUS is not set
# CONFIG_COMEDI is not set
# CONFIG_STAGING is not set
# CONFIG_GOLDFISH is not set
# CONFIG_CHROME_PLATFORMS is not set
# CONFIG_MELLANOX_PLATFORM is not set
CONFIG_SURFACE_PLATFORMS=y
# CONFIG_SURFACE_GPE is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACPI_WMI is not set
# CONFIG_ACER_WIRELESS is not set
# CONFIG_AMD_PMC is not set
# CONFIG_AMD_HSMP is not set
# CONFIG_AMD_WBRF is not set
# CONFIG_ADV_SWBUTTON is not set
# CONFIG_ASUS_WIRELESS is not set
# CONFIG_X86_PLATFORM_DRIVERS_DELL is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_X86_PLATFORM_DRIVERS_HP is not set
# CONFIG_WIRELESS_HOTKEY is not set
# CONFIG_IBM_RTL is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_ATOMISP2_PM is not set
# CONFIG_INTEL_IFS is not set
# CONFIG_INTEL_SAR_INT1092 is not set

#
# Intel Speed Select Technology interface support
#
# CONFIG_INTEL_SPEED_SELECT_INTERFACE is not set
# end of Intel Speed Select Technology interface support

#
# Intel Uncore Frequency Control
#
# CONFIG_INTEL_UNCORE_FREQ_CONTROL is not set
# end of Intel Uncore Frequency Control

# CONFIG_INTEL_PUNIT_IPC is not set
# CONFIG_INTEL_RST is not set
# CONFIG_INTEL_SMARTCONNECT is not set
# CONFIG_INTEL_VSEC is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_TOSHIBA_HAPS is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_INTEL_SCU_PCI is not set
# CONFIG_INTEL_SCU_PLATFORM is not set
# CONFIG_SIEMENS_SIMATIC_IPC is not set
# CONFIG_WINMATE_FM07_KEYS is not set
CONFIG_HAVE_CLK=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y
# CONFIG_XILINX_VCU is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# end of Clock Source drivers

CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_IOVA=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
CONFIG_IOMMU_IO_PGTABLE=y
# end of Generic IOMMU Pagetable Support

# CONFIG_IOMMU_DEBUGFS is not set
# CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set
CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_IOMMU_DMA=y
CONFIG_AMD_IOMMU=y
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_PERF_EVENTS=y
# CONFIG_IOMMUFD is not set
# CONFIG_IRQ_REMAP is not set
# CONFIG_VIRTIO_IOMMU is not set

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set
# end of Remoteproc drivers

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set
# CONFIG_RPMSG_VIRTIO is not set
# end of Rpmsg drivers

# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# fujitsu SoC drivers
#
# end of fujitsu SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Enable LiteX SoC Builder specific drivers
#
# end of Enable LiteX SoC Builder specific drivers

# CONFIG_WPCM450_SOC is not set

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

#
# PM Domains
#

#
# Amlogic PM Domains
#
# end of Amlogic PM Domains

#
# Broadcom PM Domains
#
# end of Broadcom PM Domains

#
# i.MX PM Domains
#
# end of i.MX PM Domains

#
# Qualcomm PM Domains
#
# end of Qualcomm PM Domains
# end of PM Domains

# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
# end of IRQ chip support

# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set

#
# PHY Subsystem
#
# CONFIG_GENERIC_PHY is not set
# CONFIG_USB_LGM_PHY is not set
# CONFIG_PHY_CAN_TRANSCEIVER is not set

#
# PHY drivers for Broadcom platforms
#
# CONFIG_BCM_KONA_USB2_PHY is not set
# end of PHY drivers for Broadcom platforms

# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_PHY_INTEL_LGM_EMMC is not set
# end of PHY Subsystem

# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# CONFIG_DWC_PCIE_PMU is not set
# end of Performance monitor support

CONFIG_RAS=y
# CONFIG_USB4 is not set

#
# Android
#
# CONFIG_ANDROID_BINDER_IPC is not set
# end of Android

CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=y
CONFIG_ND_CLAIM=y
CONFIG_ND_BTT=y
CONFIG_BTT=y
CONFIG_ND_PFN=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_DAX=y
CONFIG_DEV_DAX=y
CONFIG_DEV_DAX_PMEM=y
CONFIG_DEV_DAX_KMEM=y
CONFIG_DEV_DAX_IOMAP=y
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y
# CONFIG_NVMEM_LAYOUTS is not set
# CONFIG_NVMEM_RMEM is not set

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# end of HW tracing support

# CONFIG_FPGA is not set
# CONFIG_TEE is not set
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set
# CONFIG_INTERCONNECT is not set
# CONFIG_COUNTER is not set
# CONFIG_MOST is not set
# CONFIG_PECI is not set
# CONFIG_HTE is not set
# end of Device Drivers

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_VALIDATE_FS_PARSER=y
CONFIG_FS_IOMAP=y
CONFIG_FS_STACK=y
CONFIG_BUFFER_HEAD=y
CONFIG_LEGACY_DIRECT_IO=y
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
# CONFIG_REISERFS_FS_SECURITY is not set
CONFIG_JFS_FS=y
CONFIG_JFS_POSIX_ACL=y
# CONFIG_JFS_SECURITY is not set
# CONFIG_JFS_DEBUG is not set
# CONFIG_JFS_STATISTICS is not set
CONFIG_XFS_FS=y
CONFIG_XFS_SUPPORT_V4=y
CONFIG_XFS_SUPPORT_ASCII_CI=y
# CONFIG_XFS_QUOTA is not set
# CONFIG_XFS_POSIX_ACL is not set
# CONFIG_XFS_RT is not set
# CONFIG_XFS_ONLINE_SCRUB is not set
# CONFIG_XFS_WARN is not set
# CONFIG_XFS_DEBUG is not set
CONFIG_GFS2_FS=y
CONFIG_OCFS2_FS=y
CONFIG_OCFS2_FS_O2CB=y
CONFIG_OCFS2_FS_STATS=y
CONFIG_OCFS2_DEBUG_MASKLOG=y
# CONFIG_OCFS2_DEBUG_FS is not set
CONFIG_BTRFS_FS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
# CONFIG_BTRFS_FS_REF_VERIFY is not set
CONFIG_NILFS2_FS=y
CONFIG_F2FS_FS=y
CONFIG_F2FS_STAT_FS=y
CONFIG_F2FS_FS_XATTR=y
CONFIG_F2FS_FS_POSIX_ACL=y
# CONFIG_F2FS_FS_SECURITY is not set
# CONFIG_F2FS_CHECK_FS is not set
# CONFIG_F2FS_FAULT_INJECTION is not set
# CONFIG_F2FS_FS_COMPRESSION is not set
CONFIG_F2FS_IOSTAT=y
CONFIG_BCACHEFS_FS=y
# CONFIG_BCACHEFS_QUOTA is not set
# CONFIG_BCACHEFS_ERASURE_CODING is not set
CONFIG_BCACHEFS_POSIX_ACL=y
# CONFIG_BCACHEFS_DEBUG is not set
# CONFIG_BCACHEFS_TESTS is not set
# CONFIG_BCACHEFS_LOCK_TIME_STATS is not set
# CONFIG_BCACHEFS_NO_LATENCY_ACCT is not set
CONFIG_BCACHEFS_SIX_OPTIMISTIC_SPIN=y
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=y
# CONFIG_FS_VERITY is not set
CONFIG_FSNOTIFY=y
# CONFIG_DNOTIFY is not set
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
# CONFIG_FANOTIFY_ACCESS_PERMISSIONS is not set
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
# CONFIG_AUTOFS_FS is not set
CONFIG_FUSE_FS=y
CONFIG_CUSE=y
CONFIG_VIRTIO_FS=y
CONFIG_FUSE_DAX=y
CONFIG_FUSE_PASSTHROUGH=y
CONFIG_OVERLAY_FS=y
CONFIG_OVERLAY_FS_REDIRECT_DIR=y
# CONFIG_OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW is not set
CONFIG_OVERLAY_FS_INDEX=y
# CONFIG_OVERLAY_FS_NFS_EXPORT is not set
CONFIG_OVERLAY_FS_XINO_AUTO=y
# CONFIG_OVERLAY_FS_METACOPY is not set
# CONFIG_OVERLAY_FS_DEBUG is not set
CONFIG_FAMFS=y

#
# Caches
#
CONFIG_NETFS_SUPPORT=y
# CONFIG_NETFS_STATS is not set
CONFIG_FSCACHE=y
# CONFIG_FSCACHE_STATS is not set
# CONFIG_FSCACHE_DEBUG is not set
CONFIG_CACHEFILES=y
# CONFIG_CACHEFILES_DEBUG is not set
# CONFIG_CACHEFILES_ERROR_INJECTION is not set
# CONFIG_CACHEFILES_ONDEMAND is not set
# end of Caches

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set
# end of CD-ROM/DVD Filesystems

#
# DOS/FAT/EXFAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_EXFAT_FS is not set
# CONFIG_NTFS3_FS is not set
# CONFIG_NTFS_FS is not set
# end of DOS/FAT/EXFAT/NT Filesystems

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
# CONFIG_PROC_CHILDREN is not set
CONFIG_PROC_PID_ARCH_STATUS=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
# CONFIG_TMPFS_INODE64 is not set
# CONFIG_TMPFS_QUOTA is not set
CONFIG_HUGETLBFS=y
# CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON is not set
CONFIG_HUGETLB_PAGE=y
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=y
# end of Pseudo filesystems

CONFIG_MISC_FILESYSTEMS=y
CONFIG_ORANGEFS_FS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
CONFIG_ECRYPT_FS=y
# CONFIG_ECRYPT_FS_MESSAGING is not set
# CONFIG_HFS_FS is not set
CONFIG_HFSPLUS_FS=y
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_JFFS2_FS=y
CONFIG_JFFS2_FS_DEBUG=0
CONFIG_JFFS2_FS_WRITEBUFFER=y
# CONFIG_JFFS2_FS_WBUF_VERIFY is not set
# CONFIG_JFFS2_SUMMARY is not set
CONFIG_JFFS2_FS_XATTR=y
CONFIG_JFFS2_FS_POSIX_ACL=y
CONFIG_JFFS2_FS_SECURITY=y
# CONFIG_JFFS2_COMPRESSION_OPTIONS is not set
CONFIG_JFFS2_ZLIB=y
CONFIG_JFFS2_RTIME=y
CONFIG_UBIFS_FS=y
# CONFIG_UBIFS_FS_ADVANCED_COMPR is not set
CONFIG_UBIFS_FS_LZO=y
CONFIG_UBIFS_FS_ZLIB=y
CONFIG_UBIFS_FS_ZSTD=y
# CONFIG_UBIFS_ATIME_SUPPORT is not set
CONFIG_UBIFS_FS_XATTR=y
CONFIG_UBIFS_FS_SECURITY=y
# CONFIG_UBIFS_FS_AUTHENTICATION is not set
# CONFIG_CRAMFS is not set
CONFIG_SQUASHFS=y
CONFIG_SQUASHFS_FILE_CACHE=y
# CONFIG_SQUASHFS_FILE_DIRECT is not set
CONFIG_SQUASHFS_DECOMP_SINGLE=y
# CONFIG_SQUASHFS_CHOICE_DECOMP_BY_MOUNT is not set
CONFIG_SQUASHFS_COMPILE_DECOMP_SINGLE=y
# CONFIG_SQUASHFS_COMPILE_DECOMP_MULTI is not set
# CONFIG_SQUASHFS_COMPILE_DECOMP_MULTI_PERCPU is not set
CONFIG_SQUASHFS_XATTR=y
CONFIG_SQUASHFS_ZLIB=y
# CONFIG_SQUASHFS_LZ4 is not set
# CONFIG_SQUASHFS_LZO is not set
# CONFIG_SQUASHFS_XZ is not set
# CONFIG_SQUASHFS_ZSTD is not set
# CONFIG_SQUASHFS_4K_DEVBLK_SIZE is not set
# CONFIG_SQUASHFS_EMBEDDED is not set
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=3
# CONFIG_VXFS_FS is not set
CONFIG_MINIX_FS=y
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_PSTORE is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_EROFS_FS=y
# CONFIG_EROFS_FS_DEBUG is not set
CONFIG_EROFS_FS_XATTR=y
CONFIG_EROFS_FS_POSIX_ACL=y
CONFIG_EROFS_FS_SECURITY=y
CONFIG_EROFS_FS_ZIP=y
# CONFIG_EROFS_FS_ZIP_LZMA is not set
# CONFIG_EROFS_FS_ZIP_DEFLATE is not set
# CONFIG_EROFS_FS_ONDEMAND is not set
# CONFIG_EROFS_FS_PCPU_KTHREAD is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
# CONFIG_NFS_V4_1 is not set
# CONFIG_NFS_FSCACHE is not set
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
CONFIG_NFSD=y
# CONFIG_NFSD_V2 is not set
# CONFIG_NFSD_V3_ACL is not set
CONFIG_NFSD_V4=y
# CONFIG_NFSD_BLOCKLAYOUT is not set
# CONFIG_NFSD_SCSILAYOUT is not set
# CONFIG_NFSD_FLEXFILELAYOUT is not set
# CONFIG_NFSD_V4_SECURITY_LABEL is not set
# CONFIG_NFSD_LEGACY_CLIENT_TRACKING is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_RPCSEC_GSS_KRB5=y
# CONFIG_RPCSEC_GSS_KRB5_ENCTYPES_AES_SHA2 is not set
# CONFIG_SUNRPC_DEBUG is not set
CONFIG_CEPH_FS=y
# CONFIG_CEPH_FSCACHE is not set
CONFIG_CEPH_FS_POSIX_ACL=y
# CONFIG_CEPH_FS_SECURITY_LABEL is not set
# CONFIG_CIFS is not set
# CONFIG_SMB_SERVER is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=y
# CONFIG_9P_FSCACHE is not set
CONFIG_9P_FS_POSIX_ACL=y
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y
CONFIG_NLS_UCS2_UTILS=y
# CONFIG_DLM is not set
# CONFIG_UNICODE is not set
CONFIG_IO_WQ=y
# end of File systems

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_REQUEST_CACHE is not set
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_TRUSTED_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
# CONFIG_SECURITYFS is not set
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_PATH is not set
# CONFIG_INTEL_TXT is not set
CONFIG_LSM_MMAP_MIN_ADDR=65536
# CONFIG_HARDENED_USERCOPY is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_SECURITY_SELINUX=y
# CONFIG_SECURITY_SELINUX_BOOTPARAM is not set
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_SIDTAB_HASH_BITS=9
CONFIG_SECURITY_SELINUX_SID2STR_CACHE_SIZE=256
# CONFIG_SECURITY_SELINUX_DEBUG is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_LOADPIN is not set
# CONFIG_SECURITY_YAMA is not set
# CONFIG_SECURITY_SAFESETID is not set
# CONFIG_SECURITY_LOCKDOWN_LSM is not set
# CONFIG_SECURITY_LANDLOCK is not set
# CONFIG_INTEGRITY is not set
# CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_LSM="yama,loadpin,safesetid,integrity,selinux,smack,tomoyo,apparmor"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_CC_HAS_AUTO_VAR_INIT_PATTERN=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO_BARE=y
CONFIG_CC_HAS_AUTO_VAR_INIT_ZERO=y
CONFIG_INIT_STACK_NONE=y
# CONFIG_INIT_STACK_ALL_PATTERN is not set
# CONFIG_INIT_STACK_ALL_ZERO is not set
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y
# CONFIG_ZERO_CALL_USED_REGS is not set
# end of Memory initialization

#
# Hardening of kernel data structures
#
# CONFIG_LIST_HARDENED is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# end of Hardening of kernel data structures

CONFIG_RANDSTRUCT_NONE=y
# end of Kernel hardening options
# end of Security options

CONFIG_XOR_BLOCKS=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SIG2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_ENGINE=y
# end of Crypto core or helper

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=y
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
# CONFIG_CRYPTO_ECDSA is not set
# CONFIG_CRYPTO_ECRDSA is not set
# CONFIG_CRYPTO_SM2 is not set
# CONFIG_CRYPTO_CURVE25519 is not set
# end of Public-key cryptography

#
# Block ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_ARIA is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SM4_GENERIC is not set
# CONFIG_CRYPTO_TWOFISH is not set
# end of Block ciphers

#
# Length-preserving ciphers and modes
#
# CONFIG_CRYPTO_ADIANTUM is not set
CONFIG_CRYPTO_CHACHA20=y
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_HCTR2 is not set
# CONFIG_CRYPTO_KEYWRAP is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# end of Length-preserving ciphers and modes

#
# AEAD (authenticated encryption with associated data) ciphers
#
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_CCM is not set
CONFIG_CRYPTO_GCM=y
CONFIG_CRYPTO_GENIV=y
CONFIG_CRYPTO_SEQIV=y
# CONFIG_CRYPTO_ECHAINIV is not set
# CONFIG_CRYPTO_ESSIV is not set
# end of AEAD (authenticated encryption with associated data) ciphers

#
# Hashes, digests, and MACs
#
CONFIG_CRYPTO_BLAKE2B=y
# CONFIG_CRYPTO_CMAC is not set
CONFIG_CRYPTO_GHASH=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
CONFIG_CRYPTO_POLY1305=y
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_SHA1 is not set
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
CONFIG_CRYPTO_SHA3=y
# CONFIG_CRYPTO_SM3_GENERIC is not set
# CONFIG_CRYPTO_STREEBOG is not set
# CONFIG_CRYPTO_VMAC is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_XCBC is not set
CONFIG_CRYPTO_XXHASH=y
# end of Hashes, digests, and MACs

#
# CRCs (cyclic redundancy checks)
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32=y
# CONFIG_CRYPTO_CRCT10DIF is not set
# CONFIG_CRYPTO_CRC64_ROCKSOFT is not set
# end of CRCs (cyclic redundancy checks)

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_LZO=y
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
CONFIG_CRYPTO_ZSTD=y
# end of Compression

#
# Random number generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
CONFIG_CRYPTO_JITTERENTROPY_MEMORY_BLOCKS=64
CONFIG_CRYPTO_JITTERENTROPY_MEMORY_BLOCKSIZE=32
CONFIG_CRYPTO_JITTERENTROPY_OSR=1
# end of Random number generation

#
# Userspace interface
#
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
# end of Userspace interface

CONFIG_CRYPTO_HASH_INFO=y

#
# Accelerated Cryptographic Algorithms for CPU (x86)
#
# CONFIG_CRYPTO_CURVE25519_X86 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set
# CONFIG_CRYPTO_ARIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_ARIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_ARIA_GFNI_AVX512_X86_64 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_BLAKE2S_X86 is not set
# CONFIG_CRYPTO_POLYVAL_CLMUL_NI is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
# CONFIG_CRYPTO_SM3_AVX_X86_64 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
# end of Accelerated Cryptographic Algorithms for CPU (x86)

CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_CCP is not set
# CONFIG_CRYPTO_DEV_NITROX_CNN55XX is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_4XXX is not set
# CONFIG_CRYPTO_DEV_QAT_420XX is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
CONFIG_CRYPTO_DEV_VIRTIO=y
# CONFIG_CRYPTO_DEV_SAFEXCEL is not set
# CONFIG_CRYPTO_DEV_AMLOGIC_GXL is not set
# CONFIG_ASYMMETRIC_KEY_TYPE is not set

#
# Certificates for signature checking
#
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
# end of Certificates for signature checking

CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=y
# CONFIG_RAID6_PQ_BENCHMARK is not set
# CONFIG_PACKING is not set
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
# CONFIG_CORDIC is not set
# CONFIG_PRIME_NUMBERS is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_UTILS=y
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_GF128MUL=y
CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
CONFIG_CRYPTO_LIB_CHACHA_GENERIC=y
# CONFIG_CRYPTO_LIB_CHACHA is not set
# CONFIG_CRYPTO_LIB_CURVE25519 is not set
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
CONFIG_CRYPTO_LIB_POLY1305_GENERIC=y
# CONFIG_CRYPTO_LIB_POLY1305 is not set
# CONFIG_CRYPTO_LIB_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_LIB_SHA1=y
CONFIG_CRYPTO_LIB_SHA256=y
# end of Crypto library routines

# CONFIG_CRC_CCITT is not set
CONFIG_CRC16=y
# CONFIG_CRC_T10DIF is not set
# CONFIG_CRC64_ROCKSOFT is not set
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
CONFIG_CRC64=y
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set
CONFIG_XXHASH=y
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_COMPRESS=y
CONFIG_LZ4HC_COMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_ZSTD_COMMON=y
CONFIG_ZSTD_COMPRESS=y
CONFIG_ZSTD_DECOMPRESS=y
# CONFIG_XZ_DEC is not set
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_CLOSURES=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_SG_DMA_FLAGS=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_SWIOTLB=y
# CONFIG_SWIOTLB_DYNAMIC is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_DMA_MAP_BENCHMARK is not set
CONFIG_SGL_ALLOC=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
CONFIG_CLZ_TAB=y
# CONFIG_IRQ_POLL is not set
CONFIG_MPILIB=y
CONFIG_DIMLIB=y
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_FONT_SUPPORT=y
CONFIG_FONT_8x16=y
CONFIG_FONT_AUTOSELECT=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_MEMREGION=y
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_COPY_MC=y
CONFIG_ARCH_STACKWALK=y
CONFIG_STACKDEPOT=y
CONFIG_STACKDEPOT_MAX_FRAMES=64
CONFIG_SBITMAP=y
# CONFIG_LWQ_TEST is not set
# end of Library routines

CONFIG_FIRMWARE_TABLE=y

#
# Kernel hacking
#

#
# printk and dmesg options
#
# CONFIG_PRINTK_TIME is not set
# CONFIG_PRINTK_CALLER is not set
# CONFIG_STACKTRACE_BUILD_ID is not set
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=4
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_BOOT_PRINTK_DELAY is not set
CONFIG_DYNAMIC_DEBUG=y
CONFIG_DYNAMIC_DEBUG_CORE=y
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y
# end of printk and dmesg options

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
CONFIG_AS_HAS_NON_CONST_ULEB128=y
# CONFIG_DEBUG_INFO_NONE is not set
# CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT is not set
# CONFIG_DEBUG_INFO_DWARF4 is not set
CONFIG_DEBUG_INFO_DWARF5=y
# CONFIG_DEBUG_INFO_REDUCED is not set
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
# CONFIG_DEBUG_INFO_COMPRESSED_ZLIB is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_GDB_SCRIPTS=y
CONFIG_FRAME_WARN=2048
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
CONFIG_HEADERS_INSTALL=y
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_OBJTOOL=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
# CONFIG_DEBUG_FS_DISALLOW_MOUNT is not set
# CONFIG_DEBUG_FS_ALLOW_NONE is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_HONOUR_BLOCKLIST=y
CONFIG_KGDB_SERIAL_CONSOLE=y
# CONFIG_KGDB_TESTS is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
CONFIG_ARCH_HAS_EARLY_DEBUG=y
CONFIG_ARCH_HAS_UBSAN=y
# CONFIG_UBSAN is not set
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_HAVE_KCSAN_COMPILER=y
# CONFIG_KCSAN is not set
# end of Generic Kernel Debugging Instruments

#
# Networking Debugging
#
# CONFIG_NET_DEV_REFCNT_TRACKER is not set
# CONFIG_NET_NS_REFCNT_TRACKER is not set
# CONFIG_DEBUG_NET is not set
# end of Networking Debugging

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_ARCH_HAS_DEBUG_WX=y
# CONFIG_DEBUG_WX is not set
CONFIG_GENERIC_PTDUMP=y
# CONFIG_PTDUMP_DEBUGFS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_PER_VMA_LOCK_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SHRINKER_DEBUG is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM_IRQSOFF=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_MAPLE_TREE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
# CONFIG_DEBUG_VM_PGTABLE is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
# CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
# CONFIG_KASAN is not set
CONFIG_HAVE_ARCH_KFENCE=y
# CONFIG_KFENCE is not set
CONFIG_HAVE_ARCH_KMSAN=y
# end of Memory Debugging

# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Oops, Lockups and Hangs
#
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
# CONFIG_SOFTLOCKUP_DETECTOR is not set
CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
# CONFIG_HARDLOCKUP_DETECTOR is not set
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_WQ_CPU_INTENSIVE_REPORT is not set
# CONFIG_TEST_LOCKUP is not set
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
# CONFIG_SCHED_DEBUG is not set
CONFIG_SCHED_INFO=y
# CONFIG_SCHEDSTATS is not set
# end of Scheduler Debugging

# CONFIG_DEBUG_TIMEKEEPING is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
# CONFIG_PROVE_RAW_LOCK_NESTING is not set
# CONFIG_LOCK_STAT is not set
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

CONFIG_TRACE_IRQFLAGS=y
CONFIG_TRACE_IRQFLAGS_NMI=y
# CONFIG_NMI_CHECK_CPU is not set
# CONFIG_DEBUG_IRQFLAGS is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set

#
# Debug kernel data structures
#
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PLIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CLOSURES is not set
# CONFIG_DEBUG_MAPLE_TREE is not set
# end of Debug kernel data structures

#
# RCU Debugging
#
CONFIG_PROVE_RCU=y
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
# CONFIG_RCU_CPU_STALL_CPUTIME is not set
CONFIG_RCU_TRACE=y
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_LATENCYTOP is not set
# CONFIG_DEBUG_CGROUP_REF is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_RETHOOK=y
CONFIG_RETHOOK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_RETVAL=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_NO_PATCHABLE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_OBJTOOL_MCOUNT=y
CONFIG_HAVE_OBJTOOL_NOP_MCOUNT=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_HAVE_BUILDTIME_MCOUNT_SORT=y
CONFIG_BUILDTIME_MCOUNT_SORT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_BOOTTIME_TRACING is not set
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_FUNCTION_GRAPH_RETVAL=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
# CONFIG_FPROBE is not set
CONFIG_FUNCTION_PROFILER=y
# CONFIG_STACK_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_HWLAT_TRACER is not set
# CONFIG_OSNOISE_TRACER is not set
# CONFIG_TIMERLAT_TRACER is not set
# CONFIG_MMIOTRACE is not set
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
# CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENTS=y
# CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
CONFIG_UPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
# CONFIG_BPF_KPROBE_OVERRIDE is not set
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_CC=y
# CONFIG_SYNTH_EVENTS is not set
# CONFIG_USER_EVENTS is not set
# CONFIG_HIST_TRIGGERS is not set
# CONFIG_TRACE_EVENT_INJECT is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
# CONFIG_FTRACE_RECORD_RECURSION is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_FTRACE_SORT_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
# CONFIG_KPROBE_EVENT_GEN_TEST is not set
# CONFIG_RV is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
CONFIG_SAMPLES=y
# CONFIG_SAMPLE_AUXDISPLAY is not set
# CONFIG_SAMPLE_TRACE_EVENTS is not set
# CONFIG_SAMPLE_TRACE_CUSTOM_EVENTS is not set
# CONFIG_SAMPLE_TRACE_PRINTK is not set
# CONFIG_SAMPLE_FTRACE_DIRECT is not set
# CONFIG_SAMPLE_FTRACE_DIRECT_MULTI is not set
# CONFIG_SAMPLE_FTRACE_OPS is not set
# CONFIG_SAMPLE_TRACE_ARRAY is not set
# CONFIG_SAMPLE_KOBJECT is not set
# CONFIG_SAMPLE_KPROBES is not set
# CONFIG_SAMPLE_HW_BREAKPOINT is not set
# CONFIG_SAMPLE_KFIFO is not set
# CONFIG_SAMPLE_CONFIGFS is not set
# CONFIG_SAMPLE_CONNECTOR is not set
# CONFIG_SAMPLE_FANOTIFY_ERROR is not set
# CONFIG_SAMPLE_HIDRAW is not set
# CONFIG_SAMPLE_LANDLOCK is not set
# CONFIG_SAMPLE_PIDFD is not set
# CONFIG_SAMPLE_SECCOMP is not set
# CONFIG_SAMPLE_TIMER is not set
# CONFIG_SAMPLE_UHID is not set
# CONFIG_SAMPLE_ANDROID_BINDERFS is not set
CONFIG_SAMPLE_VFS=y
# CONFIG_SAMPLE_TPS6594_PFSM is not set
# CONFIG_SAMPLE_WATCHDOG is not set
# CONFIG_SAMPLE_WATCH_QUEUE is not set
# CONFIG_SAMPLE_CGROUP is not set
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT_MULTI=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
# CONFIG_STRICT_DEVMEM is not set

#
# x86 Debugging
#
CONFIG_EARLY_PRINTK_USB=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_TLBFLUSH is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_PUNIT_ATOM_DEBUG is not set
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set
# end of x86 Debugging

#
# Kernel Testing and Coverage
#
# CONFIG_KUNIT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
CONFIG_FUNCTION_ERROR_INJECTION=y
# CONFIG_FAULT_INJECTION is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
CONFIG_RUNTIME_TESTING_MENU=y
# CONFIG_TEST_DHRY is not set
# CONFIG_LKDTM is not set
# CONFIG_TEST_MIN_HEAP is not set
# CONFIG_TEST_DIV64 is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_TEST_REF_TRACKER is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_REED_SOLOMON_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_SCANF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_XARRAY is not set
# CONFIG_TEST_MAPLE_TREE is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_IDA is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_BITOPS is not set
# CONFIG_TEST_VMALLOC is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_TEST_BPF is not set
# CONFIG_TEST_BLACKHOLE_DEV is not set
# CONFIG_FIND_BIT_BENCHMARK is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_SYSCTL is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_TEST_DYNAMIC_DEBUG is not set
# CONFIG_TEST_KMOD is not set
# CONFIG_TEST_MEMCAT_P is not set
# CONFIG_TEST_MEMINIT is not set
# CONFIG_TEST_FREE_PAGES is not set
# CONFIG_TEST_FPU is not set
# CONFIG_TEST_CLOCKSOURCE_WATCHDOG is not set
# CONFIG_TEST_OBJPOOL is not set
CONFIG_ARCH_USE_MEMTEST=y
# CONFIG_MEMTEST is not set
# end of Kernel Testing and Coverage

#
# Rust hacking
#
# end of Rust hacking
# end of Kernel hacking

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2024-05-24  7:56 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-23 17:41 [RFC PATCH 00/20] Introduce the famfs shared-memory file system John Groves
2024-02-23 17:41 ` [RFC PATCH 01/20] famfs: Documentation John Groves
2024-02-23 17:41 ` [RFC PATCH 02/20] dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage John Groves
2024-02-26 12:05   ` Jonathan Cameron
2024-02-26 15:00     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 03/20] dev_dax_iomap: Move dax_pgoff_to_phys from device.c to bus.c since both need it now John Groves
2024-02-26 12:10   ` Jonathan Cameron
2024-02-26 15:13     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 04/20] dev_dax_iomap: Save the kva from memremap John Groves
2024-02-26 12:21   ` Jonathan Cameron
2024-02-26 15:48     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 05/20] dev_dax_iomap: Add dax_operations for use by fs-dax on devdax John Groves
2024-02-26 12:32   ` Jonathan Cameron
2024-02-26 16:09     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 06/20] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter John Groves
2024-02-26 12:34   ` Jonathan Cameron
2024-02-26 16:12     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 07/20] famfs: Add include/linux/famfs_ioctl.h John Groves
2024-02-24  1:39   ` Randy Dunlap
2024-02-24  2:23     ` John Groves
2024-02-24  3:27       ` Randy Dunlap
2024-02-24 23:32         ` John Groves
2024-02-24 23:40           ` Randy Dunlap
2024-02-26 12:39   ` Jonathan Cameron
2024-02-26 16:44     ` John Groves
2024-02-26 16:56       ` Jonathan Cameron
2024-02-26 18:04         ` John Groves
2024-02-23 17:41 ` [RFC PATCH 08/20] famfs: Add famfs_internal.h John Groves
2024-02-26 12:48   ` Jonathan Cameron
2024-02-26 17:35     ` John Groves
2024-02-27 10:28       ` Jonathan Cameron
2024-02-28  1:06         ` John Groves
2024-02-27 13:38   ` Christian Brauner
2024-02-27 14:12     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 09/20] famfs: Add super_operations John Groves
2024-02-26 12:51   ` Jonathan Cameron
2024-02-26 21:47     ` John Groves
2024-02-27 10:34       ` Jonathan Cameron
2024-02-27 17:48     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 10/20] famfs: famfs_open_device() & dax_holder_operations John Groves
2024-02-26 12:56   ` Jonathan Cameron
2024-02-26 22:22     ` John Groves
2024-02-27 13:39   ` Christian Brauner
2024-02-27 18:38     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 11/20] famfs: Add fs_context_operations John Groves
2024-02-26 13:20   ` Jonathan Cameron
2024-02-26 22:43     ` John Groves
2024-02-27 13:41   ` Christian Brauner
2024-02-28  0:59     ` John Groves
2024-02-28  1:49       ` Randy Dunlap
2024-02-28  8:17         ` Christian Brauner
2024-02-28 10:07       ` Christian Brauner
2024-02-28 12:01         ` Christian Brauner
2024-02-23 17:41 ` [RFC PATCH 12/20] famfs: Add inode_operations and file_system_type John Groves
2024-02-26 13:25   ` Jonathan Cameron
2024-02-26 22:53     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 13/20] famfs: Add iomap_ops John Groves
2024-02-26 13:30   ` Jonathan Cameron
2024-02-26 23:00     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 14/20] famfs: Add struct file_operations John Groves
2024-02-26 13:32   ` Jonathan Cameron
2024-02-26 23:09     ` John Groves
2024-02-23 17:41 ` [RFC PATCH 15/20] famfs: Add ioctl to file_operations John Groves
2024-02-26 13:44   ` Jonathan Cameron
2024-02-23 17:42 ` [RFC PATCH 16/20] famfs: Add fault counters John Groves
2024-02-23 18:23   ` Dave Hansen
2024-02-23 19:56     ` John Groves
2024-02-23 20:04       ` Dan Williams
2024-02-23 20:39         ` John Groves
2024-02-23 21:19           ` Dave Hansen
2024-02-23 23:50             ` Dan Williams
2024-02-24  3:59               ` Matthew Wilcox
2024-02-24  4:30                 ` Dan Williams
2024-02-23 17:42 ` [RFC PATCH 17/20] famfs: Add module stuff John Groves
2024-02-26 13:47   ` Jonathan Cameron
2024-02-27 22:15     ` John Groves
2024-02-23 17:42 ` [RFC PATCH 18/20] famfs: Support character dax via the dev_dax_iomap patch John Groves
2024-02-26 13:52   ` Jonathan Cameron
2024-02-27 22:27     ` John Groves
2024-02-23 17:42 ` [RFC PATCH 19/20] famfs: Update MAINTAINERS file John Groves
2024-02-23 17:42 ` [RFC PATCH 20/20] famfs: Add Kconfig and Makefile plumbing John Groves
2024-02-24  1:50   ` Randy Dunlap
2024-02-24  2:24     ` John Groves
2024-02-24  0:07 ` [RFC PATCH 00/20] Introduce the famfs shared-memory file system Luis Chamberlain
2024-02-26 13:27   ` John Groves
2024-02-26 15:53     ` Luis Chamberlain
2024-02-26 21:16       ` John Groves
2024-02-27  0:58         ` Luis Chamberlain
2024-02-27  2:05           ` John Groves
2024-02-29  2:15             ` Dave Chinner
2024-02-29 14:52               ` John Groves
2024-03-11  1:29                 ` Dave Chinner
2024-02-29  6:52 ` Amir Goldstein
2024-02-29 22:16   ` John Groves
2024-05-17  9:55   ` Miklos Szeredi
2024-05-19  5:59     ` Amir Goldstein
2024-05-22  2:05       ` John Groves
2024-05-22  8:58         ` Miklos Szeredi
2024-05-22 10:16           ` Amir Goldstein
2024-05-22 11:28             ` Miklos Szeredi
2024-05-22 13:41               ` Amir Goldstein
2024-05-23  2:49           ` John Groves
2024-05-23 13:57             ` Miklos Szeredi
2024-05-24  0:47               ` John Groves
2024-05-24  7:55                 ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).