linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)
       [not found] <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9>
@ 2023-10-16 23:32 ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 01/10] mm/prmem: Allocate memory during boot for storing persistent data madvenka
                     ` (10 more replies)
  0 siblings, 11 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Introduction
============

This feature can be used to persist kernel and user data across kexec reboots
in RAM for various uses. E.g., persisting:

	- cached data. E.g., database caches.
	- state. E.g., KVM guest states.
	- historical information since the last cold boot. E.g., events, logs
	  and journals.
	- measurements for integrity checks on the next boot.
	- driver data.
	- IOMMU mappings.
	- MMIO config information.

This is useful on systems where there is no non-volatile storage or
non-volatile storage is too small or too slow.

The following sections describe the implementation.

I have enhanced the ram disk block device driver to provide persistent ram
disks on which any filesystem can be created. This is for persisting user data.
I have also implemented DAX support for the persistent ram disks.

I am also working on making ZRAM persistent.

I have also briefly discussed the following use cases:

	- Persisting IOMMU mappings
	- Remembering DMA pages
	- Reserving pages that encounter memory errors
	- Remembering IMA measurements for integrity checks
	- Remembering MMIO config info
	- Implementing prmemfs (special filesystem tailored for persistence)

Allocate metadata
=================

Define a metadata structure to store all persistent memory related information.
The metadata fits into one page. On a cold boot, allocate and initialize the
metadata page.

Allocate data
=============

On a cold boot, allocate some memory for storing persistent data. Call it
persistent memory. Specify the size in a command line parameter:

	prmem=size[KMG][,max_size[KMG]]

	size		Initial amount of memory allocated to prmem during boot
	max_size	Maximum amount of memory that can be allocated to prmem

When the initial memory is exhaused via allocations, expand prmem dynamically
up to max_size. Expansion is done by allocating from the buddy allocator.
Record all allocations in the metadata.

Remember the metadata
=====================

On all (kexec) reboots, remember the metadata page address. This is done via
a new kernel command line parameter:

	prmem_meta=address

When a kexec image is loaded, the kexec command line is set up. Append the
above parameter to the command line automatically.

In early boot, extract the metadata page address from the command line and
reserve the metadata page. From the metadata, get the persistent memory that
has been allocated before and reserve it as well.

Manage persistent memory
========================

Manage persistent memory with the Gen Pool allocator (lib/genalloc.c). This
is so we don't have to implement a new allocator. Make the Gen Pool
persistent so allocations can be remembered across kexecs.

Provide functions for allocating and freeing persistent memory. These are
just wrappers around the Gen Pool functions:

  	prmem_alloc_pages()	(looks like alloc_pages())
	prmem_free_pages()	(looks like __free_pages())
	prmem_alloc()		(looks like kmalloc())
	prmem_free()		(looks like kfree())

Create persistent instances
===========================

Consumers store information in the form of data structures. To persist a data
structure across a kexec, a consumer has to do two things:

	1. Allocate persistent memory for the data structure.

	2. Remember the data structure in a named persistent instance.

A persistent instance has the following attributes:

	Subsystem name    Name of the subsystem/module/driver that created the
			  instance. E.g., "ramdisk" for the ramdisk driver.
	Instance name     Name of the instance within the subsystem. E.g.,
			  "pram0" for a persistent ram disk.
	Data		  Pointer to instance data.
	Size		  Size of instance data.

Provide functions to create and manage persistent instances:

	prmem_get()		Get/Create a persistent instance.
	prmem_set_data()	Record the instance data pointer and size.
	prmem_get_data()	Retrieve the instance data pointer and size.
	prmem_put()		Destroy a persistent instance.
	prmem_list()		Enumerate the instances of a subsystem.

Complex data structures
=======================

A persistent instance may have more than one data structure to remember across
kexec.

Data structures can be connected to other data structures using pointers,
arrays, linked lists, RB trees, etc. As long as each structure is placed in
persistent memory, the whole set of data structures can be remembered
across a kexec.

It is expected that a consumer will create a top level data structure for
an instance from which all other data structures belonging to the instance
can be reached. So, only the top level data structure needs to be registered
as instance data.

Linked list nodes and RB nodes are embedded in data structures. So, persisting
linked lists and RB trees is straight forward. But the XArray needs a little
more work. The XArray itself can be embedded in a persistent data structure.
But the XA nodes are currently allocated from normal memory using the kmem
allocator. Enhance XArrays to include a persistent option so that the XA nodes
as well can be allocated from persistent memory. Then, the whole XArray becomes
persistent.

Since Radix Trees are implemented with XArrays, we get persistent Radix
Trees as well.

The ram disk uses an XArray. Some other use cases can also use an XArray.

Persistent virtual addresses
============================

Apart from consumer data structures, Prmem metadata structures must be
persisted as well. In either case, data structures point to one another
using virtual addresses.

To keep the implementation simple, the virtual addresses used within persistent
memory must not change on a kexec. The alternative is to remap everything on
each kexec. This can be complex and cumbersome.

prmem uses direct map addresses for this reason. However, if PAGE_OFFSET is
randomized by KASLR, this will not work. Until I find an alternative for this,
prmem is currently not supported if kernel memory randomization is enabled.
prmem checks for this at runtime and disables itself. So, for now, include
"nokaslr" in the command line to allow prmem.

Note that kernel text randomization does not affect prmem. So, if an
architecture does not support randomization of PAGE_OFFSET, then there is
no need to include "nokaslr" in the command line.

Validation of metadata
======================

The metadata must be validated on a kexec before it can be used. To allow this,
compute a checksum on the metadata just before the kexec reboot and store it in
the metadata.

After kexec, in early boot, use the checksum to validate the metadata. If the
validation fails, discard the metadata. Treat it as a cold boot. That is,
allocate a new metadata page and initial region and start over.

This means that all persistent data will be lost on a validation failure.

Dynamic Expansion
=================

For some use cases, it may be hard to predict how much actual memory is
needed to store persistent data. This may depend on the workload. Either
we would have to overcommit memory for persistent data. Or, we could
allow dynamic expansion of prmem memory.

Implement dynamic expansion of prmem. When there is no free persistent memory
call alloc_pages(MAX_ORDER) to allocate a max order page. Add it to prmem.

Choosing a max order page means that no fragmentation is created for
transparent huge pages or kmem slabs. But fragmentation may be created for
1GB pages. This is not a problem for 1GB pages that are reserved up front
during boot. This could be a problem for 1GB pages that are allocated at run
time dynamically.

As mentioned before, dynamic expansion is optional. If a max_size is not
specified in the command line, then dynamic expansion does not happen.

Persistent Ramdisks
===================

I have implemented one main use case in this patchset - persistent ram disks.
Any filesystem can be installed on a persistent ram disk. User data can be
persisted on the filesystem.

One problem with using a ramdisk is that the page cache will contain redundant
copies of ramdisk pages. To avoid this, I have implemented DAX support for
persistent ramdisks. This can be availed by installing a filesystem with DAX
support on the ram disks.

Normal ramdisk devices are named "ram0", "ram1", "ram2", etc. Persistent
ramdisk devices will be named "pram0", "pram1", "pram2", etc.

For normal ramdisks, ramdisk pages are allocated using alloc_pages(). For
persistent ones, ramdisk pages are allocated using prmem_alloc_pages().

Each ram disk has a device structure (struct brd_device). This is allocated
from kmem for normal ram disks and from persistent memory for persistent ram
disks. This becomes the instance data. This structure contains an XArray
of pages allocated to the ram disk. A persistent XArray will be used.

The disk size for all normal ramdisks is specified via a module parameter
"rd_size". This forces all of the ramdisks to have the same size. For
persistent ram disks, take a different approach. Define a module parameter
called "prd_sizes" which specifies a comma-separated list of sizes. The
sizes are applied in the order in which they appear to "pram0", "pram1",
etc.

	Persistent Ram Disk Usage:

	sudo modprobe brd prd_sizes="1G,2G"

		This creates two persistent ram disks with the specified sizes.
		That is, /dev/pram0 will have a size of 1G. /dev/pram1 will
		have a size of 2G.

	sudo mkfs.ext4 /dev/pram0
	sudo mkfs.ext4 /dev/pram1

		Make filesystems on the persistent ram disks.

	sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
	sudo mount -t ext4 -o dax /dev/pram1 /path/to/mountpoint1

		Mount them somewhere. Note that the -o dax option can be used
		to avail DAX.

	sudo umount /path/to/mountpoint0
	sudo umount /path/to/mountpoint1

		Unmount the filesystems.

On subsequent kexecs, you can load the module with or without specifying the
sizes. The previous devices and sizes will be remembered. After that, simply
mount the filesystems and use them.

	sudo modprobe brd
	sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
	sudo mount -t ext4 -o dax /dev/pram1 /path/to/mountpoint1

The persistent ramdisk devices are destroyed when the module is explicitly
unloaded (rmmod). But if a reboot happens without the module unload, the
devices are persisted.

Other use cases
===============

I believe that it is possible to implement most use cases. I have listed some
examples below. I am not an expert in these areas. These are just suggestions.
Please let me know if there are any mistakes. Comments are most welcome.

- IOMMU mappings
	The IOVA to PFN mappings can be remembered using a persistent XArray.

- DMA pages
	Someone mentioned this use case. IIUC, DMA operations may be in flight
	when a kexec happens. Instead of waiting for the DMA operations to
	complete, drivers could remember the DMA pages in a persistent XArray.
	Then, in early boot, retrieve the XArray from prmem and reserve those
	individual pages early. Once the DMA operations complete, the pages can
	be removed from the XArray and freed into the buddy allocator.

- Pages that encounter memory errors
	These could be remembered in a persistent XArray. Then, in early boot,
	retrieve the XArray from prmem and reserve the pages so they are never
	used.

- IMA
	IMA tries to remember measurements across a kexec so that integrity
	checks can be performed on a kexec reboot. Currently, IIUC, IMA
	uses a kexec buffer to remember measurements. However, the buffer
	has to be allocated up front when the kexec image is loaded. If the gap
	between loading a kexec image and executing it is large, the
	measurements that come in during that time may not fit into the
	pre-allocated buffer.

	The solution could be to remember measurements using prmem. I am
	working on this. I will add this in a future version of this patchset.

- ZRAM
	The ZRAM block device is a candidate for persistence. This is still
	work in progress. I will add this in a future version of this patchset
	once I get it working.

- MMIO
	I am not familiar with what exactly needs to be persisted for this.
	I will state my understanding of the use case. Please correct me if
	I am wrong. IIUC, during PCI discovery, I/O devices are enumerated,
	memory space allocation is done and the I/O devices are configured.
	If the enumerated devices and their configuration can be remembered
	across kexec, then the discovery phase can be skipped after kexec.
	This will speed up PCI init.

	I believe the MMIO config info can be persisted using prmem.

- prmemfs
	It may be simpler and more efficient if we could implement a special
	filesystem that is tailored for persistence. We don't have to support
	anything that is not required for persistent data. E.g., FIFOs,
	special files, hard links, using the page cache, etc. When files are
	deleted, the memory can be freed back into prmem.

	The instance data for the filesystem would be the superblock. The
	following need to be allocated from pesistent memory - the superblock,
	the inodes and the data pages. The data pages can be remembered in a
	persistent XArray.

	I am looking into this as well.

TBD
===

- Reservations.
	Consumers must be able to reserve persistent memory to guarantee
	sizes for their instances. E.g., for a persistent ramdisk.

- NUMA support.

- Memory Leak detection.
	Something similar to kmemleak may need to be implemented to detect
	memory leaks in persistent memory.

---

Madhavan T. Venkataraman (10):
  mm/prmem: Allocate memory during boot for storing persistent data
  mm/prmem: Reserve metadata and persistent regions in early boot after
    kexec
  mm/prmem: Manage persistent memory with the gen pool allocator.
  mm/prmem: Implement a page allocator for persistent memory
  mm/prmem: Implement a buffer allocator for persistent memory
  mm/prmem: Implement persistent XArray (and Radix Tree)
  mm/prmem: Implement named Persistent Instances.
  mm/prmem: Implement Persistent Ramdisk instances.
  mm/prmem: Implement DAX support for Persistent Ramdisks.
  mm/prmem: Implement dynamic expansion of prmem.

 arch/x86/kernel/kexec-bzimage64.c |   5 +-
 arch/x86/kernel/setup.c           |   4 +
 drivers/block/Kconfig             |  11 +
 drivers/block/brd.c               | 320 ++++++++++++++++++++++++++++--
 include/linux/genalloc.h          |   6 +
 include/linux/memblock.h          |   2 +
 include/linux/prmem.h             | 158 +++++++++++++++
 include/linux/radix-tree.h        |   4 +
 include/linux/xarray.h            |  15 ++
 kernel/Makefile                   |   1 +
 kernel/prmem/Makefile             |   4 +
 kernel/prmem/prmem_allocator.c    | 222 +++++++++++++++++++++
 kernel/prmem/prmem_init.c         |  48 +++++
 kernel/prmem/prmem_instance.c     | 139 +++++++++++++
 kernel/prmem/prmem_misc.c         |  86 ++++++++
 kernel/prmem/prmem_parse.c        |  80 ++++++++
 kernel/prmem/prmem_region.c       |  87 ++++++++
 kernel/prmem/prmem_reserve.c      | 125 ++++++++++++
 kernel/reboot.c                   |   2 +
 lib/genalloc.c                    |  45 +++--
 lib/radix-tree.c                  |  49 ++++-
 lib/xarray.c                      |  11 +-
 mm/memblock.c                     |  12 ++
 mm/mm_init.c                      |   2 +
 24 files changed, 1400 insertions(+), 38 deletions(-)
 create mode 100644 include/linux/prmem.h
 create mode 100644 kernel/prmem/Makefile
 create mode 100644 kernel/prmem/prmem_allocator.c
 create mode 100644 kernel/prmem/prmem_init.c
 create mode 100644 kernel/prmem/prmem_instance.c
 create mode 100644 kernel/prmem/prmem_misc.c
 create mode 100644 kernel/prmem/prmem_parse.c
 create mode 100644 kernel/prmem/prmem_region.c
 create mode 100644 kernel/prmem/prmem_reserve.c


base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
-- 
2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 01/10] mm/prmem: Allocate memory during boot for storing persistent data
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 02/10] mm/prmem: Reserve metadata and persistent regions in early boot after kexec madvenka
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Introduce the "Persistent-Across-Kexec memory (prmem)" feature that allows
user and kernel data to be persisted across kexecs.

The first step is to set aside some memory for storing persistent data.
Introduce a new kernel command line parameter for this:

	prmem=size[KMG]

Allocate this memory from memblocks during boot. Make sure that the
allocation is done late enough so it does not interfere with any fixed
range allocations.

Define a "prmem_region" structure to store the range that is allocated. The
region structure will be used to manage the memory.

Define a "prmem" structure for storing persistence metadata.

Allocate a metadata page to contain the metadata structure. Initialize the
metadata. Add the initial region to a region list in the metadata.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/x86/kernel/setup.c      |  2 +
 include/linux/prmem.h        | 76 ++++++++++++++++++++++++++++++++++++
 kernel/Makefile              |  1 +
 kernel/prmem/Makefile        |  3 ++
 kernel/prmem/prmem_init.c    | 27 +++++++++++++
 kernel/prmem/prmem_parse.c   | 33 ++++++++++++++++
 kernel/prmem/prmem_region.c  | 21 ++++++++++
 kernel/prmem/prmem_reserve.c | 56 ++++++++++++++++++++++++++
 mm/mm_init.c                 |  2 +
 9 files changed, 221 insertions(+)
 create mode 100644 include/linux/prmem.h
 create mode 100644 kernel/prmem/Makefile
 create mode 100644 kernel/prmem/prmem_init.c
 create mode 100644 kernel/prmem/prmem_parse.c
 create mode 100644 kernel/prmem/prmem_region.c
 create mode 100644 kernel/prmem/prmem_reserve.c

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fd975a4a5200..f2b13b3d3ead 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -25,6 +25,7 @@
 #include <linux/static_call.h>
 #include <linux/swiotlb.h>
 #include <linux/random.h>
+#include <linux/prmem.h>
 
 #include <uapi/linux/mount.h>
 
@@ -1231,6 +1232,7 @@ void __init setup_arch(char **cmdline_p)
 	 * won't consume hotpluggable memory.
 	 */
 	reserve_crashkernel();
+	prmem_reserve();
 
 	memblock_find_dma_reserve();
 
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
new file mode 100644
index 000000000000..7f22016c4ad2
--- /dev/null
+++ b/include/linux/prmem.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Persistent-Across-Kexec memory (prmem) - Definitions.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#ifndef _LINUX_PRMEM_H
+#define _LINUX_PRMEM_H
+/*
+ * The prmem feature can be used to persist kernel and user data across kexec
+ * reboots in memory for various uses. E.g.,
+ *
+ *	- Saving cached data. E.g., database caches.
+ *	- Saving state. E.g., KVM guest states.
+ *	- Saving historical information since the last cold boot such as
+ *	  events, logs and journals.
+ *	- Saving measurements for integrity checks on the next boot.
+ *	- Saving driver data.
+ *	- Saving IOMMU mappings.
+ *	- Saving MMIO config information.
+ *
+ * This is useful on systems where there is no non-volatile storage or
+ * non-volatile storage is too slow.
+ */
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/memblock.h>
+#include <linux/printk.h>
+
+#include <asm-generic/errno.h>
+#include <asm/page.h>
+#include <asm/setup.h>
+/*
+ * A prmem region supplies the memory for storing persistent data.
+ *
+ * node		List node.
+ * pa		Physical address of the region.
+ * size		Size of the region in bytes.
+ */
+struct prmem_region {
+	struct list_head	node;
+	unsigned long		pa;
+	size_t			size;
+};
+
+/*
+ * PRMEM metadata.
+ *
+ * metadata	Physical address of the metadata page.
+ * size		Size of initial memory allocated to prmem.
+ *
+ * regions	List of memory regions.
+ */
+struct prmem {
+	unsigned long		metadata;
+	size_t			size;
+
+	/* Persistent Regions. */
+	struct list_head	regions;
+};
+
+extern struct prmem		*prmem;
+extern unsigned long		prmem_metadata;
+extern unsigned long		prmem_pa;
+extern size_t			prmem_size;
+
+/* Kernel API. */
+void prmem_reserve(void);
+void prmem_init(void);
+
+/* Internal functions. */
+struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
+
+#endif /* _LINUX_PRMEM_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3947122d618b..43b485b0467a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += prmem/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile
new file mode 100644
index 000000000000..11a53d49312a
--- /dev/null
+++ b/kernel/prmem/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
new file mode 100644
index 000000000000..97b550252028
--- /dev/null
+++ b/kernel/prmem/prmem_init.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Persistent-Across-Kexec memory (prmem) - Initialization.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#include <linux/prmem.h>
+
+bool			prmem_inited;
+
+void __init prmem_init(void)
+{
+	if (!prmem)
+		return;
+
+	if (!prmem->metadata) {
+		/* Cold boot. */
+		prmem->metadata = prmem_metadata;
+		prmem->size = prmem_size;
+		INIT_LIST_HEAD(&prmem->regions);
+
+		if (!prmem_add_region(prmem_pa, prmem_size))
+			return;
+	}
+	prmem_inited = true;
+}
diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c
new file mode 100644
index 000000000000..191655b53545
--- /dev/null
+++ b/kernel/prmem/prmem_parse.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Persistent-Across-Kexec memory (prmem) - Process prmem cmdline parameter.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#include <linux/prmem.h>
+
+/*
+ * Syntax: prmem=size[KMG]
+ *
+ *	Specifies the size of the initial memory to be allocated to prmem.
+ */
+static int __init prmem_size_parse(char *cmdline)
+{
+	char			*tmp, *cur = cmdline;
+	unsigned long		size;
+
+	if (!cur)
+		return -EINVAL;
+
+	/* Get initial size. */
+	size = memparse(cur, &tmp);
+	if (cur == tmp || !size || size & (PAGE_SIZE - 1)) {
+		pr_warn("%s: Incorrect size %lx\n", __func__, size);
+		return -EINVAL;
+	}
+
+	prmem_size = size;
+	return 0;
+}
+early_param("prmem", prmem_size_parse);
diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c
new file mode 100644
index 000000000000..8254dafcee13
--- /dev/null
+++ b/kernel/prmem/prmem_region.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Persistent-Across-Kexec memory (prmem) - Regions.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#include <linux/prmem.h>
+
+struct prmem_region *prmem_add_region(unsigned long pa, size_t size)
+{
+	struct prmem_region	*region;
+
+	/* Allocate region structure from the base of the region itself. */
+	region = __va(pa);
+	region->pa = pa;
+	region->size = size;
+
+	list_add_tail(&region->node, &prmem->regions);
+	return region;
+}
diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c
new file mode 100644
index 000000000000..e20e31a61d12
--- /dev/null
+++ b/kernel/prmem/prmem_reserve.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Persistent-Across-Kexec memory (prmem) - Reserve memory.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#include <linux/prmem.h>
+
+struct prmem		*prmem;
+unsigned long		prmem_metadata;
+unsigned long		prmem_pa;
+unsigned long		prmem_size;
+
+void __init prmem_reserve(void)
+{
+	BUILD_BUG_ON(sizeof(*prmem) > PAGE_SIZE);
+
+	if (!prmem_size)
+		return;
+
+	/*
+	 * prmem uses direct map addresses. If PAGE_OFFSET is randomized,
+	 * these addresses will change across kexecs. Persistence cannot
+	 * be supported.
+	 */
+	if (kaslr_memory_enabled()) {
+		pr_warn("%s: Cannot support persistence because of KASLR.\n",
+			__func__);
+		return;
+	}
+
+	/* Allocate a metadata page. */
+	prmem_metadata = memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+	if (!prmem_metadata) {
+		pr_warn("%s: Could not allocate metadata at %lx\n", __func__,
+			prmem_metadata);
+		return;
+	}
+
+	/* Allocate initial memory. */
+	prmem_pa = memblock_phys_alloc(prmem_size, PAGE_SIZE);
+	if (!prmem_pa) {
+		pr_warn("%s: Could not allocate initial memory\n", __func__);
+		goto free_metadata;
+	}
+
+	/* Clear metadata. */
+	prmem = __va(prmem_metadata);
+	memset(prmem, 0, sizeof(*prmem));
+	return;
+
+free_metadata:
+	memblock_phys_free(prmem_metadata, PAGE_SIZE);
+	prmem = NULL;
+}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index a1963c3322af..f12757829281 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -24,6 +24,7 @@
 #include <linux/page_ext.h>
 #include <linux/pti.h>
 #include <linux/pgtable.h>
+#include <linux/prmem.h>
 #include <linux/swap.h>
 #include <linux/cma.h>
 #include "internal.h"
@@ -2804,4 +2805,5 @@ void __init mm_core_init(void)
 	pti_init();
 	kmsan_init_runtime();
 	mm_cache_init();
+	prmem_init();
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 02/10] mm/prmem: Reserve metadata and persistent regions in early boot after kexec
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 01/10] mm/prmem: Allocate memory during boot for storing persistent data madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 03/10] mm/prmem: Manage persistent memory with the gen pool allocator madvenka
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Currently, only one memory region is given to prmem to store persistent
data. In the future, regions may be added dynamically.

The prmem metadata and the regions need to be reserved during early boot
after a kexec. For this to happen, the kernel must know where the metadata
is. To allow this, introduce a kernel command line parameter:

	prmem_meta=metadata_address

When a kexec image is loaded into the kernel, add this parameter to the
kexec cmdline. Upon a kexec boot, get the metadata page from the cmdline
and reserve it. Then, walk the list of regions in the metadata and reserve
the regions.

Note that the cmdline modification is done automatically within the kernel.
Userland does not have to do anything.

The metadata needs to be validated before it can be used. To allow this,
compute a checksum on the metadata and store it in the metadata at the end
of shutdown. During early boot, validate the metadata with the checksum.

If the validation fails, discard the metadata. Treat it as a cold boot.
That is, allocate a new metadata page and initial region and start over.
Similarly, if the reservation of the regions fails, treat it as a cold
boot and start over.

This means that all persistent data will be lost on any of these failures.
Note that there will be no memory leak when this happens.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/x86/kernel/kexec-bzimage64.c |  5 +-
 arch/x86/kernel/setup.c           |  2 +
 include/linux/memblock.h          |  2 +
 include/linux/prmem.h             | 11 ++++
 kernel/prmem/Makefile             |  2 +-
 kernel/prmem/prmem_init.c         |  9 ++++
 kernel/prmem/prmem_misc.c         | 85 +++++++++++++++++++++++++++++++
 kernel/prmem/prmem_parse.c        | 29 +++++++++++
 kernel/prmem/prmem_reserve.c      | 70 ++++++++++++++++++++++++-
 kernel/reboot.c                   |  2 +
 mm/memblock.c                     | 12 +++++
 11 files changed, 226 insertions(+), 3 deletions(-)
 create mode 100644 kernel/prmem/prmem_misc.c

diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index a61c12c01270..a19f172be410 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -18,6 +18,7 @@
 #include <linux/mm.h>
 #include <linux/efi.h>
 #include <linux/random.h>
+#include <linux/prmem.h>
 
 #include <asm/bootparam.h>
 #include <asm/setup.h>
@@ -82,6 +83,8 @@ static int setup_cmdline(struct kimage *image, struct boot_params *params,
 
 	cmdline_ptr[cmdline_len - 1] = '\0';
 
+	prmem_cmdline(cmdline_ptr);
+
 	pr_debug("Final command line is: %s\n", cmdline_ptr);
 	cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
 	cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
@@ -458,7 +461,7 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
 	 */
 	efi_map_sz = efi_get_runtime_map_size();
 	params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
-				MAX_ELFCOREHDR_STR_LEN;
+				MAX_ELFCOREHDR_STR_LEN + prmem_cmdline_size();
 	params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
 	kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
 				sizeof(struct setup_data) +
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f2b13b3d3ead..22f5cd494291 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1137,6 +1137,8 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	efi_reserve_boot_services();
 
+	prmem_reserve_early();
+
 	/* preallocate 4k for mptable mpc */
 	e820__memblock_alloc_reserved_mpc_new();
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f71ff9f0ec81..584bbb884c8e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -114,6 +114,8 @@ int memblock_add(phys_addr_t base, phys_addr_t size);
 int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_phys_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
+void memblock_unreserve(phys_addr_t base, phys_addr_t size);
+
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
 int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
 #endif
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index 7f22016c4ad2..bc8054a86f49 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -48,12 +48,16 @@ struct prmem_region {
 /*
  * PRMEM metadata.
  *
+ * checksum	Just before reboot, a checksum is computed on the metadata. On
+ *		the next kexec reboot, the metadata is validated with the
+ *		checksum to make sure that the metadata has not been corrupted.
  * metadata	Physical address of the metadata page.
  * size		Size of initial memory allocated to prmem.
  *
  * regions	List of memory regions.
  */
 struct prmem {
+	unsigned long		checksum;
 	unsigned long		metadata;
 	size_t			size;
 
@@ -65,12 +69,19 @@ extern struct prmem		*prmem;
 extern unsigned long		prmem_metadata;
 extern unsigned long		prmem_pa;
 extern size_t			prmem_size;
+extern bool			prmem_inited;
 
 /* Kernel API. */
+void prmem_reserve_early(void);
 void prmem_reserve(void);
 void prmem_init(void);
+void prmem_fini(void);
+int  prmem_cmdline_size(void);
 
 /* Internal functions. */
 struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
+unsigned long prmem_checksum(void *start, size_t size);
+bool __init prmem_validate(void);
+void prmem_cmdline(char *cmdline);
 
 #endif /* _LINUX_PRMEM_H */
diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile
index 11a53d49312a..9b0a693bfee1 100644
--- a/kernel/prmem/Makefile
+++ b/kernel/prmem/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o
+obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem_misc.o
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index 97b550252028..9cea1cd3b6a5 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -25,3 +25,12 @@ void __init prmem_init(void)
 	}
 	prmem_inited = true;
 }
+
+void prmem_fini(void)
+{
+	if (!prmem_inited)
+		return;
+
+	/* Compute checksum over the metadata. */
+	prmem->checksum = prmem_checksum(prmem, sizeof(*prmem));
+}
diff --git a/kernel/prmem/prmem_misc.c b/kernel/prmem/prmem_misc.c
new file mode 100644
index 000000000000..49b6a7232c1a
--- /dev/null
+++ b/kernel/prmem/prmem_misc.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Persistent-Across-Kexec memory (prmem) - Miscellaneous functions.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#include <linux/prmem.h>
+
+#define MAX_META_LENGTH	31
+
+/*
+ * On a kexec, modify the kernel command line to include the boot parameter
+ * "prmem_meta=" so that the metadata can be found on the next boot. If the
+ * parameter is already present in cmdline, overwrite it. Else, add it.
+ */
+void prmem_cmdline(char *cmdline)
+{
+	char		meta[MAX_META_LENGTH], *str;
+	unsigned long	metadata;
+
+	metadata = prmem_inited ? prmem->metadata : 0;
+	snprintf(meta, MAX_META_LENGTH, " prmem_meta=0x%.16lx", metadata);
+
+	str = strstr(cmdline, " prmem_meta");
+	if (str) {
+		/*
+		 * Boot parameter already exists. Overwrite it. We deliberately
+		 * use strncpy() and rely on the fact that it will not NULL
+		 * terminate the copy.
+		 */
+		strncpy(str, meta, MAX_META_LENGTH - 1);
+		return;
+	}
+	if (prmem_inited) {
+		/* Boot parameter does not exist. Add it. */
+		strcat(cmdline, meta);
+	}
+}
+
+/*
+ * Make sure that the kexec command line can accommodate the prmem_meta
+ * command line parameter.
+ */
+int prmem_cmdline_size(void)
+{
+	return MAX_META_LENGTH;
+}
+
+unsigned long prmem_checksum(void *start, size_t size)
+{
+	unsigned long	checksum = 0;
+	unsigned long	*ptr;
+	void		*end;
+
+	end = start + size;
+	for (ptr = start; (void *) ptr < end; ptr++)
+		checksum += *ptr;
+	return checksum;
+}
+
+/*
+ * Check if the metadata is sane. It would not be sane on a cold boot or if the
+ * metadata has been corrupted. In the latter case, we treat it as a cold boot.
+ */
+bool __init prmem_validate(void)
+{
+	unsigned long		checksum;
+
+	/* Sanity check the boot parameter. */
+	if (prmem_metadata != prmem->metadata || prmem_size != prmem->size) {
+		pr_warn("%s: Boot parameter mismatch\n", __func__);
+		return false;
+	}
+
+	/* Compute and check the checksum of the metadata. */
+	checksum = prmem->checksum;
+	prmem->checksum = 0;
+
+	if (checksum != prmem_checksum(prmem, sizeof(*prmem))) {
+		pr_warn("%s: Checksum mismatch\n", __func__);
+		return false;
+	}
+	return true;
+}
diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c
index 191655b53545..6c1a23c6b84e 100644
--- a/kernel/prmem/prmem_parse.c
+++ b/kernel/prmem/prmem_parse.c
@@ -31,3 +31,32 @@ static int __init prmem_size_parse(char *cmdline)
 	return 0;
 }
 early_param("prmem", prmem_size_parse);
+
+/*
+ * Syntax: prmem_meta=metadata_address
+ *
+ *	Specifies the address of a single page where the prmem metadata resides.
+ *
+ * On a kexec, the following will be appended to the kernel command line -
+ * "prmem_meta=metadata_address". This is so that the metadata can be located
+ * easily on kexec reboots.
+ */
+static int __init prmem_meta_parse(char *cmdline)
+{
+	char			*tmp, *cur = cmdline;
+	unsigned long		addr;
+
+	if (!cur)
+		return -EINVAL;
+
+	/* Get metadata address. */
+	addr = memparse(cur, &tmp);
+	if (cur == tmp || addr & (PAGE_SIZE - 1)) {
+		pr_warn("%s: Incorrect address %lx\n", __func__, addr);
+		return -EINVAL;
+	}
+
+	prmem_metadata = addr;
+	return 0;
+}
+early_param("prmem_meta", prmem_meta_parse);
diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c
index e20e31a61d12..8000fff05402 100644
--- a/kernel/prmem/prmem_reserve.c
+++ b/kernel/prmem/prmem_reserve.c
@@ -12,11 +12,79 @@ unsigned long		prmem_metadata;
 unsigned long		prmem_pa;
 unsigned long		prmem_size;
 
+void __init prmem_reserve_early(void)
+{
+	struct prmem_region	*region;
+	unsigned long		nregions;
+
+	/* Need to specify an initial size to enable prmem. */
+	if (!prmem_size)
+		return;
+
+	/* Nothing to be done if it is a cold boot. */
+	if (!prmem_metadata)
+		return;
+
+	/*
+	 * prmem uses direct map addresses. If PAGE_OFFSET is randomized,
+	 * these addresses will change across kexecs. Persistence cannot
+	 * be supported.
+	 */
+	if (kaslr_memory_enabled()) {
+		pr_warn("%s: Cannot support persistence because of KASLR.\n",
+			__func__);
+		return;
+	}
+
+	/*
+	 * This is a kexec reboot. If any step fails here, treat this like a
+	 * cold boot. That is, forget all persistent data and start over.
+	 */
+
+	/* Reserve metadata page. */
+	if (memblock_reserve(prmem_metadata, PAGE_SIZE)) {
+		pr_warn("%s: Unable to reserve metadata at %lx\n", __func__,
+			prmem_metadata);
+		return;
+	}
+	prmem = __va(prmem_metadata);
+
+	/* Make sure that the metadata is sane. */
+	if (!prmem_validate())
+		goto unreserve_metadata;
+
+	/* Reserve regions that were added to prmem. */
+	nregions = 0;
+	list_for_each_entry(region, &prmem->regions, node) {
+		if (memblock_reserve(region->pa, region->size)) {
+			pr_warn("%s: Unable to reserve %lx, %lx\n", __func__,
+				region->pa, region->size);
+			goto unreserve_regions;
+		}
+		nregions++;
+	}
+	return;
+
+unreserve_regions:
+	/* Unreserve regions. */
+	list_for_each_entry(region, &prmem->regions, node) {
+		if (!nregions)
+			break;
+		memblock_unreserve(region->pa, region->size);
+		nregions--;
+	}
+
+unreserve_metadata:
+	/* Unreserve the metadata page. */
+	memblock_unreserve(prmem_metadata, PAGE_SIZE);
+	prmem = NULL;
+}
+
 void __init prmem_reserve(void)
 {
 	BUILD_BUG_ON(sizeof(*prmem) > PAGE_SIZE);
 
-	if (!prmem_size)
+	if (!prmem_size || prmem)
 		return;
 
 	/*
diff --git a/kernel/reboot.c b/kernel/reboot.c
index 3bba88c7ffc6..b4595b7e77f3 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -13,6 +13,7 @@
 #include <linux/kexec.h>
 #include <linux/kmod.h>
 #include <linux/kmsg_dump.h>
+#include <linux/prmem.h>
 #include <linux/reboot.h>
 #include <linux/suspend.h>
 #include <linux/syscalls.h>
@@ -84,6 +85,7 @@ void kernel_restart_prepare(char *cmd)
 	system_state = SYSTEM_RESTART;
 	usermodehelper_disable();
 	device_shutdown();
+	prmem_fini();
 }
 
 /**
diff --git a/mm/memblock.c b/mm/memblock.c
index f9e61e565a53..1f5070f7b5bc 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -873,6 +873,18 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
 }
 
+void __init_memblock memblock_unreserve(phys_addr_t base, phys_addr_t size)
+{
+	phys_addr_t end = base + size - 1;
+
+	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
+		     &base, &end, (void *)_RET_IP_);
+
+	if (memblock_remove_range(&memblock.reserved, base, size))
+		return;
+	memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
 int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 03/10] mm/prmem: Manage persistent memory with the gen pool allocator.
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 01/10] mm/prmem: Allocate memory during boot for storing persistent data madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 02/10] mm/prmem: Reserve metadata and persistent regions in early boot after kexec madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 04/10] mm/prmem: Implement a page allocator for persistent memory madvenka
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

The memory in a prmem region must be managed by an allocator. Use
the Gen Pool allocator (lib/genalloc.c) for that purpose. This is so we
don't have to write a new allocator.

Now, the Gen Pool allocator uses a "struct gen_pool_chunk" to manage a
contiguous range of memory. The chunk is normally allocated using the kmem
allocator. However, for prmem, the chunk must be persisted across a
kexec reboot so that the allocations can be "remembered". To allow this,
allocate the chunk from the region itself and initialize it. Then, pass
the chunk to the Gen Pool allocator. In other words, persist the chunk.

Inside the Gen Pool allocator, distinguish between a chunk that is
allocated internally from kmem and a chunk that is passed by the caller
and handle it properly when the pool is destroyed.

Provide wrapper functions around the Gen Pool allocator functions so we
can change the allocator in the future if we wanted to.

	prmem_create_pool()
	prmem_alloc_pool()
	prmem_free_pool()

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 include/linux/genalloc.h    |  6 ++++
 include/linux/prmem.h       |  8 +++++
 kernel/prmem/prmem_init.c   |  8 +++++
 kernel/prmem/prmem_region.c | 67 ++++++++++++++++++++++++++++++++++++-
 lib/genalloc.c              | 45 ++++++++++++++++++-------
 5 files changed, 121 insertions(+), 13 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 0bd581003cd5..186757b0aec7 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -73,6 +73,7 @@ struct gen_pool_chunk {
 	struct list_head next_chunk;	/* next chunk in pool */
 	atomic_long_t avail;
 	phys_addr_t phys_addr;		/* physical starting address of memory chunk */
+	bool external;			/* Chunk is passed by caller. */
 	void *owner;			/* private data to retrieve at alloc time */
 	unsigned long start_addr;	/* start address of memory chunk */
 	unsigned long end_addr;		/* end address of memory chunk (inclusive) */
@@ -121,6 +122,11 @@ static inline int gen_pool_add(struct gen_pool *pool, unsigned long addr,
 {
 	return gen_pool_add_virt(pool, addr, -1, size, nid);
 }
+extern unsigned long gen_pool_chunk_size(size_t size, int min_alloc_order);
+extern void gen_pool_init_chunk(struct gen_pool_chunk *chunk,
+				unsigned long addr, phys_addr_t phys,
+				size_t size, bool external, void *owner);
+void gen_pool_add_chunk(struct gen_pool *pool, struct gen_pool_chunk *chunk);
 extern void gen_pool_destroy(struct gen_pool *);
 unsigned long gen_pool_alloc_algo_owner(struct gen_pool *pool, size_t size,
 		genpool_algo_t algo, void *data, void **owner);
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index bc8054a86f49..f43f5b0d2b9c 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -24,6 +24,7 @@
  * non-volatile storage is too slow.
  */
 #include <linux/types.h>
+#include <linux/genalloc.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/memblock.h>
@@ -38,11 +39,15 @@
  * node		List node.
  * pa		Physical address of the region.
  * size		Size of the region in bytes.
+ * pool		Gen Pool to manage region memory.
+ * chunk	Persistent Gen Pool chunk.
  */
 struct prmem_region {
 	struct list_head	node;
 	unsigned long		pa;
 	size_t			size;
+	struct gen_pool		*pool;
+	struct gen_pool_chunk	*chunk;
 };
 
 /*
@@ -80,6 +85,9 @@ int  prmem_cmdline_size(void);
 
 /* Internal functions. */
 struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
+bool prmem_create_pool(struct prmem_region *region, bool new_region);
+void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align);
+void prmem_free_pool(struct prmem_region *region, void *va, size_t size);
 unsigned long prmem_checksum(void *start, size_t size);
 bool __init prmem_validate(void);
 void prmem_cmdline(char *cmdline);
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index 9cea1cd3b6a5..56df1e6d3ebc 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -22,6 +22,14 @@ void __init prmem_init(void)
 
 		if (!prmem_add_region(prmem_pa, prmem_size))
 			return;
+	} else {
+		/* Warm boot. */
+		struct prmem_region	*region;
+
+		list_for_each_entry(region, &prmem->regions, node) {
+			if (!prmem_create_pool(region, false))
+				return;
+		}
 	}
 	prmem_inited = true;
 }
diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c
index 8254dafcee13..6dc88c74d9c8 100644
--- a/kernel/prmem/prmem_region.c
+++ b/kernel/prmem/prmem_region.c
@@ -1,12 +1,74 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * Persistent-Across-Kexec memory (prmem) - Regions.
+ * Persistent-Across-Kexec memory (prmem) - Regions and Region Pools.
  *
  * Copyright (C) 2023 Microsoft Corporation
  * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
  */
 #include <linux/prmem.h>
 
+bool prmem_create_pool(struct prmem_region *region, bool new_region)
+{
+	size_t		chunk_size, total_size;
+
+	chunk_size = gen_pool_chunk_size(region->size, PAGE_SHIFT);
+	total_size = sizeof(*region) + chunk_size;
+	total_size = ALIGN(total_size, PAGE_SIZE);
+
+	if (new_region) {
+		/*
+		 * We place the region structure at the base of the region
+		 * itself. Part of the region is a genpool chunk that is used
+		 * to manage the region memory.
+		 *
+		 * Normally, the chunk is allocated from regular memory by
+		 * genpool. But in the case of prmem, the chunk must be
+		 * persisted across kexecs so allocations can be remembered.
+		 * That is why it is allocated from the region memory itself
+		 * and passed to genpool.
+		 *
+		 * Make sure there is enough space for the region and the chunk.
+		 */
+		if (total_size >= region->size) {
+			pr_warn("%s: region size too small\n", __func__);
+			return false;
+		}
+
+		/* Initialize the persistent genpool chunk. */
+		region->chunk = (void *) (region + 1);
+		memset(region->chunk, 0, chunk_size);
+		gen_pool_init_chunk(region->chunk, (unsigned long) region,
+				    region->pa, region->size, true, NULL);
+	}
+
+	region->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
+	if (!region->pool) {
+		pr_warn("%s: Could not create genpool\n", __func__);
+		return false;
+	}
+
+	gen_pool_add_chunk(region->pool, region->chunk);
+
+	if (new_region) {
+		/* Reserve the region and chunk. */
+		gen_pool_alloc(region->pool, total_size);
+	}
+	return true;
+}
+
+void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align)
+{
+	struct genpool_data_align	data = { .align = align, };
+
+	return (void *) gen_pool_alloc_algo(region->pool, size,
+					    gen_pool_first_fit_align, &data);
+}
+
+void prmem_free_pool(struct prmem_region *region, void *va, size_t size)
+{
+	gen_pool_free(region->pool, (unsigned long) va, size);
+}
+
 struct prmem_region *prmem_add_region(unsigned long pa, size_t size)
 {
 	struct prmem_region	*region;
@@ -16,6 +78,9 @@ struct prmem_region *prmem_add_region(unsigned long pa, size_t size)
 	region->pa = pa;
 	region->size = size;
 
+	if (!prmem_create_pool(region, true))
+		return NULL;
+
 	list_add_tail(&region->node, &prmem->regions);
 	return region;
 }
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 6c644f954bc5..655db7b47ea9 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -165,6 +165,33 @@ struct gen_pool *gen_pool_create(int min_alloc_order, int nid)
 }
 EXPORT_SYMBOL(gen_pool_create);
 
+size_t gen_pool_chunk_size(size_t size, int min_alloc_order)
+{
+	unsigned long nbits = size >> min_alloc_order;
+	unsigned long nbytes = sizeof(struct gen_pool_chunk) +
+				BITS_TO_LONGS(nbits) * sizeof(long);
+	return nbytes;
+}
+
+void gen_pool_init_chunk(struct gen_pool_chunk *chunk, unsigned long virt,
+			 phys_addr_t phys, size_t size, bool external,
+			 void *owner)
+{
+	chunk->phys_addr = phys;
+	chunk->start_addr = virt;
+	chunk->end_addr = virt + size - 1;
+	chunk->external = external;
+	chunk->owner = owner;
+	atomic_long_set(&chunk->avail, size);
+}
+
+void gen_pool_add_chunk(struct gen_pool *pool, struct gen_pool_chunk *chunk)
+{
+	spin_lock(&pool->lock);
+	list_add_rcu(&chunk->next_chunk, &pool->chunks);
+	spin_unlock(&pool->lock);
+}
+
 /**
  * gen_pool_add_owner- add a new chunk of special memory to the pool
  * @pool: pool to add new memory chunk to
@@ -183,23 +210,14 @@ int gen_pool_add_owner(struct gen_pool *pool, unsigned long virt, phys_addr_t ph
 		 size_t size, int nid, void *owner)
 {
 	struct gen_pool_chunk *chunk;
-	unsigned long nbits = size >> pool->min_alloc_order;
-	unsigned long nbytes = sizeof(struct gen_pool_chunk) +
-				BITS_TO_LONGS(nbits) * sizeof(long);
+	unsigned long nbytes = gen_pool_chunk_size(size, pool->min_alloc_order);
 
 	chunk = vzalloc_node(nbytes, nid);
 	if (unlikely(chunk == NULL))
 		return -ENOMEM;
 
-	chunk->phys_addr = phys;
-	chunk->start_addr = virt;
-	chunk->end_addr = virt + size - 1;
-	chunk->owner = owner;
-	atomic_long_set(&chunk->avail, size);
-
-	spin_lock(&pool->lock);
-	list_add_rcu(&chunk->next_chunk, &pool->chunks);
-	spin_unlock(&pool->lock);
+	gen_pool_init_chunk(chunk, virt, phys, size, false, owner);
+	gen_pool_add_chunk(pool, chunk);
 
 	return 0;
 }
@@ -248,6 +266,9 @@ void gen_pool_destroy(struct gen_pool *pool)
 		chunk = list_entry(_chunk, struct gen_pool_chunk, next_chunk);
 		list_del(&chunk->next_chunk);
 
+		if (chunk->external)
+			continue;
+
 		end_bit = chunk_size(chunk) >> order;
 		bit = find_first_bit(chunk->bits, end_bit);
 		BUG_ON(bit < end_bit);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 04/10] mm/prmem: Implement a page allocator for persistent memory
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (2 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 03/10] mm/prmem: Manage persistent memory with the gen pool allocator madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 05/10] mm/prmem: Implement a buffer " madvenka
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Define the following convenience wrapper functions for allocating and
freeing pages:

	- prmem_alloc_pages()
	- prmem_free_pages()

The functions look similar to alloc_pages() and __free_pages(). However,
the only GFP flag that is processed is __GFP_ZERO to zero out the
allocated memory.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 include/linux/prmem.h          |  7 ++++
 kernel/prmem/Makefile          |  1 +
 kernel/prmem/prmem_allocator.c | 74 ++++++++++++++++++++++++++++++++++
 kernel/prmem/prmem_init.c      |  2 +
 4 files changed, 84 insertions(+)
 create mode 100644 kernel/prmem/prmem_allocator.c

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index f43f5b0d2b9c..108683933c82 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -75,6 +75,7 @@ extern unsigned long		prmem_metadata;
 extern unsigned long		prmem_pa;
 extern size_t			prmem_size;
 extern bool			prmem_inited;
+extern spinlock_t		prmem_lock;
 
 /* Kernel API. */
 void prmem_reserve_early(void);
@@ -83,11 +84,17 @@ void prmem_init(void);
 void prmem_fini(void);
 int  prmem_cmdline_size(void);
 
+/* Allocator API. */
+struct page *prmem_alloc_pages(unsigned int order, gfp_t gfp);
+void prmem_free_pages(struct page *pages, unsigned int order);
+
 /* Internal functions. */
 struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
 bool prmem_create_pool(struct prmem_region *region, bool new_region);
 void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align);
 void prmem_free_pool(struct prmem_region *region, void *va, size_t size);
+void *prmem_alloc_pages_locked(unsigned int order);
+void prmem_free_pages_locked(void *va, unsigned int order);
 unsigned long prmem_checksum(void *start, size_t size);
 bool __init prmem_validate(void);
 void prmem_cmdline(char *cmdline);
diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile
index 9b0a693bfee1..99bb19f0afd3 100644
--- a/kernel/prmem/Makefile
+++ b/kernel/prmem/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem_misc.o
+obj-y += prmem_allocator.o
diff --git a/kernel/prmem/prmem_allocator.c b/kernel/prmem/prmem_allocator.c
new file mode 100644
index 000000000000..07a5a430630c
--- /dev/null
+++ b/kernel/prmem/prmem_allocator.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Persistent-Across-Kexec memory feature (prmem) - Allocator.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#include <linux/prmem.h>
+
+/* Page Allocation functions. */
+
+void *prmem_alloc_pages_locked(unsigned int order)
+{
+	struct prmem_region	*region;
+	void			*va;
+	size_t			size = (1UL << order) << PAGE_SHIFT;
+
+	list_for_each_entry(region, &prmem->regions, node) {
+		va = prmem_alloc_pool(region, size, size);
+		if (va)
+			return va;
+	}
+	return NULL;
+}
+
+struct page *prmem_alloc_pages(unsigned int order, gfp_t gfp)
+{
+	void		*va;
+	size_t		size = (1UL << order) << PAGE_SHIFT;
+	bool		zero = !!(gfp & __GFP_ZERO);
+
+	if (!prmem_inited || order > MAX_ORDER)
+		return NULL;
+
+	spin_lock(&prmem_lock);
+	va = prmem_alloc_pages_locked(order);
+	spin_unlock(&prmem_lock);
+
+	if (va) {
+		if (zero)
+			memset(va, 0, size);
+		return virt_to_page(va);
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(prmem_alloc_pages);
+
+void prmem_free_pages_locked(void *va, unsigned int order)
+{
+	struct prmem_region	*region;
+	size_t			size = (1UL << order) << PAGE_SHIFT;
+	void			*eva = va + size;
+	void			*region_va;
+
+	list_for_each_entry(region, &prmem->regions, node) {
+		/* The region structure is at the base of the region memory. */
+		region_va = region;
+		if (va >= region_va && eva <= (region_va + region->size)) {
+			prmem_free_pool(region, va, size);
+			return;
+		}
+	}
+}
+
+void prmem_free_pages(struct page *pages, unsigned int order)
+{
+	if (!prmem_inited || order > MAX_ORDER)
+		return;
+
+	spin_lock(&prmem_lock);
+	prmem_free_pages_locked(page_to_virt(pages), order);
+	spin_unlock(&prmem_lock);
+}
+EXPORT_SYMBOL_GPL(prmem_free_pages);
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index 56df1e6d3ebc..d23833d296fe 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -9,6 +9,8 @@
 
 bool			prmem_inited;
 
+DEFINE_SPINLOCK(prmem_lock);
+
 void __init prmem_init(void)
 {
 	if (!prmem)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 05/10] mm/prmem: Implement a buffer allocator for persistent memory
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (3 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 04/10] mm/prmem: Implement a page allocator for persistent memory madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 06/10] mm/prmem: Implement persistent XArray (and Radix Tree) madvenka
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Implement functions that can allocate and free memory smaller than a page
size.

	- prmem_alloc()
	- prmem_free()

These functions look like kmalloc() and kfree(). However, the only GFP flag
that is processed is __GFP_ZERO to zero out the allocated memory.

To make the implementation simpler, create allocation caches for different
object sizes:

	8, 16, 32, 64, ..., PAGE_SIZE

For a given size, allocate from the appropriate cache. This idea has been
plagiarized from the kmem allocator.

To fill the cache of a specific size, allocate a page, break it up into
equal sized objects and add the objects to the cache. This is just a very
simple allocator. It does not attempt to do sophisticated things like
cache coloring, coalescing objects that belong to the same page so the
page can be freed, etc.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 include/linux/prmem.h          |  12 ++++
 kernel/prmem/prmem_allocator.c | 112 ++++++++++++++++++++++++++++++++-
 2 files changed, 123 insertions(+), 1 deletion(-)

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index 108683933c82..1cb4660cf35e 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -50,6 +50,8 @@ struct prmem_region {
 	struct gen_pool_chunk	*chunk;
 };
 
+#define PRMEM_MAX_CACHES	14
+
 /*
  * PRMEM metadata.
  *
@@ -60,6 +62,9 @@ struct prmem_region {
  * size		Size of initial memory allocated to prmem.
  *
  * regions	List of memory regions.
+ *
+ * caches	Caches for different object sizes. For allocations smaller than
+ *		PAGE_SIZE, these caches are used.
  */
 struct prmem {
 	unsigned long		checksum;
@@ -68,6 +73,9 @@ struct prmem {
 
 	/* Persistent Regions. */
 	struct list_head	regions;
+
+	/* Allocation caches. */
+	void			*caches[PRMEM_MAX_CACHES];
 };
 
 extern struct prmem		*prmem;
@@ -87,6 +95,8 @@ int  prmem_cmdline_size(void);
 /* Allocator API. */
 struct page *prmem_alloc_pages(unsigned int order, gfp_t gfp);
 void prmem_free_pages(struct page *pages, unsigned int order);
+void *prmem_alloc(size_t size, gfp_t gfp);
+void prmem_free(void *va, size_t size);
 
 /* Internal functions. */
 struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
@@ -95,6 +105,8 @@ void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align);
 void prmem_free_pool(struct prmem_region *region, void *va, size_t size);
 void *prmem_alloc_pages_locked(unsigned int order);
 void prmem_free_pages_locked(void *va, unsigned int order);
+void *prmem_alloc_locked(size_t size);
+void prmem_free_locked(void *va, size_t size);
 unsigned long prmem_checksum(void *start, size_t size);
 bool __init prmem_validate(void);
 void prmem_cmdline(char *cmdline);
diff --git a/kernel/prmem/prmem_allocator.c b/kernel/prmem/prmem_allocator.c
index 07a5a430630c..f12975bc6777 100644
--- a/kernel/prmem/prmem_allocator.c
+++ b/kernel/prmem/prmem_allocator.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * Persistent-Across-Kexec memory feature (prmem) - Allocator.
+ * Persistent-Across-Kexec memory (prmem) - Allocator.
  *
  * Copyright (C) 2023 Microsoft Corporation
  * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
@@ -72,3 +72,113 @@ void prmem_free_pages(struct page *pages, unsigned int order)
 	spin_unlock(&prmem_lock);
 }
 EXPORT_SYMBOL_GPL(prmem_free_pages);
+
+/* Buffer allocation functions. */
+
+#if PAGE_SIZE > 65536
+#error "Page size is too big"
+#endif
+
+static size_t	prmem_cache_sizes[PRMEM_MAX_CACHES] = {
+	8, 16, 32, 64, 128, 256, 512,
+	1024, 2048, 4096, 8192, 16384, 32768, 65536,
+};
+
+static int prmem_cache_index(size_t size)
+{
+	int	i;
+
+	for (i = 0; i < PRMEM_MAX_CACHES; i++) {
+		if (size <= prmem_cache_sizes[i])
+			return i;
+	}
+	BUG();
+}
+
+static void prmem_refill(void **cache, size_t size)
+{
+	void		*va;
+	int		i, n = PAGE_SIZE / size;
+
+	/* Allocate a page. */
+	va = prmem_alloc_pages_locked(0);
+	if (!va)
+		return;
+
+	/* Break up the page into pieces and put them in the cache. */
+	for (i = 0; i < n; i++, va += size) {
+		*((void **) va) = *cache;
+		*cache = va;
+	}
+}
+
+void *prmem_alloc_locked(size_t size)
+{
+	void		*va;
+	int		index;
+	void		**cache;
+
+	index = prmem_cache_index(size);
+	size = prmem_cache_sizes[index];
+
+	cache = &prmem->caches[index];
+	if (!*cache) {
+		/* Refill the cache. */
+		prmem_refill(cache, size);
+	}
+
+	/* Allocate one from the cache. */
+	va = *cache;
+	if (va)
+		*cache = *((void **) va);
+	return va;
+}
+
+void *prmem_alloc(size_t size, gfp_t gfp)
+{
+	void		*va;
+	bool		zero = !!(gfp & __GFP_ZERO);
+
+	if (!prmem_inited || !size)
+		return NULL;
+
+	/* This function is only for sizes up to a PAGE_SIZE. */
+	if (size > PAGE_SIZE)
+		return NULL;
+
+	spin_lock(&prmem_lock);
+	va = prmem_alloc_locked(size);
+	spin_unlock(&prmem_lock);
+
+	if (va && zero)
+		memset(va, 0, size);
+	return va;
+}
+EXPORT_SYMBOL_GPL(prmem_alloc);
+
+void prmem_free_locked(void *va, size_t size)
+{
+	int		index;
+	void		**cache;
+
+	/* Free the object into its cache. */
+	index = prmem_cache_index(size);
+	cache = &prmem->caches[index];
+	*((void **) va) = *cache;
+	*cache = va;
+}
+
+void prmem_free(void *va, size_t size)
+{
+	if (!prmem_inited || !va || !size)
+		return;
+
+	/* This function is only for sizes up to a PAGE_SIZE. */
+	if (size > PAGE_SIZE)
+		return;
+
+	spin_lock(&prmem_lock);
+	prmem_free_locked(va, size);
+	spin_unlock(&prmem_lock);
+}
+EXPORT_SYMBOL_GPL(prmem_free);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 06/10] mm/prmem: Implement persistent XArray (and Radix Tree)
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (4 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 05/10] mm/prmem: Implement a buffer " madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 07/10] mm/prmem: Implement named Persistent Instances madvenka
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Consumers can persist their data structures by allocating persistent
memory for them.

Now, data structures are connected to one another using pointers, arrays,
linked lists, RB nodes, etc. These can all be persisted by allocating
memory for them from persistent memory. E.g., a linked list is persisted
if the data structures that embed the list head and the list nodes are
allocated from persistent memory. Ditto for RB trees.

One important exception is the XArray. The XArray itself can be embedded in
a persistent data structure. However, the XA nodes are allocated using the
kmem allocator.

Implement a persistent XArray. Introduce a new field, xa_persistent, in the
XArray. Implement an accessor function to set the field. If xa_persistent
is true, allocate XA nodes using the prmem allocator instead of the kmem
allocator. This makes the whole XArray persistent.

Since Radix Trees (lib/radix-tree.c) are implemented based on the XArray,
we also get persistent Radix Trees. The only difference is that pre-loading
is not supported for persistent Radix Tree nodes.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 include/linux/radix-tree.h |  4 ++++
 include/linux/xarray.h     | 15 ++++++++++++
 lib/radix-tree.c           | 49 +++++++++++++++++++++++++++++++-------
 lib/xarray.c               | 11 +++++----
 4 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index eae67015ce51..74f0bdc60bea 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -82,6 +82,7 @@ static inline bool radix_tree_is_internal_node(void *ptr)
 	struct radix_tree_root name = RADIX_TREE_INIT(name, mask)
 
 #define INIT_RADIX_TREE(root, mask) xa_init_flags(root, mask)
+#define PERSIST_RADIX_TREE(root) xa_persistent(root)
 
 static inline bool radix_tree_empty(const struct radix_tree_root *root)
 {
@@ -254,6 +255,9 @@ unsigned int radix_tree_gang_lookup_tag_slot(const struct radix_tree_root *,
 		void __rcu ***results, unsigned long first_index,
 		unsigned int max_items, unsigned int tag);
 int radix_tree_tagged(const struct radix_tree_root *, unsigned int tag);
+struct radix_tree_node *radix_node_alloc(struct radix_tree_root *root,
+		struct list_lru *lru, gfp_t gfp);
+void radix_node_free(struct radix_tree_node *node);
 
 static inline void radix_tree_preload_end(void)
 {
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 741703b45f61..3176a5f62caf 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -295,6 +295,7 @@ enum xa_lock_type {
  */
 struct xarray {
 	spinlock_t	xa_lock;
+	bool		xa_persistent;
 /* private: The rest of the data structure is not to be used directly. */
 	gfp_t		xa_flags;
 	void __rcu *	xa_head;
@@ -302,6 +303,7 @@ struct xarray {
 
 #define XARRAY_INIT(name, flags) {				\
 	.xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock),		\
+	.xa_persistent = false,					\
 	.xa_flags = flags,					\
 	.xa_head = NULL,					\
 }
@@ -378,6 +380,7 @@ void xa_destroy(struct xarray *);
 static inline void xa_init_flags(struct xarray *xa, gfp_t flags)
 {
 	spin_lock_init(&xa->xa_lock);
+	xa->xa_persistent = false;
 	xa->xa_flags = flags;
 	xa->xa_head = NULL;
 }
@@ -395,6 +398,17 @@ static inline void xa_init(struct xarray *xa)
 	xa_init_flags(xa, 0);
 }
 
+/**
+ * xa_peristent() - xa_root and xa_node allocated from persistent memory.
+ * @xa: XArray.
+ *
+ * Context: Any context.
+ */
+static inline void xa_persistent(struct xarray *xa)
+{
+	xa->xa_persistent = true;
+}
+
 /**
  * xa_empty() - Determine if an array has any present entries.
  * @xa: XArray.
@@ -1142,6 +1156,7 @@ struct xa_node {
 	unsigned char	offset;		/* Slot offset in parent */
 	unsigned char	count;		/* Total entry count */
 	unsigned char	nr_values;	/* Value entry count */
+	bool		persistent;	/* Allocated from persistent memory. */
 	struct xa_node __rcu *parent;	/* NULL at top of tree */
 	struct xarray	*array;		/* The array we belong to */
 	union {
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 976b9bd02a1b..d3af6ff6c625 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -21,6 +21,7 @@
 #include <linux/kmemleak.h>
 #include <linux/percpu.h>
 #include <linux/preempt.h>		/* in_interrupt() */
+#include <linux/prmem.h>
 #include <linux/radix-tree.h>
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
@@ -225,6 +226,36 @@ static unsigned long next_index(unsigned long index,
 	return (index & ~node_maxindex(node)) + (offset << node->shift);
 }
 
+static void radix_tree_node_ctor(void *arg);
+
+struct radix_tree_node *
+radix_node_alloc(struct radix_tree_root *root, struct list_lru *lru, gfp_t gfp)
+{
+	struct radix_tree_node *node;
+
+	if (root && root->xa_persistent) {
+		node = prmem_alloc(sizeof(struct radix_tree_node), gfp);
+		if (node) {
+			radix_tree_node_ctor(node);
+			node->persistent = true;
+		}
+	} else {
+		node = kmem_cache_alloc_lru(radix_tree_node_cachep, lru, gfp);
+		if (node)
+			node->persistent = false;
+	}
+	return node;
+}
+
+void radix_node_free(struct radix_tree_node *node)
+{
+	if (node->persistent) {
+		prmem_free(node, sizeof(*node));
+		return;
+	}
+	kmem_cache_free(radix_tree_node_cachep, node);
+}
+
 /*
  * This assumes that the caller has performed appropriate preallocation, and
  * that the caller has pinned this thread of control to the current CPU.
@@ -241,8 +272,11 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
 	 * Preload code isn't irq safe and it doesn't make sense to use
 	 * preloading during an interrupt anyway as all the allocations have
 	 * to be atomic. So just do normal allocation when in interrupt.
+	 *
+	 * Also, there is no preloading for persistent trees.
 	 */
-	if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
+	if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt() &&
+	    !root->xa_persistent) {
 		struct radix_tree_preload *rtp;
 
 		/*
@@ -250,8 +284,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
 		 * cache first for the new node to get accounted to the memory
 		 * cgroup.
 		 */
-		ret = kmem_cache_alloc(radix_tree_node_cachep,
-				       gfp_mask | __GFP_NOWARN);
+		ret = radix_node_alloc(root, NULL, gfp_mask | __GFP_NOWARN);
 		if (ret)
 			goto out;
 
@@ -273,7 +306,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
 		kmemleak_update_trace(ret);
 		goto out;
 	}
-	ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+	ret = radix_node_alloc(root, NULL, gfp_mask);
 out:
 	BUG_ON(radix_tree_is_internal_node(ret));
 	if (ret) {
@@ -301,7 +334,7 @@ void radix_tree_node_rcu_free(struct rcu_head *head)
 	memset(node->tags, 0, sizeof(node->tags));
 	INIT_LIST_HEAD(&node->private_list);
 
-	kmem_cache_free(radix_tree_node_cachep, node);
+	radix_node_free(node);
 }
 
 static inline void
@@ -335,7 +368,7 @@ static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 	rtp = this_cpu_ptr(&radix_tree_preloads);
 	while (rtp->nr < nr) {
 		local_unlock(&radix_tree_preloads.lock);
-		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+		node = radix_node_alloc(NULL, NULL, gfp_mask);
 		if (node == NULL)
 			goto out;
 		local_lock(&radix_tree_preloads.lock);
@@ -345,7 +378,7 @@ static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
 			rtp->nodes = node;
 			rtp->nr++;
 		} else {
-			kmem_cache_free(radix_tree_node_cachep, node);
+			radix_node_free(node);
 		}
 	}
 	ret = 0;
@@ -1585,7 +1618,7 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 	while (rtp->nr) {
 		node = rtp->nodes;
 		rtp->nodes = node->parent;
-		kmem_cache_free(radix_tree_node_cachep, node);
+		radix_node_free(node);
 		rtp->nr--;
 	}
 	return 0;
diff --git a/lib/xarray.c b/lib/xarray.c
index 2071a3718f4e..33a74b713e6a 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -9,6 +9,7 @@
 #include <linux/bitmap.h>
 #include <linux/export.h>
 #include <linux/list.h>
+#include <linux/prmem.h>
 #include <linux/slab.h>
 #include <linux/xarray.h>
 
@@ -303,7 +304,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
 	}
 	if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
 		gfp |= __GFP_ACCOUNT;
-	xas->xa_alloc = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+	xas->xa_alloc = radix_node_alloc(xas->xa, xas->xa_lru, gfp);
 	if (!xas->xa_alloc)
 		return false;
 	xas->xa_alloc->parent = NULL;
@@ -335,10 +336,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
 		gfp |= __GFP_ACCOUNT;
 	if (gfpflags_allow_blocking(gfp)) {
 		xas_unlock_type(xas, lock_type);
-		xas->xa_alloc = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+		xas->xa_alloc = radix_node_alloc(xas->xa, xas->xa_lru, gfp);
 		xas_lock_type(xas, lock_type);
 	} else {
-		xas->xa_alloc = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+		xas->xa_alloc = radix_node_alloc(xas->xa, xas->xa_lru, gfp);
 	}
 	if (!xas->xa_alloc)
 		return false;
@@ -372,7 +373,7 @@ static void *xas_alloc(struct xa_state *xas, unsigned int shift)
 		if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
 			gfp |= __GFP_ACCOUNT;
 
-		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+		node = radix_node_alloc(xas->xa, xas->xa_lru, gfp);
 		if (!node) {
 			xas_set_err(xas, -ENOMEM);
 			return NULL;
@@ -1017,7 +1018,7 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
 		void *sibling = NULL;
 		struct xa_node *node;
 
-		node = kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp);
+		node = radix_node_alloc(xas->xa, xas->xa_lru, gfp);
 		if (!node)
 			goto nomem;
 		node->array = xas->xa;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 07/10] mm/prmem: Implement named Persistent Instances.
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (5 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 06/10] mm/prmem: Implement persistent XArray (and Radix Tree) madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 08/10] mm/prmem: Implement Persistent Ramdisk instances madvenka
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

To persist any data, a consumer needs to do the following:

	- create a persistent instance for it. The instance gets recorded
	  in the metadata.

	- Name the instance.

	- Record the instance data in the instance.

	- Retrieve the instance by name after kexec.

	- Retrieve instance data.

Implement the following API for consumers:

prmem_get(subsystem, name, create)

	Get/Create a persistent instance. The consumer provides the name
	of the subsystem and the name of the instance within the subsystem.
	E.g., for a persistent ramdisk block device:
		subsystem = "ramdisk"
		instance  = "pram0"

prmem_set_data()

	Record a data pointer and a size for the instance. An instance may
	contain many data structures connected to each other using pointers,
	etc. A consumer is expected to record the top level data structure
	in the instance. All other data structures must be reachable from
	the top level data structure.

prmem_get_data()

	Retrieve the data pointer and the size for the instance.

prmem_put()

	Destroy a persistent instance. The instance data must be NULL at
	this point. So, the consumer is responsible for freeing the
	instance data and setting it to NULL in the instance prior to
	destroying.

prmem_list()

	Walk the instances of a subsystem and call a callback for each.
	This allows a consumer to enumerate all of the instances associated
	with a subsystem.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 include/linux/prmem.h         |  36 +++++++++
 kernel/prmem/Makefile         |   2 +-
 kernel/prmem/prmem_init.c     |   1 +
 kernel/prmem/prmem_instance.c | 139 ++++++++++++++++++++++++++++++++++
 4 files changed, 177 insertions(+), 1 deletion(-)
 create mode 100644 kernel/prmem/prmem_instance.c

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index 1cb4660cf35e..c7034690f7cb 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -50,6 +50,28 @@ struct prmem_region {
 	struct gen_pool_chunk	*chunk;
 };
 
+#define PRMEM_MAX_NAME		32
+
+/*
+ * To persist any data, a persistent instance is created for it and the data is
+ * "remembered" in the instance.
+ *
+ * node		List node
+ * subsystem	Subsystem/driver/module that created the instance. E.g.,
+ *		"ramdisk" for the ramdisk driver.
+ * name		Instance name within the subsystem/driver/module. E.g., "pram0"
+ *		for a persistent ramdisk instance.
+ * data		Pointer to data. E.g., the radix tree of pages in a ram disk.
+ * size		Size of data.
+ */
+struct prmem_instance {
+	struct list_head	node;
+	char			subsystem[PRMEM_MAX_NAME];
+	char			name[PRMEM_MAX_NAME];
+	void			*data;
+	size_t			size;
+};
+
 #define PRMEM_MAX_CACHES	14
 
 /*
@@ -63,6 +85,8 @@ struct prmem_region {
  *
  * regions	List of memory regions.
  *
+ * instances	Persistent instances.
+ *
  * caches	Caches for different object sizes. For allocations smaller than
  *		PAGE_SIZE, these caches are used.
  */
@@ -74,6 +98,9 @@ struct prmem {
 	/* Persistent Regions. */
 	struct list_head	regions;
 
+	/* Persistent Instances. */
+	struct list_head	instances;
+
 	/* Allocation caches. */
 	void			*caches[PRMEM_MAX_CACHES];
 };
@@ -85,6 +112,8 @@ extern size_t			prmem_size;
 extern bool			prmem_inited;
 extern spinlock_t		prmem_lock;
 
+typedef int (*prmem_list_func_t)(struct prmem_instance *instance, void *arg);
+
 /* Kernel API. */
 void prmem_reserve_early(void);
 void prmem_reserve(void);
@@ -98,6 +127,13 @@ void prmem_free_pages(struct page *pages, unsigned int order);
 void *prmem_alloc(size_t size, gfp_t gfp);
 void prmem_free(void *va, size_t size);
 
+/* Persistent Instance API. */
+void *prmem_get(char *subsystem, char *name, bool create);
+void prmem_set_data(struct prmem_instance *instance, void *data, size_t size);
+void prmem_get_data(struct prmem_instance *instance, void **data, size_t *size);
+bool prmem_put(struct prmem_instance *instance);
+int prmem_list(char *subsystem, prmem_list_func_t func, void *arg);
+
 /* Internal functions. */
 struct prmem_region *prmem_add_region(unsigned long pa, size_t size);
 bool prmem_create_pool(struct prmem_region *region, bool new_region);
diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile
index 99bb19f0afd3..0ed7976580d6 100644
--- a/kernel/prmem/Makefile
+++ b/kernel/prmem/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-y += prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem_misc.o
-obj-y += prmem_allocator.o
+obj-y += prmem_allocator.o prmem_instance.o
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index d23833d296fe..166fca688ab3 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -21,6 +21,7 @@ void __init prmem_init(void)
 		prmem->metadata = prmem_metadata;
 		prmem->size = prmem_size;
 		INIT_LIST_HEAD(&prmem->regions);
+		INIT_LIST_HEAD(&prmem->instances);
 
 		if (!prmem_add_region(prmem_pa, prmem_size))
 			return;
diff --git a/kernel/prmem/prmem_instance.c b/kernel/prmem/prmem_instance.c
new file mode 100644
index 000000000000..ee3554d0ab8b
--- /dev/null
+++ b/kernel/prmem/prmem_instance.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Persistent-Across-Kexec memory (prmem) - Persistent instances.
+ *
+ * Copyright (C) 2023 Microsoft Corporation
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ */
+#include <linux/prmem.h>
+
+static struct prmem_instance *prmem_find(char *subsystem, char *name)
+{
+	struct prmem_instance	*instance;
+
+	list_for_each_entry(instance, &prmem->instances, node) {
+		if (!strcmp(instance->subsystem, subsystem) &&
+		    !strcmp(instance->name, name)) {
+			return instance;
+		}
+	}
+	return NULL;
+}
+
+void *prmem_get(char *subsystem, char *name, bool create)
+{
+	int			subsystem_len = strlen(subsystem);
+	int			name_len = strlen(name);
+	struct prmem_instance	*instance;
+
+	/*
+	 * In early boot, you are allowed to get an existing instance. But
+	 * you are not allowed to create one until prmem is fully initialized.
+	 */
+	if (!prmem || (!prmem_inited && create))
+		return NULL;
+
+	if (!subsystem_len || subsystem_len >= PRMEM_MAX_NAME ||
+	    !name_len || name_len >= PRMEM_MAX_NAME) {
+		return NULL;
+	}
+
+	spin_lock(&prmem_lock);
+
+	/* Check if it already exists. */
+	instance = prmem_find(subsystem, name);
+	if (instance || !create)
+		goto unlock;
+
+	instance = prmem_alloc_locked(sizeof(*instance));
+	if (!instance)
+		goto unlock;
+
+	strcpy(instance->subsystem, subsystem);
+	strcpy(instance->name, name);
+	instance->data = NULL;
+	instance->size = 0;
+
+	list_add_tail(&instance->node, &prmem->instances);
+unlock:
+	spin_unlock(&prmem_lock);
+	return instance;
+}
+EXPORT_SYMBOL_GPL(prmem_get);
+
+void prmem_set_data(struct prmem_instance *instance, void *data, size_t size)
+{
+	if (!prmem_inited)
+		return;
+
+	spin_lock(&prmem_lock);
+	instance->data = data;
+	instance->size = size;
+	spin_unlock(&prmem_lock);
+}
+EXPORT_SYMBOL_GPL(prmem_set_data);
+
+void prmem_get_data(struct prmem_instance *instance, void **data, size_t *size)
+{
+	if (!prmem)
+		return;
+
+	spin_lock(&prmem_lock);
+	*data = instance->data;
+	*size = instance->size;
+	spin_unlock(&prmem_lock);
+}
+EXPORT_SYMBOL_GPL(prmem_get_data);
+
+bool prmem_put(struct prmem_instance *instance)
+{
+	if (!prmem_inited)
+		return true;
+
+	spin_lock(&prmem_lock);
+
+	if (instance->data) {
+		/*
+		 * Caller is responsible for freeing instance data and setting
+		 * it to NULL.
+		 */
+		spin_unlock(&prmem_lock);
+		return false;
+	}
+
+	/* Free instance. */
+	list_del(&instance->node);
+	prmem_free_locked(instance, sizeof(*instance));
+
+	spin_unlock(&prmem_lock);
+	return true;
+}
+EXPORT_SYMBOL_GPL(prmem_put);
+
+int prmem_list(char *subsystem, prmem_list_func_t func, void *arg)
+{
+	int			subsystem_len = strlen(subsystem);
+	struct prmem_instance	*instance;
+	int			ret;
+
+	if (!prmem)
+		return 0;
+
+	if (!subsystem_len || subsystem_len >= PRMEM_MAX_NAME)
+		return -EINVAL;
+
+	spin_lock(&prmem_lock);
+
+	list_for_each_entry(instance, &prmem->instances, node) {
+		if (strcmp(instance->subsystem, subsystem))
+			continue;
+
+		ret = func(instance, arg);
+		if (ret)
+			break;
+	}
+
+	spin_unlock(&prmem_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(prmem_list);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 08/10] mm/prmem: Implement Persistent Ramdisk instances.
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (6 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 07/10] mm/prmem: Implement named Persistent Instances madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 09/10] mm/prmem: Implement DAX support for Persistent Ramdisks madvenka
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Using the prmem APIs, any kernel subsystem can persist its data. For
persisting user data, we need a filesystem.

Implement persistent ramdisk block device instances so that any filesystem
can be created on it.

Normal ramdisk devices are named "ram0", "ram1", "ram2", etc. Persistent
ramdisk devices will be named "pram0", "pram1", "pram2", etc.

For normal ramdisks, ramdisk pages are allocated using alloc_pages(). For
persistent ones, ramdisk pages are allocated using prmem_alloc_pages().

Each ram disk has a device structure - struct brd_device. For persistent
ram disks, allocate this from persistent memory and record it as the
instance data of the ram disk instance. The structure contains an XArray
of pages allocated to the ram disk. Make it a persistent XArray.

The disk size for all normal ramdisks is specified via a module parameter
"rd_size". This forces all of the ramdisks to have the same size.

For persistent ram disks, take a different approach. Define a module
parameter called "prd_sizes" which specifies a comma-separated list of
sizes. The sizes are applied in the order in which they are listed to
"pram0", "pram1", etc.

	Ram Disk Usage
	--------------

	sudo modprobe brd prd_sizes="1G,2G"

		This creates two ram disks with the specified sizes. That
		is, /dev/pram0 will have a size of 1G. /dev/pram1 will
		have a size of 2G.

	sudo mkfs.ext4 /dev/pram0
	sudo mkfs.ext4 /dev/pram1

		Make filesystems on the persistent ram disks.

	sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
	sudo mount -t ext4 /dev/pram1 /path/to/mountpoint1

		Mount them somewhere.

	sudo umount /path/to/mountpoint0
	sudo umount /path/to/mountpoint1

		Unmount the filesystems.

	After kexec
	-----------

	sudo modprobe brd	(you may omit "prd_sizes")

		This remembers the previously created persistent ram disks.

	sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
	sudo mount -t ext4 /dev/pram1 /path/to/mountpoint1

		Mount the same filesystems.

The maximum number of persistent ram disk instances is specified via
CONFIG_BLK_DEV_PRAM_MAX. By default, this is zero.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 drivers/block/Kconfig |  11 +++
 drivers/block/brd.c   | 214 +++++++++++++++++++++++++++++++++++++++---
 2 files changed, 213 insertions(+), 12 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 5b9d4aaebb81..08fa40f6e2de 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -256,6 +256,17 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
+config BLK_DEV_PRAM_MAX
+	int "Maximum number of Persistent RAM disks"
+	default "0"
+	depends on BLK_DEV_RAM
+	help
+	  This allows the creation of persistent RAM disks. Persistent RAM
+	  disks are used to remember data across a kexec reboot. The default
+	  value is 0 Persistent RAM disks. Change this if you know what you
+	  are doing. The sizes of the ram disks are specified via the boot
+	  arg "prd_sizes" as a comma-separated list of sizes.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media (DEPRECATED)"
 	depends on !UML
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 970bd6ff38c4..3a05e56ca16f 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -24,9 +24,12 @@
 #include <linux/slab.h>
 #include <linux/backing-dev.h>
 #include <linux/debugfs.h>
+#include <linux/prmem.h>
 
 #include <linux/uaccess.h>
 
+enum brd_type { BRD_NORMAL = 0, BRD_PERSISTENT, };
+
 /*
  * Each block ramdisk device has a xarray brd_pages of pages that stores
  * the pages containing the block device's contents. A brd page's ->index is
@@ -36,6 +39,7 @@
  */
 struct brd_device {
 	int			brd_number;
+	enum brd_type		brd_type;
 	struct gendisk		*brd_disk;
 	struct list_head	brd_list;
 
@@ -46,6 +50,15 @@ struct brd_device {
 	u64			brd_nr_pages;
 };
 
+/* Each of these functions performs an action based on brd_type. */
+static struct brd_device *brd_alloc_device(int i, enum brd_type type);
+static void brd_free_device(struct brd_device *brd);
+static struct page *brd_alloc_page(struct brd_device *brd, gfp_t gfp);
+static void brd_free_page(struct brd_device *brd, struct page *page);
+static void brd_xa_init(struct brd_device *brd);
+static void brd_init_name(struct brd_device *brd, char *name);
+static void brd_set_capacity(struct brd_device *brd);
+
 /*
  * Look up and return a brd's page for a given sector.
  */
@@ -75,7 +88,7 @@ static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
 	if (page)
 		return 0;
 
-	page = alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM);
+	page = brd_alloc_page(brd, gfp | __GFP_ZERO | __GFP_HIGHMEM);
 	if (!page)
 		return -ENOMEM;
 
@@ -87,7 +100,7 @@ static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
 	cur = __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp);
 
 	if (unlikely(cur)) {
-		__free_page(page);
+		brd_free_page(brd, page);
 		ret = xa_err(cur);
 		if (!ret && (cur->index != idx))
 			ret = -EIO;
@@ -110,7 +123,7 @@ static void brd_free_pages(struct brd_device *brd)
 	pgoff_t idx;
 
 	xa_for_each(&brd->brd_pages, idx, page) {
-		__free_page(page);
+		brd_free_page(brd, page);
 		cond_resched();
 	}
 
@@ -287,6 +300,18 @@ unsigned long rd_size = CONFIG_BLK_DEV_RAM_SIZE;
 module_param(rd_size, ulong, 0444);
 MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
 
+/* Sizes of persistent ram disks are specified in a comma-separated list.  */
+static char *prd_sizes;
+module_param(prd_sizes, charp, 0444);
+MODULE_PARM_DESC(prd_sizes, "Sizes of persistent RAM disks.");
+
+/* Persistent ram disk specific data. */
+struct prd_data {
+	struct prmem_instance	*instance;
+	unsigned long		size;
+};
+static struct prd_data	prd_data[CONFIG_BLK_DEV_PRAM_MAX];
+
 static int max_part = 1;
 module_param(max_part, int, 0444);
 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
@@ -295,6 +320,32 @@ MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
 
+void __init brd_parse(void)
+{
+	unsigned long		size;
+	char			*cur, *tmp;
+	int			i = 0;
+
+	if (!CONFIG_BLK_DEV_PRAM_MAX || !prd_sizes)
+		return;
+
+	/* Parse persistent ram disk sizes. */
+	cur = prd_sizes;
+	do {
+		/* Get the size of a ramdisk. Sanity check it. */
+		size = memparse(cur, &tmp);
+		if (cur == tmp || !size) {
+			pr_warn("%s: Memory value expected\n", __func__);
+			return;
+		}
+		cur = tmp;
+
+		/* Add the ramdisk size. */
+		prd_data[i++].size = size;
+
+	} while (*cur++ == ',' && i < CONFIG_BLK_DEV_PRAM_MAX);
+}
+
 #ifndef MODULE
 /* Legacy boot options - nonmodular */
 static int __init ramdisk_size(char *str)
@@ -314,23 +365,33 @@ static struct dentry *brd_debugfs_dir;
 
 static int brd_alloc(int i)
 {
+	int brd_number;
+	enum brd_type brd_type;
 	struct brd_device *brd;
 	struct gendisk *disk;
 	char buf[DISK_NAME_LEN];
 	int err = -ENOMEM;
 
+	if (i < rd_nr) {
+		brd_number = i;
+		brd_type = BRD_NORMAL;
+	} else {
+		brd_number = i - rd_nr;
+		brd_type = BRD_PERSISTENT;
+	}
+
 	list_for_each_entry(brd, &brd_devices, brd_list)
-		if (brd->brd_number == i)
+		if (brd->brd_number == i && brd->brd_type == brd_type)
 			return -EEXIST;
-	brd = kzalloc(sizeof(*brd), GFP_KERNEL);
+	brd = brd_alloc_device(brd_number, brd_type);
 	if (!brd)
 		return -ENOMEM;
-	brd->brd_number		= i;
+	brd->brd_number		= brd_number;
 	list_add_tail(&brd->brd_list, &brd_devices);
 
-	xa_init(&brd->brd_pages);
+	brd_xa_init(brd);
 
-	snprintf(buf, DISK_NAME_LEN, "ram%d", i);
+	brd_init_name(brd, buf);
 	if (!IS_ERR_OR_NULL(brd_debugfs_dir))
 		debugfs_create_u64(buf, 0444, brd_debugfs_dir,
 				&brd->brd_nr_pages);
@@ -345,7 +406,7 @@ static int brd_alloc(int i)
 	disk->fops		= &brd_fops;
 	disk->private_data	= brd;
 	strscpy(disk->disk_name, buf, DISK_NAME_LEN);
-	set_capacity(disk, rd_size * 2);
+	brd_set_capacity(brd);
 	
 	/*
 	 * This is so fdisk will align partitions on 4k, because of
@@ -370,7 +431,7 @@ static int brd_alloc(int i)
 	put_disk(disk);
 out_free_dev:
 	list_del(&brd->brd_list);
-	kfree(brd);
+	brd_free_device(brd);
 	return err;
 }
 
@@ -390,7 +451,7 @@ static void brd_cleanup(void)
 		put_disk(brd->brd_disk);
 		brd_free_pages(brd);
 		list_del(&brd->brd_list);
-		kfree(brd);
+		brd_free_device(brd);
 	}
 }
 
@@ -427,13 +488,21 @@ static int __init brd_init(void)
 			goto out_free;
 	}
 
+	/* Parse persistent ram disk sizes. */
+	brd_parse();
+
+	/* Create persistent ram disks. */
+	for (i = 0; i < CONFIG_BLK_DEV_PRAM_MAX; i++)
+		brd_alloc(i + rd_nr);
+
 	/*
 	 * brd module now has a feature to instantiate underlying device
 	 * structure on-demand, provided that there is an access dev node.
 	 *
 	 * (1) if rd_nr is specified, create that many upfront. else
 	 *     it defaults to CONFIG_BLK_DEV_RAM_COUNT
-	 * (2) User can further extend brd devices by create dev node themselves
+	 * (2) if prd_sizes is specified, create that many upfront.
+	 * (3) User can further extend brd devices by create dev node themselves
 	 *     and have kernel automatically instantiate actual device
 	 *     on-demand. Example:
 	 *		mknod /path/devnod_name b 1 X	# 1 is the rd major
@@ -469,3 +538,124 @@ static void __exit brd_exit(void)
 module_init(brd_init);
 module_exit(brd_exit);
 
+/* Each of these functions performs an action based on brd_type. */
+
+static struct brd_device *brd_alloc_device(int i, enum brd_type type)
+{
+	char name[PRMEM_MAX_NAME];
+	struct brd_device *brd;
+	struct prmem_instance *instance;
+	size_t size;
+	bool create;
+
+	if (type == BRD_NORMAL)
+		return kzalloc(sizeof(struct brd_device), GFP_KERNEL);
+
+	/*
+	 * Get the persistent ramdisk instance. If it does not exist, it will
+	 * be created, if a size has been specified.
+	 */
+	create = !!prd_data[i].size;
+	snprintf(name, PRMEM_MAX_NAME, "pram%d", i);
+	instance = prmem_get("ramdisk", name, create);
+	if (!instance)
+		return NULL;
+
+	prmem_get_data(instance, (void **) &brd, &size);
+	if (brd) {
+		/* Existing instance. Ignore the module parameter. */
+		prd_data[i].size = size;
+		prd_data[i].instance = instance;
+		return brd;
+	}
+
+	/*
+	 * New instance. Allocate brd from persistent memory and set it as
+	 * instance data.
+	 */
+	brd = prmem_alloc(sizeof(*brd), __GFP_ZERO);
+	if (!brd) {
+		prmem_put(instance);
+		return NULL;
+	}
+	brd->brd_type = BRD_PERSISTENT;
+	prmem_set_data(instance, brd, prd_data[i].size);
+
+	prd_data[i].instance = instance;
+	return brd;
+}
+
+static void brd_free_device(struct brd_device *brd)
+{
+	struct prmem_instance *instance;
+
+	if (brd->brd_type == BRD_NORMAL) {
+		kfree(brd);
+		return;
+	}
+
+	instance = prd_data[brd->brd_number].instance;
+	prmem_set_data(instance, NULL, 0);
+	prmem_free(brd, sizeof(*brd));
+	prmem_put(instance);
+}
+
+static struct page *brd_alloc_page(struct brd_device *brd, gfp_t gfp)
+{
+	if (brd->brd_type == BRD_NORMAL)
+		return alloc_page(gfp);
+	return prmem_alloc_pages(0, gfp);
+}
+
+static void brd_free_page(struct brd_device *brd, struct page *page)
+{
+	if (brd->brd_type == BRD_NORMAL)
+		__free_page(page);
+	else
+		prmem_free_pages(page, 0);
+}
+
+static void brd_xa_init(struct brd_device *brd)
+{
+	if (brd->brd_type == BRD_NORMAL) {
+		xa_init(&brd->brd_pages);
+		return;
+	}
+
+	if (brd->brd_nr_pages) {
+		/* Existing persistent instance. */
+		struct page *page;
+		pgoff_t idx;
+
+		/*
+		 * The xarray of pages is persistent. However, the page
+		 * indexes are not. Set them here.
+		 */
+		xa_for_each(&brd->brd_pages, idx, page) {
+			page->index = idx;
+		}
+	} else {
+		/* New persistent instance. */
+		xa_init(&brd->brd_pages);
+		xa_persistent(&brd->brd_pages);
+	}
+}
+
+static void brd_init_name(struct brd_device *brd, char *name)
+{
+	if (brd->brd_type == BRD_NORMAL)
+		snprintf(name, DISK_NAME_LEN, "ram%d", brd->brd_number);
+	else
+		snprintf(name, DISK_NAME_LEN, "pram%d", brd->brd_number);
+}
+
+static void brd_set_capacity(struct brd_device *brd)
+{
+	unsigned long disksize;
+
+	if (brd->brd_type == BRD_NORMAL)
+		disksize = rd_size;
+	else
+		disksize = prd_data[brd->brd_number].size;
+	set_capacity(brd->brd_disk, disksize * 2);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 09/10] mm/prmem: Implement DAX support for Persistent Ramdisks.
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (7 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 08/10] mm/prmem: Implement Persistent Ramdisk instances madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-16 23:32   ` [RFC PATCH v1 10/10] mm/prmem: Implement dynamic expansion of prmem madvenka
  2023-10-17  8:31   ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) Alexander Graf
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

One problem with using a ramdisk is that the page cache will contain
redundant copies of ramdisk data. To avoid this, implement DAX support
for persistent ramdisks.

To avail this, the filesystem that is installed on the ramdisk must
support DAX. Like ext4. Mount the filesystem with the dax option. E.g.,

	sudo mount -t ext4 -o dax /dev/pram0 /path/to/mountpoint

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 drivers/block/brd.c | 106 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3a05e56ca16f..d4a42d3bd212 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -25,6 +25,9 @@
 #include <linux/backing-dev.h>
 #include <linux/debugfs.h>
 #include <linux/prmem.h>
+#include <linux/pfn_t.h>
+#include <linux/dax.h>
+#include <linux/uio.h>
 
 #include <linux/uaccess.h>
 
@@ -42,6 +45,7 @@ struct brd_device {
 	enum brd_type		brd_type;
 	struct gendisk		*brd_disk;
 	struct list_head	brd_list;
+	struct dax_device	*brd_dax;
 
 	/*
 	 * Backing store of pages. This is the contents of the block device.
@@ -58,6 +62,8 @@ static void brd_free_page(struct brd_device *brd, struct page *page);
 static void brd_xa_init(struct brd_device *brd);
 static void brd_init_name(struct brd_device *brd, char *name);
 static void brd_set_capacity(struct brd_device *brd);
+static int brd_dax_init(struct brd_device *brd);
+static void brd_dax_cleanup(struct brd_device *brd);
 
 /*
  * Look up and return a brd's page for a given sector.
@@ -408,6 +414,9 @@ static int brd_alloc(int i)
 	strscpy(disk->disk_name, buf, DISK_NAME_LEN);
 	brd_set_capacity(brd);
 	
+	if (brd_dax_init(brd))
+		goto out_clean_dax;
+
 	/*
 	 * This is so fdisk will align partitions on 4k, because of
 	 * direct_access API needing 4k alignment, returning a PFN
@@ -421,6 +430,8 @@ static int brd_alloc(int i)
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
 	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue);
 	blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue);
+	if (brd->brd_dax)
+		blk_queue_flag_set(QUEUE_FLAG_DAX, disk->queue);
 	err = add_disk(disk);
 	if (err)
 		goto out_cleanup_disk;
@@ -429,6 +440,8 @@ static int brd_alloc(int i)
 
 out_cleanup_disk:
 	put_disk(disk);
+out_clean_dax:
+	brd_dax_cleanup(brd);
 out_free_dev:
 	list_del(&brd->brd_list);
 	brd_free_device(brd);
@@ -447,6 +460,7 @@ static void brd_cleanup(void)
 	debugfs_remove_recursive(brd_debugfs_dir);
 
 	list_for_each_entry_safe(brd, next, &brd_devices, brd_list) {
+		brd_dax_cleanup(brd);
 		del_gendisk(brd->brd_disk);
 		put_disk(brd->brd_disk);
 		brd_free_pages(brd);
@@ -659,3 +673,95 @@ static void brd_set_capacity(struct brd_device *brd)
 		disksize = prd_data[brd->brd_number].size;
 	set_capacity(brd->brd_disk, disksize * 2);
 }
+
+static bool		prd_dax_enabled = IS_ENABLED(CONFIG_FS_DAX);
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+				  pgoff_t pgoff, long nr_pages,
+				  enum dax_access_mode mode,
+				  void **kaddr, pfn_t *pfn);
+static int brd_dax_zero_page_range(struct dax_device *dax_dev,
+				   pgoff_t pgoff, size_t nr_pages);
+
+static const struct dax_operations brd_dax_ops = {
+	.direct_access = brd_dax_direct_access,
+	.zero_page_range = brd_dax_zero_page_range,
+};
+
+static int brd_dax_init(struct brd_device *brd)
+{
+	if (!prd_dax_enabled || brd->brd_type == BRD_NORMAL)
+		return 0;
+
+	brd->brd_dax = alloc_dax(brd, &brd_dax_ops);
+	if (IS_ERR(brd->brd_dax)) {
+		pr_warn("%s: DAX failed\n", __func__);
+		brd->brd_dax = NULL;
+		return -ENOMEM;
+	}
+
+	if (dax_add_host(brd->brd_dax, brd->brd_disk)) {
+		pr_warn("%s: DAX add failed\n", __func__);
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static void brd_dax_cleanup(struct brd_device *brd)
+{
+	if (!prd_dax_enabled || brd->brd_type == BRD_NORMAL)
+		return;
+
+	if (brd->brd_dax) {
+		dax_remove_host(brd->brd_disk);
+		kill_dax(brd->brd_dax);
+		put_dax(brd->brd_dax);
+	}
+}
+static int brd_dax_zero_page_range(struct dax_device *dax_dev,
+				   pgoff_t pgoff, size_t nr_pages)
+{
+	long rc;
+	void *kaddr;
+
+	rc = dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS,
+			&kaddr, NULL);
+	if (rc < 0)
+		return rc;
+	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
+	return 0;
+}
+
+static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff,
+		long nr_pages, void **kaddr, pfn_t *pfn)
+{
+	struct page *page;
+	sector_t sector = (sector_t) pgoff << PAGE_SECTORS_SHIFT;
+	int ret;
+
+	if (!brd)
+		return -ENODEV;
+
+	ret = brd_insert_page(brd, sector, GFP_NOWAIT);
+	if (ret)
+		return ret;
+
+	page = brd_lookup_page(brd, sector);
+	if (!page)
+		return -ENOSPC;
+
+	*kaddr = page_address(page);
+	if (pfn)
+		*pfn = page_to_pfn_t(page);
+
+	return 1;
+}
+
+static long brd_dax_direct_access(struct dax_device *dax_dev,
+		pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+		void **kaddr, pfn_t *pfn)
+{
+	struct brd_device *brd = dax_get_private(dax_dev);
+
+	return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v1 10/10] mm/prmem: Implement dynamic expansion of prmem.
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (8 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 09/10] mm/prmem: Implement DAX support for Persistent Ramdisks madvenka
@ 2023-10-16 23:32   ` madvenka
  2023-10-17  8:31   ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) Alexander Graf
  10 siblings, 0 replies; 13+ messages in thread
From: madvenka @ 2023-10-16 23:32 UTC (permalink / raw)
  To: gregkh, pbonzini, rppt, jgowans, graf, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	madvenka, jamorris

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

For some use cases, it is hard to predict how much actual memory is
needed to store persistent data. This will depend on the workload. Either
we would have to overcommit memory for persistent data. Or, we could
allow dynamic expansion of prmem memory.

Implement dynamic expansion of prmem. When the allocator runs out of memory
it calls alloc_pages(MAX_ORDER) to allocate a max order page. It creates a
region for that memory and adds it to the list of regions. Then, the
allocator can allocate from that region.

To allow this, extend the command line parameter:

	prmem=size[KMG][,max_size[KMG]]

Size is allocated upfront as mentioned before. Between size and max_size,
prmem is expanded dynamically as mentioned above.

Choosing a max order page means that no fragmentation is created for
transparent huge pages and kmem slabs. But fragmentation may be created
for 1GB pages. This is not a problem for 1GB pages that are reserved
up front. This could be a problem for 1GB pages that are allocated at
run time dynamically.

If max_size is omitted from the command line parameter, no dynamic
expansion will happen.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 include/linux/prmem.h          |  8 +++++++
 kernel/prmem/prmem_allocator.c | 38 ++++++++++++++++++++++++++++++++++
 kernel/prmem/prmem_init.c      |  1 +
 kernel/prmem/prmem_misc.c      |  3 ++-
 kernel/prmem/prmem_parse.c     | 20 +++++++++++++++++-
 kernel/prmem/prmem_region.c    |  1 +
 kernel/prmem/prmem_reserve.c   |  1 +
 7 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index c7034690f7cb..bb552946cb5b 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -83,6 +83,9 @@ struct prmem_instance {
  * metadata	Physical address of the metadata page.
  * size		Size of initial memory allocated to prmem.
  *
+ * cur_size	Current amount of memory allocated to prmem.
+ * max_size	Maximum amount of memory that can be allocated to prmem.
+ *
  * regions	List of memory regions.
  *
  * instances	Persistent instances.
@@ -95,6 +98,10 @@ struct prmem {
 	unsigned long		metadata;
 	size_t			size;
 
+	/* Dynamic expansion. */
+	size_t			cur_size;
+	size_t			max_size;
+
 	/* Persistent Regions. */
 	struct list_head	regions;
 
@@ -109,6 +116,7 @@ extern struct prmem		*prmem;
 extern unsigned long		prmem_metadata;
 extern unsigned long		prmem_pa;
 extern size_t			prmem_size;
+extern size_t			prmem_max_size;
 extern bool			prmem_inited;
 extern spinlock_t		prmem_lock;
 
diff --git a/kernel/prmem/prmem_allocator.c b/kernel/prmem/prmem_allocator.c
index f12975bc6777..1cb3eae8a3e7 100644
--- a/kernel/prmem/prmem_allocator.c
+++ b/kernel/prmem/prmem_allocator.c
@@ -9,17 +9,55 @@
 
 /* Page Allocation functions. */
 
+static void prmem_expand(void)
+{
+	struct prmem_region	*region;
+	struct page		*pages;
+	unsigned int		order = MAX_ORDER;
+	size_t			size = (1UL << order) << PAGE_SHIFT;
+
+	if (prmem->cur_size + size > prmem->max_size)
+		return;
+
+	spin_unlock(&prmem_lock);
+	pages = alloc_pages(GFP_NOWAIT, order);
+	spin_lock(&prmem_lock);
+
+	if (!pages)
+		return;
+
+	/* cur_size may have changed. Recheck. */
+	if (prmem->cur_size + size > prmem->max_size)
+		goto free;
+
+	region = prmem_add_region(page_to_phys(pages), size);
+	if (!region)
+		goto free;
+
+	pr_warn("%s: prmem expanded by %ld\n", __func__, size);
+	return;
+free:
+	__free_pages(pages, order);
+}
+
 void *prmem_alloc_pages_locked(unsigned int order)
 {
 	struct prmem_region	*region;
 	void			*va;
 	size_t			size = (1UL << order) << PAGE_SHIFT;
+	bool			expand = true;
 
+retry:
 	list_for_each_entry(region, &prmem->regions, node) {
 		va = prmem_alloc_pool(region, size, size);
 		if (va)
 			return va;
 	}
+	if (expand) {
+		expand = false;
+		prmem_expand();
+		goto retry;
+	}
 	return NULL;
 }
 
diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c
index 166fca688ab3..f4814cc88508 100644
--- a/kernel/prmem/prmem_init.c
+++ b/kernel/prmem/prmem_init.c
@@ -20,6 +20,7 @@ void __init prmem_init(void)
 		/* Cold boot. */
 		prmem->metadata = prmem_metadata;
 		prmem->size = prmem_size;
+		prmem->max_size = prmem_max_size;
 		INIT_LIST_HEAD(&prmem->regions);
 		INIT_LIST_HEAD(&prmem->instances);
 
diff --git a/kernel/prmem/prmem_misc.c b/kernel/prmem/prmem_misc.c
index 49b6a7232c1a..3100662d2cbe 100644
--- a/kernel/prmem/prmem_misc.c
+++ b/kernel/prmem/prmem_misc.c
@@ -68,7 +68,8 @@ bool __init prmem_validate(void)
 	unsigned long		checksum;
 
 	/* Sanity check the boot parameter. */
-	if (prmem_metadata != prmem->metadata || prmem_size != prmem->size) {
+	if (prmem_metadata != prmem->metadata || prmem_size != prmem->size ||
+	    prmem_max_size != prmem->max_size) {
 		pr_warn("%s: Boot parameter mismatch\n", __func__);
 		return false;
 	}
diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c
index 6c1a23c6b84e..3a57b37fa191 100644
--- a/kernel/prmem/prmem_parse.c
+++ b/kernel/prmem/prmem_parse.c
@@ -8,9 +8,11 @@
 #include <linux/prmem.h>
 
 /*
- * Syntax: prmem=size[KMG]
+ * Syntax: prmem=size[KMG][,max_size[KMG]]
  *
  *	Specifies the size of the initial memory to be allocated to prmem.
+ *	Optionally, specifies the maximum amount of memory to be allocated to
+ *	prmem. prmem will expand dynamically between size and max_size.
  */
 static int __init prmem_size_parse(char *cmdline)
 {
@@ -28,6 +30,22 @@ static int __init prmem_size_parse(char *cmdline)
 	}
 
 	prmem_size = size;
+	prmem_max_size = size;
+
+	cur = tmp;
+	if (*cur++ == ',') {
+		/* Get max size. */
+		size = memparse(cur, &tmp);
+		if (cur == tmp || !size || size & (PAGE_SIZE - 1) ||
+		    size <= prmem_size) {
+			prmem_size = 0;
+			prmem_max_size = 0;
+			pr_warn("%s: Incorrect max size %lx\n", __func__, size);
+			return -EINVAL;
+		}
+		prmem_max_size = size;
+	}
+
 	return 0;
 }
 early_param("prmem", prmem_size_parse);
diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c
index 6dc88c74d9c8..390329a34b74 100644
--- a/kernel/prmem/prmem_region.c
+++ b/kernel/prmem/prmem_region.c
@@ -82,5 +82,6 @@ struct prmem_region *prmem_add_region(unsigned long pa, size_t size)
 		return NULL;
 
 	list_add_tail(&region->node, &prmem->regions);
+	prmem->cur_size += size;
 	return region;
 }
diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c
index 8000fff05402..c5ae5d7d8f0a 100644
--- a/kernel/prmem/prmem_reserve.c
+++ b/kernel/prmem/prmem_reserve.c
@@ -11,6 +11,7 @@ struct prmem		*prmem;
 unsigned long		prmem_metadata;
 unsigned long		prmem_pa;
 unsigned long		prmem_size;
+unsigned long		prmem_max_size;
 
 void __init prmem_reserve_early(void)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)
  2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
                     ` (9 preceding siblings ...)
  2023-10-16 23:32   ` [RFC PATCH v1 10/10] mm/prmem: Implement dynamic expansion of prmem madvenka
@ 2023-10-17  8:31   ` Alexander Graf
  2023-10-17 18:08     ` Madhavan T. Venkataraman
  10 siblings, 1 reply; 13+ messages in thread
From: Alexander Graf @ 2023-10-17  8:31 UTC (permalink / raw)
  To: madvenka, gregkh, pbonzini, rppt, jgowans, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	jamorris, rostedt, kvm

Hey Madhavan!

This patch set looks super exciting - thanks a lot for putting it 
together. We've been poking at a very similar direction for a while as 
well and will discuss the fundamental problem of how to persist kernel 
metadata across kexec at LPC:

   https://lpc.events/event/17/contributions/1485/

It would be great to have you in the room as well then.

Some more comments inline.

On 17.10.23 01:32, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> Introduction
> ============
>
> This feature can be used to persist kernel and user data across kexec reboots
> in RAM for various uses. E.g., persisting:
>
>          - cached data. E.g., database caches.
>          - state. E.g., KVM guest states.
>          - historical information since the last cold boot. E.g., events, logs
>            and journals.
>          - measurements for integrity checks on the next boot.
>          - driver data.
>          - IOMMU mappings.
>          - MMIO config information.
>
> This is useful on systems where there is no non-volatile storage or
> non-volatile storage is too small or too slow.


This is useful in more situations. We for example need it to do a kexec 
while a virtual machine is in suspended state, but has IOMMU mappings 
intact (Live Update). For that, we need to ensure DMA can still reach 
the VM memory and that everything gets reassembled identically and 
without interruptions on the receiving end.


> The following sections describe the implementation.
>
> I have enhanced the ram disk block device driver to provide persistent ram
> disks on which any filesystem can be created. This is for persisting user data.
> I have also implemented DAX support for the persistent ram disks.


This is probably the least interesting of the enablements, right? You 
can already today reserve RAM on boot as DAX block device and use it for 
that purpose.


> I am also working on making ZRAM persistent.
>
> I have also briefly discussed the following use cases:
>
>          - Persisting IOMMU mappings
>          - Remembering DMA pages
>          - Reserving pages that encounter memory errors
>          - Remembering IMA measurements for integrity checks
>          - Remembering MMIO config info
>          - Implementing prmemfs (special filesystem tailored for persistence)
>
> Allocate metadata
> =================
>
> Define a metadata structure to store all persistent memory related information.
> The metadata fits into one page. On a cold boot, allocate and initialize the
> metadata page.
>
> Allocate data
> =============
>
> On a cold boot, allocate some memory for storing persistent data. Call it
> persistent memory. Specify the size in a command line parameter:
>
>          prmem=size[KMG][,max_size[KMG]]
>
>          size            Initial amount of memory allocated to prmem during boot
>          max_size        Maximum amount of memory that can be allocated to prmem
>
> When the initial memory is exhaused via allocations, expand prmem dynamically
> up to max_size. Expansion is done by allocating from the buddy allocator.
> Record all allocations in the metadata.


I don't understand why we need a separate allocator. Why can't we just 
use normal Linux allocations and serialize their location for handover? 
We would obviously still need to find a large contiguous piece of memory 
for the target kernel to bootstrap itself into until it can read which 
pages it can and can not use, but we can do that allocation in the 
source environment using CMA, no?

What I'm trying to say is: I think we're better off separating the 
handover mechanism from the allocation mechanism. If we can implement 
handover without a new allocator, we can use it for simple things with a 
slight runtime penalty. To accelerate the handover then, we can later 
add a compacting allocator that can use the handover mechanism we 
already built to persist itself.



I have a WIP branch where I'm toying with such a handover mechanism that 
uses device tree to serialize/deserialize state. By standardizing the 
property naming, we can in the receiving kernel mark all persistent 
allocations as reserved and then slowly either free them again or mark 
them as in-use one by one:

https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42

I used ftrace as example payload to persist: With the handover mechanism 
in place, we serialize/deserialize ftrace ring buffer metadata and are 
thus able to read traces of the previous system after kexec. This way, 
you can for example profile the kexec exit path.

It's not even in RFC state yet, there are a few things where I would 
need a couple days to think hard about data structures, layouts and 
other problems :). But I believe from the patch you get the idea.

One such user of kho could be a new allocator like prmem and each 
subsystem's serialization code could choose to rely on the prmem 
subsystem to persist data instead of doing it themselves. That way you 
get a very non-intrusive enablement path for kexec handover, easily 
amendable data structures that can change compatibly over time as well 
as the ability to recreate ephemeral data structure based on persistent 
information - which will be necessary to persist VFIO containers.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)
  2023-10-17  8:31   ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) Alexander Graf
@ 2023-10-17 18:08     ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 13+ messages in thread
From: Madhavan T. Venkataraman @ 2023-10-17 18:08 UTC (permalink / raw)
  To: Alexander Graf, gregkh, pbonzini, rppt, jgowans, arnd, keescook,
	stanislav.kinsburskii, anthony.yznaga, linux-mm, linux-kernel,
	jamorris, rostedt, kvm

Hey Alex,

Thanks a lot for your comments!

On 10/17/23 03:31, Alexander Graf wrote:
> Hey Madhavan!
> 
> This patch set looks super exciting - thanks a lot for putting it together. We've been poking at a very similar direction for a while as well and will discuss the fundamental problem of how to persist kernel metadata across kexec at LPC:
> 
>   https://lpc.events/event/17/contributions/1485/
> 
> It would be great to have you in the room as well then.
> 

Yes. I am planning to attend. But I am attending virtually as I am not able to travel.

> Some more comments inline.
> 
> On 17.10.23 01:32, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> Introduction
>> ============
>>
>> This feature can be used to persist kernel and user data across kexec reboots
>> in RAM for various uses. E.g., persisting:
>>
>>          - cached data. E.g., database caches.
>>          - state. E.g., KVM guest states.
>>          - historical information since the last cold boot. E.g., events, logs
>>            and journals.
>>          - measurements for integrity checks on the next boot.
>>          - driver data.
>>          - IOMMU mappings.
>>          - MMIO config information.
>>
>> This is useful on systems where there is no non-volatile storage or
>> non-volatile storage is too small or too slow.
> 
> 
> This is useful in more situations. We for example need it to do a kexec while a virtual machine is in suspended state, but has IOMMU mappings intact (Live Update). For that, we need to ensure DMA can still reach the VM memory and that everything gets reassembled identically and without interruptions on the receiving end.
> 
> 

I see.

>> The following sections describe the implementation.
>>
>> I have enhanced the ram disk block device driver to provide persistent ram
>> disks on which any filesystem can be created. This is for persisting user data.
>> I have also implemented DAX support for the persistent ram disks.
> 
> 
> This is probably the least interesting of the enablements, right? You can already today reserve RAM on boot as DAX block device and use it for that purpose.
> 

Yes. pmem provides that functionality.

There are a few differences though. However, I don't have a good feel for how important these differences are to users. May be, they are not very significant. E.g,

	- pmem regions need some setup using the ndctl command.
	- IIUC, one needs to specify a starting address and a size for a pmem region. Having to specify a starting address may make it somewhat less flexible from a configuration point of view.
	- In the case of pmem, the entire range of memory is set aside. In the case of the prmem persistent ram disk, pages are allocated as needed. So, persistent memory is shared among multiple
	  consumers more flexibly.

Also Greg H. wanted to see a filesystem based use case to be presented for persistent memory so we can see how it all comes together. I am working on prmemfs (a special FS tailored for persistence). But that will take some time. So, I wanted to present this ram disk use case as a more flexible alternative to pmem.

But you are right. They are equivalent for all practical purposes.

> 
>> I am also working on making ZRAM persistent.
>>
>> I have also briefly discussed the following use cases:
>>
>>          - Persisting IOMMU mappings
>>          - Remembering DMA pages
>>          - Reserving pages that encounter memory errors
>>          - Remembering IMA measurements for integrity checks
>>          - Remembering MMIO config info
>>          - Implementing prmemfs (special filesystem tailored for persistence)
>>
>> Allocate metadata
>> =================
>>
>> Define a metadata structure to store all persistent memory related information.
>> The metadata fits into one page. On a cold boot, allocate and initialize the
>> metadata page.
>>
>> Allocate data
>> =============
>>
>> On a cold boot, allocate some memory for storing persistent data. Call it
>> persistent memory. Specify the size in a command line parameter:
>>
>>          prmem=size[KMG][,max_size[KMG]]
>>
>>          size            Initial amount of memory allocated to prmem during boot
>>          max_size        Maximum amount of memory that can be allocated to prmem
>>
>> When the initial memory is exhaused via allocations, expand prmem dynamically
>> up to max_size. Expansion is done by allocating from the buddy allocator.
>> Record all allocations in the metadata.
> 
> 
> I don't understand why we need a separate allocator. Why can't we just use normal Linux allocations and serialize their location for handover? We would obviously still need to find a large contiguous piece of memory for the target kernel to bootstrap itself into until it can read which pages it can and can not use, but we can do that allocation in the source environment using CMA, no?
> 
> What I'm trying to say is: I think we're better off separating the handover mechanism from the allocation mechanism. If we can implement handover without a new allocator, we can use it for simple things with a slight runtime penalty. To accelerate the handover then, we can later add a compacting allocator that can use the handover mechanism we already built to persist itself.
> 
> 
> 
> I have a WIP branch where I'm toying with such a handover mechanism that uses device tree to serialize/deserialize state. By standardizing the property naming, we can in the receiving kernel mark all persistent allocations as reserved and then slowly either free them again or mark them as in-use one by one:
> 
> https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42
> 
> I used ftrace as example payload to persist: With the handover mechanism in place, we serialize/deserialize ftrace ring buffer metadata and are thus able to read traces of the previous system after kexec. This way, you can for example profile the kexec exit path.
> 
> It's not even in RFC state yet, there are a few things where I would need a couple days to think hard about data structures, layouts and other problems :). But I believe from the patch you get the idea.
> 
> One such user of kho could be a new allocator like prmem and each subsystem's serialization code could choose to rely on the prmem subsystem to persist data instead of doing it themselves. That way you get a very non-intrusive enablement path for kexec handover, easily amendable data structures that can change compatibly over time as well as the ability to recreate ephemeral data structure based on persistent information - which will be necessary to persist VFIO containers.
> 

OK. I will study your changes and your comments. I will send my feedback as well.

Thanks again!

Madhavan

> 
> Alex
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-10-17 18:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9>
2023-10-16 23:32 ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) madvenka
2023-10-16 23:32   ` [RFC PATCH v1 01/10] mm/prmem: Allocate memory during boot for storing persistent data madvenka
2023-10-16 23:32   ` [RFC PATCH v1 02/10] mm/prmem: Reserve metadata and persistent regions in early boot after kexec madvenka
2023-10-16 23:32   ` [RFC PATCH v1 03/10] mm/prmem: Manage persistent memory with the gen pool allocator madvenka
2023-10-16 23:32   ` [RFC PATCH v1 04/10] mm/prmem: Implement a page allocator for persistent memory madvenka
2023-10-16 23:32   ` [RFC PATCH v1 05/10] mm/prmem: Implement a buffer " madvenka
2023-10-16 23:32   ` [RFC PATCH v1 06/10] mm/prmem: Implement persistent XArray (and Radix Tree) madvenka
2023-10-16 23:32   ` [RFC PATCH v1 07/10] mm/prmem: Implement named Persistent Instances madvenka
2023-10-16 23:32   ` [RFC PATCH v1 08/10] mm/prmem: Implement Persistent Ramdisk instances madvenka
2023-10-16 23:32   ` [RFC PATCH v1 09/10] mm/prmem: Implement DAX support for Persistent Ramdisks madvenka
2023-10-16 23:32   ` [RFC PATCH v1 10/10] mm/prmem: Implement dynamic expansion of prmem madvenka
2023-10-17  8:31   ` [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem) Alexander Graf
2023-10-17 18:08     ` Madhavan T. Venkataraman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).