All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 00/21] Preserved-over-Kexec RAM
@ 2023-04-27  0:08 ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Sending out this RFC in part to guage community interest.
This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API.

One use case for PKRAM is preserving guest memory and/or auxillary
supporting data (e.g. iommu data) across kexec to support reboot of the
host with minimal disruption to the guest. PKRAM provides a flexible way
for doing this without requiring that the amount of memory used by a fixed
size created a priori.  Another use case is for databases to preserve their
block caches in shared memory across reboot.

Changes since RFC v2
  - Rebased onto 6.3
  - Updated API to save/load folios rather than file pages
  - Omitted previous patches for implementing and optimizing preservation
    and restoration of shmem files to reduce the number of patches and
    focus on core functionality.

Changes since RFC v1
  - Rebased onto 5.12-rc4
  - Refined the API to reduce the number of calls
    and better support multithreading.
  - Allow preserving byte data of arbitrary length
    (was previously limited to one page).
  - Build a new memblock reserved list with the
    preserved ranges and then substitute it for
    the existing one. (Mike Rapoport)
  - Use mem_avoid_overlap() to avoid kaslr stepping
    on preserved ranges. (Kees Cook)

-- Implementation details --

 * To aid in quickly finding contiguous ranges of memory containing
   preserved pages a pseudo physical mapping pagetable is populated
   with pages as they are preserved.

 * If a page to be preserved is found to be in range of memory that was
   previously reserved during early boot or in range of memory where the
   kernel will be loaded to on kexec, the page will be copied to a page
   outside of those ranges and the new page will be preserved. A compound
   page will be copied to and preserved as individual base pages.
   Note that this means that a page that cannot be moved (e.g. pinned for
   DMA) currently cannot safely be preserved. This could be addressed by
   adding functionality to kexec to reconfigure the destination addreses
   for the sections of an already-loaded kexec kernel.

 * A single page is allocated for the PKRAM super block. For the next kernel
   kexec boot to find preserved memory metadata, the pfn of the PKRAM super
   block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
   boot option.

 * In the newly booted kernel, PKRAM adds all preserved pages to the memblock
   reserve list during early boot so that they will not be recycled.

 * Since kexec may load the new kernel code to any memory region, it could
   destroy preserved memory. When the kernel selects the memory region
   (kexec_file_load syscall), kexec will avoid preserved pages.  When the
   user selects the kexec memory region to use (kexec_load syscall) , kexec
   load will fail if there is conflict with preserved pages. Pages preserved
   after a kexec kernel is loaded will be relocated if they conflict with
   the selected memory region.

[1] https://lkml.org/lkml/2013/7/1/211

Anthony Yznaga (21):
  mm: add PKRAM API stubs and Kconfig
  mm: PKRAM: implement node load and save functions
  mm: PKRAM: implement object load and save functions
  mm: PKRAM: implement folio stream operations
  mm: PKRAM: implement byte stream operations
  mm: PKRAM: link nodes by pfn before reboot
  mm: PKRAM: introduce super block
  PKRAM: track preserved pages in a physical mapping pagetable
  PKRAM: pass a list of preserved ranges to the next kernel
  PKRAM: prepare for adding preserved ranges to memblock reserved
  mm: PKRAM: reserve preserved memory at boot
  PKRAM: free the preserved ranges list
  PKRAM: prevent inadvertent use of a stale superblock
  PKRAM: provide a way to ban pages from use by PKRAM
  kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
  PKRAM: provide a way to check if a memory range has preserved pages
  kexec: PKRAM: avoid clobbering already preserved pages
  mm: PKRAM: allow preserved memory to be freed from userspace
  PKRAM: disable feature when running the kdump kernel
  x86/KASLR: PKRAM: support physical kaslr
  x86/boot/compressed/64: use 1GB pages for mappings

 arch/x86/boot/compressed/Makefile       |    3 +
 arch/x86/boot/compressed/ident_map_64.c |    9 +-
 arch/x86/boot/compressed/kaslr.c        |   10 +-
 arch/x86/boot/compressed/misc.h         |   10 +
 arch/x86/boot/compressed/pkram.c        |  110 ++
 arch/x86/kernel/setup.c                 |    3 +
 arch/x86/mm/init_64.c                   |    3 +
 include/linux/pkram.h                   |  116 ++
 kernel/kexec.c                          |    9 +
 kernel/kexec_core.c                     |    3 +
 kernel/kexec_file.c                     |   15 +
 mm/Kconfig                              |    9 +
 mm/Makefile                             |    2 +
 mm/pkram.c                              | 1753 +++++++++++++++++++++++++++++++
 mm/pkram_pagetable.c                    |  375 +++++++
 15 files changed, 2424 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c
 create mode 100644 mm/pkram_pagetable.c

-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC v3 00/21] Preserved-over-Kexec RAM
@ 2023-04-27  0:08 ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Sending out this RFC in part to guage community interest.
This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API.

One use case for PKRAM is preserving guest memory and/or auxillary
supporting data (e.g. iommu data) across kexec to support reboot of the
host with minimal disruption to the guest. PKRAM provides a flexible way
for doing this without requiring that the amount of memory used by a fixed
size created a priori.  Another use case is for databases to preserve their
block caches in shared memory across reboot.

Changes since RFC v2
  - Rebased onto 6.3
  - Updated API to save/load folios rather than file pages
  - Omitted previous patches for implementing and optimizing preservation
    and restoration of shmem files to reduce the number of patches and
    focus on core functionality.

Changes since RFC v1
  - Rebased onto 5.12-rc4
  - Refined the API to reduce the number of calls
    and better support multithreading.
  - Allow preserving byte data of arbitrary length
    (was previously limited to one page).
  - Build a new memblock reserved list with the
    preserved ranges and then substitute it for
    the existing one. (Mike Rapoport)
  - Use mem_avoid_overlap() to avoid kaslr stepping
    on preserved ranges. (Kees Cook)

-- Implementation details --

 * To aid in quickly finding contiguous ranges of memory containing
   preserved pages a pseudo physical mapping pagetable is populated
   with pages as they are preserved.

 * If a page to be preserved is found to be in range of memory that was
   previously reserved during early boot or in range of memory where the
   kernel will be loaded to on kexec, the page will be copied to a page
   outside of those ranges and the new page will be preserved. A compound
   page will be copied to and preserved as individual base pages.
   Note that this means that a page that cannot be moved (e.g. pinned for
   DMA) currently cannot safely be preserved. This could be addressed by
   adding functionality to kexec to reconfigure the destination addreses
   for the sections of an already-loaded kexec kernel.

 * A single page is allocated for the PKRAM super block. For the next kernel
   kexec boot to find preserved memory metadata, the pfn of the PKRAM super
   block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
   boot option.

 * In the newly booted kernel, PKRAM adds all preserved pages to the memblock
   reserve list during early boot so that they will not be recycled.

 * Since kexec may load the new kernel code to any memory region, it could
   destroy preserved memory. When the kernel selects the memory region
   (kexec_file_load syscall), kexec will avoid preserved pages.  When the
   user selects the kexec memory region to use (kexec_load syscall) , kexec
   load will fail if there is conflict with preserved pages. Pages preserved
   after a kexec kernel is loaded will be relocated if they conflict with
   the selected memory region.

[1] https://lkml.org/lkml/2013/7/1/211

Anthony Yznaga (21):
  mm: add PKRAM API stubs and Kconfig
  mm: PKRAM: implement node load and save functions
  mm: PKRAM: implement object load and save functions
  mm: PKRAM: implement folio stream operations
  mm: PKRAM: implement byte stream operations
  mm: PKRAM: link nodes by pfn before reboot
  mm: PKRAM: introduce super block
  PKRAM: track preserved pages in a physical mapping pagetable
  PKRAM: pass a list of preserved ranges to the next kernel
  PKRAM: prepare for adding preserved ranges to memblock reserved
  mm: PKRAM: reserve preserved memory at boot
  PKRAM: free the preserved ranges list
  PKRAM: prevent inadvertent use of a stale superblock
  PKRAM: provide a way to ban pages from use by PKRAM
  kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
  PKRAM: provide a way to check if a memory range has preserved pages
  kexec: PKRAM: avoid clobbering already preserved pages
  mm: PKRAM: allow preserved memory to be freed from userspace
  PKRAM: disable feature when running the kdump kernel
  x86/KASLR: PKRAM: support physical kaslr
  x86/boot/compressed/64: use 1GB pages for mappings

 arch/x86/boot/compressed/Makefile       |    3 +
 arch/x86/boot/compressed/ident_map_64.c |    9 +-
 arch/x86/boot/compressed/kaslr.c        |   10 +-
 arch/x86/boot/compressed/misc.h         |   10 +
 arch/x86/boot/compressed/pkram.c        |  110 ++
 arch/x86/kernel/setup.c                 |    3 +
 arch/x86/mm/init_64.c                   |    3 +
 include/linux/pkram.h                   |  116 ++
 kernel/kexec.c                          |    9 +
 kernel/kexec_core.c                     |    3 +
 kernel/kexec_file.c                     |   15 +
 mm/Kconfig                              |    9 +
 mm/Makefile                             |    2 +
 mm/pkram.c                              | 1753 +++++++++++++++++++++++++++++++
 mm/pkram_pagetable.c                    |  375 +++++++
 15 files changed, 2424 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c
 create mode 100644 mm/pkram_pagetable.c

-- 
1.9.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC v3 01/21] mm: add PKRAM API stubs and Kconfig
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Preserved-across-kexec memory or PKRAM is a method for saving memory
pages of the currently executing kernel and restoring them after kexec
boot into a new one. This can be utilized for preserving guest VM state,
large in-memory databases, process memory, etc. across reboot. While
DRAM-as-PMEM or actual persistent memory could be used to accomplish
these things, PKRAM provides the latency of DRAM with the flexibility
of dynamically determining the amount of memory to preserve.

The proposed API:

 * Preserved memory is divided into nodes which can be saved or loaded
   independently of each other. The nodes are identified by unique name
   strings. A PKRAM node is created when save is initiated by calling
   pkram_prepare_save(). A PKRAM node is removed when load is initiated by
   calling pkram_prepare_load(). See below

 * A node is further divided into objects. An object represents closely
   coupled data in the form of a grouping of folios and/or a stream of
   byte data.  For example, the folios and attributes of a file.
   After initiating an operation on a PKRAM node, PKRAM objects are
   initialized for saving or loading by calling pkram_prepare_save_obj()
   or pkram_prepare_load_obj().

 * For saving/loading data from a PKRAM node/object instances of the
   pkram_stream and pkram_access structs are used.  pkram_stream tracks
   the node and object being operated on while pkram_access tracks the
   data type and position within an object.

   The pkram_stream struct is initialized by calling pkram_prepare_save()
   or pkram_prepare_load() and then pkram_prepare_save_obj() or
   pkram_prepare_load_obj().

   Once a pkram_stream is fully initialized, a pkram_access struct
   is initialized for each data type associated with the object.
   After save or load of a data type for the object is complete,
   pkram_finish_access() is called.

   After save or load is complete for the object, pkram_finish_save_obj()
   or pkram_finish_load_obj() must be called followed by pkram_finish_save()
   or pkram_finish_load() when save or load is completed for the node.
   If an error occurred during save, the saved data and the PKRAM node
   may be freed by calling pkram_discard_save() instead of
   pkram_finish_save().

 * Both folio data and byte data can separately be streamed to a PKRAM
   object.  pkram_save_folio() and pkram_load_folio() are used
   to stream folio data while pkram_write() and pkram_read() are used to
   stream byte data.

A sequence of operations for saving/loading data from PKRAM would
look like:

  * For saving data to PKRAM:

    /* create a PKRAM node and do initial stream setup */
    pkram_prepare_save()

    /* create a PKRAM object associated with the PKRAM node and complete stream initialization */
    pkram_prepare_save_obj()

    /* save data to the node/object */
    PKRAM_ACCESS(pa_folios,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_save_folio(pa_folios,...)[,...]  /* for file folios */
    pkram_write(pa_bytes,...)[,...]        /* for a byte stream */
    pkram_finish_access(pa_folios)
    pkram_finish_access(pa_bytes)

    pkram_finish_save_obj()

    /* commit the save or discard and delete the node */
    pkram_finish_save()          /* on success, or
    pkram_discard_save()          * ... in case of error */

  * For loading data from PKRAM:

    /* remove a PKRAM node from the list and do initial stream setup */
    pkram_prepare_load()

    /* Remove a PKRAM object from the node and complete stream initializtion for loading data from it. */
    pkram_prepare_load_obj()

    /* load data from the node/object */
    PKRAM_ACCESS(pa_folios,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_load_folio(pa_folios,...)[,...] /* for file folios */
    pkram_read(pa_bytes,...)[,...]        /* for a byte stream */
*/
    pkram_finish_access(pa_folios)
    pkram_finish_access(pa_bytes)

    /* free the object */
    pkram_finish_load_obj()

    /* free the node */
    pkram_finish_load()

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  47 +++++++++++++
 mm/Kconfig            |   9 +++
 mm/Makefile           |   2 +
 mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 237 insertions(+)
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
new file mode 100644
index 000000000000..57b8db4229a4
--- /dev/null
+++ b/include/linux/pkram.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKRAM_H
+#define _LINUX_PKRAM_H
+
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/mm_types.h>
+
+/**
+ * enum pkram_data_flags - definition of data types contained in a pkram obj
+ * @PKRAM_DATA_none: No data types configured
+ */
+enum pkram_data_flags {
+	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+};
+
+struct pkram_stream;
+struct pkram_access;
+
+#define PKRAM_NAME_MAX		256	/* including nul */
+
+int pkram_prepare_save(struct pkram_stream *ps, const char *name,
+		       gfp_t gfp_mask);
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags);
+
+void pkram_finish_save(struct pkram_stream *ps);
+void pkram_finish_save_obj(struct pkram_stream *ps);
+void pkram_discard_save(struct pkram_stream *ps);
+
+int pkram_prepare_load(struct pkram_stream *ps, const char *name);
+int pkram_prepare_load_obj(struct pkram_stream *ps);
+
+void pkram_finish_load(struct pkram_stream *ps);
+void pkram_finish_load_obj(struct pkram_stream *ps);
+
+#define PKRAM_ACCESS(name, stream, type)			\
+	struct pkram_access name
+
+void pkram_finish_access(struct pkram_access *pa, bool status_ok);
+
+int pkram_save_folio(struct pkram_access *pa, struct folio *folio);
+struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index);
+
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
+
+#endif /* _LINUX_PKRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f05..10f089f4a181 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,15 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config PKRAM
+	bool "Preserved-over-kexec memory storage"
+	default n
+	help
+	  This option adds the kernel API that enables saving memory pages of
+	  the currently executing kernel and restoring them after a kexec in
+	  the newly booted one. This can be utilized for speeding up reboot by
+	  leaving process memory and/or FS caches in-place.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..7a8d5a286d48 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,5 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_PKRAM) += pkram.o
+>>>>>>> mm: add PKRAM API stubs and Kconfig
diff --git a/mm/pkram.c b/mm/pkram.c
new file mode 100644
index 000000000000..421de8211e05
--- /dev/null
+++ b/mm/pkram.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/pkram.h>
+#include <linux/types.h>
+
+/**
+ * Create a preserved memory node with name @name and initialize stream @ps
+ * for saving data to it.
+ *
+ * @gfp_mask specifies the memory allocation mask to be used when saving data.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
+ * case of failure) is to be called.
+ */
+int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
+{
+	return -EINVAL;
+}
+
+/**
+ * Create a preserved memory object and initialize stream @ps for saving data
+ * to it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
+ * in case of failure) is to be called.
+ */
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
+{
+	return -EINVAL;
+}
+
+/**
+ * Commit the object started with pkram_prepare_save_obj() to preserved memory.
+ */
+void pkram_finish_save_obj(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Commit the save to preserved memory started with pkram_prepare_save().
+ * After the call, the stream may not be used any more.
+ */
+void pkram_finish_save(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Cancel the save to preserved memory started with pkram_prepare_save() and
+ * destroy the corresponding preserved memory node freeing any data already
+ * saved to it.
+ */
+void pkram_discard_save(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Remove the preserved memory node with name @name and initialize stream @ps
+ * for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load() is to be called.
+ */
+int pkram_prepare_load(struct pkram_stream *ps, const char *name)
+{
+	return -EINVAL;
+}
+
+/**
+ * Remove the next preserved memory object from the stream @ps and
+ * initialize stream @ps for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load_obj() is to be called.
+ */
+int pkram_prepare_load_obj(struct pkram_stream *ps)
+{
+	return -EINVAL;
+}
+
+/**
+ * Finish the load of a preserved memory object started with
+ * pkram_prepare_load_obj() freeing the object and any data that has not
+ * been loaded from it.
+ */
+void pkram_finish_load_obj(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Finish the load from preserved memory started with pkram_prepare_load()
+ * freeing the corresponding preserved memory node and any data that has
+ * not been loaded from it.
+ */
+void pkram_finish_load(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Finish the data access to or from the preserved memory node and object
+ * associated with pkram stream access @pa.  The access must have been
+ * initialized with PKRAM_ACCESS().
+ */
+void pkram_finish_access(struct pkram_access *pa, bool status_ok)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Save folio @folio to the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_save() and pkram_prepare_save_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * Returns 0 on success, -errno on failure.
+ */
+int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
+{
+	return -EINVAL;
+}
+
+/**
+ * Load the next folio from the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_load() and pkram_prepare_load_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * If not NULL, @index is initialized with the preserved mapping offset of the
+ * folio loaded.
+ *
+ * Returns the folio loaded or NULL if the node is empty.
+ *
+ * The folio loaded has its refcount incremented.
+ */
+struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
+{
+	return NULL;
+}
+
+/**
+ * Copy @count bytes from @buf to the preserved memory node and object
+ * associated with pkram stream access @pa. The stream must have been
+ * initialized with pkram_prepare_save() and pkram_prepare_save_obj()
+ * and access initialized with PKRAM_ACCESS();
+ *
+ * On success, returns the number of bytes written, which is always equal to
+ * @count. On failure, -errno is returned.
+ */
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
+{
+	return -EINVAL;
+}
+
+/**
+ * Copy up to @count bytes from the preserved memory node and object
+ * associated with pkram stream access @pa to @buf. The stream must have been
+ * initialized with pkram_prepare_load() and pkram_prepare_load_obj() and
+ * access initialized PKRAM_ACCESS().
+ *
+ * Returns the number of bytes read, which may be less than @count if the node
+ * has fewer bytes available.
+ */
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
+{
+	return 0;
+}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 01/21] mm: add PKRAM API stubs and Kconfig
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Preserved-across-kexec memory or PKRAM is a method for saving memory
pages of the currently executing kernel and restoring them after kexec
boot into a new one. This can be utilized for preserving guest VM state,
large in-memory databases, process memory, etc. across reboot. While
DRAM-as-PMEM or actual persistent memory could be used to accomplish
these things, PKRAM provides the latency of DRAM with the flexibility
of dynamically determining the amount of memory to preserve.

The proposed API:

 * Preserved memory is divided into nodes which can be saved or loaded
   independently of each other. The nodes are identified by unique name
   strings. A PKRAM node is created when save is initiated by calling
   pkram_prepare_save(). A PKRAM node is removed when load is initiated by
   calling pkram_prepare_load(). See below

 * A node is further divided into objects. An object represents closely
   coupled data in the form of a grouping of folios and/or a stream of
   byte data.  For example, the folios and attributes of a file.
   After initiating an operation on a PKRAM node, PKRAM objects are
   initialized for saving or loading by calling pkram_prepare_save_obj()
   or pkram_prepare_load_obj().

 * For saving/loading data from a PKRAM node/object instances of the
   pkram_stream and pkram_access structs are used.  pkram_stream tracks
   the node and object being operated on while pkram_access tracks the
   data type and position within an object.

   The pkram_stream struct is initialized by calling pkram_prepare_save()
   or pkram_prepare_load() and then pkram_prepare_save_obj() or
   pkram_prepare_load_obj().

   Once a pkram_stream is fully initialized, a pkram_access struct
   is initialized for each data type associated with the object.
   After save or load of a data type for the object is complete,
   pkram_finish_access() is called.

   After save or load is complete for the object, pkram_finish_save_obj()
   or pkram_finish_load_obj() must be called followed by pkram_finish_save()
   or pkram_finish_load() when save or load is completed for the node.
   If an error occurred during save, the saved data and the PKRAM node
   may be freed by calling pkram_discard_save() instead of
   pkram_finish_save().

 * Both folio data and byte data can separately be streamed to a PKRAM
   object.  pkram_save_folio() and pkram_load_folio() are used
   to stream folio data while pkram_write() and pkram_read() are used to
   stream byte data.

A sequence of operations for saving/loading data from PKRAM would
look like:

  * For saving data to PKRAM:

    /* create a PKRAM node and do initial stream setup */
    pkram_prepare_save()

    /* create a PKRAM object associated with the PKRAM node and complete stream initialization */
    pkram_prepare_save_obj()

    /* save data to the node/object */
    PKRAM_ACCESS(pa_folios,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_save_folio(pa_folios,...)[,...]  /* for file folios */
    pkram_write(pa_bytes,...)[,...]        /* for a byte stream */
    pkram_finish_access(pa_folios)
    pkram_finish_access(pa_bytes)

    pkram_finish_save_obj()

    /* commit the save or discard and delete the node */
    pkram_finish_save()          /* on success, or
    pkram_discard_save()          * ... in case of error */

  * For loading data from PKRAM:

    /* remove a PKRAM node from the list and do initial stream setup */
    pkram_prepare_load()

    /* Remove a PKRAM object from the node and complete stream initializtion for loading data from it. */
    pkram_prepare_load_obj()

    /* load data from the node/object */
    PKRAM_ACCESS(pa_folios,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_load_folio(pa_folios,...)[,...] /* for file folios */
    pkram_read(pa_bytes,...)[,...]        /* for a byte stream */
*/
    pkram_finish_access(pa_folios)
    pkram_finish_access(pa_bytes)

    /* free the object */
    pkram_finish_load_obj()

    /* free the node */
    pkram_finish_load()

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  47 +++++++++++++
 mm/Kconfig            |   9 +++
 mm/Makefile           |   2 +
 mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 237 insertions(+)
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
new file mode 100644
index 000000000000..57b8db4229a4
--- /dev/null
+++ b/include/linux/pkram.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKRAM_H
+#define _LINUX_PKRAM_H
+
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/mm_types.h>
+
+/**
+ * enum pkram_data_flags - definition of data types contained in a pkram obj
+ * @PKRAM_DATA_none: No data types configured
+ */
+enum pkram_data_flags {
+	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+};
+
+struct pkram_stream;
+struct pkram_access;
+
+#define PKRAM_NAME_MAX		256	/* including nul */
+
+int pkram_prepare_save(struct pkram_stream *ps, const char *name,
+		       gfp_t gfp_mask);
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags);
+
+void pkram_finish_save(struct pkram_stream *ps);
+void pkram_finish_save_obj(struct pkram_stream *ps);
+void pkram_discard_save(struct pkram_stream *ps);
+
+int pkram_prepare_load(struct pkram_stream *ps, const char *name);
+int pkram_prepare_load_obj(struct pkram_stream *ps);
+
+void pkram_finish_load(struct pkram_stream *ps);
+void pkram_finish_load_obj(struct pkram_stream *ps);
+
+#define PKRAM_ACCESS(name, stream, type)			\
+	struct pkram_access name
+
+void pkram_finish_access(struct pkram_access *pa, bool status_ok);
+
+int pkram_save_folio(struct pkram_access *pa, struct folio *folio);
+struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index);
+
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
+
+#endif /* _LINUX_PKRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f05..10f089f4a181 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,15 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config PKRAM
+	bool "Preserved-over-kexec memory storage"
+	default n
+	help
+	  This option adds the kernel API that enables saving memory pages of
+	  the currently executing kernel and restoring them after a kexec in
+	  the newly booted one. This can be utilized for speeding up reboot by
+	  leaving process memory and/or FS caches in-place.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..7a8d5a286d48 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,5 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_PKRAM) += pkram.o
+>>>>>>> mm: add PKRAM API stubs and Kconfig
diff --git a/mm/pkram.c b/mm/pkram.c
new file mode 100644
index 000000000000..421de8211e05
--- /dev/null
+++ b/mm/pkram.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/pkram.h>
+#include <linux/types.h>
+
+/**
+ * Create a preserved memory node with name @name and initialize stream @ps
+ * for saving data to it.
+ *
+ * @gfp_mask specifies the memory allocation mask to be used when saving data.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
+ * case of failure) is to be called.
+ */
+int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
+{
+	return -EINVAL;
+}
+
+/**
+ * Create a preserved memory object and initialize stream @ps for saving data
+ * to it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
+ * in case of failure) is to be called.
+ */
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
+{
+	return -EINVAL;
+}
+
+/**
+ * Commit the object started with pkram_prepare_save_obj() to preserved memory.
+ */
+void pkram_finish_save_obj(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Commit the save to preserved memory started with pkram_prepare_save().
+ * After the call, the stream may not be used any more.
+ */
+void pkram_finish_save(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Cancel the save to preserved memory started with pkram_prepare_save() and
+ * destroy the corresponding preserved memory node freeing any data already
+ * saved to it.
+ */
+void pkram_discard_save(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Remove the preserved memory node with name @name and initialize stream @ps
+ * for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load() is to be called.
+ */
+int pkram_prepare_load(struct pkram_stream *ps, const char *name)
+{
+	return -EINVAL;
+}
+
+/**
+ * Remove the next preserved memory object from the stream @ps and
+ * initialize stream @ps for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load_obj() is to be called.
+ */
+int pkram_prepare_load_obj(struct pkram_stream *ps)
+{
+	return -EINVAL;
+}
+
+/**
+ * Finish the load of a preserved memory object started with
+ * pkram_prepare_load_obj() freeing the object and any data that has not
+ * been loaded from it.
+ */
+void pkram_finish_load_obj(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Finish the load from preserved memory started with pkram_prepare_load()
+ * freeing the corresponding preserved memory node and any data that has
+ * not been loaded from it.
+ */
+void pkram_finish_load(struct pkram_stream *ps)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Finish the data access to or from the preserved memory node and object
+ * associated with pkram stream access @pa.  The access must have been
+ * initialized with PKRAM_ACCESS().
+ */
+void pkram_finish_access(struct pkram_access *pa, bool status_ok)
+{
+	WARN_ON_ONCE(1);
+}
+
+/**
+ * Save folio @folio to the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_save() and pkram_prepare_save_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * Returns 0 on success, -errno on failure.
+ */
+int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
+{
+	return -EINVAL;
+}
+
+/**
+ * Load the next folio from the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_load() and pkram_prepare_load_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * If not NULL, @index is initialized with the preserved mapping offset of the
+ * folio loaded.
+ *
+ * Returns the folio loaded or NULL if the node is empty.
+ *
+ * The folio loaded has its refcount incremented.
+ */
+struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
+{
+	return NULL;
+}
+
+/**
+ * Copy @count bytes from @buf to the preserved memory node and object
+ * associated with pkram stream access @pa. The stream must have been
+ * initialized with pkram_prepare_save() and pkram_prepare_save_obj()
+ * and access initialized with PKRAM_ACCESS();
+ *
+ * On success, returns the number of bytes written, which is always equal to
+ * @count. On failure, -errno is returned.
+ */
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
+{
+	return -EINVAL;
+}
+
+/**
+ * Copy up to @count bytes from the preserved memory node and object
+ * associated with pkram stream access @pa to @buf. The stream must have been
+ * initialized with pkram_prepare_load() and pkram_prepare_load_obj() and
+ * access initialized PKRAM_ACCESS().
+ *
+ * Returns the number of bytes read, which may be less than @count if the node
+ * has fewer bytes available.
+ */
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
+{
+	return 0;
+}
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 02/21] mm: PKRAM: implement node load and save functions
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Preserved memory is divided into nodes which can be saved and loaded
independently of each other. PKRAM nodes are kept on a list and
identified by unique names. Whenever a save operation is initiated by
calling pkram_prepare_save(), a new node is created and linked to the
list. When the save operation has been committed by calling
pkram_finish_save(), the node becomes loadable. A load operation can be
then initiated by calling pkram_prepare_load() which deletes the node
from the list and prepares the corresponding stream for loading data
from it. After the load has been finished, the pkram_finish_load()
function must be called to free the node. Nodes are also deleted when a
save operation is discarded, i.e. pkram_discard_save() is called instead
of pkram_finish_save().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   8 ++-
 mm/pkram.c            | 147 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 149 insertions(+), 6 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 57b8db4229a4..8def9017b16a 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/mm_types.h>
 
+struct pkram_node;
+
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
@@ -14,7 +16,11 @@ enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,  /* No data types configured */
 };
 
-struct pkram_stream;
+struct pkram_stream {
+	gfp_t gfp_mask;
+	struct pkram_node *node;
+};
+
 struct pkram_access;
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index 421de8211e05..bbfd8df0874e 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,16 +2,85 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/kernel.h>
+#include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/string.h>
 #include <linux/types.h>
 
+/*
+ * Preserved memory is divided into nodes that can be saved or loaded
+ * independently of each other. The nodes are identified by unique name
+ * strings.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_node {
+	__u32	flags;
+
+	__u8	name[PKRAM_NAME_MAX];
+};
+
+#define PKRAM_SAVE		1
+#define PKRAM_LOAD		2
+#define PKRAM_ACCMODE_MASK	3
+
+static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
+static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
+
+static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
+{
+	return alloc_page(gfp_mask);
+}
+
+static inline void pkram_free_page(void *addr)
+{
+	free_page((unsigned long)addr);
+}
+
+static inline void pkram_insert_node(struct pkram_node *node)
+{
+	list_add(&virt_to_page(node)->lru, &pkram_nodes);
+}
+
+static inline void pkram_delete_node(struct pkram_node *node)
+{
+	list_del(&virt_to_page(node)->lru);
+}
+
+static struct pkram_node *pkram_find_node(const char *name)
+{
+	struct page *page;
+	struct pkram_node *node;
+
+	list_for_each_entry(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (strcmp(node->name, name) == 0)
+			return node;
+	}
+	return NULL;
+}
+
+static void pkram_stream_init(struct pkram_stream *ps,
+			     struct pkram_node *node, gfp_t gfp_mask)
+{
+	memset(ps, 0, sizeof(*ps));
+	ps->gfp_mask = gfp_mask;
+	ps->node = node;
+}
+
 /**
  * Create a preserved memory node with name @name and initialize stream @ps
  * for saving data to it.
  *
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
+ * Error values:
+ *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
+ *	%ENOMEM: insufficient memory available
+ *	%EEXIST: node with specified name already exists
+ *
  * Returns 0 on success, -errno on failure.
  *
  * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
@@ -19,7 +88,34 @@
  */
 int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
 {
-	return -EINVAL;
+	struct page *page;
+	struct pkram_node *node;
+	int err = 0;
+
+	if (strlen(name) >= PKRAM_NAME_MAX)
+		return -ENAMETOOLONG;
+
+	page = pkram_alloc_page(gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	node = page_address(page);
+
+	node->flags = PKRAM_SAVE;
+	strcpy(node->name, name);
+
+	mutex_lock(&pkram_mutex);
+	if (!pkram_find_node(name))
+		pkram_insert_node(node);
+	else
+		err = -EEXIST;
+	mutex_unlock(&pkram_mutex);
+	if (err) {
+		pkram_free_page(node);
+		return err;
+	}
+
+	pkram_stream_init(ps, node, gfp_mask);
+	return 0;
 }
 
 /**
@@ -50,7 +146,11 @@ void pkram_finish_save_obj(struct pkram_stream *ps)
  */
 void pkram_finish_save(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	node->flags &= ~PKRAM_ACCMODE_MASK;
 }
 
 /**
@@ -60,7 +160,15 @@ void pkram_finish_save(struct pkram_stream *ps)
  */
 void pkram_discard_save(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	mutex_lock(&pkram_mutex);
+	pkram_delete_node(node);
+	mutex_unlock(&pkram_mutex);
+
+	pkram_free_page(node);
 }
 
 /**
@@ -69,11 +177,36 @@ void pkram_discard_save(struct pkram_stream *ps)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOENT: node with specified name does not exist
+ *	%EBUSY: save to required node has not finished yet
+ *
  * After the load has finished, pkram_finish_load() is to be called.
  */
 int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 {
-	return -EINVAL;
+	struct pkram_node *node;
+	int err = 0;
+
+	mutex_lock(&pkram_mutex);
+	node = pkram_find_node(name);
+	if (!node) {
+		err = -ENOENT;
+		goto out_unlock;
+	}
+	if (node->flags & PKRAM_ACCMODE_MASK) {
+		err = -EBUSY;
+		goto out_unlock;
+	}
+	pkram_delete_node(node);
+out_unlock:
+	mutex_unlock(&pkram_mutex);
+	if (err)
+		return err;
+
+	node->flags |= PKRAM_LOAD;
+	pkram_stream_init(ps, node, 0);
+	return 0;
 }
 
 /**
@@ -106,7 +239,11 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(node);
 }
 
 /**
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 02/21] mm: PKRAM: implement node load and save functions
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Preserved memory is divided into nodes which can be saved and loaded
independently of each other. PKRAM nodes are kept on a list and
identified by unique names. Whenever a save operation is initiated by
calling pkram_prepare_save(), a new node is created and linked to the
list. When the save operation has been committed by calling
pkram_finish_save(), the node becomes loadable. A load operation can be
then initiated by calling pkram_prepare_load() which deletes the node
from the list and prepares the corresponding stream for loading data
from it. After the load has been finished, the pkram_finish_load()
function must be called to free the node. Nodes are also deleted when a
save operation is discarded, i.e. pkram_discard_save() is called instead
of pkram_finish_save().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   8 ++-
 mm/pkram.c            | 147 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 149 insertions(+), 6 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 57b8db4229a4..8def9017b16a 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/mm_types.h>
 
+struct pkram_node;
+
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
@@ -14,7 +16,11 @@ enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,  /* No data types configured */
 };
 
-struct pkram_stream;
+struct pkram_stream {
+	gfp_t gfp_mask;
+	struct pkram_node *node;
+};
+
 struct pkram_access;
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index 421de8211e05..bbfd8df0874e 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,16 +2,85 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/kernel.h>
+#include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/string.h>
 #include <linux/types.h>
 
+/*
+ * Preserved memory is divided into nodes that can be saved or loaded
+ * independently of each other. The nodes are identified by unique name
+ * strings.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_node {
+	__u32	flags;
+
+	__u8	name[PKRAM_NAME_MAX];
+};
+
+#define PKRAM_SAVE		1
+#define PKRAM_LOAD		2
+#define PKRAM_ACCMODE_MASK	3
+
+static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
+static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
+
+static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
+{
+	return alloc_page(gfp_mask);
+}
+
+static inline void pkram_free_page(void *addr)
+{
+	free_page((unsigned long)addr);
+}
+
+static inline void pkram_insert_node(struct pkram_node *node)
+{
+	list_add(&virt_to_page(node)->lru, &pkram_nodes);
+}
+
+static inline void pkram_delete_node(struct pkram_node *node)
+{
+	list_del(&virt_to_page(node)->lru);
+}
+
+static struct pkram_node *pkram_find_node(const char *name)
+{
+	struct page *page;
+	struct pkram_node *node;
+
+	list_for_each_entry(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (strcmp(node->name, name) == 0)
+			return node;
+	}
+	return NULL;
+}
+
+static void pkram_stream_init(struct pkram_stream *ps,
+			     struct pkram_node *node, gfp_t gfp_mask)
+{
+	memset(ps, 0, sizeof(*ps));
+	ps->gfp_mask = gfp_mask;
+	ps->node = node;
+}
+
 /**
  * Create a preserved memory node with name @name and initialize stream @ps
  * for saving data to it.
  *
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
+ * Error values:
+ *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
+ *	%ENOMEM: insufficient memory available
+ *	%EEXIST: node with specified name already exists
+ *
  * Returns 0 on success, -errno on failure.
  *
  * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
@@ -19,7 +88,34 @@
  */
 int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
 {
-	return -EINVAL;
+	struct page *page;
+	struct pkram_node *node;
+	int err = 0;
+
+	if (strlen(name) >= PKRAM_NAME_MAX)
+		return -ENAMETOOLONG;
+
+	page = pkram_alloc_page(gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	node = page_address(page);
+
+	node->flags = PKRAM_SAVE;
+	strcpy(node->name, name);
+
+	mutex_lock(&pkram_mutex);
+	if (!pkram_find_node(name))
+		pkram_insert_node(node);
+	else
+		err = -EEXIST;
+	mutex_unlock(&pkram_mutex);
+	if (err) {
+		pkram_free_page(node);
+		return err;
+	}
+
+	pkram_stream_init(ps, node, gfp_mask);
+	return 0;
 }
 
 /**
@@ -50,7 +146,11 @@ void pkram_finish_save_obj(struct pkram_stream *ps)
  */
 void pkram_finish_save(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	node->flags &= ~PKRAM_ACCMODE_MASK;
 }
 
 /**
@@ -60,7 +160,15 @@ void pkram_finish_save(struct pkram_stream *ps)
  */
 void pkram_discard_save(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	mutex_lock(&pkram_mutex);
+	pkram_delete_node(node);
+	mutex_unlock(&pkram_mutex);
+
+	pkram_free_page(node);
 }
 
 /**
@@ -69,11 +177,36 @@ void pkram_discard_save(struct pkram_stream *ps)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOENT: node with specified name does not exist
+ *	%EBUSY: save to required node has not finished yet
+ *
  * After the load has finished, pkram_finish_load() is to be called.
  */
 int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 {
-	return -EINVAL;
+	struct pkram_node *node;
+	int err = 0;
+
+	mutex_lock(&pkram_mutex);
+	node = pkram_find_node(name);
+	if (!node) {
+		err = -ENOENT;
+		goto out_unlock;
+	}
+	if (node->flags & PKRAM_ACCMODE_MASK) {
+		err = -EBUSY;
+		goto out_unlock;
+	}
+	pkram_delete_node(node);
+out_unlock:
+	mutex_unlock(&pkram_mutex);
+	if (err)
+		return err;
+
+	node->flags |= PKRAM_LOAD;
+	pkram_stream_init(ps, node, 0);
+	return 0;
 }
 
 /**
@@ -106,7 +239,11 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(node);
 }
 
 /**
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 03/21] mm: PKRAM: implement object load and save functions
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

PKRAM nodes are further divided into a list of objects. After a save
operation has been initiated for a node, a save operation for an object
associated with the node is initiated by calling pkram_prepare_save_obj().
A new object is created and linked to the node.  The save operation for
the object is committed by calling pkram_finish_save_obj().  After a load
operation has been initiated, pkram_prepare_load_obj() is called to
delete the next object from the node and prepare the corresponding
stream for loading data from it.  After the load of object has been
finished, pkram_finish_load_obj() is called to free the object.  Objects
are also deleted when a save operation is discarded.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 72 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 8def9017b16a..83718ad0e416 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -7,6 +7,7 @@
 #include <linux/mm_types.h>
 
 struct pkram_node;
+struct pkram_obj;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
@@ -19,6 +20,7 @@ enum pkram_data_flags {
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
+	struct pkram_obj *obj;
 };
 
 struct pkram_access;
diff --git a/mm/pkram.c b/mm/pkram.c
index bbfd8df0874e..6e3895cb9872 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -6,9 +6,14 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
 
+struct pkram_obj {
+	__u64   obj_pfn;	/* points to the next object in the list */
+};
+
 /*
  * Preserved memory is divided into nodes that can be saved or loaded
  * independently of each other. The nodes are identified by unique name
@@ -18,6 +23,7 @@
  */
 struct pkram_node {
 	__u32	flags;
+	__u64	obj_pfn;	/* points to the first obj of the node */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -62,6 +68,21 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_node(struct pkram_node *node)
+{
+	unsigned long obj_pfn;
+	struct pkram_obj *obj;
+
+	obj_pfn = node->obj_pfn;
+	while (obj_pfn) {
+		obj = pfn_to_kaddr(obj_pfn);
+		obj_pfn = obj->obj_pfn;
+		pkram_free_page(obj);
+		cond_resched();
+	}
+	node->obj_pfn = 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -124,12 +145,31 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOMEM: insufficient memory available
+ *
  * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
  * in case of failure) is to be called.
  */
 int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 {
-	return -EINVAL;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+	struct page *page;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	obj = page_address(page);
+
+	if (node->obj_pfn)
+		obj->obj_pfn = node->obj_pfn;
+	node->obj_pfn = page_to_pfn(page);
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -137,7 +177,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
  */
 void pkram_finish_save_obj(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 }
 
 /**
@@ -168,6 +210,7 @@ void pkram_discard_save(struct pkram_stream *ps)
 	pkram_delete_node(node);
 	mutex_unlock(&pkram_mutex);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
@@ -215,11 +258,26 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENODATA: Stream @ps has no preserved memory objects
+ *
  * After the load has finished, pkram_finish_load_obj() is to be called.
  */
 int pkram_prepare_load_obj(struct pkram_stream *ps)
 {
-	return -EINVAL;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	if (!node->obj_pfn)
+		return -ENODATA;
+
+	obj = pfn_to_kaddr(node->obj_pfn);
+	node->obj_pfn = obj->obj_pfn;
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -229,7 +287,12 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load_obj(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj = ps->obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(obj);
 }
 
 /**
@@ -243,6 +306,7 @@ void pkram_finish_load(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 03/21] mm: PKRAM: implement object load and save functions
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

PKRAM nodes are further divided into a list of objects. After a save
operation has been initiated for a node, a save operation for an object
associated with the node is initiated by calling pkram_prepare_save_obj().
A new object is created and linked to the node.  The save operation for
the object is committed by calling pkram_finish_save_obj().  After a load
operation has been initiated, pkram_prepare_load_obj() is called to
delete the next object from the node and prepare the corresponding
stream for loading data from it.  After the load of object has been
finished, pkram_finish_load_obj() is called to free the object.  Objects
are also deleted when a save operation is discarded.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 72 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 8def9017b16a..83718ad0e416 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -7,6 +7,7 @@
 #include <linux/mm_types.h>
 
 struct pkram_node;
+struct pkram_obj;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
@@ -19,6 +20,7 @@ enum pkram_data_flags {
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
+	struct pkram_obj *obj;
 };
 
 struct pkram_access;
diff --git a/mm/pkram.c b/mm/pkram.c
index bbfd8df0874e..6e3895cb9872 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -6,9 +6,14 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
 
+struct pkram_obj {
+	__u64   obj_pfn;	/* points to the next object in the list */
+};
+
 /*
  * Preserved memory is divided into nodes that can be saved or loaded
  * independently of each other. The nodes are identified by unique name
@@ -18,6 +23,7 @@
  */
 struct pkram_node {
 	__u32	flags;
+	__u64	obj_pfn;	/* points to the first obj of the node */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -62,6 +68,21 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_node(struct pkram_node *node)
+{
+	unsigned long obj_pfn;
+	struct pkram_obj *obj;
+
+	obj_pfn = node->obj_pfn;
+	while (obj_pfn) {
+		obj = pfn_to_kaddr(obj_pfn);
+		obj_pfn = obj->obj_pfn;
+		pkram_free_page(obj);
+		cond_resched();
+	}
+	node->obj_pfn = 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -124,12 +145,31 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOMEM: insufficient memory available
+ *
  * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
  * in case of failure) is to be called.
  */
 int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 {
-	return -EINVAL;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+	struct page *page;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	obj = page_address(page);
+
+	if (node->obj_pfn)
+		obj->obj_pfn = node->obj_pfn;
+	node->obj_pfn = page_to_pfn(page);
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -137,7 +177,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
  */
 void pkram_finish_save_obj(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 }
 
 /**
@@ -168,6 +210,7 @@ void pkram_discard_save(struct pkram_stream *ps)
 	pkram_delete_node(node);
 	mutex_unlock(&pkram_mutex);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
@@ -215,11 +258,26 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENODATA: Stream @ps has no preserved memory objects
+ *
  * After the load has finished, pkram_finish_load_obj() is to be called.
  */
 int pkram_prepare_load_obj(struct pkram_stream *ps)
 {
-	return -EINVAL;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	if (!node->obj_pfn)
+		return -ENODATA;
+
+	obj = pfn_to_kaddr(node->obj_pfn);
+	node->obj_pfn = obj->obj_pfn;
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -229,7 +287,12 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load_obj(struct pkram_stream *ps)
 {
-	WARN_ON_ONCE(1);
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj = ps->obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(obj);
 }
 
 /**
@@ -243,6 +306,7 @@ void pkram_finish_load(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 04/21] mm: PKRAM: implement folio stream operations
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Implement pkram_save_folio() to populate a PKRAM object with in-memory
folios and pkram_load_folio() to load folios from a PKRAM object.
Saving a folio to PKRAM is accomplished by recording its pfn, order,
and mapping index and incrementing its refcount so that it will not
be freed after the last user puts it.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  42 ++++++-
 mm/pkram.c            | 311 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 346 insertions(+), 7 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 83718ad0e416..130ab5c2d94a 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -8,22 +8,47 @@
 
 struct pkram_node;
 struct pkram_obj;
+struct pkram_link;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
+ * @PKRAM_DATA_folios: obj contains folio data
  */
 enum pkram_data_flags {
-	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+	PKRAM_DATA_none		= 0x0,	/* No data types configured */
+	PKRAM_DATA_folios	= 0x1,	/* Contains folio data */
+};
+
+struct pkram_data_stream {
+	/* List of link pages to add/remove from */
+	__u64 *head_link_pfnp;
+	__u64 *tail_link_pfnp;
+
+	struct pkram_link *link;	/* current link */
+	unsigned int entry_idx;		/* next entry in link */
 };
 
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
 	struct pkram_obj *obj;
+
+	__u64 *folios_head_link_pfnp;
+	__u64 *folios_tail_link_pfnp;
+};
+
+struct pkram_folios_access {
+	unsigned long next_index;
 };
 
-struct pkram_access;
+struct pkram_access {
+	enum pkram_data_flags dtype;
+	struct pkram_stream *ps;
+	struct pkram_data_stream pds;
+
+	struct pkram_folios_access folios;
+};
 
 #define PKRAM_NAME_MAX		256	/* including nul */
 
@@ -41,8 +66,19 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_finish_load(struct pkram_stream *ps);
 void pkram_finish_load_obj(struct pkram_stream *ps);
 
+#define PKRAM_PDS_INIT(name, stream, type) {			\
+	.head_link_pfnp = (stream)->type##_head_link_pfnp,	\
+	.tail_link_pfnp = (stream)->type##_tail_link_pfnp,	\
+	}
+
+#define PKRAM_ACCESS_INIT(name, stream, type) {			\
+	.dtype = PKRAM_DATA_##type,				\
+	.ps = (stream),						\
+	.pds = PKRAM_PDS_INIT(name, stream, type),		\
+	}
+
 #define PKRAM_ACCESS(name, stream, type)			\
-	struct pkram_access name
+	struct pkram_access name = PKRAM_ACCESS_INIT(name, stream, type)
 
 void pkram_finish_access(struct pkram_access *pa, bool status_ok);
 
diff --git a/mm/pkram.c b/mm/pkram.c
index 6e3895cb9872..610ff7a88c98 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
@@ -10,8 +11,40 @@
 #include <linux/string.h>
 #include <linux/types.h>
 
+#include "internal.h"
+
+
+/*
+ * Represents a reference to a data page saved to PKRAM.
+ */
+typedef __u64 pkram_entry_t;
+
+#define PKRAM_ENTRY_FLAGS_SHIFT	0x5
+#define PKRAM_ENTRY_FLAGS_MASK	0x7f
+#define PKRAM_ENTRY_ORDER_MASK	0x1f
+
+/*
+ * Keeps references to folios saved to PKRAM.
+ * The structure occupies a memory page.
+ */
+struct pkram_link {
+	__u64	link_pfn;	/* points to the next link of the object */
+	__u64	index;		/* mapping index of first pkram_entry_t */
+
+	/*
+	 * the array occupies the rest of the link page; if the link is not
+	 * full, the rest of the array must be filled with zeros
+	 */
+	pkram_entry_t entry[];
+};
+
+#define PKRAM_LINK_ENTRIES_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_link))/sizeof(pkram_entry_t))
+
 struct pkram_obj {
-	__u64   obj_pfn;	/* points to the next object in the list */
+	__u64	folios_head_link_pfn;	/* the first folios link of the object */
+	__u64	folios_tail_link_pfn;	/* the last folios link of the object */
+	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
 /*
@@ -19,6 +52,10 @@ struct pkram_obj {
  * independently of each other. The nodes are identified by unique name
  * strings.
  *
+ * References to folios saved to a preserved memory node are kept in a
+ * singly-linked list of PKRAM link structures (see above), the node has a
+ * pointer to the head of.
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
@@ -68,6 +105,41 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_link(struct pkram_link *link)
+{
+	struct page *page;
+	pkram_entry_t p;
+	int i;
+
+	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
+		p = link->entry[i];
+		if (!p)
+			continue;
+		page = pfn_to_page(PHYS_PFN(p));
+		put_page(page);
+	}
+}
+
+static void pkram_truncate_links(unsigned long link_pfn)
+{
+	struct pkram_link *link;
+
+	while (link_pfn) {
+		link = pfn_to_kaddr(link_pfn);
+		pkram_truncate_link(link);
+		link_pfn = link->link_pfn;
+		pkram_free_page(link);
+		cond_resched();
+	}
+}
+
+static void pkram_truncate_obj(struct pkram_obj *obj)
+{
+	pkram_truncate_links(obj->folios_head_link_pfn);
+	obj->folios_head_link_pfn = 0;
+	obj->folios_tail_link_pfn = 0;
+}
+
 static void pkram_truncate_node(struct pkram_node *node)
 {
 	unsigned long obj_pfn;
@@ -76,6 +148,7 @@ static void pkram_truncate_node(struct pkram_node *node)
 	obj_pfn = node->obj_pfn;
 	while (obj_pfn) {
 		obj = pfn_to_kaddr(obj_pfn);
+		pkram_truncate_obj(obj);
 		obj_pfn = obj->obj_pfn;
 		pkram_free_page(obj);
 		cond_resched();
@@ -83,6 +156,84 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
+{
+	__u64 link_pfn = page_to_pfn(virt_to_page(link));
+
+	if (!*pds->head_link_pfnp) {
+		*pds->head_link_pfnp = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	} else {
+		struct pkram_link *tail = pfn_to_kaddr(*pds->tail_link_pfnp);
+
+		tail->link_pfn = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	}
+}
+
+static struct pkram_link *pkram_remove_link(struct pkram_data_stream *pds)
+{
+	struct pkram_link *link;
+
+	if (!*pds->head_link_pfnp)
+		return NULL;
+
+	link = pfn_to_kaddr(*pds->head_link_pfnp);
+	*pds->head_link_pfnp = link->link_pfn;
+	if (!*pds->head_link_pfnp)
+		*pds->tail_link_pfnp = 0;
+	else
+		link->link_pfn = 0;
+
+	return link;
+}
+
+static struct pkram_link *pkram_new_link(struct pkram_data_stream *pds, gfp_t gfp_mask)
+{
+	struct pkram_link *link;
+	struct page *link_page;
+
+	link_page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+				    __GFP_ZERO);
+	if (!link_page)
+		return NULL;
+
+	link = page_address(link_page);
+	pkram_add_link(link, pds);
+	pds->link = link;
+	pds->entry_idx = 0;
+
+	return link;
+}
+
+static void pkram_add_link_entry(struct pkram_data_stream *pds, struct page *page)
+{
+	struct pkram_link *link = pds->link;
+	pkram_entry_t p;
+	short flags = 0;
+
+	p = page_to_phys(page);
+	p |= compound_order(page);
+	p |= ((flags & PKRAM_ENTRY_FLAGS_MASK) << PKRAM_ENTRY_FLAGS_SHIFT);
+	link->entry[pds->entry_idx] = p;
+	pds->entry_idx++;
+}
+
+static int pkram_next_link(struct pkram_data_stream *pds, struct pkram_link **linkp)
+{
+	struct pkram_link *link;
+
+	link = pkram_remove_link(pds);
+	if (!link)
+		return -ENODATA;
+
+	pds->link = link;
+	pds->entry_idx = 0;
+	*linkp = link;
+
+	return 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -159,6 +310,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	if (flags & ~PKRAM_DATA_folios)
+		return -EINVAL;
+
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return -ENOMEM;
@@ -168,6 +322,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		obj->obj_pfn = node->obj_pfn;
 	node->obj_pfn = page_to_pfn(page);
 
+	if (flags & PKRAM_DATA_folios) {
+		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
+		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -274,8 +432,17 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
+	if (!obj->folios_head_link_pfn) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
 	node->obj_pfn = obj->obj_pfn;
 
+	if (obj->folios_head_link_pfn) {
+		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
+		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -292,6 +459,7 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_obj(obj);
 	pkram_free_page(obj);
 }
 
@@ -317,7 +485,41 @@ void pkram_finish_load(struct pkram_stream *ps)
  */
 void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 {
-	WARN_ON_ONCE(1);
+	if (status_ok)
+		return;
+
+	if (pa->ps->node->flags == PKRAM_SAVE)
+		return;
+
+	if (pa->pds.link)
+		pkram_truncate_link(pa->pds.link);
+}
+
+/*
+ * Add a page to a PKRAM obj allocating a new PKRAM link if necessary.
+ */
+static int __pkram_save_page(struct pkram_access *pa, struct page *page,
+			     unsigned long index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    index != pa->folios.next_index) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+
+		pa->folios.next_index = link->index = index;
+	}
+
+	get_page(page);
+
+	pkram_add_link_entry(pds, page);
+
+	pa->folios.next_index += compound_nr(page);
+
+	return 0;
 }
 
 /**
@@ -327,10 +529,102 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
  * with PKRAM_ACCESS().
  *
  * Returns 0 on success, -errno on failure.
+ *
+ * Error values:
+ *	%ENOMEM: insufficient amount of memory available
+ *
+ * Saving a folio to preserved memory is simply incrementing its refcount so
+ * that it will not get freed after the last user puts it. That means it is
+ * safe to use the folio as usual after it has been saved.
  */
 int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 {
-	return -EINVAL;
+	struct pkram_node *node = pa->ps->node;
+	struct page *page = folio_page(folio, 0);
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	return __pkram_save_page(pa, page, page->index);
+}
+
+static struct page *__pkram_prep_load_page(pkram_entry_t p)
+{
+	struct page *page;
+	int order;
+	short flags;
+
+	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
+	order = p & PKRAM_ENTRY_ORDER_MASK;
+	if (order >= MAX_ORDER)
+		goto out_error;
+
+	page = pfn_to_page(PHYS_PFN(p));
+
+	if (!page_ref_freeze(pg, 1)) {
+		pr_err("PKRAM preserved page has unexpected inflated ref count\n");
+		goto out_error;
+	}
+
+	if (order) {
+		prep_compound_page(page, order);
+		if (order > 1)
+			prep_transhuge_page(page);
+	}
+
+	page_ref_unfreeze(page, 1);
+
+	return page;
+
+out_error:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Extract the next page from preserved memory freeing a PKRAM link if it
+ * becomes empty.
+ */
+static struct page *__pkram_load_page(struct pkram_access *pa, unsigned long *index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+	struct page *page;
+	pkram_entry_t p;
+	int ret;
+
+	if (!link) {
+		ret = pkram_next_link(pds, &link);
+		if (ret)
+			return NULL;
+
+		if (index)
+			pa->folios.next_index = link->index;
+	}
+
+	BUG_ON(pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX);
+
+	p = link->entry[pds->entry_idx];
+	BUG_ON(!p);
+
+	page = __pkram_prep_load_page(p);
+	if (IS_ERR(page))
+		return page;
+
+	if (index) {
+		*index = pa->folios.next_index;
+		pa->folios.next_index += compound_nr(page);
+	}
+
+	/* clear to avoid double free (see pkram_truncate_link()) */
+	link->entry[pds->entry_idx] = 0;
+
+	pds->entry_idx++;
+	if (pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    !link->entry[pds->entry_idx]) {
+		pds->link = NULL;
+		pkram_free_page(link);
+	}
+
+	return page;
 }
 
 /**
@@ -348,7 +642,16 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
  */
 struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
 {
-	return NULL;
+	struct pkram_node *node = pa->ps->node;
+	struct page *page;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	page = __pkram_load_page(pa, index);
+	if (IS_ERR_OR_NULL(page))
+		return (struct folio *)page;
+	else
+		return page_folio(page);
 }
 
 /**
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 04/21] mm: PKRAM: implement folio stream operations
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Implement pkram_save_folio() to populate a PKRAM object with in-memory
folios and pkram_load_folio() to load folios from a PKRAM object.
Saving a folio to PKRAM is accomplished by recording its pfn, order,
and mapping index and incrementing its refcount so that it will not
be freed after the last user puts it.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  42 ++++++-
 mm/pkram.c            | 311 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 346 insertions(+), 7 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 83718ad0e416..130ab5c2d94a 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -8,22 +8,47 @@
 
 struct pkram_node;
 struct pkram_obj;
+struct pkram_link;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
+ * @PKRAM_DATA_folios: obj contains folio data
  */
 enum pkram_data_flags {
-	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+	PKRAM_DATA_none		= 0x0,	/* No data types configured */
+	PKRAM_DATA_folios	= 0x1,	/* Contains folio data */
+};
+
+struct pkram_data_stream {
+	/* List of link pages to add/remove from */
+	__u64 *head_link_pfnp;
+	__u64 *tail_link_pfnp;
+
+	struct pkram_link *link;	/* current link */
+	unsigned int entry_idx;		/* next entry in link */
 };
 
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
 	struct pkram_obj *obj;
+
+	__u64 *folios_head_link_pfnp;
+	__u64 *folios_tail_link_pfnp;
+};
+
+struct pkram_folios_access {
+	unsigned long next_index;
 };
 
-struct pkram_access;
+struct pkram_access {
+	enum pkram_data_flags dtype;
+	struct pkram_stream *ps;
+	struct pkram_data_stream pds;
+
+	struct pkram_folios_access folios;
+};
 
 #define PKRAM_NAME_MAX		256	/* including nul */
 
@@ -41,8 +66,19 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_finish_load(struct pkram_stream *ps);
 void pkram_finish_load_obj(struct pkram_stream *ps);
 
+#define PKRAM_PDS_INIT(name, stream, type) {			\
+	.head_link_pfnp = (stream)->type##_head_link_pfnp,	\
+	.tail_link_pfnp = (stream)->type##_tail_link_pfnp,	\
+	}
+
+#define PKRAM_ACCESS_INIT(name, stream, type) {			\
+	.dtype = PKRAM_DATA_##type,				\
+	.ps = (stream),						\
+	.pds = PKRAM_PDS_INIT(name, stream, type),		\
+	}
+
 #define PKRAM_ACCESS(name, stream, type)			\
-	struct pkram_access name
+	struct pkram_access name = PKRAM_ACCESS_INIT(name, stream, type)
 
 void pkram_finish_access(struct pkram_access *pa, bool status_ok);
 
diff --git a/mm/pkram.c b/mm/pkram.c
index 6e3895cb9872..610ff7a88c98 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
@@ -10,8 +11,40 @@
 #include <linux/string.h>
 #include <linux/types.h>
 
+#include "internal.h"
+
+
+/*
+ * Represents a reference to a data page saved to PKRAM.
+ */
+typedef __u64 pkram_entry_t;
+
+#define PKRAM_ENTRY_FLAGS_SHIFT	0x5
+#define PKRAM_ENTRY_FLAGS_MASK	0x7f
+#define PKRAM_ENTRY_ORDER_MASK	0x1f
+
+/*
+ * Keeps references to folios saved to PKRAM.
+ * The structure occupies a memory page.
+ */
+struct pkram_link {
+	__u64	link_pfn;	/* points to the next link of the object */
+	__u64	index;		/* mapping index of first pkram_entry_t */
+
+	/*
+	 * the array occupies the rest of the link page; if the link is not
+	 * full, the rest of the array must be filled with zeros
+	 */
+	pkram_entry_t entry[];
+};
+
+#define PKRAM_LINK_ENTRIES_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_link))/sizeof(pkram_entry_t))
+
 struct pkram_obj {
-	__u64   obj_pfn;	/* points to the next object in the list */
+	__u64	folios_head_link_pfn;	/* the first folios link of the object */
+	__u64	folios_tail_link_pfn;	/* the last folios link of the object */
+	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
 /*
@@ -19,6 +52,10 @@ struct pkram_obj {
  * independently of each other. The nodes are identified by unique name
  * strings.
  *
+ * References to folios saved to a preserved memory node are kept in a
+ * singly-linked list of PKRAM link structures (see above), the node has a
+ * pointer to the head of.
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
@@ -68,6 +105,41 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_link(struct pkram_link *link)
+{
+	struct page *page;
+	pkram_entry_t p;
+	int i;
+
+	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
+		p = link->entry[i];
+		if (!p)
+			continue;
+		page = pfn_to_page(PHYS_PFN(p));
+		put_page(page);
+	}
+}
+
+static void pkram_truncate_links(unsigned long link_pfn)
+{
+	struct pkram_link *link;
+
+	while (link_pfn) {
+		link = pfn_to_kaddr(link_pfn);
+		pkram_truncate_link(link);
+		link_pfn = link->link_pfn;
+		pkram_free_page(link);
+		cond_resched();
+	}
+}
+
+static void pkram_truncate_obj(struct pkram_obj *obj)
+{
+	pkram_truncate_links(obj->folios_head_link_pfn);
+	obj->folios_head_link_pfn = 0;
+	obj->folios_tail_link_pfn = 0;
+}
+
 static void pkram_truncate_node(struct pkram_node *node)
 {
 	unsigned long obj_pfn;
@@ -76,6 +148,7 @@ static void pkram_truncate_node(struct pkram_node *node)
 	obj_pfn = node->obj_pfn;
 	while (obj_pfn) {
 		obj = pfn_to_kaddr(obj_pfn);
+		pkram_truncate_obj(obj);
 		obj_pfn = obj->obj_pfn;
 		pkram_free_page(obj);
 		cond_resched();
@@ -83,6 +156,84 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
+{
+	__u64 link_pfn = page_to_pfn(virt_to_page(link));
+
+	if (!*pds->head_link_pfnp) {
+		*pds->head_link_pfnp = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	} else {
+		struct pkram_link *tail = pfn_to_kaddr(*pds->tail_link_pfnp);
+
+		tail->link_pfn = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	}
+}
+
+static struct pkram_link *pkram_remove_link(struct pkram_data_stream *pds)
+{
+	struct pkram_link *link;
+
+	if (!*pds->head_link_pfnp)
+		return NULL;
+
+	link = pfn_to_kaddr(*pds->head_link_pfnp);
+	*pds->head_link_pfnp = link->link_pfn;
+	if (!*pds->head_link_pfnp)
+		*pds->tail_link_pfnp = 0;
+	else
+		link->link_pfn = 0;
+
+	return link;
+}
+
+static struct pkram_link *pkram_new_link(struct pkram_data_stream *pds, gfp_t gfp_mask)
+{
+	struct pkram_link *link;
+	struct page *link_page;
+
+	link_page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+				    __GFP_ZERO);
+	if (!link_page)
+		return NULL;
+
+	link = page_address(link_page);
+	pkram_add_link(link, pds);
+	pds->link = link;
+	pds->entry_idx = 0;
+
+	return link;
+}
+
+static void pkram_add_link_entry(struct pkram_data_stream *pds, struct page *page)
+{
+	struct pkram_link *link = pds->link;
+	pkram_entry_t p;
+	short flags = 0;
+
+	p = page_to_phys(page);
+	p |= compound_order(page);
+	p |= ((flags & PKRAM_ENTRY_FLAGS_MASK) << PKRAM_ENTRY_FLAGS_SHIFT);
+	link->entry[pds->entry_idx] = p;
+	pds->entry_idx++;
+}
+
+static int pkram_next_link(struct pkram_data_stream *pds, struct pkram_link **linkp)
+{
+	struct pkram_link *link;
+
+	link = pkram_remove_link(pds);
+	if (!link)
+		return -ENODATA;
+
+	pds->link = link;
+	pds->entry_idx = 0;
+	*linkp = link;
+
+	return 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -159,6 +310,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	if (flags & ~PKRAM_DATA_folios)
+		return -EINVAL;
+
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return -ENOMEM;
@@ -168,6 +322,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		obj->obj_pfn = node->obj_pfn;
 	node->obj_pfn = page_to_pfn(page);
 
+	if (flags & PKRAM_DATA_folios) {
+		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
+		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -274,8 +432,17 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
+	if (!obj->folios_head_link_pfn) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
 	node->obj_pfn = obj->obj_pfn;
 
+	if (obj->folios_head_link_pfn) {
+		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
+		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -292,6 +459,7 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_obj(obj);
 	pkram_free_page(obj);
 }
 
@@ -317,7 +485,41 @@ void pkram_finish_load(struct pkram_stream *ps)
  */
 void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 {
-	WARN_ON_ONCE(1);
+	if (status_ok)
+		return;
+
+	if (pa->ps->node->flags == PKRAM_SAVE)
+		return;
+
+	if (pa->pds.link)
+		pkram_truncate_link(pa->pds.link);
+}
+
+/*
+ * Add a page to a PKRAM obj allocating a new PKRAM link if necessary.
+ */
+static int __pkram_save_page(struct pkram_access *pa, struct page *page,
+			     unsigned long index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    index != pa->folios.next_index) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+
+		pa->folios.next_index = link->index = index;
+	}
+
+	get_page(page);
+
+	pkram_add_link_entry(pds, page);
+
+	pa->folios.next_index += compound_nr(page);
+
+	return 0;
 }
 
 /**
@@ -327,10 +529,102 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
  * with PKRAM_ACCESS().
  *
  * Returns 0 on success, -errno on failure.
+ *
+ * Error values:
+ *	%ENOMEM: insufficient amount of memory available
+ *
+ * Saving a folio to preserved memory is simply incrementing its refcount so
+ * that it will not get freed after the last user puts it. That means it is
+ * safe to use the folio as usual after it has been saved.
  */
 int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 {
-	return -EINVAL;
+	struct pkram_node *node = pa->ps->node;
+	struct page *page = folio_page(folio, 0);
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	return __pkram_save_page(pa, page, page->index);
+}
+
+static struct page *__pkram_prep_load_page(pkram_entry_t p)
+{
+	struct page *page;
+	int order;
+	short flags;
+
+	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
+	order = p & PKRAM_ENTRY_ORDER_MASK;
+	if (order >= MAX_ORDER)
+		goto out_error;
+
+	page = pfn_to_page(PHYS_PFN(p));
+
+	if (!page_ref_freeze(pg, 1)) {
+		pr_err("PKRAM preserved page has unexpected inflated ref count\n");
+		goto out_error;
+	}
+
+	if (order) {
+		prep_compound_page(page, order);
+		if (order > 1)
+			prep_transhuge_page(page);
+	}
+
+	page_ref_unfreeze(page, 1);
+
+	return page;
+
+out_error:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Extract the next page from preserved memory freeing a PKRAM link if it
+ * becomes empty.
+ */
+static struct page *__pkram_load_page(struct pkram_access *pa, unsigned long *index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+	struct page *page;
+	pkram_entry_t p;
+	int ret;
+
+	if (!link) {
+		ret = pkram_next_link(pds, &link);
+		if (ret)
+			return NULL;
+
+		if (index)
+			pa->folios.next_index = link->index;
+	}
+
+	BUG_ON(pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX);
+
+	p = link->entry[pds->entry_idx];
+	BUG_ON(!p);
+
+	page = __pkram_prep_load_page(p);
+	if (IS_ERR(page))
+		return page;
+
+	if (index) {
+		*index = pa->folios.next_index;
+		pa->folios.next_index += compound_nr(page);
+	}
+
+	/* clear to avoid double free (see pkram_truncate_link()) */
+	link->entry[pds->entry_idx] = 0;
+
+	pds->entry_idx++;
+	if (pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    !link->entry[pds->entry_idx]) {
+		pds->link = NULL;
+		pkram_free_page(link);
+	}
+
+	return page;
 }
 
 /**
@@ -348,7 +642,16 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
  */
 struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
 {
-	return NULL;
+	struct pkram_node *node = pa->ps->node;
+	struct page *page;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	page = __pkram_load_page(pa, index);
+	if (IS_ERR_OR_NULL(page))
+		return (struct folio *)page;
+	else
+		return page_folio(page);
 }
 
 /**
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 05/21] mm: PKRAM: implement byte stream operations
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

This patch adds the ability to save an arbitrary byte streams to a
a PKRAM object using pkram_write() to be restored later using pkram_read().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  11 +++++
 mm/pkram.c            | 123 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 130ab5c2d94a..b614e9059bba 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -14,10 +14,12 @@
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
  * @PKRAM_DATA_folios: obj contains folio data
+ * @PKRAM_DATA_bytes: obj contains byte data
  */
 enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,	/* No data types configured */
 	PKRAM_DATA_folios	= 0x1,	/* Contains folio data */
+	PKRAM_DATA_bytes	= 0x2,	/* Contains byte data */
 };
 
 struct pkram_data_stream {
@@ -36,18 +38,27 @@ struct pkram_stream {
 
 	__u64 *folios_head_link_pfnp;
 	__u64 *folios_tail_link_pfnp;
+
+	__u64 *bytes_head_link_pfnp;
+	__u64 *bytes_tail_link_pfnp;
 };
 
 struct pkram_folios_access {
 	unsigned long next_index;
 };
 
+struct pkram_bytes_access {
+	struct page *data_page;		/* current page */
+	unsigned int data_offset;	/* offset into current page */
+};
+
 struct pkram_access {
 	enum pkram_data_flags dtype;
 	struct pkram_stream *ps;
 	struct pkram_data_stream pds;
 
 	struct pkram_folios_access folios;
+	struct pkram_bytes_access bytes;
 };
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index 610ff7a88c98..eac8cf6b0cdf 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/highmem.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
@@ -44,6 +45,9 @@ struct pkram_link {
 struct pkram_obj {
 	__u64	folios_head_link_pfn;	/* the first folios link of the object */
 	__u64	folios_tail_link_pfn;	/* the last folios link of the object */
+	__u64	bytes_head_link_pfn;	/* the first bytes link of the object */
+	__u64	bytes_tail_link_pfn;	/* the last bytes link of the object */
+	__u64	data_len;	/* byte data size */
 	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
@@ -138,6 +142,11 @@ static void pkram_truncate_obj(struct pkram_obj *obj)
 	pkram_truncate_links(obj->folios_head_link_pfn);
 	obj->folios_head_link_pfn = 0;
 	obj->folios_tail_link_pfn = 0;
+
+	pkram_truncate_links(obj->bytes_head_link_pfn);
+	obj->bytes_head_link_pfn = 0;
+	obj->bytes_tail_link_pfn = 0;
+	obj->data_len = 0;
 }
 
 static void pkram_truncate_node(struct pkram_node *node)
@@ -310,7 +319,7 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	if (flags & ~PKRAM_DATA_folios)
+	if (flags & ~(PKRAM_DATA_folios | PKRAM_DATA_bytes))
 		return -EINVAL;
 
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
@@ -326,6 +335,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
 		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
 	}
+	if (flags & PKRAM_DATA_bytes) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -432,7 +445,7 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
-	if (!obj->folios_head_link_pfn) {
+	if (!obj->folios_head_link_pfn && !obj->bytes_head_link_pfn) {
 		WARN_ON(1);
 		return -EINVAL;
 	}
@@ -443,6 +456,10 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
 		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
 	}
+	if (obj->bytes_head_link_pfn) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -493,6 +510,9 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 
 	if (pa->pds.link)
 		pkram_truncate_link(pa->pds.link);
+
+	if ((pa->dtype == PKRAM_DATA_bytes) && (pa->bytes.data_page))
+		pkram_free_page(page_address(pa->bytes.data_page));
 }
 
 /*
@@ -547,6 +567,22 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 	return __pkram_save_page(pa, page, page->index);
 }
 
+static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+	}
+
+	pkram_add_link_entry(pds, page);
+
+	return 0;
+}
+
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
@@ -662,10 +698,53 @@ struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
  *
  * On success, returns the number of bytes written, which is always equal to
  * @count. On failure, -errno is returned.
+ *
+ * Error values:
+ *    %ENOMEM: insufficient amount of memory available
  */
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
 {
-	return -EINVAL;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, write_count = 0;
+	void *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	while (count > 0) {
+		if (!pa->bytes.data_page) {
+			gfp_t gfp_mask = pa->ps->gfp_mask;
+			struct page *page;
+			int err;
+
+			page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+					       __GFP_HIGHMEM | __GFP_ZERO);
+			if (!page)
+				return -ENOMEM;
+			err = __pkram_bytes_save_page(pa, page);
+			if (err) {
+				pkram_free_page(page_address(page));
+				return err;
+			}
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		addr = kmap_local_page(pa->bytes.data_page);
+		memcpy(addr + pa->bytes.data_offset, buf, copy_count);
+		kunmap_local(addr);
+
+		buf += copy_count;
+		obj->data_len += copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE)
+			pa->bytes.data_page = NULL;
+
+		write_count += copy_count;
+		count -= copy_count;
+	}
+	return write_count;
 }
 
 /**
@@ -679,5 +758,41 @@ ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
  */
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 {
-	return 0;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, read_count = 0;
+	char *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	while (count > 0 && obj->data_len > 0) {
+		if (!pa->bytes.data_page) {
+			struct page *page;
+
+			page = __pkram_load_page(pa, NULL);
+			if (IS_ERR_OR_NULL(page))
+				break;
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		if (copy_count > obj->data_len)
+			copy_count = obj->data_len;
+		addr = kmap_local_page(pa->bytes.data_page);
+		memcpy(buf, addr + pa->bytes.data_offset, copy_count);
+		kunmap_local(addr);
+
+		buf += copy_count;
+		obj->data_len -= copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE || !obj->data_len) {
+			put_page(pa->bytes.data_page);
+			pa->bytes.data_page = NULL;
+		}
+
+		read_count += copy_count;
+		count -= copy_count;
+	}
+	return read_count;
 }
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 05/21] mm: PKRAM: implement byte stream operations
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

This patch adds the ability to save an arbitrary byte streams to a
a PKRAM object using pkram_write() to be restored later using pkram_read().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  11 +++++
 mm/pkram.c            | 123 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 130ab5c2d94a..b614e9059bba 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -14,10 +14,12 @@
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
  * @PKRAM_DATA_folios: obj contains folio data
+ * @PKRAM_DATA_bytes: obj contains byte data
  */
 enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,	/* No data types configured */
 	PKRAM_DATA_folios	= 0x1,	/* Contains folio data */
+	PKRAM_DATA_bytes	= 0x2,	/* Contains byte data */
 };
 
 struct pkram_data_stream {
@@ -36,18 +38,27 @@ struct pkram_stream {
 
 	__u64 *folios_head_link_pfnp;
 	__u64 *folios_tail_link_pfnp;
+
+	__u64 *bytes_head_link_pfnp;
+	__u64 *bytes_tail_link_pfnp;
 };
 
 struct pkram_folios_access {
 	unsigned long next_index;
 };
 
+struct pkram_bytes_access {
+	struct page *data_page;		/* current page */
+	unsigned int data_offset;	/* offset into current page */
+};
+
 struct pkram_access {
 	enum pkram_data_flags dtype;
 	struct pkram_stream *ps;
 	struct pkram_data_stream pds;
 
 	struct pkram_folios_access folios;
+	struct pkram_bytes_access bytes;
 };
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index 610ff7a88c98..eac8cf6b0cdf 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/highmem.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
@@ -44,6 +45,9 @@ struct pkram_link {
 struct pkram_obj {
 	__u64	folios_head_link_pfn;	/* the first folios link of the object */
 	__u64	folios_tail_link_pfn;	/* the last folios link of the object */
+	__u64	bytes_head_link_pfn;	/* the first bytes link of the object */
+	__u64	bytes_tail_link_pfn;	/* the last bytes link of the object */
+	__u64	data_len;	/* byte data size */
 	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
@@ -138,6 +142,11 @@ static void pkram_truncate_obj(struct pkram_obj *obj)
 	pkram_truncate_links(obj->folios_head_link_pfn);
 	obj->folios_head_link_pfn = 0;
 	obj->folios_tail_link_pfn = 0;
+
+	pkram_truncate_links(obj->bytes_head_link_pfn);
+	obj->bytes_head_link_pfn = 0;
+	obj->bytes_tail_link_pfn = 0;
+	obj->data_len = 0;
 }
 
 static void pkram_truncate_node(struct pkram_node *node)
@@ -310,7 +319,7 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	if (flags & ~PKRAM_DATA_folios)
+	if (flags & ~(PKRAM_DATA_folios | PKRAM_DATA_bytes))
 		return -EINVAL;
 
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
@@ -326,6 +335,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
 		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
 	}
+	if (flags & PKRAM_DATA_bytes) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -432,7 +445,7 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
-	if (!obj->folios_head_link_pfn) {
+	if (!obj->folios_head_link_pfn && !obj->bytes_head_link_pfn) {
 		WARN_ON(1);
 		return -EINVAL;
 	}
@@ -443,6 +456,10 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		ps->folios_head_link_pfnp = &obj->folios_head_link_pfn;
 		ps->folios_tail_link_pfnp = &obj->folios_tail_link_pfn;
 	}
+	if (obj->bytes_head_link_pfn) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -493,6 +510,9 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 
 	if (pa->pds.link)
 		pkram_truncate_link(pa->pds.link);
+
+	if ((pa->dtype == PKRAM_DATA_bytes) && (pa->bytes.data_page))
+		pkram_free_page(page_address(pa->bytes.data_page));
 }
 
 /*
@@ -547,6 +567,22 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 	return __pkram_save_page(pa, page, page->index);
 }
 
+static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+	}
+
+	pkram_add_link_entry(pds, page);
+
+	return 0;
+}
+
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
@@ -662,10 +698,53 @@ struct folio *pkram_load_folio(struct pkram_access *pa, unsigned long *index)
  *
  * On success, returns the number of bytes written, which is always equal to
  * @count. On failure, -errno is returned.
+ *
+ * Error values:
+ *    %ENOMEM: insufficient amount of memory available
  */
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
 {
-	return -EINVAL;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, write_count = 0;
+	void *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	while (count > 0) {
+		if (!pa->bytes.data_page) {
+			gfp_t gfp_mask = pa->ps->gfp_mask;
+			struct page *page;
+			int err;
+
+			page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+					       __GFP_HIGHMEM | __GFP_ZERO);
+			if (!page)
+				return -ENOMEM;
+			err = __pkram_bytes_save_page(pa, page);
+			if (err) {
+				pkram_free_page(page_address(page));
+				return err;
+			}
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		addr = kmap_local_page(pa->bytes.data_page);
+		memcpy(addr + pa->bytes.data_offset, buf, copy_count);
+		kunmap_local(addr);
+
+		buf += copy_count;
+		obj->data_len += copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE)
+			pa->bytes.data_page = NULL;
+
+		write_count += copy_count;
+		count -= copy_count;
+	}
+	return write_count;
 }
 
 /**
@@ -679,5 +758,41 @@ ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
  */
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 {
-	return 0;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, read_count = 0;
+	char *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	while (count > 0 && obj->data_len > 0) {
+		if (!pa->bytes.data_page) {
+			struct page *page;
+
+			page = __pkram_load_page(pa, NULL);
+			if (IS_ERR_OR_NULL(page))
+				break;
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		if (copy_count > obj->data_len)
+			copy_count = obj->data_len;
+		addr = kmap_local_page(pa->bytes.data_page);
+		memcpy(buf, addr + pa->bytes.data_offset, copy_count);
+		kunmap_local(addr);
+
+		buf += copy_count;
+		obj->data_len -= copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE || !obj->data_len) {
+			put_page(pa->bytes.data_page);
+			pa->bytes.data_page = NULL;
+		}
+
+		read_count += copy_count;
+		count -= copy_count;
+	}
+	return read_count;
 }
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 06/21] mm: PKRAM: link nodes by pfn before reboot
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Since page structs are used for linking PKRAM nodes and cleared
on boot, organize all PKRAM nodes into a list singly-linked by pfns
before reboot to facilitate restoring the node list in the new kernel.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index eac8cf6b0cdf..da166cb6afb7 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,12 +2,16 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
+#include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/notifier.h>
 #include <linux/pkram.h>
+#include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
@@ -60,11 +64,15 @@ struct pkram_obj {
  * singly-linked list of PKRAM link structures (see above), the node has a
  * pointer to the head of.
  *
+ * To facilitate data restore in the new kernel, before reboot all PKRAM nodes
+ * are organized into a list singly-linked by pfn's (see pkram_reboot()).
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
 	__u32	flags;
 	__u64	obj_pfn;	/* points to the first obj of the node */
+	__u64	node_pfn;	/* points to the next node in the node list */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -73,6 +81,10 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+/*
+ * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
+ * connected through the lru field of the page struct.
+ */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
@@ -796,3 +808,41 @@ size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 	}
 	return read_count;
 }
+
+/*
+ * Build the list of PKRAM nodes.
+ */
+static void __pkram_reboot(void)
+{
+	struct page *page;
+	struct pkram_node *node;
+	unsigned long node_pfn = 0;
+
+	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+			continue;
+		node->node_pfn = node_pfn;
+		node_pfn = page_to_pfn(page);
+	}
+}
+
+static int pkram_reboot(struct notifier_block *notifier,
+		       unsigned long val, void *v)
+{
+	if (val != SYS_RESTART)
+		return NOTIFY_DONE;
+	__pkram_reboot();
+	return NOTIFY_OK;
+}
+
+static struct notifier_block pkram_reboot_notifier = {
+	.notifier_call = pkram_reboot,
+};
+
+static int __init pkram_init(void)
+{
+	register_reboot_notifier(&pkram_reboot_notifier);
+	return 0;
+}
+module_init(pkram_init);
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 06/21] mm: PKRAM: link nodes by pfn before reboot
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Since page structs are used for linking PKRAM nodes and cleared
on boot, organize all PKRAM nodes into a list singly-linked by pfns
before reboot to facilitate restoring the node list in the new kernel.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index eac8cf6b0cdf..da166cb6afb7 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,12 +2,16 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
+#include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/notifier.h>
 #include <linux/pkram.h>
+#include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
@@ -60,11 +64,15 @@ struct pkram_obj {
  * singly-linked list of PKRAM link structures (see above), the node has a
  * pointer to the head of.
  *
+ * To facilitate data restore in the new kernel, before reboot all PKRAM nodes
+ * are organized into a list singly-linked by pfn's (see pkram_reboot()).
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
 	__u32	flags;
 	__u64	obj_pfn;	/* points to the first obj of the node */
+	__u64	node_pfn;	/* points to the next node in the node list */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -73,6 +81,10 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+/*
+ * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
+ * connected through the lru field of the page struct.
+ */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
@@ -796,3 +808,41 @@ size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 	}
 	return read_count;
 }
+
+/*
+ * Build the list of PKRAM nodes.
+ */
+static void __pkram_reboot(void)
+{
+	struct page *page;
+	struct pkram_node *node;
+	unsigned long node_pfn = 0;
+
+	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+			continue;
+		node->node_pfn = node_pfn;
+		node_pfn = page_to_pfn(page);
+	}
+}
+
+static int pkram_reboot(struct notifier_block *notifier,
+		       unsigned long val, void *v)
+{
+	if (val != SYS_RESTART)
+		return NOTIFY_DONE;
+	__pkram_reboot();
+	return NOTIFY_OK;
+}
+
+static struct notifier_block pkram_reboot_notifier = {
+	.notifier_call = pkram_reboot,
+};
+
+static int __init pkram_init(void)
+{
+	register_reboot_notifier(&pkram_reboot_notifier);
+	return 0;
+}
+module_init(pkram_init);
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 07/21] mm: PKRAM: introduce super block
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

The PKRAM super block is the starting point for restoring preserved
memory. By providing the super block to the new kernel at boot time,
preserved memory can be reserved and made available to be restored.
To point the kernel to the location of the super block, one passes
its pfn via the 'pkram' boot param. For that purpose, the pkram super
block pfn is exported via /sys/kernel/pkram. If none is passed, any
preserved memory will not be kept, and a new super block will be
allocated.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index da166cb6afb7..c66b2ae4d520 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -5,15 +5,18 @@
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
+#include <linux/pfn.h>
 #include <linux/pkram.h>
 #include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/sysfs.h>
 #include <linux/types.h>
 
 #include "internal.h"
@@ -82,12 +85,38 @@ struct pkram_node {
 #define PKRAM_ACCMODE_MASK	3
 
 /*
+ * The PKRAM super block contains data needed to restore the preserved memory
+ * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
+ * boot param if one wants to restore preserved data saved by the previously
+ * executing kernel. For that purpose the kernel exports the pfn via
+ * /sys/kernel/pkram. If none is passed, preserved memory if any will not be
+ * preserved and a new clean page will be allocated for the super block.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_super_block {
+	__u64	node_pfn;		/* first element of the node list */
+};
+
+static unsigned long pkram_sb_pfn __initdata;
+static struct pkram_super_block *pkram_sb;
+
+/*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
  */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+/*
+ * The PKRAM super block pfn, see above.
+ */
+static int __init parse_pkram_sb_pfn(char *arg)
+{
+	return kstrtoul(arg, 16, &pkram_sb_pfn);
+}
+early_param("pkram", parse_pkram_sb_pfn);
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	return alloc_page(gfp_mask);
@@ -270,6 +299,7 @@ static void pkram_stream_init(struct pkram_stream *ps,
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
  *	%ENOMEM: insufficient memory available
  *	%EEXIST: node with specified name already exists
@@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	if (strlen(name) >= PKRAM_NAME_MAX)
 		return -ENAMETOOLONG;
 
@@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
  * Returns 0 on success, -errno on failure.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENOENT: node with specified name does not exist
  *	%EBUSY: save to required node has not finished yet
  *
@@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	mutex_lock(&pkram_mutex);
 	node = pkram_find_node(name);
 	if (!node) {
@@ -825,6 +862,13 @@ static void __pkram_reboot(void)
 		node->node_pfn = node_pfn;
 		node_pfn = page_to_pfn(page);
 	}
+
+	/*
+	 * Zero out pkram_sb completely since it may have been passed from
+	 * the previous boot.
+	 */
+	memset(pkram_sb, 0, PAGE_SIZE);
+	pkram_sb->node_pfn = node_pfn;
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block *notifier,
 {
 	if (val != SYS_RESTART)
 		return NOTIFY_DONE;
-	__pkram_reboot();
+	if (pkram_sb)
+		__pkram_reboot();
 	return NOTIFY_OK;
 }
 
@@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block *notifier,
 	.notifier_call = pkram_reboot,
 };
 
+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
+
+	return sprintf(buf, "%lx\n", pfn);
+}
+
+static struct kobj_attribute pkram_sb_pfn_attr =
+	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+
+static struct attribute *pkram_attrs[] = {
+	&pkram_sb_pfn_attr.attr,
+	NULL,
+};
+
+static struct attribute_group pkram_attr_group = {
+	.attrs = pkram_attrs,
+};
+
+/* returns non-zero on success */
+static int __init pkram_init_sb(void)
+{
+	unsigned long pfn;
+	struct pkram_node *node;
+
+	if (!pkram_sb) {
+		struct page *page;
+
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page) {
+			pr_err("PKRAM: Failed to allocate super block\n");
+			return 0;
+		}
+		pkram_sb = page_address(page);
+	}
+
+	/*
+	 * Build auxiliary doubly-linked list of nodes connected through
+	 * page::lru for convenience sake.
+	 */
+	pfn = pkram_sb->node_pfn;
+	while (pfn) {
+		node = pfn_to_kaddr(pfn);
+		pkram_insert_node(node);
+		pfn = node->node_pfn;
+	}
+	return 1;
+}
+
 static int __init pkram_init(void)
 {
-	register_reboot_notifier(&pkram_reboot_notifier);
+	if (pkram_init_sb()) {
+		register_reboot_notifier(&pkram_reboot_notifier);
+		sysfs_update_group(kernel_kobj, &pkram_attr_group);
+	}
 	return 0;
 }
 module_init(pkram_init);
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 07/21] mm: PKRAM: introduce super block
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

The PKRAM super block is the starting point for restoring preserved
memory. By providing the super block to the new kernel at boot time,
preserved memory can be reserved and made available to be restored.
To point the kernel to the location of the super block, one passes
its pfn via the 'pkram' boot param. For that purpose, the pkram super
block pfn is exported via /sys/kernel/pkram. If none is passed, any
preserved memory will not be kept, and a new super block will be
allocated.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index da166cb6afb7..c66b2ae4d520 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -5,15 +5,18 @@
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
+#include <linux/pfn.h>
 #include <linux/pkram.h>
 #include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/sysfs.h>
 #include <linux/types.h>
 
 #include "internal.h"
@@ -82,12 +85,38 @@ struct pkram_node {
 #define PKRAM_ACCMODE_MASK	3
 
 /*
+ * The PKRAM super block contains data needed to restore the preserved memory
+ * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
+ * boot param if one wants to restore preserved data saved by the previously
+ * executing kernel. For that purpose the kernel exports the pfn via
+ * /sys/kernel/pkram. If none is passed, preserved memory if any will not be
+ * preserved and a new clean page will be allocated for the super block.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_super_block {
+	__u64	node_pfn;		/* first element of the node list */
+};
+
+static unsigned long pkram_sb_pfn __initdata;
+static struct pkram_super_block *pkram_sb;
+
+/*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
  */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+/*
+ * The PKRAM super block pfn, see above.
+ */
+static int __init parse_pkram_sb_pfn(char *arg)
+{
+	return kstrtoul(arg, 16, &pkram_sb_pfn);
+}
+early_param("pkram", parse_pkram_sb_pfn);
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	return alloc_page(gfp_mask);
@@ -270,6 +299,7 @@ static void pkram_stream_init(struct pkram_stream *ps,
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
  *	%ENOMEM: insufficient memory available
  *	%EEXIST: node with specified name already exists
@@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	if (strlen(name) >= PKRAM_NAME_MAX)
 		return -ENAMETOOLONG;
 
@@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
  * Returns 0 on success, -errno on failure.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENOENT: node with specified name does not exist
  *	%EBUSY: save to required node has not finished yet
  *
@@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	mutex_lock(&pkram_mutex);
 	node = pkram_find_node(name);
 	if (!node) {
@@ -825,6 +862,13 @@ static void __pkram_reboot(void)
 		node->node_pfn = node_pfn;
 		node_pfn = page_to_pfn(page);
 	}
+
+	/*
+	 * Zero out pkram_sb completely since it may have been passed from
+	 * the previous boot.
+	 */
+	memset(pkram_sb, 0, PAGE_SIZE);
+	pkram_sb->node_pfn = node_pfn;
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block *notifier,
 {
 	if (val != SYS_RESTART)
 		return NOTIFY_DONE;
-	__pkram_reboot();
+	if (pkram_sb)
+		__pkram_reboot();
 	return NOTIFY_OK;
 }
 
@@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block *notifier,
 	.notifier_call = pkram_reboot,
 };
 
+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
+
+	return sprintf(buf, "%lx\n", pfn);
+}
+
+static struct kobj_attribute pkram_sb_pfn_attr =
+	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+
+static struct attribute *pkram_attrs[] = {
+	&pkram_sb_pfn_attr.attr,
+	NULL,
+};
+
+static struct attribute_group pkram_attr_group = {
+	.attrs = pkram_attrs,
+};
+
+/* returns non-zero on success */
+static int __init pkram_init_sb(void)
+{
+	unsigned long pfn;
+	struct pkram_node *node;
+
+	if (!pkram_sb) {
+		struct page *page;
+
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page) {
+			pr_err("PKRAM: Failed to allocate super block\n");
+			return 0;
+		}
+		pkram_sb = page_address(page);
+	}
+
+	/*
+	 * Build auxiliary doubly-linked list of nodes connected through
+	 * page::lru for convenience sake.
+	 */
+	pfn = pkram_sb->node_pfn;
+	while (pfn) {
+		node = pfn_to_kaddr(pfn);
+		pkram_insert_node(node);
+		pfn = node->node_pfn;
+	}
+	return 1;
+}
+
 static int __init pkram_init(void)
 {
-	register_reboot_notifier(&pkram_reboot_notifier);
+	if (pkram_init_sb()) {
+		register_reboot_notifier(&pkram_reboot_notifier);
+		sysfs_update_group(kernel_kobj, &pkram_attr_group);
+	}
 	return 0;
 }
 module_init(pkram_init);
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 08/21] PKRAM: track preserved pages in a physical mapping pagetable
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Later patches in this series will need a way to efficiently identify
physically contiguous ranges of preserved pages independent of their
virtual addresses. To facilitate this all pages to be preserved across
kexec are added to a pseudo identity mapping pagetable.

The pagetable makes use of the existing architecture definitions for
building a memory mapping pagetable except that a bitmap is used to
represent the presence or absence of preserved pages at the PTE level.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/Makefile          |   4 +-
 mm/pkram.c           |  30 ++++-
 mm/pkram_pagetable.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 404 insertions(+), 5 deletions(-)
 create mode 100644 mm/pkram_pagetable.c

diff --git a/mm/Makefile b/mm/Makefile
index 7a8d5a286d48..7a1a33b67de6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,5 +138,5 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
-obj-$(CONFIG_PKRAM) += pkram.o
->>>>>>> mm: add PKRAM API stubs and Kconfig
+obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o
+>>>>>>> PKRAM: track preserved pages in a physical mapping pagetable
diff --git a/mm/pkram.c b/mm/pkram.c
index c66b2ae4d520..e6c0f3c52465 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -101,6 +101,9 @@ struct pkram_super_block {
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
+extern int pkram_add_identity_map(struct page *page);
+extern void pkram_remove_identity_map(struct page *page);
+
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
@@ -119,11 +122,24 @@ static int __init parse_pkram_sb_pfn(char *arg)
 
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
-	return alloc_page(gfp_mask);
+	struct page *page;
+	int err;
+
+	page = alloc_page(gfp_mask);
+	if (page) {
+		err = pkram_add_identity_map(page);
+		if (err) {
+			__free_page(page);
+			page = NULL;
+		}
+	}
+
+	return page;
 }
 
 static inline void pkram_free_page(void *addr)
 {
+	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
 
@@ -161,6 +177,7 @@ static void pkram_truncate_link(struct pkram_link *link)
 		if (!p)
 			continue;
 		page = pfn_to_page(PHYS_PFN(p));
+		pkram_remove_identity_map(page);
 		put_page(page);
 	}
 }
@@ -610,10 +627,15 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 {
 	struct pkram_node *node = pa->ps->node;
 	struct page *page = folio_page(folio, 0);
+	int err;
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	return __pkram_save_page(pa, page, page->index);
+	err = __pkram_save_page(pa, page, page->index);
+	if (!err)
+		err = pkram_add_identity_map(page);
+
+	return err;
 }
 
 static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
@@ -658,6 +680,8 @@ static struct page *__pkram_prep_load_page(pkram_entry_t p)
 
 	page_ref_unfreeze(page, 1);
 
+	pkram_remove_identity_map(page);
+
 	return page;
 
 out_error:
@@ -914,7 +938,7 @@ static int __init pkram_init_sb(void)
 	if (!pkram_sb) {
 		struct page *page;
 
-		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
 			return 0;
diff --git a/mm/pkram_pagetable.c b/mm/pkram_pagetable.c
new file mode 100644
index 000000000000..85e34301ef1e
--- /dev/null
+++ b/mm/pkram_pagetable.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bitops.h>
+#include <linux/mm.h>
+
+static pgd_t *pkram_pgd;
+static DEFINE_SPINLOCK(pkram_pgd_lock);
+
+#define set_p4d(p4dp, p4d)	WRITE_ONCE(*(p4dp), (p4d))
+
+#define PKRAM_PTE_BM_BYTES	(PTRS_PER_PTE / BITS_PER_BYTE)
+#define PKRAM_PTE_BM_MASK	(PAGE_SIZE / PKRAM_PTE_BM_BYTES - 1)
+
+static pmd_t make_bitmap_pmd(unsigned long *bitmap)
+{
+	unsigned long val;
+
+	val = __pa(ALIGN_DOWN((unsigned long)bitmap, PAGE_SIZE));
+	val |= (((unsigned long)bitmap & ~PAGE_MASK) / PKRAM_PTE_BM_BYTES);
+
+	return __pmd(val);
+}
+
+static unsigned long *get_bitmap_addr(pmd_t pmd)
+{
+	unsigned long val, off;
+
+	val = pmd_val(pmd);
+	off = (val & PKRAM_PTE_BM_MASK) * PKRAM_PTE_BM_BYTES;
+
+	val = (val & PAGE_MASK) + off;
+
+	return __va(val);
+}
+
+int pkram_add_identity_map(struct page *page)
+{
+	unsigned long paddr;
+	unsigned long *bitmap;
+	unsigned int index;
+	struct page *pg;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (!pkram_pgd) {
+		spin_lock(&pkram_pgd_lock);
+		if (!pkram_pgd) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pkram_pgd = page_address(pg);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pgd_none(*pgd)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			p4d = page_address(pg);
+			set_pgd(pgd, __pgd(__pa(p4d)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		spin_lock(&pkram_pgd_lock);
+		if (p4d_none(*p4d)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pud = page_address(pg);
+			set_p4d(p4d, __p4d(__pa(pud)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pud_none(*pud)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pmd = page_address(pg);
+			set_pud(pud, __pud(__pa(pmd)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_none(*pmd)) {
+			if (PageTransHuge(page)) {
+				set_pmd(pmd, pmd_mkhuge(*pmd));
+				spin_unlock(&pkram_pgd_lock);
+				goto done;
+			}
+			bitmap = bitmap_zalloc(PTRS_PER_PTE, GFP_ATOMIC);
+			if (!bitmap)
+				goto nomem;
+			set_pmd(pmd, make_bitmap_pmd(bitmap));
+		} else {
+			BUG_ON(pmd_leaf(*pmd));
+			bitmap = get_bitmap_addr(*pmd);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	} else {
+		BUG_ON(pmd_leaf(*pmd));
+		bitmap = get_bitmap_addr(*pmd);
+	}
+
+	index = pte_index(paddr);
+	BUG_ON(test_bit(index, bitmap));
+	set_bit(index, bitmap);
+	smp_mb__after_atomic();
+	if (bitmap_full(bitmap, PTRS_PER_PTE))
+		set_pmd(pmd, pmd_mkhuge(*pmd));
+done:
+	return 0;
+nomem:
+	return -ENOMEM;
+}
+
+void pkram_remove_identity_map(struct page *page)
+{
+	unsigned long *bitmap;
+	unsigned long paddr;
+	unsigned int index;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	/*
+	 * pkram_pgd will be null when freeing metadata pages after a reboot
+	 */
+	if (!pkram_pgd)
+		return;
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pgd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		WARN_ONCE(1, "PKRAM: %s: no p4d for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		WARN_ONCE(1, "PKRAM: %s: no pud for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pmd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	if (PageTransHuge(page)) {
+		BUG_ON(!pmd_leaf(*pmd));
+		pmd_clear(pmd);
+		return;
+	}
+
+	if (pmd_leaf(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_leaf(*pmd))
+			set_pmd(pmd, __pmd(pte_val(pte_clrhuge(*(pte_t *)pmd))));
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	bitmap = get_bitmap_addr(*pmd);
+	index = pte_index(paddr);
+	clear_bit(index, bitmap);
+	smp_mb__after_atomic();
+
+	spin_lock(&pkram_pgd_lock);
+	if (!pmd_none(*pmd) && bitmap_empty(bitmap, PTRS_PER_PTE)) {
+		pmd_clear(pmd);
+		spin_unlock(&pkram_pgd_lock);
+		bitmap_free(bitmap);
+	} else {
+		spin_unlock(&pkram_pgd_lock);
+	}
+}
+
+struct pkram_pg_state {
+	int (*range_cb)(unsigned long base, unsigned long size, void *private);
+	unsigned long start_addr;
+	unsigned long curr_addr;
+	unsigned long min_addr;
+	unsigned long max_addr;
+	void *private;
+	bool tracking;
+};
+
+#define pgd_none(a)  (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+
+static int note_page(struct pkram_pg_state *st, unsigned long addr, bool present)
+{
+	if (!st->tracking && present) {
+		if (addr >= st->max_addr)
+			return 1;
+		/*
+		 * addr can be < min_addr if the page straddles the
+		 * boundary
+		 */
+		st->start_addr = max(addr, st->min_addr);
+		st->tracking = true;
+	} else if (st->tracking) {
+		unsigned long base, size;
+		int ret;
+
+		/* Continue tracking if upper bound has not been reached */
+		if (present && addr < st->max_addr)
+			return 0;
+
+		addr = min(addr, st->max_addr);
+
+		base = st->start_addr;
+		size = addr - st->start_addr;
+		st->tracking = false;
+
+		ret = st->range_cb(base, size, st->private);
+
+		if (addr == st->max_addr)
+			return 1;
+		else
+			return ret;
+	}
+
+	return 0;
+}
+
+static int walk_pte_level(struct pkram_pg_state *st, pmd_t addr, unsigned long P)
+{
+	unsigned long *bitmap;
+	int present;
+	int i, ret;
+
+	bitmap = get_bitmap_addr(addr);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		unsigned long curr_addr = P + i * PAGE_SIZE;
+
+		if (curr_addr < st->min_addr)
+			continue;
+		present = test_bit(i, bitmap);
+		ret = note_page(st, curr_addr, present);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pmd_level(struct pkram_pg_state *st, pud_t addr, unsigned long P)
+{
+	pmd_t *start;
+	int i, ret;
+
+	start = pud_pgtable(addr);
+	for (i = 0; i < PTRS_PER_PMD; i++, start++) {
+		unsigned long curr_addr = P + i * PMD_SIZE;
+
+		if (curr_addr + PMD_SIZE <= st->min_addr)
+			continue;
+		if (!pmd_none(*start)) {
+			if (pmd_leaf(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pte_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pud_level(struct pkram_pg_state *st, p4d_t addr, unsigned long P)
+{
+	pud_t *start;
+	int i, ret;
+
+	start = p4d_pgtable(addr);
+	for (i = 0; i < PTRS_PER_PUD; i++, start++) {
+		unsigned long curr_addr = P + i * PUD_SIZE;
+
+		if (curr_addr + PUD_SIZE <= st->min_addr)
+			continue;
+		if (!pud_none(*start)) {
+			if (pud_leaf(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pmd_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_p4d_level(struct pkram_pg_state *st, pgd_t addr, unsigned long P)
+{
+	p4d_t *start;
+	int i, ret;
+
+	if (PTRS_PER_P4D == 1)
+		return walk_pud_level(st, __p4d(pgd_val(addr)), P);
+
+	start = (p4d_t *)pgd_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_P4D; i++, start++) {
+		unsigned long curr_addr = P + i * P4D_SIZE;
+
+		if (curr_addr + P4D_SIZE <= st->min_addr)
+			continue;
+		if (!p4d_none(*start)) {
+			if (p4d_leaf(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pud_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+void pkram_walk_pgt(struct pkram_pg_state *st, pgd_t *pgd)
+{
+	pgd_t *start = pgd;
+	int i, ret = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++, start++) {
+		unsigned long curr_addr = i * PGDIR_SIZE;
+
+		if (curr_addr + PGDIR_SIZE <= st->min_addr)
+			continue;
+		if (!pgd_none(*start))
+			ret = walk_p4d_level(st, *start, curr_addr);
+		else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+}
+
+void pkram_find_preserved(unsigned long start, unsigned long end, void *private, int (*callback)(unsigned long base, unsigned long size, void *private))
+{
+	struct pkram_pg_state st = {
+		.range_cb = callback,
+		.min_addr = start,
+		.max_addr = end,
+		.private = private,
+	};
+
+	if (!pkram_pgd)
+		return;
+
+	pkram_walk_pgt(&st, pkram_pgd);
+}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 08/21] PKRAM: track preserved pages in a physical mapping pagetable
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Later patches in this series will need a way to efficiently identify
physically contiguous ranges of preserved pages independent of their
virtual addresses. To facilitate this all pages to be preserved across
kexec are added to a pseudo identity mapping pagetable.

The pagetable makes use of the existing architecture definitions for
building a memory mapping pagetable except that a bitmap is used to
represent the presence or absence of preserved pages at the PTE level.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/Makefile          |   4 +-
 mm/pkram.c           |  30 ++++-
 mm/pkram_pagetable.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 404 insertions(+), 5 deletions(-)
 create mode 100644 mm/pkram_pagetable.c

diff --git a/mm/Makefile b/mm/Makefile
index 7a8d5a286d48..7a1a33b67de6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,5 +138,5 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
-obj-$(CONFIG_PKRAM) += pkram.o
->>>>>>> mm: add PKRAM API stubs and Kconfig
+obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o
+>>>>>>> PKRAM: track preserved pages in a physical mapping pagetable
diff --git a/mm/pkram.c b/mm/pkram.c
index c66b2ae4d520..e6c0f3c52465 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -101,6 +101,9 @@ struct pkram_super_block {
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
+extern int pkram_add_identity_map(struct page *page);
+extern void pkram_remove_identity_map(struct page *page);
+
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
@@ -119,11 +122,24 @@ static int __init parse_pkram_sb_pfn(char *arg)
 
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
-	return alloc_page(gfp_mask);
+	struct page *page;
+	int err;
+
+	page = alloc_page(gfp_mask);
+	if (page) {
+		err = pkram_add_identity_map(page);
+		if (err) {
+			__free_page(page);
+			page = NULL;
+		}
+	}
+
+	return page;
 }
 
 static inline void pkram_free_page(void *addr)
 {
+	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
 
@@ -161,6 +177,7 @@ static void pkram_truncate_link(struct pkram_link *link)
 		if (!p)
 			continue;
 		page = pfn_to_page(PHYS_PFN(p));
+		pkram_remove_identity_map(page);
 		put_page(page);
 	}
 }
@@ -610,10 +627,15 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 {
 	struct pkram_node *node = pa->ps->node;
 	struct page *page = folio_page(folio, 0);
+	int err;
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	return __pkram_save_page(pa, page, page->index);
+	err = __pkram_save_page(pa, page, page->index);
+	if (!err)
+		err = pkram_add_identity_map(page);
+
+	return err;
 }
 
 static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
@@ -658,6 +680,8 @@ static struct page *__pkram_prep_load_page(pkram_entry_t p)
 
 	page_ref_unfreeze(page, 1);
 
+	pkram_remove_identity_map(page);
+
 	return page;
 
 out_error:
@@ -914,7 +938,7 @@ static int __init pkram_init_sb(void)
 	if (!pkram_sb) {
 		struct page *page;
 
-		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
 			return 0;
diff --git a/mm/pkram_pagetable.c b/mm/pkram_pagetable.c
new file mode 100644
index 000000000000..85e34301ef1e
--- /dev/null
+++ b/mm/pkram_pagetable.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bitops.h>
+#include <linux/mm.h>
+
+static pgd_t *pkram_pgd;
+static DEFINE_SPINLOCK(pkram_pgd_lock);
+
+#define set_p4d(p4dp, p4d)	WRITE_ONCE(*(p4dp), (p4d))
+
+#define PKRAM_PTE_BM_BYTES	(PTRS_PER_PTE / BITS_PER_BYTE)
+#define PKRAM_PTE_BM_MASK	(PAGE_SIZE / PKRAM_PTE_BM_BYTES - 1)
+
+static pmd_t make_bitmap_pmd(unsigned long *bitmap)
+{
+	unsigned long val;
+
+	val = __pa(ALIGN_DOWN((unsigned long)bitmap, PAGE_SIZE));
+	val |= (((unsigned long)bitmap & ~PAGE_MASK) / PKRAM_PTE_BM_BYTES);
+
+	return __pmd(val);
+}
+
+static unsigned long *get_bitmap_addr(pmd_t pmd)
+{
+	unsigned long val, off;
+
+	val = pmd_val(pmd);
+	off = (val & PKRAM_PTE_BM_MASK) * PKRAM_PTE_BM_BYTES;
+
+	val = (val & PAGE_MASK) + off;
+
+	return __va(val);
+}
+
+int pkram_add_identity_map(struct page *page)
+{
+	unsigned long paddr;
+	unsigned long *bitmap;
+	unsigned int index;
+	struct page *pg;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (!pkram_pgd) {
+		spin_lock(&pkram_pgd_lock);
+		if (!pkram_pgd) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pkram_pgd = page_address(pg);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pgd_none(*pgd)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			p4d = page_address(pg);
+			set_pgd(pgd, __pgd(__pa(p4d)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		spin_lock(&pkram_pgd_lock);
+		if (p4d_none(*p4d)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pud = page_address(pg);
+			set_p4d(p4d, __p4d(__pa(pud)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pud_none(*pud)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pmd = page_address(pg);
+			set_pud(pud, __pud(__pa(pmd)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_none(*pmd)) {
+			if (PageTransHuge(page)) {
+				set_pmd(pmd, pmd_mkhuge(*pmd));
+				spin_unlock(&pkram_pgd_lock);
+				goto done;
+			}
+			bitmap = bitmap_zalloc(PTRS_PER_PTE, GFP_ATOMIC);
+			if (!bitmap)
+				goto nomem;
+			set_pmd(pmd, make_bitmap_pmd(bitmap));
+		} else {
+			BUG_ON(pmd_leaf(*pmd));
+			bitmap = get_bitmap_addr(*pmd);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	} else {
+		BUG_ON(pmd_leaf(*pmd));
+		bitmap = get_bitmap_addr(*pmd);
+	}
+
+	index = pte_index(paddr);
+	BUG_ON(test_bit(index, bitmap));
+	set_bit(index, bitmap);
+	smp_mb__after_atomic();
+	if (bitmap_full(bitmap, PTRS_PER_PTE))
+		set_pmd(pmd, pmd_mkhuge(*pmd));
+done:
+	return 0;
+nomem:
+	return -ENOMEM;
+}
+
+void pkram_remove_identity_map(struct page *page)
+{
+	unsigned long *bitmap;
+	unsigned long paddr;
+	unsigned int index;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	/*
+	 * pkram_pgd will be null when freeing metadata pages after a reboot
+	 */
+	if (!pkram_pgd)
+		return;
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pgd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		WARN_ONCE(1, "PKRAM: %s: no p4d for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		WARN_ONCE(1, "PKRAM: %s: no pud for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pmd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	if (PageTransHuge(page)) {
+		BUG_ON(!pmd_leaf(*pmd));
+		pmd_clear(pmd);
+		return;
+	}
+
+	if (pmd_leaf(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_leaf(*pmd))
+			set_pmd(pmd, __pmd(pte_val(pte_clrhuge(*(pte_t *)pmd))));
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	bitmap = get_bitmap_addr(*pmd);
+	index = pte_index(paddr);
+	clear_bit(index, bitmap);
+	smp_mb__after_atomic();
+
+	spin_lock(&pkram_pgd_lock);
+	if (!pmd_none(*pmd) && bitmap_empty(bitmap, PTRS_PER_PTE)) {
+		pmd_clear(pmd);
+		spin_unlock(&pkram_pgd_lock);
+		bitmap_free(bitmap);
+	} else {
+		spin_unlock(&pkram_pgd_lock);
+	}
+}
+
+struct pkram_pg_state {
+	int (*range_cb)(unsigned long base, unsigned long size, void *private);
+	unsigned long start_addr;
+	unsigned long curr_addr;
+	unsigned long min_addr;
+	unsigned long max_addr;
+	void *private;
+	bool tracking;
+};
+
+#define pgd_none(a)  (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+
+static int note_page(struct pkram_pg_state *st, unsigned long addr, bool present)
+{
+	if (!st->tracking && present) {
+		if (addr >= st->max_addr)
+			return 1;
+		/*
+		 * addr can be < min_addr if the page straddles the
+		 * boundary
+		 */
+		st->start_addr = max(addr, st->min_addr);
+		st->tracking = true;
+	} else if (st->tracking) {
+		unsigned long base, size;
+		int ret;
+
+		/* Continue tracking if upper bound has not been reached */
+		if (present && addr < st->max_addr)
+			return 0;
+
+		addr = min(addr, st->max_addr);
+
+		base = st->start_addr;
+		size = addr - st->start_addr;
+		st->tracking = false;
+
+		ret = st->range_cb(base, size, st->private);
+
+		if (addr == st->max_addr)
+			return 1;
+		else
+			return ret;
+	}
+
+	return 0;
+}
+
+static int walk_pte_level(struct pkram_pg_state *st, pmd_t addr, unsigned long P)
+{
+	unsigned long *bitmap;
+	int present;
+	int i, ret;
+
+	bitmap = get_bitmap_addr(addr);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		unsigned long curr_addr = P + i * PAGE_SIZE;
+
+		if (curr_addr < st->min_addr)
+			continue;
+		present = test_bit(i, bitmap);
+		ret = note_page(st, curr_addr, present);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pmd_level(struct pkram_pg_state *st, pud_t addr, unsigned long P)
+{
+	pmd_t *start;
+	int i, ret;
+
+	start = pud_pgtable(addr);
+	for (i = 0; i < PTRS_PER_PMD; i++, start++) {
+		unsigned long curr_addr = P + i * PMD_SIZE;
+
+		if (curr_addr + PMD_SIZE <= st->min_addr)
+			continue;
+		if (!pmd_none(*start)) {
+			if (pmd_leaf(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pte_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pud_level(struct pkram_pg_state *st, p4d_t addr, unsigned long P)
+{
+	pud_t *start;
+	int i, ret;
+
+	start = p4d_pgtable(addr);
+	for (i = 0; i < PTRS_PER_PUD; i++, start++) {
+		unsigned long curr_addr = P + i * PUD_SIZE;
+
+		if (curr_addr + PUD_SIZE <= st->min_addr)
+			continue;
+		if (!pud_none(*start)) {
+			if (pud_leaf(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pmd_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_p4d_level(struct pkram_pg_state *st, pgd_t addr, unsigned long P)
+{
+	p4d_t *start;
+	int i, ret;
+
+	if (PTRS_PER_P4D == 1)
+		return walk_pud_level(st, __p4d(pgd_val(addr)), P);
+
+	start = (p4d_t *)pgd_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_P4D; i++, start++) {
+		unsigned long curr_addr = P + i * P4D_SIZE;
+
+		if (curr_addr + P4D_SIZE <= st->min_addr)
+			continue;
+		if (!p4d_none(*start)) {
+			if (p4d_leaf(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pud_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+void pkram_walk_pgt(struct pkram_pg_state *st, pgd_t *pgd)
+{
+	pgd_t *start = pgd;
+	int i, ret = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++, start++) {
+		unsigned long curr_addr = i * PGDIR_SIZE;
+
+		if (curr_addr + PGDIR_SIZE <= st->min_addr)
+			continue;
+		if (!pgd_none(*start))
+			ret = walk_p4d_level(st, *start, curr_addr);
+		else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+}
+
+void pkram_find_preserved(unsigned long start, unsigned long end, void *private, int (*callback)(unsigned long base, unsigned long size, void *private))
+{
+	struct pkram_pg_state st = {
+		.range_cb = callback,
+		.min_addr = start,
+		.max_addr = end,
+		.private = private,
+	};
+
+	if (!pkram_pgd)
+		return;
+
+	pkram_walk_pgt(&st, pkram_pgd);
+}
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 09/21] PKRAM: pass a list of preserved ranges to the next kernel
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

In order to build a new memblock reserved list during boot that
includes ranges preserved by the previous kernel, a list of preserved
ranges is passed to the next kernel via the pkram superblock. The
ranges are stored in ascending order in a linked list of pages. A more
complete memblock list is not prepared to avoid possible conflicts with
changes in a newer kernel and to avoid having to allocate a contiguous
range larger than a page.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 177 insertions(+), 7 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index e6c0f3c52465..3790e5180feb 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -84,6 +84,20 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
 /*
  * The PKRAM super block contains data needed to restore the preserved memory
  * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
@@ -96,13 +110,21 @@ struct pkram_node {
  */
 struct pkram_super_block {
 	__u64	node_pfn;		/* first element of the node list */
+	__u64	region_list_pfn;
+	__u64	nr_regions;
 };
 
+static struct pkram_region_list *pkram_regions_list;
+static int pkram_init_regions_list(void);
+static unsigned long pkram_populate_regions_list(void);
+
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
 extern int pkram_add_identity_map(struct page *page);
 extern void pkram_remove_identity_map(struct page *page);
+extern void pkram_find_preserved(unsigned long start, unsigned long end, void *private,
+			int (*callback)(unsigned long base, unsigned long size, void *private));
 
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
@@ -878,21 +900,48 @@ static void __pkram_reboot(void)
 	struct page *page;
 	struct pkram_node *node;
 	unsigned long node_pfn = 0;
+	unsigned long rl_pfn = 0;
+	unsigned long nr_regions = 0;
+	int err = 0;
 
-	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
-		node = page_address(page);
-		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
-			continue;
-		node->node_pfn = node_pfn;
-		node_pfn = page_to_pfn(page);
+	if (!list_empty(&pkram_nodes)) {
+		err = pkram_add_identity_map(virt_to_page(pkram_sb));
+		if (err) {
+			pr_err("PKRAM: failed to add super block to pagetable\n");
+			goto done;
+		}
+		list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+			node = page_address(page);
+			if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+				continue;
+			node->node_pfn = node_pfn;
+			node_pfn = page_to_pfn(page);
+		}
+		err = pkram_init_regions_list();
+		if (err) {
+			pr_err("PKRAM: failed to init regions list\n");
+			goto done;
+		}
+		nr_regions = pkram_populate_regions_list();
+		if (IS_ERR_VALUE(nr_regions)) {
+			err = nr_regions;
+			pr_err("PKRAM: failed to populate regions list\n");
+			goto done;
+		}
+		rl_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
 	}
 
+done:
 	/*
 	 * Zero out pkram_sb completely since it may have been passed from
 	 * the previous boot.
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
-	pkram_sb->node_pfn = node_pfn;
+	if (!err && node_pfn) {
+		pkram_sb->node_pfn = node_pfn;
+		pkram_sb->region_list_pfn = rl_pfn;
+		pkram_sb->nr_regions = nr_regions;
+	}
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -968,3 +1017,124 @@ static int __init pkram_init(void)
 	return 0;
 }
 module_init(pkram_init);
+
+static int count_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	unsigned long *nr_regions = (unsigned long *)private;
+
+	(*nr_regions)++;
+	return 0;
+}
+
+static unsigned long pkram_count_regions(void)
+{
+	unsigned long nr_regions = 0;
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &nr_regions, count_region_cb);
+
+	return nr_regions;
+}
+
+/*
+ * To faciliate rapidly building a new memblock reserved list during boot
+ * with the addition of preserved memory ranges a regions list is built
+ * before reboot.
+ * The regions list is a linked list of pages with each page containing an
+ * array of preserved memory ranges.  The ranges are stored in each page
+ * and across the list in address order.  A linked list is used rather than
+ * a single contiguous range to mitigate against the possibility that a
+ * larger, contiguous allocation may fail due to fragmentation.
+ *
+ * Since the pages of the regions list must be preserved and the pkram
+ * pagetable is used to determine what ranges are preserved, the list pages
+ * must be allocated and represented in the pkram pagetable before they can
+ * be populated.  Rather than recounting the number of regions after
+ * allocating pages and repeating until a precise number of pages are
+ * allocated, the number of pages needed is estimated.
+ */
+static int pkram_init_regions_list(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long nr_regions;
+	unsigned long nr_lpages;
+	struct page *page;
+
+	nr_regions = pkram_count_regions();
+
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+	nr_regions += nr_lpages;
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+
+	for (; nr_lpages; nr_lpages--) {
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return -ENOMEM;
+		rl = page_address(page);
+		if (pkram_regions_list) {
+			rl->next_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
+			pkram_regions_list->prev_pfn = page_to_pfn(page);
+		}
+		pkram_regions_list = rl;
+	}
+
+	return 0;
+}
+
+struct pkram_regions_priv {
+	struct pkram_region_list *curr;
+	struct pkram_region_list *last;
+	unsigned long nr_regions;
+	int idx;
+};
+
+static int add_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	struct pkram_regions_priv *priv;
+	struct pkram_region_list *rl;
+	int i;
+
+	priv = (struct pkram_regions_priv *)private;
+	rl = priv->curr;
+	i = priv->idx;
+
+	if (!rl) {
+		WARN_ON(1);
+		return 1;
+	}
+
+	if (!i)
+		priv->last = priv->curr;
+
+	rl->regions[i].base = base;
+	rl->regions[i].size = size;
+
+	priv->nr_regions++;
+	i++;
+	if (i == PKRAM_REGIONS_LIST_MAX) {
+		u64 next_pfn = rl->next_pfn;
+
+		if (next_pfn)
+			priv->curr = pfn_to_kaddr(next_pfn);
+		else
+			priv->curr = NULL;
+
+		i = 0;
+	}
+	priv->idx = i;
+
+	return 0;
+}
+
+static unsigned long pkram_populate_regions_list(void)
+{
+	struct pkram_regions_priv priv = { .curr = pkram_regions_list };
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &priv, add_region_cb);
+
+	/*
+	 * Link the first node to the last populated one.
+	 */
+	pkram_regions_list->prev_pfn = page_to_pfn(virt_to_page(priv.last));
+
+	return priv.nr_regions;
+}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 09/21] PKRAM: pass a list of preserved ranges to the next kernel
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

In order to build a new memblock reserved list during boot that
includes ranges preserved by the previous kernel, a list of preserved
ranges is passed to the next kernel via the pkram superblock. The
ranges are stored in ascending order in a linked list of pages. A more
complete memblock list is not prepared to avoid possible conflicts with
changes in a newer kernel and to avoid having to allocate a contiguous
range larger than a page.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 177 insertions(+), 7 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index e6c0f3c52465..3790e5180feb 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -84,6 +84,20 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
 /*
  * The PKRAM super block contains data needed to restore the preserved memory
  * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
@@ -96,13 +110,21 @@ struct pkram_node {
  */
 struct pkram_super_block {
 	__u64	node_pfn;		/* first element of the node list */
+	__u64	region_list_pfn;
+	__u64	nr_regions;
 };
 
+static struct pkram_region_list *pkram_regions_list;
+static int pkram_init_regions_list(void);
+static unsigned long pkram_populate_regions_list(void);
+
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
 extern int pkram_add_identity_map(struct page *page);
 extern void pkram_remove_identity_map(struct page *page);
+extern void pkram_find_preserved(unsigned long start, unsigned long end, void *private,
+			int (*callback)(unsigned long base, unsigned long size, void *private));
 
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
@@ -878,21 +900,48 @@ static void __pkram_reboot(void)
 	struct page *page;
 	struct pkram_node *node;
 	unsigned long node_pfn = 0;
+	unsigned long rl_pfn = 0;
+	unsigned long nr_regions = 0;
+	int err = 0;
 
-	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
-		node = page_address(page);
-		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
-			continue;
-		node->node_pfn = node_pfn;
-		node_pfn = page_to_pfn(page);
+	if (!list_empty(&pkram_nodes)) {
+		err = pkram_add_identity_map(virt_to_page(pkram_sb));
+		if (err) {
+			pr_err("PKRAM: failed to add super block to pagetable\n");
+			goto done;
+		}
+		list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+			node = page_address(page);
+			if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+				continue;
+			node->node_pfn = node_pfn;
+			node_pfn = page_to_pfn(page);
+		}
+		err = pkram_init_regions_list();
+		if (err) {
+			pr_err("PKRAM: failed to init regions list\n");
+			goto done;
+		}
+		nr_regions = pkram_populate_regions_list();
+		if (IS_ERR_VALUE(nr_regions)) {
+			err = nr_regions;
+			pr_err("PKRAM: failed to populate regions list\n");
+			goto done;
+		}
+		rl_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
 	}
 
+done:
 	/*
 	 * Zero out pkram_sb completely since it may have been passed from
 	 * the previous boot.
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
-	pkram_sb->node_pfn = node_pfn;
+	if (!err && node_pfn) {
+		pkram_sb->node_pfn = node_pfn;
+		pkram_sb->region_list_pfn = rl_pfn;
+		pkram_sb->nr_regions = nr_regions;
+	}
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -968,3 +1017,124 @@ static int __init pkram_init(void)
 	return 0;
 }
 module_init(pkram_init);
+
+static int count_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	unsigned long *nr_regions = (unsigned long *)private;
+
+	(*nr_regions)++;
+	return 0;
+}
+
+static unsigned long pkram_count_regions(void)
+{
+	unsigned long nr_regions = 0;
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &nr_regions, count_region_cb);
+
+	return nr_regions;
+}
+
+/*
+ * To faciliate rapidly building a new memblock reserved list during boot
+ * with the addition of preserved memory ranges a regions list is built
+ * before reboot.
+ * The regions list is a linked list of pages with each page containing an
+ * array of preserved memory ranges.  The ranges are stored in each page
+ * and across the list in address order.  A linked list is used rather than
+ * a single contiguous range to mitigate against the possibility that a
+ * larger, contiguous allocation may fail due to fragmentation.
+ *
+ * Since the pages of the regions list must be preserved and the pkram
+ * pagetable is used to determine what ranges are preserved, the list pages
+ * must be allocated and represented in the pkram pagetable before they can
+ * be populated.  Rather than recounting the number of regions after
+ * allocating pages and repeating until a precise number of pages are
+ * allocated, the number of pages needed is estimated.
+ */
+static int pkram_init_regions_list(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long nr_regions;
+	unsigned long nr_lpages;
+	struct page *page;
+
+	nr_regions = pkram_count_regions();
+
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+	nr_regions += nr_lpages;
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+
+	for (; nr_lpages; nr_lpages--) {
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return -ENOMEM;
+		rl = page_address(page);
+		if (pkram_regions_list) {
+			rl->next_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
+			pkram_regions_list->prev_pfn = page_to_pfn(page);
+		}
+		pkram_regions_list = rl;
+	}
+
+	return 0;
+}
+
+struct pkram_regions_priv {
+	struct pkram_region_list *curr;
+	struct pkram_region_list *last;
+	unsigned long nr_regions;
+	int idx;
+};
+
+static int add_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	struct pkram_regions_priv *priv;
+	struct pkram_region_list *rl;
+	int i;
+
+	priv = (struct pkram_regions_priv *)private;
+	rl = priv->curr;
+	i = priv->idx;
+
+	if (!rl) {
+		WARN_ON(1);
+		return 1;
+	}
+
+	if (!i)
+		priv->last = priv->curr;
+
+	rl->regions[i].base = base;
+	rl->regions[i].size = size;
+
+	priv->nr_regions++;
+	i++;
+	if (i == PKRAM_REGIONS_LIST_MAX) {
+		u64 next_pfn = rl->next_pfn;
+
+		if (next_pfn)
+			priv->curr = pfn_to_kaddr(next_pfn);
+		else
+			priv->curr = NULL;
+
+		i = 0;
+	}
+	priv->idx = i;
+
+	return 0;
+}
+
+static unsigned long pkram_populate_regions_list(void)
+{
+	struct pkram_regions_priv priv = { .curr = pkram_regions_list };
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &priv, add_region_cb);
+
+	/*
+	 * Link the first node to the last populated one.
+	 */
+	pkram_regions_list->prev_pfn = page_to_pfn(virt_to_page(priv.last));
+
+	return priv.nr_regions;
+}
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 10/21] PKRAM: prepare for adding preserved ranges to memblock reserved
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Calling memblock_reserve() repeatedly to add preserved ranges is
inefficient and risks clobbering preserved memory if the memblock
reserved regions array must be resized.  Instead, calculate the size
needed to accommodate the preserved ranges, find a suitable range for
a new reserved regions array that does not overlap any preserved range,
and populate it with a new, merged regions array.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 244 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 244 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index 3790e5180feb..c649504fa1fa 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
+#include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
@@ -1138,3 +1139,246 @@ static unsigned long pkram_populate_regions_list(void)
 
 	return priv.nr_regions;
 }
+
+struct pkram_region *pkram_first_region(struct pkram_super_block *sb,
+					struct pkram_region_list **rlp, int *idx)
+{
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = pfn_to_kaddr(sb->region_list_pfn);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			pr_err("PKRAM: %s: no more pkram_region_list pages\n", __func__);
+			return NULL;
+		}
+		rl = pfn_to_kaddr(rl->next_pfn);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+struct pkram_region *pkram_first_region_topdown(struct pkram_super_block *sb,
+						struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl;
+
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	rl = pfn_to_kaddr(sb->region_list_pfn);
+	if (!rl->prev_pfn) {
+		WARN_ON(1);
+		return NULL;
+	}
+	rl = pfn_to_kaddr(rl->prev_pfn);
+
+	*rlp = rl;
+
+	*idx = (sb->nr_regions - 1) % PKRAM_REGIONS_LIST_MAX;
+
+	return &rl->regions[*idx];
+}
+
+struct pkram_region *pkram_next_region_topdown(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	if (i == 0) {
+		if (!rl->prev_pfn)
+			return NULL;
+		rl = pfn_to_kaddr(rl->prev_pfn);
+		*rlp = rl;
+		i = PKRAM_REGIONS_LIST_MAX - 1;
+	} else
+		i--;
+
+	*idx = i;
+
+	return &rl->regions[i];
+}
+
+/*
+ * Use the pkram regions list to allocate a block of memory that does
+ * not overlap with preserved pages.
+ */
+phys_addr_t __init alloc_topdown(phys_addr_t size)
+{
+	phys_addr_t hole_start, hole_end, hole_size;
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	phys_addr_t addr = 0;
+	int idx;
+
+	hole_end = memblock.current_limit;
+	r = pkram_first_region_topdown(pkram_sb, &rl, &idx);
+
+	while (r) {
+		hole_start = r->base + r->size;
+		hole_size = hole_end - hole_start;
+
+		if (hole_size >= size) {
+			addr = memblock_phys_alloc_range(size, PAGE_SIZE,
+							hole_start, hole_end);
+			if (addr)
+				break;
+		}
+
+		hole_end = r->base;
+		r = pkram_next_region_topdown(&rl, &idx);
+	}
+
+	if (!addr)
+		addr = memblock_phys_alloc_range(size, PAGE_SIZE, 0, hole_end);
+
+	return addr;
+}
+
+int __init pkram_create_merged_reserved(struct memblock_type *new)
+{
+	unsigned long cnt_a;
+	unsigned long cnt_b;
+	long i, j, k;
+	struct memblock_region *r;
+	struct memblock_region *rgn;
+	struct pkram_region *pkr;
+	struct pkram_region_list *rl;
+	int idx;
+	unsigned long total_size = 0;
+	unsigned long nr_preserved = 0;
+
+	cnt_a = memblock.reserved.cnt;
+	cnt_b = pkram_sb->nr_regions;
+
+	i = 0;
+	j = 0;
+	k = 0;
+
+	pkr = pkram_first_region(pkram_sb, &rl, &idx);
+	if (!pkr)
+		return -EINVAL;
+	while (i < cnt_a && j < cnt_b && pkr) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		if (r->base + r->size <= pkr->base) {
+			*rgn = *r;
+			i++;
+		} else if (pkr->base + pkr->size <= r->base) {
+			rgn->base = pkr->base;
+			rgn->size = pkr->size;
+			memblock_set_region_node(rgn, MAX_NUMNODES);
+
+			nr_preserved +=  (rgn->size >> PAGE_SHIFT);
+			pkr = pkram_next_region(&rl, &idx);
+			j++;
+		} else {
+			pr_err("PKRAM: unexpected overlap:\n");
+			pr_err("PKRAM: reserved: base=%pa,size=%pa,flags=0x%x\n", &r->base,
+				&r->size, (int)r->flags);
+			pr_err("PKRAM: pkram: base=%pa,size=%pa\n", &pkr->base, &pkr->size);
+			return -EBUSY;
+		}
+		total_size += rgn->size;
+		k++;
+	}
+
+	while (i < cnt_a) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		*rgn = *r;
+
+		total_size += rgn->size;
+		i++;
+		k++;
+	}
+	while (j < cnt_b && pkr) {
+		rgn = &new->regions[k];
+		rgn->base = pkr->base;
+		rgn->size = pkr->size;
+		memblock_set_region_node(rgn, MAX_NUMNODES);
+
+		nr_preserved += (rgn->size >> PAGE_SHIFT);
+		total_size += rgn->size;
+		pkr = pkram_next_region(&rl, &idx);
+		j++;
+		k++;
+	}
+
+	WARN_ON(cnt_a + cnt_b != k);
+	new->cnt = cnt_a + cnt_b;
+	new->total_size = total_size;
+
+	return 0;
+}
+
+/*
+ * Reserve pages that belong to preserved memory.  This is accomplished by
+ * merging the existing reserved ranges with the preserved ranges into
+ * a new, sufficiently sized memblock reserved array.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+int __init pkram_merge_with_reserved(void)
+{
+	struct memblock_type new;
+	unsigned long new_max;
+	phys_addr_t new_size;
+	phys_addr_t addr;
+	int err;
+
+	/*
+	 * Need space to insert one more range into memblock.reserved
+	 * without memblock_double_array() being called.
+	 */
+	if (memblock.reserved.cnt == memblock.reserved.max) {
+		WARN_ONCE(1, "PKRAM: no space for new memblock list\n");
+		return -ENOMEM;
+	}
+
+	new_max = memblock.reserved.max + pkram_sb->nr_regions;
+	new_size = PAGE_ALIGN(sizeof(struct memblock_region) * new_max);
+
+	addr = alloc_topdown(new_size);
+	if (!addr)
+		return -ENOMEM;
+
+	new.regions = __va(addr);
+	new.max = new_max;
+	err = pkram_create_merged_reserved(&new);
+	if (err)
+		return err;
+
+	memblock.reserved.cnt = new.cnt;
+	memblock.reserved.max = new.max;
+	memblock.reserved.total_size = new.total_size;
+	memblock.reserved.regions = new.regions;
+
+	return 0;
+}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 10/21] PKRAM: prepare for adding preserved ranges to memblock reserved
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Calling memblock_reserve() repeatedly to add preserved ranges is
inefficient and risks clobbering preserved memory if the memblock
reserved regions array must be resized.  Instead, calculate the size
needed to accommodate the preserved ranges, find a suitable range for
a new reserved regions array that does not overlap any preserved range,
and populate it with a new, merged regions array.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 244 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 244 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index 3790e5180feb..c649504fa1fa 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
+#include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
@@ -1138,3 +1139,246 @@ static unsigned long pkram_populate_regions_list(void)
 
 	return priv.nr_regions;
 }
+
+struct pkram_region *pkram_first_region(struct pkram_super_block *sb,
+					struct pkram_region_list **rlp, int *idx)
+{
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = pfn_to_kaddr(sb->region_list_pfn);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			pr_err("PKRAM: %s: no more pkram_region_list pages\n", __func__);
+			return NULL;
+		}
+		rl = pfn_to_kaddr(rl->next_pfn);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+struct pkram_region *pkram_first_region_topdown(struct pkram_super_block *sb,
+						struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl;
+
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	rl = pfn_to_kaddr(sb->region_list_pfn);
+	if (!rl->prev_pfn) {
+		WARN_ON(1);
+		return NULL;
+	}
+	rl = pfn_to_kaddr(rl->prev_pfn);
+
+	*rlp = rl;
+
+	*idx = (sb->nr_regions - 1) % PKRAM_REGIONS_LIST_MAX;
+
+	return &rl->regions[*idx];
+}
+
+struct pkram_region *pkram_next_region_topdown(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	if (i == 0) {
+		if (!rl->prev_pfn)
+			return NULL;
+		rl = pfn_to_kaddr(rl->prev_pfn);
+		*rlp = rl;
+		i = PKRAM_REGIONS_LIST_MAX - 1;
+	} else
+		i--;
+
+	*idx = i;
+
+	return &rl->regions[i];
+}
+
+/*
+ * Use the pkram regions list to allocate a block of memory that does
+ * not overlap with preserved pages.
+ */
+phys_addr_t __init alloc_topdown(phys_addr_t size)
+{
+	phys_addr_t hole_start, hole_end, hole_size;
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	phys_addr_t addr = 0;
+	int idx;
+
+	hole_end = memblock.current_limit;
+	r = pkram_first_region_topdown(pkram_sb, &rl, &idx);
+
+	while (r) {
+		hole_start = r->base + r->size;
+		hole_size = hole_end - hole_start;
+
+		if (hole_size >= size) {
+			addr = memblock_phys_alloc_range(size, PAGE_SIZE,
+							hole_start, hole_end);
+			if (addr)
+				break;
+		}
+
+		hole_end = r->base;
+		r = pkram_next_region_topdown(&rl, &idx);
+	}
+
+	if (!addr)
+		addr = memblock_phys_alloc_range(size, PAGE_SIZE, 0, hole_end);
+
+	return addr;
+}
+
+int __init pkram_create_merged_reserved(struct memblock_type *new)
+{
+	unsigned long cnt_a;
+	unsigned long cnt_b;
+	long i, j, k;
+	struct memblock_region *r;
+	struct memblock_region *rgn;
+	struct pkram_region *pkr;
+	struct pkram_region_list *rl;
+	int idx;
+	unsigned long total_size = 0;
+	unsigned long nr_preserved = 0;
+
+	cnt_a = memblock.reserved.cnt;
+	cnt_b = pkram_sb->nr_regions;
+
+	i = 0;
+	j = 0;
+	k = 0;
+
+	pkr = pkram_first_region(pkram_sb, &rl, &idx);
+	if (!pkr)
+		return -EINVAL;
+	while (i < cnt_a && j < cnt_b && pkr) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		if (r->base + r->size <= pkr->base) {
+			*rgn = *r;
+			i++;
+		} else if (pkr->base + pkr->size <= r->base) {
+			rgn->base = pkr->base;
+			rgn->size = pkr->size;
+			memblock_set_region_node(rgn, MAX_NUMNODES);
+
+			nr_preserved +=  (rgn->size >> PAGE_SHIFT);
+			pkr = pkram_next_region(&rl, &idx);
+			j++;
+		} else {
+			pr_err("PKRAM: unexpected overlap:\n");
+			pr_err("PKRAM: reserved: base=%pa,size=%pa,flags=0x%x\n", &r->base,
+				&r->size, (int)r->flags);
+			pr_err("PKRAM: pkram: base=%pa,size=%pa\n", &pkr->base, &pkr->size);
+			return -EBUSY;
+		}
+		total_size += rgn->size;
+		k++;
+	}
+
+	while (i < cnt_a) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		*rgn = *r;
+
+		total_size += rgn->size;
+		i++;
+		k++;
+	}
+	while (j < cnt_b && pkr) {
+		rgn = &new->regions[k];
+		rgn->base = pkr->base;
+		rgn->size = pkr->size;
+		memblock_set_region_node(rgn, MAX_NUMNODES);
+
+		nr_preserved += (rgn->size >> PAGE_SHIFT);
+		total_size += rgn->size;
+		pkr = pkram_next_region(&rl, &idx);
+		j++;
+		k++;
+	}
+
+	WARN_ON(cnt_a + cnt_b != k);
+	new->cnt = cnt_a + cnt_b;
+	new->total_size = total_size;
+
+	return 0;
+}
+
+/*
+ * Reserve pages that belong to preserved memory.  This is accomplished by
+ * merging the existing reserved ranges with the preserved ranges into
+ * a new, sufficiently sized memblock reserved array.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+int __init pkram_merge_with_reserved(void)
+{
+	struct memblock_type new;
+	unsigned long new_max;
+	phys_addr_t new_size;
+	phys_addr_t addr;
+	int err;
+
+	/*
+	 * Need space to insert one more range into memblock.reserved
+	 * without memblock_double_array() being called.
+	 */
+	if (memblock.reserved.cnt == memblock.reserved.max) {
+		WARN_ONCE(1, "PKRAM: no space for new memblock list\n");
+		return -ENOMEM;
+	}
+
+	new_max = memblock.reserved.max + pkram_sb->nr_regions;
+	new_size = PAGE_ALIGN(sizeof(struct memblock_region) * new_max);
+
+	addr = alloc_topdown(new_size);
+	if (!addr)
+		return -ENOMEM;
+
+	new.regions = __va(addr);
+	new.max = new_max;
+	err = pkram_create_merged_reserved(&new);
+	if (err)
+		return err;
+
+	memblock.reserved.cnt = new.cnt;
+	memblock.reserved.max = new.max;
+	memblock.reserved.total_size = new.total_size;
+	memblock.reserved.regions = new.regions;
+
+	return 0;
+}
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 11/21] mm: PKRAM: reserve preserved memory at boot
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Keep preserved pages from being recycled during boot by adding them
to the memblock reserved list during early boot. If memory reservation
fails (e.g. a region has already been reserved), all preserved pages
are dropped.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/kernel/setup.c |  3 ++
 arch/x86/mm/init_64.c   |  2 ++
 include/linux/pkram.h   |  8 +++++
 mm/pkram.c              | 84 ++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16babff771bd..2806b21236d0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -18,6 +18,7 @@
 #include <linux/memblock.h>
 #include <linux/panic_notifier.h>
 #include <linux/pci.h>
+#include <linux/pkram.h>
 #include <linux/root_dev.h>
 #include <linux/hugetlb.h>
 #include <linux/tboot.h>
@@ -1221,6 +1222,8 @@ void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
+	pkram_reserve();
+
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a190aae8ceaf..a46ffb434f39 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -34,6 +34,7 @@
 #include <linux/gfp.h>
 #include <linux/kcore.h>
 #include <linux/bootmem_info.h>
+#include <linux/pkram.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1339,6 +1340,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
 	 * might set fields in deferred struct pages that have not yet been
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index b614e9059bba..53d5a1ec42ff 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -99,4 +99,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
 
+#ifdef CONFIG_PKRAM
+extern unsigned long pkram_reserved_pages;
+void pkram_reserve(void);
+#else
+#define pkram_reserved_pages 0UL
+static inline void pkram_reserve(void) { }
+#endif
+
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index c649504fa1fa..b711f94dbef4 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -134,6 +134,8 @@ extern void pkram_find_preserved(unsigned long start, unsigned long end, void *p
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+unsigned long __initdata pkram_reserved_pages;
+
 /*
  * The PKRAM super block pfn, see above.
  */
@@ -143,6 +145,59 @@ static int __init parse_pkram_sb_pfn(char *arg)
 }
 early_param("pkram", parse_pkram_sb_pfn);
 
+static void * __init pkram_map_meta(unsigned long pfn)
+{
+	if (pfn >= max_low_pfn)
+		return ERR_PTR(-EINVAL);
+	return pfn_to_kaddr(pfn);
+}
+
+int pkram_merge_with_reserved(void);
+/*
+ * Reserve pages that belong to preserved memory.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+void __init pkram_reserve(void)
+{
+	int err = 0;
+
+	if (!pkram_sb_pfn)
+		return;
+
+	pr_info("PKRAM: Examining preserved memory...\n");
+
+	/* Verify that nothing else has reserved the pkram_sb page */
+	if (memblock_is_region_reserved(PFN_PHYS(pkram_sb_pfn), PAGE_SIZE)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	pkram_sb = pkram_map_meta(pkram_sb_pfn);
+	if (IS_ERR(pkram_sb)) {
+		err = PTR_ERR(pkram_sb);
+		goto out;
+	}
+	/* An empty pkram_sb is not an error */
+	if (!pkram_sb->node_pfn) {
+		pkram_sb = NULL;
+		goto done;
+	}
+
+	err = pkram_merge_with_reserved();
+out:
+	if (err) {
+		pr_err("PKRAM: Reservation failed: %d\n", err);
+		WARN_ON(pkram_reserved_pages > 0);
+		pkram_sb = NULL;
+		return;
+	}
+
+done:
+	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
@@ -162,6 +217,7 @@ static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 
 static inline void pkram_free_page(void *addr)
 {
+	__ClearPageReserved(virt_to_page(addr));
 	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
@@ -193,13 +249,23 @@ static void pkram_truncate_link(struct pkram_link *link)
 {
 	struct page *page;
 	pkram_entry_t p;
-	int i;
+	int i, j, order;
 
 	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
 		p = link->entry[i];
 		if (!p)
 			continue;
+		order = p & PKRAM_ENTRY_ORDER_MASK;
+		if (order >= MAX_ORDER) {
+			pr_err("PKRAM: attempted truncate of invalid page\n");
+			return;
+		}
 		page = pfn_to_page(PHYS_PFN(p));
+		for (j = 0; j < (1 << order); j++) {
+			struct page *pg = page + j;
+
+			__ClearPageReserved(pg);
+		}
 		pkram_remove_identity_map(page);
 		put_page(page);
 	}
@@ -680,7 +746,7 @@ static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
-	int order;
+	int i, order;
 	short flags;
 
 	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
@@ -690,9 +756,16 @@ static struct page *__pkram_prep_load_page(pkram_entry_t p)
 
 	page = pfn_to_page(PHYS_PFN(p));
 
-	if (!page_ref_freeze(pg, 1)) {
-		pr_err("PKRAM preserved page has unexpected inflated ref count\n");
-		goto out_error;
+	for (i = 0; i < (1 << order); i++) {
+		struct page *pg = page + i;
+		int was_rsvd;
+
+		was_rsvd = PageReserved(pg);
+		__ClearPageReserved(pg);
+		if ((was_rsvd || i == 0) && !page_ref_freeze(pg, 1)) {
+			pr_err("PKRAM preserved page has unexpected inflated ref count\n");
+			goto out_error;
+		}
 	}
 
 	if (order) {
@@ -1331,6 +1404,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 	}
 
 	WARN_ON(cnt_a + cnt_b != k);
+	pkram_reserved_pages = nr_preserved;
 	new->cnt = cnt_a + cnt_b;
 	new->total_size = total_size;
 
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 11/21] mm: PKRAM: reserve preserved memory at boot
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Keep preserved pages from being recycled during boot by adding them
to the memblock reserved list during early boot. If memory reservation
fails (e.g. a region has already been reserved), all preserved pages
are dropped.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/kernel/setup.c |  3 ++
 arch/x86/mm/init_64.c   |  2 ++
 include/linux/pkram.h   |  8 +++++
 mm/pkram.c              | 84 ++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16babff771bd..2806b21236d0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -18,6 +18,7 @@
 #include <linux/memblock.h>
 #include <linux/panic_notifier.h>
 #include <linux/pci.h>
+#include <linux/pkram.h>
 #include <linux/root_dev.h>
 #include <linux/hugetlb.h>
 #include <linux/tboot.h>
@@ -1221,6 +1222,8 @@ void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
+	pkram_reserve();
+
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a190aae8ceaf..a46ffb434f39 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -34,6 +34,7 @@
 #include <linux/gfp.h>
 #include <linux/kcore.h>
 #include <linux/bootmem_info.h>
+#include <linux/pkram.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1339,6 +1340,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
 	 * might set fields in deferred struct pages that have not yet been
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index b614e9059bba..53d5a1ec42ff 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -99,4 +99,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
 
+#ifdef CONFIG_PKRAM
+extern unsigned long pkram_reserved_pages;
+void pkram_reserve(void);
+#else
+#define pkram_reserved_pages 0UL
+static inline void pkram_reserve(void) { }
+#endif
+
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index c649504fa1fa..b711f94dbef4 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -134,6 +134,8 @@ extern void pkram_find_preserved(unsigned long start, unsigned long end, void *p
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+unsigned long __initdata pkram_reserved_pages;
+
 /*
  * The PKRAM super block pfn, see above.
  */
@@ -143,6 +145,59 @@ static int __init parse_pkram_sb_pfn(char *arg)
 }
 early_param("pkram", parse_pkram_sb_pfn);
 
+static void * __init pkram_map_meta(unsigned long pfn)
+{
+	if (pfn >= max_low_pfn)
+		return ERR_PTR(-EINVAL);
+	return pfn_to_kaddr(pfn);
+}
+
+int pkram_merge_with_reserved(void);
+/*
+ * Reserve pages that belong to preserved memory.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+void __init pkram_reserve(void)
+{
+	int err = 0;
+
+	if (!pkram_sb_pfn)
+		return;
+
+	pr_info("PKRAM: Examining preserved memory...\n");
+
+	/* Verify that nothing else has reserved the pkram_sb page */
+	if (memblock_is_region_reserved(PFN_PHYS(pkram_sb_pfn), PAGE_SIZE)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	pkram_sb = pkram_map_meta(pkram_sb_pfn);
+	if (IS_ERR(pkram_sb)) {
+		err = PTR_ERR(pkram_sb);
+		goto out;
+	}
+	/* An empty pkram_sb is not an error */
+	if (!pkram_sb->node_pfn) {
+		pkram_sb = NULL;
+		goto done;
+	}
+
+	err = pkram_merge_with_reserved();
+out:
+	if (err) {
+		pr_err("PKRAM: Reservation failed: %d\n", err);
+		WARN_ON(pkram_reserved_pages > 0);
+		pkram_sb = NULL;
+		return;
+	}
+
+done:
+	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
@@ -162,6 +217,7 @@ static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 
 static inline void pkram_free_page(void *addr)
 {
+	__ClearPageReserved(virt_to_page(addr));
 	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
@@ -193,13 +249,23 @@ static void pkram_truncate_link(struct pkram_link *link)
 {
 	struct page *page;
 	pkram_entry_t p;
-	int i;
+	int i, j, order;
 
 	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
 		p = link->entry[i];
 		if (!p)
 			continue;
+		order = p & PKRAM_ENTRY_ORDER_MASK;
+		if (order >= MAX_ORDER) {
+			pr_err("PKRAM: attempted truncate of invalid page\n");
+			return;
+		}
 		page = pfn_to_page(PHYS_PFN(p));
+		for (j = 0; j < (1 << order); j++) {
+			struct page *pg = page + j;
+
+			__ClearPageReserved(pg);
+		}
 		pkram_remove_identity_map(page);
 		put_page(page);
 	}
@@ -680,7 +746,7 @@ static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
-	int order;
+	int i, order;
 	short flags;
 
 	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
@@ -690,9 +756,16 @@ static struct page *__pkram_prep_load_page(pkram_entry_t p)
 
 	page = pfn_to_page(PHYS_PFN(p));
 
-	if (!page_ref_freeze(pg, 1)) {
-		pr_err("PKRAM preserved page has unexpected inflated ref count\n");
-		goto out_error;
+	for (i = 0; i < (1 << order); i++) {
+		struct page *pg = page + i;
+		int was_rsvd;
+
+		was_rsvd = PageReserved(pg);
+		__ClearPageReserved(pg);
+		if ((was_rsvd || i == 0) && !page_ref_freeze(pg, 1)) {
+			pr_err("PKRAM preserved page has unexpected inflated ref count\n");
+			goto out_error;
+		}
 	}
 
 	if (order) {
@@ -1331,6 +1404,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 	}
 
 	WARN_ON(cnt_a + cnt_b != k);
+	pkram_reserved_pages = nr_preserved;
 	new->cnt = cnt_a + cnt_b;
 	new->total_size = total_size;
 
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 12/21] PKRAM: free the preserved ranges list
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Free the pages used to pass the preserved ranges to the new boot.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/mm/init_64.c |  1 +
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a46ffb434f39..9e68f07367fa 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1340,6 +1340,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	pkram_cleanup();
 	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 53d5a1ec42ff..c909aa299fc4 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -102,9 +102,11 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 #ifdef CONFIG_PKRAM
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
+void pkram_cleanup(void);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
+static inline void pkram_cleanup(void) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index b711f94dbef4..c63b27bb711b 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1456,3 +1456,23 @@ int __init pkram_merge_with_reserved(void)
 
 	return 0;
 }
+
+void __init pkram_cleanup(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long next_pfn;
+
+	if (!pkram_sb || !pkram_reserved_pages)
+		return;
+
+	next_pfn = pkram_sb->region_list_pfn;
+
+	while (next_pfn) {
+		struct page *page = pfn_to_page(next_pfn);
+
+		rl = pfn_to_kaddr(next_pfn);
+		next_pfn = rl->next_pfn;
+		__free_pages_core(page, 0);
+		pkram_reserved_pages--;
+	}
+}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 12/21] PKRAM: free the preserved ranges list
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Free the pages used to pass the preserved ranges to the new boot.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/mm/init_64.c |  1 +
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a46ffb434f39..9e68f07367fa 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1340,6 +1340,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	pkram_cleanup();
 	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 53d5a1ec42ff..c909aa299fc4 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -102,9 +102,11 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 #ifdef CONFIG_PKRAM
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
+void pkram_cleanup(void);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
+static inline void pkram_cleanup(void) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index b711f94dbef4..c63b27bb711b 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1456,3 +1456,23 @@ int __init pkram_merge_with_reserved(void)
 
 	return 0;
 }
+
+void __init pkram_cleanup(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long next_pfn;
+
+	if (!pkram_sb || !pkram_reserved_pages)
+		return;
+
+	next_pfn = pkram_sb->region_list_pfn;
+
+	while (next_pfn) {
+		struct page *page = pfn_to_page(next_pfn);
+
+		rl = pfn_to_kaddr(next_pfn);
+		next_pfn = rl->next_pfn;
+		__free_pages_core(page, 0);
+		pkram_reserved_pages--;
+	}
+}
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 13/21] PKRAM: prevent inadvertent use of a stale superblock
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

When pages have been saved to be preserved by the current boot, set
a magic number on the super block to be validated by the next kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index c63b27bb711b..befdffc76940 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -22,6 +22,7 @@
 
 #include "internal.h"
 
+#define PKRAM_MAGIC		0x706B726D
 
 /*
  * Represents a reference to a data page saved to PKRAM.
@@ -110,6 +111,8 @@ struct pkram_region_list {
  * The structure occupies a memory page.
  */
 struct pkram_super_block {
+	__u32	magic;
+
 	__u64	node_pfn;		/* first element of the node list */
 	__u64	region_list_pfn;
 	__u64	nr_regions;
@@ -179,6 +182,11 @@ void __init pkram_reserve(void)
 		err = PTR_ERR(pkram_sb);
 		goto out;
 	}
+	if (pkram_sb->magic != PKRAM_MAGIC) {
+		pr_err("PKRAM: invalid super block\n");
+		err = -EINVAL;
+		goto out;
+	}
 	/* An empty pkram_sb is not an error */
 	if (!pkram_sb->node_pfn) {
 		pkram_sb = NULL;
@@ -1012,6 +1020,7 @@ static void __pkram_reboot(void)
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
 	if (!err && node_pfn) {
+		pkram_sb->magic = PKRAM_MAGIC;
 		pkram_sb->node_pfn = node_pfn;
 		pkram_sb->region_list_pfn = rl_pfn;
 		pkram_sb->nr_regions = nr_regions;
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 13/21] PKRAM: prevent inadvertent use of a stale superblock
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

When pages have been saved to be preserved by the current boot, set
a magic number on the super block to be validated by the next kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index c63b27bb711b..befdffc76940 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -22,6 +22,7 @@
 
 #include "internal.h"
 
+#define PKRAM_MAGIC		0x706B726D
 
 /*
  * Represents a reference to a data page saved to PKRAM.
@@ -110,6 +111,8 @@ struct pkram_region_list {
  * The structure occupies a memory page.
  */
 struct pkram_super_block {
+	__u32	magic;
+
 	__u64	node_pfn;		/* first element of the node list */
 	__u64	region_list_pfn;
 	__u64	nr_regions;
@@ -179,6 +182,11 @@ void __init pkram_reserve(void)
 		err = PTR_ERR(pkram_sb);
 		goto out;
 	}
+	if (pkram_sb->magic != PKRAM_MAGIC) {
+		pr_err("PKRAM: invalid super block\n");
+		err = -EINVAL;
+		goto out;
+	}
 	/* An empty pkram_sb is not an error */
 	if (!pkram_sb->node_pfn) {
 		pkram_sb = NULL;
@@ -1012,6 +1020,7 @@ static void __pkram_reboot(void)
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
 	if (!err && node_pfn) {
+		pkram_sb->magic = PKRAM_MAGIC;
 		pkram_sb->node_pfn = node_pfn;
 		pkram_sb->region_list_pfn = rl_pfn;
 		pkram_sb->nr_regions = nr_regions;
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 14/21] PKRAM: provide a way to ban pages from use by PKRAM
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Not all memory ranges can be used for saving preserved over-kexec data.
For example, a kexec kernel may be loaded before pages are preserved.
The memory regions where the kexec segments will be copied to on kexec
must not contain preserved pages or else they will be clobbered.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   2 +
 mm/pkram.c            | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 207 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index c909aa299fc4..29109e875604 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -103,10 +103,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
 void pkram_cleanup(void);
+void pkram_ban_region(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
+static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index befdffc76940..cef75bd8ba99 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -140,6 +140,28 @@ extern void pkram_find_preserved(unsigned long start, unsigned long end, void *p
 unsigned long __initdata pkram_reserved_pages;
 
 /*
+ * For tracking a region of memory that PKRAM is not allowed to use.
+ */
+struct banned_region {
+	unsigned long start, end;		/* pfn, inclusive */
+};
+
+#define MAX_NR_BANNED		(32 + MAX_NUMNODES * 2)
+
+static unsigned int nr_banned;			/* number of banned regions */
+
+/* banned regions; arranged in ascending order, do not overlap */
+static struct banned_region banned[MAX_NR_BANNED];
+/*
+ * If a page allocated for PKRAM turns out to belong to a banned region,
+ * it is placed on the banned_pages list so subsequent allocation attempts
+ * do not encounter it again. The list is shrunk when system memory is low.
+ */
+static LIST_HEAD(banned_pages);			/* linked through page::lru */
+static DEFINE_SPINLOCK(banned_pages_lock);
+static unsigned long nr_banned_pages;
+
+/*
  * The PKRAM super block pfn, see above.
  */
 static int __init parse_pkram_sb_pfn(char *arg)
@@ -206,12 +228,116 @@ void __init pkram_reserve(void)
 	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
 }
 
+/*
+ * Ban pfn range [start..end] (inclusive) from use in PKRAM.
+ */
+void pkram_ban_region(unsigned long start, unsigned long end)
+{
+	int i, merged = -1;
+
+	/* first try to merge the region with an existing one */
+	for (i = nr_banned - 1; i >= 0 && start <= banned[i].end + 1; i--) {
+		if (end + 1 >= banned[i].start) {
+			start = min(banned[i].start, start);
+			end = max(banned[i].end, end);
+			if (merged < 0)
+				merged = i;
+		} else
+			/*
+			 * Regions are arranged in ascending order and do not
+			 * intersect so the merged region cannot jump over its
+			 * predecessors.
+			 */
+			BUG_ON(merged >= 0);
+	}
+
+	i++;
+
+	if (merged >= 0) {
+		banned[i].start = start;
+		banned[i].end = end;
+		/* shift if merged with more than one region */
+		memmove(banned + i + 1, banned + merged + 1,
+			sizeof(*banned) * (nr_banned - merged - 1));
+		nr_banned -= merged - i;
+		return;
+	}
+
+	/*
+	 * The region does not intersect with an existing one;
+	 * try to create a new one.
+	 */
+	if (nr_banned == MAX_NR_BANNED) {
+		pr_err("PKRAM: Failed to ban %lu-%lu: Too many banned regions\n",
+			start, end);
+		return;
+	}
+
+	memmove(banned + i + 1, banned + i,
+		sizeof(*banned) * (nr_banned - i));
+	banned[i].start = start;
+	banned[i].end = end;
+	nr_banned++;
+}
+
+static void pkram_show_banned(void)
+{
+	int i;
+	unsigned long n, total = 0;
+
+	pr_info("PKRAM: banned regions:\n");
+	for (i = 0; i < nr_banned; i++) {
+		n = banned[i].end - banned[i].start + 1;
+		pr_info("%4d: [%08lx - %08lx] %ld pages\n",
+			i, banned[i].start, banned[i].end, n);
+		total += n;
+	}
+	pr_info("Total banned: %ld pages in %d regions\n",
+		total, nr_banned);
+}
+
+/*
+ * Returns true if the page may not be used for storing preserved data.
+ */
+static bool pkram_page_banned(struct page *page)
+{
+	unsigned long epfn, pfn = page_to_pfn(page);
+	int l = 0, r = nr_banned - 1, m;
+
+	epfn = pfn + compound_nr(page) - 1;
+
+	/* do binary search */
+	while (l <= r) {
+		m = (l + r) / 2;
+		if (epfn < banned[m].start)
+			r = m - 1;
+		else if (pfn > banned[m].end)
+			l = m + 1;
+		else
+			return true;
+	}
+	return false;
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
+	LIST_HEAD(list);
+	unsigned long len = 0;
 	int err;
 
 	page = alloc_page(gfp_mask);
+	while (page && pkram_page_banned(page)) {
+		len++;
+		list_add(&page->lru, &list);
+		page = alloc_page(gfp_mask);
+	}
+	if (len > 0) {
+		spin_lock(&banned_pages_lock);
+		nr_banned_pages += len;
+		list_splice(&list, &banned_pages);
+		spin_unlock(&banned_pages_lock);
+	}
 	if (page) {
 		err = pkram_add_identity_map(page);
 		if (err) {
@@ -230,6 +356,53 @@ static inline void pkram_free_page(void *addr)
 	free_page((unsigned long)addr);
 }
 
+static void __banned_pages_shrink(unsigned long nr_to_scan)
+{
+	struct page *page;
+
+	if (nr_to_scan <= 0)
+		return;
+
+	while (nr_banned_pages > 0) {
+		BUG_ON(list_empty(&banned_pages));
+		page = list_first_entry(&banned_pages, struct page, lru);
+		list_del(&page->lru);
+		__free_page(page);
+		nr_banned_pages--;
+		nr_to_scan--;
+		if (!nr_to_scan)
+			break;
+	}
+}
+
+static unsigned long
+banned_pages_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	return nr_banned_pages;
+}
+
+static unsigned long
+banned_pages_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int nr_left = nr_banned_pages;
+
+	if (!sc->nr_to_scan || !nr_left)
+		return nr_left;
+
+	spin_lock(&banned_pages_lock);
+	__banned_pages_shrink(sc->nr_to_scan);
+	nr_left = nr_banned_pages;
+	spin_unlock(&banned_pages_lock);
+
+	return nr_left;
+}
+
+static struct shrinker banned_pages_shrinker = {
+	.count_objects = banned_pages_count,
+	.scan_objects = banned_pages_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 static inline void pkram_insert_node(struct pkram_node *node)
 {
 	list_add(&virt_to_page(node)->lru, &pkram_nodes);
@@ -705,6 +878,31 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 	return 0;
 }
 
+static int __pkram_save_page_copy(struct pkram_access *pa, struct page *page)
+{
+	int nr_pages = compound_nr(page);
+	pgoff_t index = page->index;
+	int i, err;
+
+	for (i = 0; i < nr_pages; i++, index++) {
+		struct page *p = page + i;
+		struct page *new;
+
+		new = pkram_alloc_page(pa->ps->gfp_mask);
+		if (!new)
+			return -ENOMEM;
+
+		copy_highpage(new, p);
+		err = __pkram_save_page(pa, new, index);
+		if (err) {
+			pkram_free_page(page_address(new));
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * Save folio @folio to the preserved memory node and object associated
  * with pkram stream access @pa. The stream must have been initialized with
@@ -728,6 +926,10 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	/* if page is banned, relocate it */
+	if (pkram_page_banned(page))
+		return __pkram_save_page_copy(pa, page);
+
 	err = __pkram_save_page(pa, page, page->index);
 	if (!err)
 		err = pkram_add_identity_map(page);
@@ -987,6 +1189,7 @@ static void __pkram_reboot(void)
 	int err = 0;
 
 	if (!list_empty(&pkram_nodes)) {
+		pkram_show_banned();
 		err = pkram_add_identity_map(virt_to_page(pkram_sb));
 		if (err) {
 			pr_err("PKRAM: failed to add super block to pagetable\n");
@@ -1073,6 +1276,7 @@ static int __init pkram_init_sb(void)
 		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
+			__banned_pages_shrink(ULONG_MAX);
 			return 0;
 		}
 		pkram_sb = page_address(page);
@@ -1095,6 +1299,7 @@ static int __init pkram_init(void)
 {
 	if (pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
+		register_shrinker(&banned_pages_shrinker, "pkram");
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
 	}
 	return 0;
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 14/21] PKRAM: provide a way to ban pages from use by PKRAM
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Not all memory ranges can be used for saving preserved over-kexec data.
For example, a kexec kernel may be loaded before pages are preserved.
The memory regions where the kexec segments will be copied to on kexec
must not contain preserved pages or else they will be clobbered.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   2 +
 mm/pkram.c            | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 207 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index c909aa299fc4..29109e875604 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -103,10 +103,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
 void pkram_cleanup(void);
+void pkram_ban_region(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
+static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index befdffc76940..cef75bd8ba99 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -140,6 +140,28 @@ extern void pkram_find_preserved(unsigned long start, unsigned long end, void *p
 unsigned long __initdata pkram_reserved_pages;
 
 /*
+ * For tracking a region of memory that PKRAM is not allowed to use.
+ */
+struct banned_region {
+	unsigned long start, end;		/* pfn, inclusive */
+};
+
+#define MAX_NR_BANNED		(32 + MAX_NUMNODES * 2)
+
+static unsigned int nr_banned;			/* number of banned regions */
+
+/* banned regions; arranged in ascending order, do not overlap */
+static struct banned_region banned[MAX_NR_BANNED];
+/*
+ * If a page allocated for PKRAM turns out to belong to a banned region,
+ * it is placed on the banned_pages list so subsequent allocation attempts
+ * do not encounter it again. The list is shrunk when system memory is low.
+ */
+static LIST_HEAD(banned_pages);			/* linked through page::lru */
+static DEFINE_SPINLOCK(banned_pages_lock);
+static unsigned long nr_banned_pages;
+
+/*
  * The PKRAM super block pfn, see above.
  */
 static int __init parse_pkram_sb_pfn(char *arg)
@@ -206,12 +228,116 @@ void __init pkram_reserve(void)
 	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
 }
 
+/*
+ * Ban pfn range [start..end] (inclusive) from use in PKRAM.
+ */
+void pkram_ban_region(unsigned long start, unsigned long end)
+{
+	int i, merged = -1;
+
+	/* first try to merge the region with an existing one */
+	for (i = nr_banned - 1; i >= 0 && start <= banned[i].end + 1; i--) {
+		if (end + 1 >= banned[i].start) {
+			start = min(banned[i].start, start);
+			end = max(banned[i].end, end);
+			if (merged < 0)
+				merged = i;
+		} else
+			/*
+			 * Regions are arranged in ascending order and do not
+			 * intersect so the merged region cannot jump over its
+			 * predecessors.
+			 */
+			BUG_ON(merged >= 0);
+	}
+
+	i++;
+
+	if (merged >= 0) {
+		banned[i].start = start;
+		banned[i].end = end;
+		/* shift if merged with more than one region */
+		memmove(banned + i + 1, banned + merged + 1,
+			sizeof(*banned) * (nr_banned - merged - 1));
+		nr_banned -= merged - i;
+		return;
+	}
+
+	/*
+	 * The region does not intersect with an existing one;
+	 * try to create a new one.
+	 */
+	if (nr_banned == MAX_NR_BANNED) {
+		pr_err("PKRAM: Failed to ban %lu-%lu: Too many banned regions\n",
+			start, end);
+		return;
+	}
+
+	memmove(banned + i + 1, banned + i,
+		sizeof(*banned) * (nr_banned - i));
+	banned[i].start = start;
+	banned[i].end = end;
+	nr_banned++;
+}
+
+static void pkram_show_banned(void)
+{
+	int i;
+	unsigned long n, total = 0;
+
+	pr_info("PKRAM: banned regions:\n");
+	for (i = 0; i < nr_banned; i++) {
+		n = banned[i].end - banned[i].start + 1;
+		pr_info("%4d: [%08lx - %08lx] %ld pages\n",
+			i, banned[i].start, banned[i].end, n);
+		total += n;
+	}
+	pr_info("Total banned: %ld pages in %d regions\n",
+		total, nr_banned);
+}
+
+/*
+ * Returns true if the page may not be used for storing preserved data.
+ */
+static bool pkram_page_banned(struct page *page)
+{
+	unsigned long epfn, pfn = page_to_pfn(page);
+	int l = 0, r = nr_banned - 1, m;
+
+	epfn = pfn + compound_nr(page) - 1;
+
+	/* do binary search */
+	while (l <= r) {
+		m = (l + r) / 2;
+		if (epfn < banned[m].start)
+			r = m - 1;
+		else if (pfn > banned[m].end)
+			l = m + 1;
+		else
+			return true;
+	}
+	return false;
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
+	LIST_HEAD(list);
+	unsigned long len = 0;
 	int err;
 
 	page = alloc_page(gfp_mask);
+	while (page && pkram_page_banned(page)) {
+		len++;
+		list_add(&page->lru, &list);
+		page = alloc_page(gfp_mask);
+	}
+	if (len > 0) {
+		spin_lock(&banned_pages_lock);
+		nr_banned_pages += len;
+		list_splice(&list, &banned_pages);
+		spin_unlock(&banned_pages_lock);
+	}
 	if (page) {
 		err = pkram_add_identity_map(page);
 		if (err) {
@@ -230,6 +356,53 @@ static inline void pkram_free_page(void *addr)
 	free_page((unsigned long)addr);
 }
 
+static void __banned_pages_shrink(unsigned long nr_to_scan)
+{
+	struct page *page;
+
+	if (nr_to_scan <= 0)
+		return;
+
+	while (nr_banned_pages > 0) {
+		BUG_ON(list_empty(&banned_pages));
+		page = list_first_entry(&banned_pages, struct page, lru);
+		list_del(&page->lru);
+		__free_page(page);
+		nr_banned_pages--;
+		nr_to_scan--;
+		if (!nr_to_scan)
+			break;
+	}
+}
+
+static unsigned long
+banned_pages_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	return nr_banned_pages;
+}
+
+static unsigned long
+banned_pages_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int nr_left = nr_banned_pages;
+
+	if (!sc->nr_to_scan || !nr_left)
+		return nr_left;
+
+	spin_lock(&banned_pages_lock);
+	__banned_pages_shrink(sc->nr_to_scan);
+	nr_left = nr_banned_pages;
+	spin_unlock(&banned_pages_lock);
+
+	return nr_left;
+}
+
+static struct shrinker banned_pages_shrinker = {
+	.count_objects = banned_pages_count,
+	.scan_objects = banned_pages_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 static inline void pkram_insert_node(struct pkram_node *node)
 {
 	list_add(&virt_to_page(node)->lru, &pkram_nodes);
@@ -705,6 +878,31 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 	return 0;
 }
 
+static int __pkram_save_page_copy(struct pkram_access *pa, struct page *page)
+{
+	int nr_pages = compound_nr(page);
+	pgoff_t index = page->index;
+	int i, err;
+
+	for (i = 0; i < nr_pages; i++, index++) {
+		struct page *p = page + i;
+		struct page *new;
+
+		new = pkram_alloc_page(pa->ps->gfp_mask);
+		if (!new)
+			return -ENOMEM;
+
+		copy_highpage(new, p);
+		err = __pkram_save_page(pa, new, index);
+		if (err) {
+			pkram_free_page(page_address(new));
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * Save folio @folio to the preserved memory node and object associated
  * with pkram stream access @pa. The stream must have been initialized with
@@ -728,6 +926,10 @@ int pkram_save_folio(struct pkram_access *pa, struct folio *folio)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	/* if page is banned, relocate it */
+	if (pkram_page_banned(page))
+		return __pkram_save_page_copy(pa, page);
+
 	err = __pkram_save_page(pa, page, page->index);
 	if (!err)
 		err = pkram_add_identity_map(page);
@@ -987,6 +1189,7 @@ static void __pkram_reboot(void)
 	int err = 0;
 
 	if (!list_empty(&pkram_nodes)) {
+		pkram_show_banned();
 		err = pkram_add_identity_map(virt_to_page(pkram_sb));
 		if (err) {
 			pr_err("PKRAM: failed to add super block to pagetable\n");
@@ -1073,6 +1276,7 @@ static int __init pkram_init_sb(void)
 		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
+			__banned_pages_shrink(ULONG_MAX);
 			return 0;
 		}
 		pkram_sb = page_address(page);
@@ -1095,6 +1299,7 @@ static int __init pkram_init(void)
 {
 	if (pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
+		register_shrinker(&banned_pages_shrinker, "pkram");
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
 	}
 	return 0;
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 15/21] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

When loading a kernel for kexec, dynamically update the list of physical
ranges that are not to be used for storing preserved pages with the ranges
where kexec segments will be copied to on reboot. This ensures no pages
preserved after the new kernel has been loaded will reside in these ranges
on reboot.

Not yet handled is the case where pages have been preserved before a
kexec kernel is loaded.  This will be covered by a later patch.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec.c      |  9 +++++++++
 kernel/kexec_file.c | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 92d301f98776..cd871fc07c65 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/pkram.h>
 
 #include "kexec_internal.h"
 
@@ -153,6 +154,14 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 	if (ret)
 		goto out;
 
+	for (i = 0; i < nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/* Install the new kernel and uninstall the old */
 	image = xchg(dest_image, image);
 
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f1a0e4e3fb5c..ca2aa2d61955 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -27,6 +27,8 @@
 #include <linux/kernel_read_file.h>
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
+#include <linux/pkram.h>
+
 #include "kexec_internal.h"
 
 #ifdef CONFIG_KEXEC_SIG
@@ -403,6 +405,14 @@ static int kexec_image_verify_sig(struct kimage *image, void *buf,
 	if (ret)
 		goto out;
 
+	for (i = 0; i < image->nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/*
 	 * Free up any temporary buffers allocated which are not needed
 	 * after image has been loaded
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 15/21] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

When loading a kernel for kexec, dynamically update the list of physical
ranges that are not to be used for storing preserved pages with the ranges
where kexec segments will be copied to on reboot. This ensures no pages
preserved after the new kernel has been loaded will reside in these ranges
on reboot.

Not yet handled is the case where pages have been preserved before a
kexec kernel is loaded.  This will be covered by a later patch.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec.c      |  9 +++++++++
 kernel/kexec_file.c | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 92d301f98776..cd871fc07c65 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/pkram.h>
 
 #include "kexec_internal.h"
 
@@ -153,6 +154,14 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 	if (ret)
 		goto out;
 
+	for (i = 0; i < nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/* Install the new kernel and uninstall the old */
 	image = xchg(dest_image, image);
 
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f1a0e4e3fb5c..ca2aa2d61955 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -27,6 +27,8 @@
 #include <linux/kernel_read_file.h>
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
+#include <linux/pkram.h>
+
 #include "kexec_internal.h"
 
 #ifdef CONFIG_KEXEC_SIG
@@ -403,6 +405,14 @@ static int kexec_image_verify_sig(struct kimage *image, void *buf,
 	if (ret)
 		goto out;
 
+	for (i = 0; i < image->nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/*
 	 * Free up any temporary buffers allocated which are not needed
 	 * after image has been loaded
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 16/21] PKRAM: provide a way to check if a memory range has preserved pages
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

When a kernel is loaded for kexec the address ranges where the kexec
segments will be copied to may conflict with pages already set to be
preserved. Provide a way to determine if preserved pages exist in a
specified range.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 29109e875604..bec9ae75e802 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -104,11 +104,13 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_reserve(void);
 void pkram_cleanup(void);
 void pkram_ban_region(unsigned long start, unsigned long end);
+int pkram_has_preserved_pages(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
 static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
+static inline int pkram_has_preserved_pages(unsigned long start, unsigned long end) { return 0; }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index cef75bd8ba99..474fb6fc8355 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1690,3 +1690,23 @@ void __init pkram_cleanup(void)
 		pkram_reserved_pages--;
 	}
 }
+
+static int has_preserved_pages_cb(unsigned long base, unsigned long size, void *private)
+{
+	int *has_preserved = (int *)private;
+
+	*has_preserved = 1;
+	return 1;
+}
+
+/*
+ * Check whether the memory range [start, end) contains preserved pages.
+ */
+int pkram_has_preserved_pages(unsigned long start, unsigned long end)
+{
+	int has_preserved = 0;
+
+	pkram_find_preserved(start, end, &has_preserved, has_preserved_pages_cb);
+
+	return has_preserved;
+}
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 16/21] PKRAM: provide a way to check if a memory range has preserved pages
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

When a kernel is loaded for kexec the address ranges where the kexec
segments will be copied to may conflict with pages already set to be
preserved. Provide a way to determine if preserved pages exist in a
specified range.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 29109e875604..bec9ae75e802 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -104,11 +104,13 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_reserve(void);
 void pkram_cleanup(void);
 void pkram_ban_region(unsigned long start, unsigned long end);
+int pkram_has_preserved_pages(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
 static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
+static inline int pkram_has_preserved_pages(unsigned long start, unsigned long end) { return 0; }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index cef75bd8ba99..474fb6fc8355 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1690,3 +1690,23 @@ void __init pkram_cleanup(void)
 		pkram_reserved_pages--;
 	}
 }
+
+static int has_preserved_pages_cb(unsigned long base, unsigned long size, void *private)
+{
+	int *has_preserved = (int *)private;
+
+	*has_preserved = 1;
+	return 1;
+}
+
+/*
+ * Check whether the memory range [start, end) contains preserved pages.
+ */
+int pkram_has_preserved_pages(unsigned long start, unsigned long end)
+{
+	int has_preserved = 0;
+
+	pkram_find_preserved(start, end, &has_preserved, has_preserved_pages_cb);
+
+	return has_preserved;
+}
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 17/21] kexec: PKRAM: avoid clobbering already preserved pages
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Ensure destination ranges of the kexec segments do not overlap
with any kernel pages marked to be preserved across kexec.

For kexec_load, return EADDRNOTAVAIL if overlap is detected.

For kexec_file_load, skip ranges containing preserved pages when
seaching for available ranges to use.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec_core.c | 3 +++
 kernel/kexec_file.c | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 3d578c6fefee..e0d52f70cb48 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -40,6 +40,7 @@
 #include <linux/hugetlb.h>
 #include <linux/objtool.h>
 #include <linux/kmsg_dump.h>
+#include <linux/pkram.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -178,6 +179,8 @@ int sanity_check_segment_list(struct kimage *image)
 			return -EADDRNOTAVAIL;
 		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
 			return -EADDRNOTAVAIL;
+		if (pkram_has_preserved_pages(mstart, mend))
+			return -EADDRNOTAVAIL;
 	}
 
 	/* Verify our destination addresses do not overlap.
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index ca2aa2d61955..8bca01060d32 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -490,6 +490,11 @@ static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
 			continue;
 		}
 
+		if (pkram_has_preserved_pages(temp_start, temp_end + 1)) {
+			temp_start = temp_start - PAGE_SIZE;
+			continue;
+		}
+
 		/* We found a suitable memory range */
 		break;
 	} while (1);
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 17/21] kexec: PKRAM: avoid clobbering already preserved pages
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Ensure destination ranges of the kexec segments do not overlap
with any kernel pages marked to be preserved across kexec.

For kexec_load, return EADDRNOTAVAIL if overlap is detected.

For kexec_file_load, skip ranges containing preserved pages when
seaching for available ranges to use.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec_core.c | 3 +++
 kernel/kexec_file.c | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 3d578c6fefee..e0d52f70cb48 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -40,6 +40,7 @@
 #include <linux/hugetlb.h>
 #include <linux/objtool.h>
 #include <linux/kmsg_dump.h>
+#include <linux/pkram.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -178,6 +179,8 @@ int sanity_check_segment_list(struct kimage *image)
 			return -EADDRNOTAVAIL;
 		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
 			return -EADDRNOTAVAIL;
+		if (pkram_has_preserved_pages(mstart, mend))
+			return -EADDRNOTAVAIL;
 	}
 
 	/* Verify our destination addresses do not overlap.
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index ca2aa2d61955..8bca01060d32 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -490,6 +490,11 @@ static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
 			continue;
 		}
 
+		if (pkram_has_preserved_pages(temp_start, temp_end + 1)) {
+			temp_start = temp_start - PAGE_SIZE;
+			continue;
+		}
+
 		/* We found a suitable memory range */
 		break;
 	} while (1);
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 18/21] mm: PKRAM: allow preserved memory to be freed from userspace
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

To free all space utilized for preserved memory, one can write 0 to
/sys/kernel/pkram. This will destroy all PKRAM nodes that are not
currently being read or written.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 474fb6fc8355..d404e415f3cb 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -493,6 +493,32 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+/*
+ * Free all nodes that are not under operation.
+ */
+static void pkram_truncate(void)
+{
+	struct page *page, *tmp;
+	struct pkram_node *node;
+	LIST_HEAD(dispose);
+
+	mutex_lock(&pkram_mutex);
+	list_for_each_entry_safe(page, tmp, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (!(node->flags & PKRAM_ACCMODE_MASK))
+			list_move(&page->lru, &dispose);
+	}
+	mutex_unlock(&pkram_mutex);
+
+	while (!list_empty(&dispose)) {
+		page = list_first_entry(&dispose, struct page, lru);
+		list_del(&page->lru);
+		node = page_address(page);
+		pkram_truncate_node(node);
+		pkram_free_page(node);
+	}
+}
+
 static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
 {
 	__u64 link_pfn = page_to_pfn(virt_to_page(link));
@@ -1252,8 +1278,19 @@ static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
 	return sprintf(buf, "%lx\n", pfn);
 }
 
+static ssize_t store_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int val;
+
+	if (kstrtoint(buf, 0, &val) || val)
+		return -EINVAL;
+	pkram_truncate();
+	return count;
+}
+
 static struct kobj_attribute pkram_sb_pfn_attr =
-	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+	__ATTR(pkram, 0644, show_pkram_sb_pfn, store_pkram_sb_pfn);
 
 static struct attribute *pkram_attrs[] = {
 	&pkram_sb_pfn_attr.attr,
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 18/21] mm: PKRAM: allow preserved memory to be freed from userspace
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

To free all space utilized for preserved memory, one can write 0 to
/sys/kernel/pkram. This will destroy all PKRAM nodes that are not
currently being read or written.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 474fb6fc8355..d404e415f3cb 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -493,6 +493,32 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+/*
+ * Free all nodes that are not under operation.
+ */
+static void pkram_truncate(void)
+{
+	struct page *page, *tmp;
+	struct pkram_node *node;
+	LIST_HEAD(dispose);
+
+	mutex_lock(&pkram_mutex);
+	list_for_each_entry_safe(page, tmp, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (!(node->flags & PKRAM_ACCMODE_MASK))
+			list_move(&page->lru, &dispose);
+	}
+	mutex_unlock(&pkram_mutex);
+
+	while (!list_empty(&dispose)) {
+		page = list_first_entry(&dispose, struct page, lru);
+		list_del(&page->lru);
+		node = page_address(page);
+		pkram_truncate_node(node);
+		pkram_free_page(node);
+	}
+}
+
 static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
 {
 	__u64 link_pfn = page_to_pfn(virt_to_page(link));
@@ -1252,8 +1278,19 @@ static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
 	return sprintf(buf, "%lx\n", pfn);
 }
 
+static ssize_t store_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int val;
+
+	if (kstrtoint(buf, 0, &val) || val)
+		return -EINVAL;
+	pkram_truncate();
+	return count;
+}
+
 static struct kobj_attribute pkram_sb_pfn_attr =
-	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+	__ATTR(pkram, 0644, show_pkram_sb_pfn, store_pkram_sb_pfn);
 
 static struct attribute *pkram_attrs[] = {
 	&pkram_sb_pfn_attr.attr,
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 19/21] PKRAM: disable feature when running the kdump kernel
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

The kdump kernel should not preserve or restore pages.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index d404e415f3cb..f38236e5d836 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/crash_dump.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
@@ -188,7 +189,7 @@ void __init pkram_reserve(void)
 {
 	int err = 0;
 
-	if (!pkram_sb_pfn)
+	if (!pkram_sb_pfn || is_kdump_kernel())
 		return;
 
 	pr_info("PKRAM: Examining preserved memory...\n");
@@ -285,6 +286,9 @@ static void pkram_show_banned(void)
 	int i;
 	unsigned long n, total = 0;
 
+	if (is_kdump_kernel())
+		return;
+
 	pr_info("PKRAM: banned regions:\n");
 	for (i = 0; i < nr_banned; i++) {
 		n = banned[i].end - banned[i].start + 1;
@@ -1334,7 +1338,7 @@ static int __init pkram_init_sb(void)
 
 static int __init pkram_init(void)
 {
-	if (pkram_init_sb()) {
+	if (!is_kdump_kernel() && pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
 		register_shrinker(&banned_pages_shrinker, "pkram");
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 19/21] PKRAM: disable feature when running the kdump kernel
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

The kdump kernel should not preserve or restore pages.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index d404e415f3cb..f38236e5d836 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/crash_dump.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
@@ -188,7 +189,7 @@ void __init pkram_reserve(void)
 {
 	int err = 0;
 
-	if (!pkram_sb_pfn)
+	if (!pkram_sb_pfn || is_kdump_kernel())
 		return;
 
 	pr_info("PKRAM: Examining preserved memory...\n");
@@ -285,6 +286,9 @@ static void pkram_show_banned(void)
 	int i;
 	unsigned long n, total = 0;
 
+	if (is_kdump_kernel())
+		return;
+
 	pr_info("PKRAM: banned regions:\n");
 	for (i = 0; i < nr_banned; i++) {
 		n = banned[i].end - banned[i].start + 1;
@@ -1334,7 +1338,7 @@ static int __init pkram_init_sb(void)
 
 static int __init pkram_init(void)
 {
-	if (pkram_init_sb()) {
+	if (!is_kdump_kernel() && pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
 		register_shrinker(&banned_pages_shrinker, "pkram");
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 20/21] x86/KASLR: PKRAM: support physical kaslr
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Avoid regions of memory that contain preserved pages when computing
slots used to select where to put the decompressed kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/Makefile |   3 ++
 arch/x86/boot/compressed/kaslr.c  |  10 +++-
 arch/x86/boot/compressed/misc.h   |  10 ++++
 arch/x86/boot/compressed/pkram.c  | 110 ++++++++++++++++++++++++++++++++++++++
 mm/pkram.c                        |   2 +-
 5 files changed, 132 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..d9a5af94a797 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -103,6 +103,9 @@ ifdef CONFIG_X86_64
 	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
 	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o
+ifdef CONFIG_RANDOMIZE_BASE
+	vmlinux-objs-$(CONFIG_PKRAM) += $(obj)/pkram.o
+endif
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..047b8b9a0799 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -436,6 +436,7 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	struct setup_data *ptr;
 	u64 earliest = img->start + img->size;
 	bool is_overlapping = false;
+	struct mem_vector avoid;
 
 	for (i = 0; i < MEM_AVOID_MAX; i++) {
 		if (mem_overlaps(img, &mem_avoid[i]) &&
@@ -449,8 +450,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	/* Avoid all entries in the setup_data linked list. */
 	ptr = (struct setup_data *)(unsigned long)boot_params->hdr.setup_data;
 	while (ptr) {
-		struct mem_vector avoid;
-
 		avoid.start = (unsigned long)ptr;
 		avoid.size = sizeof(*ptr) + ptr->len;
 
@@ -475,6 +474,12 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 		ptr = (struct setup_data *)(unsigned long)ptr->next;
 	}
 
+	if (pkram_has_overlap(img, &avoid) && (avoid.start < earliest)) {
+		*overlap = avoid;
+		earliest = overlap->start;
+		is_overlapping = true;
+	}
+
 	return is_overlapping;
 }
 
@@ -836,6 +841,7 @@ void choose_random_location(unsigned long input,
 		return;
 	}
 
+	pkram_init();
 	boot_params->hdr.loadflags |= KASLR_FLAG;
 
 	if (IS_ENABLED(CONFIG_X86_32))
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 20118fb7c53b..01ff5e507064 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -124,6 +124,16 @@ static inline void console_init(void)
 { }
 #endif
 
+#ifdef CONFIG_PKRAM
+void pkram_init(void);
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap);
+#else
+static inline void pkram_init(void) { }
+static inline int pkram_has_overlap(struct mem_vector *entry,
+				    struct mem_vector *overlap)
+{ return 0; }
+#endif
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 void sev_enable(struct boot_params *bp);
 void snp_check_features(void);
diff --git a/arch/x86/boot/compressed/pkram.c b/arch/x86/boot/compressed/pkram.c
new file mode 100644
index 000000000000..19267ca2ce8e
--- /dev/null
+++ b/arch/x86/boot/compressed/pkram.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "misc.h"
+
+#define PKRAM_MAGIC		0x706B726D
+
+struct pkram_super_block {
+	__u32	magic;
+
+	__u64	node_pfn;
+	__u64	region_list_pfn;
+	__u64	nr_regions;
+};
+
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
+
+static u64 pkram_sb_pfn;
+static struct pkram_super_block *pkram_sb;
+
+void pkram_init(void)
+{
+	struct pkram_super_block *sb;
+	char arg[32];
+
+	if (cmdline_find_option("pkram", arg, sizeof(arg)) > 0) {
+		if (kstrtoull(arg, 16, &pkram_sb_pfn) != 0)
+			return;
+	} else
+		return;
+
+	sb = (struct pkram_super_block *)(pkram_sb_pfn << PAGE_SHIFT);
+	if (sb->magic != PKRAM_MAGIC) {
+		debug_putstr("PKRAM: invalid super block\n");
+		return;
+	}
+
+	pkram_sb = sb;
+}
+
+static struct pkram_region *pkram_first_region(struct pkram_super_block *sb,
+					       struct pkram_region_list **rlp, int *idx)
+{
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = (struct pkram_region_list *)(sb->region_list_pfn << PAGE_SHIFT);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+static struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			debug_putstr("PKRAM: no more pkram_region_list pages\n");
+			return NULL;
+		}
+		rl = (struct pkram_region_list *)(rl->next_pfn << PAGE_SHIFT);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap)
+{
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	int idx;
+
+	r = pkram_first_region(pkram_sb, &rl, &idx);
+
+	while (r) {
+		if (r->base + r->size <= entry->start) {
+			r = pkram_next_region(&rl, &idx);
+			continue;
+		}
+		if (r->base >= entry->start + entry->size)
+			return 0;
+
+		overlap->start = r->base;
+		overlap->size = r->size;
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/mm/pkram.c b/mm/pkram.c
index f38236e5d836..a3e045b8dfe4 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -96,7 +96,7 @@ struct pkram_region_list {
 	__u64	prev_pfn;
 	__u64	next_pfn;
 
-	struct pkram_region regions[0];
+	struct pkram_region regions[];
 };
 
 #define PKRAM_REGIONS_LIST_MAX \
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 20/21] x86/KASLR: PKRAM: support physical kaslr
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

Avoid regions of memory that contain preserved pages when computing
slots used to select where to put the decompressed kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/Makefile |   3 ++
 arch/x86/boot/compressed/kaslr.c  |  10 +++-
 arch/x86/boot/compressed/misc.h   |  10 ++++
 arch/x86/boot/compressed/pkram.c  | 110 ++++++++++++++++++++++++++++++++++++++
 mm/pkram.c                        |   2 +-
 5 files changed, 132 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..d9a5af94a797 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -103,6 +103,9 @@ ifdef CONFIG_X86_64
 	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
 	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev.o
+ifdef CONFIG_RANDOMIZE_BASE
+	vmlinux-objs-$(CONFIG_PKRAM) += $(obj)/pkram.o
+endif
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..047b8b9a0799 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -436,6 +436,7 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	struct setup_data *ptr;
 	u64 earliest = img->start + img->size;
 	bool is_overlapping = false;
+	struct mem_vector avoid;
 
 	for (i = 0; i < MEM_AVOID_MAX; i++) {
 		if (mem_overlaps(img, &mem_avoid[i]) &&
@@ -449,8 +450,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	/* Avoid all entries in the setup_data linked list. */
 	ptr = (struct setup_data *)(unsigned long)boot_params->hdr.setup_data;
 	while (ptr) {
-		struct mem_vector avoid;
-
 		avoid.start = (unsigned long)ptr;
 		avoid.size = sizeof(*ptr) + ptr->len;
 
@@ -475,6 +474,12 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 		ptr = (struct setup_data *)(unsigned long)ptr->next;
 	}
 
+	if (pkram_has_overlap(img, &avoid) && (avoid.start < earliest)) {
+		*overlap = avoid;
+		earliest = overlap->start;
+		is_overlapping = true;
+	}
+
 	return is_overlapping;
 }
 
@@ -836,6 +841,7 @@ void choose_random_location(unsigned long input,
 		return;
 	}
 
+	pkram_init();
 	boot_params->hdr.loadflags |= KASLR_FLAG;
 
 	if (IS_ENABLED(CONFIG_X86_32))
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 20118fb7c53b..01ff5e507064 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -124,6 +124,16 @@ static inline void console_init(void)
 { }
 #endif
 
+#ifdef CONFIG_PKRAM
+void pkram_init(void);
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap);
+#else
+static inline void pkram_init(void) { }
+static inline int pkram_has_overlap(struct mem_vector *entry,
+				    struct mem_vector *overlap)
+{ return 0; }
+#endif
+
 #ifdef CONFIG_AMD_MEM_ENCRYPT
 void sev_enable(struct boot_params *bp);
 void snp_check_features(void);
diff --git a/arch/x86/boot/compressed/pkram.c b/arch/x86/boot/compressed/pkram.c
new file mode 100644
index 000000000000..19267ca2ce8e
--- /dev/null
+++ b/arch/x86/boot/compressed/pkram.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "misc.h"
+
+#define PKRAM_MAGIC		0x706B726D
+
+struct pkram_super_block {
+	__u32	magic;
+
+	__u64	node_pfn;
+	__u64	region_list_pfn;
+	__u64	nr_regions;
+};
+
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
+
+static u64 pkram_sb_pfn;
+static struct pkram_super_block *pkram_sb;
+
+void pkram_init(void)
+{
+	struct pkram_super_block *sb;
+	char arg[32];
+
+	if (cmdline_find_option("pkram", arg, sizeof(arg)) > 0) {
+		if (kstrtoull(arg, 16, &pkram_sb_pfn) != 0)
+			return;
+	} else
+		return;
+
+	sb = (struct pkram_super_block *)(pkram_sb_pfn << PAGE_SHIFT);
+	if (sb->magic != PKRAM_MAGIC) {
+		debug_putstr("PKRAM: invalid super block\n");
+		return;
+	}
+
+	pkram_sb = sb;
+}
+
+static struct pkram_region *pkram_first_region(struct pkram_super_block *sb,
+					       struct pkram_region_list **rlp, int *idx)
+{
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = (struct pkram_region_list *)(sb->region_list_pfn << PAGE_SHIFT);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+static struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			debug_putstr("PKRAM: no more pkram_region_list pages\n");
+			return NULL;
+		}
+		rl = (struct pkram_region_list *)(rl->next_pfn << PAGE_SHIFT);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap)
+{
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	int idx;
+
+	r = pkram_first_region(pkram_sb, &rl, &idx);
+
+	while (r) {
+		if (r->base + r->size <= entry->start) {
+			r = pkram_next_region(&rl, &idx);
+			continue;
+		}
+		if (r->base >= entry->start + entry->size)
+			return 0;
+
+		overlap->start = r->base;
+		overlap->size = r->size;
+		return 1;
+	}
+
+	return 0;
+}
diff --git a/mm/pkram.c b/mm/pkram.c
index f38236e5d836..a3e045b8dfe4 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -96,7 +96,7 @@ struct pkram_region_list {
 	__u64	prev_pfn;
 	__u64	next_pfn;
 
-	struct pkram_region regions[0];
+	struct pkram_region regions[];
 };
 
 #define PKRAM_REGIONS_LIST_MAX \
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-04-27  0:08   ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

pkram kaslr code can incur multiple page faults when it walks its
preserved ranges list called via mem_avoid_overlap().  The multiple
faults can easily end up using up the small number of pages available
to be allocated for page table pages.

This patch hacks things so that mappings are 1GB which results in the need
for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
the mappings to be 2M.  This could possibly be fixed by updating split
code to split 1GB page if the aren't any other issues with using 1GB
mappings.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 321a5011042d..1e02cf6dda3c 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -95,8 +95,8 @@ void kernel_add_identity_map(unsigned long start, unsigned long end)
 	int ret;
 
 	/* Align boundary to 2M. */
-	start = round_down(start, PMD_SIZE);
-	end = round_up(end, PMD_SIZE);
+	start = round_down(start, PUD_SIZE);
+	end = round_up(end, PUD_SIZE);
 	if (start >= end)
 		return;
 
@@ -120,6 +120,7 @@ void initialize_identity_maps(void *rmode)
 	mapping_info.context = &pgt_data;
 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
 	mapping_info.kernpg_flag = _KERNPG_TABLE;
+	mapping_info.direct_gbpages = true;
 
 	/*
 	 * It should be impossible for this not to already be true,
@@ -365,8 +366,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
 
 	ghcb_fault = sev_es_check_ghcb_fault(address);
 
-	address   &= PMD_MASK;
-	end        = address + PMD_SIZE;
+	address   &= PUD_MASK;
+	end        = address + PUD_SIZE;
 
 	/*
 	 * Check for unexpected error codes. Unexpected are:
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings
@ 2023-04-27  0:08   ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27  0:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, hpa, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

pkram kaslr code can incur multiple page faults when it walks its
preserved ranges list called via mem_avoid_overlap().  The multiple
faults can easily end up using up the small number of pages available
to be allocated for page table pages.

This patch hacks things so that mappings are 1GB which results in the need
for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
the mappings to be 2M.  This could possibly be fixed by updating split
code to split 1GB page if the aren't any other issues with using 1GB
mappings.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 321a5011042d..1e02cf6dda3c 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -95,8 +95,8 @@ void kernel_add_identity_map(unsigned long start, unsigned long end)
 	int ret;
 
 	/* Align boundary to 2M. */
-	start = round_down(start, PMD_SIZE);
-	end = round_up(end, PMD_SIZE);
+	start = round_down(start, PUD_SIZE);
+	end = round_up(end, PUD_SIZE);
 	if (start >= end)
 		return;
 
@@ -120,6 +120,7 @@ void initialize_identity_maps(void *rmode)
 	mapping_info.context = &pgt_data;
 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
 	mapping_info.kernpg_flag = _KERNPG_TABLE;
+	mapping_info.direct_gbpages = true;
 
 	/*
 	 * It should be impossible for this not to already be true,
@@ -365,8 +366,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
 
 	ghcb_fault = sev_es_check_ghcb_fault(address);
 
-	address   &= PMD_MASK;
-	end        = address + PMD_SIZE;
+	address   &= PUD_MASK;
+	end        = address + PUD_SIZE;
 
 	/*
 	 * Check for unexpected error codes. Unexpected are:
-- 
1.9.4


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings
  2023-04-27  0:08   ` Anthony Yznaga
@ 2023-04-27 18:40     ` H. Peter Anvin
  -1 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2023-04-27 18:40 UTC (permalink / raw)
  To: Anthony Yznaga, linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

On April 26, 2023 5:08:57 PM PDT, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>pkram kaslr code can incur multiple page faults when it walks its
>preserved ranges list called via mem_avoid_overlap().  The multiple
>faults can easily end up using up the small number of pages available
>to be allocated for page table pages.
>
>This patch hacks things so that mappings are 1GB which results in the need
>for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
>the mappings to be 2M.  This could possibly be fixed by updating split
>code to split 1GB page if the aren't any other issues with using 1GB
>mappings.
>
>Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>---
> arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
>diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
>index 321a5011042d..1e02cf6dda3c 100644
>--- a/arch/x86/boot/compressed/ident_map_64.c
>+++ b/arch/x86/boot/compressed/ident_map_64.c
>@@ -95,8 +95,8 @@ void kernel_add_identity_map(unsigned long start, unsigned long end)
> 	int ret;
> 
> 	/* Align boundary to 2M. */
>-	start = round_down(start, PMD_SIZE);
>-	end = round_up(end, PMD_SIZE);
>+	start = round_down(start, PUD_SIZE);
>+	end = round_up(end, PUD_SIZE);
> 	if (start >= end)
> 		return;
> 
>@@ -120,6 +120,7 @@ void initialize_identity_maps(void *rmode)
> 	mapping_info.context = &pgt_data;
> 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
> 	mapping_info.kernpg_flag = _KERNPG_TABLE;
>+	mapping_info.direct_gbpages = true;
> 
> 	/*
> 	 * It should be impossible for this not to already be true,
>@@ -365,8 +366,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
> 
> 	ghcb_fault = sev_es_check_ghcb_fault(address);
> 
>-	address   &= PMD_MASK;
>-	end        = address + PMD_SIZE;
>+	address   &= PUD_MASK;
>+	end        = address + PUD_SIZE;
> 
> 	/*
> 	 * Check for unexpected error codes. Unexpected are:

Strong NAK: 1G pages are not supported by all 64-bit CPUs, *and* by your own admission breaks things ...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings
@ 2023-04-27 18:40     ` H. Peter Anvin
  0 siblings, 0 replies; 64+ messages in thread
From: H. Peter Anvin @ 2023-04-27 18:40 UTC (permalink / raw)
  To: Anthony Yznaga, linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec

On April 26, 2023 5:08:57 PM PDT, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>pkram kaslr code can incur multiple page faults when it walks its
>preserved ranges list called via mem_avoid_overlap().  The multiple
>faults can easily end up using up the small number of pages available
>to be allocated for page table pages.
>
>This patch hacks things so that mappings are 1GB which results in the need
>for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
>the mappings to be 2M.  This could possibly be fixed by updating split
>code to split 1GB page if the aren't any other issues with using 1GB
>mappings.
>
>Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>---
> arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
>diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
>index 321a5011042d..1e02cf6dda3c 100644
>--- a/arch/x86/boot/compressed/ident_map_64.c
>+++ b/arch/x86/boot/compressed/ident_map_64.c
>@@ -95,8 +95,8 @@ void kernel_add_identity_map(unsigned long start, unsigned long end)
> 	int ret;
> 
> 	/* Align boundary to 2M. */
>-	start = round_down(start, PMD_SIZE);
>-	end = round_up(end, PMD_SIZE);
>+	start = round_down(start, PUD_SIZE);
>+	end = round_up(end, PUD_SIZE);
> 	if (start >= end)
> 		return;
> 
>@@ -120,6 +120,7 @@ void initialize_identity_maps(void *rmode)
> 	mapping_info.context = &pgt_data;
> 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
> 	mapping_info.kernpg_flag = _KERNPG_TABLE;
>+	mapping_info.direct_gbpages = true;
> 
> 	/*
> 	 * It should be impossible for this not to already be true,
>@@ -365,8 +366,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
> 
> 	ghcb_fault = sev_es_check_ghcb_fault(address);
> 
>-	address   &= PMD_MASK;
>-	end        = address + PMD_SIZE;
>+	address   &= PUD_MASK;
>+	end        = address + PUD_SIZE;
> 
> 	/*
> 	 * Check for unexpected error codes. Unexpected are:

Strong NAK: 1G pages are not supported by all 64-bit CPUs, *and* by your own admission breaks things ...

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings
  2023-04-27 18:40     ` H. Peter Anvin
@ 2023-04-27 22:38       ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27 22:38 UTC (permalink / raw)
  To: H. Peter Anvin, linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec


On 4/27/23 11:40 AM, H. Peter Anvin wrote:
> On April 26, 2023 5:08:57 PM PDT, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> pkram kaslr code can incur multiple page faults when it walks its
>> preserved ranges list called via mem_avoid_overlap().  The multiple
>> faults can easily end up using up the small number of pages available
>> to be allocated for page table pages.
>>
>> This patch hacks things so that mappings are 1GB which results in the need
>> for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
>> the mappings to be 2M.  This could possibly be fixed by updating split
>> code to split 1GB page if the aren't any other issues with using 1GB
>> mappings.
>>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>> arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
>> index 321a5011042d..1e02cf6dda3c 100644
>> --- a/arch/x86/boot/compressed/ident_map_64.c
>> +++ b/arch/x86/boot/compressed/ident_map_64.c
>> @@ -95,8 +95,8 @@ void kernel_add_identity_map(unsigned long start, unsigned long end)
>> 	int ret;
>>
>> 	/* Align boundary to 2M. */
>> -	start = round_down(start, PMD_SIZE);
>> -	end = round_up(end, PMD_SIZE);
>> +	start = round_down(start, PUD_SIZE);
>> +	end = round_up(end, PUD_SIZE);
>> 	if (start >= end)
>> 		return;
>>
>> @@ -120,6 +120,7 @@ void initialize_identity_maps(void *rmode)
>> 	mapping_info.context = &pgt_data;
>> 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
>> 	mapping_info.kernpg_flag = _KERNPG_TABLE;
>> +	mapping_info.direct_gbpages = true;
>>
>> 	/*
>> 	 * It should be impossible for this not to already be true,
>> @@ -365,8 +366,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
>>
>> 	ghcb_fault = sev_es_check_ghcb_fault(address);
>>
>> -	address   &= PMD_MASK;
>> -	end        = address + PMD_SIZE;
>> +	address   &= PUD_MASK;
>> +	end        = address + PUD_SIZE;
>>
>> 	/*
>> 	 * Check for unexpected error codes. Unexpected are:
> Strong NAK: 1G pages are not supported by all 64-bit CPUs, *and* by your own admission breaks things ...
>
I strongly suspected that this was a no-go. Thank you for taking a 
looking and confirming it. I'll look into alternative solutions.


Anthony


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings
@ 2023-04-27 22:38       ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-04-27 22:38 UTC (permalink / raw)
  To: H. Peter Anvin, linux-mm, linux-kernel
  Cc: tglx, mingo, bp, x86, dave.hansen, luto, peterz, rppt, akpm,
	ebiederm, keescook, graf, jason.zeng, lei.l.li, steven.sistare,
	fam.zheng, mgalaxy, kexec


On 4/27/23 11:40 AM, H. Peter Anvin wrote:
> On April 26, 2023 5:08:57 PM PDT, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> pkram kaslr code can incur multiple page faults when it walks its
>> preserved ranges list called via mem_avoid_overlap().  The multiple
>> faults can easily end up using up the small number of pages available
>> to be allocated for page table pages.
>>
>> This patch hacks things so that mappings are 1GB which results in the need
>> for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
>> the mappings to be 2M.  This could possibly be fixed by updating split
>> code to split 1GB page if the aren't any other issues with using 1GB
>> mappings.
>>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>> arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
>> index 321a5011042d..1e02cf6dda3c 100644
>> --- a/arch/x86/boot/compressed/ident_map_64.c
>> +++ b/arch/x86/boot/compressed/ident_map_64.c
>> @@ -95,8 +95,8 @@ void kernel_add_identity_map(unsigned long start, unsigned long end)
>> 	int ret;
>>
>> 	/* Align boundary to 2M. */
>> -	start = round_down(start, PMD_SIZE);
>> -	end = round_up(end, PMD_SIZE);
>> +	start = round_down(start, PUD_SIZE);
>> +	end = round_up(end, PUD_SIZE);
>> 	if (start >= end)
>> 		return;
>>
>> @@ -120,6 +120,7 @@ void initialize_identity_maps(void *rmode)
>> 	mapping_info.context = &pgt_data;
>> 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
>> 	mapping_info.kernpg_flag = _KERNPG_TABLE;
>> +	mapping_info.direct_gbpages = true;
>>
>> 	/*
>> 	 * It should be impossible for this not to already be true,
>> @@ -365,8 +366,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
>>
>> 	ghcb_fault = sev_es_check_ghcb_fault(address);
>>
>> -	address   &= PMD_MASK;
>> -	end        = address + PMD_SIZE;
>> +	address   &= PUD_MASK;
>> +	end        = address + PUD_SIZE;
>>
>> 	/*
>> 	 * Check for unexpected error codes. Unexpected are:
> Strong NAK: 1G pages are not supported by all 64-bit CPUs, *and* by your own admission breaks things ...
>
I strongly suspected that this was a no-go. Thank you for taking a 
looking and confirming it. I'll look into alternative solutions.


Anthony


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-05-26 13:57   ` Gowans, James
  -1 siblings, 0 replies; 64+ messages in thread
From: Gowans, James @ 2023-05-26 13:57 UTC (permalink / raw)
  To: linux-mm, anthony.yznaga, linux-kernel
  Cc: kexec, jason.zeng, keescook, lei.l.li, luto, rppt, dave.hansen,
	steven.sistare, Graf (AWS),
	Alexander, akpm, mgalaxy, mingo, fam.zheng, Woodhouse, David,
	tglx, yuleixzhang, ebiederm, hpa, peterz, bp, x86

On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote:
> Sending out this RFC in part to guage community interest.
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API.
> 
> One use case for PKRAM is preserving guest memory and/or auxillary
> supporting data (e.g. iommu data) across kexec to support reboot of the
> host with minimal disruption to the guest.

Hi Anthony,

Thanks for re-posting this - I'm been wanting to re-kindle the discussion
on preserving memory across kexec for a while now.

There are a few aspects at play in this space of memory management
designed specifically for the virtualisation and live update (kexec) use-
case which I think we should consider:

1. Preserving userspace-accessible memory across kexec: this is what pkram
addresses.

2. Preserving kernel state: This would include memory required for kexec
with DMA passthrough devices, like IOMMU root page and page tables, DMA-
able buffers for drivers, etc. Also certain structures for improved kernel
boot performance after kexec, like a PCI device cache, clock LPJ and
possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC
indicates that this should be possible, though IMO this could be more
straight forward to do with a new filesystem with first-class support for
kernel persistence via something like inode types for kernel data.

3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of
2-stage translations it's beneficial to allocate guest memory in large
contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If
the buddy allocator is used this may be a challenge both from an
implementation and a fragmentation perspective, and it may be desirable to
have stronger guarantees about allocation sizes.

4. Removing struct page overhead: When doing the huge/gigantic
allocations, in generally it won't be necessary to have 4 KiB struct
pages. This is something with dmemfs [1, 2] tries to achieve by using a
large chunk of reserved memory and managing that by a new filesystem.

5. More "advanced" memory management APIs/ioctls for virtualisation: Being
able to support things like DMA-driven post-copy live migration, memory
oversubscription, carving out chunks of memory from a VM to launch side-
car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This
may be easier to achieve with a new filesystem, rather than coupling to
tempfs semantics and ioctls.

Overall, with the above in mind, my take is that we may have a smoother
path to implement a more comprehensive solution by going the route of a
new purpose-built file system on top of reserved memory. Sort of like
dmemfs with persistence and specifically support for kernel persistence.

Does my take here make sense?

I'm hoping to put together an RFC for something like the above (dmemfs
with persistence) soon, focusing on how the IOMMU persistence will work.
This is an important differentiating factor to cover in the RFC, IMO.

> PKRAM provides a flexible way
> for doing this without requiring that the amount of memory used by a fixed
> size created a prior.

AFAICT the main down-side of what I'm suggesting here compared to pkram,
is that as you say here: pkram doesn't require the up-front reserving of
memory - allocations from the global shared pool are dynamic. I'm on the
fence as to whether this is actually a desirable property though. Carving
out a large chunk of system memory as reserved memory for a persisted
filesystem (as I'm suggesting) has the advantages of removing struct page
overhead, providing better guarantees about huge/gigantic page
allocations, and probably makes the kexec restore path simpler and more
self contained.

I think there's an argument to be made that having a clearly-defined large
range of memory which is persisted, and the rest is normal "ephemeral"
kernel memory may be preferable.

Keen to hear your (and others) thoughts!

JG

[0] http://david.woodhou.se/live-update-handover.pdf
[1] https://lwn.net/Articles/839216/
[2] https://lkml.org/lkml/2020/12/7/342

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
@ 2023-05-26 13:57   ` Gowans, James
  0 siblings, 0 replies; 64+ messages in thread
From: Gowans, James @ 2023-05-26 13:57 UTC (permalink / raw)
  To: linux-mm, anthony.yznaga, linux-kernel
  Cc: kexec, jason.zeng, keescook, lei.l.li, luto, rppt, dave.hansen,
	steven.sistare, Graf (AWS),
	Alexander, akpm, mgalaxy, mingo, fam.zheng, Woodhouse, David,
	tglx, yuleixzhang, ebiederm, hpa, peterz, bp, x86

On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote:
> Sending out this RFC in part to guage community interest.
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API.
> 
> One use case for PKRAM is preserving guest memory and/or auxillary
> supporting data (e.g. iommu data) across kexec to support reboot of the
> host with minimal disruption to the guest.

Hi Anthony,

Thanks for re-posting this - I'm been wanting to re-kindle the discussion
on preserving memory across kexec for a while now.

There are a few aspects at play in this space of memory management
designed specifically for the virtualisation and live update (kexec) use-
case which I think we should consider:

1. Preserving userspace-accessible memory across kexec: this is what pkram
addresses.

2. Preserving kernel state: This would include memory required for kexec
with DMA passthrough devices, like IOMMU root page and page tables, DMA-
able buffers for drivers, etc. Also certain structures for improved kernel
boot performance after kexec, like a PCI device cache, clock LPJ and
possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC
indicates that this should be possible, though IMO this could be more
straight forward to do with a new filesystem with first-class support for
kernel persistence via something like inode types for kernel data.

3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of
2-stage translations it's beneficial to allocate guest memory in large
contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If
the buddy allocator is used this may be a challenge both from an
implementation and a fragmentation perspective, and it may be desirable to
have stronger guarantees about allocation sizes.

4. Removing struct page overhead: When doing the huge/gigantic
allocations, in generally it won't be necessary to have 4 KiB struct
pages. This is something with dmemfs [1, 2] tries to achieve by using a
large chunk of reserved memory and managing that by a new filesystem.

5. More "advanced" memory management APIs/ioctls for virtualisation: Being
able to support things like DMA-driven post-copy live migration, memory
oversubscription, carving out chunks of memory from a VM to launch side-
car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This
may be easier to achieve with a new filesystem, rather than coupling to
tempfs semantics and ioctls.

Overall, with the above in mind, my take is that we may have a smoother
path to implement a more comprehensive solution by going the route of a
new purpose-built file system on top of reserved memory. Sort of like
dmemfs with persistence and specifically support for kernel persistence.

Does my take here make sense?

I'm hoping to put together an RFC for something like the above (dmemfs
with persistence) soon, focusing on how the IOMMU persistence will work.
This is an important differentiating factor to cover in the RFC, IMO.

> PKRAM provides a flexible way
> for doing this without requiring that the amount of memory used by a fixed
> size created a prior.

AFAICT the main down-side of what I'm suggesting here compared to pkram,
is that as you say here: pkram doesn't require the up-front reserving of
memory - allocations from the global shared pool are dynamic. I'm on the
fence as to whether this is actually a desirable property though. Carving
out a large chunk of system memory as reserved memory for a persisted
filesystem (as I'm suggesting) has the advantages of removing struct page
overhead, providing better guarantees about huge/gigantic page
allocations, and probably makes the kexec restore path simpler and more
self contained.

I think there's an argument to be made that having a clearly-defined large
range of memory which is persisted, and the rest is normal "ephemeral"
kernel memory may be preferable.

Keen to hear your (and others) thoughts!

JG

[0] http://david.woodhou.se/live-update-handover.pdf
[1] https://lwn.net/Articles/839216/
[2] https://lkml.org/lkml/2020/12/7/342
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
  2023-05-26 13:57   ` Gowans, James
@ 2023-05-31 23:14     ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-05-31 23:14 UTC (permalink / raw)
  To: Gowans, James, linux-mm, linux-kernel
  Cc: kexec, jason.zeng, keescook, lei.l.li, luto, rppt, dave.hansen,
	steven.sistare, Graf (AWS),
	Alexander, akpm, mgalaxy, mingo, fam.zheng, Woodhouse, David,
	tglx, yuleixzhang, ebiederm, hpa, peterz, bp, x86


On 5/26/23 6:57 AM, Gowans, James wrote:
> On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote:
>> Sending out this RFC in part to guage community interest.
>> This patchset implements preserved-over-kexec memory storage or PKRAM as a
>> method for saving memory pages of the currently executing kernel so that
>> they may be restored after kexec into a new kernel. The patches are adapted
>> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
>> introduce the PKRAM kernel API.
>>
>> One use case for PKRAM is preserving guest memory and/or auxillary
>> supporting data (e.g. iommu data) across kexec to support reboot of the
>> host with minimal disruption to the guest.
> Hi Anthony,

Hi James,


Thank you for looking at this.

>
> Thanks for re-posting this - I'm been wanting to re-kindle the discussion
> on preserving memory across kexec for a while now.
>
> There are a few aspects at play in this space of memory management
> designed specifically for the virtualisation and live update (kexec) use-
> case which I think we should consider:
>
> 1. Preserving userspace-accessible memory across kexec: this is what pkram
> addresses.
>
> 2. Preserving kernel state: This would include memory required for kexec
> with DMA passthrough devices, like IOMMU root page and page tables, DMA-
> able buffers for drivers, etc. Also certain structures for improved kernel
> boot performance after kexec, like a PCI device cache, clock LPJ and
> possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC
> indicates that this should be possible, though IMO this could be more
> straight forward to do with a new filesystem with first-class support for
> kernel persistence via something like inode types for kernel data.

PKRAM as it is now can preserve kernel data by streaming bytes to a
PKRAM object, but the data must be location independent since the data
is stored in allocated 4k pages rather than being preserved in place
This really isn't usable for things like page tables or memory expected
not to move because of DMA, etc.

One issue with preserving non-relocatable, regular memory that is not
partitioned from the kernel is the risk that a kexec kernel has already
been loaded and that its pre-computed destination where it will be copied
to on reboot will overwrite the preserved memory. Either some way of
re-processing the kexec kernel to load somewhere else would be needed,
or kexec load would need to be restricted from loading where memory

might be preserved. Plusses for a partitioning approach.


>
> 3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of
> 2-stage translations it's beneficial to allocate guest memory in large
> contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If
> the buddy allocator is used this may be a challenge both from an
> implementation and a fragmentation perspective, and it may be desirable to
> have stronger guarantees about allocation sizes.
Agreed that guaranteeing large blocks and fragmentation are issues for
PKRAM.  One possible avenue to address this could be to support preserving

hugetlb pages.


>
> 4. Removing struct page overhead: When doing the huge/gigantic
> allocations, in generally it won't be necessary to have 4 KiB struct
> pages. This is something with dmemfs [1, 2] tries to achieve by using a
> large chunk of reserved memory and managing that by a new filesystem.
Has using DAX been considered? Not familiar with dmemfs but it sounds

functionally similar.


>
> 5. More "advanced" memory management APIs/ioctls for virtualisation: Being
> able to support things like DMA-driven post-copy live migration, memory
> oversubscription, carving out chunks of memory from a VM to launch side-
> car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This
> may be easier to achieve with a new filesystem, rather than coupling to
> tempfs semantics and ioctls.
>
> Overall, with the above in mind, my take is that we may have a smoother
> path to implement a more comprehensive solution by going the route of a
> new purpose-built file system on top of reserved memory. Sort of like
> dmemfs with persistence and specifically support for kernel persistence.
>
> Does my take here make sense?
Yes, I believe so. There are some serious issues with PKRAM to address
before it could be truly viable (fragmentation, relocation, etc), so

a memory partitioning approach might be the way to go.


>
> I'm hoping to put together an RFC for something like the above (dmemfs
> with persistence) soon, focusing on how the IOMMU persistence will work.
> This is an important differentiating factor to cover in the RFC, IMO.

Great! I'll keep an eye out for it.


Anthony


>
>> PKRAM provides a flexible way
>> for doing this without requiring that the amount of memory used by a fixed
>> size created a prior.
> AFAICT the main down-side of what I'm suggesting here compared to pkram,
> is that as you say here: pkram doesn't require the up-front reserving of
> memory - allocations from the global shared pool are dynamic. I'm on the
> fence as to whether this is actually a desirable property though. Carving
> out a large chunk of system memory as reserved memory for a persisted
> filesystem (as I'm suggesting) has the advantages of removing struct page
> overhead, providing better guarantees about huge/gigantic page
> allocations, and probably makes the kexec restore path simpler and more
> self contained.
>
> I think there's an argument to be made that having a clearly-defined large
> range of memory which is persisted, and the rest is normal "ephemeral"
> kernel memory may be preferable.
>
> Keen to hear your (and others) thoughts!
>
> JG
>
> [0] http://david.woodhou.se/live-update-handover.pdf
> [1] https://lwn.net/Articles/839216/
> [2] https://lkml.org/lkml/2020/12/7/342

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
@ 2023-05-31 23:14     ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-05-31 23:14 UTC (permalink / raw)
  To: Gowans, James, linux-mm, linux-kernel
  Cc: kexec, jason.zeng, keescook, lei.l.li, luto, rppt, dave.hansen,
	steven.sistare, Graf (AWS),
	Alexander, akpm, mgalaxy, mingo, fam.zheng, Woodhouse, David,
	tglx, yuleixzhang, ebiederm, hpa, peterz, bp, x86


On 5/26/23 6:57 AM, Gowans, James wrote:
> On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote:
>> Sending out this RFC in part to guage community interest.
>> This patchset implements preserved-over-kexec memory storage or PKRAM as a
>> method for saving memory pages of the currently executing kernel so that
>> they may be restored after kexec into a new kernel. The patches are adapted
>> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
>> introduce the PKRAM kernel API.
>>
>> One use case for PKRAM is preserving guest memory and/or auxillary
>> supporting data (e.g. iommu data) across kexec to support reboot of the
>> host with minimal disruption to the guest.
> Hi Anthony,

Hi James,


Thank you for looking at this.

>
> Thanks for re-posting this - I'm been wanting to re-kindle the discussion
> on preserving memory across kexec for a while now.
>
> There are a few aspects at play in this space of memory management
> designed specifically for the virtualisation and live update (kexec) use-
> case which I think we should consider:
>
> 1. Preserving userspace-accessible memory across kexec: this is what pkram
> addresses.
>
> 2. Preserving kernel state: This would include memory required for kexec
> with DMA passthrough devices, like IOMMU root page and page tables, DMA-
> able buffers for drivers, etc. Also certain structures for improved kernel
> boot performance after kexec, like a PCI device cache, clock LPJ and
> possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC
> indicates that this should be possible, though IMO this could be more
> straight forward to do with a new filesystem with first-class support for
> kernel persistence via something like inode types for kernel data.

PKRAM as it is now can preserve kernel data by streaming bytes to a
PKRAM object, but the data must be location independent since the data
is stored in allocated 4k pages rather than being preserved in place
This really isn't usable for things like page tables or memory expected
not to move because of DMA, etc.

One issue with preserving non-relocatable, regular memory that is not
partitioned from the kernel is the risk that a kexec kernel has already
been loaded and that its pre-computed destination where it will be copied
to on reboot will overwrite the preserved memory. Either some way of
re-processing the kexec kernel to load somewhere else would be needed,
or kexec load would need to be restricted from loading where memory

might be preserved. Plusses for a partitioning approach.


>
> 3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of
> 2-stage translations it's beneficial to allocate guest memory in large
> contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If
> the buddy allocator is used this may be a challenge both from an
> implementation and a fragmentation perspective, and it may be desirable to
> have stronger guarantees about allocation sizes.
Agreed that guaranteeing large blocks and fragmentation are issues for
PKRAM.  One possible avenue to address this could be to support preserving

hugetlb pages.


>
> 4. Removing struct page overhead: When doing the huge/gigantic
> allocations, in generally it won't be necessary to have 4 KiB struct
> pages. This is something with dmemfs [1, 2] tries to achieve by using a
> large chunk of reserved memory and managing that by a new filesystem.
Has using DAX been considered? Not familiar with dmemfs but it sounds

functionally similar.


>
> 5. More "advanced" memory management APIs/ioctls for virtualisation: Being
> able to support things like DMA-driven post-copy live migration, memory
> oversubscription, carving out chunks of memory from a VM to launch side-
> car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This
> may be easier to achieve with a new filesystem, rather than coupling to
> tempfs semantics and ioctls.
>
> Overall, with the above in mind, my take is that we may have a smoother
> path to implement a more comprehensive solution by going the route of a
> new purpose-built file system on top of reserved memory. Sort of like
> dmemfs with persistence and specifically support for kernel persistence.
>
> Does my take here make sense?
Yes, I believe so. There are some serious issues with PKRAM to address
before it could be truly viable (fragmentation, relocation, etc), so

a memory partitioning approach might be the way to go.


>
> I'm hoping to put together an RFC for something like the above (dmemfs
> with persistence) soon, focusing on how the IOMMU persistence will work.
> This is an important differentiating factor to cover in the RFC, IMO.

Great! I'll keep an eye out for it.


Anthony


>
>> PKRAM provides a flexible way
>> for doing this without requiring that the amount of memory used by a fixed
>> size created a prior.
> AFAICT the main down-side of what I'm suggesting here compared to pkram,
> is that as you say here: pkram doesn't require the up-front reserving of
> memory - allocations from the global shared pool are dynamic. I'm on the
> fence as to whether this is actually a desirable property though. Carving
> out a large chunk of system memory as reserved memory for a persisted
> filesystem (as I'm suggesting) has the advantages of removing struct page
> overhead, providing better guarantees about huge/gigantic page
> allocations, and probably makes the kexec restore path simpler and more
> self contained.
>
> I think there's an argument to be made that having a clearly-defined large
> range of memory which is persisted, and the rest is normal "ephemeral"
> kernel memory may be preferable.
>
> Keen to hear your (and others) thoughts!
>
> JG
>
> [0] http://david.woodhou.se/live-update-handover.pdf
> [1] https://lwn.net/Articles/839216/
> [2] https://lkml.org/lkml/2020/12/7/342

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
  2023-04-27  0:08 ` Anthony Yznaga
@ 2023-06-01  2:15   ` Baoquan He
  -1 siblings, 0 replies; 64+ messages in thread
From: Baoquan He @ 2023-06-01  2:15 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

On 04/26/23 at 05:08pm, Anthony Yznaga wrote:
> Sending out this RFC in part to guage community interest.
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API.
> 
> One use case for PKRAM is preserving guest memory and/or auxillary
> supporting data (e.g. iommu data) across kexec to support reboot of the
> host with minimal disruption to the guest. PKRAM provides a flexible way
> for doing this without requiring that the amount of memory used by a fixed
> size created a priori.  Another use case is for databases to preserve their
> block caches in shared memory across reboot.

If so, I have confusions, who can help clarify:
1) Why kexec reboot was introduced, what do we expect kexec reboot to
   do?

2) If we need keep these data and those data, can we not reboot? They
   are definitely located there w/o any concern.

3) What if systems of AI, edge computing, HPC etc also want to carry
   kinds of data from userspace or kernel, system status, registers
   etc when kexec reboot is needed, while enligntened by this patch?

Thanks
Baoquan


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
@ 2023-06-01  2:15   ` Baoquan He
  0 siblings, 0 replies; 64+ messages in thread
From: Baoquan He @ 2023-06-01  2:15 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

On 04/26/23 at 05:08pm, Anthony Yznaga wrote:
> Sending out this RFC in part to guage community interest.
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API.
> 
> One use case for PKRAM is preserving guest memory and/or auxillary
> supporting data (e.g. iommu data) across kexec to support reboot of the
> host with minimal disruption to the guest. PKRAM provides a flexible way
> for doing this without requiring that the amount of memory used by a fixed
> size created a priori.  Another use case is for databases to preserve their
> block caches in shared memory across reboot.

If so, I have confusions, who can help clarify:
1) Why kexec reboot was introduced, what do we expect kexec reboot to
   do?

2) If we need keep these data and those data, can we not reboot? They
   are definitely located there w/o any concern.

3) What if systems of AI, edge computing, HPC etc also want to carry
   kinds of data from userspace or kernel, system status, registers
   etc when kexec reboot is needed, while enligntened by this patch?

Thanks
Baoquan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
  2023-06-01  2:15   ` Baoquan He
@ 2023-06-01 23:58     ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-06-01 23:58 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec


On 5/31/23 7:15 PM, Baoquan He wrote:
> On 04/26/23 at 05:08pm, Anthony Yznaga wrote:
>> Sending out this RFC in part to guage community interest.
>> This patchset implements preserved-over-kexec memory storage or PKRAM as a
>> method for saving memory pages of the currently executing kernel so that
>> they may be restored after kexec into a new kernel. The patches are adapted
>> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
>> introduce the PKRAM kernel API.
>>
>> One use case for PKRAM is preserving guest memory and/or auxillary
>> supporting data (e.g. iommu data) across kexec to support reboot of the
>> host with minimal disruption to the guest. PKRAM provides a flexible way
>> for doing this without requiring that the amount of memory used by a fixed
>> size created a priori.  Another use case is for databases to preserve their
>> block caches in shared memory across reboot.
> If so, I have confusions, who can help clarify:
> 1) Why kexec reboot was introduced, what do we expect kexec reboot to
>     do?
>
> 2) If we need keep these data and those data, can we not reboot? They
>     are definitely located there w/o any concern.
>
> 3) What if systems of AI, edge computing, HPC etc also want to carry
>     kinds of data from userspace or kernel, system status, registers
>     etc when kexec reboot is needed, while enligntened by this patch?

Hi Baoquan,

Avoiding a more significant disruption from having to halt or migrate

VMs, failover services, etc. when a reboot is necessary to pick up

security fixes is one motivation for exploring preserving memory

across the reboot.


Anthony

>
> Thanks
> Baoquan
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 00/21] Preserved-over-Kexec RAM
@ 2023-06-01 23:58     ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-06-01 23:58 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec


On 5/31/23 7:15 PM, Baoquan He wrote:
> On 04/26/23 at 05:08pm, Anthony Yznaga wrote:
>> Sending out this RFC in part to guage community interest.
>> This patchset implements preserved-over-kexec memory storage or PKRAM as a
>> method for saving memory pages of the currently executing kernel so that
>> they may be restored after kexec into a new kernel. The patches are adapted
>> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
>> introduce the PKRAM kernel API.
>>
>> One use case for PKRAM is preserving guest memory and/or auxillary
>> supporting data (e.g. iommu data) across kexec to support reboot of the
>> host with minimal disruption to the guest. PKRAM provides a flexible way
>> for doing this without requiring that the amount of memory used by a fixed
>> size created a priori.  Another use case is for databases to preserve their
>> block caches in shared memory across reboot.
> If so, I have confusions, who can help clarify:
> 1) Why kexec reboot was introduced, what do we expect kexec reboot to
>     do?
>
> 2) If we need keep these data and those data, can we not reboot? They
>     are definitely located there w/o any concern.
>
> 3) What if systems of AI, edge computing, HPC etc also want to carry
>     kinds of data from userspace or kernel, system status, registers
>     etc when kexec reboot is needed, while enligntened by this patch?

Hi Baoquan,

Avoiding a more significant disruption from having to halt or migrate

VMs, failover services, etc. when a reboot is necessary to pick up

security fixes is one motivation for exploring preserving memory

across the reboot.


Anthony

>
> Thanks
> Baoquan
>

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
  2023-04-27  0:08   ` Anthony Yznaga
@ 2023-06-05  2:40     ` Coiby Xu
  -1 siblings, 0 replies; 64+ messages in thread
From: Coiby Xu @ 2023-06-05  2:40 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

Hi Anthony,

On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>The PKRAM super block is the starting point for restoring preserved
>memory. By providing the super block to the new kernel at boot time,
>preserved memory can be reserved and made available to be restored.
>To point the kernel to the location of the super block, one passes
>its pfn via the 'pkram' boot param. 

I'm curious to ask how will the 'pkram' boot param be passed. It seems I
can't find the answer in this patch set.


> For that purpose, the pkram super
>block pfn is exported via /sys/kernel/pkram. If none is passed, any
>preserved memory will not be kept, and a new super block will be
>allocated.
>
>Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>---
> mm/pkram.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 100 insertions(+), 2 deletions(-)
>
>diff --git a/mm/pkram.c b/mm/pkram.c
>index da166cb6afb7..c66b2ae4d520 100644
>--- a/mm/pkram.c
>+++ b/mm/pkram.c
>@@ -5,15 +5,18 @@
> #include <linux/init.h>
> #include <linux/io.h>
> #include <linux/kernel.h>
>+#include <linux/kobject.h>
> #include <linux/list.h>
> #include <linux/mm.h>
> #include <linux/module.h>
> #include <linux/mutex.h>
> #include <linux/notifier.h>
>+#include <linux/pfn.h>
> #include <linux/pkram.h>
> #include <linux/reboot.h>
> #include <linux/sched.h>
> #include <linux/string.h>
>+#include <linux/sysfs.h>
> #include <linux/types.h>
>
> #include "internal.h"
>@@ -82,12 +85,38 @@ struct pkram_node {
> #define PKRAM_ACCMODE_MASK	3
>
> /*
>+ * The PKRAM super block contains data needed to restore the preserved memory
>+ * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
>+ * boot param if one wants to restore preserved data saved by the previously
>+ * executing kernel. For that purpose the kernel exports the pfn via
>+ * /sys/kernel/pkram. If none is passed, preserved memory if any will not be
>+ * preserved and a new clean page will be allocated for the super block.
>+ *
>+ * The structure occupies a memory page.
>+ */
>+struct pkram_super_block {
>+	__u64	node_pfn;		/* first element of the node list */
>+};
>+
>+static unsigned long pkram_sb_pfn __initdata;
>+static struct pkram_super_block *pkram_sb;
>+
>+/*
>  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
>  * connected through the lru field of the page struct.
>  */
> static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
> static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
>
>+/*
>+ * The PKRAM super block pfn, see above.
>+ */
>+static int __init parse_pkram_sb_pfn(char *arg)
>+{
>+	return kstrtoul(arg, 16, &pkram_sb_pfn);
>+}
>+early_param("pkram", parse_pkram_sb_pfn);
>+
> static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
> {
> 	return alloc_page(gfp_mask);
>@@ -270,6 +299,7 @@ static void pkram_stream_init(struct pkram_stream *ps,
>  * @gfp_mask specifies the memory allocation mask to be used when saving data.
>  *
>  * Error values:
>+ *	%ENODEV: PKRAM not available
>  *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>  *	%ENOMEM: insufficient memory available
>  *	%EEXIST: node with specified name already exists
>@@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
> 	struct pkram_node *node;
> 	int err = 0;
>
>+	if (!pkram_sb)
>+		return -ENODEV;
>+
> 	if (strlen(name) >= PKRAM_NAME_MAX)
> 		return -ENAMETOOLONG;
>
>@@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>  * Returns 0 on success, -errno on failure.
>  *
>  * Error values:
>+ *	%ENODEV: PKRAM not available
>  *	%ENOENT: node with specified name does not exist
>  *	%EBUSY: save to required node has not finished yet
>  *
>@@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
> 	struct pkram_node *node;
> 	int err = 0;
>
>+	if (!pkram_sb)
>+		return -ENODEV;
>+
> 	mutex_lock(&pkram_mutex);
> 	node = pkram_find_node(name);
> 	if (!node) {
>@@ -825,6 +862,13 @@ static void __pkram_reboot(void)
> 		node->node_pfn = node_pfn;
> 		node_pfn = page_to_pfn(page);
> 	}
>+
>+	/*
>+	 * Zero out pkram_sb completely since it may have been passed from
>+	 * the previous boot.
>+	 */
>+	memset(pkram_sb, 0, PAGE_SIZE);
>+	pkram_sb->node_pfn = node_pfn;
> }
>
> static int pkram_reboot(struct notifier_block *notifier,
>@@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block *notifier,
> {
> 	if (val != SYS_RESTART)
> 		return NOTIFY_DONE;
>-	__pkram_reboot();
>+	if (pkram_sb)
>+		__pkram_reboot();
> 	return NOTIFY_OK;
> }
>
>@@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block *notifier,
> 	.notifier_call = pkram_reboot,
> };
>
>+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>+		struct kobj_attribute *attr, char *buf)
>+{
>+	unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>+
>+	return sprintf(buf, "%lx\n", pfn);
>+}
>+
>+static struct kobj_attribute pkram_sb_pfn_attr =
>+	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>+
>+static struct attribute *pkram_attrs[] = {
>+	&pkram_sb_pfn_attr.attr,
>+	NULL,
>+};
>+
>+static struct attribute_group pkram_attr_group = {
>+	.attrs = pkram_attrs,
>+};
>+
>+/* returns non-zero on success */
>+static int __init pkram_init_sb(void)
>+{
>+	unsigned long pfn;
>+	struct pkram_node *node;
>+
>+	if (!pkram_sb) {
>+		struct page *page;
>+
>+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>+		if (!page) {
>+			pr_err("PKRAM: Failed to allocate super block\n");
>+			return 0;
>+		}
>+		pkram_sb = page_address(page);
>+	}
>+
>+	/*
>+	 * Build auxiliary doubly-linked list of nodes connected through
>+	 * page::lru for convenience sake.
>+	 */
>+	pfn = pkram_sb->node_pfn;
>+	while (pfn) {
>+		node = pfn_to_kaddr(pfn);
>+		pkram_insert_node(node);
>+		pfn = node->node_pfn;
>+	}
>+	return 1;
>+}
>+
> static int __init pkram_init(void)
> {
>-	register_reboot_notifier(&pkram_reboot_notifier);
>+	if (pkram_init_sb()) {
>+		register_reboot_notifier(&pkram_reboot_notifier);
>+		sysfs_update_group(kernel_kobj, &pkram_attr_group);
>+	}
> 	return 0;
> }
> module_init(pkram_init);
>-- 
>1.9.4
>
>
>_______________________________________________
>kexec mailing list
>kexec@lists.infradead.org
>http://lists.infradead.org/mailman/listinfo/kexec
>

-- 
Best regards,
Coiby


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
@ 2023-06-05  2:40     ` Coiby Xu
  0 siblings, 0 replies; 64+ messages in thread
From: Coiby Xu @ 2023-06-05  2:40 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

Hi Anthony,

On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>The PKRAM super block is the starting point for restoring preserved
>memory. By providing the super block to the new kernel at boot time,
>preserved memory can be reserved and made available to be restored.
>To point the kernel to the location of the super block, one passes
>its pfn via the 'pkram' boot param. 

I'm curious to ask how will the 'pkram' boot param be passed. It seems I
can't find the answer in this patch set.


> For that purpose, the pkram super
>block pfn is exported via /sys/kernel/pkram. If none is passed, any
>preserved memory will not be kept, and a new super block will be
>allocated.
>
>Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>---
> mm/pkram.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 100 insertions(+), 2 deletions(-)
>
>diff --git a/mm/pkram.c b/mm/pkram.c
>index da166cb6afb7..c66b2ae4d520 100644
>--- a/mm/pkram.c
>+++ b/mm/pkram.c
>@@ -5,15 +5,18 @@
> #include <linux/init.h>
> #include <linux/io.h>
> #include <linux/kernel.h>
>+#include <linux/kobject.h>
> #include <linux/list.h>
> #include <linux/mm.h>
> #include <linux/module.h>
> #include <linux/mutex.h>
> #include <linux/notifier.h>
>+#include <linux/pfn.h>
> #include <linux/pkram.h>
> #include <linux/reboot.h>
> #include <linux/sched.h>
> #include <linux/string.h>
>+#include <linux/sysfs.h>
> #include <linux/types.h>
>
> #include "internal.h"
>@@ -82,12 +85,38 @@ struct pkram_node {
> #define PKRAM_ACCMODE_MASK	3
>
> /*
>+ * The PKRAM super block contains data needed to restore the preserved memory
>+ * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
>+ * boot param if one wants to restore preserved data saved by the previously
>+ * executing kernel. For that purpose the kernel exports the pfn via
>+ * /sys/kernel/pkram. If none is passed, preserved memory if any will not be
>+ * preserved and a new clean page will be allocated for the super block.
>+ *
>+ * The structure occupies a memory page.
>+ */
>+struct pkram_super_block {
>+	__u64	node_pfn;		/* first element of the node list */
>+};
>+
>+static unsigned long pkram_sb_pfn __initdata;
>+static struct pkram_super_block *pkram_sb;
>+
>+/*
>  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
>  * connected through the lru field of the page struct.
>  */
> static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
> static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
>
>+/*
>+ * The PKRAM super block pfn, see above.
>+ */
>+static int __init parse_pkram_sb_pfn(char *arg)
>+{
>+	return kstrtoul(arg, 16, &pkram_sb_pfn);
>+}
>+early_param("pkram", parse_pkram_sb_pfn);
>+
> static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
> {
> 	return alloc_page(gfp_mask);
>@@ -270,6 +299,7 @@ static void pkram_stream_init(struct pkram_stream *ps,
>  * @gfp_mask specifies the memory allocation mask to be used when saving data.
>  *
>  * Error values:
>+ *	%ENODEV: PKRAM not available
>  *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>  *	%ENOMEM: insufficient memory available
>  *	%EEXIST: node with specified name already exists
>@@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
> 	struct pkram_node *node;
> 	int err = 0;
>
>+	if (!pkram_sb)
>+		return -ENODEV;
>+
> 	if (strlen(name) >= PKRAM_NAME_MAX)
> 		return -ENAMETOOLONG;
>
>@@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>  * Returns 0 on success, -errno on failure.
>  *
>  * Error values:
>+ *	%ENODEV: PKRAM not available
>  *	%ENOENT: node with specified name does not exist
>  *	%EBUSY: save to required node has not finished yet
>  *
>@@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
> 	struct pkram_node *node;
> 	int err = 0;
>
>+	if (!pkram_sb)
>+		return -ENODEV;
>+
> 	mutex_lock(&pkram_mutex);
> 	node = pkram_find_node(name);
> 	if (!node) {
>@@ -825,6 +862,13 @@ static void __pkram_reboot(void)
> 		node->node_pfn = node_pfn;
> 		node_pfn = page_to_pfn(page);
> 	}
>+
>+	/*
>+	 * Zero out pkram_sb completely since it may have been passed from
>+	 * the previous boot.
>+	 */
>+	memset(pkram_sb, 0, PAGE_SIZE);
>+	pkram_sb->node_pfn = node_pfn;
> }
>
> static int pkram_reboot(struct notifier_block *notifier,
>@@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block *notifier,
> {
> 	if (val != SYS_RESTART)
> 		return NOTIFY_DONE;
>-	__pkram_reboot();
>+	if (pkram_sb)
>+		__pkram_reboot();
> 	return NOTIFY_OK;
> }
>
>@@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block *notifier,
> 	.notifier_call = pkram_reboot,
> };
>
>+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>+		struct kobj_attribute *attr, char *buf)
>+{
>+	unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>+
>+	return sprintf(buf, "%lx\n", pfn);
>+}
>+
>+static struct kobj_attribute pkram_sb_pfn_attr =
>+	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>+
>+static struct attribute *pkram_attrs[] = {
>+	&pkram_sb_pfn_attr.attr,
>+	NULL,
>+};
>+
>+static struct attribute_group pkram_attr_group = {
>+	.attrs = pkram_attrs,
>+};
>+
>+/* returns non-zero on success */
>+static int __init pkram_init_sb(void)
>+{
>+	unsigned long pfn;
>+	struct pkram_node *node;
>+
>+	if (!pkram_sb) {
>+		struct page *page;
>+
>+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>+		if (!page) {
>+			pr_err("PKRAM: Failed to allocate super block\n");
>+			return 0;
>+		}
>+		pkram_sb = page_address(page);
>+	}
>+
>+	/*
>+	 * Build auxiliary doubly-linked list of nodes connected through
>+	 * page::lru for convenience sake.
>+	 */
>+	pfn = pkram_sb->node_pfn;
>+	while (pfn) {
>+		node = pfn_to_kaddr(pfn);
>+		pkram_insert_node(node);
>+		pfn = node->node_pfn;
>+	}
>+	return 1;
>+}
>+
> static int __init pkram_init(void)
> {
>-	register_reboot_notifier(&pkram_reboot_notifier);
>+	if (pkram_init_sb()) {
>+		register_reboot_notifier(&pkram_reboot_notifier);
>+		sysfs_update_group(kernel_kobj, &pkram_attr_group);
>+	}
> 	return 0;
> }
> module_init(pkram_init);
>-- 
>1.9.4
>
>
>_______________________________________________
>kexec mailing list
>kexec@lists.infradead.org
>http://lists.infradead.org/mailman/listinfo/kexec
>

-- 
Best regards,
Coiby


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
  2023-06-05  2:40     ` Coiby Xu
@ 2023-06-06  2:01       ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-06-06  2:01 UTC (permalink / raw)
  To: Coiby Xu
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

Hi Coiby,

On 6/4/23 7:40 PM, Coiby Xu wrote:
> Hi Anthony,
>
> On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>> The PKRAM super block is the starting point for restoring preserved
>> memory. By providing the super block to the new kernel at boot time,
>> preserved memory can be reserved and made available to be restored.
>> To point the kernel to the location of the super block, one passes
>> its pfn via the 'pkram' boot param. 
>
> I'm curious to ask how will the 'pkram' boot param be passed. It seems I
> can't find the answer in this patch set.

The pfn of the super block read from /sys/kernel/pkram is passed to

the next kernel by adding the boot parameter, pkram=<super block pfn>.

The next kernel picks it up through the early_param("pkram", 
parse_pkram_sb_pfn)

in this patch below.


Anthony


>
>
>> For that purpose, the pkram super
>> block pfn is exported via /sys/kernel/pkram. If none is passed, any
>> preserved memory will not be kept, and a new super block will be
>> allocated.
>>
>> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>> mm/pkram.c | 102 
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 100 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/pkram.c b/mm/pkram.c
>> index da166cb6afb7..c66b2ae4d520 100644
>> --- a/mm/pkram.c
>> +++ b/mm/pkram.c
>> @@ -5,15 +5,18 @@
>> #include <linux/init.h>
>> #include <linux/io.h>
>> #include <linux/kernel.h>
>> +#include <linux/kobject.h>
>> #include <linux/list.h>
>> #include <linux/mm.h>
>> #include <linux/module.h>
>> #include <linux/mutex.h>
>> #include <linux/notifier.h>
>> +#include <linux/pfn.h>
>> #include <linux/pkram.h>
>> #include <linux/reboot.h>
>> #include <linux/sched.h>
>> #include <linux/string.h>
>> +#include <linux/sysfs.h>
>> #include <linux/types.h>
>>
>> #include "internal.h"
>> @@ -82,12 +85,38 @@ struct pkram_node {
>> #define PKRAM_ACCMODE_MASK    3
>>
>> /*
>> + * The PKRAM super block contains data needed to restore the 
>> preserved memory
>> + * structure on boot. The pointer to it (pfn) should be passed via 
>> the 'pkram'
>> + * boot param if one wants to restore preserved data saved by the 
>> previously
>> + * executing kernel. For that purpose the kernel exports the pfn via
>> + * /sys/kernel/pkram. If none is passed, preserved memory if any 
>> will not be
>> + * preserved and a new clean page will be allocated for the super 
>> block.
>> + *
>> + * The structure occupies a memory page.
>> + */
>> +struct pkram_super_block {
>> +    __u64    node_pfn;        /* first element of the node list */
>> +};
>> +
>> +static unsigned long pkram_sb_pfn __initdata;
>> +static struct pkram_super_block *pkram_sb;
>> +
>> +/*
>>  * For convenience sake PKRAM nodes are kept in an auxiliary 
>> doubly-linked list
>>  * connected through the lru field of the page struct.
>>  */
>> static LIST_HEAD(pkram_nodes);            /* linked through page::lru */
>> static DEFINE_MUTEX(pkram_mutex);        /* serializes open/close */
>>
>> +/*
>> + * The PKRAM super block pfn, see above.
>> + */
>> +static int __init parse_pkram_sb_pfn(char *arg)
>> +{
>> +    return kstrtoul(arg, 16, &pkram_sb_pfn);
>> +}
>> +early_param("pkram", parse_pkram_sb_pfn);
>> +
>> static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
>> {
>>     return alloc_page(gfp_mask);
>> @@ -270,6 +299,7 @@ static void pkram_stream_init(struct pkram_stream 
>> *ps,
>>  * @gfp_mask specifies the memory allocation mask to be used when 
>> saving data.
>>  *
>>  * Error values:
>> + *    %ENODEV: PKRAM not available
>>  *    %ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>>  *    %ENOMEM: insufficient memory available
>>  *    %EEXIST: node with specified name already exists
>> @@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, 
>> const char *name, gfp_t gfp_mask
>>     struct pkram_node *node;
>>     int err = 0;
>>
>> +    if (!pkram_sb)
>> +        return -ENODEV;
>> +
>>     if (strlen(name) >= PKRAM_NAME_MAX)
>>         return -ENAMETOOLONG;
>>
>> @@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>>  * Returns 0 on success, -errno on failure.
>>  *
>>  * Error values:
>> + *    %ENODEV: PKRAM not available
>>  *    %ENOENT: node with specified name does not exist
>>  *    %EBUSY: save to required node has not finished yet
>>  *
>> @@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, 
>> const char *name)
>>     struct pkram_node *node;
>>     int err = 0;
>>
>> +    if (!pkram_sb)
>> +        return -ENODEV;
>> +
>>     mutex_lock(&pkram_mutex);
>>     node = pkram_find_node(name);
>>     if (!node) {
>> @@ -825,6 +862,13 @@ static void __pkram_reboot(void)
>>         node->node_pfn = node_pfn;
>>         node_pfn = page_to_pfn(page);
>>     }
>> +
>> +    /*
>> +     * Zero out pkram_sb completely since it may have been passed from
>> +     * the previous boot.
>> +     */
>> +    memset(pkram_sb, 0, PAGE_SIZE);
>> +    pkram_sb->node_pfn = node_pfn;
>> }
>>
>> static int pkram_reboot(struct notifier_block *notifier,
>> @@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block 
>> *notifier,
>> {
>>     if (val != SYS_RESTART)
>>         return NOTIFY_DONE;
>> -    __pkram_reboot();
>> +    if (pkram_sb)
>> +        __pkram_reboot();
>>     return NOTIFY_OK;
>> }
>>
>> @@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block 
>> *notifier,
>>     .notifier_call = pkram_reboot,
>> };
>>
>> +static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>> +        struct kobj_attribute *attr, char *buf)
>> +{
>> +    unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>> +
>> +    return sprintf(buf, "%lx\n", pfn);
>> +}
>> +
>> +static struct kobj_attribute pkram_sb_pfn_attr =
>> +    __ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>> +
>> +static struct attribute *pkram_attrs[] = {
>> +    &pkram_sb_pfn_attr.attr,
>> +    NULL,
>> +};
>> +
>> +static struct attribute_group pkram_attr_group = {
>> +    .attrs = pkram_attrs,
>> +};
>> +
>> +/* returns non-zero on success */
>> +static int __init pkram_init_sb(void)
>> +{
>> +    unsigned long pfn;
>> +    struct pkram_node *node;
>> +
>> +    if (!pkram_sb) {
>> +        struct page *page;
>> +
>> +        page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>> +        if (!page) {
>> +            pr_err("PKRAM: Failed to allocate super block\n");
>> +            return 0;
>> +        }
>> +        pkram_sb = page_address(page);
>> +    }
>> +
>> +    /*
>> +     * Build auxiliary doubly-linked list of nodes connected through
>> +     * page::lru for convenience sake.
>> +     */
>> +    pfn = pkram_sb->node_pfn;
>> +    while (pfn) {
>> +        node = pfn_to_kaddr(pfn);
>> +        pkram_insert_node(node);
>> +        pfn = node->node_pfn;
>> +    }
>> +    return 1;
>> +}
>> +
>> static int __init pkram_init(void)
>> {
>> -    register_reboot_notifier(&pkram_reboot_notifier);
>> +    if (pkram_init_sb()) {
>> +        register_reboot_notifier(&pkram_reboot_notifier);
>> +        sysfs_update_group(kernel_kobj, &pkram_attr_group);
>> +    }
>>     return 0;
>> }
>> module_init(pkram_init);
>> -- 
>> 1.9.4
>>
>>
>> _______________________________________________
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
@ 2023-06-06  2:01       ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-06-06  2:01 UTC (permalink / raw)
  To: Coiby Xu
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

Hi Coiby,

On 6/4/23 7:40 PM, Coiby Xu wrote:
> Hi Anthony,
>
> On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>> The PKRAM super block is the starting point for restoring preserved
>> memory. By providing the super block to the new kernel at boot time,
>> preserved memory can be reserved and made available to be restored.
>> To point the kernel to the location of the super block, one passes
>> its pfn via the 'pkram' boot param. 
>
> I'm curious to ask how will the 'pkram' boot param be passed. It seems I
> can't find the answer in this patch set.

The pfn of the super block read from /sys/kernel/pkram is passed to

the next kernel by adding the boot parameter, pkram=<super block pfn>.

The next kernel picks it up through the early_param("pkram", 
parse_pkram_sb_pfn)

in this patch below.


Anthony


>
>
>> For that purpose, the pkram super
>> block pfn is exported via /sys/kernel/pkram. If none is passed, any
>> preserved memory will not be kept, and a new super block will be
>> allocated.
>>
>> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>> mm/pkram.c | 102 
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 100 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/pkram.c b/mm/pkram.c
>> index da166cb6afb7..c66b2ae4d520 100644
>> --- a/mm/pkram.c
>> +++ b/mm/pkram.c
>> @@ -5,15 +5,18 @@
>> #include <linux/init.h>
>> #include <linux/io.h>
>> #include <linux/kernel.h>
>> +#include <linux/kobject.h>
>> #include <linux/list.h>
>> #include <linux/mm.h>
>> #include <linux/module.h>
>> #include <linux/mutex.h>
>> #include <linux/notifier.h>
>> +#include <linux/pfn.h>
>> #include <linux/pkram.h>
>> #include <linux/reboot.h>
>> #include <linux/sched.h>
>> #include <linux/string.h>
>> +#include <linux/sysfs.h>
>> #include <linux/types.h>
>>
>> #include "internal.h"
>> @@ -82,12 +85,38 @@ struct pkram_node {
>> #define PKRAM_ACCMODE_MASK    3
>>
>> /*
>> + * The PKRAM super block contains data needed to restore the 
>> preserved memory
>> + * structure on boot. The pointer to it (pfn) should be passed via 
>> the 'pkram'
>> + * boot param if one wants to restore preserved data saved by the 
>> previously
>> + * executing kernel. For that purpose the kernel exports the pfn via
>> + * /sys/kernel/pkram. If none is passed, preserved memory if any 
>> will not be
>> + * preserved and a new clean page will be allocated for the super 
>> block.
>> + *
>> + * The structure occupies a memory page.
>> + */
>> +struct pkram_super_block {
>> +    __u64    node_pfn;        /* first element of the node list */
>> +};
>> +
>> +static unsigned long pkram_sb_pfn __initdata;
>> +static struct pkram_super_block *pkram_sb;
>> +
>> +/*
>>  * For convenience sake PKRAM nodes are kept in an auxiliary 
>> doubly-linked list
>>  * connected through the lru field of the page struct.
>>  */
>> static LIST_HEAD(pkram_nodes);            /* linked through page::lru */
>> static DEFINE_MUTEX(pkram_mutex);        /* serializes open/close */
>>
>> +/*
>> + * The PKRAM super block pfn, see above.
>> + */
>> +static int __init parse_pkram_sb_pfn(char *arg)
>> +{
>> +    return kstrtoul(arg, 16, &pkram_sb_pfn);
>> +}
>> +early_param("pkram", parse_pkram_sb_pfn);
>> +
>> static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
>> {
>>     return alloc_page(gfp_mask);
>> @@ -270,6 +299,7 @@ static void pkram_stream_init(struct pkram_stream 
>> *ps,
>>  * @gfp_mask specifies the memory allocation mask to be used when 
>> saving data.
>>  *
>>  * Error values:
>> + *    %ENODEV: PKRAM not available
>>  *    %ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>>  *    %ENOMEM: insufficient memory available
>>  *    %EEXIST: node with specified name already exists
>> @@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, 
>> const char *name, gfp_t gfp_mask
>>     struct pkram_node *node;
>>     int err = 0;
>>
>> +    if (!pkram_sb)
>> +        return -ENODEV;
>> +
>>     if (strlen(name) >= PKRAM_NAME_MAX)
>>         return -ENAMETOOLONG;
>>
>> @@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>>  * Returns 0 on success, -errno on failure.
>>  *
>>  * Error values:
>> + *    %ENODEV: PKRAM not available
>>  *    %ENOENT: node with specified name does not exist
>>  *    %EBUSY: save to required node has not finished yet
>>  *
>> @@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, 
>> const char *name)
>>     struct pkram_node *node;
>>     int err = 0;
>>
>> +    if (!pkram_sb)
>> +        return -ENODEV;
>> +
>>     mutex_lock(&pkram_mutex);
>>     node = pkram_find_node(name);
>>     if (!node) {
>> @@ -825,6 +862,13 @@ static void __pkram_reboot(void)
>>         node->node_pfn = node_pfn;
>>         node_pfn = page_to_pfn(page);
>>     }
>> +
>> +    /*
>> +     * Zero out pkram_sb completely since it may have been passed from
>> +     * the previous boot.
>> +     */
>> +    memset(pkram_sb, 0, PAGE_SIZE);
>> +    pkram_sb->node_pfn = node_pfn;
>> }
>>
>> static int pkram_reboot(struct notifier_block *notifier,
>> @@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block 
>> *notifier,
>> {
>>     if (val != SYS_RESTART)
>>         return NOTIFY_DONE;
>> -    __pkram_reboot();
>> +    if (pkram_sb)
>> +        __pkram_reboot();
>>     return NOTIFY_OK;
>> }
>>
>> @@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block 
>> *notifier,
>>     .notifier_call = pkram_reboot,
>> };
>>
>> +static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>> +        struct kobj_attribute *attr, char *buf)
>> +{
>> +    unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>> +
>> +    return sprintf(buf, "%lx\n", pfn);
>> +}
>> +
>> +static struct kobj_attribute pkram_sb_pfn_attr =
>> +    __ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>> +
>> +static struct attribute *pkram_attrs[] = {
>> +    &pkram_sb_pfn_attr.attr,
>> +    NULL,
>> +};
>> +
>> +static struct attribute_group pkram_attr_group = {
>> +    .attrs = pkram_attrs,
>> +};
>> +
>> +/* returns non-zero on success */
>> +static int __init pkram_init_sb(void)
>> +{
>> +    unsigned long pfn;
>> +    struct pkram_node *node;
>> +
>> +    if (!pkram_sb) {
>> +        struct page *page;
>> +
>> +        page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>> +        if (!page) {
>> +            pr_err("PKRAM: Failed to allocate super block\n");
>> +            return 0;
>> +        }
>> +        pkram_sb = page_address(page);
>> +    }
>> +
>> +    /*
>> +     * Build auxiliary doubly-linked list of nodes connected through
>> +     * page::lru for convenience sake.
>> +     */
>> +    pfn = pkram_sb->node_pfn;
>> +    while (pfn) {
>> +        node = pfn_to_kaddr(pfn);
>> +        pkram_insert_node(node);
>> +        pfn = node->node_pfn;
>> +    }
>> +    return 1;
>> +}
>> +
>> static int __init pkram_init(void)
>> {
>> -    register_reboot_notifier(&pkram_reboot_notifier);
>> +    if (pkram_init_sb()) {
>> +        register_reboot_notifier(&pkram_reboot_notifier);
>> +        sysfs_update_group(kernel_kobj, &pkram_attr_group);
>> +    }
>>     return 0;
>> }
>> module_init(pkram_init);
>> -- 
>> 1.9.4
>>
>>
>> _______________________________________________
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>>
>

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
  2023-06-06  2:01       ` Anthony Yznaga
@ 2023-06-06  2:55         ` Coiby Xu
  -1 siblings, 0 replies; 64+ messages in thread
From: Coiby Xu @ 2023-06-06  2:55 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

On Mon, Jun 05, 2023 at 07:01:56PM -0700, Anthony Yznaga wrote:
>Hi Coiby,
>
>On 6/4/23 7:40 PM, Coiby Xu wrote:
>>Hi Anthony,
>>
>>On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>>>The PKRAM super block is the starting point for restoring preserved
>>>memory. By providing the super block to the new kernel at boot time,
>>>preserved memory can be reserved and made available to be restored.
>>>To point the kernel to the location of the super block, one passes
>>>its pfn via the 'pkram' boot param.
>>
>>I'm curious to ask how will the 'pkram' boot param be passed. It seems I
>>can't find the answer in this patch set.
>
>The pfn of the super block read from /sys/kernel/pkram is passed to
>
>the next kernel by adding the boot parameter, pkram=<super block pfn>.
>
>The next kernel picks it up through the early_param("pkram", 
>parse_pkram_sb_pfn)
>
>in this patch below.

Thanks for the explanation! Sorry I didn't make my question clear. I
should have asked who is going to read /sys/kernel/pkram how this pkram
boot parameter will be added for the next kernel.

>
>
>Anthony
>
>
>>
>>
>>>For that purpose, the pkram super
>>>block pfn is exported via /sys/kernel/pkram. If none is passed, any
>>>preserved memory will not be kept, and a new super block will be
>>>allocated.
>>>
>>>Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>>>Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>>>---
>>>mm/pkram.c | 102 
>>>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>1 file changed, 100 insertions(+), 2 deletions(-)
>>>
>>>diff --git a/mm/pkram.c b/mm/pkram.c
>>>index da166cb6afb7..c66b2ae4d520 100644
>>>--- a/mm/pkram.c
>>>+++ b/mm/pkram.c
>>>@@ -5,15 +5,18 @@
>>>#include <linux/init.h>
>>>#include <linux/io.h>
>>>#include <linux/kernel.h>
>>>+#include <linux/kobject.h>
>>>#include <linux/list.h>
>>>#include <linux/mm.h>
>>>#include <linux/module.h>
>>>#include <linux/mutex.h>
>>>#include <linux/notifier.h>
>>>+#include <linux/pfn.h>
>>>#include <linux/pkram.h>
>>>#include <linux/reboot.h>
>>>#include <linux/sched.h>
>>>#include <linux/string.h>
>>>+#include <linux/sysfs.h>
>>>#include <linux/types.h>
>>>
>>>#include "internal.h"
>>>@@ -82,12 +85,38 @@ struct pkram_node {
>>>#define PKRAM_ACCMODE_MASK    3
>>>
>>>/*
>>>+ * The PKRAM super block contains data needed to restore the 
>>>preserved memory
>>>+ * structure on boot. The pointer to it (pfn) should be passed 
>>>via the 'pkram'
>>>+ * boot param if one wants to restore preserved data saved by the 
>>>previously
>>>+ * executing kernel. For that purpose the kernel exports the pfn via
>>>+ * /sys/kernel/pkram. If none is passed, preserved memory if any 
>>>will not be
>>>+ * preserved and a new clean page will be allocated for the super 
>>>block.
>>>+ *
>>>+ * The structure occupies a memory page.
>>>+ */
>>>+struct pkram_super_block {
>>>+    __u64    node_pfn;        /* first element of the node list */
>>>+};
>>>+
>>>+static unsigned long pkram_sb_pfn __initdata;
>>>+static struct pkram_super_block *pkram_sb;
>>>+
>>>+/*
>>> * For convenience sake PKRAM nodes are kept in an auxiliary 
>>>doubly-linked list
>>> * connected through the lru field of the page struct.
>>> */
>>>static LIST_HEAD(pkram_nodes);            /* linked through page::lru */
>>>static DEFINE_MUTEX(pkram_mutex);        /* serializes open/close */
>>>
>>>+/*
>>>+ * The PKRAM super block pfn, see above.
>>>+ */
>>>+static int __init parse_pkram_sb_pfn(char *arg)
>>>+{
>>>+    return kstrtoul(arg, 16, &pkram_sb_pfn);
>>>+}
>>>+early_param("pkram", parse_pkram_sb_pfn);
>>>+
>>>static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
>>>{
>>>    return alloc_page(gfp_mask);
>>>@@ -270,6 +299,7 @@ static void pkram_stream_init(struct 
>>>pkram_stream *ps,
>>> * @gfp_mask specifies the memory allocation mask to be used when 
>>>saving data.
>>> *
>>> * Error values:
>>>+ *    %ENODEV: PKRAM not available
>>> *    %ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>>> *    %ENOMEM: insufficient memory available
>>> *    %EEXIST: node with specified name already exists
>>>@@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream 
>>>*ps, const char *name, gfp_t gfp_mask
>>>    struct pkram_node *node;
>>>    int err = 0;
>>>
>>>+    if (!pkram_sb)
>>>+        return -ENODEV;
>>>+
>>>    if (strlen(name) >= PKRAM_NAME_MAX)
>>>        return -ENAMETOOLONG;
>>>
>>>@@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>>> * Returns 0 on success, -errno on failure.
>>> *
>>> * Error values:
>>>+ *    %ENODEV: PKRAM not available
>>> *    %ENOENT: node with specified name does not exist
>>> *    %EBUSY: save to required node has not finished yet
>>> *
>>>@@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream 
>>>*ps, const char *name)
>>>    struct pkram_node *node;
>>>    int err = 0;
>>>
>>>+    if (!pkram_sb)
>>>+        return -ENODEV;
>>>+
>>>    mutex_lock(&pkram_mutex);
>>>    node = pkram_find_node(name);
>>>    if (!node) {
>>>@@ -825,6 +862,13 @@ static void __pkram_reboot(void)
>>>        node->node_pfn = node_pfn;
>>>        node_pfn = page_to_pfn(page);
>>>    }
>>>+
>>>+    /*
>>>+     * Zero out pkram_sb completely since it may have been passed from
>>>+     * the previous boot.
>>>+     */
>>>+    memset(pkram_sb, 0, PAGE_SIZE);
>>>+    pkram_sb->node_pfn = node_pfn;
>>>}
>>>
>>>static int pkram_reboot(struct notifier_block *notifier,
>>>@@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block 
>>>*notifier,
>>>{
>>>    if (val != SYS_RESTART)
>>>        return NOTIFY_DONE;
>>>-    __pkram_reboot();
>>>+    if (pkram_sb)
>>>+        __pkram_reboot();
>>>    return NOTIFY_OK;
>>>}
>>>
>>>@@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block 
>>>*notifier,
>>>    .notifier_call = pkram_reboot,
>>>};
>>>
>>>+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>>>+        struct kobj_attribute *attr, char *buf)
>>>+{
>>>+    unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>>>+
>>>+    return sprintf(buf, "%lx\n", pfn);
>>>+}
>>>+
>>>+static struct kobj_attribute pkram_sb_pfn_attr =
>>>+    __ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>>>+
>>>+static struct attribute *pkram_attrs[] = {
>>>+    &pkram_sb_pfn_attr.attr,
>>>+    NULL,
>>>+};
>>>+
>>>+static struct attribute_group pkram_attr_group = {
>>>+    .attrs = pkram_attrs,
>>>+};
>>>+
>>>+/* returns non-zero on success */
>>>+static int __init pkram_init_sb(void)
>>>+{
>>>+    unsigned long pfn;
>>>+    struct pkram_node *node;
>>>+
>>>+    if (!pkram_sb) {
>>>+        struct page *page;
>>>+
>>>+        page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>+        if (!page) {
>>>+            pr_err("PKRAM: Failed to allocate super block\n");
>>>+            return 0;
>>>+        }
>>>+        pkram_sb = page_address(page);
>>>+    }
>>>+
>>>+    /*
>>>+     * Build auxiliary doubly-linked list of nodes connected through
>>>+     * page::lru for convenience sake.
>>>+     */
>>>+    pfn = pkram_sb->node_pfn;
>>>+    while (pfn) {
>>>+        node = pfn_to_kaddr(pfn);
>>>+        pkram_insert_node(node);
>>>+        pfn = node->node_pfn;
>>>+    }
>>>+    return 1;
>>>+}
>>>+
>>>static int __init pkram_init(void)
>>>{
>>>-    register_reboot_notifier(&pkram_reboot_notifier);
>>>+    if (pkram_init_sb()) {
>>>+        register_reboot_notifier(&pkram_reboot_notifier);
>>>+        sysfs_update_group(kernel_kobj, &pkram_attr_group);
>>>+    }
>>>    return 0;
>>>}
>>>module_init(pkram_init);
>>>-- 
>>>1.9.4
>>>
>>>
>>>_______________________________________________
>>>kexec mailing list
>>>kexec@lists.infradead.org
>>>http://lists.infradead.org/mailman/listinfo/kexec
>>>
>>
>

-- 
Best regards,
Coiby


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
@ 2023-06-06  2:55         ` Coiby Xu
  0 siblings, 0 replies; 64+ messages in thread
From: Coiby Xu @ 2023-06-06  2:55 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec

On Mon, Jun 05, 2023 at 07:01:56PM -0700, Anthony Yznaga wrote:
>Hi Coiby,
>
>On 6/4/23 7:40 PM, Coiby Xu wrote:
>>Hi Anthony,
>>
>>On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>>>The PKRAM super block is the starting point for restoring preserved
>>>memory. By providing the super block to the new kernel at boot time,
>>>preserved memory can be reserved and made available to be restored.
>>>To point the kernel to the location of the super block, one passes
>>>its pfn via the 'pkram' boot param.
>>
>>I'm curious to ask how will the 'pkram' boot param be passed. It seems I
>>can't find the answer in this patch set.
>
>The pfn of the super block read from /sys/kernel/pkram is passed to
>
>the next kernel by adding the boot parameter, pkram=<super block pfn>.
>
>The next kernel picks it up through the early_param("pkram", 
>parse_pkram_sb_pfn)
>
>in this patch below.

Thanks for the explanation! Sorry I didn't make my question clear. I
should have asked who is going to read /sys/kernel/pkram how this pkram
boot parameter will be added for the next kernel.

>
>
>Anthony
>
>
>>
>>
>>>For that purpose, the pkram super
>>>block pfn is exported via /sys/kernel/pkram. If none is passed, any
>>>preserved memory will not be kept, and a new super block will be
>>>allocated.
>>>
>>>Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>>>Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>>>---
>>>mm/pkram.c | 102 
>>>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>1 file changed, 100 insertions(+), 2 deletions(-)
>>>
>>>diff --git a/mm/pkram.c b/mm/pkram.c
>>>index da166cb6afb7..c66b2ae4d520 100644
>>>--- a/mm/pkram.c
>>>+++ b/mm/pkram.c
>>>@@ -5,15 +5,18 @@
>>>#include <linux/init.h>
>>>#include <linux/io.h>
>>>#include <linux/kernel.h>
>>>+#include <linux/kobject.h>
>>>#include <linux/list.h>
>>>#include <linux/mm.h>
>>>#include <linux/module.h>
>>>#include <linux/mutex.h>
>>>#include <linux/notifier.h>
>>>+#include <linux/pfn.h>
>>>#include <linux/pkram.h>
>>>#include <linux/reboot.h>
>>>#include <linux/sched.h>
>>>#include <linux/string.h>
>>>+#include <linux/sysfs.h>
>>>#include <linux/types.h>
>>>
>>>#include "internal.h"
>>>@@ -82,12 +85,38 @@ struct pkram_node {
>>>#define PKRAM_ACCMODE_MASK    3
>>>
>>>/*
>>>+ * The PKRAM super block contains data needed to restore the 
>>>preserved memory
>>>+ * structure on boot. The pointer to it (pfn) should be passed 
>>>via the 'pkram'
>>>+ * boot param if one wants to restore preserved data saved by the 
>>>previously
>>>+ * executing kernel. For that purpose the kernel exports the pfn via
>>>+ * /sys/kernel/pkram. If none is passed, preserved memory if any 
>>>will not be
>>>+ * preserved and a new clean page will be allocated for the super 
>>>block.
>>>+ *
>>>+ * The structure occupies a memory page.
>>>+ */
>>>+struct pkram_super_block {
>>>+    __u64    node_pfn;        /* first element of the node list */
>>>+};
>>>+
>>>+static unsigned long pkram_sb_pfn __initdata;
>>>+static struct pkram_super_block *pkram_sb;
>>>+
>>>+/*
>>> * For convenience sake PKRAM nodes are kept in an auxiliary 
>>>doubly-linked list
>>> * connected through the lru field of the page struct.
>>> */
>>>static LIST_HEAD(pkram_nodes);            /* linked through page::lru */
>>>static DEFINE_MUTEX(pkram_mutex);        /* serializes open/close */
>>>
>>>+/*
>>>+ * The PKRAM super block pfn, see above.
>>>+ */
>>>+static int __init parse_pkram_sb_pfn(char *arg)
>>>+{
>>>+    return kstrtoul(arg, 16, &pkram_sb_pfn);
>>>+}
>>>+early_param("pkram", parse_pkram_sb_pfn);
>>>+
>>>static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
>>>{
>>>    return alloc_page(gfp_mask);
>>>@@ -270,6 +299,7 @@ static void pkram_stream_init(struct 
>>>pkram_stream *ps,
>>> * @gfp_mask specifies the memory allocation mask to be used when 
>>>saving data.
>>> *
>>> * Error values:
>>>+ *    %ENODEV: PKRAM not available
>>> *    %ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>>> *    %ENOMEM: insufficient memory available
>>> *    %EEXIST: node with specified name already exists
>>>@@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream 
>>>*ps, const char *name, gfp_t gfp_mask
>>>    struct pkram_node *node;
>>>    int err = 0;
>>>
>>>+    if (!pkram_sb)
>>>+        return -ENODEV;
>>>+
>>>    if (strlen(name) >= PKRAM_NAME_MAX)
>>>        return -ENAMETOOLONG;
>>>
>>>@@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>>> * Returns 0 on success, -errno on failure.
>>> *
>>> * Error values:
>>>+ *    %ENODEV: PKRAM not available
>>> *    %ENOENT: node with specified name does not exist
>>> *    %EBUSY: save to required node has not finished yet
>>> *
>>>@@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream 
>>>*ps, const char *name)
>>>    struct pkram_node *node;
>>>    int err = 0;
>>>
>>>+    if (!pkram_sb)
>>>+        return -ENODEV;
>>>+
>>>    mutex_lock(&pkram_mutex);
>>>    node = pkram_find_node(name);
>>>    if (!node) {
>>>@@ -825,6 +862,13 @@ static void __pkram_reboot(void)
>>>        node->node_pfn = node_pfn;
>>>        node_pfn = page_to_pfn(page);
>>>    }
>>>+
>>>+    /*
>>>+     * Zero out pkram_sb completely since it may have been passed from
>>>+     * the previous boot.
>>>+     */
>>>+    memset(pkram_sb, 0, PAGE_SIZE);
>>>+    pkram_sb->node_pfn = node_pfn;
>>>}
>>>
>>>static int pkram_reboot(struct notifier_block *notifier,
>>>@@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block 
>>>*notifier,
>>>{
>>>    if (val != SYS_RESTART)
>>>        return NOTIFY_DONE;
>>>-    __pkram_reboot();
>>>+    if (pkram_sb)
>>>+        __pkram_reboot();
>>>    return NOTIFY_OK;
>>>}
>>>
>>>@@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block 
>>>*notifier,
>>>    .notifier_call = pkram_reboot,
>>>};
>>>
>>>+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>>>+        struct kobj_attribute *attr, char *buf)
>>>+{
>>>+    unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>>>+
>>>+    return sprintf(buf, "%lx\n", pfn);
>>>+}
>>>+
>>>+static struct kobj_attribute pkram_sb_pfn_attr =
>>>+    __ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>>>+
>>>+static struct attribute *pkram_attrs[] = {
>>>+    &pkram_sb_pfn_attr.attr,
>>>+    NULL,
>>>+};
>>>+
>>>+static struct attribute_group pkram_attr_group = {
>>>+    .attrs = pkram_attrs,
>>>+};
>>>+
>>>+/* returns non-zero on success */
>>>+static int __init pkram_init_sb(void)
>>>+{
>>>+    unsigned long pfn;
>>>+    struct pkram_node *node;
>>>+
>>>+    if (!pkram_sb) {
>>>+        struct page *page;
>>>+
>>>+        page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>+        if (!page) {
>>>+            pr_err("PKRAM: Failed to allocate super block\n");
>>>+            return 0;
>>>+        }
>>>+        pkram_sb = page_address(page);
>>>+    }
>>>+
>>>+    /*
>>>+     * Build auxiliary doubly-linked list of nodes connected through
>>>+     * page::lru for convenience sake.
>>>+     */
>>>+    pfn = pkram_sb->node_pfn;
>>>+    while (pfn) {
>>>+        node = pfn_to_kaddr(pfn);
>>>+        pkram_insert_node(node);
>>>+        pfn = node->node_pfn;
>>>+    }
>>>+    return 1;
>>>+}
>>>+
>>>static int __init pkram_init(void)
>>>{
>>>-    register_reboot_notifier(&pkram_reboot_notifier);
>>>+    if (pkram_init_sb()) {
>>>+        register_reboot_notifier(&pkram_reboot_notifier);
>>>+        sysfs_update_group(kernel_kobj, &pkram_attr_group);
>>>+    }
>>>    return 0;
>>>}
>>>module_init(pkram_init);
>>>-- 
>>>1.9.4
>>>
>>>
>>>_______________________________________________
>>>kexec mailing list
>>>kexec@lists.infradead.org
>>>http://lists.infradead.org/mailman/listinfo/kexec
>>>
>>
>

-- 
Best regards,
Coiby


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
  2023-06-06  2:55         ` Coiby Xu
@ 2023-06-06  3:12           ` Anthony Yznaga
  -1 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-06-06  3:12 UTC (permalink / raw)
  To: Coiby Xu
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec


On 6/5/23 7:55 PM, Coiby Xu wrote:
> On Mon, Jun 05, 2023 at 07:01:56PM -0700, Anthony Yznaga wrote:
>> Hi Coiby,
>>
>> On 6/4/23 7:40 PM, Coiby Xu wrote:
>>> Hi Anthony,
>>>
>>> On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>>>> The PKRAM super block is the starting point for restoring preserved
>>>> memory. By providing the super block to the new kernel at boot time,
>>>> preserved memory can be reserved and made available to be restored.
>>>> To point the kernel to the location of the super block, one passes
>>>> its pfn via the 'pkram' boot param.
>>>
>>> I'm curious to ask how will the 'pkram' boot param be passed. It 
>>> seems I
>>> can't find the answer in this patch set.
>>
>> The pfn of the super block read from /sys/kernel/pkram is passed to
>>
>> the next kernel by adding the boot parameter, pkram=<super block pfn>.
>>
>> The next kernel picks it up through the early_param("pkram", 
>> parse_pkram_sb_pfn)
>>
>> in this patch below.
>
> Thanks for the explanation! Sorry I didn't make my question clear. I
> should have asked who is going to read /sys/kernel/pkram how this pkram
> boot parameter will be added for the next kernel.

I have not specified this in the patches. One way would be to write a script

as wrapper around kexec that determines and appends the appropriate boot

parameter.


Anthony

>
>>
>>
>> Anthony
>>
>>
>>>
>>>
>>>> For that purpose, the pkram super
>>>> block pfn is exported via /sys/kernel/pkram. If none is passed, any
>>>> preserved memory will not be kept, and a new super block will be
>>>> allocated.
>>>>
>>>> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>>>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>>>> ---
>>>> mm/pkram.c | 102 
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>> 1 file changed, 100 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/pkram.c b/mm/pkram.c
>>>> index da166cb6afb7..c66b2ae4d520 100644
>>>> --- a/mm/pkram.c
>>>> +++ b/mm/pkram.c
>>>> @@ -5,15 +5,18 @@
>>>> #include <linux/init.h>
>>>> #include <linux/io.h>
>>>> #include <linux/kernel.h>
>>>> +#include <linux/kobject.h>
>>>> #include <linux/list.h>
>>>> #include <linux/mm.h>
>>>> #include <linux/module.h>
>>>> #include <linux/mutex.h>
>>>> #include <linux/notifier.h>
>>>> +#include <linux/pfn.h>
>>>> #include <linux/pkram.h>
>>>> #include <linux/reboot.h>
>>>> #include <linux/sched.h>
>>>> #include <linux/string.h>
>>>> +#include <linux/sysfs.h>
>>>> #include <linux/types.h>
>>>>
>>>> #include "internal.h"
>>>> @@ -82,12 +85,38 @@ struct pkram_node {
>>>> #define PKRAM_ACCMODE_MASK    3
>>>>
>>>> /*
>>>> + * The PKRAM super block contains data needed to restore the 
>>>> preserved memory
>>>> + * structure on boot. The pointer to it (pfn) should be passed via 
>>>> the 'pkram'
>>>> + * boot param if one wants to restore preserved data saved by the 
>>>> previously
>>>> + * executing kernel. For that purpose the kernel exports the pfn via
>>>> + * /sys/kernel/pkram. If none is passed, preserved memory if any 
>>>> will not be
>>>> + * preserved and a new clean page will be allocated for the super 
>>>> block.
>>>> + *
>>>> + * The structure occupies a memory page.
>>>> + */
>>>> +struct pkram_super_block {
>>>> +    __u64    node_pfn;        /* first element of the node list */
>>>> +};
>>>> +
>>>> +static unsigned long pkram_sb_pfn __initdata;
>>>> +static struct pkram_super_block *pkram_sb;
>>>> +
>>>> +/*
>>>>  * For convenience sake PKRAM nodes are kept in an auxiliary 
>>>> doubly-linked list
>>>>  * connected through the lru field of the page struct.
>>>>  */
>>>> static LIST_HEAD(pkram_nodes);            /* linked through 
>>>> page::lru */
>>>> static DEFINE_MUTEX(pkram_mutex);        /* serializes open/close */
>>>>
>>>> +/*
>>>> + * The PKRAM super block pfn, see above.
>>>> + */
>>>> +static int __init parse_pkram_sb_pfn(char *arg)
>>>> +{
>>>> +    return kstrtoul(arg, 16, &pkram_sb_pfn);
>>>> +}
>>>> +early_param("pkram", parse_pkram_sb_pfn);
>>>> +
>>>> static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
>>>> {
>>>>     return alloc_page(gfp_mask);
>>>> @@ -270,6 +299,7 @@ static void pkram_stream_init(struct 
>>>> pkram_stream *ps,
>>>>  * @gfp_mask specifies the memory allocation mask to be used when 
>>>> saving data.
>>>>  *
>>>>  * Error values:
>>>> + *    %ENODEV: PKRAM not available
>>>>  *    %ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>>>>  *    %ENOMEM: insufficient memory available
>>>>  *    %EEXIST: node with specified name already exists
>>>> @@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, 
>>>> const char *name, gfp_t gfp_mask
>>>>     struct pkram_node *node;
>>>>     int err = 0;
>>>>
>>>> +    if (!pkram_sb)
>>>> +        return -ENODEV;
>>>> +
>>>>     if (strlen(name) >= PKRAM_NAME_MAX)
>>>>         return -ENAMETOOLONG;
>>>>
>>>> @@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>>>>  * Returns 0 on success, -errno on failure.
>>>>  *
>>>>  * Error values:
>>>> + *    %ENODEV: PKRAM not available
>>>>  *    %ENOENT: node with specified name does not exist
>>>>  *    %EBUSY: save to required node has not finished yet
>>>>  *
>>>> @@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, 
>>>> const char *name)
>>>>     struct pkram_node *node;
>>>>     int err = 0;
>>>>
>>>> +    if (!pkram_sb)
>>>> +        return -ENODEV;
>>>> +
>>>>     mutex_lock(&pkram_mutex);
>>>>     node = pkram_find_node(name);
>>>>     if (!node) {
>>>> @@ -825,6 +862,13 @@ static void __pkram_reboot(void)
>>>>         node->node_pfn = node_pfn;
>>>>         node_pfn = page_to_pfn(page);
>>>>     }
>>>> +
>>>> +    /*
>>>> +     * Zero out pkram_sb completely since it may have been passed 
>>>> from
>>>> +     * the previous boot.
>>>> +     */
>>>> +    memset(pkram_sb, 0, PAGE_SIZE);
>>>> +    pkram_sb->node_pfn = node_pfn;
>>>> }
>>>>
>>>> static int pkram_reboot(struct notifier_block *notifier,
>>>> @@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block 
>>>> *notifier,
>>>> {
>>>>     if (val != SYS_RESTART)
>>>>         return NOTIFY_DONE;
>>>> -    __pkram_reboot();
>>>> +    if (pkram_sb)
>>>> +        __pkram_reboot();
>>>>     return NOTIFY_OK;
>>>> }
>>>>
>>>> @@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block 
>>>> *notifier,
>>>>     .notifier_call = pkram_reboot,
>>>> };
>>>>
>>>> +static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>>>> +        struct kobj_attribute *attr, char *buf)
>>>> +{
>>>> +    unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>>>> +
>>>> +    return sprintf(buf, "%lx\n", pfn);
>>>> +}
>>>> +
>>>> +static struct kobj_attribute pkram_sb_pfn_attr =
>>>> +    __ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>>>> +
>>>> +static struct attribute *pkram_attrs[] = {
>>>> +    &pkram_sb_pfn_attr.attr,
>>>> +    NULL,
>>>> +};
>>>> +
>>>> +static struct attribute_group pkram_attr_group = {
>>>> +    .attrs = pkram_attrs,
>>>> +};
>>>> +
>>>> +/* returns non-zero on success */
>>>> +static int __init pkram_init_sb(void)
>>>> +{
>>>> +    unsigned long pfn;
>>>> +    struct pkram_node *node;
>>>> +
>>>> +    if (!pkram_sb) {
>>>> +        struct page *page;
>>>> +
>>>> +        page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> +        if (!page) {
>>>> +            pr_err("PKRAM: Failed to allocate super block\n");
>>>> +            return 0;
>>>> +        }
>>>> +        pkram_sb = page_address(page);
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Build auxiliary doubly-linked list of nodes connected through
>>>> +     * page::lru for convenience sake.
>>>> +     */
>>>> +    pfn = pkram_sb->node_pfn;
>>>> +    while (pfn) {
>>>> +        node = pfn_to_kaddr(pfn);
>>>> +        pkram_insert_node(node);
>>>> +        pfn = node->node_pfn;
>>>> +    }
>>>> +    return 1;
>>>> +}
>>>> +
>>>> static int __init pkram_init(void)
>>>> {
>>>> -    register_reboot_notifier(&pkram_reboot_notifier);
>>>> +    if (pkram_init_sb()) {
>>>> + register_reboot_notifier(&pkram_reboot_notifier);
>>>> +        sysfs_update_group(kernel_kobj, &pkram_attr_group);
>>>> +    }
>>>>     return 0;
>>>> }
>>>> module_init(pkram_init);
>>>> -- 
>>>> 1.9.4
>>>>
>>>>
>>>> _______________________________________________
>>>> kexec mailing list
>>>> kexec@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/kexec
>>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v3 07/21] mm: PKRAM: introduce super block
@ 2023-06-06  3:12           ` Anthony Yznaga
  0 siblings, 0 replies; 64+ messages in thread
From: Anthony Yznaga @ 2023-06-06  3:12 UTC (permalink / raw)
  To: Coiby Xu
  Cc: linux-mm, linux-kernel, tglx, mingo, bp, x86, hpa, dave.hansen,
	luto, peterz, rppt, akpm, ebiederm, keescook, graf, jason.zeng,
	lei.l.li, steven.sistare, fam.zheng, mgalaxy, kexec


On 6/5/23 7:55 PM, Coiby Xu wrote:
> On Mon, Jun 05, 2023 at 07:01:56PM -0700, Anthony Yznaga wrote:
>> Hi Coiby,
>>
>> On 6/4/23 7:40 PM, Coiby Xu wrote:
>>> Hi Anthony,
>>>
>>> On Wed, Apr 26, 2023 at 05:08:43PM -0700, Anthony Yznaga wrote:
>>>> The PKRAM super block is the starting point for restoring preserved
>>>> memory. By providing the super block to the new kernel at boot time,
>>>> preserved memory can be reserved and made available to be restored.
>>>> To point the kernel to the location of the super block, one passes
>>>> its pfn via the 'pkram' boot param.
>>>
>>> I'm curious to ask how will the 'pkram' boot param be passed. It 
>>> seems I
>>> can't find the answer in this patch set.
>>
>> The pfn of the super block read from /sys/kernel/pkram is passed to
>>
>> the next kernel by adding the boot parameter, pkram=<super block pfn>.
>>
>> The next kernel picks it up through the early_param("pkram", 
>> parse_pkram_sb_pfn)
>>
>> in this patch below.
>
> Thanks for the explanation! Sorry I didn't make my question clear. I
> should have asked who is going to read /sys/kernel/pkram how this pkram
> boot parameter will be added for the next kernel.

I have not specified this in the patches. One way would be to write a script

as wrapper around kexec that determines and appends the appropriate boot

parameter.


Anthony

>
>>
>>
>> Anthony
>>
>>
>>>
>>>
>>>> For that purpose, the pkram super
>>>> block pfn is exported via /sys/kernel/pkram. If none is passed, any
>>>> preserved memory will not be kept, and a new super block will be
>>>> allocated.
>>>>
>>>> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>>>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>>>> ---
>>>> mm/pkram.c | 102 
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>> 1 file changed, 100 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/pkram.c b/mm/pkram.c
>>>> index da166cb6afb7..c66b2ae4d520 100644
>>>> --- a/mm/pkram.c
>>>> +++ b/mm/pkram.c
>>>> @@ -5,15 +5,18 @@
>>>> #include <linux/init.h>
>>>> #include <linux/io.h>
>>>> #include <linux/kernel.h>
>>>> +#include <linux/kobject.h>
>>>> #include <linux/list.h>
>>>> #include <linux/mm.h>
>>>> #include <linux/module.h>
>>>> #include <linux/mutex.h>
>>>> #include <linux/notifier.h>
>>>> +#include <linux/pfn.h>
>>>> #include <linux/pkram.h>
>>>> #include <linux/reboot.h>
>>>> #include <linux/sched.h>
>>>> #include <linux/string.h>
>>>> +#include <linux/sysfs.h>
>>>> #include <linux/types.h>
>>>>
>>>> #include "internal.h"
>>>> @@ -82,12 +85,38 @@ struct pkram_node {
>>>> #define PKRAM_ACCMODE_MASK    3
>>>>
>>>> /*
>>>> + * The PKRAM super block contains data needed to restore the 
>>>> preserved memory
>>>> + * structure on boot. The pointer to it (pfn) should be passed via 
>>>> the 'pkram'
>>>> + * boot param if one wants to restore preserved data saved by the 
>>>> previously
>>>> + * executing kernel. For that purpose the kernel exports the pfn via
>>>> + * /sys/kernel/pkram. If none is passed, preserved memory if any 
>>>> will not be
>>>> + * preserved and a new clean page will be allocated for the super 
>>>> block.
>>>> + *
>>>> + * The structure occupies a memory page.
>>>> + */
>>>> +struct pkram_super_block {
>>>> +    __u64    node_pfn;        /* first element of the node list */
>>>> +};
>>>> +
>>>> +static unsigned long pkram_sb_pfn __initdata;
>>>> +static struct pkram_super_block *pkram_sb;
>>>> +
>>>> +/*
>>>>  * For convenience sake PKRAM nodes are kept in an auxiliary 
>>>> doubly-linked list
>>>>  * connected through the lru field of the page struct.
>>>>  */
>>>> static LIST_HEAD(pkram_nodes);            /* linked through 
>>>> page::lru */
>>>> static DEFINE_MUTEX(pkram_mutex);        /* serializes open/close */
>>>>
>>>> +/*
>>>> + * The PKRAM super block pfn, see above.
>>>> + */
>>>> +static int __init parse_pkram_sb_pfn(char *arg)
>>>> +{
>>>> +    return kstrtoul(arg, 16, &pkram_sb_pfn);
>>>> +}
>>>> +early_param("pkram", parse_pkram_sb_pfn);
>>>> +
>>>> static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
>>>> {
>>>>     return alloc_page(gfp_mask);
>>>> @@ -270,6 +299,7 @@ static void pkram_stream_init(struct 
>>>> pkram_stream *ps,
>>>>  * @gfp_mask specifies the memory allocation mask to be used when 
>>>> saving data.
>>>>  *
>>>>  * Error values:
>>>> + *    %ENODEV: PKRAM not available
>>>>  *    %ENAMETOOLONG: name len >= PKRAM_NAME_MAX
>>>>  *    %ENOMEM: insufficient memory available
>>>>  *    %EEXIST: node with specified name already exists
>>>> @@ -285,6 +315,9 @@ int pkram_prepare_save(struct pkram_stream *ps, 
>>>> const char *name, gfp_t gfp_mask
>>>>     struct pkram_node *node;
>>>>     int err = 0;
>>>>
>>>> +    if (!pkram_sb)
>>>> +        return -ENODEV;
>>>> +
>>>>     if (strlen(name) >= PKRAM_NAME_MAX)
>>>>         return -ENAMETOOLONG;
>>>>
>>>> @@ -404,6 +437,7 @@ void pkram_discard_save(struct pkram_stream *ps)
>>>>  * Returns 0 on success, -errno on failure.
>>>>  *
>>>>  * Error values:
>>>> + *    %ENODEV: PKRAM not available
>>>>  *    %ENOENT: node with specified name does not exist
>>>>  *    %EBUSY: save to required node has not finished yet
>>>>  *
>>>> @@ -414,6 +448,9 @@ int pkram_prepare_load(struct pkram_stream *ps, 
>>>> const char *name)
>>>>     struct pkram_node *node;
>>>>     int err = 0;
>>>>
>>>> +    if (!pkram_sb)
>>>> +        return -ENODEV;
>>>> +
>>>>     mutex_lock(&pkram_mutex);
>>>>     node = pkram_find_node(name);
>>>>     if (!node) {
>>>> @@ -825,6 +862,13 @@ static void __pkram_reboot(void)
>>>>         node->node_pfn = node_pfn;
>>>>         node_pfn = page_to_pfn(page);
>>>>     }
>>>> +
>>>> +    /*
>>>> +     * Zero out pkram_sb completely since it may have been passed 
>>>> from
>>>> +     * the previous boot.
>>>> +     */
>>>> +    memset(pkram_sb, 0, PAGE_SIZE);
>>>> +    pkram_sb->node_pfn = node_pfn;
>>>> }
>>>>
>>>> static int pkram_reboot(struct notifier_block *notifier,
>>>> @@ -832,7 +876,8 @@ static int pkram_reboot(struct notifier_block 
>>>> *notifier,
>>>> {
>>>>     if (val != SYS_RESTART)
>>>>         return NOTIFY_DONE;
>>>> -    __pkram_reboot();
>>>> +    if (pkram_sb)
>>>> +        __pkram_reboot();
>>>>     return NOTIFY_OK;
>>>> }
>>>>
>>>> @@ -840,9 +885,62 @@ static int pkram_reboot(struct notifier_block 
>>>> *notifier,
>>>>     .notifier_call = pkram_reboot,
>>>> };
>>>>
>>>> +static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
>>>> +        struct kobj_attribute *attr, char *buf)
>>>> +{
>>>> +    unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
>>>> +
>>>> +    return sprintf(buf, "%lx\n", pfn);
>>>> +}
>>>> +
>>>> +static struct kobj_attribute pkram_sb_pfn_attr =
>>>> +    __ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
>>>> +
>>>> +static struct attribute *pkram_attrs[] = {
>>>> +    &pkram_sb_pfn_attr.attr,
>>>> +    NULL,
>>>> +};
>>>> +
>>>> +static struct attribute_group pkram_attr_group = {
>>>> +    .attrs = pkram_attrs,
>>>> +};
>>>> +
>>>> +/* returns non-zero on success */
>>>> +static int __init pkram_init_sb(void)
>>>> +{
>>>> +    unsigned long pfn;
>>>> +    struct pkram_node *node;
>>>> +
>>>> +    if (!pkram_sb) {
>>>> +        struct page *page;
>>>> +
>>>> +        page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> +        if (!page) {
>>>> +            pr_err("PKRAM: Failed to allocate super block\n");
>>>> +            return 0;
>>>> +        }
>>>> +        pkram_sb = page_address(page);
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Build auxiliary doubly-linked list of nodes connected through
>>>> +     * page::lru for convenience sake.
>>>> +     */
>>>> +    pfn = pkram_sb->node_pfn;
>>>> +    while (pfn) {
>>>> +        node = pfn_to_kaddr(pfn);
>>>> +        pkram_insert_node(node);
>>>> +        pfn = node->node_pfn;
>>>> +    }
>>>> +    return 1;
>>>> +}
>>>> +
>>>> static int __init pkram_init(void)
>>>> {
>>>> -    register_reboot_notifier(&pkram_reboot_notifier);
>>>> +    if (pkram_init_sb()) {
>>>> + register_reboot_notifier(&pkram_reboot_notifier);
>>>> +        sysfs_update_group(kernel_kobj, &pkram_attr_group);
>>>> +    }
>>>>     return 0;
>>>> }
>>>> module_init(pkram_init);
>>>> -- 
>>>> 1.9.4
>>>>
>>>>
>>>> _______________________________________________
>>>> kexec mailing list
>>>> kexec@lists.infradead.org
>>>> http://lists.infradead.org/mailman/listinfo/kexec
>>>>
>>>
>>
>

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2023-06-06  3:14 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-27  0:08 [RFC v3 00/21] Preserved-over-Kexec RAM Anthony Yznaga
2023-04-27  0:08 ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 01/21] mm: add PKRAM API stubs and Kconfig Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 02/21] mm: PKRAM: implement node load and save functions Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 03/21] mm: PKRAM: implement object " Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 04/21] mm: PKRAM: implement folio stream operations Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 05/21] mm: PKRAM: implement byte " Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 06/21] mm: PKRAM: link nodes by pfn before reboot Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 07/21] mm: PKRAM: introduce super block Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-06-05  2:40   ` Coiby Xu
2023-06-05  2:40     ` Coiby Xu
2023-06-06  2:01     ` Anthony Yznaga
2023-06-06  2:01       ` Anthony Yznaga
2023-06-06  2:55       ` Coiby Xu
2023-06-06  2:55         ` Coiby Xu
2023-06-06  3:12         ` Anthony Yznaga
2023-06-06  3:12           ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 08/21] PKRAM: track preserved pages in a physical mapping pagetable Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 09/21] PKRAM: pass a list of preserved ranges to the next kernel Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 10/21] PKRAM: prepare for adding preserved ranges to memblock reserved Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 11/21] mm: PKRAM: reserve preserved memory at boot Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 12/21] PKRAM: free the preserved ranges list Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 13/21] PKRAM: prevent inadvertent use of a stale superblock Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 14/21] PKRAM: provide a way to ban pages from use by PKRAM Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 15/21] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 16/21] PKRAM: provide a way to check if a memory range has preserved pages Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 17/21] kexec: PKRAM: avoid clobbering already " Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 18/21] mm: PKRAM: allow preserved memory to be freed from userspace Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 19/21] PKRAM: disable feature when running the kdump kernel Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 20/21] x86/KASLR: PKRAM: support physical kaslr Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27  0:08 ` [RFC v3 21/21] x86/boot/compressed/64: use 1GB pages for mappings Anthony Yznaga
2023-04-27  0:08   ` Anthony Yznaga
2023-04-27 18:40   ` H. Peter Anvin
2023-04-27 18:40     ` H. Peter Anvin
2023-04-27 22:38     ` Anthony Yznaga
2023-04-27 22:38       ` Anthony Yznaga
2023-05-26 13:57 ` [RFC v3 00/21] Preserved-over-Kexec RAM Gowans, James
2023-05-26 13:57   ` Gowans, James
2023-05-31 23:14   ` Anthony Yznaga
2023-05-31 23:14     ` Anthony Yznaga
2023-06-01  2:15 ` Baoquan He
2023-06-01  2:15   ` Baoquan He
2023-06-01 23:58   ` Anthony Yznaga
2023-06-01 23:58     ` Anthony Yznaga

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.