All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM
@ 2021-03-30 21:35 ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API and implement its use within tmpfs, allowing
tmpfs files to be preserved across kexec.

One use case for PKRAM is preserving guest memory and/or auxillary supporting
data (e.g. iommu data) across kexec in support of VMM Fast Restart[2].
VMM Fast Restart is currently using PKRAM to support preserving "Keep Alive
State" across reboot[3].  PKRAM provides a flexible way for doing this
without requiring that the amount of memory used by a fixed size created
a priori.  Another use case is for databases to preserve their block caches
in shared memory across reboot.

Changes since RFC v1
  - Rebased onto 5.12-rc4
  - Refined the API to reduce the number of calls
    and better support multithreading.
  - Allow preserving byte data of arbitrary length
    (was previously limited to one page).
  - Build a new memblock reserved list with the
    preserved ranges and then substitute it for
    the existing one. (Mike Rapoport)
  - Use mem_avoid_overlap() to avoid kaslr stepping
    on preserved ranges. (Kees Cook)

-- Usage --

 1) Mount tmpfs with 'pkram=NAME' option.

    NAME is an arbitrary string specifying a preserved memory node.
    Different tmpfs trees may be saved to PKRAM if different names are
    passed.

    # mkdir -p /mnt
    # mount -t tmpfs -o pkram=mytmpfs none /mnt

 2) Populate a file under /mnt

    # head -c 2G /dev/urandom > /mnt/testfile
    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile

 3) Remount tmpfs to preserve files.

    # mount -o remount,preserve,ro /mnt

 4) Load the new kernel image.

    Pass the PKRAM super block pfn via 'pkram' boot option. The pfn is
    exported via the sysfs file /sys/kernel/pkram.

    # kexec -s -l /boot/vmlinuz-$kernel --initrd=/boot/initramfs-$kernel.img \
            --append="$(cat /proc/cmdline|sed -e 's/pkram=[^ ]*//g') pkram=$(cat /sys/kernel/pkram)"

 5) Boot to the new kernel.

    # systemctl kexec

 6) Mount tmpfs with 'pkram=NAME' option.

    It should find the PKRAM node with the tmpfs tree saved on previous
    unmount and restore it.

    # mount -t tmpfs -o pkram=mytmpfs none /mnt

 7) Use the restored file under /mnt

    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile


 -- Implementation details --

 * When a tmpfs filesystem is mounted the first time with the 'pkram=NAME'
   option, a shmem_pkram_info is allocated to record NAME. The shmem_pkram_info
   and whether the filesystem is in the preserved state are tracked by
   shmem_sb_info.

 * A PKRAM-enabled tmpfs filesystem is saved to PKRAM on remount when the
  'preserve' mount option is specified and the filesystem is read-only.

 * Saving a file to PKRAM is done by walking the pages of the file and
   building a list of the pages and attributes needed to restore them later.
   The pages containing this metadata as well as the target file pages have
   their refcount incremented to prevent them from being freed even after
   the last user puts the pages (i.e. the filesystem is unmounted).

 * To aid in quickly finding contiguous ranges of memory containing
   preserved pages a pseudo physical mapping pagetable is populated
   with pages as they are preserved.

 * If a page to be preserved is found to be in range of memory that was
   previously reserved during early boot or in range of memory where the
   kernel will be loaded to on kexec, the page will be copied to a page
   outside of those ranges and the new page will be preserved. A compound
   page will be copied to and preserved as individual base pages.

 * A single page is allocated for the PKRAM super block. For the next kernel
   kexec boot to find preserved memory metadata, the pfn of the PKRAM super
   block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
   boot option.

 * In the newly booted kernel, PKRAM adds all preserved pages to the memblock
   reserve list during early boot so that they will not be recycled.

 * Since kexec may load the new kernel code to any memory region, it could
   destroy preserved memory. When the kernel selects the memory region
   (kexec_file_load syscall), kexec will avoid preserved pages.  When the
   user selects the kexec memory region to use (kexec_load syscall) , kexec
   load will fail if there is conflict with preserved pages. Pages preserved
   after a kexec kernel is loaded will be relocated if they conflict with
   the selected memory region.

The current implementation has some restrictions:

 * Only regular tmpfs files without multiple hard links can be preserved.
   Save to PKRAM will abort and log an error if a directory or other file
   type is encountered.

 * Pages for PKRAM-enabled files are prevented from swapping out to avoid
   the performance penalty of swapping in and the possibility of insufficient
   memory.


-- Patches --

The patches are broken down into the following groups:

Patches 1-22 implement the API and supporting functionality.

Patches 23-27 implement the use of PKRAM within tmpfs

The remaining patches implement optimizations to the initialization of
preserved pages and to the preservation and restoration of shmem pages.

To give an idea of the improvement in performance here is an example
comparison with and without these patches when saving and loading a 100G
file:

  Save a 100G file:

              | No optimizations | Optimized (16 cpus) |
  ------------------------------------------------------
  huge=never  |     2265ms       |       232ms         |
  ------------------------------------------------------
  huge=always |       58ms       |        22ms         |


  Load a 100G file:

              | No optimizations | Optimized (16 cpus) |
  ------------------------------------------------------
  huge=never  |     8833ms       |       516ms         |
  ------------------------------------------------------
  huge=always |      752ms       |       105ms         |


Patches 28-31 Defer initialization of page structs for preserved pages

Patches 32-34 Implement multi-threading of shmem page preservation and
restoration.

Patches 35-37 Implement and use an  API for inserting shmem pages in bulk

Patches 38-39: Reduce contention on the LRU lock by staging and adding pages
in bulk to the LRU

Patches 40-43: Reduce contention on the pagecache xarray lock by inserting
pages in bulk in certain cases

[1] https://lkml.org/lkml/2013/7/1/211

[2] https://www.youtube.com/watch?v=pBsHnf93tcQ
    https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

[3] https://www.youtube.com/watch?v=pBsHnf93tcQ
https://static.sched.com/hosted_files/kvmforum2020/10/Device-Keepalive-State-KVMForum2020.pdf

Anthony Yznaga (43):
  mm: add PKRAM API stubs and Kconfig
  mm: PKRAM: implement node load and save functions
  mm: PKRAM: implement object load and save functions
  mm: PKRAM: implement page stream operations
  mm: PKRAM: support preserving transparent hugepages
  mm: PKRAM: implement byte stream operations
  mm: PKRAM: link nodes by pfn before reboot
  mm: PKRAM: introduce super block
  PKRAM: track preserved pages in a physical mapping pagetable
  PKRAM: pass a list of preserved ranges to the next kernel
  PKRAM: prepare for adding preserved ranges to memblock reserved
  mm: PKRAM: reserve preserved memory at boot
  PKRAM: free the preserved ranges list
  PKRAM: prevent inadvertent use of a stale superblock
  PKRAM: provide a way to ban pages from use by PKRAM
  kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
  PKRAM: provide a way to check if a memory range has preserved pages
  kexec: PKRAM: avoid clobbering already preserved pages
  mm: PKRAM: allow preserved memory to be freed from userspace
  PKRAM: disable feature when running the kdump kernel
  x86/KASLR: PKRAM: support physical kaslr
  x86/boot/compressed/64: use 1GB pages for mappings
  mm: shmem: introduce shmem_insert_page
  mm: shmem: enable saving to PKRAM
  mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages
  mm: shmem: specify the mm to use when inserting pages
  mm: shmem: when inserting, handle pages already charged to a memcg
  x86/mm/numa: add numa_isolate_memblocks()
  PKRAM: ensure memblocks with preserved pages init'd for numa
  memblock: PKRAM: mark memblocks that contain preserved pages
  memblock, mm: defer initialization of preserved pages
  shmem: preserve shmem files a chunk at a time
  PKRAM: atomically add and remove link pages
  shmem: PKRAM: multithread preserving and restoring shmem pages
  shmem: introduce shmem_insert_pages()
  PKRAM: add support for loading pages in bulk
  shmem: PKRAM: enable bulk loading of preserved pages into shmem
  mm: implement splicing a list of pages to the LRU
  shmem: optimize adding pages to the LRU in shmem_insert_pages()
  shmem: initial support for adding multiple pages to pagecache
  XArray: add xas_export_node() and xas_import_node()
  shmem: reduce time holding xa_lock when inserting pages
  PKRAM: improve index alignment of pkram_link entries

 documentation/core-api/xarray.rst       |    8 +
 arch/x86/boot/compressed/Makefile       |    3 +
 arch/x86/boot/compressed/ident_map_64.c |    9 +-
 arch/x86/boot/compressed/kaslr.c        |   10 +-
 arch/x86/boot/compressed/misc.h         |   10 +
 arch/x86/boot/compressed/pkram.c        |  109 ++
 arch/x86/include/asm/numa.h             |    4 +
 arch/x86/kernel/setup.c                 |    3 +
 arch/x86/mm/init_64.c                   |    2 +
 arch/x86/mm/numa.c                      |   32 +-
 include/linux/memblock.h                |    6 +
 include/linux/mm.h                      |    2 +-
 include/linux/pkram.h                   |  120 ++
 include/linux/shmem_fs.h                |   28 +
 include/linux/swap.h                    |   13 +
 include/linux/xarray.h                  |    2 +
 kernel/kexec.c                          |    9 +
 kernel/kexec_core.c                     |    3 +
 kernel/kexec_file.c                     |   15 +
 lib/test_xarray.c                       |   45 +
 lib/xarray.c                            |  100 ++
 mm/Kconfig                              |    9 +
 mm/Makefile                             |    1 +
 mm/memblock.c                           |   11 +-
 mm/page_alloc.c                         |   55 +-
 mm/pkram.c                              | 1808 +++++++++++++++++++++++++++++++
 mm/pkram_pagetable.c                    |  376 +++++++
 mm/shmem.c                              |  494 ++++++++-
 mm/shmem_pkram.c                        |  530 +++++++++
 mm/swap.c                               |   86 ++
 30 files changed, 3869 insertions(+), 34 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c
 create mode 100644 mm/pkram_pagetable.c
 create mode 100644 mm/shmem_pkram.c

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM
@ 2021-03-30 21:35 ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API and implement its use within tmpfs, allowing
tmpfs files to be preserved across kexec.

One use case for PKRAM is preserving guest memory and/or auxillary supporting
data (e.g. iommu data) across kexec in support of VMM Fast Restart[2].
VMM Fast Restart is currently using PKRAM to support preserving "Keep Alive
State" across reboot[3].  PKRAM provides a flexible way for doing this
without requiring that the amount of memory used by a fixed size created
a priori.  Another use case is for databases to preserve their block caches
in shared memory across reboot.

Changes since RFC v1
  - Rebased onto 5.12-rc4
  - Refined the API to reduce the number of calls
    and better support multithreading.
  - Allow preserving byte data of arbitrary length
    (was previously limited to one page).
  - Build a new memblock reserved list with the
    preserved ranges and then substitute it for
    the existing one. (Mike Rapoport)
  - Use mem_avoid_overlap() to avoid kaslr stepping
    on preserved ranges. (Kees Cook)

-- Usage --

 1) Mount tmpfs with 'pkram=NAME' option.

    NAME is an arbitrary string specifying a preserved memory node.
    Different tmpfs trees may be saved to PKRAM if different names are
    passed.

    # mkdir -p /mnt
    # mount -t tmpfs -o pkram=mytmpfs none /mnt

 2) Populate a file under /mnt

    # head -c 2G /dev/urandom > /mnt/testfile
    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile

 3) Remount tmpfs to preserve files.

    # mount -o remount,preserve,ro /mnt

 4) Load the new kernel image.

    Pass the PKRAM super block pfn via 'pkram' boot option. The pfn is
    exported via the sysfs file /sys/kernel/pkram.

    # kexec -s -l /boot/vmlinuz-$kernel --initrd=/boot/initramfs-$kernel.img \
            --append="$(cat /proc/cmdline|sed -e 's/pkram=[^ ]*//g') pkram=$(cat /sys/kernel/pkram)"

 5) Boot to the new kernel.

    # systemctl kexec

 6) Mount tmpfs with 'pkram=NAME' option.

    It should find the PKRAM node with the tmpfs tree saved on previous
    unmount and restore it.

    # mount -t tmpfs -o pkram=mytmpfs none /mnt

 7) Use the restored file under /mnt

    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile


 -- Implementation details --

 * When a tmpfs filesystem is mounted the first time with the 'pkram=NAME'
   option, a shmem_pkram_info is allocated to record NAME. The shmem_pkram_info
   and whether the filesystem is in the preserved state are tracked by
   shmem_sb_info.

 * A PKRAM-enabled tmpfs filesystem is saved to PKRAM on remount when the
  'preserve' mount option is specified and the filesystem is read-only.

 * Saving a file to PKRAM is done by walking the pages of the file and
   building a list of the pages and attributes needed to restore them later.
   The pages containing this metadata as well as the target file pages have
   their refcount incremented to prevent them from being freed even after
   the last user puts the pages (i.e. the filesystem is unmounted).

 * To aid in quickly finding contiguous ranges of memory containing
   preserved pages a pseudo physical mapping pagetable is populated
   with pages as they are preserved.

 * If a page to be preserved is found to be in range of memory that was
   previously reserved during early boot or in range of memory where the
   kernel will be loaded to on kexec, the page will be copied to a page
   outside of those ranges and the new page will be preserved. A compound
   page will be copied to and preserved as individual base pages.

 * A single page is allocated for the PKRAM super block. For the next kernel
   kexec boot to find preserved memory metadata, the pfn of the PKRAM super
   block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
   boot option.

 * In the newly booted kernel, PKRAM adds all preserved pages to the memblock
   reserve list during early boot so that they will not be recycled.

 * Since kexec may load the new kernel code to any memory region, it could
   destroy preserved memory. When the kernel selects the memory region
   (kexec_file_load syscall), kexec will avoid preserved pages.  When the
   user selects the kexec memory region to use (kexec_load syscall) , kexec
   load will fail if there is conflict with preserved pages. Pages preserved
   after a kexec kernel is loaded will be relocated if they conflict with
   the selected memory region.

The current implementation has some restrictions:

 * Only regular tmpfs files without multiple hard links can be preserved.
   Save to PKRAM will abort and log an error if a directory or other file
   type is encountered.

 * Pages for PKRAM-enabled files are prevented from swapping out to avoid
   the performance penalty of swapping in and the possibility of insufficient
   memory.


-- Patches --

The patches are broken down into the following groups:

Patches 1-22 implement the API and supporting functionality.

Patches 23-27 implement the use of PKRAM within tmpfs

The remaining patches implement optimizations to the initialization of
preserved pages and to the preservation and restoration of shmem pages.

To give an idea of the improvement in performance here is an example
comparison with and without these patches when saving and loading a 100G
file:

  Save a 100G file:

              | No optimizations | Optimized (16 cpus) |
  ------------------------------------------------------
  huge=never  |     2265ms       |       232ms         |
  ------------------------------------------------------
  huge=always |       58ms       |        22ms         |


  Load a 100G file:

              | No optimizations | Optimized (16 cpus) |
  ------------------------------------------------------
  huge=never  |     8833ms       |       516ms         |
  ------------------------------------------------------
  huge=always |      752ms       |       105ms         |


Patches 28-31 Defer initialization of page structs for preserved pages

Patches 32-34 Implement multi-threading of shmem page preservation and
restoration.

Patches 35-37 Implement and use an  API for inserting shmem pages in bulk

Patches 38-39: Reduce contention on the LRU lock by staging and adding pages
in bulk to the LRU

Patches 40-43: Reduce contention on the pagecache xarray lock by inserting
pages in bulk in certain cases

[1] https://lkml.org/lkml/2013/7/1/211

[2] https://www.youtube.com/watch?v=pBsHnf93tcQ
    https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

[3] https://www.youtube.com/watch?v=pBsHnf93tcQ
https://static.sched.com/hosted_files/kvmforum2020/10/Device-Keepalive-State-KVMForum2020.pdf

Anthony Yznaga (43):
  mm: add PKRAM API stubs and Kconfig
  mm: PKRAM: implement node load and save functions
  mm: PKRAM: implement object load and save functions
  mm: PKRAM: implement page stream operations
  mm: PKRAM: support preserving transparent hugepages
  mm: PKRAM: implement byte stream operations
  mm: PKRAM: link nodes by pfn before reboot
  mm: PKRAM: introduce super block
  PKRAM: track preserved pages in a physical mapping pagetable
  PKRAM: pass a list of preserved ranges to the next kernel
  PKRAM: prepare for adding preserved ranges to memblock reserved
  mm: PKRAM: reserve preserved memory at boot
  PKRAM: free the preserved ranges list
  PKRAM: prevent inadvertent use of a stale superblock
  PKRAM: provide a way to ban pages from use by PKRAM
  kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
  PKRAM: provide a way to check if a memory range has preserved pages
  kexec: PKRAM: avoid clobbering already preserved pages
  mm: PKRAM: allow preserved memory to be freed from userspace
  PKRAM: disable feature when running the kdump kernel
  x86/KASLR: PKRAM: support physical kaslr
  x86/boot/compressed/64: use 1GB pages for mappings
  mm: shmem: introduce shmem_insert_page
  mm: shmem: enable saving to PKRAM
  mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages
  mm: shmem: specify the mm to use when inserting pages
  mm: shmem: when inserting, handle pages already charged to a memcg
  x86/mm/numa: add numa_isolate_memblocks()
  PKRAM: ensure memblocks with preserved pages init'd for numa
  memblock: PKRAM: mark memblocks that contain preserved pages
  memblock, mm: defer initialization of preserved pages
  shmem: preserve shmem files a chunk at a time
  PKRAM: atomically add and remove link pages
  shmem: PKRAM: multithread preserving and restoring shmem pages
  shmem: introduce shmem_insert_pages()
  PKRAM: add support for loading pages in bulk
  shmem: PKRAM: enable bulk loading of preserved pages into shmem
  mm: implement splicing a list of pages to the LRU
  shmem: optimize adding pages to the LRU in shmem_insert_pages()
  shmem: initial support for adding multiple pages to pagecache
  XArray: add xas_export_node() and xas_import_node()
  shmem: reduce time holding xa_lock when inserting pages
  PKRAM: improve index alignment of pkram_link entries

 documentation/core-api/xarray.rst       |    8 +
 arch/x86/boot/compressed/Makefile       |    3 +
 arch/x86/boot/compressed/ident_map_64.c |    9 +-
 arch/x86/boot/compressed/kaslr.c        |   10 +-
 arch/x86/boot/compressed/misc.h         |   10 +
 arch/x86/boot/compressed/pkram.c        |  109 ++
 arch/x86/include/asm/numa.h             |    4 +
 arch/x86/kernel/setup.c                 |    3 +
 arch/x86/mm/init_64.c                   |    2 +
 arch/x86/mm/numa.c                      |   32 +-
 include/linux/memblock.h                |    6 +
 include/linux/mm.h                      |    2 +-
 include/linux/pkram.h                   |  120 ++
 include/linux/shmem_fs.h                |   28 +
 include/linux/swap.h                    |   13 +
 include/linux/xarray.h                  |    2 +
 kernel/kexec.c                          |    9 +
 kernel/kexec_core.c                     |    3 +
 kernel/kexec_file.c                     |   15 +
 lib/test_xarray.c                       |   45 +
 lib/xarray.c                            |  100 ++
 mm/Kconfig                              |    9 +
 mm/Makefile                             |    1 +
 mm/memblock.c                           |   11 +-
 mm/page_alloc.c                         |   55 +-
 mm/pkram.c                              | 1808 +++++++++++++++++++++++++++++++
 mm/pkram_pagetable.c                    |  376 +++++++
 mm/shmem.c                              |  494 ++++++++-
 mm/shmem_pkram.c                        |  530 +++++++++
 mm/swap.c                               |   86 ++
 30 files changed, 3869 insertions(+), 34 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c
 create mode 100644 mm/pkram_pagetable.c
 create mode 100644 mm/shmem_pkram.c

-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Preserved-across-kexec memory or PKRAM is a method for saving memory
pages of the currently executing kernel and restoring them after kexec
boot into a new one. This can be utilized for preserving guest VM state,
large in-memory databases, process memory, etc. across reboot. While
DRAM-as-PMEM or actual persistent memory could be used to accomplish
these things, PKRAM provides the latency of DRAM with the flexibility
of dynamically determining the amount of memory to preserve.

The proposed API:

 * Preserved memory is divided into nodes which can be saved or loaded
   independently of each other. The nodes are identified by unique name
   strings. A PKRAM node is created when save is initiated by calling
   pkram_prepare_save(). A PKRAM node is removed when load is initiated by
   calling pkram_prepare_load(). See below

 * A node is further divided into objects. An object represents closely
   coupled data in the form of a grouping of pages and/or a stream of
   byte data.  For example, the pages and attributes of a file.
   After initiating an operation on a PKRAM node, PKRAM objects are
   initialized for saving or loading by calling pkram_prepare_save_obj()
   or pkram_prepare_load_obj().

 * For saving/loading data from a PKRAM node/object instances of the
   pkram_stream and pkram_access structs are used.  pkram_stream tracks
   the node and object being operated on while pkram_access tracks the
   data type and position within an object.

   The pkram_stream struct is initialized by calling pkram_prepare_save()
   or pkram_prepare_load() and then pkram_prepare_save_obj() or
   pkram_prepare_load_obj().

   Once a pkram_stream is fully initialized, a pkram_access struct
   is initialized for each data type associated with the object.
   After save or load of a data type for the object is complete,
   pkram_finish_access() is called.

   After save or load is complete for the object, pkram_finish_save_obj()
   or pkram_finish_load_obj() must be called followed by pkram_finish_save()
   or pkram_finish_load() when save or load is completed for the node.
   If an error occurred during save, the saved data and the PKRAM node
   may be freed by calling pkram_discard_save() instead of
   pkram_finish_save().

 * Both page data and byte data can separately be streamed to a PKRAM
   object.  pkram_save_file_page() and pkram_load_file_page() are used
   to stream page data while pkram_write() and pkram_read() are used to
   stream byte data.

A sequence of operations for saving/loading data from PKRAM would
look like:

  * For saving data to PKRAM:

    /* create a PKRAM node and do initial stream setup */
    pkram_prepare_save()

    /* create a PKRAM object associated with the PKRAM node and complete stream initialization */
    pkram_prepare_save_obj()

    /* save data to the node/object */
    PKRAM_ACCESS(pa_pages,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_save_file_page(pa_pages,...)[,...]  /* for file pages */
    pkram_write(pa_bytes,...)[,...]           /* for a byte stream */
    pkram_finish_access(pa_pages)
    pkram_finish_access(pa_bytes)

    pkram_finish_save_obj()

    /* commit the save or discard and delete the node */
    pkram_finish_save()          /* on success, or
    pkram_discard_save()          * ... in case of error */

  * For loading data from PKRAM:

    /* remove a PKRAM node from the list and do initial stream setup */
    pkram_prepare_load()

    /* Remove a PKRAM object from the node and complete stream initializtion for loading data from it. */
    pkram_prepare_load_obj()

    /* load data from the node/object */
    PKRAM_ACCESS(pa_pages,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_load_file_page(pa_pages,...)[,...] /* for file pages */
    pkram_read(pa_bytes,...)[,...]           /* for a byte stream */
*/
    pkram_finish_access(pa_pages)
    pkram_finish_access(pa_bytes)

    /* free the object */
    pkram_finish_load_obj()

    /* free the node */
    pkram_finish_load()

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  47 +++++++++++++
 mm/Kconfig            |   9 +++
 mm/Makefile           |   1 +
 mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 236 insertions(+)
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
new file mode 100644
index 000000000000..a575da2d6c79
--- /dev/null
+++ b/include/linux/pkram.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKRAM_H
+#define _LINUX_PKRAM_H
+
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/mm_types.h>
+
+/**
+ * enum pkram_data_flags - definition of data types contained in a pkram obj
+ * @PKRAM_DATA_none: No data types configured
+ */
+enum pkram_data_flags {
+	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+};
+
+struct pkram_stream;
+struct pkram_access;
+
+#define PKRAM_NAME_MAX		256	/* including nul */
+
+int pkram_prepare_save(struct pkram_stream *ps, const char *name,
+		       gfp_t gfp_mask);
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags);
+
+void pkram_finish_save(struct pkram_stream *ps);
+void pkram_finish_save_obj(struct pkram_stream *ps);
+void pkram_discard_save(struct pkram_stream *ps);
+
+int pkram_prepare_load(struct pkram_stream *ps, const char *name);
+int pkram_prepare_load_obj(struct pkram_stream *ps);
+
+void pkram_finish_load(struct pkram_stream *ps);
+void pkram_finish_load_obj(struct pkram_stream *ps);
+
+#define PKRAM_ACCESS(name, stream, type)			\
+	struct pkram_access name
+
+void pkram_finish_access(struct pkram_access *pa, bool status_ok);
+
+int pkram_save_file_page(struct pkram_access *pa, struct page *page);
+struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index);
+
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
+
+#endif /* _LINUX_PKRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..ea8242c91728 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,13 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config PKRAM
+	bool "Preserved-over-kexec memory storage"
+	default n
+	help
+	  This option adds the kernel API that enables saving memory pages of
+	  the currently executing kernel and restoring them after a kexec in
+	  the newly booted one. This can be utilized for speeding up reboot by
+	  leaving process memory and/or FS caches in-place.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..ab3a724769b5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_PKRAM) += pkram.o
diff --git a/mm/pkram.c b/mm/pkram.c
new file mode 100644
index 000000000000..59e4661b2fb7
--- /dev/null
+++ b/mm/pkram.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/pkram.h>
+#include <linux/types.h>
+
+/**
+ * Create a preserved memory node with name @name and initialize stream @ps
+ * for saving data to it.
+ *
+ * @gfp_mask specifies the memory allocation mask to be used when saving data.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
+ * case of failure) is to be called.
+ */
+int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Create a preserved memory object and initialize stream @ps for saving data
+ * to it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
+ * in case of failure) is to be called.
+ */
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Commit the object started with pkram_prepare_save_obj() to preserved memory.
+ */
+void pkram_finish_save_obj(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Commit the save to preserved memory started with pkram_prepare_save().
+ * After the call, the stream may not be used any more.
+ */
+void pkram_finish_save(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Cancel the save to preserved memory started with pkram_prepare_save() and
+ * destroy the corresponding preserved memory node freeing any data already
+ * saved to it.
+ */
+void pkram_discard_save(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Remove the preserved memory node with name @name and initialize stream @ps
+ * for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load() is to be called.
+ */
+int pkram_prepare_load(struct pkram_stream *ps, const char *name)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Remove the next preserved memory object from the stream @ps and
+ * initialize stream @ps for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load_obj() is to be called.
+ */
+int pkram_prepare_load_obj(struct pkram_stream *ps)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Finish the load of a preserved memory object started with
+ * pkram_prepare_load_obj() freeing the object and any data that has not
+ * been loaded from it.
+ */
+void pkram_finish_load_obj(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Finish the load from preserved memory started with pkram_prepare_load()
+ * freeing the corresponding preserved memory node and any data that has
+ * not been loaded from it.
+ */
+void pkram_finish_load(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Finish the data access to or from the preserved memory node and object
+ * associated with pkram stream access @pa.  The access must have been
+ * initialized with PKRAM_ACCESS(). 
+ */
+void pkram_finish_access(struct pkram_access *pa, bool status_ok)
+{
+	BUG();
+}
+
+/**
+ * Save file page @page to the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_save() and pkram_prepare_save_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * Returns 0 on success, -errno on failure.
+ */
+int pkram_save_file_page(struct pkram_access *pa, struct page *page)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Load the next page from the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_load() and pkram_prepare_load_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * If not NULL, @index is initialized with the preserved mapping offset of the
+ * page loaded.
+ *
+ * Returns the page loaded or NULL if the node is empty.
+ *
+ * The page loaded has its refcount incremented.
+ */
+struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
+{
+	return NULL;
+}
+
+/**
+ * Copy @count bytes from @buf to the preserved memory node and object
+ * associated with pkram stream access @pa. The stream must have been
+ * initialized with pkram_prepare_save() and pkram_prepare_save_obj()
+ * and access initialized with PKRAM_ACCESS();
+ *
+ * On success, returns the number of bytes written, which is always equal to
+ * @count. On failure, -errno is returned.
+ */
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Copy up to @count bytes from the preserved memory node and object
+ * associated with pkram stream access @pa to @buf. The stream must have been
+ * initialized with pkram_prepare_load() and pkram_prepare_load_obj() and
+ * access initialized PKRAM_ACCESS().
+ *
+ * Returns the number of bytes read, which may be less than @count if the node
+ * has fewer bytes available.
+ */
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
+{
+	return 0;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Preserved-across-kexec memory or PKRAM is a method for saving memory
pages of the currently executing kernel and restoring them after kexec
boot into a new one. This can be utilized for preserving guest VM state,
large in-memory databases, process memory, etc. across reboot. While
DRAM-as-PMEM or actual persistent memory could be used to accomplish
these things, PKRAM provides the latency of DRAM with the flexibility
of dynamically determining the amount of memory to preserve.

The proposed API:

 * Preserved memory is divided into nodes which can be saved or loaded
   independently of each other. The nodes are identified by unique name
   strings. A PKRAM node is created when save is initiated by calling
   pkram_prepare_save(). A PKRAM node is removed when load is initiated by
   calling pkram_prepare_load(). See below

 * A node is further divided into objects. An object represents closely
   coupled data in the form of a grouping of pages and/or a stream of
   byte data.  For example, the pages and attributes of a file.
   After initiating an operation on a PKRAM node, PKRAM objects are
   initialized for saving or loading by calling pkram_prepare_save_obj()
   or pkram_prepare_load_obj().

 * For saving/loading data from a PKRAM node/object instances of the
   pkram_stream and pkram_access structs are used.  pkram_stream tracks
   the node and object being operated on while pkram_access tracks the
   data type and position within an object.

   The pkram_stream struct is initialized by calling pkram_prepare_save()
   or pkram_prepare_load() and then pkram_prepare_save_obj() or
   pkram_prepare_load_obj().

   Once a pkram_stream is fully initialized, a pkram_access struct
   is initialized for each data type associated with the object.
   After save or load of a data type for the object is complete,
   pkram_finish_access() is called.

   After save or load is complete for the object, pkram_finish_save_obj()
   or pkram_finish_load_obj() must be called followed by pkram_finish_save()
   or pkram_finish_load() when save or load is completed for the node.
   If an error occurred during save, the saved data and the PKRAM node
   may be freed by calling pkram_discard_save() instead of
   pkram_finish_save().

 * Both page data and byte data can separately be streamed to a PKRAM
   object.  pkram_save_file_page() and pkram_load_file_page() are used
   to stream page data while pkram_write() and pkram_read() are used to
   stream byte data.

A sequence of operations for saving/loading data from PKRAM would
look like:

  * For saving data to PKRAM:

    /* create a PKRAM node and do initial stream setup */
    pkram_prepare_save()

    /* create a PKRAM object associated with the PKRAM node and complete stream initialization */
    pkram_prepare_save_obj()

    /* save data to the node/object */
    PKRAM_ACCESS(pa_pages,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_save_file_page(pa_pages,...)[,...]  /* for file pages */
    pkram_write(pa_bytes,...)[,...]           /* for a byte stream */
    pkram_finish_access(pa_pages)
    pkram_finish_access(pa_bytes)

    pkram_finish_save_obj()

    /* commit the save or discard and delete the node */
    pkram_finish_save()          /* on success, or
    pkram_discard_save()          * ... in case of error */

  * For loading data from PKRAM:

    /* remove a PKRAM node from the list and do initial stream setup */
    pkram_prepare_load()

    /* Remove a PKRAM object from the node and complete stream initializtion for loading data from it. */
    pkram_prepare_load_obj()

    /* load data from the node/object */
    PKRAM_ACCESS(pa_pages,...)
    PKRAM_ACCESS(pa_bytes,...)
    pkram_load_file_page(pa_pages,...)[,...] /* for file pages */
    pkram_read(pa_bytes,...)[,...]           /* for a byte stream */
*/
    pkram_finish_access(pa_pages)
    pkram_finish_access(pa_bytes)

    /* free the object */
    pkram_finish_load_obj()

    /* free the node */
    pkram_finish_load()

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  47 +++++++++++++
 mm/Kconfig            |   9 +++
 mm/Makefile           |   1 +
 mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 236 insertions(+)
 create mode 100644 include/linux/pkram.h
 create mode 100644 mm/pkram.c

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
new file mode 100644
index 000000000000..a575da2d6c79
--- /dev/null
+++ b/include/linux/pkram.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKRAM_H
+#define _LINUX_PKRAM_H
+
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/mm_types.h>
+
+/**
+ * enum pkram_data_flags - definition of data types contained in a pkram obj
+ * @PKRAM_DATA_none: No data types configured
+ */
+enum pkram_data_flags {
+	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+};
+
+struct pkram_stream;
+struct pkram_access;
+
+#define PKRAM_NAME_MAX		256	/* including nul */
+
+int pkram_prepare_save(struct pkram_stream *ps, const char *name,
+		       gfp_t gfp_mask);
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags);
+
+void pkram_finish_save(struct pkram_stream *ps);
+void pkram_finish_save_obj(struct pkram_stream *ps);
+void pkram_discard_save(struct pkram_stream *ps);
+
+int pkram_prepare_load(struct pkram_stream *ps, const char *name);
+int pkram_prepare_load_obj(struct pkram_stream *ps);
+
+void pkram_finish_load(struct pkram_stream *ps);
+void pkram_finish_load_obj(struct pkram_stream *ps);
+
+#define PKRAM_ACCESS(name, stream, type)			\
+	struct pkram_access name
+
+void pkram_finish_access(struct pkram_access *pa, bool status_ok);
+
+int pkram_save_file_page(struct pkram_access *pa, struct page *page);
+struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index);
+
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
+
+#endif /* _LINUX_PKRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..ea8242c91728 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,13 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config PKRAM
+	bool "Preserved-over-kexec memory storage"
+	default n
+	help
+	  This option adds the kernel API that enables saving memory pages of
+	  the currently executing kernel and restoring them after a kexec in
+	  the newly booted one. This can be utilized for speeding up reboot by
+	  leaving process memory and/or FS caches in-place.
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..ab3a724769b5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_PKRAM) += pkram.o
diff --git a/mm/pkram.c b/mm/pkram.c
new file mode 100644
index 000000000000..59e4661b2fb7
--- /dev/null
+++ b/mm/pkram.c
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/pkram.h>
+#include <linux/types.h>
+
+/**
+ * Create a preserved memory node with name @name and initialize stream @ps
+ * for saving data to it.
+ *
+ * @gfp_mask specifies the memory allocation mask to be used when saving data.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
+ * case of failure) is to be called.
+ */
+int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Create a preserved memory object and initialize stream @ps for saving data
+ * to it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
+ * in case of failure) is to be called.
+ */
+int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Commit the object started with pkram_prepare_save_obj() to preserved memory.
+ */
+void pkram_finish_save_obj(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Commit the save to preserved memory started with pkram_prepare_save().
+ * After the call, the stream may not be used any more.
+ */
+void pkram_finish_save(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Cancel the save to preserved memory started with pkram_prepare_save() and
+ * destroy the corresponding preserved memory node freeing any data already
+ * saved to it.
+ */
+void pkram_discard_save(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Remove the preserved memory node with name @name and initialize stream @ps
+ * for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load() is to be called.
+ */
+int pkram_prepare_load(struct pkram_stream *ps, const char *name)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Remove the next preserved memory object from the stream @ps and
+ * initialize stream @ps for loading data from it.
+ *
+ * Returns 0 on success, -errno on failure.
+ *
+ * After the load has finished, pkram_finish_load_obj() is to be called.
+ */
+int pkram_prepare_load_obj(struct pkram_stream *ps)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Finish the load of a preserved memory object started with
+ * pkram_prepare_load_obj() freeing the object and any data that has not
+ * been loaded from it.
+ */
+void pkram_finish_load_obj(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Finish the load from preserved memory started with pkram_prepare_load()
+ * freeing the corresponding preserved memory node and any data that has
+ * not been loaded from it.
+ */
+void pkram_finish_load(struct pkram_stream *ps)
+{
+	BUG();
+}
+
+/**
+ * Finish the data access to or from the preserved memory node and object
+ * associated with pkram stream access @pa.  The access must have been
+ * initialized with PKRAM_ACCESS(). 
+ */
+void pkram_finish_access(struct pkram_access *pa, bool status_ok)
+{
+	BUG();
+}
+
+/**
+ * Save file page @page to the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_save() and pkram_prepare_save_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * Returns 0 on success, -errno on failure.
+ */
+int pkram_save_file_page(struct pkram_access *pa, struct page *page)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Load the next page from the preserved memory node and object associated
+ * with pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_load() and pkram_prepare_load_obj() and access initialized
+ * with PKRAM_ACCESS().
+ *
+ * If not NULL, @index is initialized with the preserved mapping offset of the
+ * page loaded.
+ *
+ * Returns the page loaded or NULL if the node is empty.
+ *
+ * The page loaded has its refcount incremented.
+ */
+struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
+{
+	return NULL;
+}
+
+/**
+ * Copy @count bytes from @buf to the preserved memory node and object
+ * associated with pkram stream access @pa. The stream must have been
+ * initialized with pkram_prepare_save() and pkram_prepare_save_obj()
+ * and access initialized with PKRAM_ACCESS();
+ *
+ * On success, returns the number of bytes written, which is always equal to
+ * @count. On failure, -errno is returned.
+ */
+ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
+{
+	return -ENOSYS;
+}
+
+/**
+ * Copy up to @count bytes from the preserved memory node and object
+ * associated with pkram stream access @pa to @buf. The stream must have been
+ * initialized with pkram_prepare_load() and pkram_prepare_load_obj() and
+ * access initialized PKRAM_ACCESS().
+ *
+ * Returns the number of bytes read, which may be less than @count if the node
+ * has fewer bytes available.
+ */
+size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
+{
+	return 0;
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 02/43] mm: PKRAM: implement node load and save functions
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Preserved memory is divided into nodes which can be saved and loaded
independently of each other. PKRAM nodes are kept on a list and
identified by unique names. Whenever a save operation is initiated by
calling pkram_prepare_save(), a new node is created and linked to the
list. When the save operation has been committed by calling
pkram_finish_save(), the node becomes loadable. A load operation can be
then initiated by calling pkram_prepare_load() which deletes the node
from the list and prepares the corresponding stream for loading data
from it. After the load has been finished, the pkram_finish_load()
function must be called to free the node. Nodes are also deleted when a
save operation is discarded, i.e. pkram_discard_save() is called instead
of pkram_finish_save().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   8 ++-
 mm/pkram.c            | 148 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 150 insertions(+), 6 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index a575da2d6c79..01055a876450 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/mm_types.h>
 
+struct pkram_node;
+
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
@@ -14,7 +16,11 @@ enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,  /* No data types configured */
 };
 
-struct pkram_stream;
+struct pkram_stream {
+	gfp_t gfp_mask;
+	struct pkram_node *node;
+};
+
 struct pkram_access;
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index 59e4661b2fb7..21976df6e0ea 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,16 +2,85 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/kernel.h>
+#include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/string.h>
 #include <linux/types.h>
 
+/*
+ * Preserved memory is divided into nodes that can be saved or loaded
+ * independently of each other. The nodes are identified by unique name
+ * strings.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_node {
+	__u32	flags;
+
+	__u8	name[PKRAM_NAME_MAX];
+};
+
+#define PKRAM_SAVE		1
+#define PKRAM_LOAD		2
+#define PKRAM_ACCMODE_MASK	3
+
+static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
+static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
+
+static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
+{
+	return alloc_page(gfp_mask);
+}
+
+static inline void pkram_free_page(void *addr)
+{
+	free_page((unsigned long)addr);
+}
+
+static inline void pkram_insert_node(struct pkram_node *node)
+{
+	list_add(&virt_to_page(node)->lru, &pkram_nodes);
+}
+
+static inline void pkram_delete_node(struct pkram_node *node)
+{
+	list_del(&virt_to_page(node)->lru);
+}
+
+static struct pkram_node *pkram_find_node(const char *name)
+{
+	struct page *page;
+	struct pkram_node *node;
+
+	list_for_each_entry(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (strcmp(node->name, name) == 0)
+			return node;
+	}
+	return NULL;
+}
+
+static void pkram_stream_init(struct pkram_stream *ps,
+			     struct pkram_node *node, gfp_t gfp_mask)
+{
+	memset(ps, 0, sizeof(*ps));
+	ps->gfp_mask = gfp_mask;
+	ps->node = node;
+}
+
 /**
  * Create a preserved memory node with name @name and initialize stream @ps
  * for saving data to it.
  *
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
+ * Error values:
+ *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
+ *	%ENOMEM: insufficient memory available
+ *	%EEXIST: node with specified name already exists
+ *
  * Returns 0 on success, -errno on failure.
  *
  * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
@@ -19,7 +88,34 @@
  */
 int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
 {
-	return -ENOSYS;
+	struct page *page;
+	struct pkram_node *node;
+	int err = 0;
+
+	if (strlen(name) >= PKRAM_NAME_MAX)
+		return -ENAMETOOLONG;
+
+	page = pkram_alloc_page(gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	node = page_address(page);
+
+	node->flags = PKRAM_SAVE;
+	strcpy(node->name, name);
+
+	mutex_lock(&pkram_mutex);
+	if (!pkram_find_node(name))
+		pkram_insert_node(node);
+	else
+		err = -EEXIST;
+	mutex_unlock(&pkram_mutex);
+	if (err) {
+		pkram_free_page(node);
+		return err;
+	}
+
+	pkram_stream_init(ps, node, gfp_mask);
+	return 0;
 }
 
 /**
@@ -50,7 +146,12 @@ void pkram_finish_save_obj(struct pkram_stream *ps)
  */
 void pkram_finish_save(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	smp_wmb();
+	node->flags &= ~PKRAM_ACCMODE_MASK;
 }
 
 /**
@@ -60,7 +161,15 @@ void pkram_finish_save(struct pkram_stream *ps)
  */
 void pkram_discard_save(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	mutex_lock(&pkram_mutex);
+	pkram_delete_node(node);
+	mutex_unlock(&pkram_mutex);
+
+	pkram_free_page(node);
 }
 
 /**
@@ -69,11 +178,36 @@ void pkram_discard_save(struct pkram_stream *ps)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOENT: node with specified name does not exist
+ *	%EBUSY: save to required node has not finished yet
+ *
  * After the load has finished, pkram_finish_load() is to be called.
  */
 int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 {
-	return -ENOSYS;
+	struct pkram_node *node;
+	int err = 0;
+
+	mutex_lock(&pkram_mutex);
+	node = pkram_find_node(name);
+	if (!node) {
+		err = -ENOENT;
+		goto out_unlock;
+	}
+	if (node->flags & PKRAM_ACCMODE_MASK) {
+		err = -EBUSY;
+		goto out_unlock;
+	}
+	pkram_delete_node(node);
+out_unlock:
+	mutex_unlock(&pkram_mutex);
+	if (err)
+		return err;
+
+	node->flags |= PKRAM_LOAD;
+	pkram_stream_init(ps, node, 0);
+	return 0;
 }
 
 /**
@@ -106,7 +240,11 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(node);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 02/43] mm: PKRAM: implement node load and save functions
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Preserved memory is divided into nodes which can be saved and loaded
independently of each other. PKRAM nodes are kept on a list and
identified by unique names. Whenever a save operation is initiated by
calling pkram_prepare_save(), a new node is created and linked to the
list. When the save operation has been committed by calling
pkram_finish_save(), the node becomes loadable. A load operation can be
then initiated by calling pkram_prepare_load() which deletes the node
from the list and prepares the corresponding stream for loading data
from it. After the load has been finished, the pkram_finish_load()
function must be called to free the node. Nodes are also deleted when a
save operation is discarded, i.e. pkram_discard_save() is called instead
of pkram_finish_save().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   8 ++-
 mm/pkram.c            | 148 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 150 insertions(+), 6 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index a575da2d6c79..01055a876450 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -6,6 +6,8 @@
 #include <linux/types.h>
 #include <linux/mm_types.h>
 
+struct pkram_node;
+
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
@@ -14,7 +16,11 @@ enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,  /* No data types configured */
 };
 
-struct pkram_stream;
+struct pkram_stream {
+	gfp_t gfp_mask;
+	struct pkram_node *node;
+};
+
 struct pkram_access;
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index 59e4661b2fb7..21976df6e0ea 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,16 +2,85 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/kernel.h>
+#include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/string.h>
 #include <linux/types.h>
 
+/*
+ * Preserved memory is divided into nodes that can be saved or loaded
+ * independently of each other. The nodes are identified by unique name
+ * strings.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_node {
+	__u32	flags;
+
+	__u8	name[PKRAM_NAME_MAX];
+};
+
+#define PKRAM_SAVE		1
+#define PKRAM_LOAD		2
+#define PKRAM_ACCMODE_MASK	3
+
+static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
+static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
+
+static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
+{
+	return alloc_page(gfp_mask);
+}
+
+static inline void pkram_free_page(void *addr)
+{
+	free_page((unsigned long)addr);
+}
+
+static inline void pkram_insert_node(struct pkram_node *node)
+{
+	list_add(&virt_to_page(node)->lru, &pkram_nodes);
+}
+
+static inline void pkram_delete_node(struct pkram_node *node)
+{
+	list_del(&virt_to_page(node)->lru);
+}
+
+static struct pkram_node *pkram_find_node(const char *name)
+{
+	struct page *page;
+	struct pkram_node *node;
+
+	list_for_each_entry(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (strcmp(node->name, name) == 0)
+			return node;
+	}
+	return NULL;
+}
+
+static void pkram_stream_init(struct pkram_stream *ps,
+			     struct pkram_node *node, gfp_t gfp_mask)
+{
+	memset(ps, 0, sizeof(*ps));
+	ps->gfp_mask = gfp_mask;
+	ps->node = node;
+}
+
 /**
  * Create a preserved memory node with name @name and initialize stream @ps
  * for saving data to it.
  *
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
+ * Error values:
+ *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
+ *	%ENOMEM: insufficient memory available
+ *	%EEXIST: node with specified name already exists
+ *
  * Returns 0 on success, -errno on failure.
  *
  * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
@@ -19,7 +88,34 @@
  */
 int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask)
 {
-	return -ENOSYS;
+	struct page *page;
+	struct pkram_node *node;
+	int err = 0;
+
+	if (strlen(name) >= PKRAM_NAME_MAX)
+		return -ENAMETOOLONG;
+
+	page = pkram_alloc_page(gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	node = page_address(page);
+
+	node->flags = PKRAM_SAVE;
+	strcpy(node->name, name);
+
+	mutex_lock(&pkram_mutex);
+	if (!pkram_find_node(name))
+		pkram_insert_node(node);
+	else
+		err = -EEXIST;
+	mutex_unlock(&pkram_mutex);
+	if (err) {
+		pkram_free_page(node);
+		return err;
+	}
+
+	pkram_stream_init(ps, node, gfp_mask);
+	return 0;
 }
 
 /**
@@ -50,7 +146,12 @@ void pkram_finish_save_obj(struct pkram_stream *ps)
  */
 void pkram_finish_save(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	smp_wmb();
+	node->flags &= ~PKRAM_ACCMODE_MASK;
 }
 
 /**
@@ -60,7 +161,15 @@ void pkram_finish_save(struct pkram_stream *ps)
  */
 void pkram_discard_save(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	mutex_lock(&pkram_mutex);
+	pkram_delete_node(node);
+	mutex_unlock(&pkram_mutex);
+
+	pkram_free_page(node);
 }
 
 /**
@@ -69,11 +178,36 @@ void pkram_discard_save(struct pkram_stream *ps)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOENT: node with specified name does not exist
+ *	%EBUSY: save to required node has not finished yet
+ *
  * After the load has finished, pkram_finish_load() is to be called.
  */
 int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 {
-	return -ENOSYS;
+	struct pkram_node *node;
+	int err = 0;
+
+	mutex_lock(&pkram_mutex);
+	node = pkram_find_node(name);
+	if (!node) {
+		err = -ENOENT;
+		goto out_unlock;
+	}
+	if (node->flags & PKRAM_ACCMODE_MASK) {
+		err = -EBUSY;
+		goto out_unlock;
+	}
+	pkram_delete_node(node);
+out_unlock:
+	mutex_unlock(&pkram_mutex);
+	if (err)
+		return err;
+
+	node->flags |= PKRAM_LOAD;
+	pkram_stream_init(ps, node, 0);
+	return 0;
 }
 
 /**
@@ -106,7 +240,11 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(node);
 }
 
 /**
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 03/43] mm: PKRAM: implement object load and save functions
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

PKRAM nodes are further divided into a list of objects. After a save
operation has been initiated for a node, a save operation for an object
associated with the node is initiated by calling pkram_prepare_save_obj().
A new object is created and linked to the node.  The save operation for
the object is committed by calling pkram_finish_save_obj().  After a load
operation has been initiated, pkram_prepare_load_obj() is called to
delete the next object from the node and prepare the corresponding
stream for loading data from it.  After the load of object has been
finished, pkram_finish_load_obj() is called to free the object.  Objects
are also deleted when a save operation is discarded.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 72 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 01055a876450..a4d55af392c0 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -7,6 +7,7 @@
 #include <linux/mm_types.h>
 
 struct pkram_node;
+struct pkram_obj;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
@@ -19,6 +20,7 @@ enum pkram_data_flags {
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
+	struct pkram_obj *obj;
 };
 
 struct pkram_access;
diff --git a/mm/pkram.c b/mm/pkram.c
index 21976df6e0ea..7c977c5982f8 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -6,9 +6,14 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
 
+struct pkram_obj {
+	__u64   obj_pfn;	/* points to the next object in the list */
+};
+
 /*
  * Preserved memory is divided into nodes that can be saved or loaded
  * independently of each other. The nodes are identified by unique name
@@ -18,6 +23,7 @@
  */
 struct pkram_node {
 	__u32	flags;
+	__u64	obj_pfn;	/* points to the first obj of the node */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -62,6 +68,21 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_node(struct pkram_node *node)
+{
+	unsigned long obj_pfn;
+	struct pkram_obj *obj;
+
+	obj_pfn = node->obj_pfn;
+	while (obj_pfn) {
+		obj = pfn_to_kaddr(obj_pfn);
+		obj_pfn = obj->obj_pfn;
+		pkram_free_page(obj);
+		cond_resched();
+	}
+	node->obj_pfn = 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -124,12 +145,31 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOMEM: insufficient memory available
+ *
  * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
  * in case of failure) is to be called.
  */
 int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 {
-	return -ENOSYS;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+	struct page *page;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	obj = page_address(page);
+
+	if (node->obj_pfn)
+		obj->obj_pfn = node->obj_pfn;
+	node->obj_pfn = page_to_pfn(page);
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -137,7 +177,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
  */
 void pkram_finish_save_obj(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 }
 
 /**
@@ -169,6 +211,7 @@ void pkram_discard_save(struct pkram_stream *ps)
 	pkram_delete_node(node);
 	mutex_unlock(&pkram_mutex);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
@@ -216,11 +259,26 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENODATA: Stream @ps has no preserved memory objects
+ *
  * After the load has finished, pkram_finish_load_obj() is to be called.
  */
 int pkram_prepare_load_obj(struct pkram_stream *ps)
 {
-	return -ENOSYS;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	if (!node->obj_pfn)
+		return -ENODATA;
+
+	obj = pfn_to_kaddr(node->obj_pfn);
+	node->obj_pfn = obj->obj_pfn;
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -230,7 +288,12 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load_obj(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj = ps->obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(obj);
 }
 
 /**
@@ -244,6 +307,7 @@ void pkram_finish_load(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 03/43] mm: PKRAM: implement object load and save functions
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

PKRAM nodes are further divided into a list of objects. After a save
operation has been initiated for a node, a save operation for an object
associated with the node is initiated by calling pkram_prepare_save_obj().
A new object is created and linked to the node.  The save operation for
the object is committed by calling pkram_finish_save_obj().  After a load
operation has been initiated, pkram_prepare_load_obj() is called to
delete the next object from the node and prepare the corresponding
stream for loading data from it.  After the load of object has been
finished, pkram_finish_load_obj() is called to free the object.  Objects
are also deleted when a save operation is discarded.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 72 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 01055a876450..a4d55af392c0 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -7,6 +7,7 @@
 #include <linux/mm_types.h>
 
 struct pkram_node;
+struct pkram_obj;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
@@ -19,6 +20,7 @@ enum pkram_data_flags {
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
+	struct pkram_obj *obj;
 };
 
 struct pkram_access;
diff --git a/mm/pkram.c b/mm/pkram.c
index 21976df6e0ea..7c977c5982f8 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -6,9 +6,14 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pkram.h>
+#include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
 
+struct pkram_obj {
+	__u64   obj_pfn;	/* points to the next object in the list */
+};
+
 /*
  * Preserved memory is divided into nodes that can be saved or loaded
  * independently of each other. The nodes are identified by unique name
@@ -18,6 +23,7 @@
  */
 struct pkram_node {
 	__u32	flags;
+	__u64	obj_pfn;	/* points to the first obj of the node */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -62,6 +68,21 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_node(struct pkram_node *node)
+{
+	unsigned long obj_pfn;
+	struct pkram_obj *obj;
+
+	obj_pfn = node->obj_pfn;
+	while (obj_pfn) {
+		obj = pfn_to_kaddr(obj_pfn);
+		obj_pfn = obj->obj_pfn;
+		pkram_free_page(obj);
+		cond_resched();
+	}
+	node->obj_pfn = 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -124,12 +145,31 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENOMEM: insufficient memory available
+ *
  * After the save has finished, pkram_finish_save_obj() (or pkram_discard_save()
  * in case of failure) is to be called.
  */
 int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 {
-	return -ENOSYS;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+	struct page *page;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+	obj = page_address(page);
+
+	if (node->obj_pfn)
+		obj->obj_pfn = node->obj_pfn;
+	node->obj_pfn = page_to_pfn(page);
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -137,7 +177,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
  */
 void pkram_finish_save_obj(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 }
 
 /**
@@ -169,6 +211,7 @@ void pkram_discard_save(struct pkram_stream *ps)
 	pkram_delete_node(node);
 	mutex_unlock(&pkram_mutex);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
@@ -216,11 +259,26 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *	%ENODATA: Stream @ps has no preserved memory objects
+ *
  * After the load has finished, pkram_finish_load_obj() is to be called.
  */
 int pkram_prepare_load_obj(struct pkram_stream *ps)
 {
-	return -ENOSYS;
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	if (!node->obj_pfn)
+		return -ENODATA;
+
+	obj = pfn_to_kaddr(node->obj_pfn);
+	node->obj_pfn = obj->obj_pfn;
+
+	ps->obj = obj;
+	return 0;
 }
 
 /**
@@ -230,7 +288,12 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
  */
 void pkram_finish_load_obj(struct pkram_stream *ps)
 {
-	BUG();
+	struct pkram_node *node = ps->node;
+	struct pkram_obj *obj = ps->obj;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	pkram_free_page(obj);
 }
 
 /**
@@ -244,6 +307,7 @@ void pkram_finish_load(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_node(node);
 	pkram_free_page(node);
 }
 
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 04/43] mm: PKRAM: implement page stream operations
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Using the pkram_save_file_page() function, one can populate PKRAM objects
with in-memory pages which can later be loaded using the pkram_load_file_page()
function. Saving a page to PKRAM is accomplished by recording its pfn and
mapping index and incrementing its refcount so that it will not be freed
after the last user puts it.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  42 +++++++-
 mm/pkram.c            | 282 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 317 insertions(+), 7 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index a4d55af392c0..9d8a6fd96dd9 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -8,22 +8,47 @@
 
 struct pkram_node;
 struct pkram_obj;
+struct pkram_link;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
+ * @PKRAM_DATA_pages: obj contains file page data
  */
 enum pkram_data_flags {
-	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+	PKRAM_DATA_none		= 0x0,	/* No data types configured */
+	PKRAM_DATA_pages	= 0x1,	/* Contains file page data */
+};
+
+struct pkram_data_stream {
+	/* List of link pages to add/remove from */
+	__u64 *head_link_pfnp;
+	__u64 *tail_link_pfnp;
+
+	struct pkram_link *link;	/* current link */
+	unsigned int entry_idx;		/* next entry in link */
 };
 
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
 	struct pkram_obj *obj;
+
+	__u64 *pages_head_link_pfnp;
+	__u64 *pages_tail_link_pfnp;
+};
+
+struct pkram_pages_access {
+	unsigned long next_index;
 };
 
-struct pkram_access;
+struct pkram_access {
+	enum pkram_data_flags dtype;
+	struct pkram_stream *ps;
+	struct pkram_data_stream pds;
+
+	struct pkram_pages_access pages;
+};
 
 #define PKRAM_NAME_MAX		256	/* including nul */
 
@@ -41,8 +66,19 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_finish_load(struct pkram_stream *ps);
 void pkram_finish_load_obj(struct pkram_stream *ps);
 
+#define PKRAM_PDS_INIT(name, stream, type) {			\
+	.head_link_pfnp=(stream)->type##_head_link_pfnp,	\
+	.tail_link_pfnp=(stream)->type##_tail_link_pfnp,	\
+	}
+
+#define PKRAM_ACCESS_INIT(name, stream, type) {			\
+	.dtype = PKRAM_DATA_##type,				\
+	.ps = (stream),						\
+	.pds = PKRAM_PDS_INIT(name, stream, type),		\
+	}
+
 #define PKRAM_ACCESS(name, stream, type)			\
-	struct pkram_access name
+	struct pkram_access name = PKRAM_ACCESS_INIT(name, stream, type)
 
 void pkram_finish_access(struct pkram_access *pa, bool status_ok);
 
diff --git a/mm/pkram.c b/mm/pkram.c
index 7c977c5982f8..9c42db66d022 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
@@ -10,8 +11,39 @@
 #include <linux/string.h>
 #include <linux/types.h>
 
+#include "internal.h"
+
+
+/*
+ * Represents a reference to a data page saved to PKRAM.
+ */
+typedef __u64 pkram_entry_t;
+
+#define PKRAM_ENTRY_FLAGS_SHIFT	0x5
+#define PKRAM_ENTRY_FLAGS_MASK	0x7f
+
+/*
+ * Keeps references to data pages saved to PKRAM.
+ * The structure occupies a memory page.
+ */
+struct pkram_link {
+	__u64	link_pfn;	/* points to the next link of the object */
+	__u64	index;		/* mapping index of first pkram_entry_t */
+
+	/*
+	 * the array occupies the rest of the link page; if the link is not
+	 * full, the rest of the array must be filled with zeros
+	 */
+	pkram_entry_t entry[0];
+};
+
+#define PKRAM_LINK_ENTRIES_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_link))/sizeof(pkram_entry_t))
+
 struct pkram_obj {
-	__u64   obj_pfn;	/* points to the next object in the list */
+	__u64	pages_head_link_pfn;	/* the first pages link of the object */
+	__u64	pages_tail_link_pfn;	/* the last pages link of the object */
+	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
 /*
@@ -19,6 +51,10 @@ struct pkram_obj {
  * independently of each other. The nodes are identified by unique name
  * strings.
  *
+ * References to data pages saved to a preserved memory node are kept in a
+ * singly-linked list of PKRAM link structures (see above), the node has a
+ * pointer to the head of.
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
@@ -68,6 +104,41 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_link(struct pkram_link *link)
+{
+	struct page *page;
+	pkram_entry_t p;
+	int i;
+
+	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
+		p = link->entry[i];
+		if (!p)
+			continue;
+		page = pfn_to_page(PHYS_PFN(p));
+		put_page(page);
+	}
+}
+
+static void pkram_truncate_links(unsigned long link_pfn)
+{
+	struct pkram_link *link;
+
+	while (link_pfn) {
+		link = pfn_to_kaddr(link_pfn);
+		pkram_truncate_link(link);
+		link_pfn = link->link_pfn;
+		pkram_free_page(link);
+		cond_resched();
+	}
+}
+
+static void pkram_truncate_obj(struct pkram_obj *obj)
+{
+	pkram_truncate_links(obj->pages_head_link_pfn);
+	obj->pages_head_link_pfn = 0;
+	obj->pages_tail_link_pfn = 0;
+}
+
 static void pkram_truncate_node(struct pkram_node *node)
 {
 	unsigned long obj_pfn;
@@ -76,6 +147,7 @@ static void pkram_truncate_node(struct pkram_node *node)
 	obj_pfn = node->obj_pfn;
 	while (obj_pfn) {
 		obj = pfn_to_kaddr(obj_pfn);
+		pkram_truncate_obj(obj);
 		obj_pfn = obj->obj_pfn;
 		pkram_free_page(obj);
 		cond_resched();
@@ -83,6 +155,83 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
+{
+	__u64 link_pfn = page_to_pfn(virt_to_page(link));
+
+	if (!*pds->head_link_pfnp) {
+		*pds->head_link_pfnp = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	} else {
+		struct pkram_link *tail = pfn_to_kaddr(*pds->tail_link_pfnp);
+
+		tail->link_pfn = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	}
+}
+
+static struct pkram_link *pkram_remove_link(struct pkram_data_stream *pds)
+{
+	struct pkram_link *link;
+
+	if (!*pds->head_link_pfnp)
+		return NULL;
+
+	link = pfn_to_kaddr(*pds->head_link_pfnp);
+	*pds->head_link_pfnp = link->link_pfn;
+	if (!*pds->head_link_pfnp)
+		*pds->tail_link_pfnp = 0;
+	else
+		link->link_pfn = 0;
+
+	return link;
+}
+
+static struct pkram_link *pkram_new_link(struct pkram_data_stream *pds, gfp_t gfp_mask)
+{
+	struct pkram_link *link;
+	struct page *link_page;
+
+	link_page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+				    __GFP_ZERO);
+	if (!link_page)
+		return NULL;
+
+	link = page_address(link_page);
+	pkram_add_link(link, pds);
+	pds->link = link;
+	pds->entry_idx = 0;
+
+	return link;
+}
+
+static void pkram_add_link_entry(struct pkram_data_stream *pds, struct page *page)
+{
+	struct pkram_link *link = pds->link;
+	pkram_entry_t p;
+	short flags = 0;
+
+	p = page_to_phys(page);
+	p |= ((flags & PKRAM_ENTRY_FLAGS_MASK) << PKRAM_ENTRY_FLAGS_SHIFT);
+	link->entry[pds->entry_idx] = p;
+	pds->entry_idx++;
+}
+
+static int pkram_next_link(struct pkram_data_stream *pds, struct pkram_link **linkp)
+{
+	struct pkram_link *link;
+
+	link = pkram_remove_link(pds);
+	if (!link)
+		return -ENODATA;
+
+	pds->link = link;
+	pds->entry_idx = 0;
+	*linkp = link;
+
+	return 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -159,6 +308,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	if (flags & ~PKRAM_DATA_pages)
+		return -EINVAL;
+
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return -ENOMEM;
@@ -168,6 +320,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		obj->obj_pfn = node->obj_pfn;
 	node->obj_pfn = page_to_pfn(page);
 
+	if (flags & PKRAM_DATA_pages) {
+		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
+		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -275,8 +431,17 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
+	if (!obj->pages_head_link_pfn) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
 	node->obj_pfn = obj->obj_pfn;
 
+	if (obj->pages_head_link_pfn) {
+		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
+		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -293,6 +458,7 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_obj(obj);
 	pkram_free_page(obj);
 }
 
@@ -318,7 +484,41 @@ void pkram_finish_load(struct pkram_stream *ps)
  */
 void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 {
-	BUG();
+	if (status_ok)
+		return;
+
+	if (pa->ps->node->flags == PKRAM_SAVE)
+		return;
+
+	if (pa->pds.link)
+		pkram_truncate_link(pa->pds.link);
+}
+
+/*
+ * Add file page to a PKRAM obj allocating a new PKRAM link if necessary.
+ */
+static int __pkram_save_page(struct pkram_access *pa, struct page *page,
+			     unsigned long index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    index != pa->pages.next_index) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+
+		pa->pages.next_index = link->index = index;
+	}
+
+	get_page(page);
+
+	pkram_add_link_entry(pds, page);
+
+	pa->pages.next_index++;
+
+	return 0;
 }
 
 /**
@@ -328,10 +528,80 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
  * with PKRAM_ACCESS().
  *
  * Returns 0 on success, -errno on failure.
+ *
+ * Error values:
+ *	%ENOMEM: insufficient amount of memory available
+ *
+ * Saving a page to preserved memory is simply incrementing its refcount so
+ * that it will not get freed after the last user puts it. That means it is
+ * safe to use the page as usual after it has been saved.
  */
 int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 {
-	return -ENOSYS;
+	struct pkram_node *node = pa->ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	BUG_ON(PageCompound(page));
+
+	return __pkram_save_page(pa, page, page->index);
+}
+
+static struct page *__pkram_prep_load_page(pkram_entry_t p)
+{
+	struct page *page;
+	short flags;
+
+	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
+	page = pfn_to_page(PHYS_PFN(p));
+
+	return page;
+}
+
+/*
+ * Extract the next page from preserved memory freeing a PKRAM link if it
+ * becomes empty.
+ */
+static struct page *__pkram_load_page(struct pkram_access *pa, unsigned long *index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+	struct page *page;
+	pkram_entry_t p;
+	int ret;
+
+	if (!link) {
+		ret = pkram_next_link(pds, &link);
+		if (ret)
+			return NULL;	// XXX return error value?
+
+		if (index)
+			pa->pages.next_index = link->index;
+	}
+
+	BUG_ON(pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX);
+
+	p = link->entry[pds->entry_idx];
+	BUG_ON(!p);
+
+	page = __pkram_prep_load_page(p);
+
+	if (index) {
+		*index = pa->pages.next_index;
+		pa->pages.next_index++;
+	}
+
+	/* clear to avoid double free (see pkram_truncate_link()) */
+	link->entry[pds->entry_idx] = 0;
+
+	pds->entry_idx++;
+	if (pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    !link->entry[pds->entry_idx]) {
+		pds->link = NULL;
+		pkram_free_page(link);
+	}
+
+	return page;
 }
 
 /**
@@ -349,7 +619,11 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
  */
 struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
 {
-	return NULL;
+	struct pkram_node *node = pa->ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	return __pkram_load_page(pa, index);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 04/43] mm: PKRAM: implement page stream operations
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Using the pkram_save_file_page() function, one can populate PKRAM objects
with in-memory pages which can later be loaded using the pkram_load_file_page()
function. Saving a page to PKRAM is accomplished by recording its pfn and
mapping index and incrementing its refcount so that it will not be freed
after the last user puts it.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  42 +++++++-
 mm/pkram.c            | 282 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 317 insertions(+), 7 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index a4d55af392c0..9d8a6fd96dd9 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -8,22 +8,47 @@
 
 struct pkram_node;
 struct pkram_obj;
+struct pkram_link;
 
 /**
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
+ * @PKRAM_DATA_pages: obj contains file page data
  */
 enum pkram_data_flags {
-	PKRAM_DATA_none		= 0x0,  /* No data types configured */
+	PKRAM_DATA_none		= 0x0,	/* No data types configured */
+	PKRAM_DATA_pages	= 0x1,	/* Contains file page data */
+};
+
+struct pkram_data_stream {
+	/* List of link pages to add/remove from */
+	__u64 *head_link_pfnp;
+	__u64 *tail_link_pfnp;
+
+	struct pkram_link *link;	/* current link */
+	unsigned int entry_idx;		/* next entry in link */
 };
 
 struct pkram_stream {
 	gfp_t gfp_mask;
 	struct pkram_node *node;
 	struct pkram_obj *obj;
+
+	__u64 *pages_head_link_pfnp;
+	__u64 *pages_tail_link_pfnp;
+};
+
+struct pkram_pages_access {
+	unsigned long next_index;
 };
 
-struct pkram_access;
+struct pkram_access {
+	enum pkram_data_flags dtype;
+	struct pkram_stream *ps;
+	struct pkram_data_stream pds;
+
+	struct pkram_pages_access pages;
+};
 
 #define PKRAM_NAME_MAX		256	/* including nul */
 
@@ -41,8 +66,19 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_finish_load(struct pkram_stream *ps);
 void pkram_finish_load_obj(struct pkram_stream *ps);
 
+#define PKRAM_PDS_INIT(name, stream, type) {			\
+	.head_link_pfnp=(stream)->type##_head_link_pfnp,	\
+	.tail_link_pfnp=(stream)->type##_tail_link_pfnp,	\
+	}
+
+#define PKRAM_ACCESS_INIT(name, stream, type) {			\
+	.dtype = PKRAM_DATA_##type,				\
+	.ps = (stream),						\
+	.pds = PKRAM_PDS_INIT(name, stream, type),		\
+	}
+
 #define PKRAM_ACCESS(name, stream, type)			\
-	struct pkram_access name
+	struct pkram_access name = PKRAM_ACCESS_INIT(name, stream, type)
 
 void pkram_finish_access(struct pkram_access *pa, bool status_ok);
 
diff --git a/mm/pkram.c b/mm/pkram.c
index 7c977c5982f8..9c42db66d022 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
@@ -10,8 +11,39 @@
 #include <linux/string.h>
 #include <linux/types.h>
 
+#include "internal.h"
+
+
+/*
+ * Represents a reference to a data page saved to PKRAM.
+ */
+typedef __u64 pkram_entry_t;
+
+#define PKRAM_ENTRY_FLAGS_SHIFT	0x5
+#define PKRAM_ENTRY_FLAGS_MASK	0x7f
+
+/*
+ * Keeps references to data pages saved to PKRAM.
+ * The structure occupies a memory page.
+ */
+struct pkram_link {
+	__u64	link_pfn;	/* points to the next link of the object */
+	__u64	index;		/* mapping index of first pkram_entry_t */
+
+	/*
+	 * the array occupies the rest of the link page; if the link is not
+	 * full, the rest of the array must be filled with zeros
+	 */
+	pkram_entry_t entry[0];
+};
+
+#define PKRAM_LINK_ENTRIES_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_link))/sizeof(pkram_entry_t))
+
 struct pkram_obj {
-	__u64   obj_pfn;	/* points to the next object in the list */
+	__u64	pages_head_link_pfn;	/* the first pages link of the object */
+	__u64	pages_tail_link_pfn;	/* the last pages link of the object */
+	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
 /*
@@ -19,6 +51,10 @@ struct pkram_obj {
  * independently of each other. The nodes are identified by unique name
  * strings.
  *
+ * References to data pages saved to a preserved memory node are kept in a
+ * singly-linked list of PKRAM link structures (see above), the node has a
+ * pointer to the head of.
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
@@ -68,6 +104,41 @@ static struct pkram_node *pkram_find_node(const char *name)
 	return NULL;
 }
 
+static void pkram_truncate_link(struct pkram_link *link)
+{
+	struct page *page;
+	pkram_entry_t p;
+	int i;
+
+	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
+		p = link->entry[i];
+		if (!p)
+			continue;
+		page = pfn_to_page(PHYS_PFN(p));
+		put_page(page);
+	}
+}
+
+static void pkram_truncate_links(unsigned long link_pfn)
+{
+	struct pkram_link *link;
+
+	while (link_pfn) {
+		link = pfn_to_kaddr(link_pfn);
+		pkram_truncate_link(link);
+		link_pfn = link->link_pfn;
+		pkram_free_page(link);
+		cond_resched();
+	}
+}
+
+static void pkram_truncate_obj(struct pkram_obj *obj)
+{
+	pkram_truncate_links(obj->pages_head_link_pfn);
+	obj->pages_head_link_pfn = 0;
+	obj->pages_tail_link_pfn = 0;
+}
+
 static void pkram_truncate_node(struct pkram_node *node)
 {
 	unsigned long obj_pfn;
@@ -76,6 +147,7 @@ static void pkram_truncate_node(struct pkram_node *node)
 	obj_pfn = node->obj_pfn;
 	while (obj_pfn) {
 		obj = pfn_to_kaddr(obj_pfn);
+		pkram_truncate_obj(obj);
 		obj_pfn = obj->obj_pfn;
 		pkram_free_page(obj);
 		cond_resched();
@@ -83,6 +155,83 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
+{
+	__u64 link_pfn = page_to_pfn(virt_to_page(link));
+
+	if (!*pds->head_link_pfnp) {
+		*pds->head_link_pfnp = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	} else {
+		struct pkram_link *tail = pfn_to_kaddr(*pds->tail_link_pfnp);
+
+		tail->link_pfn = link_pfn;
+		*pds->tail_link_pfnp = link_pfn;
+	}
+}
+
+static struct pkram_link *pkram_remove_link(struct pkram_data_stream *pds)
+{
+	struct pkram_link *link;
+
+	if (!*pds->head_link_pfnp)
+		return NULL;
+
+	link = pfn_to_kaddr(*pds->head_link_pfnp);
+	*pds->head_link_pfnp = link->link_pfn;
+	if (!*pds->head_link_pfnp)
+		*pds->tail_link_pfnp = 0;
+	else
+		link->link_pfn = 0;
+
+	return link;
+}
+
+static struct pkram_link *pkram_new_link(struct pkram_data_stream *pds, gfp_t gfp_mask)
+{
+	struct pkram_link *link;
+	struct page *link_page;
+
+	link_page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+				    __GFP_ZERO);
+	if (!link_page)
+		return NULL;
+
+	link = page_address(link_page);
+	pkram_add_link(link, pds);
+	pds->link = link;
+	pds->entry_idx = 0;
+
+	return link;
+}
+
+static void pkram_add_link_entry(struct pkram_data_stream *pds, struct page *page)
+{
+	struct pkram_link *link = pds->link;
+	pkram_entry_t p;
+	short flags = 0;
+
+	p = page_to_phys(page);
+	p |= ((flags & PKRAM_ENTRY_FLAGS_MASK) << PKRAM_ENTRY_FLAGS_SHIFT);
+	link->entry[pds->entry_idx] = p;
+	pds->entry_idx++;
+}
+
+static int pkram_next_link(struct pkram_data_stream *pds, struct pkram_link **linkp)
+{
+	struct pkram_link *link;
+
+	link = pkram_remove_link(pds);
+	if (!link)
+		return -ENODATA;
+
+	pds->link = link;
+	pds->entry_idx = 0;
+	*linkp = link;
+
+	return 0;
+}
+
 static void pkram_stream_init(struct pkram_stream *ps,
 			     struct pkram_node *node, gfp_t gfp_mask)
 {
@@ -159,6 +308,9 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	if (flags & ~PKRAM_DATA_pages)
+		return -EINVAL;
+
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
 	if (!page)
 		return -ENOMEM;
@@ -168,6 +320,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		obj->obj_pfn = node->obj_pfn;
 	node->obj_pfn = page_to_pfn(page);
 
+	if (flags & PKRAM_DATA_pages) {
+		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
+		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -275,8 +431,17 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
+	if (!obj->pages_head_link_pfn) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
 	node->obj_pfn = obj->obj_pfn;
 
+	if (obj->pages_head_link_pfn) {
+		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
+		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -293,6 +458,7 @@ void pkram_finish_load_obj(struct pkram_stream *ps)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
 
+	pkram_truncate_obj(obj);
 	pkram_free_page(obj);
 }
 
@@ -318,7 +484,41 @@ void pkram_finish_load(struct pkram_stream *ps)
  */
 void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 {
-	BUG();
+	if (status_ok)
+		return;
+
+	if (pa->ps->node->flags == PKRAM_SAVE)
+		return;
+
+	if (pa->pds.link)
+		pkram_truncate_link(pa->pds.link);
+}
+
+/*
+ * Add file page to a PKRAM obj allocating a new PKRAM link if necessary.
+ */
+static int __pkram_save_page(struct pkram_access *pa, struct page *page,
+			     unsigned long index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    index != pa->pages.next_index) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+
+		pa->pages.next_index = link->index = index;
+	}
+
+	get_page(page);
+
+	pkram_add_link_entry(pds, page);
+
+	pa->pages.next_index++;
+
+	return 0;
 }
 
 /**
@@ -328,10 +528,80 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
  * with PKRAM_ACCESS().
  *
  * Returns 0 on success, -errno on failure.
+ *
+ * Error values:
+ *	%ENOMEM: insufficient amount of memory available
+ *
+ * Saving a page to preserved memory is simply incrementing its refcount so
+ * that it will not get freed after the last user puts it. That means it is
+ * safe to use the page as usual after it has been saved.
  */
 int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 {
-	return -ENOSYS;
+	struct pkram_node *node = pa->ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	BUG_ON(PageCompound(page));
+
+	return __pkram_save_page(pa, page, page->index);
+}
+
+static struct page *__pkram_prep_load_page(pkram_entry_t p)
+{
+	struct page *page;
+	short flags;
+
+	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
+	page = pfn_to_page(PHYS_PFN(p));
+
+	return page;
+}
+
+/*
+ * Extract the next page from preserved memory freeing a PKRAM link if it
+ * becomes empty.
+ */
+static struct page *__pkram_load_page(struct pkram_access *pa, unsigned long *index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+	struct page *page;
+	pkram_entry_t p;
+	int ret;
+
+	if (!link) {
+		ret = pkram_next_link(pds, &link);
+		if (ret)
+			return NULL;	// XXX return error value?
+
+		if (index)
+			pa->pages.next_index = link->index;
+	}
+
+	BUG_ON(pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX);
+
+	p = link->entry[pds->entry_idx];
+	BUG_ON(!p);
+
+	page = __pkram_prep_load_page(p);
+
+	if (index) {
+		*index = pa->pages.next_index;
+		pa->pages.next_index++;
+	}
+
+	/* clear to avoid double free (see pkram_truncate_link()) */
+	link->entry[pds->entry_idx] = 0;
+
+	pds->entry_idx++;
+	if (pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
+	    !link->entry[pds->entry_idx]) {
+		pds->link = NULL;
+		pkram_free_page(link);
+	}
+
+	return page;
 }
 
 /**
@@ -349,7 +619,11 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
  */
 struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
 {
-	return NULL;
+	struct pkram_node *node = pa->ps->node;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	return __pkram_load_page(pa, index);
 }
 
 /**
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 05/43] mm: PKRAM: support preserving transparent hugepages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Support preserving a transparent hugepage by recording the page order and
a flag indicating it is a THP.  Use these values when the page is
restored to reconstruct the THP.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 9c42db66d022..da44a6060c5f 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -21,6 +21,9 @@
 
 #define PKRAM_ENTRY_FLAGS_SHIFT	0x5
 #define PKRAM_ENTRY_FLAGS_MASK	0x7f
+#define PKRAM_ENTRY_ORDER_MASK	0x1f
+
+#define PKRAM_PAGE_TRANS_HUGE	0x1	/* page is a transparent hugepage */
 
 /*
  * Keeps references to data pages saved to PKRAM.
@@ -211,7 +214,11 @@ static void pkram_add_link_entry(struct pkram_data_stream *pds, struct page *pag
 	pkram_entry_t p;
 	short flags = 0;
 
+	if (PageTransHuge(page))
+		flags |= PKRAM_PAGE_TRANS_HUGE;
+
 	p = page_to_phys(page);
+	p |= compound_order(page);
 	p |= ((flags & PKRAM_ENTRY_FLAGS_MASK) << PKRAM_ENTRY_FLAGS_SHIFT);
 	link->entry[pds->entry_idx] = p;
 	pds->entry_idx++;
@@ -516,7 +523,7 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 
 	pkram_add_link_entry(pds, page);
 
-	pa->pages.next_index++;
+	pa->pages.next_index += compound_nr(page);
 
 	return 0;
 }
@@ -542,19 +549,24 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	BUG_ON(PageCompound(page));
-
 	return __pkram_save_page(pa, page, page->index);
 }
 
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
+	int order;
 	short flags;
 
 	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
 	page = pfn_to_page(PHYS_PFN(p));
 
+	if (flags & PKRAM_PAGE_TRANS_HUGE) {
+		order = p & PKRAM_ENTRY_ORDER_MASK;
+		prep_compound_page(page, order);
+		prep_transhuge_page(page);
+	}
+
 	return page;
 }
 
@@ -588,7 +600,7 @@ static struct page *__pkram_load_page(struct pkram_access *pa, unsigned long *in
 
 	if (index) {
 		*index = pa->pages.next_index;
-		pa->pages.next_index++;
+		pa->pages.next_index += compound_nr(page);
 	}
 
 	/* clear to avoid double free (see pkram_truncate_link()) */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 05/43] mm: PKRAM: support preserving transparent hugepages
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Support preserving a transparent hugepage by recording the page order and
a flag indicating it is a THP.  Use these values when the page is
restored to reconstruct the THP.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 9c42db66d022..da44a6060c5f 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -21,6 +21,9 @@
 
 #define PKRAM_ENTRY_FLAGS_SHIFT	0x5
 #define PKRAM_ENTRY_FLAGS_MASK	0x7f
+#define PKRAM_ENTRY_ORDER_MASK	0x1f
+
+#define PKRAM_PAGE_TRANS_HUGE	0x1	/* page is a transparent hugepage */
 
 /*
  * Keeps references to data pages saved to PKRAM.
@@ -211,7 +214,11 @@ static void pkram_add_link_entry(struct pkram_data_stream *pds, struct page *pag
 	pkram_entry_t p;
 	short flags = 0;
 
+	if (PageTransHuge(page))
+		flags |= PKRAM_PAGE_TRANS_HUGE;
+
 	p = page_to_phys(page);
+	p |= compound_order(page);
 	p |= ((flags & PKRAM_ENTRY_FLAGS_MASK) << PKRAM_ENTRY_FLAGS_SHIFT);
 	link->entry[pds->entry_idx] = p;
 	pds->entry_idx++;
@@ -516,7 +523,7 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 
 	pkram_add_link_entry(pds, page);
 
-	pa->pages.next_index++;
+	pa->pages.next_index += compound_nr(page);
 
 	return 0;
 }
@@ -542,19 +549,24 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	BUG_ON(PageCompound(page));
-
 	return __pkram_save_page(pa, page, page->index);
 }
 
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
+	int order;
 	short flags;
 
 	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
 	page = pfn_to_page(PHYS_PFN(p));
 
+	if (flags & PKRAM_PAGE_TRANS_HUGE) {
+		order = p & PKRAM_ENTRY_ORDER_MASK;
+		prep_compound_page(page, order);
+		prep_transhuge_page(page);
+	}
+
 	return page;
 }
 
@@ -588,7 +600,7 @@ static struct page *__pkram_load_page(struct pkram_access *pa, unsigned long *in
 
 	if (index) {
 		*index = pa->pages.next_index;
-		pa->pages.next_index++;
+		pa->pages.next_index += compound_nr(page);
 	}
 
 	/* clear to avoid double free (see pkram_truncate_link()) */
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 06/43] mm: PKRAM: implement byte stream operations
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

This patch adds the ability to save an arbitrary byte streams to a
a PKRAM object using pkram_write() to be restored later using pkram_read().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  11 +++++
 mm/pkram.c            | 123 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 9d8a6fd96dd9..4f95d4fb5339 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -14,10 +14,12 @@
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
  * @PKRAM_DATA_pages: obj contains file page data
+ * @PKRAM_DATA_bytes: obj contains byte data
  */
 enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,	/* No data types configured */
 	PKRAM_DATA_pages	= 0x1,	/* Contains file page data */
+	PKRAM_DATA_bytes	= 0x2,	/* Contains byte data */
 };
 
 struct pkram_data_stream {
@@ -36,18 +38,27 @@ struct pkram_stream {
 
 	__u64 *pages_head_link_pfnp;
 	__u64 *pages_tail_link_pfnp;
+
+	__u64 *bytes_head_link_pfnp;
+	__u64 *bytes_tail_link_pfnp;
 };
 
 struct pkram_pages_access {
 	unsigned long next_index;
 };
 
+struct pkram_bytes_access {
+	struct page *data_page;		/* current page */
+	unsigned int data_offset;	/* offset into current page */
+};
+
 struct pkram_access {
 	enum pkram_data_flags dtype;
 	struct pkram_stream *ps;
 	struct pkram_data_stream pds;
 
 	struct pkram_pages_access pages;
+	struct pkram_bytes_access bytes;
 };
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index da44a6060c5f..d81af26c9a66 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/highmem.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
@@ -46,6 +47,9 @@ struct pkram_link {
 struct pkram_obj {
 	__u64	pages_head_link_pfn;	/* the first pages link of the object */
 	__u64	pages_tail_link_pfn;	/* the last pages link of the object */
+	__u64	bytes_head_link_pfn;	/* the first bytes link of the object */
+	__u64	bytes_tail_link_pfn;	/* the last bytes link of the object */
+	__u64	data_len;	/* byte data size */
 	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
@@ -140,6 +144,11 @@ static void pkram_truncate_obj(struct pkram_obj *obj)
 	pkram_truncate_links(obj->pages_head_link_pfn);
 	obj->pages_head_link_pfn = 0;
 	obj->pages_tail_link_pfn = 0;
+
+	pkram_truncate_links(obj->bytes_head_link_pfn);
+	obj->bytes_head_link_pfn = 0;
+	obj->bytes_tail_link_pfn = 0;
+	obj->data_len = 0;
 }
 
 static void pkram_truncate_node(struct pkram_node *node)
@@ -315,7 +324,7 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	if (flags & ~PKRAM_DATA_pages)
+	if (flags & ~(PKRAM_DATA_pages | PKRAM_DATA_bytes))
 		return -EINVAL;
 
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
@@ -331,6 +340,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
 		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
 	}
+	if (flags & PKRAM_DATA_bytes) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -438,7 +451,7 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
-	if (!obj->pages_head_link_pfn) {
+	if (!obj->pages_head_link_pfn && !obj->bytes_head_link_pfn) {
 		WARN_ON(1);
 		return -EINVAL;
 	}
@@ -449,6 +462,10 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
 		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
 	}
+	if (obj->bytes_head_link_pfn) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -499,6 +516,9 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 
 	if (pa->pds.link)
 		pkram_truncate_link(pa->pds.link);
+
+	if ((pa->dtype == PKRAM_DATA_bytes) && (pa->bytes.data_page))
+		pkram_free_page(page_address(pa->bytes.data_page));
 }
 
 /*
@@ -552,6 +572,22 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 	return __pkram_save_page(pa, page, page->index);
 }
 
+static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+	}
+
+	pkram_add_link_entry(pds, page);
+
+	return 0;
+}
+
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
@@ -646,10 +682,53 @@ struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
  *
  * On success, returns the number of bytes written, which is always equal to
  * @count. On failure, -errno is returned.
+ *
+ * Error values:
+ *    %ENOMEM: insufficient amount of memory available
  */
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
 {
-	return -ENOSYS;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, write_count = 0;
+	void *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	while (count > 0) {
+		if (!pa->bytes.data_page) {
+			gfp_t gfp_mask = pa->ps->gfp_mask;
+			struct page *page;
+			int err;
+
+			page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+					       __GFP_HIGHMEM | __GFP_ZERO);
+			if (!page)
+				return -ENOMEM;
+			err = __pkram_bytes_save_page(pa, page);
+			if (err) {
+				pkram_free_page(page_address(page));
+				return err;
+			}
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		addr = kmap_atomic(pa->bytes.data_page);
+		memcpy(addr + pa->bytes.data_offset, buf, copy_count);
+		kunmap_atomic(addr);
+
+		buf += copy_count;
+		obj->data_len += copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE)
+			pa->bytes.data_page = NULL;
+
+		write_count += copy_count;
+		count -= copy_count;
+	}
+	return write_count;
 }
 
 /**
@@ -663,5 +742,41 @@ ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
  */
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 {
-	return 0;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, read_count = 0;
+	char *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	while (count > 0 && obj->data_len > 0) {
+		if (!pa->bytes.data_page) {
+			struct page *page;
+
+			page = __pkram_load_page(pa, NULL);
+			if (!page)
+				break;
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		if (copy_count > obj->data_len)
+			copy_count = obj->data_len;
+		addr = kmap_atomic(pa->bytes.data_page);
+		memcpy(buf, addr + pa->bytes.data_offset, copy_count);
+		kunmap_atomic(addr);
+
+		buf += copy_count;
+		obj->data_len -= copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE || !obj->data_len) {
+			put_page(pa->bytes.data_page);
+			pa->bytes.data_page = NULL;
+		}
+
+		read_count += copy_count;
+		count -= copy_count;
+	}
+	return read_count;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 06/43] mm: PKRAM: implement byte stream operations
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

This patch adds the ability to save an arbitrary byte streams to a
a PKRAM object using pkram_write() to be restored later using pkram_read().

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  11 +++++
 mm/pkram.c            | 123 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 9d8a6fd96dd9..4f95d4fb5339 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -14,10 +14,12 @@
  * enum pkram_data_flags - definition of data types contained in a pkram obj
  * @PKRAM_DATA_none: No data types configured
  * @PKRAM_DATA_pages: obj contains file page data
+ * @PKRAM_DATA_bytes: obj contains byte data
  */
 enum pkram_data_flags {
 	PKRAM_DATA_none		= 0x0,	/* No data types configured */
 	PKRAM_DATA_pages	= 0x1,	/* Contains file page data */
+	PKRAM_DATA_bytes	= 0x2,	/* Contains byte data */
 };
 
 struct pkram_data_stream {
@@ -36,18 +38,27 @@ struct pkram_stream {
 
 	__u64 *pages_head_link_pfnp;
 	__u64 *pages_tail_link_pfnp;
+
+	__u64 *bytes_head_link_pfnp;
+	__u64 *bytes_tail_link_pfnp;
 };
 
 struct pkram_pages_access {
 	unsigned long next_index;
 };
 
+struct pkram_bytes_access {
+	struct page *data_page;		/* current page */
+	unsigned int data_offset;	/* offset into current page */
+};
+
 struct pkram_access {
 	enum pkram_data_flags dtype;
 	struct pkram_stream *ps;
 	struct pkram_data_stream pds;
 
 	struct pkram_pages_access pages;
+	struct pkram_bytes_access bytes;
 };
 
 #define PKRAM_NAME_MAX		256	/* including nul */
diff --git a/mm/pkram.c b/mm/pkram.c
index da44a6060c5f..d81af26c9a66 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/err.h>
 #include <linux/gfp.h>
+#include <linux/highmem.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
@@ -46,6 +47,9 @@ struct pkram_link {
 struct pkram_obj {
 	__u64	pages_head_link_pfn;	/* the first pages link of the object */
 	__u64	pages_tail_link_pfn;	/* the last pages link of the object */
+	__u64	bytes_head_link_pfn;	/* the first bytes link of the object */
+	__u64	bytes_tail_link_pfn;	/* the last bytes link of the object */
+	__u64	data_len;	/* byte data size */
 	__u64	obj_pfn;	/* points to the next object in the list */
 };
 
@@ -140,6 +144,11 @@ static void pkram_truncate_obj(struct pkram_obj *obj)
 	pkram_truncate_links(obj->pages_head_link_pfn);
 	obj->pages_head_link_pfn = 0;
 	obj->pages_tail_link_pfn = 0;
+
+	pkram_truncate_links(obj->bytes_head_link_pfn);
+	obj->bytes_head_link_pfn = 0;
+	obj->bytes_tail_link_pfn = 0;
+	obj->data_len = 0;
 }
 
 static void pkram_truncate_node(struct pkram_node *node)
@@ -315,7 +324,7 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	if (flags & ~PKRAM_DATA_pages)
+	if (flags & ~(PKRAM_DATA_pages | PKRAM_DATA_bytes))
 		return -EINVAL;
 
 	page = pkram_alloc_page(ps->gfp_mask | __GFP_ZERO);
@@ -331,6 +340,10 @@ int pkram_prepare_save_obj(struct pkram_stream *ps, enum pkram_data_flags flags)
 		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
 		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
 	}
+	if (flags & PKRAM_DATA_bytes) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -438,7 +451,7 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		return -ENODATA;
 
 	obj = pfn_to_kaddr(node->obj_pfn);
-	if (!obj->pages_head_link_pfn) {
+	if (!obj->pages_head_link_pfn && !obj->bytes_head_link_pfn) {
 		WARN_ON(1);
 		return -EINVAL;
 	}
@@ -449,6 +462,10 @@ int pkram_prepare_load_obj(struct pkram_stream *ps)
 		ps->pages_head_link_pfnp = &obj->pages_head_link_pfn;
 		ps->pages_tail_link_pfnp = &obj->pages_tail_link_pfn;
 	}
+	if (obj->bytes_head_link_pfn) {
+		ps->bytes_head_link_pfnp = &obj->bytes_head_link_pfn;
+		ps->bytes_tail_link_pfnp = &obj->bytes_tail_link_pfn;
+	}
 	ps->obj = obj;
 	return 0;
 }
@@ -499,6 +516,9 @@ void pkram_finish_access(struct pkram_access *pa, bool status_ok)
 
 	if (pa->pds.link)
 		pkram_truncate_link(pa->pds.link);
+
+	if ((pa->dtype == PKRAM_DATA_bytes) && (pa->bytes.data_page))
+		pkram_free_page(page_address(pa->bytes.data_page));
 }
 
 /*
@@ -552,6 +572,22 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 	return __pkram_save_page(pa, page, page->index);
 }
 
+static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link = pds->link;
+
+	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX) {
+		link = pkram_new_link(pds, pa->ps->gfp_mask);
+		if (!link)
+			return -ENOMEM;
+	}
+
+	pkram_add_link_entry(pds, page);
+
+	return 0;
+}
+
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
@@ -646,10 +682,53 @@ struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
  *
  * On success, returns the number of bytes written, which is always equal to
  * @count. On failure, -errno is returned.
+ *
+ * Error values:
+ *    %ENOMEM: insufficient amount of memory available
  */
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
 {
-	return -ENOSYS;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, write_count = 0;
+	void *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
+
+	while (count > 0) {
+		if (!pa->bytes.data_page) {
+			gfp_t gfp_mask = pa->ps->gfp_mask;
+			struct page *page;
+			int err;
+
+			page = pkram_alloc_page((gfp_mask & GFP_RECLAIM_MASK) |
+					       __GFP_HIGHMEM | __GFP_ZERO);
+			if (!page)
+				return -ENOMEM;
+			err = __pkram_bytes_save_page(pa, page);
+			if (err) {
+				pkram_free_page(page_address(page));
+				return err;
+			}
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		addr = kmap_atomic(pa->bytes.data_page);
+		memcpy(addr + pa->bytes.data_offset, buf, copy_count);
+		kunmap_atomic(addr);
+
+		buf += copy_count;
+		obj->data_len += copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE)
+			pa->bytes.data_page = NULL;
+
+		write_count += copy_count;
+		count -= copy_count;
+	}
+	return write_count;
 }
 
 /**
@@ -663,5 +742,41 @@ ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count)
  */
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 {
-	return 0;
+	struct pkram_node *node = pa->ps->node;
+	struct pkram_obj *obj = pa->ps->obj;
+	size_t copy_count, read_count = 0;
+	char *addr;
+
+	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_LOAD);
+
+	while (count > 0 && obj->data_len > 0) {
+		if (!pa->bytes.data_page) {
+			struct page *page;
+
+			page = __pkram_load_page(pa, NULL);
+			if (!page)
+				break;
+			pa->bytes.data_page = page;
+			pa->bytes.data_offset = 0;
+		}
+
+		copy_count = min_t(size_t, count, PAGE_SIZE - pa->bytes.data_offset);
+		if (copy_count > obj->data_len)
+			copy_count = obj->data_len;
+		addr = kmap_atomic(pa->bytes.data_page);
+		memcpy(buf, addr + pa->bytes.data_offset, copy_count);
+		kunmap_atomic(addr);
+
+		buf += copy_count;
+		obj->data_len -= copy_count;
+		pa->bytes.data_offset += copy_count;
+		if (pa->bytes.data_offset >= PAGE_SIZE || !obj->data_len) {
+			put_page(pa->bytes.data_page);
+			pa->bytes.data_page = NULL;
+		}
+
+		read_count += copy_count;
+		count -= copy_count;
+	}
+	return read_count;
 }
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 07/43] mm: PKRAM: link nodes by pfn before reboot
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Since page structs are used for linking PKRAM nodes and cleared
on boot, organize all PKRAM nodes into a list singly-linked by pfns
before reboot to facilitate restoring the node list in the new kernel.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index d81af26c9a66..975f200aef38 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,12 +2,16 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
+#include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/notifier.h>
 #include <linux/pkram.h>
+#include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
@@ -62,11 +66,15 @@ struct pkram_obj {
  * singly-linked list of PKRAM link structures (see above), the node has a
  * pointer to the head of.
  *
+ * To facilitate data restore in the new kernel, before reboot all PKRAM nodes
+ * are organized into a list singly-linked by pfn's (see pkram_reboot()).
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
 	__u32	flags;
 	__u64	obj_pfn;	/* points to the first obj of the node */
+	__u64	node_pfn;	/* points to the next node in the node list */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -75,6 +83,10 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+/*
+ * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
+ * connected through the lru field of the page struct.
+ */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
@@ -780,3 +792,41 @@ size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 	}
 	return read_count;
 }
+
+/*
+ * Build the list of PKRAM nodes.
+ */
+static void __pkram_reboot(void)
+{
+	struct page *page;
+	struct pkram_node *node;
+	unsigned long node_pfn = 0;
+
+	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+			continue;
+		node->node_pfn = node_pfn;
+		node_pfn = page_to_pfn(page);
+	}
+}
+
+static int pkram_reboot(struct notifier_block *notifier,
+		       unsigned long val, void *v)
+{
+	if (val != SYS_RESTART)
+		return NOTIFY_DONE;
+	__pkram_reboot();
+	return NOTIFY_OK;
+}
+
+static struct notifier_block pkram_reboot_notifier = {
+	.notifier_call = pkram_reboot,
+};
+
+static int __init pkram_init(void)
+{
+	register_reboot_notifier(&pkram_reboot_notifier);
+	return 0;
+}
+module_init(pkram_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 07/43] mm: PKRAM: link nodes by pfn before reboot
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Since page structs are used for linking PKRAM nodes and cleared
on boot, organize all PKRAM nodes into a list singly-linked by pfns
before reboot to facilitate restoring the node list in the new kernel.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index d81af26c9a66..975f200aef38 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -2,12 +2,16 @@
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
+#include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mm.h>
+#include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/notifier.h>
 #include <linux/pkram.h>
+#include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
 #include <linux/types.h>
@@ -62,11 +66,15 @@ struct pkram_obj {
  * singly-linked list of PKRAM link structures (see above), the node has a
  * pointer to the head of.
  *
+ * To facilitate data restore in the new kernel, before reboot all PKRAM nodes
+ * are organized into a list singly-linked by pfn's (see pkram_reboot()).
+ *
  * The structure occupies a memory page.
  */
 struct pkram_node {
 	__u32	flags;
 	__u64	obj_pfn;	/* points to the first obj of the node */
+	__u64	node_pfn;	/* points to the next node in the node list */
 
 	__u8	name[PKRAM_NAME_MAX];
 };
@@ -75,6 +83,10 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+/*
+ * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
+ * connected through the lru field of the page struct.
+ */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
@@ -780,3 +792,41 @@ size_t pkram_read(struct pkram_access *pa, void *buf, size_t count)
 	}
 	return read_count;
 }
+
+/*
+ * Build the list of PKRAM nodes.
+ */
+static void __pkram_reboot(void)
+{
+	struct page *page;
+	struct pkram_node *node;
+	unsigned long node_pfn = 0;
+
+	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+			continue;
+		node->node_pfn = node_pfn;
+		node_pfn = page_to_pfn(page);
+	}
+}
+
+static int pkram_reboot(struct notifier_block *notifier,
+		       unsigned long val, void *v)
+{
+	if (val != SYS_RESTART)
+		return NOTIFY_DONE;
+	__pkram_reboot();
+	return NOTIFY_OK;
+}
+
+static struct notifier_block pkram_reboot_notifier = {
+	.notifier_call = pkram_reboot,
+};
+
+static int __init pkram_init(void)
+{
+	register_reboot_notifier(&pkram_reboot_notifier);
+	return 0;
+}
+module_init(pkram_init);
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 08/43] mm: PKRAM: introduce super block
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

The PKRAM super block is the starting point for restoring preserved
memory. By providing the super block to the new kernel at boot time,
preserved memory can be reserved and made available to be restored.
To point the kernel to the location of the super block, one passes
its pfn via the 'pkram' boot param. For that purpose, the pkram super
block pfn is exported via /sys/kernel/pkram. If none is passed, any
preserved memory will not be kept, and a new super block will be
allocated.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 975f200aef38..2809371a9aec 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -5,15 +5,18 @@
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
+#include <linux/pfn.h>
 #include <linux/pkram.h>
 #include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/sysfs.h>
 #include <linux/types.h>
 
 #include "internal.h"
@@ -84,12 +87,38 @@ struct pkram_node {
 #define PKRAM_ACCMODE_MASK	3
 
 /*
+ * The PKRAM super block contains data needed to restore the preserved memory
+ * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
+ * boot param if one wants to restore preserved data saved by the previously
+ * executing kernel. For that purpose the kernel exports the pfn via
+ * /sys/kernel/pkram. If none is passed, preserved memory if any will not be
+ * preserved and a new clean page will be allocated for the super block.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_super_block {
+	__u64	node_pfn;		/* first element of the node list */
+};
+
+static unsigned long pkram_sb_pfn __initdata;
+static struct pkram_super_block *pkram_sb;
+
+/*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
  */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+/*
+ * The PKRAM super block pfn, see above.
+ */
+static int __init parse_pkram_sb_pfn(char *arg)
+{
+	return kstrtoul(arg, 16, &pkram_sb_pfn);
+}
+early_param("pkram", parse_pkram_sb_pfn);
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	return alloc_page(gfp_mask);
@@ -275,6 +304,7 @@ static void pkram_stream_init(struct pkram_stream *ps,
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
  *	%ENOMEM: insufficient memory available
  *	%EEXIST: node with specified name already exists
@@ -290,6 +320,9 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	if (strlen(name) >= PKRAM_NAME_MAX)
 		return -ENAMETOOLONG;
 
@@ -410,6 +443,7 @@ void pkram_discard_save(struct pkram_stream *ps)
  * Returns 0 on success, -errno on failure.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENOENT: node with specified name does not exist
  *	%EBUSY: save to required node has not finished yet
  *
@@ -420,6 +454,9 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	mutex_lock(&pkram_mutex);
 	node = pkram_find_node(name);
 	if (!node) {
@@ -809,6 +846,13 @@ static void __pkram_reboot(void)
 		node->node_pfn = node_pfn;
 		node_pfn = page_to_pfn(page);
 	}
+
+	/*
+	 * Zero out pkram_sb completely since it may have been passed from
+	 * the previous boot.
+	 */
+	memset(pkram_sb, 0, PAGE_SIZE);
+	pkram_sb->node_pfn = node_pfn;
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -816,7 +860,8 @@ static int pkram_reboot(struct notifier_block *notifier,
 {
 	if (val != SYS_RESTART)
 		return NOTIFY_DONE;
-	__pkram_reboot();
+	if (pkram_sb)
+		__pkram_reboot();
 	return NOTIFY_OK;
 }
 
@@ -824,9 +869,62 @@ static int pkram_reboot(struct notifier_block *notifier,
 	.notifier_call = pkram_reboot,
 };
 
+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
+
+	return sprintf(buf, "%lx\n", pfn);
+}
+
+static struct kobj_attribute pkram_sb_pfn_attr =
+	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+
+static struct attribute *pkram_attrs[] = {
+	&pkram_sb_pfn_attr.attr,
+	NULL,
+};
+
+static struct attribute_group pkram_attr_group = {
+	.attrs = pkram_attrs,
+};
+
+/* returns non-zero on success */
+static int __init pkram_init_sb(void)
+{
+	unsigned long pfn;
+	struct pkram_node *node;
+
+	if (!pkram_sb) {
+		struct page *page;
+
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page) {
+			pr_err("PKRAM: Failed to allocate super block\n");
+			return 0;
+		}
+		pkram_sb = page_address(page);
+	}
+
+	/*
+	 * Build auxiliary doubly-linked list of nodes connected through
+	 * page::lru for convenience sake.
+	 */
+	pfn = pkram_sb->node_pfn;
+	while (pfn) {
+		node = pfn_to_kaddr(pfn);
+		pkram_insert_node(node);
+		pfn = node->node_pfn;
+	}
+	return 1;
+}
+
 static int __init pkram_init(void)
 {
-	register_reboot_notifier(&pkram_reboot_notifier);
+	if (pkram_init_sb()) {
+		register_reboot_notifier(&pkram_reboot_notifier);
+		sysfs_update_group(kernel_kobj, &pkram_attr_group);
+	}
 	return 0;
 }
 module_init(pkram_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 08/43] mm: PKRAM: introduce super block
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

The PKRAM super block is the starting point for restoring preserved
memory. By providing the super block to the new kernel at boot time,
preserved memory can be reserved and made available to be restored.
To point the kernel to the location of the super block, one passes
its pfn via the 'pkram' boot param. For that purpose, the pkram super
block pfn is exported via /sys/kernel/pkram. If none is passed, any
preserved memory will not be kept, and a new super block will be
allocated.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 975f200aef38..2809371a9aec 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -5,15 +5,18 @@
 #include <linux/init.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/notifier.h>
+#include <linux/pfn.h>
 #include <linux/pkram.h>
 #include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/sysfs.h>
 #include <linux/types.h>
 
 #include "internal.h"
@@ -84,12 +87,38 @@ struct pkram_node {
 #define PKRAM_ACCMODE_MASK	3
 
 /*
+ * The PKRAM super block contains data needed to restore the preserved memory
+ * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
+ * boot param if one wants to restore preserved data saved by the previously
+ * executing kernel. For that purpose the kernel exports the pfn via
+ * /sys/kernel/pkram. If none is passed, preserved memory if any will not be
+ * preserved and a new clean page will be allocated for the super block.
+ *
+ * The structure occupies a memory page.
+ */
+struct pkram_super_block {
+	__u64	node_pfn;		/* first element of the node list */
+};
+
+static unsigned long pkram_sb_pfn __initdata;
+static struct pkram_super_block *pkram_sb;
+
+/*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
  */
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+/*
+ * The PKRAM super block pfn, see above.
+ */
+static int __init parse_pkram_sb_pfn(char *arg)
+{
+	return kstrtoul(arg, 16, &pkram_sb_pfn);
+}
+early_param("pkram", parse_pkram_sb_pfn);
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	return alloc_page(gfp_mask);
@@ -275,6 +304,7 @@ static void pkram_stream_init(struct pkram_stream *ps,
  * @gfp_mask specifies the memory allocation mask to be used when saving data.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENAMETOOLONG: name len >= PKRAM_NAME_MAX
  *	%ENOMEM: insufficient memory available
  *	%EEXIST: node with specified name already exists
@@ -290,6 +320,9 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name, gfp_t gfp_mask
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	if (strlen(name) >= PKRAM_NAME_MAX)
 		return -ENAMETOOLONG;
 
@@ -410,6 +443,7 @@ void pkram_discard_save(struct pkram_stream *ps)
  * Returns 0 on success, -errno on failure.
  *
  * Error values:
+ *	%ENODEV: PKRAM not available
  *	%ENOENT: node with specified name does not exist
  *	%EBUSY: save to required node has not finished yet
  *
@@ -420,6 +454,9 @@ int pkram_prepare_load(struct pkram_stream *ps, const char *name)
 	struct pkram_node *node;
 	int err = 0;
 
+	if (!pkram_sb)
+		return -ENODEV;
+
 	mutex_lock(&pkram_mutex);
 	node = pkram_find_node(name);
 	if (!node) {
@@ -809,6 +846,13 @@ static void __pkram_reboot(void)
 		node->node_pfn = node_pfn;
 		node_pfn = page_to_pfn(page);
 	}
+
+	/*
+	 * Zero out pkram_sb completely since it may have been passed from
+	 * the previous boot.
+	 */
+	memset(pkram_sb, 0, PAGE_SIZE);
+	pkram_sb->node_pfn = node_pfn;
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -816,7 +860,8 @@ static int pkram_reboot(struct notifier_block *notifier,
 {
 	if (val != SYS_RESTART)
 		return NOTIFY_DONE;
-	__pkram_reboot();
+	if (pkram_sb)
+		__pkram_reboot();
 	return NOTIFY_OK;
 }
 
@@ -824,9 +869,62 @@ static int pkram_reboot(struct notifier_block *notifier,
 	.notifier_call = pkram_reboot,
 };
 
+static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	unsigned long pfn = pkram_sb ? PFN_DOWN(__pa(pkram_sb)) : 0;
+
+	return sprintf(buf, "%lx\n", pfn);
+}
+
+static struct kobj_attribute pkram_sb_pfn_attr =
+	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+
+static struct attribute *pkram_attrs[] = {
+	&pkram_sb_pfn_attr.attr,
+	NULL,
+};
+
+static struct attribute_group pkram_attr_group = {
+	.attrs = pkram_attrs,
+};
+
+/* returns non-zero on success */
+static int __init pkram_init_sb(void)
+{
+	unsigned long pfn;
+	struct pkram_node *node;
+
+	if (!pkram_sb) {
+		struct page *page;
+
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page) {
+			pr_err("PKRAM: Failed to allocate super block\n");
+			return 0;
+		}
+		pkram_sb = page_address(page);
+	}
+
+	/*
+	 * Build auxiliary doubly-linked list of nodes connected through
+	 * page::lru for convenience sake.
+	 */
+	pfn = pkram_sb->node_pfn;
+	while (pfn) {
+		node = pfn_to_kaddr(pfn);
+		pkram_insert_node(node);
+		pfn = node->node_pfn;
+	}
+	return 1;
+}
+
 static int __init pkram_init(void)
 {
-	register_reboot_notifier(&pkram_reboot_notifier);
+	if (pkram_init_sb()) {
+		register_reboot_notifier(&pkram_reboot_notifier);
+		sysfs_update_group(kernel_kobj, &pkram_attr_group);
+	}
 	return 0;
 }
 module_init(pkram_init);
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 09/43] PKRAM: track preserved pages in a physical mapping pagetable
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Later patches in this series will need a way to efficiently identify
physically contiguous ranges of preserved pages independent of their
virtual addresses. To facilitate this all pages to be preserved across
kexec are added to a pseudo identity mapping pagetable.

The pagetable makes use of the existing architecture definitions for
building a memory mapping pagetable except that a bitmap is used to
represent the presence or absence of preserved pages at the PTE level.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/Makefile          |   2 +-
 mm/pkram.c           |  30 +++-
 mm/pkram_pagetable.c | 376 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 404 insertions(+), 4 deletions(-)
 create mode 100644 mm/pkram_pagetable.c

diff --git a/mm/Makefile b/mm/Makefile
index ab3a724769b5..f5c0dd0a3707 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,4 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
-obj-$(CONFIG_PKRAM) += pkram.o
+obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o
diff --git a/mm/pkram.c b/mm/pkram.c
index 2809371a9aec..a9e6cd8ca084 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -103,6 +103,9 @@ struct pkram_super_block {
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
+extern int pkram_add_identity_map(struct page *page);
+extern void pkram_remove_identity_map(struct page *page);
+
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
@@ -121,11 +124,24 @@ static int __init parse_pkram_sb_pfn(char *arg)
 
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
-	return alloc_page(gfp_mask);
+	struct page *page;
+	int err;
+
+	page = alloc_page(gfp_mask);
+	if (page) {
+		err = pkram_add_identity_map(page);
+		if (err) {
+			__free_page(page);
+			page = NULL;
+		}
+	}
+
+	return page;
 }
 
 static inline void pkram_free_page(void *addr)
 {
+	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
 
@@ -163,6 +179,7 @@ static void pkram_truncate_link(struct pkram_link *link)
 		if (!p)
 			continue;
 		page = pfn_to_page(PHYS_PFN(p));
+		pkram_remove_identity_map(page);
 		put_page(page);
 	}
 }
@@ -615,10 +632,15 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 {
 	struct pkram_node *node = pa->ps->node;
+	int err;
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	return __pkram_save_page(pa, page, page->index);
+	err = __pkram_save_page(pa, page, page->index);
+	if (!err)
+		err = pkram_add_identity_map(page);
+
+	return err;
 }
 
 static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
@@ -652,6 +674,8 @@ static struct page *__pkram_prep_load_page(pkram_entry_t p)
 		prep_transhuge_page(page);
 	}
 
+	pkram_remove_identity_map(page);
+
 	return page;
 }
 
@@ -898,7 +922,7 @@ static int __init pkram_init_sb(void)
 	if (!pkram_sb) {
 		struct page *page;
 
-		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
 			return 0;
diff --git a/mm/pkram_pagetable.c b/mm/pkram_pagetable.c
new file mode 100644
index 000000000000..9c5443bd7686
--- /dev/null
+++ b/mm/pkram_pagetable.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bitops.h>
+//#include <asm/pgtable.h>
+#include <linux/mm.h>
+
+static pgd_t *pkram_pgd;
+static DEFINE_SPINLOCK(pkram_pgd_lock);
+
+#define set_p4d(p4dp, p4d)	WRITE_ONCE(*(p4dp), (p4d))
+
+#define PKRAM_PTE_BM_BYTES	(PTRS_PER_PTE / BITS_PER_BYTE)
+#define PKRAM_PTE_BM_MASK	(PAGE_SIZE / PKRAM_PTE_BM_BYTES - 1)
+
+static pmd_t make_bitmap_pmd(unsigned long *bitmap)
+{
+	unsigned long val;
+
+	val = __pa(ALIGN_DOWN((unsigned long)bitmap, PAGE_SIZE));
+	val |= (((unsigned long)bitmap & ~PAGE_MASK) / PKRAM_PTE_BM_BYTES);
+
+	return __pmd(val);
+}
+
+static unsigned long *get_bitmap_addr(pmd_t pmd)
+{
+	unsigned long val, off;
+
+	val = pmd_val(pmd);
+	off = (val & PKRAM_PTE_BM_MASK) * PKRAM_PTE_BM_BYTES;
+
+	val = (val & PAGE_MASK) + off;
+
+	return __va(val);
+}
+
+int pkram_add_identity_map(struct page *page)
+{
+	unsigned long paddr;
+	unsigned long *bitmap;
+	unsigned int index;
+	struct page *pg;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (!pkram_pgd) {
+		spin_lock(&pkram_pgd_lock);
+		if (!pkram_pgd) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pkram_pgd = page_address(pg);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pgd_none(*pgd)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			p4d = page_address(pg);
+			set_pgd(pgd, __pgd(__pa(p4d)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		spin_lock(&pkram_pgd_lock);
+		if (p4d_none(*p4d)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pud = page_address(pg);
+			set_p4d(p4d, __p4d(__pa(pud)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pud_none(*pud)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pmd = page_address(pg);
+			set_pud(pud, __pud(__pa(pmd)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_none(*pmd)) {
+			if (PageTransHuge(page)) {
+				set_pmd(pmd, pmd_mkhuge(*pmd));
+				spin_unlock(&pkram_pgd_lock);
+				goto done;
+			}
+			bitmap = bitmap_zalloc(PTRS_PER_PTE, GFP_ATOMIC);
+			if (!bitmap)
+				goto nomem;
+			set_pmd(pmd, make_bitmap_pmd(bitmap));
+		} else {
+			BUG_ON(pmd_large(*pmd));
+			bitmap = get_bitmap_addr(*pmd);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	} else {
+		BUG_ON(pmd_large(*pmd));
+		bitmap = get_bitmap_addr(*pmd);
+	}
+
+	index = pte_index(paddr);
+	BUG_ON(test_bit(index, bitmap));
+	set_bit(index, bitmap);
+	smp_mb__after_atomic();
+	if (bitmap_full(bitmap, PTRS_PER_PTE))
+		set_pmd(pmd, pmd_mkhuge(*pmd));
+done:
+	return 0;
+nomem:
+	return -ENOMEM;
+}
+
+void pkram_remove_identity_map(struct page *page)
+{
+	unsigned long *bitmap;
+	unsigned long paddr;
+	unsigned int index;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	/*
+	 * pkram_pgd will be null when freeing metadata pages after a reboot
+	 */
+	if (!pkram_pgd)
+		return;
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pgd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		WARN_ONCE(1, "PKRAM: %s: no p4d for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		WARN_ONCE(1, "PKRAM: %s: no pud for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pmd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	if (PageTransHuge(page)) {
+		BUG_ON(!pmd_large(*pmd));
+		pmd_clear(pmd);
+		return;
+	}
+
+	if (pmd_large(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_large(*pmd))
+			set_pmd(pmd, __pmd(pte_val(pte_clrhuge(*(pte_t *)pmd))));
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	bitmap = get_bitmap_addr(*pmd);
+	index = pte_index(paddr);
+	clear_bit(index, bitmap);
+	smp_mb__after_atomic();
+
+	spin_lock(&pkram_pgd_lock);
+	if (!pmd_none(*pmd) && bitmap_empty(bitmap, PTRS_PER_PTE)) {
+		pmd_clear(pmd);
+		spin_unlock(&pkram_pgd_lock);
+		bitmap_free(bitmap);
+	} else {
+		spin_unlock(&pkram_pgd_lock);
+	}
+}
+
+struct pkram_pg_state {
+	int (*range_cb)(unsigned long base, unsigned long size, void *private);
+	unsigned long start_addr;
+	unsigned long curr_addr;
+	unsigned long min_addr;
+	unsigned long max_addr;
+	void *private;
+	bool tracking;
+};
+
+#define pgd_none(a)  (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+
+static int note_page(struct pkram_pg_state *st, unsigned long addr, bool present)
+{
+	if (!st->tracking && present) {
+		if (addr >= st->max_addr)
+			return 1;
+		/*
+		 * addr can be < min_addr if the page straddles the
+		 * boundary
+		 */
+		st->start_addr = max(addr, st->min_addr);
+		st->tracking = true;
+	} else if (st->tracking) {
+		unsigned long base, size;
+		int ret;
+
+		/* Continue tracking if upper bound has not been reached */
+		if (present && addr < st->max_addr)
+			return 0;
+
+		addr = min(addr, st->max_addr);
+
+		base = st->start_addr;
+		size = addr - st->start_addr;
+		st->tracking = false;
+
+		ret = st->range_cb(base, size, st->private);
+
+		if (addr == st->max_addr)
+			return 1;
+		else
+			return ret;
+	}
+
+	return 0;
+}
+
+static int walk_pte_level(struct pkram_pg_state *st, pmd_t addr, unsigned long P)
+{
+	unsigned long *bitmap;
+	int present;
+	int i, ret;
+
+	bitmap = get_bitmap_addr(addr);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		unsigned long curr_addr = P + i * PAGE_SIZE;
+
+		if (curr_addr < st->min_addr)
+			continue;
+		present = test_bit(i, bitmap);
+		ret = note_page(st, curr_addr, present);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pmd_level(struct pkram_pg_state *st, pud_t addr, unsigned long P)
+{
+	pmd_t *start;
+	int i, ret;
+
+	start = (pmd_t *)pud_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_PMD; i++, start++) {
+		unsigned long curr_addr = P + i * PMD_SIZE;
+
+		if (curr_addr + PMD_SIZE <= st->min_addr)
+			continue;
+		if (!pmd_none(*start)) {
+			if (pmd_large(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pte_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pud_level(struct pkram_pg_state *st, p4d_t addr, unsigned long P)
+{
+	pud_t *start;
+	int i, ret;
+
+	start = (pud_t *)p4d_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_PUD; i++, start++) {
+		unsigned long curr_addr = P + i * PUD_SIZE;
+
+		if (curr_addr + PUD_SIZE <= st->min_addr)
+			continue;
+		if (!pud_none(*start)) {
+			if (pud_large(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pmd_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_p4d_level(struct pkram_pg_state *st, pgd_t addr, unsigned long P)
+{
+	p4d_t *start;
+	int i, ret;
+
+	if (PTRS_PER_P4D == 1)
+		return walk_pud_level(st, __p4d(pgd_val(addr)), P);
+
+	start = (p4d_t *)pgd_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_P4D; i++, start++) {
+		unsigned long curr_addr = P + i * P4D_SIZE;
+
+		if (curr_addr + P4D_SIZE <= st->min_addr)
+			continue;
+		if (!p4d_none(*start)) {
+			if (p4d_large(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pud_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+void pkram_walk_pgt(struct pkram_pg_state *st, pgd_t *pgd)
+{
+	pgd_t *start = pgd;
+	int i, ret = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++, start++) {
+		unsigned long curr_addr = i * PGDIR_SIZE;
+
+		if (curr_addr + PGDIR_SIZE <= st->min_addr)
+			continue;
+		if (!pgd_none(*start))
+			ret = walk_p4d_level(st, *start, curr_addr);
+		else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+}
+
+void pkram_find_preserved(unsigned long start, unsigned long end, void *private, int (*callback)(unsigned long base, unsigned long size, void *private))
+{
+	struct pkram_pg_state st = {
+		.range_cb = callback,
+		.min_addr = start,
+		.max_addr = end,
+		.private = private,
+	};
+
+	if (!pkram_pgd)
+		return;
+
+	pkram_walk_pgt(&st, pkram_pgd);
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 09/43] PKRAM: track preserved pages in a physical mapping pagetable
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Later patches in this series will need a way to efficiently identify
physically contiguous ranges of preserved pages independent of their
virtual addresses. To facilitate this all pages to be preserved across
kexec are added to a pseudo identity mapping pagetable.

The pagetable makes use of the existing architecture definitions for
building a memory mapping pagetable except that a bitmap is used to
represent the presence or absence of preserved pages at the PTE level.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/Makefile          |   2 +-
 mm/pkram.c           |  30 +++-
 mm/pkram_pagetable.c | 376 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 404 insertions(+), 4 deletions(-)
 create mode 100644 mm/pkram_pagetable.c

diff --git a/mm/Makefile b/mm/Makefile
index ab3a724769b5..f5c0dd0a3707 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,4 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
-obj-$(CONFIG_PKRAM) += pkram.o
+obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o
diff --git a/mm/pkram.c b/mm/pkram.c
index 2809371a9aec..a9e6cd8ca084 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -103,6 +103,9 @@ struct pkram_super_block {
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
+extern int pkram_add_identity_map(struct page *page);
+extern void pkram_remove_identity_map(struct page *page);
+
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
@@ -121,11 +124,24 @@ static int __init parse_pkram_sb_pfn(char *arg)
 
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
-	return alloc_page(gfp_mask);
+	struct page *page;
+	int err;
+
+	page = alloc_page(gfp_mask);
+	if (page) {
+		err = pkram_add_identity_map(page);
+		if (err) {
+			__free_page(page);
+			page = NULL;
+		}
+	}
+
+	return page;
 }
 
 static inline void pkram_free_page(void *addr)
 {
+	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
 
@@ -163,6 +179,7 @@ static void pkram_truncate_link(struct pkram_link *link)
 		if (!p)
 			continue;
 		page = pfn_to_page(PHYS_PFN(p));
+		pkram_remove_identity_map(page);
 		put_page(page);
 	}
 }
@@ -615,10 +632,15 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 {
 	struct pkram_node *node = pa->ps->node;
+	int err;
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
-	return __pkram_save_page(pa, page, page->index);
+	err = __pkram_save_page(pa, page, page->index);
+	if (!err)
+		err = pkram_add_identity_map(page);
+
+	return err;
 }
 
 static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
@@ -652,6 +674,8 @@ static struct page *__pkram_prep_load_page(pkram_entry_t p)
 		prep_transhuge_page(page);
 	}
 
+	pkram_remove_identity_map(page);
+
 	return page;
 }
 
@@ -898,7 +922,7 @@ static int __init pkram_init_sb(void)
 	if (!pkram_sb) {
 		struct page *page;
 
-		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
 			return 0;
diff --git a/mm/pkram_pagetable.c b/mm/pkram_pagetable.c
new file mode 100644
index 000000000000..9c5443bd7686
--- /dev/null
+++ b/mm/pkram_pagetable.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bitops.h>
+//#include <asm/pgtable.h>
+#include <linux/mm.h>
+
+static pgd_t *pkram_pgd;
+static DEFINE_SPINLOCK(pkram_pgd_lock);
+
+#define set_p4d(p4dp, p4d)	WRITE_ONCE(*(p4dp), (p4d))
+
+#define PKRAM_PTE_BM_BYTES	(PTRS_PER_PTE / BITS_PER_BYTE)
+#define PKRAM_PTE_BM_MASK	(PAGE_SIZE / PKRAM_PTE_BM_BYTES - 1)
+
+static pmd_t make_bitmap_pmd(unsigned long *bitmap)
+{
+	unsigned long val;
+
+	val = __pa(ALIGN_DOWN((unsigned long)bitmap, PAGE_SIZE));
+	val |= (((unsigned long)bitmap & ~PAGE_MASK) / PKRAM_PTE_BM_BYTES);
+
+	return __pmd(val);
+}
+
+static unsigned long *get_bitmap_addr(pmd_t pmd)
+{
+	unsigned long val, off;
+
+	val = pmd_val(pmd);
+	off = (val & PKRAM_PTE_BM_MASK) * PKRAM_PTE_BM_BYTES;
+
+	val = (val & PAGE_MASK) + off;
+
+	return __va(val);
+}
+
+int pkram_add_identity_map(struct page *page)
+{
+	unsigned long paddr;
+	unsigned long *bitmap;
+	unsigned int index;
+	struct page *pg;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	if (!pkram_pgd) {
+		spin_lock(&pkram_pgd_lock);
+		if (!pkram_pgd) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pkram_pgd = page_address(pg);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pgd_none(*pgd)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			p4d = page_address(pg);
+			set_pgd(pgd, __pgd(__pa(p4d)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		spin_lock(&pkram_pgd_lock);
+		if (p4d_none(*p4d)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pud = page_address(pg);
+			set_p4d(p4d, __p4d(__pa(pud)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pud_none(*pud)) {
+			pg = alloc_page(GFP_ATOMIC|__GFP_ZERO);
+			if (!pg)
+				goto nomem;
+			pmd = page_address(pg);
+			set_pud(pud, __pud(__pa(pmd)));
+		}
+		spin_unlock(&pkram_pgd_lock);
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_none(*pmd)) {
+			if (PageTransHuge(page)) {
+				set_pmd(pmd, pmd_mkhuge(*pmd));
+				spin_unlock(&pkram_pgd_lock);
+				goto done;
+			}
+			bitmap = bitmap_zalloc(PTRS_PER_PTE, GFP_ATOMIC);
+			if (!bitmap)
+				goto nomem;
+			set_pmd(pmd, make_bitmap_pmd(bitmap));
+		} else {
+			BUG_ON(pmd_large(*pmd));
+			bitmap = get_bitmap_addr(*pmd);
+		}
+		spin_unlock(&pkram_pgd_lock);
+	} else {
+		BUG_ON(pmd_large(*pmd));
+		bitmap = get_bitmap_addr(*pmd);
+	}
+
+	index = pte_index(paddr);
+	BUG_ON(test_bit(index, bitmap));
+	set_bit(index, bitmap);
+	smp_mb__after_atomic();
+	if (bitmap_full(bitmap, PTRS_PER_PTE))
+		set_pmd(pmd, pmd_mkhuge(*pmd));
+done:
+	return 0;
+nomem:
+	return -ENOMEM;
+}
+
+void pkram_remove_identity_map(struct page *page)
+{
+	unsigned long *bitmap;
+	unsigned long paddr;
+	unsigned int index;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	/*
+	 * pkram_pgd will be null when freeing metadata pages after a reboot
+	 */
+	if (!pkram_pgd)
+		return;
+
+	paddr = __pa(page_address(page));
+	pgd = pkram_pgd;
+	pgd += pgd_index(paddr);
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pgd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	p4d = p4d_offset(pgd, paddr);
+	if (p4d_none(*p4d)) {
+		WARN_ONCE(1, "PKRAM: %s: no p4d for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pud = pud_offset(p4d, paddr);
+	if (pud_none(*pud)) {
+		WARN_ONCE(1, "PKRAM: %s: no pud for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	pmd = pmd_offset(pud, paddr);
+	if (pmd_none(*pmd)) {
+		WARN_ONCE(1, "PKRAM: %s: no pmd for 0x%lx\n", __func__, paddr);
+		return;
+	}
+	if (PageTransHuge(page)) {
+		BUG_ON(!pmd_large(*pmd));
+		pmd_clear(pmd);
+		return;
+	}
+
+	if (pmd_large(*pmd)) {
+		spin_lock(&pkram_pgd_lock);
+		if (pmd_large(*pmd))
+			set_pmd(pmd, __pmd(pte_val(pte_clrhuge(*(pte_t *)pmd))));
+		spin_unlock(&pkram_pgd_lock);
+	}
+
+	bitmap = get_bitmap_addr(*pmd);
+	index = pte_index(paddr);
+	clear_bit(index, bitmap);
+	smp_mb__after_atomic();
+
+	spin_lock(&pkram_pgd_lock);
+	if (!pmd_none(*pmd) && bitmap_empty(bitmap, PTRS_PER_PTE)) {
+		pmd_clear(pmd);
+		spin_unlock(&pkram_pgd_lock);
+		bitmap_free(bitmap);
+	} else {
+		spin_unlock(&pkram_pgd_lock);
+	}
+}
+
+struct pkram_pg_state {
+	int (*range_cb)(unsigned long base, unsigned long size, void *private);
+	unsigned long start_addr;
+	unsigned long curr_addr;
+	unsigned long min_addr;
+	unsigned long max_addr;
+	void *private;
+	bool tracking;
+};
+
+#define pgd_none(a)  (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+
+static int note_page(struct pkram_pg_state *st, unsigned long addr, bool present)
+{
+	if (!st->tracking && present) {
+		if (addr >= st->max_addr)
+			return 1;
+		/*
+		 * addr can be < min_addr if the page straddles the
+		 * boundary
+		 */
+		st->start_addr = max(addr, st->min_addr);
+		st->tracking = true;
+	} else if (st->tracking) {
+		unsigned long base, size;
+		int ret;
+
+		/* Continue tracking if upper bound has not been reached */
+		if (present && addr < st->max_addr)
+			return 0;
+
+		addr = min(addr, st->max_addr);
+
+		base = st->start_addr;
+		size = addr - st->start_addr;
+		st->tracking = false;
+
+		ret = st->range_cb(base, size, st->private);
+
+		if (addr == st->max_addr)
+			return 1;
+		else
+			return ret;
+	}
+
+	return 0;
+}
+
+static int walk_pte_level(struct pkram_pg_state *st, pmd_t addr, unsigned long P)
+{
+	unsigned long *bitmap;
+	int present;
+	int i, ret;
+
+	bitmap = get_bitmap_addr(addr);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		unsigned long curr_addr = P + i * PAGE_SIZE;
+
+		if (curr_addr < st->min_addr)
+			continue;
+		present = test_bit(i, bitmap);
+		ret = note_page(st, curr_addr, present);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pmd_level(struct pkram_pg_state *st, pud_t addr, unsigned long P)
+{
+	pmd_t *start;
+	int i, ret;
+
+	start = (pmd_t *)pud_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_PMD; i++, start++) {
+		unsigned long curr_addr = P + i * PMD_SIZE;
+
+		if (curr_addr + PMD_SIZE <= st->min_addr)
+			continue;
+		if (!pmd_none(*start)) {
+			if (pmd_large(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pte_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_pud_level(struct pkram_pg_state *st, p4d_t addr, unsigned long P)
+{
+	pud_t *start;
+	int i, ret;
+
+	start = (pud_t *)p4d_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_PUD; i++, start++) {
+		unsigned long curr_addr = P + i * PUD_SIZE;
+
+		if (curr_addr + PUD_SIZE <= st->min_addr)
+			continue;
+		if (!pud_none(*start)) {
+			if (pud_large(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pmd_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static int walk_p4d_level(struct pkram_pg_state *st, pgd_t addr, unsigned long P)
+{
+	p4d_t *start;
+	int i, ret;
+
+	if (PTRS_PER_P4D == 1)
+		return walk_pud_level(st, __p4d(pgd_val(addr)), P);
+
+	start = (p4d_t *)pgd_page_vaddr(addr);
+	for (i = 0; i < PTRS_PER_P4D; i++, start++) {
+		unsigned long curr_addr = P + i * P4D_SIZE;
+
+		if (curr_addr + P4D_SIZE <= st->min_addr)
+			continue;
+		if (!p4d_none(*start)) {
+			if (p4d_large(*start))
+				ret = note_page(st, curr_addr, true);
+			else
+				ret = walk_pud_level(st, *start, curr_addr);
+		} else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+void pkram_walk_pgt(struct pkram_pg_state *st, pgd_t *pgd)
+{
+	pgd_t *start = pgd;
+	int i, ret = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++, start++) {
+		unsigned long curr_addr = i * PGDIR_SIZE;
+
+		if (curr_addr + PGDIR_SIZE <= st->min_addr)
+			continue;
+		if (!pgd_none(*start))
+			ret = walk_p4d_level(st, *start, curr_addr);
+		else
+			ret = note_page(st, curr_addr, false);
+		if (ret)
+			break;
+	}
+}
+
+void pkram_find_preserved(unsigned long start, unsigned long end, void *private, int (*callback)(unsigned long base, unsigned long size, void *private))
+{
+	struct pkram_pg_state st = {
+		.range_cb = callback,
+		.min_addr = start,
+		.max_addr = end,
+		.private = private,
+	};
+
+	if (!pkram_pgd)
+		return;
+
+	pkram_walk_pgt(&st, pkram_pgd);
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 10/43] PKRAM: pass a list of preserved ranges to the next kernel
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

In order to build a new memblock reserved list during boot that
includes ranges preserved by the previous kernel, a list of preserved
ranges is passed to the next kernel via the pkram superblock. The
ranges are stored in ascending order in a linked list of pages. A more
complete memblock list is not prepared to avoid possible conflicts with
changes in a newer kernel and to avoid having to allocate a contiguous
range larger than a page.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 176 insertions(+), 7 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index a9e6cd8ca084..4cfa236a4126 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -86,6 +86,20 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
 /*
  * The PKRAM super block contains data needed to restore the preserved memory
  * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
@@ -98,13 +112,20 @@ struct pkram_node {
  */
 struct pkram_super_block {
 	__u64	node_pfn;		/* first element of the node list */
+	__u64	region_list_pfn;
+	__u64	nr_regions;
 };
 
+static struct pkram_region_list *pkram_regions_list;
+static int pkram_init_regions_list(void);
+static unsigned long pkram_populate_regions_list(void);
+
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
 extern int pkram_add_identity_map(struct page *page);
 extern void pkram_remove_identity_map(struct page *page);
+extern void pkram_find_preserved(unsigned long start, unsigned long end, void *private, int (*callback)(unsigned long base, unsigned long size, void *private));
 
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
@@ -862,21 +883,48 @@ static void __pkram_reboot(void)
 	struct page *page;
 	struct pkram_node *node;
 	unsigned long node_pfn = 0;
+	unsigned long rl_pfn = 0;
+	unsigned long nr_regions = 0;
+	int err = 0;
 
-	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
-		node = page_address(page);
-		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
-			continue;
-		node->node_pfn = node_pfn;
-		node_pfn = page_to_pfn(page);
+	if (!list_empty(&pkram_nodes)) {
+		err = pkram_add_identity_map(virt_to_page(pkram_sb));
+		if (err) {
+			pr_err("PKRAM: failed to add super block to pagetable\n");
+			goto done;
+		}
+		list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+			node = page_address(page);
+			if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+				continue;
+			node->node_pfn = node_pfn;
+			node_pfn = page_to_pfn(page);
+		}
+		err = pkram_init_regions_list();
+		if (err) {
+			pr_err("PKRAM: failed to init regions list\n");
+			goto done;
+		}
+		nr_regions = pkram_populate_regions_list();
+		if (IS_ERR_VALUE(nr_regions)) {
+			err = nr_regions;
+			pr_err("PKRAM: failed to populate regions list\n");
+			goto done;
+		}
+		rl_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
 	}
 
+done:
 	/*
 	 * Zero out pkram_sb completely since it may have been passed from
 	 * the previous boot.
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
-	pkram_sb->node_pfn = node_pfn;
+	if (!err && node_pfn) {
+		pkram_sb->node_pfn = node_pfn;
+		pkram_sb->region_list_pfn = rl_pfn;
+		pkram_sb->nr_regions = nr_regions;
+	}
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -952,3 +1000,124 @@ static int __init pkram_init(void)
 	return 0;
 }
 module_init(pkram_init);
+
+static int count_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	unsigned long *nr_regions = (unsigned long *)private;
+
+	(*nr_regions)++;
+	return 0;
+}
+
+static unsigned long pkram_count_regions(void)
+{
+	unsigned long nr_regions = 0;
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &nr_regions, count_region_cb);
+
+	return nr_regions;
+}
+
+/*
+ * To faciliate rapidly building a new memblock reserved list during boot
+ * with the addition of preserved memory ranges a regions list is built
+ * before reboot.
+ * The regions list is a linked list of pages with each page containing an
+ * array of preserved memory ranges.  The ranges are stored in each page
+ * and across the list in address order.  A linked list is used rather than
+ * a single contiguous range to mitigate against the possibility that a
+ * larger, contiguous allocation may fail due to fragmentation.
+ *
+ * Since the pages of the regions list must be preserved and the pkram
+ * pagetable is used to determine what ranges are preserved, the list pages
+ * must be allocated and represented in the pkram pagetable before they can
+ * be populated.  Rather than recounting the number of regions after
+ * allocating pages and repeating until a precise number of pages are
+ * are allocated, the number of pages needed is estimated.
+ */
+static int pkram_init_regions_list(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long nr_regions;
+	unsigned long nr_lpages;
+	struct page *page;
+
+	nr_regions = pkram_count_regions();
+
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+	nr_regions += nr_lpages;
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+
+	for (; nr_lpages; nr_lpages--) {
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return -ENOMEM;
+		rl = page_address(page);
+		if (pkram_regions_list) {
+			rl->next_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
+			pkram_regions_list->prev_pfn = page_to_pfn(page);
+		}
+		pkram_regions_list = rl;
+	}
+
+	return 0;
+}
+
+struct pkram_regions_priv {
+	struct pkram_region_list *curr;
+	struct pkram_region_list *last;
+	unsigned long nr_regions;
+	int idx;
+};
+
+static int add_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	struct pkram_regions_priv *priv;
+	struct pkram_region_list *rl;
+	int i;
+
+	priv = (struct pkram_regions_priv *)private;
+	rl = priv->curr;
+	i = priv->idx;
+
+	if (!rl) {
+		WARN_ON(1);
+		return 1;
+	}
+
+	if (!i)
+		priv->last = priv->curr;
+
+	rl->regions[i].base = base;
+	rl->regions[i].size = size;
+
+	priv->nr_regions++;
+	i++;
+	if (i == PKRAM_REGIONS_LIST_MAX) {
+		u64 next_pfn = rl->next_pfn;
+
+		if (next_pfn)
+			priv->curr = pfn_to_kaddr(next_pfn);
+		else
+			priv->curr = NULL;
+
+		i = 0;
+	}
+	priv->idx = i;
+
+	return 0;
+}
+
+static unsigned long pkram_populate_regions_list(void)
+{
+	struct pkram_regions_priv priv = { .curr = pkram_regions_list };
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &priv, add_region_cb);
+
+	/*
+	 * Link the first node to the last populated one.
+	 */
+	pkram_regions_list->prev_pfn = page_to_pfn(virt_to_page(priv.last));
+
+	return priv.nr_regions;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 10/43] PKRAM: pass a list of preserved ranges to the next kernel
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

In order to build a new memblock reserved list during boot that
includes ranges preserved by the previous kernel, a list of preserved
ranges is passed to the next kernel via the pkram superblock. The
ranges are stored in ascending order in a linked list of pages. A more
complete memblock list is not prepared to avoid possible conflicts with
changes in a newer kernel and to avoid having to allocate a contiguous
range larger than a page.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 176 insertions(+), 7 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index a9e6cd8ca084..4cfa236a4126 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -86,6 +86,20 @@ struct pkram_node {
 #define PKRAM_LOAD		2
 #define PKRAM_ACCMODE_MASK	3
 
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
 /*
  * The PKRAM super block contains data needed to restore the preserved memory
  * structure on boot. The pointer to it (pfn) should be passed via the 'pkram'
@@ -98,13 +112,20 @@ struct pkram_node {
  */
 struct pkram_super_block {
 	__u64	node_pfn;		/* first element of the node list */
+	__u64	region_list_pfn;
+	__u64	nr_regions;
 };
 
+static struct pkram_region_list *pkram_regions_list;
+static int pkram_init_regions_list(void);
+static unsigned long pkram_populate_regions_list(void);
+
 static unsigned long pkram_sb_pfn __initdata;
 static struct pkram_super_block *pkram_sb;
 
 extern int pkram_add_identity_map(struct page *page);
 extern void pkram_remove_identity_map(struct page *page);
+extern void pkram_find_preserved(unsigned long start, unsigned long end, void *private, int (*callback)(unsigned long base, unsigned long size, void *private));
 
 /*
  * For convenience sake PKRAM nodes are kept in an auxiliary doubly-linked list
@@ -862,21 +883,48 @@ static void __pkram_reboot(void)
 	struct page *page;
 	struct pkram_node *node;
 	unsigned long node_pfn = 0;
+	unsigned long rl_pfn = 0;
+	unsigned long nr_regions = 0;
+	int err = 0;
 
-	list_for_each_entry_reverse(page, &pkram_nodes, lru) {
-		node = page_address(page);
-		if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
-			continue;
-		node->node_pfn = node_pfn;
-		node_pfn = page_to_pfn(page);
+	if (!list_empty(&pkram_nodes)) {
+		err = pkram_add_identity_map(virt_to_page(pkram_sb));
+		if (err) {
+			pr_err("PKRAM: failed to add super block to pagetable\n");
+			goto done;
+		}
+		list_for_each_entry_reverse(page, &pkram_nodes, lru) {
+			node = page_address(page);
+			if (WARN_ON(node->flags & PKRAM_ACCMODE_MASK))
+				continue;
+			node->node_pfn = node_pfn;
+			node_pfn = page_to_pfn(page);
+		}
+		err = pkram_init_regions_list();
+		if (err) {
+			pr_err("PKRAM: failed to init regions list\n");
+			goto done;
+		}
+		nr_regions = pkram_populate_regions_list();
+		if (IS_ERR_VALUE(nr_regions)) {
+			err = nr_regions;
+			pr_err("PKRAM: failed to populate regions list\n");
+			goto done;
+		}
+		rl_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
 	}
 
+done:
 	/*
 	 * Zero out pkram_sb completely since it may have been passed from
 	 * the previous boot.
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
-	pkram_sb->node_pfn = node_pfn;
+	if (!err && node_pfn) {
+		pkram_sb->node_pfn = node_pfn;
+		pkram_sb->region_list_pfn = rl_pfn;
+		pkram_sb->nr_regions = nr_regions;
+	}
 }
 
 static int pkram_reboot(struct notifier_block *notifier,
@@ -952,3 +1000,124 @@ static int __init pkram_init(void)
 	return 0;
 }
 module_init(pkram_init);
+
+static int count_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	unsigned long *nr_regions = (unsigned long *)private;
+
+	(*nr_regions)++;
+	return 0;
+}
+
+static unsigned long pkram_count_regions(void)
+{
+	unsigned long nr_regions = 0;
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &nr_regions, count_region_cb);
+
+	return nr_regions;
+}
+
+/*
+ * To faciliate rapidly building a new memblock reserved list during boot
+ * with the addition of preserved memory ranges a regions list is built
+ * before reboot.
+ * The regions list is a linked list of pages with each page containing an
+ * array of preserved memory ranges.  The ranges are stored in each page
+ * and across the list in address order.  A linked list is used rather than
+ * a single contiguous range to mitigate against the possibility that a
+ * larger, contiguous allocation may fail due to fragmentation.
+ *
+ * Since the pages of the regions list must be preserved and the pkram
+ * pagetable is used to determine what ranges are preserved, the list pages
+ * must be allocated and represented in the pkram pagetable before they can
+ * be populated.  Rather than recounting the number of regions after
+ * allocating pages and repeating until a precise number of pages are
+ * are allocated, the number of pages needed is estimated.
+ */
+static int pkram_init_regions_list(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long nr_regions;
+	unsigned long nr_lpages;
+	struct page *page;
+
+	nr_regions = pkram_count_regions();
+
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+	nr_regions += nr_lpages;
+	nr_lpages = DIV_ROUND_UP(nr_regions, PKRAM_REGIONS_LIST_MAX);
+
+	for (; nr_lpages; nr_lpages--) {
+		page = pkram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+		if (!page)
+			return -ENOMEM;
+		rl = page_address(page);
+		if (pkram_regions_list) {
+			rl->next_pfn = page_to_pfn(virt_to_page(pkram_regions_list));
+			pkram_regions_list->prev_pfn = page_to_pfn(page);
+		}
+		pkram_regions_list = rl;
+	}
+
+	return 0;
+}
+
+struct pkram_regions_priv {
+	struct pkram_region_list *curr;
+	struct pkram_region_list *last;
+	unsigned long nr_regions;
+	int idx;
+};
+
+static int add_region_cb(unsigned long base, unsigned long size, void *private)
+{
+	struct pkram_regions_priv *priv;
+	struct pkram_region_list *rl;
+	int i;
+
+	priv = (struct pkram_regions_priv *)private;
+	rl = priv->curr;
+	i = priv->idx;
+
+	if (!rl) {
+		WARN_ON(1);
+		return 1;
+	}
+
+	if (!i)
+		priv->last = priv->curr;
+
+	rl->regions[i].base = base;
+	rl->regions[i].size = size;
+
+	priv->nr_regions++;
+	i++;
+	if (i == PKRAM_REGIONS_LIST_MAX) {
+		u64 next_pfn = rl->next_pfn;
+
+		if (next_pfn)
+			priv->curr = pfn_to_kaddr(next_pfn);
+		else
+			priv->curr = NULL;
+
+		i = 0;
+	}
+	priv->idx = i;
+
+	return 0;
+}
+
+static unsigned long pkram_populate_regions_list(void)
+{
+	struct pkram_regions_priv priv = { .curr = pkram_regions_list };
+
+	pkram_find_preserved(0, PHYS_ADDR_MAX, &priv, add_region_cb);
+
+	/*
+	 * Link the first node to the last populated one.
+	 */
+	pkram_regions_list->prev_pfn = page_to_pfn(virt_to_page(priv.last));
+
+	return priv.nr_regions;
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 11/43] PKRAM: prepare for adding preserved ranges to memblock reserved
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Calling memblock_reserve() repeatedly to add preserved ranges is
inefficient and risks clobbering preserved memory if the memblock
reserved regions array must be resized.  Instead, calculate the size
needed to accomodate the preserved ranges, find a suitable range for
a new reserved regions array that does not overlap any preserved range,
and populate it with a new, merged regions array.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 241 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 241 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index 4cfa236a4126..b4a14837946a 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
+#include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
@@ -1121,3 +1122,243 @@ static unsigned long pkram_populate_regions_list(void)
 
 	return priv.nr_regions;
 }
+
+struct pkram_region *pkram_first_region(struct pkram_super_block *sb, struct pkram_region_list **rlp, int *idx)
+{
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = pfn_to_kaddr(sb->region_list_pfn);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			pr_err("PKRAM: %s: no more pkram_region_list pages\n", __func__);
+			return NULL;
+		}
+		rl = pfn_to_kaddr(rl->next_pfn);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+struct pkram_region *pkram_first_region_topdown(struct pkram_super_block *sb, struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl;
+
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	rl = pfn_to_kaddr(sb->region_list_pfn);
+	if (!rl->prev_pfn) {
+		WARN_ON(1);
+		return NULL;
+	}
+	rl = pfn_to_kaddr(rl->prev_pfn);
+
+	*rlp = rl;
+
+	*idx = (sb->nr_regions - 1) % PKRAM_REGIONS_LIST_MAX;
+
+	return &rl->regions[*idx];
+}
+
+struct pkram_region *pkram_next_region_topdown(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	if (i == 0) {
+		if (!rl->prev_pfn)
+			return NULL;
+		rl = pfn_to_kaddr(rl->prev_pfn);
+		*rlp = rl;
+		i = PKRAM_REGIONS_LIST_MAX - 1;
+	} else
+		i--;
+
+	*idx = i;
+
+	return &rl->regions[i];
+}
+
+/*
+ * Use the pkram regions list to find an available block of memory that does
+ * not overlap with preserved pages.
+ */
+phys_addr_t __init find_available_topdown(phys_addr_t size)
+{
+	phys_addr_t hole_start, hole_end, hole_size;
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	phys_addr_t addr = 0;
+	int idx;
+
+	hole_end = memblock.current_limit;
+	r = pkram_first_region_topdown(pkram_sb, &rl, &idx);
+
+	while (r) {
+		hole_start = r->base + r->size;
+		hole_size = hole_end - hole_start;
+
+		if (hole_size >= size) {
+			addr = memblock_find_in_range(hole_start, hole_end,
+							size, PAGE_SIZE);
+			if (addr)
+				break;
+		}
+
+		hole_end = r->base;
+		r = pkram_next_region_topdown(&rl, &idx);
+	}
+
+	if (!addr)
+		addr = memblock_find_in_range(0, hole_end, size, PAGE_SIZE);
+
+	return addr;
+}
+
+int __init pkram_create_merged_reserved(struct memblock_type *new)
+{
+	unsigned long cnt_a;
+	unsigned long cnt_b;
+	long i, j, k;
+	struct memblock_region *r;
+	struct memblock_region *rgn;
+	struct pkram_region *pkr;
+	struct pkram_region_list *rl;
+	int idx;
+	unsigned long total_size = 0;
+	unsigned long nr_preserved = 0;
+
+	cnt_a = memblock.reserved.cnt;
+	cnt_b = pkram_sb->nr_regions;
+
+	i = 0;
+	j = 0;
+	k = 0;
+
+	pkr = pkram_first_region(pkram_sb, &rl, &idx);
+	if (!pkr)
+		return -EINVAL;
+	while (i < cnt_a && j < cnt_b && pkr) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		if (r->base + r->size <= pkr->base) {
+			*rgn = *r;
+			i++;
+		} else if (pkr->base + pkr->size <= r->base) {
+			rgn->base = pkr->base;
+			rgn->size = pkr->size;
+			memblock_set_region_node(rgn, MAX_NUMNODES);
+
+			nr_preserved +=  (rgn->size >> PAGE_SHIFT);
+			pkr = pkram_next_region(&rl, &idx);
+			j++;
+		} else {
+			pr_err("PKRAM: unexpected overlap:\n");
+			pr_err("PKRAM: reserved: base=%pa,size=%pa,flags=0x%x\n", &r->base, &r->size, (int)r->flags);
+			pr_err("PKRAM: pkram: base=%pa,size=%pa\n", &pkr->base, &pkr->size);
+			return -EBUSY;
+		}
+		total_size += rgn->size;
+		k++;
+	}
+
+	while (i < cnt_a) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		*rgn = *r;
+
+		total_size += rgn->size;
+		i++;
+		k++;
+	}
+	while (j < cnt_b && pkr) {
+		rgn = &new->regions[k];
+		rgn->base = pkr->base;
+		rgn->size = pkr->size;
+		memblock_set_region_node(rgn, MAX_NUMNODES);
+
+		nr_preserved += (rgn->size >> PAGE_SHIFT);
+		total_size += rgn->size;
+		pkr = pkram_next_region(&rl, &idx);
+		j++;
+		k++;
+	}
+
+	WARN_ON(cnt_a + cnt_b != k);
+	new->cnt = cnt_a + cnt_b;
+	new->total_size = total_size;
+
+	return 0;
+}
+
+/*
+ * Reserve pages that belong to preserved memory.  This is accomplished by
+ * merging the existing reserved ranges with the preserved ranges into
+ * a new, sufficiently sized memblock reserved array.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+int __init pkram_merge_with_reserved(void)
+{
+	struct memblock_type new;
+	unsigned long new_max;
+	phys_addr_t new_size;
+	phys_addr_t addr;
+	int err;
+
+	/*
+	 * Need space to insert one more range into memblock.reserved
+	 * without memblock_double_array() being called.
+	 */
+	if (memblock.reserved.cnt == memblock.reserved.max) {
+		WARN_ONCE(1, "PKRAM: no space for new memblock list\n");
+		return -ENOMEM;
+	}
+
+	new_max = memblock.reserved.max + pkram_sb->nr_regions;
+	new_size = PAGE_ALIGN(sizeof (struct memblock_region) * new_max);
+
+	addr = find_available_topdown(new_size);
+	if (!addr || memblock_reserve(addr, new_size))
+		return -ENOMEM;
+
+	new.regions = __va(addr);
+	new.max = new_max;
+	err = pkram_create_merged_reserved(&new);
+	if (err)
+		return err;
+
+	memblock.reserved.cnt = new.cnt;
+	memblock.reserved.max = new.max;
+	memblock.reserved.total_size = new.total_size;
+	memblock.reserved.regions = new.regions;
+
+	return 0;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 11/43] PKRAM: prepare for adding preserved ranges to memblock reserved
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Calling memblock_reserve() repeatedly to add preserved ranges is
inefficient and risks clobbering preserved memory if the memblock
reserved regions array must be resized.  Instead, calculate the size
needed to accomodate the preserved ranges, find a suitable range for
a new reserved regions array that does not overlap any preserved range,
and populate it with a new, merged regions array.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 241 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 241 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index 4cfa236a4126..b4a14837946a 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -7,6 +7,7 @@
 #include <linux/kernel.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
+#include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
@@ -1121,3 +1122,243 @@ static unsigned long pkram_populate_regions_list(void)
 
 	return priv.nr_regions;
 }
+
+struct pkram_region *pkram_first_region(struct pkram_super_block *sb, struct pkram_region_list **rlp, int *idx)
+{
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = pfn_to_kaddr(sb->region_list_pfn);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			pr_err("PKRAM: %s: no more pkram_region_list pages\n", __func__);
+			return NULL;
+		}
+		rl = pfn_to_kaddr(rl->next_pfn);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+struct pkram_region *pkram_first_region_topdown(struct pkram_super_block *sb, struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl;
+
+	WARN_ON(!sb);
+	WARN_ON(!sb->region_list_pfn);
+
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	rl = pfn_to_kaddr(sb->region_list_pfn);
+	if (!rl->prev_pfn) {
+		WARN_ON(1);
+		return NULL;
+	}
+	rl = pfn_to_kaddr(rl->prev_pfn);
+
+	*rlp = rl;
+
+	*idx = (sb->nr_regions - 1) % PKRAM_REGIONS_LIST_MAX;
+
+	return &rl->regions[*idx];
+}
+
+struct pkram_region *pkram_next_region_topdown(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	if (i == 0) {
+		if (!rl->prev_pfn)
+			return NULL;
+		rl = pfn_to_kaddr(rl->prev_pfn);
+		*rlp = rl;
+		i = PKRAM_REGIONS_LIST_MAX - 1;
+	} else
+		i--;
+
+	*idx = i;
+
+	return &rl->regions[i];
+}
+
+/*
+ * Use the pkram regions list to find an available block of memory that does
+ * not overlap with preserved pages.
+ */
+phys_addr_t __init find_available_topdown(phys_addr_t size)
+{
+	phys_addr_t hole_start, hole_end, hole_size;
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	phys_addr_t addr = 0;
+	int idx;
+
+	hole_end = memblock.current_limit;
+	r = pkram_first_region_topdown(pkram_sb, &rl, &idx);
+
+	while (r) {
+		hole_start = r->base + r->size;
+		hole_size = hole_end - hole_start;
+
+		if (hole_size >= size) {
+			addr = memblock_find_in_range(hole_start, hole_end,
+							size, PAGE_SIZE);
+			if (addr)
+				break;
+		}
+
+		hole_end = r->base;
+		r = pkram_next_region_topdown(&rl, &idx);
+	}
+
+	if (!addr)
+		addr = memblock_find_in_range(0, hole_end, size, PAGE_SIZE);
+
+	return addr;
+}
+
+int __init pkram_create_merged_reserved(struct memblock_type *new)
+{
+	unsigned long cnt_a;
+	unsigned long cnt_b;
+	long i, j, k;
+	struct memblock_region *r;
+	struct memblock_region *rgn;
+	struct pkram_region *pkr;
+	struct pkram_region_list *rl;
+	int idx;
+	unsigned long total_size = 0;
+	unsigned long nr_preserved = 0;
+
+	cnt_a = memblock.reserved.cnt;
+	cnt_b = pkram_sb->nr_regions;
+
+	i = 0;
+	j = 0;
+	k = 0;
+
+	pkr = pkram_first_region(pkram_sb, &rl, &idx);
+	if (!pkr)
+		return -EINVAL;
+	while (i < cnt_a && j < cnt_b && pkr) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		if (r->base + r->size <= pkr->base) {
+			*rgn = *r;
+			i++;
+		} else if (pkr->base + pkr->size <= r->base) {
+			rgn->base = pkr->base;
+			rgn->size = pkr->size;
+			memblock_set_region_node(rgn, MAX_NUMNODES);
+
+			nr_preserved +=  (rgn->size >> PAGE_SHIFT);
+			pkr = pkram_next_region(&rl, &idx);
+			j++;
+		} else {
+			pr_err("PKRAM: unexpected overlap:\n");
+			pr_err("PKRAM: reserved: base=%pa,size=%pa,flags=0x%x\n", &r->base, &r->size, (int)r->flags);
+			pr_err("PKRAM: pkram: base=%pa,size=%pa\n", &pkr->base, &pkr->size);
+			return -EBUSY;
+		}
+		total_size += rgn->size;
+		k++;
+	}
+
+	while (i < cnt_a) {
+		r = &memblock.reserved.regions[i];
+		rgn = &new->regions[k];
+
+		*rgn = *r;
+
+		total_size += rgn->size;
+		i++;
+		k++;
+	}
+	while (j < cnt_b && pkr) {
+		rgn = &new->regions[k];
+		rgn->base = pkr->base;
+		rgn->size = pkr->size;
+		memblock_set_region_node(rgn, MAX_NUMNODES);
+
+		nr_preserved += (rgn->size >> PAGE_SHIFT);
+		total_size += rgn->size;
+		pkr = pkram_next_region(&rl, &idx);
+		j++;
+		k++;
+	}
+
+	WARN_ON(cnt_a + cnt_b != k);
+	new->cnt = cnt_a + cnt_b;
+	new->total_size = total_size;
+
+	return 0;
+}
+
+/*
+ * Reserve pages that belong to preserved memory.  This is accomplished by
+ * merging the existing reserved ranges with the preserved ranges into
+ * a new, sufficiently sized memblock reserved array.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+int __init pkram_merge_with_reserved(void)
+{
+	struct memblock_type new;
+	unsigned long new_max;
+	phys_addr_t new_size;
+	phys_addr_t addr;
+	int err;
+
+	/*
+	 * Need space to insert one more range into memblock.reserved
+	 * without memblock_double_array() being called.
+	 */
+	if (memblock.reserved.cnt == memblock.reserved.max) {
+		WARN_ONCE(1, "PKRAM: no space for new memblock list\n");
+		return -ENOMEM;
+	}
+
+	new_max = memblock.reserved.max + pkram_sb->nr_regions;
+	new_size = PAGE_ALIGN(sizeof (struct memblock_region) * new_max);
+
+	addr = find_available_topdown(new_size);
+	if (!addr || memblock_reserve(addr, new_size))
+		return -ENOMEM;
+
+	new.regions = __va(addr);
+	new.max = new_max;
+	err = pkram_create_merged_reserved(&new);
+	if (err)
+		return err;
+
+	memblock.reserved.cnt = new.cnt;
+	memblock.reserved.max = new.max;
+	memblock.reserved.total_size = new.total_size;
+	memblock.reserved.regions = new.regions;
+
+	return 0;
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 12/43] mm: PKRAM: reserve preserved memory at boot
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Keep preserved pages from being recycled during boot by adding them
to the memblock reserved list during early boot. If memory reservation
fails (e.g. a region has already been reserved), all preserved pages
are dropped.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/kernel/setup.c |  3 ++
 arch/x86/mm/init_64.c   |  2 ++
 include/linux/pkram.h   |  8 ++++++
 mm/pkram.c              | 76 +++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d883176ef2ce..fbd85964719d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -15,6 +15,7 @@
 #include <linux/iscsi_ibft.h>
 #include <linux/memblock.h>
 #include <linux/pci.h>
+#include <linux/pkram.h>
 #include <linux/root_dev.h>
 #include <linux/hugetlb.h>
 #include <linux/tboot.h>
@@ -1146,6 +1147,8 @@ void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
+	pkram_reserve();
+
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..8efb2fb2a88b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/pkram.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1293,6 +1294,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
 	 * might set fields in deferred struct pages that have not yet been
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 4f95d4fb5339..8d3d780d9fe1 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -99,4 +99,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
 
+#ifdef CONFIG_PKRAM
+extern unsigned long pkram_reserved_pages;
+void pkram_reserve(void);
+#else
+#define pkram_reserved_pages 0UL
+static inline void pkram_reserve(void) { }
+#endif
+
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index b4a14837946a..03731bb6af26 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -135,6 +135,8 @@ struct pkram_super_block {
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+unsigned long __initdata pkram_reserved_pages;
+
 /*
  * The PKRAM super block pfn, see above.
  */
@@ -144,6 +146,59 @@ static int __init parse_pkram_sb_pfn(char *arg)
 }
 early_param("pkram", parse_pkram_sb_pfn);
 
+static void * __init pkram_map_meta(unsigned long pfn)
+{
+	if (pfn >= max_low_pfn)
+		return ERR_PTR(-EINVAL);
+	return pfn_to_kaddr(pfn);
+}
+
+int pkram_merge_with_reserved(void);
+/*
+ * Reserve pages that belong to preserved memory.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+void __init pkram_reserve(void)
+{
+	int err = 0;
+
+	if (!pkram_sb_pfn)
+		return;
+
+	pr_info("PKRAM: Examining preserved memory...\n");
+
+	/* Verify that nothing else has reserved the pkram_sb page */
+	if (memblock_is_region_reserved(PFN_PHYS(pkram_sb_pfn), PAGE_SIZE)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	pkram_sb = pkram_map_meta(pkram_sb_pfn);
+	if (IS_ERR(pkram_sb)) {
+		err = PTR_ERR(pkram_sb);
+		goto out;
+	}
+	/* An empty pkram_sb is not an error */
+	if (!pkram_sb->node_pfn) {
+		pkram_sb = NULL;
+		goto done;
+	}
+
+	err = pkram_merge_with_reserved();
+out:
+	if (err) {
+		pr_err("PKRAM: Reservation failed: %d\n", err);
+		WARN_ON(pkram_reserved_pages > 0);
+		pkram_sb = NULL;
+		return;
+	}
+
+done:
+	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
@@ -163,6 +218,11 @@ static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 
 static inline void pkram_free_page(void *addr)
 {
+	/*
+	 * The page may have the reserved bit set since preserved pages
+	 * are reserved early in boot.
+	 */
+	ClearPageReserved(virt_to_page(addr));
 	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
@@ -201,6 +261,11 @@ static void pkram_truncate_link(struct pkram_link *link)
 		if (!p)
 			continue;
 		page = pfn_to_page(PHYS_PFN(p));
+		/*
+		 * The page may have the reserved bit set since preserved pages
+		 * are reserved early in boot.
+		 */
+		ClearPageReserved(page);
 		pkram_remove_identity_map(page);
 		put_page(page);
 	}
@@ -684,14 +749,20 @@ static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
-	int order;
+	int i, order;
 	short flags;
 
 	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
+	order = p & PKRAM_ENTRY_ORDER_MASK;
 	page = pfn_to_page(PHYS_PFN(p));
 
+	for (i = 0; i < (1 << order); i++) {
+		struct page *pg = page + i;
+
+		ClearPageReserved(pg);
+	}
+
 	if (flags & PKRAM_PAGE_TRANS_HUGE) {
-		order = p & PKRAM_ENTRY_ORDER_MASK;
 		prep_compound_page(page, order);
 		prep_transhuge_page(page);
 	}
@@ -1311,6 +1382,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 	}
 
 	WARN_ON(cnt_a + cnt_b != k);
+	pkram_reserved_pages = nr_preserved;
 	new->cnt = cnt_a + cnt_b;
 	new->total_size = total_size;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 12/43] mm: PKRAM: reserve preserved memory at boot
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Keep preserved pages from being recycled during boot by adding them
to the memblock reserved list during early boot. If memory reservation
fails (e.g. a region has already been reserved), all preserved pages
are dropped.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/kernel/setup.c |  3 ++
 arch/x86/mm/init_64.c   |  2 ++
 include/linux/pkram.h   |  8 ++++++
 mm/pkram.c              | 76 +++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d883176ef2ce..fbd85964719d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -15,6 +15,7 @@
 #include <linux/iscsi_ibft.h>
 #include <linux/memblock.h>
 #include <linux/pci.h>
+#include <linux/pkram.h>
 #include <linux/root_dev.h>
 #include <linux/hugetlb.h>
 #include <linux/tboot.h>
@@ -1146,6 +1147,8 @@ void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
+	pkram_reserve();
+
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b5a3fa4033d3..8efb2fb2a88b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/pkram.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1293,6 +1294,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
 	 * might set fields in deferred struct pages that have not yet been
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 4f95d4fb5339..8d3d780d9fe1 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -99,4 +99,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
 
+#ifdef CONFIG_PKRAM
+extern unsigned long pkram_reserved_pages;
+void pkram_reserve(void);
+#else
+#define pkram_reserved_pages 0UL
+static inline void pkram_reserve(void) { }
+#endif
+
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index b4a14837946a..03731bb6af26 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -135,6 +135,8 @@ struct pkram_super_block {
 static LIST_HEAD(pkram_nodes);			/* linked through page::lru */
 static DEFINE_MUTEX(pkram_mutex);		/* serializes open/close */
 
+unsigned long __initdata pkram_reserved_pages;
+
 /*
  * The PKRAM super block pfn, see above.
  */
@@ -144,6 +146,59 @@ static int __init parse_pkram_sb_pfn(char *arg)
 }
 early_param("pkram", parse_pkram_sb_pfn);
 
+static void * __init pkram_map_meta(unsigned long pfn)
+{
+	if (pfn >= max_low_pfn)
+		return ERR_PTR(-EINVAL);
+	return pfn_to_kaddr(pfn);
+}
+
+int pkram_merge_with_reserved(void);
+/*
+ * Reserve pages that belong to preserved memory.
+ *
+ * This function should be called at boot time as early as possible to prevent
+ * preserved memory from being recycled.
+ */
+void __init pkram_reserve(void)
+{
+	int err = 0;
+
+	if (!pkram_sb_pfn)
+		return;
+
+	pr_info("PKRAM: Examining preserved memory...\n");
+
+	/* Verify that nothing else has reserved the pkram_sb page */
+	if (memblock_is_region_reserved(PFN_PHYS(pkram_sb_pfn), PAGE_SIZE)) {
+		err = -EBUSY;
+		goto out;
+	}
+
+	pkram_sb = pkram_map_meta(pkram_sb_pfn);
+	if (IS_ERR(pkram_sb)) {
+		err = PTR_ERR(pkram_sb);
+		goto out;
+	}
+	/* An empty pkram_sb is not an error */
+	if (!pkram_sb->node_pfn) {
+		pkram_sb = NULL;
+		goto done;
+	}
+
+	err = pkram_merge_with_reserved();
+out:
+	if (err) {
+		pr_err("PKRAM: Reservation failed: %d\n", err);
+		WARN_ON(pkram_reserved_pages > 0);
+		pkram_sb = NULL;
+		return;
+	}
+
+done:
+	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
@@ -163,6 +218,11 @@ static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 
 static inline void pkram_free_page(void *addr)
 {
+	/*
+	 * The page may have the reserved bit set since preserved pages
+	 * are reserved early in boot.
+	 */
+	ClearPageReserved(virt_to_page(addr));
 	pkram_remove_identity_map(virt_to_page(addr));
 	free_page((unsigned long)addr);
 }
@@ -201,6 +261,11 @@ static void pkram_truncate_link(struct pkram_link *link)
 		if (!p)
 			continue;
 		page = pfn_to_page(PHYS_PFN(p));
+		/*
+		 * The page may have the reserved bit set since preserved pages
+		 * are reserved early in boot.
+		 */
+		ClearPageReserved(page);
 		pkram_remove_identity_map(page);
 		put_page(page);
 	}
@@ -684,14 +749,20 @@ static int __pkram_bytes_save_page(struct pkram_access *pa, struct page *page)
 static struct page *__pkram_prep_load_page(pkram_entry_t p)
 {
 	struct page *page;
-	int order;
+	int i, order;
 	short flags;
 
 	flags = (p >> PKRAM_ENTRY_FLAGS_SHIFT) & PKRAM_ENTRY_FLAGS_MASK;
+	order = p & PKRAM_ENTRY_ORDER_MASK;
 	page = pfn_to_page(PHYS_PFN(p));
 
+	for (i = 0; i < (1 << order); i++) {
+		struct page *pg = page + i;
+
+		ClearPageReserved(pg);
+	}
+
 	if (flags & PKRAM_PAGE_TRANS_HUGE) {
-		order = p & PKRAM_ENTRY_ORDER_MASK;
 		prep_compound_page(page, order);
 		prep_transhuge_page(page);
 	}
@@ -1311,6 +1382,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 	}
 
 	WARN_ON(cnt_a + cnt_b != k);
+	pkram_reserved_pages = nr_preserved;
 	new->cnt = cnt_a + cnt_b;
 	new->total_size = total_size;
 
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 13/43] PKRAM: free the preserved ranges list
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Free the pages used to pass the preserved ranges to the new boot.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/mm/init_64.c |  1 +
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8efb2fb2a88b..69bd71996b8b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1294,6 +1294,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	pkram_cleanup();
 	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 8d3d780d9fe1..c2099a4f2004 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -102,9 +102,11 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 #ifdef CONFIG_PKRAM
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
+void pkram_cleanup(void);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
+static inline void pkram_cleanup(void) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index 03731bb6af26..dab6657080bf 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1434,3 +1434,23 @@ int __init pkram_merge_with_reserved(void)
 
 	return 0;
 }
+
+void __init pkram_cleanup(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long next_pfn;
+
+	if (!pkram_sb || !pkram_reserved_pages)
+		return;
+
+	next_pfn = pkram_sb->region_list_pfn;
+
+	while (next_pfn) {
+		struct page *page = pfn_to_page(next_pfn);
+
+		rl = pfn_to_kaddr(next_pfn);
+		next_pfn = rl->next_pfn;
+		__free_pages_core(page, 0);
+		pkram_reserved_pages--;
+	}
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 13/43] PKRAM: free the preserved ranges list
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Free the pages used to pass the preserved ranges to the new boot.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/mm/init_64.c |  1 +
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8efb2fb2a88b..69bd71996b8b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1294,6 +1294,7 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
+	pkram_cleanup();
 	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 8d3d780d9fe1..c2099a4f2004 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -102,9 +102,11 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 #ifdef CONFIG_PKRAM
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
+void pkram_cleanup(void);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
+static inline void pkram_cleanup(void) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index 03731bb6af26..dab6657080bf 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1434,3 +1434,23 @@ int __init pkram_merge_with_reserved(void)
 
 	return 0;
 }
+
+void __init pkram_cleanup(void)
+{
+	struct pkram_region_list *rl;
+	unsigned long next_pfn;
+
+	if (!pkram_sb || !pkram_reserved_pages)
+		return;
+
+	next_pfn = pkram_sb->region_list_pfn;
+
+	while (next_pfn) {
+		struct page *page = pfn_to_page(next_pfn);
+
+		rl = pfn_to_kaddr(next_pfn);
+		next_pfn = rl->next_pfn;
+		__free_pages_core(page, 0);
+		pkram_reserved_pages--;
+	}
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 14/43] PKRAM: prevent inadvertent use of a stale superblock
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

When pages have been saved to be preserved by the current boot, set
a magic number on the super block to be validated by the next kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index dab6657080bf..8670d1633a9d 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -22,6 +22,7 @@
 
 #include "internal.h"
 
+#define PKRAM_MAGIC		0x706B726D
 
 /*
  * Represents a reference to a data page saved to PKRAM.
@@ -112,6 +113,8 @@ struct pkram_region_list {
  * The structure occupies a memory page.
  */
 struct pkram_super_block {
+	__u32	magic;
+
 	__u64	node_pfn;		/* first element of the node list */
 	__u64	region_list_pfn;
 	__u64	nr_regions;
@@ -180,6 +183,11 @@ void __init pkram_reserve(void)
 		err = PTR_ERR(pkram_sb);
 		goto out;
 	}
+	if (pkram_sb->magic != PKRAM_MAGIC) {
+		pr_err("PKRAM: invalid super block\n");
+		err = -EINVAL;
+		goto out;
+	}
 	/* An empty pkram_sb is not an error */
 	if (!pkram_sb->node_pfn) {
 		pkram_sb = NULL;
@@ -993,6 +1001,7 @@ static void __pkram_reboot(void)
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
 	if (!err && node_pfn) {
+		pkram_sb->magic = PKRAM_MAGIC;
 		pkram_sb->node_pfn = node_pfn;
 		pkram_sb->region_list_pfn = rl_pfn;
 		pkram_sb->nr_regions = nr_regions;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 14/43] PKRAM: prevent inadvertent use of a stale superblock
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

When pages have been saved to be preserved by the current boot, set
a magic number on the super block to be validated by the next kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index dab6657080bf..8670d1633a9d 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -22,6 +22,7 @@
 
 #include "internal.h"
 
+#define PKRAM_MAGIC		0x706B726D
 
 /*
  * Represents a reference to a data page saved to PKRAM.
@@ -112,6 +113,8 @@ struct pkram_region_list {
  * The structure occupies a memory page.
  */
 struct pkram_super_block {
+	__u32	magic;
+
 	__u64	node_pfn;		/* first element of the node list */
 	__u64	region_list_pfn;
 	__u64	nr_regions;
@@ -180,6 +183,11 @@ void __init pkram_reserve(void)
 		err = PTR_ERR(pkram_sb);
 		goto out;
 	}
+	if (pkram_sb->magic != PKRAM_MAGIC) {
+		pr_err("PKRAM: invalid super block\n");
+		err = -EINVAL;
+		goto out;
+	}
 	/* An empty pkram_sb is not an error */
 	if (!pkram_sb->node_pfn) {
 		pkram_sb = NULL;
@@ -993,6 +1001,7 @@ static void __pkram_reboot(void)
 	 */
 	memset(pkram_sb, 0, PAGE_SIZE);
 	if (!err && node_pfn) {
+		pkram_sb->magic = PKRAM_MAGIC;
 		pkram_sb->node_pfn = node_pfn;
 		pkram_sb->region_list_pfn = rl_pfn;
 		pkram_sb->nr_regions = nr_regions;
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 15/43] PKRAM: provide a way to ban pages from use by PKRAM
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Not all memory ranges can be used for saving preserved over-kexec data.
For example, a kexec kernel may be loaded before pages are preserved.
The memory regions where the kexec segments will be copied to on kexec
must not contain preserved pages or else they will be clobbered.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   2 +
 mm/pkram.c            | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 207 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index c2099a4f2004..97a7c2ac44a9 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -103,10 +103,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
 void pkram_cleanup(void);
+void pkram_ban_region(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
+static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index 8670d1633a9d..d15be75c1032 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -141,6 +141,28 @@ struct pkram_super_block {
 unsigned long __initdata pkram_reserved_pages;
 
 /*
+ * For tracking a region of memory that PKRAM is not allowed to use.
+ */
+struct banned_region {
+	unsigned long start, end;		/* pfn, inclusive */
+};
+
+#define MAX_NR_BANNED		(32 + MAX_NUMNODES * 2)
+
+static unsigned int nr_banned;			/* number of banned regions */
+
+/* banned regions; arranged in ascending order, do not overlap */
+static struct banned_region banned[MAX_NR_BANNED];
+/*
+ * If a page allocated for PKRAM turns out to belong to a banned region,
+ * it is placed on the banned_pages list so subsequent allocation attempts
+ * do not encounter it again. The list is shrunk when system memory is low.
+ */
+static LIST_HEAD(banned_pages);			/* linked through page::lru */
+static DEFINE_SPINLOCK(banned_pages_lock);
+static unsigned long nr_banned_pages;
+
+/*
  * The PKRAM super block pfn, see above.
  */
 static int __init parse_pkram_sb_pfn(char *arg)
@@ -207,12 +229,116 @@ void __init pkram_reserve(void)
 	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
 }
 
+/*
+ * Ban pfn range [start..end] (inclusive) from use in PKRAM.
+ */
+void pkram_ban_region(unsigned long start, unsigned long end)
+{
+	int i, merged = -1;
+
+	/* first try to merge the region with an existing one */
+	for (i = nr_banned - 1; i >= 0 && start <= banned[i].end + 1; i--) {
+		if (end + 1 >= banned[i].start) {
+			start = min(banned[i].start, start);
+			end = max(banned[i].end, end);
+			if (merged < 0)
+				merged = i;
+		} else
+			/*
+			 * Regions are arranged in ascending order and do not
+			 * intersect so the merged region cannot jump over its
+			 * predecessors.
+			 */
+			BUG_ON(merged >= 0);
+	}
+
+	i++;
+
+	if (merged >= 0) {
+		banned[i].start = start;
+		banned[i].end = end;
+		/* shift if merged with more than one region */
+		memmove(banned + i + 1, banned + merged + 1,
+			sizeof(*banned) * (nr_banned - merged - 1));
+		nr_banned -= merged - i;
+		return;
+	}
+
+	/*
+	 * The region does not intersect with an existing one;
+	 * try to create a new one.
+	 */
+	if (nr_banned == MAX_NR_BANNED) {
+		pr_err("PKRAM: Failed to ban %lu-%lu: "
+		       "Too many banned regions\n", start, end);
+		return;
+	}
+
+	memmove(banned + i + 1, banned + i,
+		sizeof(*banned) * (nr_banned - i));
+	banned[i].start = start;
+	banned[i].end = end;
+	nr_banned++;
+}
+
+static void pkram_show_banned(void)
+{
+	int i;
+	unsigned long n, total = 0;
+
+	pr_info("PKRAM: banned regions:\n");
+	for (i = 0; i < nr_banned; i++) {
+		n = banned[i].end - banned[i].start + 1;
+		pr_info("%4d: [%08lx - %08lx] %ld pages\n",
+			i, banned[i].start, banned[i].end, n);
+		total += n;
+	}
+	pr_info("Total banned: %ld pages in %d regions\n",
+		total, nr_banned);
+}
+
+/*
+ * Returns true if the page may not be used for storing preserved data.
+ */
+static bool pkram_page_banned(struct page *page)
+{
+	unsigned long epfn, pfn = page_to_pfn(page);
+	int l = 0, r = nr_banned - 1, m;
+
+	epfn = pfn + compound_nr(page) - 1;
+
+	/* do binary search */
+	while (l <= r) {
+		m = (l + r) / 2;
+		if (epfn < banned[m].start)
+			r = m - 1;
+		else if (pfn > banned[m].end)
+			l = m + 1;
+		else
+			return true;
+	}
+	return false;
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
+	LIST_HEAD(list);
+	unsigned long len = 0;
 	int err;
 
 	page = alloc_page(gfp_mask);
+	while (page && pkram_page_banned(page)) {
+		len++;
+		list_add(&page->lru, &list);
+		page = alloc_page(gfp_mask);
+	}
+	if (len > 0) {
+		spin_lock(&banned_pages_lock);
+		nr_banned_pages += len;
+		list_splice(&list, &banned_pages);
+		spin_unlock(&banned_pages_lock);
+	}
 	if (page) {
 		err = pkram_add_identity_map(page);
 		if (err) {
@@ -235,6 +361,53 @@ static inline void pkram_free_page(void *addr)
 	free_page((unsigned long)addr);
 }
 
+static void __banned_pages_shrink(unsigned long nr_to_scan)
+{
+	struct page *page;
+
+	if (nr_to_scan <= 0)
+		return;
+
+	while (nr_banned_pages > 0) {
+		BUG_ON(list_empty(&banned_pages));
+		page = list_first_entry(&banned_pages, struct page, lru);
+		list_del(&page->lru);
+		__free_page(page);
+		nr_banned_pages--;
+		nr_to_scan--;
+		if (!nr_to_scan)
+			break;
+	}
+}
+
+static unsigned long
+banned_pages_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	return nr_banned_pages;
+}
+
+static unsigned long
+banned_pages_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int nr_left = nr_banned_pages;
+
+	if (!sc->nr_to_scan || !nr_left)
+		return nr_left;
+
+	spin_lock(&banned_pages_lock);
+	__banned_pages_shrink(sc->nr_to_scan);
+	nr_left = nr_banned_pages;
+	spin_unlock(&banned_pages_lock);
+
+	return nr_left;
+}
+
+static struct shrinker banned_pages_shrinker = {
+	.count_objects = banned_pages_count,
+	.scan_objects = banned_pages_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 static inline void pkram_insert_node(struct pkram_node *node)
 {
 	list_add(&virt_to_page(node)->lru, &pkram_nodes);
@@ -709,6 +882,31 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 	return 0;
 }
 
+static int __pkram_save_page_copy(struct pkram_access *pa, struct page *page)
+{
+	int nr_pages = compound_nr(page);
+	pgoff_t index = page->index;
+	int i, err;
+
+	for (i = 0; i < nr_pages; i++, index++) {
+		struct page *p = page + i;
+		struct page *new;
+
+		new = pkram_alloc_page(pa->ps->gfp_mask);
+		if (!new)
+			return -ENOMEM;
+
+		copy_highpage(new, p);
+		err = __pkram_save_page(pa, new, index);
+		if (err) {
+			pkram_free_page(page_address(new));
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * Save file page @page to the preserved memory node and object associated
  * with pkram stream access @pa. The stream must have been initialized with
@@ -731,6 +929,10 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	/* if page is banned, relocate it */
+	if (pkram_page_banned(page))
+		return __pkram_save_page_copy(pa, page);
+
 	err = __pkram_save_page(pa, page, page->index);
 	if (!err)
 		err = pkram_add_identity_map(page);
@@ -968,6 +1170,7 @@ static void __pkram_reboot(void)
 	int err = 0;
 
 	if (!list_empty(&pkram_nodes)) {
+		pkram_show_banned();
 		err = pkram_add_identity_map(virt_to_page(pkram_sb));
 		if (err) {
 			pr_err("PKRAM: failed to add super block to pagetable\n");
@@ -1054,6 +1257,7 @@ static int __init pkram_init_sb(void)
 		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
+			__banned_pages_shrink(ULONG_MAX);
 			return 0;
 		}
 		pkram_sb = page_address(page);
@@ -1076,6 +1280,7 @@ static int __init pkram_init(void)
 {
 	if (pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
+		register_shrinker(&banned_pages_shrinker);
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
 	}
 	return 0;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 15/43] PKRAM: provide a way to ban pages from use by PKRAM
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Not all memory ranges can be used for saving preserved over-kexec data.
For example, a kexec kernel may be loaded before pages are preserved.
The memory regions where the kexec segments will be copied to on kexec
must not contain preserved pages or else they will be clobbered.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |   2 +
 mm/pkram.c            | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 207 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index c2099a4f2004..97a7c2ac44a9 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -103,10 +103,12 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 extern unsigned long pkram_reserved_pages;
 void pkram_reserve(void);
 void pkram_cleanup(void);
+void pkram_ban_region(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
+static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index 8670d1633a9d..d15be75c1032 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -141,6 +141,28 @@ struct pkram_super_block {
 unsigned long __initdata pkram_reserved_pages;
 
 /*
+ * For tracking a region of memory that PKRAM is not allowed to use.
+ */
+struct banned_region {
+	unsigned long start, end;		/* pfn, inclusive */
+};
+
+#define MAX_NR_BANNED		(32 + MAX_NUMNODES * 2)
+
+static unsigned int nr_banned;			/* number of banned regions */
+
+/* banned regions; arranged in ascending order, do not overlap */
+static struct banned_region banned[MAX_NR_BANNED];
+/*
+ * If a page allocated for PKRAM turns out to belong to a banned region,
+ * it is placed on the banned_pages list so subsequent allocation attempts
+ * do not encounter it again. The list is shrunk when system memory is low.
+ */
+static LIST_HEAD(banned_pages);			/* linked through page::lru */
+static DEFINE_SPINLOCK(banned_pages_lock);
+static unsigned long nr_banned_pages;
+
+/*
  * The PKRAM super block pfn, see above.
  */
 static int __init parse_pkram_sb_pfn(char *arg)
@@ -207,12 +229,116 @@ void __init pkram_reserve(void)
 	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
 }
 
+/*
+ * Ban pfn range [start..end] (inclusive) from use in PKRAM.
+ */
+void pkram_ban_region(unsigned long start, unsigned long end)
+{
+	int i, merged = -1;
+
+	/* first try to merge the region with an existing one */
+	for (i = nr_banned - 1; i >= 0 && start <= banned[i].end + 1; i--) {
+		if (end + 1 >= banned[i].start) {
+			start = min(banned[i].start, start);
+			end = max(banned[i].end, end);
+			if (merged < 0)
+				merged = i;
+		} else
+			/*
+			 * Regions are arranged in ascending order and do not
+			 * intersect so the merged region cannot jump over its
+			 * predecessors.
+			 */
+			BUG_ON(merged >= 0);
+	}
+
+	i++;
+
+	if (merged >= 0) {
+		banned[i].start = start;
+		banned[i].end = end;
+		/* shift if merged with more than one region */
+		memmove(banned + i + 1, banned + merged + 1,
+			sizeof(*banned) * (nr_banned - merged - 1));
+		nr_banned -= merged - i;
+		return;
+	}
+
+	/*
+	 * The region does not intersect with an existing one;
+	 * try to create a new one.
+	 */
+	if (nr_banned == MAX_NR_BANNED) {
+		pr_err("PKRAM: Failed to ban %lu-%lu: "
+		       "Too many banned regions\n", start, end);
+		return;
+	}
+
+	memmove(banned + i + 1, banned + i,
+		sizeof(*banned) * (nr_banned - i));
+	banned[i].start = start;
+	banned[i].end = end;
+	nr_banned++;
+}
+
+static void pkram_show_banned(void)
+{
+	int i;
+	unsigned long n, total = 0;
+
+	pr_info("PKRAM: banned regions:\n");
+	for (i = 0; i < nr_banned; i++) {
+		n = banned[i].end - banned[i].start + 1;
+		pr_info("%4d: [%08lx - %08lx] %ld pages\n",
+			i, banned[i].start, banned[i].end, n);
+		total += n;
+	}
+	pr_info("Total banned: %ld pages in %d regions\n",
+		total, nr_banned);
+}
+
+/*
+ * Returns true if the page may not be used for storing preserved data.
+ */
+static bool pkram_page_banned(struct page *page)
+{
+	unsigned long epfn, pfn = page_to_pfn(page);
+	int l = 0, r = nr_banned - 1, m;
+
+	epfn = pfn + compound_nr(page) - 1;
+
+	/* do binary search */
+	while (l <= r) {
+		m = (l + r) / 2;
+		if (epfn < banned[m].start)
+			r = m - 1;
+		else if (pfn > banned[m].end)
+			l = m + 1;
+		else
+			return true;
+	}
+	return false;
+}
+
 static inline struct page *pkram_alloc_page(gfp_t gfp_mask)
 {
 	struct page *page;
+	LIST_HEAD(list);
+	unsigned long len = 0;
 	int err;
 
 	page = alloc_page(gfp_mask);
+	while (page && pkram_page_banned(page)) {
+		len++;
+		list_add(&page->lru, &list);
+		page = alloc_page(gfp_mask);
+	}
+	if (len > 0) {
+		spin_lock(&banned_pages_lock);
+		nr_banned_pages += len;
+		list_splice(&list, &banned_pages);
+		spin_unlock(&banned_pages_lock);
+	}
 	if (page) {
 		err = pkram_add_identity_map(page);
 		if (err) {
@@ -235,6 +361,53 @@ static inline void pkram_free_page(void *addr)
 	free_page((unsigned long)addr);
 }
 
+static void __banned_pages_shrink(unsigned long nr_to_scan)
+{
+	struct page *page;
+
+	if (nr_to_scan <= 0)
+		return;
+
+	while (nr_banned_pages > 0) {
+		BUG_ON(list_empty(&banned_pages));
+		page = list_first_entry(&banned_pages, struct page, lru);
+		list_del(&page->lru);
+		__free_page(page);
+		nr_banned_pages--;
+		nr_to_scan--;
+		if (!nr_to_scan)
+			break;
+	}
+}
+
+static unsigned long
+banned_pages_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	return nr_banned_pages;
+}
+
+static unsigned long
+banned_pages_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int nr_left = nr_banned_pages;
+
+	if (!sc->nr_to_scan || !nr_left)
+		return nr_left;
+
+	spin_lock(&banned_pages_lock);
+	__banned_pages_shrink(sc->nr_to_scan);
+	nr_left = nr_banned_pages;
+	spin_unlock(&banned_pages_lock);
+
+	return nr_left;
+}
+
+static struct shrinker banned_pages_shrinker = {
+	.count_objects = banned_pages_count,
+	.scan_objects = banned_pages_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 static inline void pkram_insert_node(struct pkram_node *node)
 {
 	list_add(&virt_to_page(node)->lru, &pkram_nodes);
@@ -709,6 +882,31 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 	return 0;
 }
 
+static int __pkram_save_page_copy(struct pkram_access *pa, struct page *page)
+{
+	int nr_pages = compound_nr(page);
+	pgoff_t index = page->index;
+	int i, err;
+
+	for (i = 0; i < nr_pages; i++, index++) {
+		struct page *p = page + i;
+		struct page *new;
+
+		new = pkram_alloc_page(pa->ps->gfp_mask);
+		if (!new)
+			return -ENOMEM;
+
+		copy_highpage(new, p);
+		err = __pkram_save_page(pa, new, index);
+		if (err) {
+			pkram_free_page(page_address(new));
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * Save file page @page to the preserved memory node and object associated
  * with pkram stream access @pa. The stream must have been initialized with
@@ -731,6 +929,10 @@ int pkram_save_file_page(struct pkram_access *pa, struct page *page)
 
 	BUG_ON((node->flags & PKRAM_ACCMODE_MASK) != PKRAM_SAVE);
 
+	/* if page is banned, relocate it */
+	if (pkram_page_banned(page))
+		return __pkram_save_page_copy(pa, page);
+
 	err = __pkram_save_page(pa, page, page->index);
 	if (!err)
 		err = pkram_add_identity_map(page);
@@ -968,6 +1170,7 @@ static void __pkram_reboot(void)
 	int err = 0;
 
 	if (!list_empty(&pkram_nodes)) {
+		pkram_show_banned();
 		err = pkram_add_identity_map(virt_to_page(pkram_sb));
 		if (err) {
 			pr_err("PKRAM: failed to add super block to pagetable\n");
@@ -1054,6 +1257,7 @@ static int __init pkram_init_sb(void)
 		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page) {
 			pr_err("PKRAM: Failed to allocate super block\n");
+			__banned_pages_shrink(ULONG_MAX);
 			return 0;
 		}
 		pkram_sb = page_address(page);
@@ -1076,6 +1280,7 @@ static int __init pkram_init(void)
 {
 	if (pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
+		register_shrinker(&banned_pages_shrinker);
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
 	}
 	return 0;
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 16/43] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

When loading a kernel for kexec, dynamically update the list of physical
ranges that are not to be used for storing preserved pages with the ranges
where kexec segments will be copied to on reboot. This ensures no pages
preserved after the new kernel has been loaded will reside in these ranges
on reboot.

Not yet handled is the case where pages have been preserved before a
kexec kernel is loaded.  This will be covered by a later patch.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec.c      |  9 +++++++++
 kernel/kexec_file.c | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index c82c6c06f051..826c8fb824d8 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/pkram.h>
 
 #include "kexec_internal.h"
 
@@ -163,6 +164,14 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 	if (ret)
 		goto out;
 
+	for (i = 0; i < nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/* Install the new kernel and uninstall the old */
 	image = xchg(dest_image, image);
 
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 5c3447cf7ad5..1ec47a3c60dd 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -27,6 +27,8 @@
 #include <linux/kernel_read_file.h>
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
+#include <linux/pkram.h>
+
 #include "kexec_internal.h"
 
 static int kexec_calculate_store_digests(struct kimage *image);
@@ -429,6 +431,14 @@ void kimage_file_post_load_cleanup(struct kimage *image)
 	if (ret)
 		goto out;
 
+	for (i = 0; i < image->nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/*
 	 * Free up any temporary buffers allocated which are not needed
 	 * after image has been loaded
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 16/43] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

When loading a kernel for kexec, dynamically update the list of physical
ranges that are not to be used for storing preserved pages with the ranges
where kexec segments will be copied to on reboot. This ensures no pages
preserved after the new kernel has been loaded will reside in these ranges
on reboot.

Not yet handled is the case where pages have been preserved before a
kexec kernel is loaded.  This will be covered by a later patch.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec.c      |  9 +++++++++
 kernel/kexec_file.c | 10 ++++++++++
 2 files changed, 19 insertions(+)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index c82c6c06f051..826c8fb824d8 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/pkram.h>
 
 #include "kexec_internal.h"
 
@@ -163,6 +164,14 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
 	if (ret)
 		goto out;
 
+	for (i = 0; i < nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/* Install the new kernel and uninstall the old */
 	image = xchg(dest_image, image);
 
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 5c3447cf7ad5..1ec47a3c60dd 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -27,6 +27,8 @@
 #include <linux/kernel_read_file.h>
 #include <linux/syscalls.h>
 #include <linux/vmalloc.h>
+#include <linux/pkram.h>
+
 #include "kexec_internal.h"
 
 static int kexec_calculate_store_digests(struct kimage *image);
@@ -429,6 +431,14 @@ void kimage_file_post_load_cleanup(struct kimage *image)
 	if (ret)
 		goto out;
 
+	for (i = 0; i < image->nr_segments; i++) {
+		unsigned long mem = image->segment[i].mem;
+		size_t memsz = image->segment[i].memsz;
+
+		if (memsz)
+			pkram_ban_region(PFN_DOWN(mem), PFN_UP(mem + memsz) - 1);
+	}
+
 	/*
 	 * Free up any temporary buffers allocated which are not needed
 	 * after image has been loaded
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 17/43] PKRAM: provide a way to check if a memory range has preserved pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

When a kernel is loaded for kexec the address ranges where the kexec
segments will be copied to may conflict with pages already set to be
preserved. Provide a way to determine if preserved pages exist in a
specified range.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 97a7c2ac44a9..977cf45a1bcf 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -104,11 +104,13 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_reserve(void);
 void pkram_cleanup(void);
 void pkram_ban_region(unsigned long start, unsigned long end);
+int pkram_has_preserved_pages(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
 static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
+static inline int pkram_has_preserved_pages(unsigned long start, unsigned long end) { return 0; }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index d15be75c1032..dcf84ba785a7 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1668,3 +1668,23 @@ void __init pkram_cleanup(void)
 		pkram_reserved_pages--;
 	}
 }
+
+static int has_preserved_pages_cb(unsigned long base, unsigned long size, void *private)
+{
+	int *has_preserved = (int *)private;
+
+	*has_preserved = 1;
+	return 1;
+}
+
+/*
+ * Check whether the memory range [start, end) contains preserved pages.
+ */
+int pkram_has_preserved_pages(unsigned long start, unsigned long end)
+{
+	int has_preserved = 0;
+
+	pkram_find_preserved(start, end, &has_preserved, has_preserved_pages_cb);
+
+	return has_preserved;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 17/43] PKRAM: provide a way to check if a memory range has preserved pages
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

When a kernel is loaded for kexec the address ranges where the kexec
segments will be copied to may conflict with pages already set to be
preserved. Provide a way to determine if preserved pages exist in a
specified range.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  2 ++
 mm/pkram.c            | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 97a7c2ac44a9..977cf45a1bcf 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -104,11 +104,13 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 void pkram_reserve(void);
 void pkram_cleanup(void);
 void pkram_ban_region(unsigned long start, unsigned long end);
+int pkram_has_preserved_pages(unsigned long start, unsigned long end);
 #else
 #define pkram_reserved_pages 0UL
 static inline void pkram_reserve(void) { }
 static inline void pkram_cleanup(void) { }
 static inline void pkram_ban_region(unsigned long start, unsigned long end) { }
+static inline int pkram_has_preserved_pages(unsigned long start, unsigned long end) { return 0; }
 #endif
 
 #endif /* _LINUX_PKRAM_H */
diff --git a/mm/pkram.c b/mm/pkram.c
index d15be75c1032..dcf84ba785a7 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1668,3 +1668,23 @@ void __init pkram_cleanup(void)
 		pkram_reserved_pages--;
 	}
 }
+
+static int has_preserved_pages_cb(unsigned long base, unsigned long size, void *private)
+{
+	int *has_preserved = (int *)private;
+
+	*has_preserved = 1;
+	return 1;
+}
+
+/*
+ * Check whether the memory range [start, end) contains preserved pages.
+ */
+int pkram_has_preserved_pages(unsigned long start, unsigned long end)
+{
+	int has_preserved = 0;
+
+	pkram_find_preserved(start, end, &has_preserved, has_preserved_pages_cb);
+
+	return has_preserved;
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 18/43] kexec: PKRAM: avoid clobbering already preserved pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Ensure destination ranges of the kexec segments do not overlap
with any kernel pages marked to be preserved across kexec.

For kexec_load, return EADDRNOTAVAIL if overlap is detected.

For kexec_file_load, skip ranges containing preserved pages when
seaching for available ranges to use.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec_core.c | 3 +++
 kernel/kexec_file.c | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index a0b6780740c8..fda4abb865ff 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -37,6 +37,7 @@
 #include <linux/compiler.h>
 #include <linux/hugetlb.h>
 #include <linux/objtool.h>
+#include <linux/pkram.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -175,6 +176,8 @@ int sanity_check_segment_list(struct kimage *image)
 			return -EADDRNOTAVAIL;
 		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
 			return -EADDRNOTAVAIL;
+		if (pkram_has_preserved_pages(mstart, mend))
+			return -EADDRNOTAVAIL;
 	}
 
 	/* Verify our destination addresses do not overlap.
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 1ec47a3c60dd..94109bcdbeff 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -516,6 +516,11 @@ static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
 			continue;
 		}
 
+		if (pkram_has_preserved_pages(temp_start, temp_end + 1)) {
+			temp_start = temp_start - PAGE_SIZE;
+			continue;
+		}
+
 		/* We found a suitable memory range */
 		break;
 	} while (1);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 18/43] kexec: PKRAM: avoid clobbering already preserved pages
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Ensure destination ranges of the kexec segments do not overlap
with any kernel pages marked to be preserved across kexec.

For kexec_load, return EADDRNOTAVAIL if overlap is detected.

For kexec_file_load, skip ranges containing preserved pages when
seaching for available ranges to use.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/kexec_core.c | 3 +++
 kernel/kexec_file.c | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index a0b6780740c8..fda4abb865ff 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -37,6 +37,7 @@
 #include <linux/compiler.h>
 #include <linux/hugetlb.h>
 #include <linux/objtool.h>
+#include <linux/pkram.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -175,6 +176,8 @@ int sanity_check_segment_list(struct kimage *image)
 			return -EADDRNOTAVAIL;
 		if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
 			return -EADDRNOTAVAIL;
+		if (pkram_has_preserved_pages(mstart, mend))
+			return -EADDRNOTAVAIL;
 	}
 
 	/* Verify our destination addresses do not overlap.
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 1ec47a3c60dd..94109bcdbeff 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -516,6 +516,11 @@ static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
 			continue;
 		}
 
+		if (pkram_has_preserved_pages(temp_start, temp_end + 1)) {
+			temp_start = temp_start - PAGE_SIZE;
+			continue;
+		}
+
 		/* We found a suitable memory range */
 		break;
 	} while (1);
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 19/43] mm: PKRAM: allow preserved memory to be freed from userspace
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To free all space utilized for preserved memory, one can write 0 to
/sys/kernel/pkram. This will destroy all PKRAM nodes that are not
currently being read or written.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index dcf84ba785a7..8700fd77dc67 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -493,6 +493,32 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+/*
+ * Free all nodes that are not under operation.
+ */
+static void pkram_truncate(void)
+{
+	struct page *page, *tmp;
+	struct pkram_node *node;
+	LIST_HEAD(dispose);
+
+	mutex_lock(&pkram_mutex);
+	list_for_each_entry_safe(page, tmp, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (!(node->flags & PKRAM_ACCMODE_MASK))
+			list_move(&page->lru, &dispose);
+	}
+	mutex_unlock(&pkram_mutex);
+
+	while (!list_empty(&dispose)) {
+		page = list_first_entry(&dispose, struct page, lru);
+		list_del(&page->lru);
+		node = page_address(page);
+		pkram_truncate_node(node);
+		pkram_free_page(node);
+	}
+}
+
 static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
 {
 	__u64 link_pfn = page_to_pfn(virt_to_page(link));
@@ -1233,8 +1259,19 @@ static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
 	return sprintf(buf, "%lx\n", pfn);
 }
 
+static ssize_t store_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int val;
+
+	if (kstrtoint(buf, 0, &val) || val)
+		return -EINVAL;
+	pkram_truncate();
+	return count;
+}
+
 static struct kobj_attribute pkram_sb_pfn_attr =
-	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+	__ATTR(pkram, 0644, show_pkram_sb_pfn, store_pkram_sb_pfn);
 
 static struct attribute *pkram_attrs[] = {
 	&pkram_sb_pfn_attr.attr,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 19/43] mm: PKRAM: allow preserved memory to be freed from userspace
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To free all space utilized for preserved memory, one can write 0 to
/sys/kernel/pkram. This will destroy all PKRAM nodes that are not
currently being read or written.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index dcf84ba785a7..8700fd77dc67 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -493,6 +493,32 @@ static void pkram_truncate_node(struct pkram_node *node)
 	node->obj_pfn = 0;
 }
 
+/*
+ * Free all nodes that are not under operation.
+ */
+static void pkram_truncate(void)
+{
+	struct page *page, *tmp;
+	struct pkram_node *node;
+	LIST_HEAD(dispose);
+
+	mutex_lock(&pkram_mutex);
+	list_for_each_entry_safe(page, tmp, &pkram_nodes, lru) {
+		node = page_address(page);
+		if (!(node->flags & PKRAM_ACCMODE_MASK))
+			list_move(&page->lru, &dispose);
+	}
+	mutex_unlock(&pkram_mutex);
+
+	while (!list_empty(&dispose)) {
+		page = list_first_entry(&dispose, struct page, lru);
+		list_del(&page->lru);
+		node = page_address(page);
+		pkram_truncate_node(node);
+		pkram_free_page(node);
+	}
+}
+
 static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
 {
 	__u64 link_pfn = page_to_pfn(virt_to_page(link));
@@ -1233,8 +1259,19 @@ static ssize_t show_pkram_sb_pfn(struct kobject *kobj,
 	return sprintf(buf, "%lx\n", pfn);
 }
 
+static ssize_t store_pkram_sb_pfn(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int val;
+
+	if (kstrtoint(buf, 0, &val) || val)
+		return -EINVAL;
+	pkram_truncate();
+	return count;
+}
+
 static struct kobj_attribute pkram_sb_pfn_attr =
-	__ATTR(pkram, 0444, show_pkram_sb_pfn, NULL);
+	__ATTR(pkram, 0644, show_pkram_sb_pfn, store_pkram_sb_pfn);
 
 static struct attribute *pkram_attrs[] = {
 	&pkram_sb_pfn_attr.attr,
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 20/43] PKRAM: disable feature when running the kdump kernel
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

The kdump kernel should not preserve or restore pages.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 8700fd77dc67..aea069cc49be 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/crash_dump.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
@@ -189,7 +190,7 @@ void __init pkram_reserve(void)
 {
 	int err = 0;
 
-	if (!pkram_sb_pfn)
+	if (!pkram_sb_pfn || is_kdump_kernel())
 		return;
 
 	pr_info("PKRAM: Examining preserved memory...\n");
@@ -286,6 +287,9 @@ static void pkram_show_banned(void)
 	int i;
 	unsigned long n, total = 0;
 
+	if (is_kdump_kernel())
+		return;
+
 	pr_info("PKRAM: banned regions:\n");
 	for (i = 0; i < nr_banned; i++) {
 		n = banned[i].end - banned[i].start + 1;
@@ -1315,7 +1319,7 @@ static int __init pkram_init_sb(void)
 
 static int __init pkram_init(void)
 {
-	if (pkram_init_sb()) {
+	if (!is_kdump_kernel() && pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
 		register_shrinker(&banned_pages_shrinker);
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 20/43] PKRAM: disable feature when running the kdump kernel
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

The kdump kernel should not preserve or restore pages.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 8700fd77dc67..aea069cc49be 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/crash_dump.h>
 #include <linux/err.h>
 #include <linux/gfp.h>
 #include <linux/highmem.h>
@@ -189,7 +190,7 @@ void __init pkram_reserve(void)
 {
 	int err = 0;
 
-	if (!pkram_sb_pfn)
+	if (!pkram_sb_pfn || is_kdump_kernel())
 		return;
 
 	pr_info("PKRAM: Examining preserved memory...\n");
@@ -286,6 +287,9 @@ static void pkram_show_banned(void)
 	int i;
 	unsigned long n, total = 0;
 
+	if (is_kdump_kernel())
+		return;
+
 	pr_info("PKRAM: banned regions:\n");
 	for (i = 0; i < nr_banned; i++) {
 		n = banned[i].end - banned[i].start + 1;
@@ -1315,7 +1319,7 @@ static int __init pkram_init_sb(void)
 
 static int __init pkram_init(void)
 {
-	if (pkram_init_sb()) {
+	if (!is_kdump_kernel() && pkram_init_sb()) {
 		register_reboot_notifier(&pkram_reboot_notifier);
 		register_shrinker(&banned_pages_shrinker);
 		sysfs_update_group(kernel_kobj, &pkram_attr_group);
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 21/43] x86/KASLR: PKRAM: support physical kaslr
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Avoid regions of memory that contain preserved pages when computing
slots used to select where to put the decompressed kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/Makefile |   3 ++
 arch/x86/boot/compressed/kaslr.c  |  10 +++-
 arch/x86/boot/compressed/misc.h   |  10 ++++
 arch/x86/boot/compressed/pkram.c  | 109 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 130 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..ef27d411b641 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -93,6 +93,9 @@ ifdef CONFIG_X86_64
 	vmlinux-objs-y += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
 	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev-es.o
+ifdef CONFIG_RANDOMIZE_BASE
+	vmlinux-objs-$(CONFIG_PKRAM) += $(obj)/pkram.o
+endif
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index b92fffbe761f..a007363a7698 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -440,6 +440,7 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	struct setup_data *ptr;
 	u64 earliest = img->start + img->size;
 	bool is_overlapping = false;
+	struct mem_vector avoid;
 
 	for (i = 0; i < MEM_AVOID_MAX; i++) {
 		if (mem_overlaps(img, &mem_avoid[i]) &&
@@ -453,8 +454,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	/* Avoid all entries in the setup_data linked list. */
 	ptr = (struct setup_data *)(unsigned long)boot_params->hdr.setup_data;
 	while (ptr) {
-		struct mem_vector avoid;
-
 		avoid.start = (unsigned long)ptr;
 		avoid.size = sizeof(*ptr) + ptr->len;
 
@@ -479,6 +478,12 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 		ptr = (struct setup_data *)(unsigned long)ptr->next;
 	}
 
+	if (pkram_has_overlap(img, &avoid) && (avoid.start < earliest)) {
+		*overlap = avoid;
+		earliest = overlap->start;
+		is_overlapping = true;
+	}
+
 	return is_overlapping;
 }
 
@@ -840,6 +845,7 @@ void choose_random_location(unsigned long input,
 		return;
 	}
 
+	pkram_init();
 	boot_params->hdr.loadflags |= KASLR_FLAG;
 
 	if (IS_ENABLED(CONFIG_X86_32))
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..f8232ffd8141 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -116,6 +116,16 @@ static inline void console_init(void)
 { }
 #endif
 
+#ifdef CONFIG_PKRAM
+void pkram_init(void);
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap);
+#else
+static inline void pkram_init(void) { }
+static inline int pkram_has_overlap(struct mem_vector *entry,
+				    struct mem_vector *overlap);
+{ return 0; }
+#endif
+
 void set_sev_encryption_mask(void);
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/pkram.c b/arch/x86/boot/compressed/pkram.c
new file mode 100644
index 000000000000..60380f074c3f
--- /dev/null
+++ b/arch/x86/boot/compressed/pkram.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "misc.h"
+
+#define PKRAM_MAGIC		0x706B726D
+
+struct pkram_super_block {
+	__u32	magic;
+
+	__u64	node_pfn;
+	__u64	region_list_pfn;
+	__u64	nr_regions;
+};
+
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
+
+static u64 pkram_sb_pfn;
+static struct pkram_super_block *pkram_sb;
+
+void pkram_init(void)
+{
+	struct pkram_super_block *sb;
+	char arg[32];
+
+	if (cmdline_find_option("pkram", arg, sizeof(arg)) > 0) {
+		if (kstrtoull(arg, 16, &pkram_sb_pfn) != 0)
+			return;
+	} else
+		return;
+
+	sb = (struct pkram_super_block *)(pkram_sb_pfn << PAGE_SHIFT);
+	if (sb->magic != PKRAM_MAGIC) {
+		debug_putstr("PKRAM: invalid super block\n");
+		return;
+	}
+
+	pkram_sb = sb;
+}
+
+static struct pkram_region *pkram_first_region(struct pkram_super_block *sb, struct pkram_region_list **rlp, int *idx)
+{
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = (struct pkram_region_list *)(sb->region_list_pfn << PAGE_SHIFT);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+static struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			debug_putstr("PKRAM: no more pkram_region_list pages\n");
+			return NULL;
+		}
+		rl = (struct pkram_region_list *)(rl->next_pfn << PAGE_SHIFT);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap)
+{
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	int idx;
+
+	r = pkram_first_region(pkram_sb, &rl, &idx);
+
+	while (r) {
+		if (r->base + r->size <= entry->start) {
+			r = pkram_next_region(&rl, &idx);
+			continue;
+		}
+		if (r->base >= entry->start + entry->size)
+			return 0;
+
+		overlap->start = r->base;
+		overlap->size = r->size;
+		return 1;
+	}
+
+	return 0;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 21/43] x86/KASLR: PKRAM: support physical kaslr
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Avoid regions of memory that contain preserved pages when computing
slots used to select where to put the decompressed kernel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/Makefile |   3 ++
 arch/x86/boot/compressed/kaslr.c  |  10 +++-
 arch/x86/boot/compressed/misc.h   |  10 ++++
 arch/x86/boot/compressed/pkram.c  | 109 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 130 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/boot/compressed/pkram.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index e0bc3988c3fa..ef27d411b641 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -93,6 +93,9 @@ ifdef CONFIG_X86_64
 	vmlinux-objs-y += $(obj)/mem_encrypt.o
 	vmlinux-objs-y += $(obj)/pgtable_64.o
 	vmlinux-objs-$(CONFIG_AMD_MEM_ENCRYPT) += $(obj)/sev-es.o
+ifdef CONFIG_RANDOMIZE_BASE
+	vmlinux-objs-$(CONFIG_PKRAM) += $(obj)/pkram.o
+endif
 endif
 
 vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index b92fffbe761f..a007363a7698 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -440,6 +440,7 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	struct setup_data *ptr;
 	u64 earliest = img->start + img->size;
 	bool is_overlapping = false;
+	struct mem_vector avoid;
 
 	for (i = 0; i < MEM_AVOID_MAX; i++) {
 		if (mem_overlaps(img, &mem_avoid[i]) &&
@@ -453,8 +454,6 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 	/* Avoid all entries in the setup_data linked list. */
 	ptr = (struct setup_data *)(unsigned long)boot_params->hdr.setup_data;
 	while (ptr) {
-		struct mem_vector avoid;
-
 		avoid.start = (unsigned long)ptr;
 		avoid.size = sizeof(*ptr) + ptr->len;
 
@@ -479,6 +478,12 @@ static bool mem_avoid_overlap(struct mem_vector *img,
 		ptr = (struct setup_data *)(unsigned long)ptr->next;
 	}
 
+	if (pkram_has_overlap(img, &avoid) && (avoid.start < earliest)) {
+		*overlap = avoid;
+		earliest = overlap->start;
+		is_overlapping = true;
+	}
+
 	return is_overlapping;
 }
 
@@ -840,6 +845,7 @@ void choose_random_location(unsigned long input,
 		return;
 	}
 
+	pkram_init();
 	boot_params->hdr.loadflags |= KASLR_FLAG;
 
 	if (IS_ENABLED(CONFIG_X86_32))
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 901ea5ebec22..f8232ffd8141 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -116,6 +116,16 @@ static inline void console_init(void)
 { }
 #endif
 
+#ifdef CONFIG_PKRAM
+void pkram_init(void);
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap);
+#else
+static inline void pkram_init(void) { }
+static inline int pkram_has_overlap(struct mem_vector *entry,
+				    struct mem_vector *overlap);
+{ return 0; }
+#endif
+
 void set_sev_encryption_mask(void);
 
 #ifdef CONFIG_AMD_MEM_ENCRYPT
diff --git a/arch/x86/boot/compressed/pkram.c b/arch/x86/boot/compressed/pkram.c
new file mode 100644
index 000000000000..60380f074c3f
--- /dev/null
+++ b/arch/x86/boot/compressed/pkram.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "misc.h"
+
+#define PKRAM_MAGIC		0x706B726D
+
+struct pkram_super_block {
+	__u32	magic;
+
+	__u64	node_pfn;
+	__u64	region_list_pfn;
+	__u64	nr_regions;
+};
+
+struct pkram_region {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+struct pkram_region_list {
+	__u64	prev_pfn;
+	__u64	next_pfn;
+
+	struct pkram_region regions[0];
+};
+
+#define PKRAM_REGIONS_LIST_MAX \
+	((PAGE_SIZE-sizeof(struct pkram_region_list))/sizeof(struct pkram_region))
+
+static u64 pkram_sb_pfn;
+static struct pkram_super_block *pkram_sb;
+
+void pkram_init(void)
+{
+	struct pkram_super_block *sb;
+	char arg[32];
+
+	if (cmdline_find_option("pkram", arg, sizeof(arg)) > 0) {
+		if (kstrtoull(arg, 16, &pkram_sb_pfn) != 0)
+			return;
+	} else
+		return;
+
+	sb = (struct pkram_super_block *)(pkram_sb_pfn << PAGE_SHIFT);
+	if (sb->magic != PKRAM_MAGIC) {
+		debug_putstr("PKRAM: invalid super block\n");
+		return;
+	}
+
+	pkram_sb = sb;
+}
+
+static struct pkram_region *pkram_first_region(struct pkram_super_block *sb, struct pkram_region_list **rlp, int *idx)
+{
+	if (!sb || !sb->region_list_pfn)
+		return NULL;
+
+	*rlp = (struct pkram_region_list *)(sb->region_list_pfn << PAGE_SHIFT);
+	*idx = 0;
+
+	return &(*rlp)->regions[0];
+}
+
+static struct pkram_region *pkram_next_region(struct pkram_region_list **rlp, int *idx)
+{
+	struct pkram_region_list *rl = *rlp;
+	int i = *idx;
+
+	i++;
+	if (i >= PKRAM_REGIONS_LIST_MAX) {
+		if (!rl->next_pfn) {
+			debug_putstr("PKRAM: no more pkram_region_list pages\n");
+			return NULL;
+		}
+		rl = (struct pkram_region_list *)(rl->next_pfn << PAGE_SHIFT);
+		*rlp = rl;
+		i = 0;
+	}
+	*idx = i;
+
+	if (rl->regions[i].size == 0)
+		return NULL;
+
+	return &rl->regions[i];
+}
+
+int pkram_has_overlap(struct mem_vector *entry, struct mem_vector *overlap)
+{
+	struct pkram_region_list *rl;
+	struct pkram_region *r;
+	int idx;
+
+	r = pkram_first_region(pkram_sb, &rl, &idx);
+
+	while (r) {
+		if (r->base + r->size <= entry->start) {
+			r = pkram_next_region(&rl, &idx);
+			continue;
+		}
+		if (r->base >= entry->start + entry->size)
+			return 0;
+
+		overlap->start = r->base;
+		overlap->size = r->size;
+		return 1;
+	}
+
+	return 0;
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 22/43] x86/boot/compressed/64: use 1GB pages for mappings
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

pkram kaslr code can incur multiple page faults when it walks its
preserved ranges list called via mem_avoid_overlap().  The multiple
faults can easily end up using up the small number of pages available
to be allocated for page table pages.

This patch hacks things so that mappings are 1GB which results in the need
for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
the mappings to be 2M.  This could possibly be fixed by updating split
code to split 1GB page if the aren't any other issues with using 1GB
mappings.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index f7213d0943b8..6ff02da4cc1a 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -95,8 +95,8 @@ static void add_identity_map(unsigned long start, unsigned long end)
 	int ret;
 
 	/* Align boundary to 2M. */
-	start = round_down(start, PMD_SIZE);
-	end = round_up(end, PMD_SIZE);
+	start = round_down(start, PUD_SIZE);
+	end = round_up(end, PUD_SIZE);
 	if (start >= end)
 		return;
 
@@ -119,6 +119,7 @@ void initialize_identity_maps(void *rmode)
 	mapping_info.context = &pgt_data;
 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
 	mapping_info.kernpg_flag = _KERNPG_TABLE;
+	mapping_info.direct_gbpages = true;
 
 	/*
 	 * It should be impossible for this not to already be true,
@@ -329,8 +330,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
 
 	ghcb_fault = sev_es_check_ghcb_fault(address);
 
-	address   &= PMD_MASK;
-	end        = address + PMD_SIZE;
+	address   &= PUD_MASK;
+	end        = address + PUD_SIZE;
 
 	/*
 	 * Check for unexpected error codes. Unexpected are:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 22/43] x86/boot/compressed/64: use 1GB pages for mappings
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

pkram kaslr code can incur multiple page faults when it walks its
preserved ranges list called via mem_avoid_overlap().  The multiple
faults can easily end up using up the small number of pages available
to be allocated for page table pages.

This patch hacks things so that mappings are 1GB which results in the need
for far fewer page table pages.  As is this breaks AMD SEV-ES which expects
the mappings to be 2M.  This could possibly be fixed by updating split
code to split 1GB page if the aren't any other issues with using 1GB
mappings.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/boot/compressed/ident_map_64.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index f7213d0943b8..6ff02da4cc1a 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -95,8 +95,8 @@ static void add_identity_map(unsigned long start, unsigned long end)
 	int ret;
 
 	/* Align boundary to 2M. */
-	start = round_down(start, PMD_SIZE);
-	end = round_up(end, PMD_SIZE);
+	start = round_down(start, PUD_SIZE);
+	end = round_up(end, PUD_SIZE);
 	if (start >= end)
 		return;
 
@@ -119,6 +119,7 @@ void initialize_identity_maps(void *rmode)
 	mapping_info.context = &pgt_data;
 	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
 	mapping_info.kernpg_flag = _KERNPG_TABLE;
+	mapping_info.direct_gbpages = true;
 
 	/*
 	 * It should be impossible for this not to already be true,
@@ -329,8 +330,8 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
 
 	ghcb_fault = sev_es_check_ghcb_fault(address);
 
-	address   &= PMD_MASK;
-	end        = address + PMD_SIZE;
+	address   &= PUD_MASK;
+	end        = address + PUD_SIZE;
 
 	/*
 	 * Check for unexpected error codes. Unexpected are:
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 23/43] mm: shmem: introduce shmem_insert_page
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

The function inserts a page into a shmem file at a specified offset.
The page can be a regular PAGE_SIZE page or a transparent huge page.
If there is something at the offset (page or swap), the function fails.

The function will be used by the next patch.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/shmem_fs.h |  3 ++
 mm/shmem.c               | 77 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..3f0dd95efd46 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -103,6 +103,9 @@ enum sgp_type {
 extern int shmem_getpage(struct inode *inode, pgoff_t index,
 		struct page **pagep, enum sgp_type sgp);
 
+extern int shmem_insert_page(struct mm_struct *mm, struct inode *inode,
+		pgoff_t index, struct page *page);
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index b2db4ed0fbc7..60e4f0ad23b9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -755,6 +755,83 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	BUG_ON(error);
 }
 
+int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
+		      struct page *page)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	gfp_t gfp = mapping_gfp_mask(mapping);
+	int err;
+	int nr;
+	pgoff_t hindex = index;
+	bool on_lru = PageLRU(page);
+
+	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return -EFBIG;
+
+	nr = thp_nr_pages(page);
+retry:
+	err = 0;
+	if (!shmem_inode_acct_block(inode, nr))
+		err = -ENOSPC;
+	if (err) {
+		int retry = 5;
+
+		/*
+		 * Try to reclaim some space by splitting a huge page
+		 * beyond i_size on the filesystem.
+		 */
+		while (retry--) {
+			int ret;
+
+			ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
+			if (ret == SHRINK_STOP)
+				break;
+			if (ret)
+				goto retry;
+		}
+		goto failed;
+	}
+
+	if (!on_lru) {
+		__SetPageLocked(page);
+		__SetPageSwapBacked(page);
+	} else {
+		lock_page(page);
+	}
+
+	hindex = round_down(index, nr);
+	__SetPageReferenced(page);
+
+	err = shmem_add_to_page_cache(page, mapping, hindex,
+				      NULL, gfp & GFP_RECLAIM_MASK, mm);
+	if (err)
+		goto out_unlock;
+
+	if (!on_lru)
+		lru_cache_add(page);
+
+	spin_lock(&info->lock);
+	info->alloced += nr;
+	inode->i_blocks += BLOCKS_PER_PAGE << thp_order(page);
+	shmem_recalc_inode(inode);
+	spin_unlock(&info->lock);
+
+	flush_dcache_page(page);
+	SetPageUptodate(page);
+	set_page_dirty(page);
+
+	unlock_page(page);
+	return 0;
+
+out_unlock:
+	unlock_page(page);
+	shmem_inode_unacct_blocks(inode, nr);
+failed:
+	return err;
+}
+
 /*
  * Remove swap entry from page cache, free the swap and its page cache.
  */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 23/43] mm: shmem: introduce shmem_insert_page
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

The function inserts a page into a shmem file at a specified offset.
The page can be a regular PAGE_SIZE page or a transparent huge page.
If there is something at the offset (page or swap), the function fails.

The function will be used by the next patch.

Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/shmem_fs.h |  3 ++
 mm/shmem.c               | 77 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index d82b6f396588..3f0dd95efd46 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -103,6 +103,9 @@ enum sgp_type {
 extern int shmem_getpage(struct inode *inode, pgoff_t index,
 		struct page **pagep, enum sgp_type sgp);
 
+extern int shmem_insert_page(struct mm_struct *mm, struct inode *inode,
+		pgoff_t index, struct page *page);
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index b2db4ed0fbc7..60e4f0ad23b9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -755,6 +755,83 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	BUG_ON(error);
 }
 
+int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
+		      struct page *page)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	gfp_t gfp = mapping_gfp_mask(mapping);
+	int err;
+	int nr;
+	pgoff_t hindex = index;
+	bool on_lru = PageLRU(page);
+
+	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return -EFBIG;
+
+	nr = thp_nr_pages(page);
+retry:
+	err = 0;
+	if (!shmem_inode_acct_block(inode, nr))
+		err = -ENOSPC;
+	if (err) {
+		int retry = 5;
+
+		/*
+		 * Try to reclaim some space by splitting a huge page
+		 * beyond i_size on the filesystem.
+		 */
+		while (retry--) {
+			int ret;
+
+			ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
+			if (ret == SHRINK_STOP)
+				break;
+			if (ret)
+				goto retry;
+		}
+		goto failed;
+	}
+
+	if (!on_lru) {
+		__SetPageLocked(page);
+		__SetPageSwapBacked(page);
+	} else {
+		lock_page(page);
+	}
+
+	hindex = round_down(index, nr);
+	__SetPageReferenced(page);
+
+	err = shmem_add_to_page_cache(page, mapping, hindex,
+				      NULL, gfp & GFP_RECLAIM_MASK, mm);
+	if (err)
+		goto out_unlock;
+
+	if (!on_lru)
+		lru_cache_add(page);
+
+	spin_lock(&info->lock);
+	info->alloced += nr;
+	inode->i_blocks += BLOCKS_PER_PAGE << thp_order(page);
+	shmem_recalc_inode(inode);
+	spin_unlock(&info->lock);
+
+	flush_dcache_page(page);
+	SetPageUptodate(page);
+	set_page_dirty(page);
+
+	unlock_page(page);
+	return 0;
+
+out_unlock:
+	unlock_page(page);
+	shmem_inode_unacct_blocks(inode, nr);
+failed:
+	return err;
+}
+
 /*
  * Remove swap entry from page cache, free the swap and its page cache.
  */
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 24/43] mm: shmem: enable saving to PKRAM
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:35   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

This patch illustrates how the PKRAM API can be used for preserving tmpfs.
Two options are added to tmpfs:
    The 'pkram=' option specifies the PKRAM node to load/save the
    filesystem tree from/to.
    The 'preserve' option initiates preservation of a read-only
    filesystem tree.

If the 'pkram=' options is passed on mount, shmem will look for the
corresponding PKRAM node and load the FS tree from it.

If the 'pkram=' options was passed on mount and the 'preserve' option is
passed on remount and the filesystem is read-only, shmem will save the
FS tree to the PKRAM node.

A typical usage scenario looks like:

 # mount -t tmpfs -o pkram=mytmpfs none /mnt
 # echo something > /mnt/smth
 # mount -o remount ro,preserve /mnt
 <possibly kexec>
 # mount -t tmpfs -o pkram=mytmpfs none /mnt
 # cat /mnt/smth

Each FS tree is saved into a PKRAM node, and each file is saved into a
PKRAM object. A byte stream written to the object is used for saving file
metadata (name, permissions, etc) while the page stream written to
the object accommodates file content pages and their offsets.

This implementation serves as a demonstration and therefore is
simplified: it supports only regular files in the root directory without
multiple hard links, and it does not save swapped out files and aborts if
any are found. However, it can be elaborated to fully support tmpfs.

Originally-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/shmem_fs.h |  24 +++
 mm/Makefile              |   2 +-
 mm/shmem.c               |  64 ++++++++
 mm/shmem_pkram.c         | 385 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 474 insertions(+), 1 deletion(-)
 create mode 100644 mm/shmem_pkram.c

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 3f0dd95efd46..78149d702a62 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -26,6 +26,11 @@ struct shmem_inode_info {
 	struct inode		vfs_inode;
 };
 
+#define SHMEM_PKRAM_NAME_MAX	128
+struct shmem_pkram_info {
+	char name[SHMEM_PKRAM_NAME_MAX];
+};
+
 struct shmem_sb_info {
 	unsigned long max_blocks;   /* How many blocks are allowed */
 	struct percpu_counter used_blocks;  /* How many are allocated */
@@ -43,6 +48,8 @@ struct shmem_sb_info {
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
 	unsigned long shrinklist_len; /* Length of shrinklist */
+	struct shmem_pkram_info *pkram;
+	bool preserve;		    /* PKRAM-enabled data is preserved */
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
@@ -106,6 +113,23 @@ extern int shmem_getpage(struct inode *inode, pgoff_t index,
 extern int shmem_insert_page(struct mm_struct *mm, struct inode *inode,
 		pgoff_t index, struct page *page);
 
+#ifdef CONFIG_PKRAM
+extern int shmem_parse_pkram(const char *str, struct shmem_pkram_info **pkram);
+extern void shmem_show_pkram(struct seq_file *seq, struct shmem_pkram_info *pkram,
+			bool preserve);
+extern int shmem_save_pkram(struct super_block *sb);
+extern void shmem_load_pkram(struct super_block *sb);
+extern int shmem_release_pkram(struct super_block *sb);
+#else
+static inline int shmem_parse_pkram(const char *str,
+			struct shmem_pkram_info **pkram) { return 1; }
+static inline void shmem_show_pkram(struct seq_file *seq,
+			struct shmem_pkram_info *pkram, bool preserve) { }
+static inline int shmem_save_pkram(struct super_block *sb) { return 0; }
+static inline void shmem_load_pkram(struct super_block *sb) { }
+static inline int shmem_release_pkram(struct super_block *sb) { return 0; }
+#endif
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/mm/Makefile b/mm/Makefile
index f5c0dd0a3707..a4e9dd5545df 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,4 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
-obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o
+obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o shmem_pkram.o
diff --git a/mm/shmem.c b/mm/shmem.c
index 60e4f0ad23b9..c1c5760465f2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -111,16 +111,20 @@ struct shmem_options {
 	unsigned long long blocks;
 	unsigned long long inodes;
 	struct mempolicy *mpol;
+	struct shmem_pkram_info *pkram;
 	kuid_t uid;
 	kgid_t gid;
 	umode_t mode;
 	bool full_inums;
+	bool preserve;
 	int huge;
 	int seen;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
 #define SHMEM_SEEN_INUMS 8
+#define SHMEM_SEEN_PKRAM 16
+#define SHMEM_SEEN_PRESERVE 32
 };
 
 #ifdef CONFIG_TMPFS
@@ -3441,6 +3445,8 @@ enum shmem_param {
 	Opt_uid,
 	Opt_inode32,
 	Opt_inode64,
+	Opt_pkram,
+	Opt_preserve,
 };
 
 static const struct constant_table shmem_param_enums_huge[] = {
@@ -3462,6 +3468,8 @@ enum shmem_param {
 	fsparam_u32   ("uid",		Opt_uid),
 	fsparam_flag  ("inode32",	Opt_inode32),
 	fsparam_flag  ("inode64",	Opt_inode64),
+	fsparam_string("pkram",		Opt_pkram),
+	fsparam_flag_no("preserve",	Opt_preserve),
 	{}
 };
 
@@ -3545,6 +3553,22 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		ctx->full_inums = true;
 		ctx->seen |= SHMEM_SEEN_INUMS;
 		break;
+	case Opt_pkram:
+		if (IS_ENABLED(CONFIG_PKRAM)) {
+			kfree(ctx->pkram);
+			if (shmem_parse_pkram(param->string, &ctx->pkram))
+				goto bad_value;
+			ctx->seen |= SHMEM_SEEN_PKRAM;
+			break;
+		}
+		goto unsupported_parameter;
+	case Opt_preserve:
+		if (IS_ENABLED(CONFIG_PKRAM)) {
+			ctx->preserve = result.boolean;
+			ctx->seen |= SHMEM_SEEN_PRESERVE;
+			break;
+		}
+		goto unsupported_parameter;
 	}
 	return 0;
 
@@ -3641,6 +3665,41 @@ static int shmem_reconfigure(struct fs_context *fc)
 		err = "Current inum too high to switch to 32-bit inums";
 		goto out;
 	}
+	if (ctx->seen & SHMEM_SEEN_PRESERVE) {
+		if (!sbinfo->pkram && !(ctx->seen & SHMEM_SEEN_PKRAM)) {
+			err = "Cannot set preserve/nopreserve. Not enabled for PKRAM";
+			goto out;
+		}
+		if (ctx->preserve && !(fc->sb_flags & SB_RDONLY)) {
+			err = "Cannot preserve. Filesystem must be read-only";
+			goto out;
+		}
+	}
+
+	if (ctx->pkram) {
+		kfree(sbinfo->pkram);
+		sbinfo->pkram = ctx->pkram;
+	}
+
+	if (ctx->seen & SHMEM_SEEN_PRESERVE) {
+		int error;
+
+		if (!sbinfo->preserve && ctx->preserve) {
+			error = shmem_save_pkram(fc->root->d_sb);
+			if (error) {
+				err = "Failed to preserve";
+				goto out;
+			}
+			sbinfo->preserve = true;
+		} else if (sbinfo->preserve && !ctx->preserve) {
+			error = shmem_release_pkram(fc->root->d_sb);
+			if (error) {
+				err = "Failed to unpreserve";
+				goto out;
+			}
+			sbinfo->preserve = false;
+		}
+	}
 
 	if (ctx->seen & SHMEM_SEEN_HUGE)
 		sbinfo->huge = ctx->huge;
@@ -3714,6 +3773,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
 #endif
 	shmem_show_mpol(seq, sbinfo->mpol);
+	shmem_show_pkram(seq, sbinfo->pkram, sbinfo->preserve);
 	return 0;
 }
 
@@ -3726,6 +3786,7 @@ static void shmem_put_super(struct super_block *sb)
 	free_percpu(sbinfo->ino_batch);
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
+	kfree(sbinfo->pkram);
 	kfree(sbinfo);
 	sb->s_fs_info = NULL;
 }
@@ -3780,6 +3841,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sbinfo->huge = ctx->huge;
 	sbinfo->mpol = ctx->mpol;
 	ctx->mpol = NULL;
+	sbinfo->pkram = ctx->pkram;
+	ctx->pkram = NULL;
 
 	spin_lock_init(&sbinfo->stat_lock);
 	if (percpu_counter_init(&sbinfo->used_blocks, 0, GFP_KERNEL))
@@ -3809,6 +3872,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_root = d_make_root(inode);
 	if (!sb->s_root)
 		goto failed;
+	shmem_load_pkram(sb);
 	return 0;
 
 failed:
diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
new file mode 100644
index 000000000000..904b1b861ce5
--- /dev/null
+++ b/mm/shmem_pkram.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/crash_dump.h>
+#include <linux/dcache.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/mutex.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
+#include <linux/pagevec.h>
+#include <linux/pkram.h>
+#include <linux/seq_file.h>
+#include <linux/shmem_fs.h>
+#include <linux/spinlock.h>
+#include <linux/string.h>
+#include <linux/time.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+
+struct file_header {
+	__u32	mode;
+	kuid_t	uid;
+	kgid_t	gid;
+	__u32	namelen;
+	__u64	size;
+	__u64	atime;
+	__u64	mtime;
+	__u64	ctime;
+};
+
+int shmem_parse_pkram(const char *str, struct shmem_pkram_info **pkram)
+{
+	struct shmem_pkram_info *new;
+	size_t len;
+
+	len = strlen(str);
+	if (!len || len >= SHMEM_PKRAM_NAME_MAX)
+		return 1;
+	new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return 1;
+	strcpy(new->name, str);
+	*pkram = new;
+	return 0;
+}
+
+void shmem_show_pkram(struct seq_file *seq, struct shmem_pkram_info *pkram, bool preserve)
+{
+	if (pkram) {
+		seq_printf(seq, ",pkram=%s", pkram->name);
+		seq_printf(seq, ",%s", preserve ? "preserve" : "nopreserve");
+	}
+}
+
+static int shmem_pkram_name(char *buf, size_t bufsize,
+			   struct shmem_sb_info *sbinfo)
+{
+	if (snprintf(buf, bufsize, "shmem-%s", sbinfo->pkram->name) >= bufsize)
+		return -ENAMETOOLONG;
+	return 0;
+}
+
+static int save_page(struct page *page, struct pkram_access *pa)
+{
+	int err = 0;
+
+	if (page)
+		err = pkram_save_file_page(pa, page);
+
+	return err;
+}
+
+static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+{
+	PKRAM_ACCESS(pa, ps, pages);
+	struct pagevec pvec;
+	unsigned long start, end;
+	int err = 0;
+	int i;
+
+	start = 0;
+	end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+	pagevec_init(&pvec);
+	for ( ; ; ) {
+		pvec.nr = find_get_pages_range(mapping, &start, end,
+					PAGEVEC_SIZE, pvec.pages);
+		if (!pvec.nr)
+			break;
+		for (i = 0; i < pagevec_count(&pvec); ) {
+			struct page *page = pvec.pages[i];
+
+			lock_page(page);
+			BUG_ON(page->mapping != mapping);
+			err = save_page(page, &pa);
+			if (PageCompound(page)) {
+				start = page->index + compound_nr(page);
+				i += compound_nr(page);
+			} else {
+				i++;
+			}
+
+			unlock_page(page);
+			if (err)
+				break;
+		}
+		pagevec_release(&pvec);
+		if (err || (start > end))
+			break;
+		cond_resched();
+	}
+
+	pkram_finish_access(&pa, err == 0);
+	return err;
+}
+
+static int save_file(struct dentry *dentry, struct pkram_stream *ps)
+{
+	PKRAM_ACCESS(pa_bytes, ps, bytes);
+	struct inode *inode = dentry->d_inode;
+	umode_t mode = inode->i_mode;
+	struct file_header hdr;
+	ssize_t ret;
+	int err;
+
+	if (WARN_ON_ONCE(!S_ISREG(mode)))
+		return -EINVAL;
+	if (WARN_ON_ONCE(inode->i_nlink > 1))
+		return -EINVAL;
+
+	hdr.mode = mode;
+	hdr.uid = inode->i_uid;
+	hdr.gid = inode->i_gid;
+	hdr.namelen = dentry->d_name.len;
+	hdr.size = i_size_read(inode);
+	hdr.atime = timespec64_to_ns(&inode->i_atime);
+	hdr.mtime = timespec64_to_ns(&inode->i_mtime);
+	hdr.ctime = timespec64_to_ns(&inode->i_ctime);
+
+
+	ret = pkram_write(&pa_bytes, &hdr, sizeof(hdr));
+	if (ret < 0) {
+		err = ret;
+		goto out;
+	}
+	ret = pkram_write(&pa_bytes, dentry->d_name.name, dentry->d_name.len);
+	if (ret < 0) {
+		err = ret;
+		goto out;
+	}
+
+	err = save_file_content(ps, inode->i_mapping);
+out:
+	pkram_finish_access(&pa_bytes, err == 0);
+	return err;
+}
+
+static int save_tree(struct super_block *sb, struct pkram_stream *ps)
+{
+	struct dentry *dentry, *root = sb->s_root;
+	int err = 0;
+
+	inode_lock(d_inode(root));
+	spin_lock(&root->d_lock);
+	list_for_each_entry(dentry, &root->d_subdirs, d_child) {
+		if (d_unhashed(dentry) || !dentry->d_inode)
+			continue;
+		dget(dentry);
+		spin_unlock(&root->d_lock);
+
+		err = pkram_prepare_save_obj(ps, PKRAM_DATA_pages|PKRAM_DATA_bytes);
+		if (!err)
+			err = save_file(dentry, ps);
+		if (!err)
+			pkram_finish_save_obj(ps);
+		spin_lock(&root->d_lock);
+		dput(dentry);
+		if (err)
+			break;
+	}
+	spin_unlock(&root->d_lock);
+	inode_unlock(d_inode(root));
+
+	return err;
+}
+
+int shmem_save_pkram(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo = sb->s_fs_info;
+	struct pkram_stream ps;
+	char *buf;
+	int err = -ENOMEM;
+
+	if (!sbinfo || !sbinfo->pkram || is_kdump_kernel())
+		return 0;
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	err = shmem_pkram_name(buf, PAGE_SIZE, sbinfo);
+	if (!err)
+		err = pkram_prepare_save(&ps, buf, GFP_KERNEL);
+	if (err)
+		goto out_free_buf;
+
+	err = save_tree(sb, &ps);
+	if (err)
+		goto out_discard_save;
+
+	pkram_finish_save(&ps);
+	goto out_free_buf;
+
+out_discard_save:
+	pkram_discard_save(&ps);
+out_free_buf:
+	free_page((unsigned long)buf);
+out:
+	if (err)
+		pr_err("SHMEM: PKRAM save failed: %d\n", err);
+
+	return err;
+}
+
+static int load_file_content(struct pkram_stream *ps, struct address_space *mapping)
+{
+	PKRAM_ACCESS(pa, ps, pages);
+	unsigned long index;
+	struct page *page;
+	int err = 0;
+
+	do {
+		page = pkram_load_file_page(&pa, &index);
+		if (!page)
+			break;
+
+		err = shmem_insert_page(current->mm, mapping->host, index, page);
+		put_page(page);
+		cond_resched();
+	} while (!err);
+
+	pkram_finish_access(&pa, err == 0);
+	return err;
+}
+
+static int load_file(struct dentry *parent, struct pkram_stream *ps,
+		     char *buf, size_t bufsize)
+{
+	PKRAM_ACCESS(pa_bytes, ps, bytes);
+	struct dentry *dentry;
+	struct inode *inode;
+	struct file_header hdr;
+	size_t ret;
+	umode_t mode;
+	int namelen;
+	int err = -EINVAL;
+
+	ret = pkram_read(&pa_bytes, &hdr, sizeof(hdr));
+	if (ret != sizeof(hdr))
+		goto out;
+
+	mode = hdr.mode;
+	namelen = hdr.namelen;
+	if (!S_ISREG(mode) || namelen > bufsize)
+		goto out;
+	if (pkram_read(&pa_bytes, buf, namelen) != namelen)
+		goto out;
+
+	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
+
+	dentry = lookup_one_len(buf, parent, namelen);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto out_unlock;
+	}
+
+	err = vfs_create(&init_user_ns, parent->d_inode, dentry, mode, NULL);
+	dput(dentry); /* on success shmem pinned it */
+	if (err)
+		goto out_unlock;
+
+	inode = dentry->d_inode;
+	inode->i_mode = mode;
+	inode->i_uid = hdr.uid;
+	inode->i_gid = hdr.gid;
+	inode->i_atime = ns_to_timespec64(hdr.atime);
+	inode->i_mtime = ns_to_timespec64(hdr.mtime);
+	inode->i_ctime = ns_to_timespec64(hdr.ctime);
+	i_size_write(inode, hdr.size);
+
+	err = load_file_content(ps, inode->i_mapping);
+out_unlock:
+	inode_unlock(d_inode(parent));
+out:
+	pkram_finish_access(&pa_bytes, err == 0);
+	return err;
+}
+
+static int load_tree(struct super_block *sb, struct pkram_stream *ps,
+		     char *buf, size_t bufsize)
+{
+	int err;
+
+	do {
+		err = pkram_prepare_load_obj(ps);
+		if (err) {
+			if (err == -ENODATA)
+				err = 0;
+			break;
+		}
+		err = load_file(sb->s_root, ps, buf, PAGE_SIZE);
+		pkram_finish_load_obj(ps);
+	} while (!err);
+
+	return err;
+}
+
+void shmem_load_pkram(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo = sb->s_fs_info;
+	struct pkram_stream ps;
+	char *buf;
+	int err = -ENOMEM;
+
+	if (!sbinfo->pkram)
+		return;
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	err = shmem_pkram_name(buf, PAGE_SIZE, sbinfo);
+	if (!err)
+		err = pkram_prepare_load(&ps, buf);
+	if (err) {
+		if (err == -ENOENT)
+			err = 0;
+		goto out_free_buf;
+	}
+
+	err = load_tree(sb, &ps, buf, PAGE_SIZE);
+
+	pkram_finish_load(&ps);
+out_free_buf:
+	free_page((unsigned long)buf);
+out:
+	if (err)
+		pr_err("SHMEM: PKRAM load failed: %d\n", err);
+}
+
+int shmem_release_pkram(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo = sb->s_fs_info;
+	struct pkram_stream ps;
+	char *buf;
+	int err = -ENOMEM;
+
+	if (!sbinfo->pkram)
+		return 0;
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	err = shmem_pkram_name(buf, PAGE_SIZE, sbinfo);
+	if (!err)
+		err = pkram_prepare_load(&ps, buf);
+	if (err) {
+		if (err == -ENOENT)
+			err = 0;
+		goto out_free_buf;
+	}
+
+	pkram_finish_load(&ps);
+out_free_buf:
+	free_page((unsigned long)buf);
+out:
+	if (err)
+		pr_err("SHMEM: PKRAM load failed: %d\n", err);
+
+	return err;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 24/43] mm: shmem: enable saving to PKRAM
@ 2021-03-30 21:35   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

This patch illustrates how the PKRAM API can be used for preserving tmpfs.
Two options are added to tmpfs:
    The 'pkram=' option specifies the PKRAM node to load/save the
    filesystem tree from/to.
    The 'preserve' option initiates preservation of a read-only
    filesystem tree.

If the 'pkram=' options is passed on mount, shmem will look for the
corresponding PKRAM node and load the FS tree from it.

If the 'pkram=' options was passed on mount and the 'preserve' option is
passed on remount and the filesystem is read-only, shmem will save the
FS tree to the PKRAM node.

A typical usage scenario looks like:

 # mount -t tmpfs -o pkram=mytmpfs none /mnt
 # echo something > /mnt/smth
 # mount -o remount ro,preserve /mnt
 <possibly kexec>
 # mount -t tmpfs -o pkram=mytmpfs none /mnt
 # cat /mnt/smth

Each FS tree is saved into a PKRAM node, and each file is saved into a
PKRAM object. A byte stream written to the object is used for saving file
metadata (name, permissions, etc) while the page stream written to
the object accommodates file content pages and their offsets.

This implementation serves as a demonstration and therefore is
simplified: it supports only regular files in the root directory without
multiple hard links, and it does not save swapped out files and aborts if
any are found. However, it can be elaborated to fully support tmpfs.

Originally-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/shmem_fs.h |  24 +++
 mm/Makefile              |   2 +-
 mm/shmem.c               |  64 ++++++++
 mm/shmem_pkram.c         | 385 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 474 insertions(+), 1 deletion(-)
 create mode 100644 mm/shmem_pkram.c

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 3f0dd95efd46..78149d702a62 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -26,6 +26,11 @@ struct shmem_inode_info {
 	struct inode		vfs_inode;
 };
 
+#define SHMEM_PKRAM_NAME_MAX	128
+struct shmem_pkram_info {
+	char name[SHMEM_PKRAM_NAME_MAX];
+};
+
 struct shmem_sb_info {
 	unsigned long max_blocks;   /* How many blocks are allowed */
 	struct percpu_counter used_blocks;  /* How many are allocated */
@@ -43,6 +48,8 @@ struct shmem_sb_info {
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
 	unsigned long shrinklist_len; /* Length of shrinklist */
+	struct shmem_pkram_info *pkram;
+	bool preserve;		    /* PKRAM-enabled data is preserved */
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
@@ -106,6 +113,23 @@ extern int shmem_getpage(struct inode *inode, pgoff_t index,
 extern int shmem_insert_page(struct mm_struct *mm, struct inode *inode,
 		pgoff_t index, struct page *page);
 
+#ifdef CONFIG_PKRAM
+extern int shmem_parse_pkram(const char *str, struct shmem_pkram_info **pkram);
+extern void shmem_show_pkram(struct seq_file *seq, struct shmem_pkram_info *pkram,
+			bool preserve);
+extern int shmem_save_pkram(struct super_block *sb);
+extern void shmem_load_pkram(struct super_block *sb);
+extern int shmem_release_pkram(struct super_block *sb);
+#else
+static inline int shmem_parse_pkram(const char *str,
+			struct shmem_pkram_info **pkram) { return 1; }
+static inline void shmem_show_pkram(struct seq_file *seq,
+			struct shmem_pkram_info *pkram, bool preserve) { }
+static inline int shmem_save_pkram(struct super_block *sb) { return 0; }
+static inline void shmem_load_pkram(struct super_block *sb) { }
+static inline int shmem_release_pkram(struct super_block *sb) { return 0; }
+#endif
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/mm/Makefile b/mm/Makefile
index f5c0dd0a3707..a4e9dd5545df 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,4 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
-obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o
+obj-$(CONFIG_PKRAM) += pkram.o pkram_pagetable.o shmem_pkram.o
diff --git a/mm/shmem.c b/mm/shmem.c
index 60e4f0ad23b9..c1c5760465f2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -111,16 +111,20 @@ struct shmem_options {
 	unsigned long long blocks;
 	unsigned long long inodes;
 	struct mempolicy *mpol;
+	struct shmem_pkram_info *pkram;
 	kuid_t uid;
 	kgid_t gid;
 	umode_t mode;
 	bool full_inums;
+	bool preserve;
 	int huge;
 	int seen;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
 #define SHMEM_SEEN_INUMS 8
+#define SHMEM_SEEN_PKRAM 16
+#define SHMEM_SEEN_PRESERVE 32
 };
 
 #ifdef CONFIG_TMPFS
@@ -3441,6 +3445,8 @@ enum shmem_param {
 	Opt_uid,
 	Opt_inode32,
 	Opt_inode64,
+	Opt_pkram,
+	Opt_preserve,
 };
 
 static const struct constant_table shmem_param_enums_huge[] = {
@@ -3462,6 +3468,8 @@ enum shmem_param {
 	fsparam_u32   ("uid",		Opt_uid),
 	fsparam_flag  ("inode32",	Opt_inode32),
 	fsparam_flag  ("inode64",	Opt_inode64),
+	fsparam_string("pkram",		Opt_pkram),
+	fsparam_flag_no("preserve",	Opt_preserve),
 	{}
 };
 
@@ -3545,6 +3553,22 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		ctx->full_inums = true;
 		ctx->seen |= SHMEM_SEEN_INUMS;
 		break;
+	case Opt_pkram:
+		if (IS_ENABLED(CONFIG_PKRAM)) {
+			kfree(ctx->pkram);
+			if (shmem_parse_pkram(param->string, &ctx->pkram))
+				goto bad_value;
+			ctx->seen |= SHMEM_SEEN_PKRAM;
+			break;
+		}
+		goto unsupported_parameter;
+	case Opt_preserve:
+		if (IS_ENABLED(CONFIG_PKRAM)) {
+			ctx->preserve = result.boolean;
+			ctx->seen |= SHMEM_SEEN_PRESERVE;
+			break;
+		}
+		goto unsupported_parameter;
 	}
 	return 0;
 
@@ -3641,6 +3665,41 @@ static int shmem_reconfigure(struct fs_context *fc)
 		err = "Current inum too high to switch to 32-bit inums";
 		goto out;
 	}
+	if (ctx->seen & SHMEM_SEEN_PRESERVE) {
+		if (!sbinfo->pkram && !(ctx->seen & SHMEM_SEEN_PKRAM)) {
+			err = "Cannot set preserve/nopreserve. Not enabled for PKRAM";
+			goto out;
+		}
+		if (ctx->preserve && !(fc->sb_flags & SB_RDONLY)) {
+			err = "Cannot preserve. Filesystem must be read-only";
+			goto out;
+		}
+	}
+
+	if (ctx->pkram) {
+		kfree(sbinfo->pkram);
+		sbinfo->pkram = ctx->pkram;
+	}
+
+	if (ctx->seen & SHMEM_SEEN_PRESERVE) {
+		int error;
+
+		if (!sbinfo->preserve && ctx->preserve) {
+			error = shmem_save_pkram(fc->root->d_sb);
+			if (error) {
+				err = "Failed to preserve";
+				goto out;
+			}
+			sbinfo->preserve = true;
+		} else if (sbinfo->preserve && !ctx->preserve) {
+			error = shmem_release_pkram(fc->root->d_sb);
+			if (error) {
+				err = "Failed to unpreserve";
+				goto out;
+			}
+			sbinfo->preserve = false;
+		}
+	}
 
 	if (ctx->seen & SHMEM_SEEN_HUGE)
 		sbinfo->huge = ctx->huge;
@@ -3714,6 +3773,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
 #endif
 	shmem_show_mpol(seq, sbinfo->mpol);
+	shmem_show_pkram(seq, sbinfo->pkram, sbinfo->preserve);
 	return 0;
 }
 
@@ -3726,6 +3786,7 @@ static void shmem_put_super(struct super_block *sb)
 	free_percpu(sbinfo->ino_batch);
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
+	kfree(sbinfo->pkram);
 	kfree(sbinfo);
 	sb->s_fs_info = NULL;
 }
@@ -3780,6 +3841,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sbinfo->huge = ctx->huge;
 	sbinfo->mpol = ctx->mpol;
 	ctx->mpol = NULL;
+	sbinfo->pkram = ctx->pkram;
+	ctx->pkram = NULL;
 
 	spin_lock_init(&sbinfo->stat_lock);
 	if (percpu_counter_init(&sbinfo->used_blocks, 0, GFP_KERNEL))
@@ -3809,6 +3872,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_root = d_make_root(inode);
 	if (!sb->s_root)
 		goto failed;
+	shmem_load_pkram(sb);
 	return 0;
 
 failed:
diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
new file mode 100644
index 000000000000..904b1b861ce5
--- /dev/null
+++ b/mm/shmem_pkram.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/crash_dump.h>
+#include <linux/dcache.h>
+#include <linux/err.h>
+#include <linux/fs.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/mutex.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
+#include <linux/pagevec.h>
+#include <linux/pkram.h>
+#include <linux/seq_file.h>
+#include <linux/shmem_fs.h>
+#include <linux/spinlock.h>
+#include <linux/string.h>
+#include <linux/time.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+
+struct file_header {
+	__u32	mode;
+	kuid_t	uid;
+	kgid_t	gid;
+	__u32	namelen;
+	__u64	size;
+	__u64	atime;
+	__u64	mtime;
+	__u64	ctime;
+};
+
+int shmem_parse_pkram(const char *str, struct shmem_pkram_info **pkram)
+{
+	struct shmem_pkram_info *new;
+	size_t len;
+
+	len = strlen(str);
+	if (!len || len >= SHMEM_PKRAM_NAME_MAX)
+		return 1;
+	new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return 1;
+	strcpy(new->name, str);
+	*pkram = new;
+	return 0;
+}
+
+void shmem_show_pkram(struct seq_file *seq, struct shmem_pkram_info *pkram, bool preserve)
+{
+	if (pkram) {
+		seq_printf(seq, ",pkram=%s", pkram->name);
+		seq_printf(seq, ",%s", preserve ? "preserve" : "nopreserve");
+	}
+}
+
+static int shmem_pkram_name(char *buf, size_t bufsize,
+			   struct shmem_sb_info *sbinfo)
+{
+	if (snprintf(buf, bufsize, "shmem-%s", sbinfo->pkram->name) >= bufsize)
+		return -ENAMETOOLONG;
+	return 0;
+}
+
+static int save_page(struct page *page, struct pkram_access *pa)
+{
+	int err = 0;
+
+	if (page)
+		err = pkram_save_file_page(pa, page);
+
+	return err;
+}
+
+static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+{
+	PKRAM_ACCESS(pa, ps, pages);
+	struct pagevec pvec;
+	unsigned long start, end;
+	int err = 0;
+	int i;
+
+	start = 0;
+	end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+	pagevec_init(&pvec);
+	for ( ; ; ) {
+		pvec.nr = find_get_pages_range(mapping, &start, end,
+					PAGEVEC_SIZE, pvec.pages);
+		if (!pvec.nr)
+			break;
+		for (i = 0; i < pagevec_count(&pvec); ) {
+			struct page *page = pvec.pages[i];
+
+			lock_page(page);
+			BUG_ON(page->mapping != mapping);
+			err = save_page(page, &pa);
+			if (PageCompound(page)) {
+				start = page->index + compound_nr(page);
+				i += compound_nr(page);
+			} else {
+				i++;
+			}
+
+			unlock_page(page);
+			if (err)
+				break;
+		}
+		pagevec_release(&pvec);
+		if (err || (start > end))
+			break;
+		cond_resched();
+	}
+
+	pkram_finish_access(&pa, err == 0);
+	return err;
+}
+
+static int save_file(struct dentry *dentry, struct pkram_stream *ps)
+{
+	PKRAM_ACCESS(pa_bytes, ps, bytes);
+	struct inode *inode = dentry->d_inode;
+	umode_t mode = inode->i_mode;
+	struct file_header hdr;
+	ssize_t ret;
+	int err;
+
+	if (WARN_ON_ONCE(!S_ISREG(mode)))
+		return -EINVAL;
+	if (WARN_ON_ONCE(inode->i_nlink > 1))
+		return -EINVAL;
+
+	hdr.mode = mode;
+	hdr.uid = inode->i_uid;
+	hdr.gid = inode->i_gid;
+	hdr.namelen = dentry->d_name.len;
+	hdr.size = i_size_read(inode);
+	hdr.atime = timespec64_to_ns(&inode->i_atime);
+	hdr.mtime = timespec64_to_ns(&inode->i_mtime);
+	hdr.ctime = timespec64_to_ns(&inode->i_ctime);
+
+
+	ret = pkram_write(&pa_bytes, &hdr, sizeof(hdr));
+	if (ret < 0) {
+		err = ret;
+		goto out;
+	}
+	ret = pkram_write(&pa_bytes, dentry->d_name.name, dentry->d_name.len);
+	if (ret < 0) {
+		err = ret;
+		goto out;
+	}
+
+	err = save_file_content(ps, inode->i_mapping);
+out:
+	pkram_finish_access(&pa_bytes, err == 0);
+	return err;
+}
+
+static int save_tree(struct super_block *sb, struct pkram_stream *ps)
+{
+	struct dentry *dentry, *root = sb->s_root;
+	int err = 0;
+
+	inode_lock(d_inode(root));
+	spin_lock(&root->d_lock);
+	list_for_each_entry(dentry, &root->d_subdirs, d_child) {
+		if (d_unhashed(dentry) || !dentry->d_inode)
+			continue;
+		dget(dentry);
+		spin_unlock(&root->d_lock);
+
+		err = pkram_prepare_save_obj(ps, PKRAM_DATA_pages|PKRAM_DATA_bytes);
+		if (!err)
+			err = save_file(dentry, ps);
+		if (!err)
+			pkram_finish_save_obj(ps);
+		spin_lock(&root->d_lock);
+		dput(dentry);
+		if (err)
+			break;
+	}
+	spin_unlock(&root->d_lock);
+	inode_unlock(d_inode(root));
+
+	return err;
+}
+
+int shmem_save_pkram(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo = sb->s_fs_info;
+	struct pkram_stream ps;
+	char *buf;
+	int err = -ENOMEM;
+
+	if (!sbinfo || !sbinfo->pkram || is_kdump_kernel())
+		return 0;
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	err = shmem_pkram_name(buf, PAGE_SIZE, sbinfo);
+	if (!err)
+		err = pkram_prepare_save(&ps, buf, GFP_KERNEL);
+	if (err)
+		goto out_free_buf;
+
+	err = save_tree(sb, &ps);
+	if (err)
+		goto out_discard_save;
+
+	pkram_finish_save(&ps);
+	goto out_free_buf;
+
+out_discard_save:
+	pkram_discard_save(&ps);
+out_free_buf:
+	free_page((unsigned long)buf);
+out:
+	if (err)
+		pr_err("SHMEM: PKRAM save failed: %d\n", err);
+
+	return err;
+}
+
+static int load_file_content(struct pkram_stream *ps, struct address_space *mapping)
+{
+	PKRAM_ACCESS(pa, ps, pages);
+	unsigned long index;
+	struct page *page;
+	int err = 0;
+
+	do {
+		page = pkram_load_file_page(&pa, &index);
+		if (!page)
+			break;
+
+		err = shmem_insert_page(current->mm, mapping->host, index, page);
+		put_page(page);
+		cond_resched();
+	} while (!err);
+
+	pkram_finish_access(&pa, err == 0);
+	return err;
+}
+
+static int load_file(struct dentry *parent, struct pkram_stream *ps,
+		     char *buf, size_t bufsize)
+{
+	PKRAM_ACCESS(pa_bytes, ps, bytes);
+	struct dentry *dentry;
+	struct inode *inode;
+	struct file_header hdr;
+	size_t ret;
+	umode_t mode;
+	int namelen;
+	int err = -EINVAL;
+
+	ret = pkram_read(&pa_bytes, &hdr, sizeof(hdr));
+	if (ret != sizeof(hdr))
+		goto out;
+
+	mode = hdr.mode;
+	namelen = hdr.namelen;
+	if (!S_ISREG(mode) || namelen > bufsize)
+		goto out;
+	if (pkram_read(&pa_bytes, buf, namelen) != namelen)
+		goto out;
+
+	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
+
+	dentry = lookup_one_len(buf, parent, namelen);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto out_unlock;
+	}
+
+	err = vfs_create(&init_user_ns, parent->d_inode, dentry, mode, NULL);
+	dput(dentry); /* on success shmem pinned it */
+	if (err)
+		goto out_unlock;
+
+	inode = dentry->d_inode;
+	inode->i_mode = mode;
+	inode->i_uid = hdr.uid;
+	inode->i_gid = hdr.gid;
+	inode->i_atime = ns_to_timespec64(hdr.atime);
+	inode->i_mtime = ns_to_timespec64(hdr.mtime);
+	inode->i_ctime = ns_to_timespec64(hdr.ctime);
+	i_size_write(inode, hdr.size);
+
+	err = load_file_content(ps, inode->i_mapping);
+out_unlock:
+	inode_unlock(d_inode(parent));
+out:
+	pkram_finish_access(&pa_bytes, err == 0);
+	return err;
+}
+
+static int load_tree(struct super_block *sb, struct pkram_stream *ps,
+		     char *buf, size_t bufsize)
+{
+	int err;
+
+	do {
+		err = pkram_prepare_load_obj(ps);
+		if (err) {
+			if (err == -ENODATA)
+				err = 0;
+			break;
+		}
+		err = load_file(sb->s_root, ps, buf, PAGE_SIZE);
+		pkram_finish_load_obj(ps);
+	} while (!err);
+
+	return err;
+}
+
+void shmem_load_pkram(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo = sb->s_fs_info;
+	struct pkram_stream ps;
+	char *buf;
+	int err = -ENOMEM;
+
+	if (!sbinfo->pkram)
+		return;
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	err = shmem_pkram_name(buf, PAGE_SIZE, sbinfo);
+	if (!err)
+		err = pkram_prepare_load(&ps, buf);
+	if (err) {
+		if (err == -ENOENT)
+			err = 0;
+		goto out_free_buf;
+	}
+
+	err = load_tree(sb, &ps, buf, PAGE_SIZE);
+
+	pkram_finish_load(&ps);
+out_free_buf:
+	free_page((unsigned long)buf);
+out:
+	if (err)
+		pr_err("SHMEM: PKRAM load failed: %d\n", err);
+}
+
+int shmem_release_pkram(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo = sb->s_fs_info;
+	struct pkram_stream ps;
+	char *buf;
+	int err = -ENOMEM;
+
+	if (!sbinfo->pkram)
+		return 0;
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	err = shmem_pkram_name(buf, PAGE_SIZE, sbinfo);
+	if (!err)
+		err = pkram_prepare_load(&ps, buf);
+	if (err) {
+		if (err == -ENOENT)
+			err = 0;
+		goto out_free_buf;
+	}
+
+	pkram_finish_load(&ps);
+out_free_buf:
+	free_page((unsigned long)buf);
+out:
+	if (err)
+		pr_err("SHMEM: PKRAM load failed: %d\n", err);
+
+	return err;
+}
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 25/43] mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Work around the limitation that shmem pages must be in memory in order
to be preserved by preventing them from being swapped out in the first
place.  Do this by marking shmem pages associated with a PKRAM node
as unevictable.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index c1c5760465f2..8dfe80aeee97 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2400,6 +2400,8 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		INIT_LIST_HEAD(&info->swaplist);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
+		if (sbinfo->pkram)
+			mapping_set_unevictable(inode->i_mapping);
 
 		switch (mode & S_IFMT) {
 		default:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 25/43] mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Work around the limitation that shmem pages must be in memory in order
to be preserved by preventing them from being swapped out in the first
place.  Do this by marking shmem pages associated with a PKRAM node
as unevictable.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index c1c5760465f2..8dfe80aeee97 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2400,6 +2400,8 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		INIT_LIST_HEAD(&info->swaplist);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
+		if (sbinfo->pkram)
+			mapping_set_unevictable(inode->i_mapping);
 
 		switch (mode & S_IFMT) {
 		default:
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 26/43] mm: shmem: specify the mm to use when inserting pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Explicitly specify the mm to pass to shmem_insert_page() when
the pkram_stream is initialized rather than use the mm of the
current thread.  This will allow for multiple kernel threads to
target the same mm when inserting pages in parallel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index 904b1b861ce5..8682b0c002c0 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -225,7 +225,7 @@ int shmem_save_pkram(struct super_block *sb)
 	return err;
 }
 
-static int load_file_content(struct pkram_stream *ps, struct address_space *mapping)
+static int load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
 {
 	PKRAM_ACCESS(pa, ps, pages);
 	unsigned long index;
@@ -237,7 +237,7 @@ static int load_file_content(struct pkram_stream *ps, struct address_space *mapp
 		if (!page)
 			break;
 
-		err = shmem_insert_page(current->mm, mapping->host, index, page);
+		err = shmem_insert_page(mm, mapping->host, index, page);
 		put_page(page);
 		cond_resched();
 	} while (!err);
@@ -291,7 +291,7 @@ static int load_file(struct dentry *parent, struct pkram_stream *ps,
 	inode->i_ctime = ns_to_timespec64(hdr.ctime);
 	i_size_write(inode, hdr.size);
 
-	err = load_file_content(ps, inode->i_mapping);
+	err = load_file_content(ps, inode->i_mapping, current->mm);
 out_unlock:
 	inode_unlock(d_inode(parent));
 out:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 26/43] mm: shmem: specify the mm to use when inserting pages
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Explicitly specify the mm to pass to shmem_insert_page() when
the pkram_stream is initialized rather than use the mm of the
current thread.  This will allow for multiple kernel threads to
target the same mm when inserting pages in parallel.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index 904b1b861ce5..8682b0c002c0 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -225,7 +225,7 @@ int shmem_save_pkram(struct super_block *sb)
 	return err;
 }
 
-static int load_file_content(struct pkram_stream *ps, struct address_space *mapping)
+static int load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
 {
 	PKRAM_ACCESS(pa, ps, pages);
 	unsigned long index;
@@ -237,7 +237,7 @@ static int load_file_content(struct pkram_stream *ps, struct address_space *mapp
 		if (!page)
 			break;
 
-		err = shmem_insert_page(current->mm, mapping->host, index, page);
+		err = shmem_insert_page(mm, mapping->host, index, page);
 		put_page(page);
 		cond_resched();
 	} while (!err);
@@ -291,7 +291,7 @@ static int load_file(struct dentry *parent, struct pkram_stream *ps,
 	inode->i_ctime = ns_to_timespec64(hdr.ctime);
 	i_size_write(inode, hdr.size);
 
-	err = load_file_content(ps, inode->i_mapping);
+	err = load_file_content(ps, inode->i_mapping, current->mm);
 out_unlock:
 	inode_unlock(d_inode(parent));
 out:
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 27/43] mm: shmem: when inserting, handle pages already charged to a memcg
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

If shmem_insert_page() is called to insert a page that was preserved
using PKRAM on the current boot (i.e. preserved page is restored without
an intervening kexec boot), the page will still be charged to a memory
cgroup because it is never freed. Don't try to charge it again.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8dfe80aeee97..44cc158ab34d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -671,7 +671,7 @@ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
 static int shmem_add_to_page_cache(struct page *page,
 				   struct address_space *mapping,
 				   pgoff_t index, void *expected, gfp_t gfp,
-				   struct mm_struct *charge_mm)
+				   struct mm_struct *charge_mm, bool skipcharge)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
 	unsigned long i = 0;
@@ -688,7 +688,7 @@ static int shmem_add_to_page_cache(struct page *page,
 	page->mapping = mapping;
 	page->index = index;
 
-	if (!PageSwapCache(page)) {
+	if (!skipcharge && !PageSwapCache(page)) {
 		error = mem_cgroup_charge(page, charge_mm, gfp);
 		if (error) {
 			if (PageTransHuge(page)) {
@@ -770,6 +770,7 @@ int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 	int nr;
 	pgoff_t hindex = index;
 	bool on_lru = PageLRU(page);
+	bool ischarged = page_memcg(page) ? true : false;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return -EFBIG;
@@ -809,7 +810,8 @@ int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 	__SetPageReferenced(page);
 
 	err = shmem_add_to_page_cache(page, mapping, hindex,
-				      NULL, gfp & GFP_RECLAIM_MASK, mm);
+				      NULL, gfp & GFP_RECLAIM_MASK,
+				      mm, ischarged);
 	if (err)
 		goto out_unlock;
 
@@ -1829,7 +1831,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
 
 	error = shmem_add_to_page_cache(page, mapping, index,
 					swp_to_radix_entry(swap), gfp,
-					charge_mm);
+					charge_mm, false);
 	if (error)
 		goto failed;
 
@@ -2009,7 +2011,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	error = shmem_add_to_page_cache(page, mapping, hindex,
 					NULL, gfp & GFP_RECLAIM_MASK,
-					charge_mm);
+					charge_mm, false);
 	if (error)
 		goto unacct;
 	lru_cache_add(page);
@@ -2500,7 +2502,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
-				      gfp & GFP_RECLAIM_MASK, dst_mm);
+				      gfp & GFP_RECLAIM_MASK, dst_mm, false);
 	if (ret)
 		goto out_release;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 27/43] mm: shmem: when inserting, handle pages already charged to a memcg
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

If shmem_insert_page() is called to insert a page that was preserved
using PKRAM on the current boot (i.e. preserved page is restored without
an intervening kexec boot), the page will still be charged to a memory
cgroup because it is never freed. Don't try to charge it again.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 8dfe80aeee97..44cc158ab34d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -671,7 +671,7 @@ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
 static int shmem_add_to_page_cache(struct page *page,
 				   struct address_space *mapping,
 				   pgoff_t index, void *expected, gfp_t gfp,
-				   struct mm_struct *charge_mm)
+				   struct mm_struct *charge_mm, bool skipcharge)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
 	unsigned long i = 0;
@@ -688,7 +688,7 @@ static int shmem_add_to_page_cache(struct page *page,
 	page->mapping = mapping;
 	page->index = index;
 
-	if (!PageSwapCache(page)) {
+	if (!skipcharge && !PageSwapCache(page)) {
 		error = mem_cgroup_charge(page, charge_mm, gfp);
 		if (error) {
 			if (PageTransHuge(page)) {
@@ -770,6 +770,7 @@ int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 	int nr;
 	pgoff_t hindex = index;
 	bool on_lru = PageLRU(page);
+	bool ischarged = page_memcg(page) ? true : false;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return -EFBIG;
@@ -809,7 +810,8 @@ int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 	__SetPageReferenced(page);
 
 	err = shmem_add_to_page_cache(page, mapping, hindex,
-				      NULL, gfp & GFP_RECLAIM_MASK, mm);
+				      NULL, gfp & GFP_RECLAIM_MASK,
+				      mm, ischarged);
 	if (err)
 		goto out_unlock;
 
@@ -1829,7 +1831,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
 
 	error = shmem_add_to_page_cache(page, mapping, index,
 					swp_to_radix_entry(swap), gfp,
-					charge_mm);
+					charge_mm, false);
 	if (error)
 		goto failed;
 
@@ -2009,7 +2011,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	error = shmem_add_to_page_cache(page, mapping, hindex,
 					NULL, gfp & GFP_RECLAIM_MASK,
-					charge_mm);
+					charge_mm, false);
 	if (error)
 		goto unacct;
 	lru_cache_add(page);
@@ -2500,7 +2502,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release;
 
 	ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
-				      gfp & GFP_RECLAIM_MASK, dst_mm);
+				      gfp & GFP_RECLAIM_MASK, dst_mm, false);
 	if (ret)
 		goto out_release;
 
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 28/43] x86/mm/numa: add numa_isolate_memblocks()
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Provide a way for a caller external to numa to ensure memblocks in the
memblock reserved list do not cross node boundaries and have a node id
assigned to them.  This will be used by PKRAM to ensure initialization of
page structs for preserved pages can be deferred and multithreaded
efficiently.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/include/asm/numa.h |  4 ++++
 arch/x86/mm/numa.c          | 32 ++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..632b5b6d8cb3 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -41,6 +41,7 @@ static inline void set_apicid_to_node(int apicid, s16 node)
 }
 
 extern int numa_cpu_node(int cpu);
+extern void __init numa_isolate_memblocks(void);
 
 #else	/* CONFIG_NUMA */
 static inline void set_apicid_to_node(int apicid, s16 node)
@@ -51,6 +52,9 @@ static inline int numa_cpu_node(int cpu)
 {
 	return NUMA_NO_NODE;
 }
+static inline void numa_isolate_memblocks(void)
+{
+}
 #endif	/* CONFIG_NUMA */
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5eb4dc2b97da..dd85098f9d72 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -473,6 +473,25 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 	return true;
 }
 
+void __init numa_isolate_memblocks(void)
+{
+	int i;
+
+	/*
+	 * Iterate over all memory known to the x86 architecture,
+	 * and use those ranges to set the nid in memblock.reserved.
+	 * This will split up the memblock regions along node
+	 * boundaries and will set the node IDs as well.
+	 */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		struct numa_memblk *mb = numa_meminfo.blk + i;
+		int ret;
+
+		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
+		WARN_ON_ONCE(ret);
+	}
+}
+
 /*
  * Mark all currently memblock-reserved physical memory (which covers the
  * kernel's own memory ranges) as hot-unswappable.
@@ -491,19 +510,8 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	 * used by the kernel, but those regions are not split up
 	 * along node boundaries yet, and don't necessarily have their
 	 * node ID set yet either.
-	 *
-	 * So iterate over all memory known to the x86 architecture,
-	 * and use those ranges to set the nid in memblock.reserved.
-	 * This will split up the memblock regions along node
-	 * boundaries and will set the node IDs as well.
 	 */
-	for (i = 0; i < numa_meminfo.nr_blks; i++) {
-		struct numa_memblk *mb = numa_meminfo.blk + i;
-		int ret;
-
-		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
-		WARN_ON_ONCE(ret);
-	}
+	numa_isolate_memblocks();
 
 	/*
 	 * Now go over all reserved memblock regions, to construct a
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 28/43] x86/mm/numa: add numa_isolate_memblocks()
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Provide a way for a caller external to numa to ensure memblocks in the
memblock reserved list do not cross node boundaries and have a node id
assigned to them.  This will be used by PKRAM to ensure initialization of
page structs for preserved pages can be deferred and multithreaded
efficiently.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/include/asm/numa.h |  4 ++++
 arch/x86/mm/numa.c          | 32 ++++++++++++++++++++------------
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..632b5b6d8cb3 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -41,6 +41,7 @@ static inline void set_apicid_to_node(int apicid, s16 node)
 }
 
 extern int numa_cpu_node(int cpu);
+extern void __init numa_isolate_memblocks(void);
 
 #else	/* CONFIG_NUMA */
 static inline void set_apicid_to_node(int apicid, s16 node)
@@ -51,6 +52,9 @@ static inline int numa_cpu_node(int cpu)
 {
 	return NUMA_NO_NODE;
 }
+static inline void numa_isolate_memblocks(void)
+{
+}
 #endif	/* CONFIG_NUMA */
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5eb4dc2b97da..dd85098f9d72 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -473,6 +473,25 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
 	return true;
 }
 
+void __init numa_isolate_memblocks(void)
+{
+	int i;
+
+	/*
+	 * Iterate over all memory known to the x86 architecture,
+	 * and use those ranges to set the nid in memblock.reserved.
+	 * This will split up the memblock regions along node
+	 * boundaries and will set the node IDs as well.
+	 */
+	for (i = 0; i < numa_meminfo.nr_blks; i++) {
+		struct numa_memblk *mb = numa_meminfo.blk + i;
+		int ret;
+
+		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
+		WARN_ON_ONCE(ret);
+	}
+}
+
 /*
  * Mark all currently memblock-reserved physical memory (which covers the
  * kernel's own memory ranges) as hot-unswappable.
@@ -491,19 +510,8 @@ static void __init numa_clear_kernel_node_hotplug(void)
 	 * used by the kernel, but those regions are not split up
 	 * along node boundaries yet, and don't necessarily have their
 	 * node ID set yet either.
-	 *
-	 * So iterate over all memory known to the x86 architecture,
-	 * and use those ranges to set the nid in memblock.reserved.
-	 * This will split up the memblock regions along node
-	 * boundaries and will set the node IDs as well.
 	 */
-	for (i = 0; i < numa_meminfo.nr_blks; i++) {
-		struct numa_memblk *mb = numa_meminfo.blk + i;
-		int ret;
-
-		ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid);
-		WARN_ON_ONCE(ret);
-	}
+	numa_isolate_memblocks();
 
 	/*
 	 * Now go over all reserved memblock regions, to construct a
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 29/43] PKRAM: ensure memblocks with preserved pages init'd for numa
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

In order to facilitate fast initialization of page structs for
preserved pages, memblocks with preserved pages must not cross
numa node boundaries and must have a node id assigned to them.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index aea069cc49be..b8d6b549fa6c 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -21,6 +21,7 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 
+#include <asm/numa.h>
 #include "internal.h"
 
 #define PKRAM_MAGIC		0x706B726D
@@ -226,6 +227,14 @@ void __init pkram_reserve(void)
 		return;
 	}
 
+	/*
+	 * Fix up the reserved memblock list to ensure the
+	 * memblock regions are split along node boundaries
+	 * and have a node ID set.  This will allow the page
+	 * structs for the preserved pages to be initialized
+	 * more efficiently.
+	 */
+	numa_isolate_memblocks();
 done:
 	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 29/43] PKRAM: ensure memblocks with preserved pages init'd for numa
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

In order to facilitate fast initialization of page structs for
preserved pages, memblocks with preserved pages must not cross
numa node boundaries and must have a node id assigned to them.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/pkram.c b/mm/pkram.c
index aea069cc49be..b8d6b549fa6c 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -21,6 +21,7 @@
 #include <linux/sysfs.h>
 #include <linux/types.h>
 
+#include <asm/numa.h>
 #include "internal.h"
 
 #define PKRAM_MAGIC		0x706B726D
@@ -226,6 +227,14 @@ void __init pkram_reserve(void)
 		return;
 	}
 
+	/*
+	 * Fix up the reserved memblock list to ensure the
+	 * memblock regions are split along node boundaries
+	 * and have a node ID set.  This will allow the page
+	 * structs for the preserved pages to be initialized
+	 * more efficiently.
+	 */
+	numa_isolate_memblocks();
 done:
 	pr_info("PKRAM: %lu pages reserved\n", pkram_reserved_pages);
 }
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 30/43] memblock: PKRAM: mark memblocks that contain preserved pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To support deferred initialization of page structs for preserved pages,
separate memblocks containing preserved pages by setting a new flag
when adding them to the memblock reserved list.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/memblock.h | 6 ++++++
 mm/pkram.c               | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d13e3cd938b4..39c53d08d9f7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -37,6 +37,7 @@ enum memblock_flags {
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+	MEMBLOCK_PRESERVED	= 0x8,	/* preserved pages region */
 };
 
 /**
@@ -248,6 +249,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
+static inline bool memblock_is_preserved(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_PRESERVED;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/pkram.c b/mm/pkram.c
index b8d6b549fa6c..08144c18d425 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1607,6 +1607,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 		} else if (pkr->base + pkr->size <= r->base) {
 			rgn->base = pkr->base;
 			rgn->size = pkr->size;
+			rgn->flags = MEMBLOCK_PRESERVED;
 			memblock_set_region_node(rgn, MAX_NUMNODES);
 
 			nr_preserved +=  (rgn->size >> PAGE_SHIFT);
@@ -1636,6 +1637,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 		rgn = &new->regions[k];
 		rgn->base = pkr->base;
 		rgn->size = pkr->size;
+		rgn->flags = MEMBLOCK_PRESERVED;
 		memblock_set_region_node(rgn, MAX_NUMNODES);
 
 		nr_preserved += (rgn->size >> PAGE_SHIFT);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 30/43] memblock: PKRAM: mark memblocks that contain preserved pages
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To support deferred initialization of page structs for preserved pages,
separate memblocks containing preserved pages by setting a new flag
when adding them to the memblock reserved list.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/memblock.h | 6 ++++++
 mm/pkram.c               | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d13e3cd938b4..39c53d08d9f7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -37,6 +37,7 @@ enum memblock_flags {
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+	MEMBLOCK_PRESERVED	= 0x8,	/* preserved pages region */
 };
 
 /**
@@ -248,6 +249,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
+static inline bool memblock_is_preserved(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_PRESERVED;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/pkram.c b/mm/pkram.c
index b8d6b549fa6c..08144c18d425 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1607,6 +1607,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 		} else if (pkr->base + pkr->size <= r->base) {
 			rgn->base = pkr->base;
 			rgn->size = pkr->size;
+			rgn->flags = MEMBLOCK_PRESERVED;
 			memblock_set_region_node(rgn, MAX_NUMNODES);
 
 			nr_preserved +=  (rgn->size >> PAGE_SHIFT);
@@ -1636,6 +1637,7 @@ int __init pkram_create_merged_reserved(struct memblock_type *new)
 		rgn = &new->regions[k];
 		rgn->base = pkr->base;
 		rgn->size = pkr->size;
+		rgn->flags = MEMBLOCK_PRESERVED;
 		memblock_set_region_node(rgn, MAX_NUMNODES);
 
 		nr_preserved += (rgn->size >> PAGE_SHIFT);
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 31/43] memblock, mm: defer initialization of preserved pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Preserved pages are represented in the memblock reserved list, but page
structs for pages in the reserved list are initialized early while boot
is single threaded which means that a large number of preserved pages
can impact boot time. To mitigate, defer initialization of preserved
pages by skipping them when other reserved pages are initialized and
initializing them later with a separate kernel thread.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/mm/init_64.c |  1 -
 include/linux/mm.h    |  2 +-
 mm/memblock.c         | 11 +++++++++--
 mm/page_alloc.c       | 55 +++++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 69bd71996b8b..8efb2fb2a88b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1294,7 +1294,6 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
-	pkram_cleanup();
 	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64a71bf20536..2a93b2a6ec8d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2337,7 +2337,7 @@ extern unsigned long free_reserved_area(void *start, void *end,
 extern void adjust_managed_page_count(struct page *page, long count);
 extern void mem_init_print_info(const char *str);
 
-extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
+extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end, int nid);
 
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void free_reserved_page(struct page *page)
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..461ea0f85495 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2007,11 +2007,18 @@ static unsigned long __init free_low_memory_core_early(void)
 	unsigned long count = 0;
 	phys_addr_t start, end;
 	u64 i;
+	struct memblock_region *r;
 
 	memblock_clear_hotplug(0, -1);
 
-	for_each_reserved_mem_range(i, &start, &end)
-		reserve_bootmem_region(start, end);
+	for_each_reserved_mem_region(r) {
+		if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT) && memblock_is_preserved(r))
+			continue;
+
+		start = r->base;
+		end = r->base + r->size;
+		reserve_bootmem_region(start, end, NUMA_NO_NODE);
+	}
 
 	/*
 	 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfc72873961d..999fcc8fe907 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@
 #include <linux/padata.h>
 #include <linux/khugepaged.h>
 #include <linux/buffer_head.h>
+#include <linux/pkram.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1475,15 +1476,18 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __meminit init_reserved_page(unsigned long pfn)
+static void __meminit init_reserved_page(unsigned long pfn, int nid)
 {
 	pg_data_t *pgdat;
-	int nid, zid;
+	int zid;
 
-	if (!early_page_uninitialised(pfn))
-		return;
+	if (nid == NUMA_NO_NODE) {
+		if (!early_page_uninitialised(pfn))
+			return;
+
+		nid = early_pfn_to_nid(pfn);
+	}
 
-	nid = early_pfn_to_nid(pfn);
 	pgdat = NODE_DATA(nid);
 
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
@@ -1495,7 +1499,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
-static inline void init_reserved_page(unsigned long pfn)
+static inline void init_reserved_page(unsigned long pfn, int nid)
 {
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
@@ -1506,7 +1510,7 @@ static inline void init_reserved_page(unsigned long pfn)
  * marks the pages PageReserved. The remaining valid pages are later
  * sent to the buddy page allocator.
  */
-void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
+void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end, int nid)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long end_pfn = PFN_UP(end);
@@ -1515,7 +1519,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 		if (pfn_valid(start_pfn)) {
 			struct page *page = pfn_to_page(start_pfn);
 
-			init_reserved_page(start_pfn);
+			init_reserved_page(start_pfn, nid);
 
 			/* Avoid false-positive PageTail() */
 			INIT_LIST_HEAD(&page->lru);
@@ -2008,6 +2012,35 @@ static int __init deferred_init_memmap(void *data)
 	return 0;
 }
 
+#ifdef CONFIG_PKRAM
+static int __init deferred_init_preserved(void *dummy)
+{
+	unsigned long start = jiffies;
+	unsigned long nr_pages = 0;
+	struct memblock_region *r;
+	phys_addr_t spa, epa;
+	int nid;
+
+	for_each_reserved_mem_region(r) {
+		if (!memblock_is_preserved(r))
+			continue;
+
+		spa = r->base;
+		epa = r->base + r->size;
+		nid = memblock_get_region_node(r);
+
+		reserve_bootmem_region(spa, epa, nid);
+		nr_pages += ((epa - spa) >> PAGE_SHIFT);
+	}
+
+	pr_info("initialised %lu preserved pages in %ums\n", nr_pages,
+					jiffies_to_msecs(jiffies - start));
+
+	pgdat_init_report_one_done();
+	return 0;
+}
+#endif /* CONFIG_PKRAM */
+
 /*
  * If this zone has deferred pages, try to grow it by initializing enough
  * deferred pages to satisfy the allocation specified by order, rounded up to
@@ -2107,6 +2140,10 @@ void __init page_alloc_init_late(void)
 
 	/* There will be num_node_state(N_MEMORY) threads */
 	atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
+#ifdef CONFIG_PKRAM
+	atomic_inc(&pgdat_init_n_undone);
+	kthread_run(deferred_init_preserved, NULL, "pgdatainit_preserved");
+#endif
 	for_each_node_state(nid, N_MEMORY) {
 		kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
 	}
@@ -2114,6 +2151,8 @@ void __init page_alloc_init_late(void)
 	/* Block until all are initialised */
 	wait_for_completion(&pgdat_init_all_done_comp);
 
+	pkram_cleanup();
+
 	/*
 	 * The number of managed pages has changed due to the initialisation
 	 * so the pcpu batch and high limits needs to be updated or the limits
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 31/43] memblock, mm: defer initialization of preserved pages
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Preserved pages are represented in the memblock reserved list, but page
structs for pages in the reserved list are initialized early while boot
is single threaded which means that a large number of preserved pages
can impact boot time. To mitigate, defer initialization of preserved
pages by skipping them when other reserved pages are initialized and
initializing them later with a separate kernel thread.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/mm/init_64.c |  1 -
 include/linux/mm.h    |  2 +-
 mm/memblock.c         | 11 +++++++++--
 mm/page_alloc.c       | 55 +++++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 69bd71996b8b..8efb2fb2a88b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1294,7 +1294,6 @@ void __init mem_init(void)
 	after_bootmem = 1;
 	x86_init.hyper.init_after_bootmem();
 
-	pkram_cleanup();
 	totalram_pages_add(pkram_reserved_pages);
 	/*
 	 * Must be done after boot memory is put on freelist, because here we
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64a71bf20536..2a93b2a6ec8d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2337,7 +2337,7 @@ extern unsigned long free_reserved_area(void *start, void *end,
 extern void adjust_managed_page_count(struct page *page, long count);
 extern void mem_init_print_info(const char *str);
 
-extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
+extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end, int nid);
 
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void free_reserved_page(struct page *page)
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..461ea0f85495 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2007,11 +2007,18 @@ static unsigned long __init free_low_memory_core_early(void)
 	unsigned long count = 0;
 	phys_addr_t start, end;
 	u64 i;
+	struct memblock_region *r;
 
 	memblock_clear_hotplug(0, -1);
 
-	for_each_reserved_mem_range(i, &start, &end)
-		reserve_bootmem_region(start, end);
+	for_each_reserved_mem_region(r) {
+		if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT) && memblock_is_preserved(r))
+			continue;
+
+		start = r->base;
+		end = r->base + r->size;
+		reserve_bootmem_region(start, end, NUMA_NO_NODE);
+	}
 
 	/*
 	 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfc72873961d..999fcc8fe907 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@
 #include <linux/padata.h>
 #include <linux/khugepaged.h>
 #include <linux/buffer_head.h>
+#include <linux/pkram.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1475,15 +1476,18 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __meminit init_reserved_page(unsigned long pfn)
+static void __meminit init_reserved_page(unsigned long pfn, int nid)
 {
 	pg_data_t *pgdat;
-	int nid, zid;
+	int zid;
 
-	if (!early_page_uninitialised(pfn))
-		return;
+	if (nid == NUMA_NO_NODE) {
+		if (!early_page_uninitialised(pfn))
+			return;
+
+		nid = early_pfn_to_nid(pfn);
+	}
 
-	nid = early_pfn_to_nid(pfn);
 	pgdat = NODE_DATA(nid);
 
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
@@ -1495,7 +1499,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
 	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
-static inline void init_reserved_page(unsigned long pfn)
+static inline void init_reserved_page(unsigned long pfn, int nid)
 {
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
@@ -1506,7 +1510,7 @@ static inline void init_reserved_page(unsigned long pfn)
  * marks the pages PageReserved. The remaining valid pages are later
  * sent to the buddy page allocator.
  */
-void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
+void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end, int nid)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long end_pfn = PFN_UP(end);
@@ -1515,7 +1519,7 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 		if (pfn_valid(start_pfn)) {
 			struct page *page = pfn_to_page(start_pfn);
 
-			init_reserved_page(start_pfn);
+			init_reserved_page(start_pfn, nid);
 
 			/* Avoid false-positive PageTail() */
 			INIT_LIST_HEAD(&page->lru);
@@ -2008,6 +2012,35 @@ static int __init deferred_init_memmap(void *data)
 	return 0;
 }
 
+#ifdef CONFIG_PKRAM
+static int __init deferred_init_preserved(void *dummy)
+{
+	unsigned long start = jiffies;
+	unsigned long nr_pages = 0;
+	struct memblock_region *r;
+	phys_addr_t spa, epa;
+	int nid;
+
+	for_each_reserved_mem_region(r) {
+		if (!memblock_is_preserved(r))
+			continue;
+
+		spa = r->base;
+		epa = r->base + r->size;
+		nid = memblock_get_region_node(r);
+
+		reserve_bootmem_region(spa, epa, nid);
+		nr_pages += ((epa - spa) >> PAGE_SHIFT);
+	}
+
+	pr_info("initialised %lu preserved pages in %ums\n", nr_pages,
+					jiffies_to_msecs(jiffies - start));
+
+	pgdat_init_report_one_done();
+	return 0;
+}
+#endif /* CONFIG_PKRAM */
+
 /*
  * If this zone has deferred pages, try to grow it by initializing enough
  * deferred pages to satisfy the allocation specified by order, rounded up to
@@ -2107,6 +2140,10 @@ void __init page_alloc_init_late(void)
 
 	/* There will be num_node_state(N_MEMORY) threads */
 	atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
+#ifdef CONFIG_PKRAM
+	atomic_inc(&pgdat_init_n_undone);
+	kthread_run(deferred_init_preserved, NULL, "pgdatainit_preserved");
+#endif
 	for_each_node_state(nid, N_MEMORY) {
 		kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid);
 	}
@@ -2114,6 +2151,8 @@ void __init page_alloc_init_late(void)
 	/* Block until all are initialised */
 	wait_for_completion(&pgdat_init_all_done_comp);
 
+	pkram_cleanup();
+
 	/*
 	 * The number of managed pages has changed due to the initialisation
 	 * so the pcpu batch and high limits needs to be updated or the limits
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 32/43] shmem: preserve shmem files a chunk at a time
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To prepare for multithreading the work to preserve a shmem file,
divide the work into subranges of the total index range of the file.
The chunk size is a rather arbitrary 256k indices.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 7 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index 8682b0c002c0..e52722b3a709 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -74,16 +74,14 @@ static int save_page(struct page *page, struct pkram_access *pa)
 	return err;
 }
 
-static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+static int save_file_content_range(struct pkram_access *pa,
+				   struct address_space *mapping,
+				   unsigned long start, unsigned long end)
 {
-	PKRAM_ACCESS(pa, ps, pages);
 	struct pagevec pvec;
-	unsigned long start, end;
 	int err = 0;
 	int i;
 
-	start = 0;
-	end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
 	pagevec_init(&pvec);
 	for ( ; ; ) {
 		pvec.nr = find_get_pages_range(mapping, &start, end,
@@ -95,7 +93,7 @@ static int save_file_content(struct pkram_stream *ps, struct address_space *mapp
 
 			lock_page(page);
 			BUG_ON(page->mapping != mapping);
-			err = save_page(page, &pa);
+			err = save_page(page, pa);
 			if (PageCompound(page)) {
 				start = page->index + compound_nr(page);
 				i += compound_nr(page);
@@ -113,10 +111,62 @@ static int save_file_content(struct pkram_stream *ps, struct address_space *mapp
 		cond_resched();
 	}
 
-	pkram_finish_access(&pa, err == 0);
 	return err;
 }
 
+struct shmem_pkram_arg {
+	struct pkram_stream *ps;
+	struct address_space *mapping;
+	struct mm_struct *mm;
+	atomic64_t next;
+};
+
+unsigned long shmem_pkram_max_index_range = 512 * 512;
+
+static int get_save_range(unsigned long max, atomic64_t *next, unsigned long *start, unsigned long *end)
+{
+	unsigned long index;
+ 
+	index = atomic64_fetch_add(shmem_pkram_max_index_range, next);
+	if (index >= max)
+		return -ENODATA;
+ 
+	*start = index;
+	*end = index + shmem_pkram_max_index_range - 1;
+ 
+	return 0;
+}
+
+static int do_save_file_content(struct pkram_stream *ps,
+				struct address_space *mapping,
+				atomic64_t *next)
+{
+	PKRAM_ACCESS(pa, ps, pages);
+	unsigned long start, end, max;
+	int ret;
+ 
+	max = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ 
+	do {
+		ret = get_save_range(max, next, &start, &end);
+		if (!ret)
+			ret = save_file_content_range(&pa, mapping, start, end);
+	} while (!ret);
+ 
+	if (ret == -ENODATA)
+		ret = 0;
+ 
+	pkram_finish_access(&pa, ret == 0);
+	return ret;
+}
+
+static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+{
+	struct shmem_pkram_arg arg = { ps, mapping, NULL, ATOMIC64_INIT(0) };
+ 
+	return do_save_file_content(arg.ps, arg.mapping, &arg.next);
+}
+
 static int save_file(struct dentry *dentry, struct pkram_stream *ps)
 {
 	PKRAM_ACCESS(pa_bytes, ps, bytes);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 32/43] shmem: preserve shmem files a chunk at a time
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To prepare for multithreading the work to preserve a shmem file,
divide the work into subranges of the total index range of the file.
The chunk size is a rather arbitrary 256k indices.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 7 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index 8682b0c002c0..e52722b3a709 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -74,16 +74,14 @@ static int save_page(struct page *page, struct pkram_access *pa)
 	return err;
 }
 
-static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+static int save_file_content_range(struct pkram_access *pa,
+				   struct address_space *mapping,
+				   unsigned long start, unsigned long end)
 {
-	PKRAM_ACCESS(pa, ps, pages);
 	struct pagevec pvec;
-	unsigned long start, end;
 	int err = 0;
 	int i;
 
-	start = 0;
-	end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
 	pagevec_init(&pvec);
 	for ( ; ; ) {
 		pvec.nr = find_get_pages_range(mapping, &start, end,
@@ -95,7 +93,7 @@ static int save_file_content(struct pkram_stream *ps, struct address_space *mapp
 
 			lock_page(page);
 			BUG_ON(page->mapping != mapping);
-			err = save_page(page, &pa);
+			err = save_page(page, pa);
 			if (PageCompound(page)) {
 				start = page->index + compound_nr(page);
 				i += compound_nr(page);
@@ -113,10 +111,62 @@ static int save_file_content(struct pkram_stream *ps, struct address_space *mapp
 		cond_resched();
 	}
 
-	pkram_finish_access(&pa, err == 0);
 	return err;
 }
 
+struct shmem_pkram_arg {
+	struct pkram_stream *ps;
+	struct address_space *mapping;
+	struct mm_struct *mm;
+	atomic64_t next;
+};
+
+unsigned long shmem_pkram_max_index_range = 512 * 512;
+
+static int get_save_range(unsigned long max, atomic64_t *next, unsigned long *start, unsigned long *end)
+{
+	unsigned long index;
+ 
+	index = atomic64_fetch_add(shmem_pkram_max_index_range, next);
+	if (index >= max)
+		return -ENODATA;
+ 
+	*start = index;
+	*end = index + shmem_pkram_max_index_range - 1;
+ 
+	return 0;
+}
+
+static int do_save_file_content(struct pkram_stream *ps,
+				struct address_space *mapping,
+				atomic64_t *next)
+{
+	PKRAM_ACCESS(pa, ps, pages);
+	unsigned long start, end, max;
+	int ret;
+ 
+	max = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+ 
+	do {
+		ret = get_save_range(max, next, &start, &end);
+		if (!ret)
+			ret = save_file_content_range(&pa, mapping, start, end);
+	} while (!ret);
+ 
+	if (ret == -ENODATA)
+		ret = 0;
+ 
+	pkram_finish_access(&pa, ret == 0);
+	return ret;
+}
+
+static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+{
+	struct shmem_pkram_arg arg = { ps, mapping, NULL, ATOMIC64_INIT(0) };
+ 
+	return do_save_file_content(arg.ps, arg.mapping, &arg.next);
+}
+
 static int save_file(struct dentry *dentry, struct pkram_stream *ps)
 {
 	PKRAM_ACCESS(pa_bytes, ps, bytes);
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 33/43] PKRAM: atomically add and remove link pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Add and remove pkram_link pages from a pkram_obj atomically to prepare
for multithreading.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 08144c18d425..382ccf6f789f 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -535,33 +535,42 @@ static void pkram_truncate(void)
 static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
 {
 	__u64 link_pfn = page_to_pfn(virt_to_page(link));
+	__u64 *tail = pds->tail_link_pfnp;
+	__u64 tail_pfn;
 
-	if (!*pds->head_link_pfnp) {
+	do {
+		tail_pfn = *tail;
+	} while (cmpxchg64(tail, tail_pfn, link_pfn) != tail_pfn);
+
+	if (!tail_pfn) {
 		*pds->head_link_pfnp = link_pfn;
-		*pds->tail_link_pfnp = link_pfn;
 	} else {
-		struct pkram_link *tail = pfn_to_kaddr(*pds->tail_link_pfnp);
+		struct pkram_link *prev_tail = pfn_to_kaddr(tail_pfn);
 
-		tail->link_pfn = link_pfn;
-		*pds->tail_link_pfnp = link_pfn;
+		prev_tail->link_pfn = link_pfn;
 	}
 }
 
 static struct pkram_link *pkram_remove_link(struct pkram_data_stream *pds)
 {
-	struct pkram_link *link;
+	__u64 *head = pds->head_link_pfnp;
+	__u64 head_pfn = *head;
 
-	if (!*pds->head_link_pfnp)
-		return NULL;
+	while (head_pfn) {
+		struct pkram_link *link = pfn_to_kaddr(head_pfn);
 
-	link = pfn_to_kaddr(*pds->head_link_pfnp);
-	*pds->head_link_pfnp = link->link_pfn;
-	if (!*pds->head_link_pfnp)
-		*pds->tail_link_pfnp = 0;
-	else
-		link->link_pfn = 0;
+		if (cmpxchg64(head, head_pfn, link->link_pfn) == head_pfn) {
+			if (!*head)
+				*pds->tail_link_pfnp = 0;
+			else
+				link->link_pfn = 0;
+			return link;
+		}
 
-	return link;
+		head_pfn = *head;
+	}
+
+	return NULL;
 }
 
 static struct pkram_link *pkram_new_link(struct pkram_data_stream *pds, gfp_t gfp_mask)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 33/43] PKRAM: atomically add and remove link pages
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Add and remove pkram_link pages from a pkram_obj atomically to prepare
for multithreading.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index 08144c18d425..382ccf6f789f 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -535,33 +535,42 @@ static void pkram_truncate(void)
 static void pkram_add_link(struct pkram_link *link, struct pkram_data_stream *pds)
 {
 	__u64 link_pfn = page_to_pfn(virt_to_page(link));
+	__u64 *tail = pds->tail_link_pfnp;
+	__u64 tail_pfn;
 
-	if (!*pds->head_link_pfnp) {
+	do {
+		tail_pfn = *tail;
+	} while (cmpxchg64(tail, tail_pfn, link_pfn) != tail_pfn);
+
+	if (!tail_pfn) {
 		*pds->head_link_pfnp = link_pfn;
-		*pds->tail_link_pfnp = link_pfn;
 	} else {
-		struct pkram_link *tail = pfn_to_kaddr(*pds->tail_link_pfnp);
+		struct pkram_link *prev_tail = pfn_to_kaddr(tail_pfn);
 
-		tail->link_pfn = link_pfn;
-		*pds->tail_link_pfnp = link_pfn;
+		prev_tail->link_pfn = link_pfn;
 	}
 }
 
 static struct pkram_link *pkram_remove_link(struct pkram_data_stream *pds)
 {
-	struct pkram_link *link;
+	__u64 *head = pds->head_link_pfnp;
+	__u64 head_pfn = *head;
 
-	if (!*pds->head_link_pfnp)
-		return NULL;
+	while (head_pfn) {
+		struct pkram_link *link = pfn_to_kaddr(head_pfn);
 
-	link = pfn_to_kaddr(*pds->head_link_pfnp);
-	*pds->head_link_pfnp = link->link_pfn;
-	if (!*pds->head_link_pfnp)
-		*pds->tail_link_pfnp = 0;
-	else
-		link->link_pfn = 0;
+		if (cmpxchg64(head, head_pfn, link->link_pfn) == head_pfn) {
+			if (!*head)
+				*pds->tail_link_pfnp = 0;
+			else
+				link->link_pfn = 0;
+			return link;
+		}
 
-	return link;
+		head_pfn = *head;
+	}
+
+	return NULL;
 }
 
 static struct pkram_link *pkram_new_link(struct pkram_data_stream *pds, gfp_t gfp_mask)
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 34/43] shmem: PKRAM: multithread preserving and restoring shmem pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Improve performance by multithreading the work to preserve and restore
shmem pages.

When preserving pages each thread saves non-overlapping ranges of a file
to a pkram_obj until all pages are preserved.

When restoring pages each thread loads pages using a local pkram_access.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 89 insertions(+), 5 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index e52722b3a709..354c2b58962c 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -115,6 +115,7 @@ static int save_file_content_range(struct pkram_access *pa,
 }
 
 struct shmem_pkram_arg {
+	int *error;
 	struct pkram_stream *ps;
 	struct address_space *mapping;
 	struct mm_struct *mm;
@@ -137,6 +138,16 @@ static int get_save_range(unsigned long max, atomic64_t *next, unsigned long *st
 	return 0;
 }
 
+/* Completion tracking for save_file_content_thr() threads */
+static atomic_t pkram_save_n_undone;
+static DECLARE_COMPLETION(pkram_save_all_done_comp);
+
+static inline void pkram_save_report_one_done(void)
+{
+	if (atomic_dec_and_test(&pkram_save_n_undone))
+		complete(&pkram_save_all_done_comp);
+}
+
 static int do_save_file_content(struct pkram_stream *ps,
 				struct address_space *mapping,
 				atomic64_t *next)
@@ -160,11 +171,40 @@ static int do_save_file_content(struct pkram_stream *ps,
 	return ret;
 }
 
-static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+static int save_file_content_thr(void *data)
 {
-	struct shmem_pkram_arg arg = { ps, mapping, NULL, ATOMIC64_INIT(0) };
- 
-	return do_save_file_content(arg.ps, arg.mapping, &arg.next);
+	struct shmem_pkram_arg *arg = data;
+	int ret;
+
+	ret = do_save_file_content(arg->ps, arg->mapping, &arg->next);
+	if (ret && !*arg->error)
+		*arg->error = ret;
+
+	pkram_save_report_one_done();
+	return 0;
+}
+
+static int shmem_pkram_max_threads = 16;
+
+static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+ {
+	int err = 0;
+	struct shmem_pkram_arg arg = { &err, ps, mapping, NULL, ATOMIC64_INIT(0) };
+	unsigned int thr, nr_threads;
+
+	nr_threads = num_online_cpus() - 1;
+	nr_threads = clamp_val(shmem_pkram_max_threads, 1, nr_threads);
+
+	if (nr_threads == 1)
+		return do_save_file_content(arg.ps, arg.mapping, &arg.next);
+
+	atomic_set(&pkram_save_n_undone, nr_threads);
+	for (thr = 0; thr < nr_threads; thr++)
+		kthread_run(save_file_content_thr, &arg, "pkram_save%d", thr);
+
+	wait_for_completion(&pkram_save_all_done_comp);
+
+	return err;
 }
 
 static int save_file(struct dentry *dentry, struct pkram_stream *ps)
@@ -275,7 +315,17 @@ int shmem_save_pkram(struct super_block *sb)
 	return err;
 }
 
-static int load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
+/* Completion tracking for load_file_content_thr() threads */
+static atomic_t pkram_load_n_undone;
+static DECLARE_COMPLETION(pkram_load_all_done_comp);
+
+static inline void pkram_load_report_one_done(void)
+{
+	if (atomic_dec_and_test(&pkram_load_n_undone))
+		complete(&pkram_load_all_done_comp);
+}
+
+static int do_load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
 {
 	PKRAM_ACCESS(pa, ps, pages);
 	unsigned long index;
@@ -296,6 +346,40 @@ static int load_file_content(struct pkram_stream *ps, struct address_space *mapp
 	return err;
 }
 
+static int load_file_content_thr(void *data)
+{
+	struct shmem_pkram_arg *arg = data;
+	int ret;
+
+	ret = do_load_file_content(arg->ps, arg->mapping, arg->mm);
+	if (ret && !*arg->error)
+		*arg->error = ret;
+
+	pkram_load_report_one_done();
+	return 0;
+}
+
+static int load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
+{
+	int err = 0;
+	struct shmem_pkram_arg arg = { &err, ps, mapping, mm };
+	unsigned int thr, nr_threads;
+
+	nr_threads = num_online_cpus() - 1;
+	nr_threads = clamp_val(shmem_pkram_max_threads, 1, nr_threads);
+
+	if (nr_threads == 1)
+		return do_load_file_content(ps, mapping, mm);
+
+	atomic_set(&pkram_load_n_undone, nr_threads);
+	for (thr = 0; thr < nr_threads; thr++)
+		kthread_run(load_file_content_thr, &arg, "pkram_load%d", thr);
+
+	wait_for_completion(&pkram_load_all_done_comp);
+
+	return err;
+}
+
 static int load_file(struct dentry *parent, struct pkram_stream *ps,
 		     char *buf, size_t bufsize)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 34/43] shmem: PKRAM: multithread preserving and restoring shmem pages
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Improve performance by multithreading the work to preserve and restore
shmem pages.

When preserving pages each thread saves non-overlapping ranges of a file
to a pkram_obj until all pages are preserved.

When restoring pages each thread loads pages using a local pkram_access.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 89 insertions(+), 5 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index e52722b3a709..354c2b58962c 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -115,6 +115,7 @@ static int save_file_content_range(struct pkram_access *pa,
 }
 
 struct shmem_pkram_arg {
+	int *error;
 	struct pkram_stream *ps;
 	struct address_space *mapping;
 	struct mm_struct *mm;
@@ -137,6 +138,16 @@ static int get_save_range(unsigned long max, atomic64_t *next, unsigned long *st
 	return 0;
 }
 
+/* Completion tracking for save_file_content_thr() threads */
+static atomic_t pkram_save_n_undone;
+static DECLARE_COMPLETION(pkram_save_all_done_comp);
+
+static inline void pkram_save_report_one_done(void)
+{
+	if (atomic_dec_and_test(&pkram_save_n_undone))
+		complete(&pkram_save_all_done_comp);
+}
+
 static int do_save_file_content(struct pkram_stream *ps,
 				struct address_space *mapping,
 				atomic64_t *next)
@@ -160,11 +171,40 @@ static int do_save_file_content(struct pkram_stream *ps,
 	return ret;
 }
 
-static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+static int save_file_content_thr(void *data)
 {
-	struct shmem_pkram_arg arg = { ps, mapping, NULL, ATOMIC64_INIT(0) };
- 
-	return do_save_file_content(arg.ps, arg.mapping, &arg.next);
+	struct shmem_pkram_arg *arg = data;
+	int ret;
+
+	ret = do_save_file_content(arg->ps, arg->mapping, &arg->next);
+	if (ret && !*arg->error)
+		*arg->error = ret;
+
+	pkram_save_report_one_done();
+	return 0;
+}
+
+static int shmem_pkram_max_threads = 16;
+
+static int save_file_content(struct pkram_stream *ps, struct address_space *mapping)
+ {
+	int err = 0;
+	struct shmem_pkram_arg arg = { &err, ps, mapping, NULL, ATOMIC64_INIT(0) };
+	unsigned int thr, nr_threads;
+
+	nr_threads = num_online_cpus() - 1;
+	nr_threads = clamp_val(shmem_pkram_max_threads, 1, nr_threads);
+
+	if (nr_threads == 1)
+		return do_save_file_content(arg.ps, arg.mapping, &arg.next);
+
+	atomic_set(&pkram_save_n_undone, nr_threads);
+	for (thr = 0; thr < nr_threads; thr++)
+		kthread_run(save_file_content_thr, &arg, "pkram_save%d", thr);
+
+	wait_for_completion(&pkram_save_all_done_comp);
+
+	return err;
 }
 
 static int save_file(struct dentry *dentry, struct pkram_stream *ps)
@@ -275,7 +315,17 @@ int shmem_save_pkram(struct super_block *sb)
 	return err;
 }
 
-static int load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
+/* Completion tracking for load_file_content_thr() threads */
+static atomic_t pkram_load_n_undone;
+static DECLARE_COMPLETION(pkram_load_all_done_comp);
+
+static inline void pkram_load_report_one_done(void)
+{
+	if (atomic_dec_and_test(&pkram_load_n_undone))
+		complete(&pkram_load_all_done_comp);
+}
+
+static int do_load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
 {
 	PKRAM_ACCESS(pa, ps, pages);
 	unsigned long index;
@@ -296,6 +346,40 @@ static int load_file_content(struct pkram_stream *ps, struct address_space *mapp
 	return err;
 }
 
+static int load_file_content_thr(void *data)
+{
+	struct shmem_pkram_arg *arg = data;
+	int ret;
+
+	ret = do_load_file_content(arg->ps, arg->mapping, arg->mm);
+	if (ret && !*arg->error)
+		*arg->error = ret;
+
+	pkram_load_report_one_done();
+	return 0;
+}
+
+static int load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
+{
+	int err = 0;
+	struct shmem_pkram_arg arg = { &err, ps, mapping, mm };
+	unsigned int thr, nr_threads;
+
+	nr_threads = num_online_cpus() - 1;
+	nr_threads = clamp_val(shmem_pkram_max_threads, 1, nr_threads);
+
+	if (nr_threads == 1)
+		return do_load_file_content(ps, mapping, mm);
+
+	atomic_set(&pkram_load_n_undone, nr_threads);
+	for (thr = 0; thr < nr_threads; thr++)
+		kthread_run(load_file_content_thr, &arg, "pkram_load%d", thr);
+
+	wait_for_completion(&pkram_load_all_done_comp);
+
+	return err;
+}
+
 static int load_file(struct dentry *parent, struct pkram_stream *ps,
 		     char *buf, size_t bufsize)
 {
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 35/43] shmem: introduce shmem_insert_pages()
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Calling shmem_insert_page() to insert one page at a time does
not scale well when multiple threads are inserting pages into
the same shmem segment.  This is primarily due to the locking needed
when adding to the pagecache and LRU but also due to contention
on the shmem_inode_info lock. To address the shmem_inode_info lock
and prepare for future optimizations, introduce shmem_insert_pages()
which allows a caller to pass an array of pages to be inserted into a
shmem segment.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/shmem_fs.h |  3 +-
 mm/shmem.c               | 93 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 78149d702a62..bc116c4fe145 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -112,7 +112,8 @@ extern int shmem_getpage(struct inode *inode, pgoff_t index,
 
 extern int shmem_insert_page(struct mm_struct *mm, struct inode *inode,
 		pgoff_t index, struct page *page);
-
+extern int shmem_insert_pages(struct mm_struct *mm, struct inode *inode,
+			      pgoff_t index, struct page *pages[], int npages);
 #ifdef CONFIG_PKRAM
 extern int shmem_parse_pkram(const char *str, struct shmem_pkram_info **pkram);
 extern void shmem_show_pkram(struct seq_file *seq, struct shmem_pkram_info *pkram,
diff --git a/mm/shmem.c b/mm/shmem.c
index 44cc158ab34d..c3fa72061d8a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -838,6 +838,99 @@ int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 	return err;
 }
 
+int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
+		       pgoff_t index, struct page *pages[], int npages)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	gfp_t gfp = mapping_gfp_mask(mapping);
+	int i, err;
+	int nr = 0;
+
+	for (i = 0; i < npages; i++)
+		nr += thp_nr_pages(pages[i]);
+
+	if (index + nr - 1 > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return -EFBIG;
+
+retry:
+	err = 0;
+	if (!shmem_inode_acct_block(inode, nr))
+		err = -ENOSPC;
+	if (err) {
+		int retry = 5;
+
+		/*
+		 * Try to reclaim some space by splitting a huge page
+		 * beyond i_size on the filesystem.
+		 */
+		while (retry--) {
+			int ret;
+
+			ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
+			if (ret == SHRINK_STOP)
+				break;
+			if (ret)
+				goto retry;
+		}
+		goto failed;
+	}
+
+	for (i = 0; i < npages; i++) {
+		if (!PageLRU(pages[i])) {
+			__SetPageLocked(pages[i]);
+			__SetPageSwapBacked(pages[i]);
+		} else {
+			lock_page(pages[i]);
+		}
+
+		__SetPageReferenced(pages[i]);
+	}
+
+	for (i = 0; i < npages; i++) {
+		bool ischarged = page_memcg(pages[i]) ? true : false;
+
+		err = shmem_add_to_page_cache(pages[i], mapping, index,
+					NULL, gfp & GFP_RECLAIM_MASK,
+					charge_mm, ischarged);
+		if (err)
+			goto out_release;
+
+		index += thp_nr_pages(pages[i]);
+	}
+
+	spin_lock(&info->lock);
+	info->alloced += nr;
+	inode->i_blocks += BLOCKS_PER_PAGE * nr;
+	shmem_recalc_inode(inode);
+	spin_unlock(&info->lock);
+
+	for (i = 0; i < npages; i++) {
+		if (!PageLRU(pages[i]))
+			lru_cache_add(pages[i]);
+
+		flush_dcache_page(pages[i]);
+		SetPageUptodate(pages[i]);
+		set_page_dirty(pages[i]);
+
+		unlock_page(pages[i]);
+	}
+
+	return 0;
+
+out_release:
+	while (--i >= 0)
+		delete_from_page_cache(pages[i]);
+
+	for (i = 0; i < npages; i++)
+		unlock_page(pages[i]);
+
+	shmem_inode_unacct_blocks(inode, nr);
+failed:
+	return err;
+}
+
 /*
  * Remove swap entry from page cache, free the swap and its page cache.
  */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 35/43] shmem: introduce shmem_insert_pages()
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Calling shmem_insert_page() to insert one page at a time does
not scale well when multiple threads are inserting pages into
the same shmem segment.  This is primarily due to the locking needed
when adding to the pagecache and LRU but also due to contention
on the shmem_inode_info lock. To address the shmem_inode_info lock
and prepare for future optimizations, introduce shmem_insert_pages()
which allows a caller to pass an array of pages to be inserted into a
shmem segment.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/shmem_fs.h |  3 +-
 mm/shmem.c               | 93 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 78149d702a62..bc116c4fe145 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -112,7 +112,8 @@ extern int shmem_getpage(struct inode *inode, pgoff_t index,
 
 extern int shmem_insert_page(struct mm_struct *mm, struct inode *inode,
 		pgoff_t index, struct page *page);
-
+extern int shmem_insert_pages(struct mm_struct *mm, struct inode *inode,
+			      pgoff_t index, struct page *pages[], int npages);
 #ifdef CONFIG_PKRAM
 extern int shmem_parse_pkram(const char *str, struct shmem_pkram_info **pkram);
 extern void shmem_show_pkram(struct seq_file *seq, struct shmem_pkram_info *pkram,
diff --git a/mm/shmem.c b/mm/shmem.c
index 44cc158ab34d..c3fa72061d8a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -838,6 +838,99 @@ int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 	return err;
 }
 
+int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
+		       pgoff_t index, struct page *pages[], int npages)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	gfp_t gfp = mapping_gfp_mask(mapping);
+	int i, err;
+	int nr = 0;
+
+	for (i = 0; i < npages; i++)
+		nr += thp_nr_pages(pages[i]);
+
+	if (index + nr - 1 > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return -EFBIG;
+
+retry:
+	err = 0;
+	if (!shmem_inode_acct_block(inode, nr))
+		err = -ENOSPC;
+	if (err) {
+		int retry = 5;
+
+		/*
+		 * Try to reclaim some space by splitting a huge page
+		 * beyond i_size on the filesystem.
+		 */
+		while (retry--) {
+			int ret;
+
+			ret = shmem_unused_huge_shrink(sbinfo, NULL, 1);
+			if (ret == SHRINK_STOP)
+				break;
+			if (ret)
+				goto retry;
+		}
+		goto failed;
+	}
+
+	for (i = 0; i < npages; i++) {
+		if (!PageLRU(pages[i])) {
+			__SetPageLocked(pages[i]);
+			__SetPageSwapBacked(pages[i]);
+		} else {
+			lock_page(pages[i]);
+		}
+
+		__SetPageReferenced(pages[i]);
+	}
+
+	for (i = 0; i < npages; i++) {
+		bool ischarged = page_memcg(pages[i]) ? true : false;
+
+		err = shmem_add_to_page_cache(pages[i], mapping, index,
+					NULL, gfp & GFP_RECLAIM_MASK,
+					charge_mm, ischarged);
+		if (err)
+			goto out_release;
+
+		index += thp_nr_pages(pages[i]);
+	}
+
+	spin_lock(&info->lock);
+	info->alloced += nr;
+	inode->i_blocks += BLOCKS_PER_PAGE * nr;
+	shmem_recalc_inode(inode);
+	spin_unlock(&info->lock);
+
+	for (i = 0; i < npages; i++) {
+		if (!PageLRU(pages[i]))
+			lru_cache_add(pages[i]);
+
+		flush_dcache_page(pages[i]);
+		SetPageUptodate(pages[i]);
+		set_page_dirty(pages[i]);
+
+		unlock_page(pages[i]);
+	}
+
+	return 0;
+
+out_release:
+	while (--i >= 0)
+		delete_from_page_cache(pages[i]);
+
+	for (i = 0; i < npages; i++)
+		unlock_page(pages[i]);
+
+	shmem_inode_unacct_blocks(inode, nr);
+failed:
+	return err;
+}
+
 /*
  * Remove swap entry from page cache, free the swap and its page cache.
  */
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 36/43] PKRAM: add support for loading pages in bulk
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Implement a new API function, pkram_load_file_pages(), to support
loading pages in bulk.  A caller provided buffer not smaller than
PKRAM_PAGES_BUFSIZE is populated with pages pointers that are contiguous
by their original mapping index values.  The number of pages in the buffer
and the mapping index of the first page are provided to the caller.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  4 ++++
 mm/pkram.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 977cf45a1bcf..ca46e5eafe71 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -96,6 +96,10 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 int pkram_save_file_page(struct pkram_access *pa, struct page *page);
 struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index);
 
+#define PKRAM_PAGES_BUFSIZE	PAGE_SIZE
+
+int pkram_load_file_pages(struct pkram_access *pa, struct page *pages[], unsigned int *nr_pages, unsigned long *index);
+
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
 
diff --git a/mm/pkram.c b/mm/pkram.c
index 382ccf6f789f..b63b2a3958e7 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1099,6 +1099,52 @@ struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
 }
 
 /**
+ * Load pages from the preserved memory node and object associated with
+ * pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_load() and pkram_prepare_load_obj() and access initialized
+ * with PKRAM_ACCESS().
+ * The page entries of a single pkram_link are processed, and @pages is
+ * populated with the page pointers.  @nr_pages is set to the number of
+ * pages, and @index is set to the mapping index of the first page.
+ *
+ * Returns 0 if one or more pages are loaded or -ENODATA if there are no
+ * pages to load.
+ *
+ * The pages loaded have an incremented refcount either because the page
+ * was initialized with a refcount of 1 at boot or because the page was
+ * subsequently preserved which increased the refcount.
+ */
+int pkram_load_file_pages(struct pkram_access *pa, struct page *pages[], unsigned int *nr_pages, unsigned long *index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link;
+	int nr_entries = 0;
+	int i, ret;
+
+	ret = pkram_next_link(pds, &link);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
+		unsigned long p = link->entry[i];
+
+		if (!p)
+			break;
+
+		pages[i] = __pkram_prep_load_page(p);
+		nr_entries++;
+	}
+
+	*nr_pages = nr_entries;
+	*index = link->index;
+
+	pkram_free_page(link);
+	pds->link = NULL;
+
+	return 0;
+}
+
+/**
  * Copy @count bytes from @buf to the preserved memory node and object
  * associated with pkram stream access @pa. The stream must have been
  * initialized with pkram_prepare_save() and pkram_prepare_save_obj()
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 36/43] PKRAM: add support for loading pages in bulk
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Implement a new API function, pkram_load_file_pages(), to support
loading pages in bulk.  A caller provided buffer not smaller than
PKRAM_PAGES_BUFSIZE is populated with pages pointers that are contiguous
by their original mapping index values.  The number of pages in the buffer
and the mapping index of the first page are provided to the caller.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/pkram.h |  4 ++++
 mm/pkram.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/pkram.h b/include/linux/pkram.h
index 977cf45a1bcf..ca46e5eafe71 100644
--- a/include/linux/pkram.h
+++ b/include/linux/pkram.h
@@ -96,6 +96,10 @@ int pkram_prepare_save(struct pkram_stream *ps, const char *name,
 int pkram_save_file_page(struct pkram_access *pa, struct page *page);
 struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index);
 
+#define PKRAM_PAGES_BUFSIZE	PAGE_SIZE
+
+int pkram_load_file_pages(struct pkram_access *pa, struct page *pages[], unsigned int *nr_pages, unsigned long *index);
+
 ssize_t pkram_write(struct pkram_access *pa, const void *buf, size_t count);
 size_t pkram_read(struct pkram_access *pa, void *buf, size_t count);
 
diff --git a/mm/pkram.c b/mm/pkram.c
index 382ccf6f789f..b63b2a3958e7 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -1099,6 +1099,52 @@ struct page *pkram_load_file_page(struct pkram_access *pa, unsigned long *index)
 }
 
 /**
+ * Load pages from the preserved memory node and object associated with
+ * pkram stream access @pa. The stream must have been initialized with
+ * pkram_prepare_load() and pkram_prepare_load_obj() and access initialized
+ * with PKRAM_ACCESS().
+ * The page entries of a single pkram_link are processed, and @pages is
+ * populated with the page pointers.  @nr_pages is set to the number of
+ * pages, and @index is set to the mapping index of the first page.
+ *
+ * Returns 0 if one or more pages are loaded or -ENODATA if there are no
+ * pages to load.
+ *
+ * The pages loaded have an incremented refcount either because the page
+ * was initialized with a refcount of 1 at boot or because the page was
+ * subsequently preserved which increased the refcount.
+ */
+int pkram_load_file_pages(struct pkram_access *pa, struct page *pages[], unsigned int *nr_pages, unsigned long *index)
+{
+	struct pkram_data_stream *pds = &pa->pds;
+	struct pkram_link *link;
+	int nr_entries = 0;
+	int i, ret;
+
+	ret = pkram_next_link(pds, &link);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < PKRAM_LINK_ENTRIES_MAX; i++) {
+		unsigned long p = link->entry[i];
+
+		if (!p)
+			break;
+
+		pages[i] = __pkram_prep_load_page(p);
+		nr_entries++;
+	}
+
+	*nr_pages = nr_entries;
+	*index = link->index;
+
+	pkram_free_page(link);
+	pds->link = NULL;
+
+	return 0;
+}
+
+/**
  * Copy @count bytes from @buf to the preserved memory node and object
  * associated with pkram stream access @pa. The stream must have been
  * initialized with pkram_prepare_save() and pkram_prepare_save_obj()
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 37/43] shmem: PKRAM: enable bulk loading of preserved pages into shmem
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Make use of new interfaces for loading and inserting preserved pages
into a shmem file in bulk.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index 354c2b58962c..24a1ebb4af59 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -328,20 +328,31 @@ static inline void pkram_load_report_one_done(void)
 static int do_load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
 {
 	PKRAM_ACCESS(pa, ps, pages);
+	struct page **pages;
+	unsigned int nr_pages;
 	unsigned long index;
-	struct page *page;
-	int err = 0;
+	int i, err;
+
+	pages = kzalloc(PKRAM_PAGES_BUFSIZE, GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
 
 	do {
-		page = pkram_load_file_page(&pa, &index);
-		if (!page)
+		err = pkram_load_file_pages(&pa, pages, &nr_pages, &index);
+		if (err) {
+			if (err == -ENODATA)
+				err = 0;
 			break;
+		}
+
+		err = shmem_insert_pages(mm, mapping->host, index, pages, nr_pages);
 
-		err = shmem_insert_page(mm, mapping->host, index, page);
-		put_page(page);
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
 		cond_resched();
 	} while (!err);
 
+	kfree(pages);
 	pkram_finish_access(&pa, err == 0);
 	return err;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 37/43] shmem: PKRAM: enable bulk loading of preserved pages into shmem
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Make use of new interfaces for loading and inserting preserved pages
into a shmem file in bulk.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem_pkram.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/shmem_pkram.c b/mm/shmem_pkram.c
index 354c2b58962c..24a1ebb4af59 100644
--- a/mm/shmem_pkram.c
+++ b/mm/shmem_pkram.c
@@ -328,20 +328,31 @@ static inline void pkram_load_report_one_done(void)
 static int do_load_file_content(struct pkram_stream *ps, struct address_space *mapping, struct mm_struct *mm)
 {
 	PKRAM_ACCESS(pa, ps, pages);
+	struct page **pages;
+	unsigned int nr_pages;
 	unsigned long index;
-	struct page *page;
-	int err = 0;
+	int i, err;
+
+	pages = kzalloc(PKRAM_PAGES_BUFSIZE, GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
 
 	do {
-		page = pkram_load_file_page(&pa, &index);
-		if (!page)
+		err = pkram_load_file_pages(&pa, pages, &nr_pages, &index);
+		if (err) {
+			if (err == -ENODATA)
+				err = 0;
 			break;
+		}
+
+		err = shmem_insert_pages(mm, mapping->host, index, pages, nr_pages);
 
-		err = shmem_insert_page(mm, mapping->host, index, page);
-		put_page(page);
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
 		cond_resched();
 	} while (!err);
 
+	kfree(pages);
 	pkram_finish_access(&pa, err == 0);
 	return err;
 }
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 38/43] mm: implement splicing a list of pages to the LRU
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Considerable contention on the LRU lock happens when multiple threads
are used to insert pages into a shmem file in parallel. To alleviate this
provide a way for pages to be added to the same LRU to be staged so that
they can be added by splicing lists and updating stats once with the lock
held. For now only unevictable pages are supported.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/swap.h | 13 ++++++++
 mm/swap.c            | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4cc6ec3bf0ab..254c9c8d71d0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -351,6 +351,19 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 
 extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
+struct lru_splice {
+	struct list_head	splice;
+	struct list_head	*lru_head;
+	struct lruvec		*lruvec;
+	enum lru_list		lru;
+	unsigned long		nr_pages[MAX_NR_ZONES];
+	unsigned long		pgculled;
+};
+#define LRU_SPLICE_INIT(name)	{ .splice = LIST_HEAD_INIT(name.splice) }
+#define LRU_SPLICE(name) \
+	struct lru_splice name = LRU_SPLICE_INIT(name)
+extern void lru_splice_add(struct page *page, struct lru_splice *splice);
+extern void add_splice_to_lru_list(struct lru_splice *splice);
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..a1db6a748608 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,6 +200,92 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 }
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
+/*
+ * Update stats and move accumulated pages from an lru_splice to the lru.
+ */
+void add_splice_to_lru_list(struct lru_splice *splice)
+{
+	struct lruvec *lruvec = splice->lruvec;
+	enum lru_list lru = splice->lru;
+	unsigned long flags = 0;
+	int zid;
+
+	if (list_empty(&splice->splice))
+		return;
+
+	spin_lock_irqsave(&lruvec->lru_lock, flags);
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		if (splice->nr_pages[zid])
+			update_lru_size(lruvec, lru, zid, splice->nr_pages[zid]);
+	}
+	count_vm_events(UNEVICTABLE_PGCULLED, splice->pgculled);
+	list_splice_init(&splice->splice, splice->lru_head);
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
+static void add_page_to_lru_splice(struct page *page, struct lru_splice *splice,
+				   struct lruvec *lruvec, enum lru_list lru)
+{
+	if (list_empty(&splice->splice)) {
+		int zid;
+
+		splice->lruvec = lruvec;
+		splice->lru_head = &lruvec->lists[lru];
+		splice->lru = lru;
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			splice->nr_pages[zid] = 0;
+		splice->pgculled = 0;
+	}
+
+	BUG_ON(splice->lruvec != lruvec);
+	BUG_ON(splice->lru_head != &lruvec->lists[lru]);
+
+	list_add(&page->lru, &splice->splice);
+	splice->nr_pages[page_zonenum(page)] += thp_nr_pages(page);
+}
+
+/*
+ * Similar in functionality to __pagevec_lru_add_fn() but here the page is
+ * being added to an lru_splice and the LRU lock is not held.
+ */
+static void page_lru_splice_add(struct page *page, struct lru_splice *splice, struct lruvec *lruvec)
+{
+	enum lru_list lru;
+	int was_unevictable = TestClearPageUnevictable(page);
+	int nr_pages = thp_nr_pages(page);
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	/* XXX only supports unevictable pages at the moment */
+	VM_BUG_ON_PAGE(was_unevictable, page);
+
+	SetPageLRU(page);
+	smp_mb__after_atomic();
+
+	lru = LRU_UNEVICTABLE;
+	ClearPageActive(page);
+	SetPageUnevictable(page);
+	if (!was_unevictable)
+		splice->pgculled += nr_pages;
+
+	add_page_to_lru_splice(page, splice, lruvec, lru);
+	trace_mm_lru_insertion(page);
+}
+
+void lru_splice_add(struct page *page, struct lru_splice *splice)
+{
+	struct lruvec *lruvec;
+
+	VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	get_page(page);
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	if (lruvec != splice->lruvec)
+		add_splice_to_lru_list(splice);
+	page_lru_splice_add(page, splice, lruvec);
+	put_page(page);
+}
+
 static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 38/43] mm: implement splicing a list of pages to the LRU
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Considerable contention on the LRU lock happens when multiple threads
are used to insert pages into a shmem file in parallel. To alleviate this
provide a way for pages to be added to the same LRU to be staged so that
they can be added by splicing lists and updating stats once with the lock
held. For now only unevictable pages are supported.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/swap.h | 13 ++++++++
 mm/swap.c            | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4cc6ec3bf0ab..254c9c8d71d0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -351,6 +351,19 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 
 extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
+struct lru_splice {
+	struct list_head	splice;
+	struct list_head	*lru_head;
+	struct lruvec		*lruvec;
+	enum lru_list		lru;
+	unsigned long		nr_pages[MAX_NR_ZONES];
+	unsigned long		pgculled;
+};
+#define LRU_SPLICE_INIT(name)	{ .splice = LIST_HEAD_INIT(name.splice) }
+#define LRU_SPLICE(name) \
+	struct lru_splice name = LRU_SPLICE_INIT(name)
+extern void lru_splice_add(struct page *page, struct lru_splice *splice);
+extern void add_splice_to_lru_list(struct lru_splice *splice);
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..a1db6a748608 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,6 +200,92 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 }
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
+/*
+ * Update stats and move accumulated pages from an lru_splice to the lru.
+ */
+void add_splice_to_lru_list(struct lru_splice *splice)
+{
+	struct lruvec *lruvec = splice->lruvec;
+	enum lru_list lru = splice->lru;
+	unsigned long flags = 0;
+	int zid;
+
+	if (list_empty(&splice->splice))
+		return;
+
+	spin_lock_irqsave(&lruvec->lru_lock, flags);
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		if (splice->nr_pages[zid])
+			update_lru_size(lruvec, lru, zid, splice->nr_pages[zid]);
+	}
+	count_vm_events(UNEVICTABLE_PGCULLED, splice->pgculled);
+	list_splice_init(&splice->splice, splice->lru_head);
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
+static void add_page_to_lru_splice(struct page *page, struct lru_splice *splice,
+				   struct lruvec *lruvec, enum lru_list lru)
+{
+	if (list_empty(&splice->splice)) {
+		int zid;
+
+		splice->lruvec = lruvec;
+		splice->lru_head = &lruvec->lists[lru];
+		splice->lru = lru;
+		for (zid = 0; zid < MAX_NR_ZONES; zid++)
+			splice->nr_pages[zid] = 0;
+		splice->pgculled = 0;
+	}
+
+	BUG_ON(splice->lruvec != lruvec);
+	BUG_ON(splice->lru_head != &lruvec->lists[lru]);
+
+	list_add(&page->lru, &splice->splice);
+	splice->nr_pages[page_zonenum(page)] += thp_nr_pages(page);
+}
+
+/*
+ * Similar in functionality to __pagevec_lru_add_fn() but here the page is
+ * being added to an lru_splice and the LRU lock is not held.
+ */
+static void page_lru_splice_add(struct page *page, struct lru_splice *splice, struct lruvec *lruvec)
+{
+	enum lru_list lru;
+	int was_unevictable = TestClearPageUnevictable(page);
+	int nr_pages = thp_nr_pages(page);
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	/* XXX only supports unevictable pages at the moment */
+	VM_BUG_ON_PAGE(was_unevictable, page);
+
+	SetPageLRU(page);
+	smp_mb__after_atomic();
+
+	lru = LRU_UNEVICTABLE;
+	ClearPageActive(page);
+	SetPageUnevictable(page);
+	if (!was_unevictable)
+		splice->pgculled += nr_pages;
+
+	add_page_to_lru_splice(page, splice, lruvec, lru);
+	trace_mm_lru_insertion(page);
+}
+
+void lru_splice_add(struct page *page, struct lru_splice *splice)
+{
+	struct lruvec *lruvec;
+
+	VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	get_page(page);
+	lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+	if (lruvec != splice->lruvec)
+		add_splice_to_lru_list(splice);
+	page_lru_splice_add(page, splice, lruvec);
+	put_page(page);
+}
+
 static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 39/43] shmem: optimize adding pages to the LRU in shmem_insert_pages()
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Reduce LRU lock contention when inserting shmem pages by staging pages
to be added to the same LRU and adding them en masse.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c3fa72061d8a..63299da75166 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -845,6 +845,7 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	gfp_t gfp = mapping_gfp_mask(mapping);
+	LRU_SPLICE(splice);
 	int i, err;
 	int nr = 0;
 
@@ -908,7 +909,7 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 
 	for (i = 0; i < npages; i++) {
 		if (!PageLRU(pages[i]))
-			lru_cache_add(pages[i]);
+			lru_splice_add(pages[i], &splice);
 
 		flush_dcache_page(pages[i]);
 		SetPageUptodate(pages[i]);
@@ -917,6 +918,8 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 		unlock_page(pages[i]);
 	}
 
+	add_splice_to_lru_list(&splice);
+
 	return 0;
 
 out_release:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 39/43] shmem: optimize adding pages to the LRU in shmem_insert_pages()
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Reduce LRU lock contention when inserting shmem pages by staging pages
to be added to the same LRU and adding them en masse.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c3fa72061d8a..63299da75166 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -845,6 +845,7 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	gfp_t gfp = mapping_gfp_mask(mapping);
+	LRU_SPLICE(splice);
 	int i, err;
 	int nr = 0;
 
@@ -908,7 +909,7 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 
 	for (i = 0; i < npages; i++) {
 		if (!PageLRU(pages[i]))
-			lru_cache_add(pages[i]);
+			lru_splice_add(pages[i], &splice);
 
 		flush_dcache_page(pages[i]);
 		SetPageUptodate(pages[i]);
@@ -917,6 +918,8 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 		unlock_page(pages[i]);
 	}
 
+	add_splice_to_lru_list(&splice);
+
 	return 0;
 
 out_release:
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 40/43] shmem: initial support for adding multiple pages to pagecache
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

shmem_insert_pages() currently loops over the array of pages passed
to it and calls shmem_add_to_page_cache() for each one. Prepare
for adding pages to the pagecache in bulk by adding and using a
shmem_add_pages_to_cache() call.  For now it just iterates over
an array and adds pages individually, but improvements in performance
when multiple threads are adding to the same pagecache are achieved
by calling a new shmem_add_to_page_cache_fast() function that does
not check for conflicts and drops the xarray lock before updating stats.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 108 insertions(+), 15 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 63299da75166..f495af51042e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -738,6 +738,74 @@ static int shmem_add_to_page_cache(struct page *page,
 	return error;
 }
 
+static int shmem_add_to_page_cache_fast(struct page *page,
+				   struct address_space *mapping,
+				   pgoff_t index, gfp_t gfp,
+				   struct mm_struct *charge_mm, bool skipcharge)
+{
+	XA_STATE_ORDER(xas, &mapping->i_pages, index, thp_order(page));
+	unsigned long nr = thp_nr_pages(page);
+	unsigned long i = 0;
+	int error;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	VM_BUG_ON_PAGE(index != round_down(index, nr), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+
+	page_ref_add(page, nr);
+	page->mapping = mapping;
+	page->index = index;
+
+	if (!skipcharge && !PageSwapCache(page)) {
+		error = mem_cgroup_charge(page, charge_mm, gfp);
+		if (error) {
+			if (PageTransHuge(page)) {
+				count_vm_event(THP_FILE_FALLBACK);
+				count_vm_event(THP_FILE_FALLBACK_CHARGE);
+			}
+			goto error;
+		}
+	}
+	cgroup_throttle_swaprate(page, gfp);
+
+	do {
+		xas_lock_irq(&xas);
+		xas_create_range(&xas);
+		if (xas_error(&xas))
+			goto unlock;
+next:
+		xas_store(&xas, page);
+		if (++i < nr) {
+			xas_next(&xas);
+			goto next;
+		}
+		mapping->nrpages += nr;
+		xas_unlock(&xas);
+		if (PageTransHuge(page)) {
+			count_vm_event(THP_FILE_ALLOC);
+			__inc_node_page_state(page, NR_SHMEM_THPS);
+		}
+		__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
+		__mod_lruvec_page_state(page, NR_SHMEM, nr);
+		local_irq_enable();
+		break;
+unlock:
+		xas_unlock_irq(&xas);
+	} while (xas_nomem(&xas, gfp));
+
+	if (xas_error(&xas)) {
+		error = xas_error(&xas);
+		goto error;
+	}
+
+	return 0;
+error:
+	page->mapping = NULL;
+	page_ref_sub(page, nr);
+	return error;
+}
+
 /*
  * Like delete_from_page_cache, but substitutes swap for page.
  */
@@ -759,6 +827,41 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	BUG_ON(error);
 }
 
+static int shmem_add_pages_to_cache(struct page *pages[], int npages,
+				struct address_space *mapping,
+				pgoff_t start, gfp_t gfp,
+				struct mm_struct *charge_mm)
+{
+	pgoff_t index = start;
+	int i, err;
+
+	i = 0;
+	while (i < npages) {
+		if (PageTransHuge(pages[i])) {
+			err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
+			if (err)
+				goto out_release;
+			index += thp_nr_pages(pages[i]);
+			i++;
+			continue;
+		}
+
+		err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
+		if (err)
+			goto out_release;
+		index++;
+		i++;
+	}
+	return 0;
+
+out_release:
+	while (i < npages) {
+		delete_from_page_cache(pages[i]);
+		i--;
+	}
+	return err;
+}
+
 int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 		      struct page *page)
 {
@@ -889,17 +992,10 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 		__SetPageReferenced(pages[i]);
 	}
 
-	for (i = 0; i < npages; i++) {
-		bool ischarged = page_memcg(pages[i]) ? true : false;
-
-		err = shmem_add_to_page_cache(pages[i], mapping, index,
-					NULL, gfp & GFP_RECLAIM_MASK,
-					charge_mm, ischarged);
-		if (err)
-			goto out_release;
-
-		index += thp_nr_pages(pages[i]);
-	}
+	err = shmem_add_pages_to_cache(pages, npages, mapping, index,
+					gfp & GFP_RECLAIM_MASK, charge_mm);
+	if (err)
+		goto out_unlock;
 
 	spin_lock(&info->lock);
 	info->alloced += nr;
@@ -922,10 +1018,7 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 
 	return 0;
 
-out_release:
-	while (--i >= 0)
-		delete_from_page_cache(pages[i]);
-
+out_unlock:
 	for (i = 0; i < npages; i++)
 		unlock_page(pages[i]);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 40/43] shmem: initial support for adding multiple pages to pagecache
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

shmem_insert_pages() currently loops over the array of pages passed
to it and calls shmem_add_to_page_cache() for each one. Prepare
for adding pages to the pagecache in bulk by adding and using a
shmem_add_pages_to_cache() call.  For now it just iterates over
an array and adds pages individually, but improvements in performance
when multiple threads are adding to the same pagecache are achieved
by calling a new shmem_add_to_page_cache_fast() function that does
not check for conflicts and drops the xarray lock before updating stats.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 108 insertions(+), 15 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 63299da75166..f495af51042e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -738,6 +738,74 @@ static int shmem_add_to_page_cache(struct page *page,
 	return error;
 }
 
+static int shmem_add_to_page_cache_fast(struct page *page,
+				   struct address_space *mapping,
+				   pgoff_t index, gfp_t gfp,
+				   struct mm_struct *charge_mm, bool skipcharge)
+{
+	XA_STATE_ORDER(xas, &mapping->i_pages, index, thp_order(page));
+	unsigned long nr = thp_nr_pages(page);
+	unsigned long i = 0;
+	int error;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	VM_BUG_ON_PAGE(index != round_down(index, nr), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+
+	page_ref_add(page, nr);
+	page->mapping = mapping;
+	page->index = index;
+
+	if (!skipcharge && !PageSwapCache(page)) {
+		error = mem_cgroup_charge(page, charge_mm, gfp);
+		if (error) {
+			if (PageTransHuge(page)) {
+				count_vm_event(THP_FILE_FALLBACK);
+				count_vm_event(THP_FILE_FALLBACK_CHARGE);
+			}
+			goto error;
+		}
+	}
+	cgroup_throttle_swaprate(page, gfp);
+
+	do {
+		xas_lock_irq(&xas);
+		xas_create_range(&xas);
+		if (xas_error(&xas))
+			goto unlock;
+next:
+		xas_store(&xas, page);
+		if (++i < nr) {
+			xas_next(&xas);
+			goto next;
+		}
+		mapping->nrpages += nr;
+		xas_unlock(&xas);
+		if (PageTransHuge(page)) {
+			count_vm_event(THP_FILE_ALLOC);
+			__inc_node_page_state(page, NR_SHMEM_THPS);
+		}
+		__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
+		__mod_lruvec_page_state(page, NR_SHMEM, nr);
+		local_irq_enable();
+		break;
+unlock:
+		xas_unlock_irq(&xas);
+	} while (xas_nomem(&xas, gfp));
+
+	if (xas_error(&xas)) {
+		error = xas_error(&xas);
+		goto error;
+	}
+
+	return 0;
+error:
+	page->mapping = NULL;
+	page_ref_sub(page, nr);
+	return error;
+}
+
 /*
  * Like delete_from_page_cache, but substitutes swap for page.
  */
@@ -759,6 +827,41 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	BUG_ON(error);
 }
 
+static int shmem_add_pages_to_cache(struct page *pages[], int npages,
+				struct address_space *mapping,
+				pgoff_t start, gfp_t gfp,
+				struct mm_struct *charge_mm)
+{
+	pgoff_t index = start;
+	int i, err;
+
+	i = 0;
+	while (i < npages) {
+		if (PageTransHuge(pages[i])) {
+			err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
+			if (err)
+				goto out_release;
+			index += thp_nr_pages(pages[i]);
+			i++;
+			continue;
+		}
+
+		err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
+		if (err)
+			goto out_release;
+		index++;
+		i++;
+	}
+	return 0;
+
+out_release:
+	while (i < npages) {
+		delete_from_page_cache(pages[i]);
+		i--;
+	}
+	return err;
+}
+
 int shmem_insert_page(struct mm_struct *mm, struct inode *inode, pgoff_t index,
 		      struct page *page)
 {
@@ -889,17 +992,10 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 		__SetPageReferenced(pages[i]);
 	}
 
-	for (i = 0; i < npages; i++) {
-		bool ischarged = page_memcg(pages[i]) ? true : false;
-
-		err = shmem_add_to_page_cache(pages[i], mapping, index,
-					NULL, gfp & GFP_RECLAIM_MASK,
-					charge_mm, ischarged);
-		if (err)
-			goto out_release;
-
-		index += thp_nr_pages(pages[i]);
-	}
+	err = shmem_add_pages_to_cache(pages, npages, mapping, index,
+					gfp & GFP_RECLAIM_MASK, charge_mm);
+	if (err)
+		goto out_unlock;
 
 	spin_lock(&info->lock);
 	info->alloced += nr;
@@ -922,10 +1018,7 @@ int shmem_insert_pages(struct mm_struct *charge_mm, struct inode *inode,
 
 	return 0;
 
-out_release:
-	while (--i >= 0)
-		delete_from_page_cache(pages[i]);
-
+out_unlock:
 	for (i = 0; i < npages; i++)
 		unlock_page(pages[i]);
 
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 41/43] XArray: add xas_export_node() and xas_import_node()
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Contention on the xarray lock when multiple threads are adding to the
same xarray can be mitigated by providing a way to add entries in
bulk.

Allow a caller to allocate and populate an xarray node outside of
the target xarray and then only take the xarray lock long enough to
import the node into it.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 Documentation/core-api/xarray.rst |   8 +++
 include/linux/xarray.h            |   2 +
 lib/test_xarray.c                 |  45 +++++++++++++++++
 lib/xarray.c                      | 100 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 155 insertions(+)

diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index a137a0e6d068..12ec59038fc8 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -444,6 +444,14 @@ called each time the XArray updates a node.  This is used by the page
 cache workingset code to maintain its list of nodes which contain only
 shadow entries.
 
+xas_export_node() is used to remove and return a node from an XArray
+while xas_import_node() is used to add a node to an XArray.  Together
+these can be used, for example, to reduce lock contention when multiple
+threads are updating an XArray by allowing a caller to allocate and
+populate a node outside of the target XArray in a local XArray, export
+the node, and then take the target XArray lock just long enough to import
+the node.
+
 Multi-Index Entries
 -------------------
 
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 92c0160b3352..1eda38cbe020 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1506,6 +1506,8 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry)
 void xas_pause(struct xa_state *);
 
 void xas_create_range(struct xa_state *);
+struct xa_node *xas_export_node(struct xa_state *xas);
+void xas_import_node(struct xa_state *xas, struct xa_node *node);
 
 #ifdef CONFIG_XARRAY_MULTI
 int xa_get_order(struct xarray *, unsigned long index);
diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8294f43f4981..9cca0921cf9b 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1765,6 +1765,50 @@ static noinline void check_destroy(struct xarray *xa)
 #endif
 }
 
+static noinline void check_export_import_1(struct xarray *xa,
+		unsigned long index, unsigned int order)
+{
+	int xa_shift = order + XA_CHUNK_SHIFT - (order % XA_CHUNK_SHIFT);
+	XA_STATE(xas, xa, index);
+	struct xa_node *node;
+	unsigned long i;
+
+	xa_store_many_order(xa, index, xa_shift);
+
+	xas_lock(&xas);
+	xas_set_order(&xas, index, xa_shift);
+	node = xas_export_node(&xas);
+	xas_unlock(&xas);
+
+	XA_BUG_ON(xa, !xa_empty(xa));
+
+	do {
+		xas_lock(&xas);
+		xas_set_order(&xas, index, xa_shift);
+		xas_import_node(&xas, node);
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	for (i = index; i < index + (1UL << xa_shift); i++)
+		xa_erase_index(xa, i);
+
+	XA_BUG_ON(xa, !xa_empty(xa));
+}
+
+static noinline void check_export_import(struct xarray *xa)
+{
+	unsigned int order;
+	unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 12 : 1;
+
+	for (order = 0; order < max_order; order += XA_CHUNK_SHIFT) {
+		int xa_shift = order + XA_CHUNK_SHIFT;
+		unsigned long j;
+
+		for (j = 0; j < XA_CHUNK_SIZE; j++)
+			check_export_import_1(xa, j << xa_shift, order);
+	}
+}
+
 static DEFINE_XARRAY(array);
 
 static int xarray_checks(void)
@@ -1797,6 +1841,7 @@ static int xarray_checks(void)
 	check_workingset(&array, 0);
 	check_workingset(&array, 64);
 	check_workingset(&array, 4096);
+	check_export_import(&array);
 
 	printk("XArray: %u of %u tests passed\n", tests_passed, tests_run);
 	return (tests_run == tests_passed) ? 0 : -EINVAL;
diff --git a/lib/xarray.c b/lib/xarray.c
index 5fa51614802a..58d58333f0d0 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -510,6 +510,30 @@ static void xas_delete_node(struct xa_state *xas)
 		xas_shrink(xas);
 }
 
+static void xas_unlink_node(struct xa_state *xas)
+{
+	struct xa_node *node = xas->xa_node;
+	struct xa_node *parent;
+
+	parent = xa_parent_locked(xas->xa, node);
+	xas->xa_node = parent;
+	xas->xa_offset = node->offset;
+
+	if (!parent) {
+		xas->xa->xa_head = NULL;
+		xas->xa_node = XAS_BOUNDS;
+		return;
+	}
+
+	parent->slots[xas->xa_offset] = NULL;
+	parent->count--;
+	XA_NODE_BUG_ON(parent, parent->count > XA_CHUNK_SIZE);
+
+	xas_update(xas, parent);
+
+	xas_delete_node(xas);
+}
+
 /**
  * xas_free_nodes() - Free this node and all nodes that it references
  * @xas: Array operation state.
@@ -1690,6 +1714,82 @@ static void xas_set_range(struct xa_state *xas, unsigned long first,
 }
 
 /**
+ * xas_export_node() - remove and return a node from an XArray
+ * @xas: XArray operation state
+ *
+ * The range covered by @xas must be aligned to and cover a single node
+ * at any level of the tree.
+ *
+ * Return: On success, returns the removed node.  If the range is invalid,
+ * returns %NULL and sets -EINVAL in @xas.  Otherwise returns %NULL if the
+ * node does not exist.
+ */
+struct xa_node *xas_export_node(struct xa_state *xas)
+{
+	struct xa_node *node;
+
+	if (!xas->xa_shift || xas->xa_sibs) {
+		xas_set_err(xas, -EINVAL);
+		return NULL;
+	}
+
+	xas->xa_shift -= XA_CHUNK_SHIFT;
+
+	if (!xas_find(xas, xas->xa_index))
+		return NULL;
+	node = xas->xa_node;
+	xas_unlink_node(xas);
+	node->parent = NULL;
+
+	return node;
+}
+
+/**
+ * xas_import_node() - add a node to an XArray
+ * @xas: XArray operation state
+ * @node: The node to add
+ *
+ * The range covered by @xas must be aligned to and cover a single node
+ * at any level of the tree.  No nodes should already exist within the
+ * range.
+ * Sets an error in @xas if the range is invalid or xas_create() fails
+ */
+void xas_import_node(struct xa_state *xas, struct xa_node *node)
+{
+	struct xa_node *parent = NULL;
+	void __rcu **slot = &xas->xa->xa_head;
+	int count = 0;
+
+	if (!xas->xa_shift || xas->xa_sibs) {
+		xas_set_err(xas, -EINVAL);
+		return;
+	}
+
+	if (xas->xa_index || xa_head_locked(xas->xa)) {
+		xas_set_order(xas, xas->xa_index, node->shift + XA_CHUNK_SHIFT);
+		xas_create(xas, true);
+
+		if (xas_invalid(xas))
+			return;
+
+		parent = xas->xa_node;
+	}
+
+	if (parent) {
+		slot = &parent->slots[xas->xa_offset];
+		node->offset = xas->xa_offset;
+		count++;
+	}
+
+	RCU_INIT_POINTER(node->parent, parent);
+	node->array = xas->xa;
+
+	rcu_assign_pointer(*slot, xa_mk_node(node));
+
+	update_node(xas, parent, count, 0);
+}
+
+/**
  * xa_store_range() - Store this entry at a range of indices in the XArray.
  * @xa: XArray.
  * @first: First index to affect.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 41/43] XArray: add xas_export_node() and xas_import_node()
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Contention on the xarray lock when multiple threads are adding to the
same xarray can be mitigated by providing a way to add entries in
bulk.

Allow a caller to allocate and populate an xarray node outside of
the target xarray and then only take the xarray lock long enough to
import the node into it.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 Documentation/core-api/xarray.rst |   8 +++
 include/linux/xarray.h            |   2 +
 lib/test_xarray.c                 |  45 +++++++++++++++++
 lib/xarray.c                      | 100 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 155 insertions(+)

diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index a137a0e6d068..12ec59038fc8 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -444,6 +444,14 @@ called each time the XArray updates a node.  This is used by the page
 cache workingset code to maintain its list of nodes which contain only
 shadow entries.
 
+xas_export_node() is used to remove and return a node from an XArray
+while xas_import_node() is used to add a node to an XArray.  Together
+these can be used, for example, to reduce lock contention when multiple
+threads are updating an XArray by allowing a caller to allocate and
+populate a node outside of the target XArray in a local XArray, export
+the node, and then take the target XArray lock just long enough to import
+the node.
+
 Multi-Index Entries
 -------------------
 
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 92c0160b3352..1eda38cbe020 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1506,6 +1506,8 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry)
 void xas_pause(struct xa_state *);
 
 void xas_create_range(struct xa_state *);
+struct xa_node *xas_export_node(struct xa_state *xas);
+void xas_import_node(struct xa_state *xas, struct xa_node *node);
 
 #ifdef CONFIG_XARRAY_MULTI
 int xa_get_order(struct xarray *, unsigned long index);
diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8294f43f4981..9cca0921cf9b 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1765,6 +1765,50 @@ static noinline void check_destroy(struct xarray *xa)
 #endif
 }
 
+static noinline void check_export_import_1(struct xarray *xa,
+		unsigned long index, unsigned int order)
+{
+	int xa_shift = order + XA_CHUNK_SHIFT - (order % XA_CHUNK_SHIFT);
+	XA_STATE(xas, xa, index);
+	struct xa_node *node;
+	unsigned long i;
+
+	xa_store_many_order(xa, index, xa_shift);
+
+	xas_lock(&xas);
+	xas_set_order(&xas, index, xa_shift);
+	node = xas_export_node(&xas);
+	xas_unlock(&xas);
+
+	XA_BUG_ON(xa, !xa_empty(xa));
+
+	do {
+		xas_lock(&xas);
+		xas_set_order(&xas, index, xa_shift);
+		xas_import_node(&xas, node);
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	for (i = index; i < index + (1UL << xa_shift); i++)
+		xa_erase_index(xa, i);
+
+	XA_BUG_ON(xa, !xa_empty(xa));
+}
+
+static noinline void check_export_import(struct xarray *xa)
+{
+	unsigned int order;
+	unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 12 : 1;
+
+	for (order = 0; order < max_order; order += XA_CHUNK_SHIFT) {
+		int xa_shift = order + XA_CHUNK_SHIFT;
+		unsigned long j;
+
+		for (j = 0; j < XA_CHUNK_SIZE; j++)
+			check_export_import_1(xa, j << xa_shift, order);
+	}
+}
+
 static DEFINE_XARRAY(array);
 
 static int xarray_checks(void)
@@ -1797,6 +1841,7 @@ static int xarray_checks(void)
 	check_workingset(&array, 0);
 	check_workingset(&array, 64);
 	check_workingset(&array, 4096);
+	check_export_import(&array);
 
 	printk("XArray: %u of %u tests passed\n", tests_passed, tests_run);
 	return (tests_run == tests_passed) ? 0 : -EINVAL;
diff --git a/lib/xarray.c b/lib/xarray.c
index 5fa51614802a..58d58333f0d0 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -510,6 +510,30 @@ static void xas_delete_node(struct xa_state *xas)
 		xas_shrink(xas);
 }
 
+static void xas_unlink_node(struct xa_state *xas)
+{
+	struct xa_node *node = xas->xa_node;
+	struct xa_node *parent;
+
+	parent = xa_parent_locked(xas->xa, node);
+	xas->xa_node = parent;
+	xas->xa_offset = node->offset;
+
+	if (!parent) {
+		xas->xa->xa_head = NULL;
+		xas->xa_node = XAS_BOUNDS;
+		return;
+	}
+
+	parent->slots[xas->xa_offset] = NULL;
+	parent->count--;
+	XA_NODE_BUG_ON(parent, parent->count > XA_CHUNK_SIZE);
+
+	xas_update(xas, parent);
+
+	xas_delete_node(xas);
+}
+
 /**
  * xas_free_nodes() - Free this node and all nodes that it references
  * @xas: Array operation state.
@@ -1690,6 +1714,82 @@ static void xas_set_range(struct xa_state *xas, unsigned long first,
 }
 
 /**
+ * xas_export_node() - remove and return a node from an XArray
+ * @xas: XArray operation state
+ *
+ * The range covered by @xas must be aligned to and cover a single node
+ * at any level of the tree.
+ *
+ * Return: On success, returns the removed node.  If the range is invalid,
+ * returns %NULL and sets -EINVAL in @xas.  Otherwise returns %NULL if the
+ * node does not exist.
+ */
+struct xa_node *xas_export_node(struct xa_state *xas)
+{
+	struct xa_node *node;
+
+	if (!xas->xa_shift || xas->xa_sibs) {
+		xas_set_err(xas, -EINVAL);
+		return NULL;
+	}
+
+	xas->xa_shift -= XA_CHUNK_SHIFT;
+
+	if (!xas_find(xas, xas->xa_index))
+		return NULL;
+	node = xas->xa_node;
+	xas_unlink_node(xas);
+	node->parent = NULL;
+
+	return node;
+}
+
+/**
+ * xas_import_node() - add a node to an XArray
+ * @xas: XArray operation state
+ * @node: The node to add
+ *
+ * The range covered by @xas must be aligned to and cover a single node
+ * at any level of the tree.  No nodes should already exist within the
+ * range.
+ * Sets an error in @xas if the range is invalid or xas_create() fails
+ */
+void xas_import_node(struct xa_state *xas, struct xa_node *node)
+{
+	struct xa_node *parent = NULL;
+	void __rcu **slot = &xas->xa->xa_head;
+	int count = 0;
+
+	if (!xas->xa_shift || xas->xa_sibs) {
+		xas_set_err(xas, -EINVAL);
+		return;
+	}
+
+	if (xas->xa_index || xa_head_locked(xas->xa)) {
+		xas_set_order(xas, xas->xa_index, node->shift + XA_CHUNK_SHIFT);
+		xas_create(xas, true);
+
+		if (xas_invalid(xas))
+			return;
+
+		parent = xas->xa_node;
+	}
+
+	if (parent) {
+		slot = &parent->slots[xas->xa_offset];
+		node->offset = xas->xa_offset;
+		count++;
+	}
+
+	RCU_INIT_POINTER(node->parent, parent);
+	node->array = xas->xa;
+
+	rcu_assign_pointer(*slot, xa_mk_node(node));
+
+	update_node(xas, parent, count, 0);
+}
+
+/**
  * xa_store_range() - Store this entry at a range of indices in the XArray.
  * @xa: XArray.
  * @first: First index to affect.
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 42/43] shmem: reduce time holding xa_lock when inserting pages
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Rather than adding one page at a time to the page cache and taking the
page cache xarray lock each time, where possible add pages in bulk by
first populating an xarray node outside of the page cache before taking
the lock to insert it.
When a group of pages to be inserted will fill an xarray node, add them
to a local xarray, export the xarray node, and then take the lock on the
page cache xarray and insert the node.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 156 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index f495af51042e..a7c23b43b57f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -827,17 +827,149 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	BUG_ON(error);
 }
 
+static int shmem_add_aligned_to_page_cache(struct page *pages[], int npages,
+					   struct address_space *mapping,
+					   pgoff_t index, gfp_t gfp, int order,
+					   struct mm_struct *charge_mm)
+{
+	int xa_shift = order + XA_CHUNK_SHIFT - (order % XA_CHUNK_SHIFT);
+	XA_STATE_ORDER(xas, &mapping->i_pages, index, xa_shift);
+	struct xarray xa_tmp;
+	/*
+	 * Specify order so xas_create_range() only needs to be called once
+	 * to allocate the entire range.  This guarantees that xas_store()
+	 * will not fail due to lack of memory.
+	 * Specify index == 0 so the minimum necessary nodes are allocated.
+	 */
+	XA_STATE_ORDER(xas_tmp, &xa_tmp, 0, xa_shift);
+	unsigned long nr = 1UL << order;
+	struct xa_node *node;
+	int i, error;
+
+	if (npages * nr != 1 << xa_shift) {
+		WARN_ONCE(1, "npages (%d) not aligned to xa_shift\n", npages);
+		return -EINVAL;
+	}
+	if (!IS_ALIGNED(index, 1 << xa_shift)) {
+		WARN_ONCE(1, "index (%lu) not aligned to xa_shift\n", index);
+		return -EINVAL;
+	}
+
+	for (i = 0; i < npages; i++) {
+		bool skipcharge = page_memcg(pages[i]) ? true : false;
+
+		VM_BUG_ON_PAGE(PageTail(pages[i]), pages[i]);
+		VM_BUG_ON_PAGE(!PageLocked(pages[i]), pages[i]);
+		VM_BUG_ON_PAGE(!PageSwapBacked(pages[i]), pages[i]);
+
+		page_ref_add(pages[i], nr);
+		pages[i]->mapping = mapping;
+		pages[i]->index = index + (i * nr);
+
+		if (!skipcharge && !PageSwapCache(pages[i])) {
+			error = mem_cgroup_charge(pages[i], charge_mm, gfp);
+			if (error) {
+				if (PageTransHuge(pages[i])) {
+					count_vm_event(THP_FILE_FALLBACK);
+					count_vm_event(THP_FILE_FALLBACK_CHARGE);
+				}
+				goto error;
+			}
+		}
+		cgroup_throttle_swaprate(pages[i], gfp);
+	}
+
+	xa_init(&xa_tmp);
+	do {
+		xas_lock(&xas_tmp);
+		xas_create_range(&xas_tmp);
+		if (xas_error(&xas_tmp))
+			goto unlock;
+		for (i = 0; i < npages; i++) {
+			int j = 0;
+next:
+			xas_store(&xas_tmp, pages[i]);
+			if (++j < nr) {
+				xas_next(&xas_tmp);
+				goto next;
+			}
+			if (i < npages - 1)
+				xas_next(&xas_tmp);
+		}
+		xas_set_order(&xas_tmp, 0, xa_shift);
+		node = xas_export_node(&xas_tmp);
+unlock:
+		xas_unlock(&xas_tmp);
+	} while (xas_nomem(&xas_tmp, gfp));
+
+	if (xas_error(&xas_tmp)) {
+		error = xas_error(&xas_tmp);
+		i = npages - 1;
+		goto error;
+	}
+
+	do {
+		xas_lock_irq(&xas);
+		xas_import_node(&xas, node);
+		if (xas_error(&xas))
+			goto unlock1;
+		mapping->nrpages += nr * npages;
+		xas_unlock(&xas);
+		for (i = 0; i < npages; i++) {
+			__mod_lruvec_page_state(pages[i], NR_FILE_PAGES, nr);
+			__mod_lruvec_page_state(pages[i], NR_SHMEM, nr);
+			if (PageTransHuge(pages[i])) {
+				count_vm_event(THP_FILE_ALLOC);
+				__inc_node_page_state(pages[i], NR_SHMEM_THPS);
+			}
+		}
+		local_irq_enable();
+		break;
+unlock1:
+		xas_unlock_irq(&xas);
+	} while (xas_nomem(&xas, gfp));
+
+	if (xas_error(&xas)) {
+		error = xas_error(&xas);
+		goto error;
+	}
+
+	return 0;
+error:
+	while (i != 0) {
+		pages[i]->mapping = NULL;
+		page_ref_sub(pages[i], nr);
+		i--;
+	}
+	return error;
+}
+
 static int shmem_add_pages_to_cache(struct page *pages[], int npages,
 				struct address_space *mapping,
 				pgoff_t start, gfp_t gfp,
 				struct mm_struct *charge_mm)
 {
 	pgoff_t index = start;
-	int i, err;
+	int i, j, err;
 
 	i = 0;
 	while (i < npages) {
 		if (PageTransHuge(pages[i])) {
+			if (IS_ALIGNED(index, 4096) && i+8 <= npages) {
+				for (j = 1; j < 8; j++) {
+					if (!PageTransHuge(pages[i+j]))
+						break;
+				}
+				if (j == 8) {
+					err = shmem_add_aligned_to_page_cache(&pages[i], 8, mapping, index, gfp, HPAGE_PMD_ORDER, charge_mm);
+					if (err)
+						goto out_release;
+					index += HPAGE_PMD_NR * 8;
+					i += 8;
+					continue;
+				}
+			}
+
 			err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
 			if (err)
 				goto out_release;
@@ -846,11 +978,29 @@ static int shmem_add_pages_to_cache(struct page *pages[], int npages,
 			continue;
 		}
 
-		err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
-		if (err)
-			goto out_release;
-		index++;
-		i++;
+		for (j = 1; i + j < npages; j++) {
+			if (PageTransHuge(pages[i + j]))
+				break;
+		}
+
+		while (j > 0) {
+			if (IS_ALIGNED(index, 64) && j >= 64) {
+				err = shmem_add_aligned_to_page_cache(&pages[i], 64, mapping, index, gfp, 0, charge_mm);
+				if (err)
+					goto out_release;
+				index += 64;
+				i += 64;
+				j -= 64;
+				continue;
+			}
+
+			err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
+			if (err)
+				goto out_release;
+			index++;
+			i++;
+			j--;
+		}
 	}
 	return 0;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 42/43] shmem: reduce time holding xa_lock when inserting pages
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

Rather than adding one page at a time to the page cache and taking the
page cache xarray lock each time, where possible add pages in bulk by
first populating an xarray node outside of the page cache before taking
the lock to insert it.
When a group of pages to be inserted will fill an xarray node, add them
to a local xarray, export the xarray node, and then take the lock on the
page cache xarray and insert the node.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/shmem.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 156 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index f495af51042e..a7c23b43b57f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -827,17 +827,149 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	BUG_ON(error);
 }
 
+static int shmem_add_aligned_to_page_cache(struct page *pages[], int npages,
+					   struct address_space *mapping,
+					   pgoff_t index, gfp_t gfp, int order,
+					   struct mm_struct *charge_mm)
+{
+	int xa_shift = order + XA_CHUNK_SHIFT - (order % XA_CHUNK_SHIFT);
+	XA_STATE_ORDER(xas, &mapping->i_pages, index, xa_shift);
+	struct xarray xa_tmp;
+	/*
+	 * Specify order so xas_create_range() only needs to be called once
+	 * to allocate the entire range.  This guarantees that xas_store()
+	 * will not fail due to lack of memory.
+	 * Specify index == 0 so the minimum necessary nodes are allocated.
+	 */
+	XA_STATE_ORDER(xas_tmp, &xa_tmp, 0, xa_shift);
+	unsigned long nr = 1UL << order;
+	struct xa_node *node;
+	int i, error;
+
+	if (npages * nr != 1 << xa_shift) {
+		WARN_ONCE(1, "npages (%d) not aligned to xa_shift\n", npages);
+		return -EINVAL;
+	}
+	if (!IS_ALIGNED(index, 1 << xa_shift)) {
+		WARN_ONCE(1, "index (%lu) not aligned to xa_shift\n", index);
+		return -EINVAL;
+	}
+
+	for (i = 0; i < npages; i++) {
+		bool skipcharge = page_memcg(pages[i]) ? true : false;
+
+		VM_BUG_ON_PAGE(PageTail(pages[i]), pages[i]);
+		VM_BUG_ON_PAGE(!PageLocked(pages[i]), pages[i]);
+		VM_BUG_ON_PAGE(!PageSwapBacked(pages[i]), pages[i]);
+
+		page_ref_add(pages[i], nr);
+		pages[i]->mapping = mapping;
+		pages[i]->index = index + (i * nr);
+
+		if (!skipcharge && !PageSwapCache(pages[i])) {
+			error = mem_cgroup_charge(pages[i], charge_mm, gfp);
+			if (error) {
+				if (PageTransHuge(pages[i])) {
+					count_vm_event(THP_FILE_FALLBACK);
+					count_vm_event(THP_FILE_FALLBACK_CHARGE);
+				}
+				goto error;
+			}
+		}
+		cgroup_throttle_swaprate(pages[i], gfp);
+	}
+
+	xa_init(&xa_tmp);
+	do {
+		xas_lock(&xas_tmp);
+		xas_create_range(&xas_tmp);
+		if (xas_error(&xas_tmp))
+			goto unlock;
+		for (i = 0; i < npages; i++) {
+			int j = 0;
+next:
+			xas_store(&xas_tmp, pages[i]);
+			if (++j < nr) {
+				xas_next(&xas_tmp);
+				goto next;
+			}
+			if (i < npages - 1)
+				xas_next(&xas_tmp);
+		}
+		xas_set_order(&xas_tmp, 0, xa_shift);
+		node = xas_export_node(&xas_tmp);
+unlock:
+		xas_unlock(&xas_tmp);
+	} while (xas_nomem(&xas_tmp, gfp));
+
+	if (xas_error(&xas_tmp)) {
+		error = xas_error(&xas_tmp);
+		i = npages - 1;
+		goto error;
+	}
+
+	do {
+		xas_lock_irq(&xas);
+		xas_import_node(&xas, node);
+		if (xas_error(&xas))
+			goto unlock1;
+		mapping->nrpages += nr * npages;
+		xas_unlock(&xas);
+		for (i = 0; i < npages; i++) {
+			__mod_lruvec_page_state(pages[i], NR_FILE_PAGES, nr);
+			__mod_lruvec_page_state(pages[i], NR_SHMEM, nr);
+			if (PageTransHuge(pages[i])) {
+				count_vm_event(THP_FILE_ALLOC);
+				__inc_node_page_state(pages[i], NR_SHMEM_THPS);
+			}
+		}
+		local_irq_enable();
+		break;
+unlock1:
+		xas_unlock_irq(&xas);
+	} while (xas_nomem(&xas, gfp));
+
+	if (xas_error(&xas)) {
+		error = xas_error(&xas);
+		goto error;
+	}
+
+	return 0;
+error:
+	while (i != 0) {
+		pages[i]->mapping = NULL;
+		page_ref_sub(pages[i], nr);
+		i--;
+	}
+	return error;
+}
+
 static int shmem_add_pages_to_cache(struct page *pages[], int npages,
 				struct address_space *mapping,
 				pgoff_t start, gfp_t gfp,
 				struct mm_struct *charge_mm)
 {
 	pgoff_t index = start;
-	int i, err;
+	int i, j, err;
 
 	i = 0;
 	while (i < npages) {
 		if (PageTransHuge(pages[i])) {
+			if (IS_ALIGNED(index, 4096) && i+8 <= npages) {
+				for (j = 1; j < 8; j++) {
+					if (!PageTransHuge(pages[i+j]))
+						break;
+				}
+				if (j == 8) {
+					err = shmem_add_aligned_to_page_cache(&pages[i], 8, mapping, index, gfp, HPAGE_PMD_ORDER, charge_mm);
+					if (err)
+						goto out_release;
+					index += HPAGE_PMD_NR * 8;
+					i += 8;
+					continue;
+				}
+			}
+
 			err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
 			if (err)
 				goto out_release;
@@ -846,11 +978,29 @@ static int shmem_add_pages_to_cache(struct page *pages[], int npages,
 			continue;
 		}
 
-		err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
-		if (err)
-			goto out_release;
-		index++;
-		i++;
+		for (j = 1; i + j < npages; j++) {
+			if (PageTransHuge(pages[i + j]))
+				break;
+		}
+
+		while (j > 0) {
+			if (IS_ALIGNED(index, 64) && j >= 64) {
+				err = shmem_add_aligned_to_page_cache(&pages[i], 64, mapping, index, gfp, 0, charge_mm);
+				if (err)
+					goto out_release;
+				index += 64;
+				i += 64;
+				j -= 64;
+				continue;
+			}
+
+			err = shmem_add_to_page_cache_fast(pages[i], mapping, index, gfp, charge_mm, page_memcg(pages[i]) ? true : false);
+			if (err)
+				goto out_release;
+			index++;
+			i++;
+			j--;
+		}
 	}
 	return 0;
 
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 43/43] PKRAM: improve index alignment of pkram_link entries
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-03-30 21:36   ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To take advantage of optimizations when adding pages to the page cache
via shmem_insert_pages(), improve the likelihood that the pages array
passed to shmem_insert_pages() starts on an aligned index.  Do this
when preserving pages by starting a new pkram_link page when the current
page is aligned and the next aligned page will not fit on the pkram_link
page.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index b63b2a3958e7..3f43809c8a85 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -911,9 +911,20 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 {
 	struct pkram_data_stream *pds = &pa->pds;
 	struct pkram_link *link = pds->link;
+	int align, align_cnt;
+
+	if (PageTransHuge(page)) {
+		align = 1 << (HPAGE_PMD_ORDER + XA_CHUNK_SHIFT - (HPAGE_PMD_ORDER % XA_CHUNK_SHIFT));
+		align_cnt = align >> HPAGE_PMD_ORDER;
+	} else {
+		align = XA_CHUNK_SIZE;
+		align_cnt = XA_CHUNK_SIZE;
+	}
 
 	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
-	    index != pa->pages.next_index) {
+	    index != pa->pages.next_index ||
+	    (IS_ALIGNED(index, align) &&
+	    (pds->entry_idx + align_cnt > PKRAM_LINK_ENTRIES_MAX))) {
 		link = pkram_new_link(pds, pa->ps->gfp_mask);
 		if (!link)
 			return -ENOMEM;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [RFC v2 43/43] PKRAM: improve index alignment of pkram_link entries
@ 2021-03-30 21:36   ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-30 21:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

To take advantage of optimizations when adding pages to the page cache
via shmem_insert_pages(), improve the likelihood that the pages array
passed to shmem_insert_pages() starts on an aligned index.  Do this
when preserving pages by starting a new pkram_link page when the current
page is aligned and the next aligned page will not fit on the pkram_link
page.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/pkram.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/pkram.c b/mm/pkram.c
index b63b2a3958e7..3f43809c8a85 100644
--- a/mm/pkram.c
+++ b/mm/pkram.c
@@ -911,9 +911,20 @@ static int __pkram_save_page(struct pkram_access *pa, struct page *page,
 {
 	struct pkram_data_stream *pds = &pa->pds;
 	struct pkram_link *link = pds->link;
+	int align, align_cnt;
+
+	if (PageTransHuge(page)) {
+		align = 1 << (HPAGE_PMD_ORDER + XA_CHUNK_SHIFT - (HPAGE_PMD_ORDER % XA_CHUNK_SHIFT));
+		align_cnt = align >> HPAGE_PMD_ORDER;
+	} else {
+		align = XA_CHUNK_SIZE;
+		align_cnt = XA_CHUNK_SIZE;
+	}
 
 	if (!link || pds->entry_idx >= PKRAM_LINK_ENTRIES_MAX ||
-	    index != pa->pages.next_index) {
+	    index != pa->pages.next_index ||
+	    (IS_ALIGNED(index, align) &&
+	    (pds->entry_idx + align_cnt > PKRAM_LINK_ENTRIES_MAX))) {
 		link = pkram_new_link(pds, pa->ps->gfp_mask);
 		if (!link)
 			return -ENOMEM;
-- 
1.8.3.1


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig
  2021-03-30 21:35   ` Anthony Yznaga
@ 2021-03-31 18:43     ` Randy Dunlap
  -1 siblings, 0 replies; 94+ messages in thread
From: Randy Dunlap @ 2021-03-31 18:43 UTC (permalink / raw)
  To: Anthony Yznaga, linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

On 3/30/21 2:35 PM, Anthony Yznaga wrote:
> Preserved-across-kexec memory or PKRAM is a method for saving memory
> pages of the currently executing kernel and restoring them after kexec
> boot into a new one. This can be utilized for preserving guest VM state,
> large in-memory databases, process memory, etc. across reboot. While
> DRAM-as-PMEM or actual persistent memory could be used to accomplish
> these things, PKRAM provides the latency of DRAM with the flexibility
> of dynamically determining the amount of memory to preserve.
> 
...

> 
> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  include/linux/pkram.h |  47 +++++++++++++
>  mm/Kconfig            |   9 +++
>  mm/Makefile           |   1 +
>  mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 236 insertions(+)
>  create mode 100644 include/linux/pkram.h
>  create mode 100644 mm/pkram.c
> 
> diff --git a/mm/pkram.c b/mm/pkram.c
> new file mode 100644
> index 000000000000..59e4661b2fb7
> --- /dev/null
> +++ b/mm/pkram.c
> @@ -0,0 +1,179 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/err.h>
> +#include <linux/gfp.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/pkram.h>
> +#include <linux/types.h>
> +

Hi,

There are several doc blocks that begin with "/**" but that are not
in kernel-doc format (/** means kernel-doc format when inside the kernel
source tree).

Please either change those to "/*" or convert them to kernel-doc format.
The latter is preferable for exported interfaces.

> +/**
> + * Create a preserved memory node with name @name and initialize stream @ps
> + * for saving data to it.
> + *
> + * @gfp_mask specifies the memory allocation mask to be used when saving data.
> + *
> + * Returns 0 on success, -errno on failure.
> + *
> + * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
> + * case of failure) is to be called.
> + */


b) from patch 00/43:

 documentation/core-api/xarray.rst       |    8 +

How did "documentation" become lower case (instead of Documentation)?


thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig
@ 2021-03-31 18:43     ` Randy Dunlap
  0 siblings, 0 replies; 94+ messages in thread
From: Randy Dunlap @ 2021-03-31 18:43 UTC (permalink / raw)
  To: Anthony Yznaga, linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

On 3/30/21 2:35 PM, Anthony Yznaga wrote:
> Preserved-across-kexec memory or PKRAM is a method for saving memory
> pages of the currently executing kernel and restoring them after kexec
> boot into a new one. This can be utilized for preserving guest VM state,
> large in-memory databases, process memory, etc. across reboot. While
> DRAM-as-PMEM or actual persistent memory could be used to accomplish
> these things, PKRAM provides the latency of DRAM with the flexibility
> of dynamically determining the amount of memory to preserve.
> 
...

> 
> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  include/linux/pkram.h |  47 +++++++++++++
>  mm/Kconfig            |   9 +++
>  mm/Makefile           |   1 +
>  mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 236 insertions(+)
>  create mode 100644 include/linux/pkram.h
>  create mode 100644 mm/pkram.c
> 
> diff --git a/mm/pkram.c b/mm/pkram.c
> new file mode 100644
> index 000000000000..59e4661b2fb7
> --- /dev/null
> +++ b/mm/pkram.c
> @@ -0,0 +1,179 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/err.h>
> +#include <linux/gfp.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/pkram.h>
> +#include <linux/types.h>
> +

Hi,

There are several doc blocks that begin with "/**" but that are not
in kernel-doc format (/** means kernel-doc format when inside the kernel
source tree).

Please either change those to "/*" or convert them to kernel-doc format.
The latter is preferable for exported interfaces.

> +/**
> + * Create a preserved memory node with name @name and initialize stream @ps
> + * for saving data to it.
> + *
> + * @gfp_mask specifies the memory allocation mask to be used when saving data.
> + *
> + * Returns 0 on success, -errno on failure.
> + *
> + * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
> + * case of failure) is to be called.
> + */


b) from patch 00/43:

 documentation/core-api/xarray.rst       |    8 +

How did "documentation" become lower case (instead of Documentation)?


thanks.
-- 
~Randy


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig
  2021-03-31 18:43     ` Randy Dunlap
@ 2021-03-31 20:28       ` Anthony Yznaga
  -1 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-31 20:28 UTC (permalink / raw)
  To: Randy Dunlap, linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

On 3/31/21 11:43 AM, Randy Dunlap wrote:
> On 3/30/21 2:35 PM, Anthony Yznaga wrote:
>> Preserved-across-kexec memory or PKRAM is a method for saving memory
>> pages of the currently executing kernel and restoring them after kexec
>> boot into a new one. This can be utilized for preserving guest VM state,
>> large in-memory databases, process memory, etc. across reboot. While
>> DRAM-as-PMEM or actual persistent memory could be used to accomplish
>> these things, PKRAM provides the latency of DRAM with the flexibility
>> of dynamically determining the amount of memory to preserve.
>>
> ...
>
>> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>  include/linux/pkram.h |  47 +++++++++++++
>>  mm/Kconfig            |   9 +++
>>  mm/Makefile           |   1 +
>>  mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 236 insertions(+)
>>  create mode 100644 include/linux/pkram.h
>>  create mode 100644 mm/pkram.c
>>
>> diff --git a/mm/pkram.c b/mm/pkram.c
>> new file mode 100644
>> index 000000000000..59e4661b2fb7
>> --- /dev/null
>> +++ b/mm/pkram.c
>> @@ -0,0 +1,179 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/err.h>
>> +#include <linux/gfp.h>
>> +#include <linux/kernel.h>
>> +#include <linux/mm.h>
>> +#include <linux/pkram.h>
>> +#include <linux/types.h>
>> +
> Hi,
>
> There are several doc blocks that begin with "/**" but that are not
> in kernel-doc format (/** means kernel-doc format when inside the kernel
> source tree).
>
> Please either change those to "/*" or convert them to kernel-doc format.
> The latter is preferable for exported interfaces.
Thank you.  I'll fix these up.

>
>> +/**
>> + * Create a preserved memory node with name @name and initialize stream @ps
>> + * for saving data to it.
>> + *
>> + * @gfp_mask specifies the memory allocation mask to be used when saving data.
>> + *
>> + * Returns 0 on success, -errno on failure.
>> + *
>> + * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
>> + * case of failure) is to be called.
>> + */
>
> b) from patch 00/43:
>
>  documentation/core-api/xarray.rst       |    8 +
>
> How did "documentation" become lower case (instead of Documentation)?
That is odd.  The patch (41) has it correct.

Anthony

>
>
> thanks.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig
@ 2021-03-31 20:28       ` Anthony Yznaga
  0 siblings, 0 replies; 94+ messages in thread
From: Anthony Yznaga @ 2021-03-31 20:28 UTC (permalink / raw)
  To: Randy Dunlap, linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec

On 3/31/21 11:43 AM, Randy Dunlap wrote:
> On 3/30/21 2:35 PM, Anthony Yznaga wrote:
>> Preserved-across-kexec memory or PKRAM is a method for saving memory
>> pages of the currently executing kernel and restoring them after kexec
>> boot into a new one. This can be utilized for preserving guest VM state,
>> large in-memory databases, process memory, etc. across reboot. While
>> DRAM-as-PMEM or actual persistent memory could be used to accomplish
>> these things, PKRAM provides the latency of DRAM with the flexibility
>> of dynamically determining the amount of memory to preserve.
>>
> ...
>
>> Originally-by: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>  include/linux/pkram.h |  47 +++++++++++++
>>  mm/Kconfig            |   9 +++
>>  mm/Makefile           |   1 +
>>  mm/pkram.c            | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 236 insertions(+)
>>  create mode 100644 include/linux/pkram.h
>>  create mode 100644 mm/pkram.c
>>
>> diff --git a/mm/pkram.c b/mm/pkram.c
>> new file mode 100644
>> index 000000000000..59e4661b2fb7
>> --- /dev/null
>> +++ b/mm/pkram.c
>> @@ -0,0 +1,179 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/err.h>
>> +#include <linux/gfp.h>
>> +#include <linux/kernel.h>
>> +#include <linux/mm.h>
>> +#include <linux/pkram.h>
>> +#include <linux/types.h>
>> +
> Hi,
>
> There are several doc blocks that begin with "/**" but that are not
> in kernel-doc format (/** means kernel-doc format when inside the kernel
> source tree).
>
> Please either change those to "/*" or convert them to kernel-doc format.
> The latter is preferable for exported interfaces.
Thank you.  I'll fix these up.

>
>> +/**
>> + * Create a preserved memory node with name @name and initialize stream @ps
>> + * for saving data to it.
>> + *
>> + * @gfp_mask specifies the memory allocation mask to be used when saving data.
>> + *
>> + * Returns 0 on success, -errno on failure.
>> + *
>> + * After the save has finished, pkram_finish_save() (or pkram_discard_save() in
>> + * case of failure) is to be called.
>> + */
>
> b) from patch 00/43:
>
>  documentation/core-api/xarray.rst       |    8 +
>
> How did "documentation" become lower case (instead of Documentation)?
That is odd.  The patch (41) has it correct.

Anthony

>
>
> thanks.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM
  2021-03-30 21:35 ` Anthony Yznaga
@ 2021-06-05 13:39   ` Pavel Tatashin
  -1 siblings, 0 replies; 94+ messages in thread
From: Pavel Tatashin @ 2021-06-05 13:39 UTC (permalink / raw)
  To: Anthony Yznaga, linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec



On 3/30/21 5:35 PM, Anthony Yznaga wrote:
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API and implement its use within tmpfs, allowing
> tmpfs files to be preserved across kexec.
> 
> One use case for PKRAM is preserving guest memory and/or auxillary supporting
> data (e.g. iommu data) across kexec in support of VMM Fast Restart[2].
> VMM Fast Restart is currently using PKRAM to support preserving "Keep Alive
> State" across reboot[3].  PKRAM provides a flexible way for doing this
> without requiring that the amount of memory used by a fixed size created
> a priori.  Another use case is for databases to preserve their block caches
> in shared memory across reboot.

Hi Anthony,

I have several concerns about preserving arbitrary not prereserved segments across reboot.

1. PKRAM does not work across firmware reboots
With emulated persistent memory it is possible to do reboot through firmware and not loose the preserved-memory. The firmware can be modified to mark the required ranges pages as PRAM, and Linux will treat them as such. The benefit of this is that it works for both cases kexec and reboot through firmware. The disadvantage is that you have to know in advance how much memory needs to be preserved. However, with the ability to hot-plug/hot-remove the PMEM, the second point becomes moot as it is possible to mark a large chunk of memory as PMEM if needed. I have designed something like this for one of our projects, and it is already been used in the fleet. Reboot through firmware, allows us to service firmware in addition to kernel.

2. Boot failures due to memory fragmentation
We also considered using PRAM instead of PMEM. PRAM was one of the previous attempts to do the persistent memory thing via tmpfs flag: mount -t tmpfs -o pram=mytmpfs none /mnt/crdump"; that project was never upstreamed. However, we gave up with that idea because in addition to loosing possibility to reboot through the firmware, it also adds memory fragmentation. For example, if the new kernel require larger contiguous memory chunks to be allocated during boot than the previous kernel (i.e. the next kernel has new drivers, or some debug feature enabled), the boot might simply fail because of the extra memory ranges being reserved.

3. New intra-kernel dependencies
Kexec reboot is when one Linux kernel works as a bootloader for the next one. Currently, there is very little information that is passed from the old kernel to the next kernel. Adding more information that two independent kernels must know about each other is not a very good thing from architectural point of view. It limits the flexibility of kexec.

However, we do need PKRAM and ability to preserve kernel memory across reboot for fast hypervisor updates or such. User pages can already be preserved across reboot on emulated or real persistent memory. The easiest way is via DAXFS placed on that memory.
Kernel cannot preserve its memory on  PMEM across the reboot. However, functionality can be extended so kernel memory can be preserved on both emulated persistent memory or on real persistent memory. PKRAM could provide an interface to save kernel data to a file, and that file could be placed on any filesystem including DAXFS. When placed on DAXFS, that file can be used as iommu data, as it is actually located in physical memory and not moving anywhere. It is preserved across firmware/kexec reboot with having the devices survive the reboot state intact. During boot, have the device drivers that use PKRAM preserve functionality map saved files from DAXFS in order to have IOMMU functionality working again.

Thank you,
Pasha

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM
@ 2021-06-05 13:39   ` Pavel Tatashin
  0 siblings, 0 replies; 94+ messages in thread
From: Pavel Tatashin @ 2021-06-05 13:39 UTC (permalink / raw)
  To: Anthony Yznaga, linux-mm, linux-kernel
  Cc: willy, corbet, tglx, mingo, bp, x86, hpa, dave.hansen, luto,
	peterz, rppt, akpm, hughd, ebiederm, keescook, ardb, nivedita,
	jroedel, masahiroy, nathan, terrelln, vincenzo.frascino,
	martin.b.radev, andreyknvl, daniel.kiper, rafael.j.wysocki,
	dan.j.williams, Jonathan.Cameron, bhe, rminnich, ashish.kalra,
	guro, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, david,
	richard.weiyang, vdavydov.dev, graf, jason.zeng, lei.l.li,
	daniel.m.jordan, steven.sistare, linux-fsdevel, linux-doc, kexec



On 3/30/21 5:35 PM, Anthony Yznaga wrote:
> This patchset implements preserved-over-kexec memory storage or PKRAM as a
> method for saving memory pages of the currently executing kernel so that
> they may be restored after kexec into a new kernel. The patches are adapted
> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
> introduce the PKRAM kernel API and implement its use within tmpfs, allowing
> tmpfs files to be preserved across kexec.
> 
> One use case for PKRAM is preserving guest memory and/or auxillary supporting
> data (e.g. iommu data) across kexec in support of VMM Fast Restart[2].
> VMM Fast Restart is currently using PKRAM to support preserving "Keep Alive
> State" across reboot[3].  PKRAM provides a flexible way for doing this
> without requiring that the amount of memory used by a fixed size created
> a priori.  Another use case is for databases to preserve their block caches
> in shared memory across reboot.

Hi Anthony,

I have several concerns about preserving arbitrary not prereserved segments across reboot.

1. PKRAM does not work across firmware reboots
With emulated persistent memory it is possible to do reboot through firmware and not loose the preserved-memory. The firmware can be modified to mark the required ranges pages as PRAM, and Linux will treat them as such. The benefit of this is that it works for both cases kexec and reboot through firmware. The disadvantage is that you have to know in advance how much memory needs to be preserved. However, with the ability to hot-plug/hot-remove the PMEM, the second point becomes moot as it is possible to mark a large chunk of memory as PMEM if needed. I have designed something like this for one of our projects, and it is already been used in the fleet. Reboot through firmware, allows us to service firmware in addition to kernel.

2. Boot failures due to memory fragmentation
We also considered using PRAM instead of PMEM. PRAM was one of the previous attempts to do the persistent memory thing via tmpfs flag: mount -t tmpfs -o pram=mytmpfs none /mnt/crdump"; that project was never upstreamed. However, we gave up with that idea because in addition to loosing possibility to reboot through the firmware, it also adds memory fragmentation. For example, if the new kernel require larger contiguous memory chunks to be allocated during boot than the previous kernel (i.e. the next kernel has new drivers, or some debug feature enabled), the boot might simply fail because of the extra memory ranges being reserved.

3. New intra-kernel dependencies
Kexec reboot is when one Linux kernel works as a bootloader for the next one. Currently, there is very little information that is passed from the old kernel to the next kernel. Adding more information that two independent kernels must know about each other is not a very good thing from architectural point of view. It limits the flexibility of kexec.

However, we do need PKRAM and ability to preserve kernel memory across reboot for fast hypervisor updates or such. User pages can already be preserved across reboot on emulated or real persistent memory. The easiest way is via DAXFS placed on that memory.
Kernel cannot preserve its memory on  PMEM across the reboot. However, functionality can be extended so kernel memory can be preserved on both emulated persistent memory or on real persistent memory. PKRAM could provide an interface to save kernel data to a file, and that file could be placed on any filesystem including DAXFS. When placed on DAXFS, that file can be used as iommu data, as it is actually located in physical memory and not moving anywhere. It is preserved across firmware/kexec reboot with having the devices survive the reboot state intact. During boot, have the device drivers that use PKRAM preserve functionality map saved files from DAXFS in order to have IOMMU functionality working again.

Thank you,
Pasha

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2021-06-05 13:40 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-30 21:35 [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM Anthony Yznaga
2021-03-30 21:35 ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-31 18:43   ` Randy Dunlap
2021-03-31 18:43     ` Randy Dunlap
2021-03-31 20:28     ` Anthony Yznaga
2021-03-31 20:28       ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 02/43] mm: PKRAM: implement node load and save functions Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 03/43] mm: PKRAM: implement object " Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 04/43] mm: PKRAM: implement page stream operations Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 05/43] mm: PKRAM: support preserving transparent hugepages Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 06/43] mm: PKRAM: implement byte stream operations Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 07/43] mm: PKRAM: link nodes by pfn before reboot Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 08/43] mm: PKRAM: introduce super block Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 09/43] PKRAM: track preserved pages in a physical mapping pagetable Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 10/43] PKRAM: pass a list of preserved ranges to the next kernel Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 11/43] PKRAM: prepare for adding preserved ranges to memblock reserved Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 12/43] mm: PKRAM: reserve preserved memory at boot Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 13/43] PKRAM: free the preserved ranges list Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 14/43] PKRAM: prevent inadvertent use of a stale superblock Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 15/43] PKRAM: provide a way to ban pages from use by PKRAM Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 16/43] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 17/43] PKRAM: provide a way to check if a memory range has preserved pages Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 18/43] kexec: PKRAM: avoid clobbering already " Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 19/43] mm: PKRAM: allow preserved memory to be freed from userspace Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 20/43] PKRAM: disable feature when running the kdump kernel Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 21/43] x86/KASLR: PKRAM: support physical kaslr Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 22/43] x86/boot/compressed/64: use 1GB pages for mappings Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 23/43] mm: shmem: introduce shmem_insert_page Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 24/43] mm: shmem: enable saving to PKRAM Anthony Yznaga
2021-03-30 21:35   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 25/43] mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 26/43] mm: shmem: specify the mm to use when inserting pages Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 27/43] mm: shmem: when inserting, handle pages already charged to a memcg Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 28/43] x86/mm/numa: add numa_isolate_memblocks() Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 29/43] PKRAM: ensure memblocks with preserved pages init'd for numa Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 30/43] memblock: PKRAM: mark memblocks that contain preserved pages Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 31/43] memblock, mm: defer initialization of " Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 32/43] shmem: preserve shmem files a chunk at a time Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 33/43] PKRAM: atomically add and remove link pages Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 34/43] shmem: PKRAM: multithread preserving and restoring shmem pages Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 35/43] shmem: introduce shmem_insert_pages() Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 36/43] PKRAM: add support for loading pages in bulk Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 37/43] shmem: PKRAM: enable bulk loading of preserved pages into shmem Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 38/43] mm: implement splicing a list of pages to the LRU Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 39/43] shmem: optimize adding pages to the LRU in shmem_insert_pages() Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 40/43] shmem: initial support for adding multiple pages to pagecache Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 41/43] XArray: add xas_export_node() and xas_import_node() Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 42/43] shmem: reduce time holding xa_lock when inserting pages Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 43/43] PKRAM: improve index alignment of pkram_link entries Anthony Yznaga
2021-03-30 21:36   ` Anthony Yznaga
2021-06-05 13:39 ` [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM Pavel Tatashin
2021-06-05 13:39   ` Pavel Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.