From: Anthony Yznaga <anthony.yznaga@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: willy@infradead.org, corbet@lwn.net, tglx@linutronix.de,
mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com,
dave.hansen@linux.intel.com, luto@kernel.org,
peterz@infradead.org, rppt@kernel.org, akpm@linux-foundation.org,
hughd@google.com, ebiederm@xmission.com, keescook@chromium.org,
ardb@kernel.org, nivedita@alum.mit.edu, jroedel@suse.de,
masahiroy@kernel.org, nathan@kernel.org, terrelln@fb.com,
vincenzo.frascino@arm.com, martin.b.radev@gmail.com,
andreyknvl@google.com, daniel.kiper@oracle.com,
rafael.j.wysocki@intel.com, dan.j.williams@intel.com,
Jonathan.Cameron@huawei.com, bhe@redhat.com, rminnich@gmail.com,
ashish.kalra@amd.com, guro@fb.com, hannes@cmpxchg.org,
mhocko@kernel.org, iamjoonsoo.kim@lge.com, vbabka@suse.cz,
alex.shi@linux.alibaba.com, david@redhat.com,
richard.weiyang@gmail.com, vdavydov.dev@gmail.com,
graf@amazon.com, jason.zeng@intel.com, lei.l.li@intel.com,
daniel.m.jordan@oracle.com, steven.sistare@oracle.com,
linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org,
kexec@lists.infradead.org
Subject: [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM
Date: Tue, 30 Mar 2021 14:35:35 -0700 [thread overview]
Message-ID: <1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com> (raw)
This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API and implement its use within tmpfs, allowing
tmpfs files to be preserved across kexec.
One use case for PKRAM is preserving guest memory and/or auxillary supporting
data (e.g. iommu data) across kexec in support of VMM Fast Restart[2].
VMM Fast Restart is currently using PKRAM to support preserving "Keep Alive
State" across reboot[3]. PKRAM provides a flexible way for doing this
without requiring that the amount of memory used by a fixed size created
a priori. Another use case is for databases to preserve their block caches
in shared memory across reboot.
Changes since RFC v1
- Rebased onto 5.12-rc4
- Refined the API to reduce the number of calls
and better support multithreading.
- Allow preserving byte data of arbitrary length
(was previously limited to one page).
- Build a new memblock reserved list with the
preserved ranges and then substitute it for
the existing one. (Mike Rapoport)
- Use mem_avoid_overlap() to avoid kaslr stepping
on preserved ranges. (Kees Cook)
-- Usage --
1) Mount tmpfs with 'pkram=NAME' option.
NAME is an arbitrary string specifying a preserved memory node.
Different tmpfs trees may be saved to PKRAM if different names are
passed.
# mkdir -p /mnt
# mount -t tmpfs -o pkram=mytmpfs none /mnt
2) Populate a file under /mnt
# head -c 2G /dev/urandom > /mnt/testfile
# md5sum /mnt/testfile
e281e2f019ac3bfa3bdb28aa08c4beb3 /mnt/testfile
3) Remount tmpfs to preserve files.
# mount -o remount,preserve,ro /mnt
4) Load the new kernel image.
Pass the PKRAM super block pfn via 'pkram' boot option. The pfn is
exported via the sysfs file /sys/kernel/pkram.
# kexec -s -l /boot/vmlinuz-$kernel --initrd=/boot/initramfs-$kernel.img \
--append="$(cat /proc/cmdline|sed -e 's/pkram=[^ ]*//g') pkram=$(cat /sys/kernel/pkram)"
5) Boot to the new kernel.
# systemctl kexec
6) Mount tmpfs with 'pkram=NAME' option.
It should find the PKRAM node with the tmpfs tree saved on previous
unmount and restore it.
# mount -t tmpfs -o pkram=mytmpfs none /mnt
7) Use the restored file under /mnt
# md5sum /mnt/testfile
e281e2f019ac3bfa3bdb28aa08c4beb3 /mnt/testfile
-- Implementation details --
* When a tmpfs filesystem is mounted the first time with the 'pkram=NAME'
option, a shmem_pkram_info is allocated to record NAME. The shmem_pkram_info
and whether the filesystem is in the preserved state are tracked by
shmem_sb_info.
* A PKRAM-enabled tmpfs filesystem is saved to PKRAM on remount when the
'preserve' mount option is specified and the filesystem is read-only.
* Saving a file to PKRAM is done by walking the pages of the file and
building a list of the pages and attributes needed to restore them later.
The pages containing this metadata as well as the target file pages have
their refcount incremented to prevent them from being freed even after
the last user puts the pages (i.e. the filesystem is unmounted).
* To aid in quickly finding contiguous ranges of memory containing
preserved pages a pseudo physical mapping pagetable is populated
with pages as they are preserved.
* If a page to be preserved is found to be in range of memory that was
previously reserved during early boot or in range of memory where the
kernel will be loaded to on kexec, the page will be copied to a page
outside of those ranges and the new page will be preserved. A compound
page will be copied to and preserved as individual base pages.
* A single page is allocated for the PKRAM super block. For the next kernel
kexec boot to find preserved memory metadata, the pfn of the PKRAM super
block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
boot option.
* In the newly booted kernel, PKRAM adds all preserved pages to the memblock
reserve list during early boot so that they will not be recycled.
* Since kexec may load the new kernel code to any memory region, it could
destroy preserved memory. When the kernel selects the memory region
(kexec_file_load syscall), kexec will avoid preserved pages. When the
user selects the kexec memory region to use (kexec_load syscall) , kexec
load will fail if there is conflict with preserved pages. Pages preserved
after a kexec kernel is loaded will be relocated if they conflict with
the selected memory region.
The current implementation has some restrictions:
* Only regular tmpfs files without multiple hard links can be preserved.
Save to PKRAM will abort and log an error if a directory or other file
type is encountered.
* Pages for PKRAM-enabled files are prevented from swapping out to avoid
the performance penalty of swapping in and the possibility of insufficient
memory.
-- Patches --
The patches are broken down into the following groups:
Patches 1-22 implement the API and supporting functionality.
Patches 23-27 implement the use of PKRAM within tmpfs
The remaining patches implement optimizations to the initialization of
preserved pages and to the preservation and restoration of shmem pages.
To give an idea of the improvement in performance here is an example
comparison with and without these patches when saving and loading a 100G
file:
Save a 100G file:
| No optimizations | Optimized (16 cpus) |
------------------------------------------------------
huge=never | 2265ms | 232ms |
------------------------------------------------------
huge=always | 58ms | 22ms |
Load a 100G file:
| No optimizations | Optimized (16 cpus) |
------------------------------------------------------
huge=never | 8833ms | 516ms |
------------------------------------------------------
huge=always | 752ms | 105ms |
Patches 28-31 Defer initialization of page structs for preserved pages
Patches 32-34 Implement multi-threading of shmem page preservation and
restoration.
Patches 35-37 Implement and use an API for inserting shmem pages in bulk
Patches 38-39: Reduce contention on the LRU lock by staging and adding pages
in bulk to the LRU
Patches 40-43: Reduce contention on the pagecache xarray lock by inserting
pages in bulk in certain cases
[1] https://lkml.org/lkml/2013/7/1/211
[2] https://www.youtube.com/watch?v=pBsHnf93tcQ
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
[3] https://www.youtube.com/watch?v=pBsHnf93tcQ
https://static.sched.com/hosted_files/kvmforum2020/10/Device-Keepalive-State-KVMForum2020.pdf
Anthony Yznaga (43):
mm: add PKRAM API stubs and Kconfig
mm: PKRAM: implement node load and save functions
mm: PKRAM: implement object load and save functions
mm: PKRAM: implement page stream operations
mm: PKRAM: support preserving transparent hugepages
mm: PKRAM: implement byte stream operations
mm: PKRAM: link nodes by pfn before reboot
mm: PKRAM: introduce super block
PKRAM: track preserved pages in a physical mapping pagetable
PKRAM: pass a list of preserved ranges to the next kernel
PKRAM: prepare for adding preserved ranges to memblock reserved
mm: PKRAM: reserve preserved memory at boot
PKRAM: free the preserved ranges list
PKRAM: prevent inadvertent use of a stale superblock
PKRAM: provide a way to ban pages from use by PKRAM
kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
PKRAM: provide a way to check if a memory range has preserved pages
kexec: PKRAM: avoid clobbering already preserved pages
mm: PKRAM: allow preserved memory to be freed from userspace
PKRAM: disable feature when running the kdump kernel
x86/KASLR: PKRAM: support physical kaslr
x86/boot/compressed/64: use 1GB pages for mappings
mm: shmem: introduce shmem_insert_page
mm: shmem: enable saving to PKRAM
mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages
mm: shmem: specify the mm to use when inserting pages
mm: shmem: when inserting, handle pages already charged to a memcg
x86/mm/numa: add numa_isolate_memblocks()
PKRAM: ensure memblocks with preserved pages init'd for numa
memblock: PKRAM: mark memblocks that contain preserved pages
memblock, mm: defer initialization of preserved pages
shmem: preserve shmem files a chunk at a time
PKRAM: atomically add and remove link pages
shmem: PKRAM: multithread preserving and restoring shmem pages
shmem: introduce shmem_insert_pages()
PKRAM: add support for loading pages in bulk
shmem: PKRAM: enable bulk loading of preserved pages into shmem
mm: implement splicing a list of pages to the LRU
shmem: optimize adding pages to the LRU in shmem_insert_pages()
shmem: initial support for adding multiple pages to pagecache
XArray: add xas_export_node() and xas_import_node()
shmem: reduce time holding xa_lock when inserting pages
PKRAM: improve index alignment of pkram_link entries
documentation/core-api/xarray.rst | 8 +
arch/x86/boot/compressed/Makefile | 3 +
arch/x86/boot/compressed/ident_map_64.c | 9 +-
arch/x86/boot/compressed/kaslr.c | 10 +-
arch/x86/boot/compressed/misc.h | 10 +
arch/x86/boot/compressed/pkram.c | 109 ++
arch/x86/include/asm/numa.h | 4 +
arch/x86/kernel/setup.c | 3 +
arch/x86/mm/init_64.c | 2 +
arch/x86/mm/numa.c | 32 +-
include/linux/memblock.h | 6 +
include/linux/mm.h | 2 +-
include/linux/pkram.h | 120 ++
include/linux/shmem_fs.h | 28 +
include/linux/swap.h | 13 +
include/linux/xarray.h | 2 +
kernel/kexec.c | 9 +
kernel/kexec_core.c | 3 +
kernel/kexec_file.c | 15 +
lib/test_xarray.c | 45 +
lib/xarray.c | 100 ++
mm/Kconfig | 9 +
mm/Makefile | 1 +
mm/memblock.c | 11 +-
mm/page_alloc.c | 55 +-
mm/pkram.c | 1808 +++++++++++++++++++++++++++++++
mm/pkram_pagetable.c | 376 +++++++
mm/shmem.c | 494 ++++++++-
mm/shmem_pkram.c | 530 +++++++++
mm/swap.c | 86 ++
30 files changed, 3869 insertions(+), 34 deletions(-)
create mode 100644 arch/x86/boot/compressed/pkram.c
create mode 100644 include/linux/pkram.h
create mode 100644 mm/pkram.c
create mode 100644 mm/pkram_pagetable.c
create mode 100644 mm/shmem_pkram.c
--
1.8.3.1
next reply other threads:[~2021-03-30 21:28 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-30 21:35 Anthony Yznaga [this message]
2021-03-30 21:35 ` [RFC v2 01/43] mm: add PKRAM API stubs and Kconfig Anthony Yznaga
2021-03-31 18:43 ` Randy Dunlap
2021-03-31 20:28 ` Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 02/43] mm: PKRAM: implement node load and save functions Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 03/43] mm: PKRAM: implement object " Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 04/43] mm: PKRAM: implement page stream operations Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 05/43] mm: PKRAM: support preserving transparent hugepages Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 06/43] mm: PKRAM: implement byte stream operations Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 07/43] mm: PKRAM: link nodes by pfn before reboot Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 08/43] mm: PKRAM: introduce super block Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 09/43] PKRAM: track preserved pages in a physical mapping pagetable Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 10/43] PKRAM: pass a list of preserved ranges to the next kernel Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 11/43] PKRAM: prepare for adding preserved ranges to memblock reserved Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 12/43] mm: PKRAM: reserve preserved memory at boot Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 13/43] PKRAM: free the preserved ranges list Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 14/43] PKRAM: prevent inadvertent use of a stale superblock Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 15/43] PKRAM: provide a way to ban pages from use by PKRAM Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 16/43] kexec: PKRAM: prevent kexec clobbering preserved pages in some cases Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 17/43] PKRAM: provide a way to check if a memory range has preserved pages Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 18/43] kexec: PKRAM: avoid clobbering already " Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 19/43] mm: PKRAM: allow preserved memory to be freed from userspace Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 20/43] PKRAM: disable feature when running the kdump kernel Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 21/43] x86/KASLR: PKRAM: support physical kaslr Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 22/43] x86/boot/compressed/64: use 1GB pages for mappings Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 23/43] mm: shmem: introduce shmem_insert_page Anthony Yznaga
2021-03-30 21:35 ` [RFC v2 24/43] mm: shmem: enable saving to PKRAM Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 25/43] mm: shmem: prevent swapping of PKRAM-enabled tmpfs pages Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 26/43] mm: shmem: specify the mm to use when inserting pages Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 27/43] mm: shmem: when inserting, handle pages already charged to a memcg Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 28/43] x86/mm/numa: add numa_isolate_memblocks() Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 29/43] PKRAM: ensure memblocks with preserved pages init'd for numa Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 30/43] memblock: PKRAM: mark memblocks that contain preserved pages Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 31/43] memblock, mm: defer initialization of " Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 32/43] shmem: preserve shmem files a chunk at a time Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 33/43] PKRAM: atomically add and remove link pages Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 34/43] shmem: PKRAM: multithread preserving and restoring shmem pages Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 35/43] shmem: introduce shmem_insert_pages() Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 36/43] PKRAM: add support for loading pages in bulk Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 37/43] shmem: PKRAM: enable bulk loading of preserved pages into shmem Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 38/43] mm: implement splicing a list of pages to the LRU Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 39/43] shmem: optimize adding pages to the LRU in shmem_insert_pages() Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 40/43] shmem: initial support for adding multiple pages to pagecache Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 41/43] XArray: add xas_export_node() and xas_import_node() Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 42/43] shmem: reduce time holding xa_lock when inserting pages Anthony Yznaga
2021-03-30 21:36 ` [RFC v2 43/43] PKRAM: improve index alignment of pkram_link entries Anthony Yznaga
2021-06-05 13:39 ` [RFC v2 00/43] PKRAM: Preserved-over-Kexec RAM Pavel Tatashin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com \
--to=anthony.yznaga@oracle.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@linux.alibaba.com \
--cc=andreyknvl@google.com \
--cc=ardb@kernel.org \
--cc=ashish.kalra@amd.com \
--cc=bhe@redhat.com \
--cc=bp@alien8.de \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=daniel.kiper@oracle.com \
--cc=daniel.m.jordan@oracle.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=ebiederm@xmission.com \
--cc=graf@amazon.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=jason.zeng@intel.com \
--cc=jroedel@suse.de \
--cc=keescook@chromium.org \
--cc=kexec@lists.infradead.org \
--cc=lei.l.li@intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=martin.b.radev@gmail.com \
--cc=masahiroy@kernel.org \
--cc=mhocko@kernel.org \
--cc=mingo@redhat.com \
--cc=nathan@kernel.org \
--cc=nivedita@alum.mit.edu \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=richard.weiyang@gmail.com \
--cc=rminnich@gmail.com \
--cc=rppt@kernel.org \
--cc=steven.sistare@oracle.com \
--cc=terrelln@fb.com \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.cz \
--cc=vdavydov.dev@gmail.com \
--cc=vincenzo.frascino@arm.com \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).