linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
@ 2021-08-07  3:25 Peter Xu
  2021-08-07  3:25 ` [PATCH RFC 1/4] mm: Introduce PTE_MARKER swap entry Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Peter Xu @ 2021-08-07  3:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Alistair Popple, Tiberiu Georgescu, ivan.teterevkov,
	Mike Rapoport, Hugh Dickins, peterx, Matthew Wilcox,
	Andrea Arcangeli, David Hildenbrand, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

Summary
=======

[Based on v5.14-rc4]

This patchset enables PM_SWAP of pagemap on shmem.  IOW userspace will be able
to detect whether a shmem page is swapped out, just like anonymous pages.

This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it
brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make
it the default yet.  However IMHO 0.8% is still in an acceptable range that we
can even make it the default at last.  Comments are welcomed here.

There's one previous series that wanted to address the same issue but in
another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here:

https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/

In that series it's done by looking up page cache for all none ptes.  However I
raised concern on 4x performance degradation for all shmem pagemap users.

Unlike the other approach, this series has zero overhead on pagemap read
because the PM_SWAP info is consolidated into the zapped PTEs directly.

Goals
=====

One major goal of this series is to add the PM_SWAP support, the reason is as
stated by Tiberiu and Ivan in the other patchset:

https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/

As a summary: for some reason the userspace needs to scan the pages in the
background, however that scanning could misguide page reclaim on which page is
hot and which is cold.  With correct PM_SWAP information, the userspace can
correct the behavior of page reclaim by firstly fetching that info from
pagemap, and explicit madvise(MADV_PAGEOUT).  In this case, the pages are for
the guest, but it can be any shmem page.

Another major goal of this series is to do a proof-of-concept of the PTE marker
idea, and that's also the major reason why it's RFC.  So far PTE marker can
potentially be the solution for below three problems that I'm aware of:

  (a) PM_SWAP on shmem

  (b) Userfaultfd-wp on shmem/hugetlbfs

  (c) PM_SOFT_DIRTY lost for shmem over swapping

This series tries to resolve problem (a) which should be the simplest, ideally
it should solve immediate problem for the live migration issue raised by
Tiberiu and Ivan on proactive paging out unused guest pages.

Both (a) and (c) will be for performance-wise or statistic-wise.

Scenario (b) will require pte markers as part of the function to trap writes to
uffd-wp protected regions when the pages were e.g. swapped out or zapped for
any reason.

Currently, uffd-wp shmem work (still during review on the list, latest v5, [1])
used another solution called "special swap pte".  It works similarly like PTE
markers as both of the approachs are to persist information into zapped pte,
but people showed concern about that idea and it's suggested to use a safer
(swp-entry level operation, not pte level), and arch-independent approach.

Hopefully PTE markers satifsfy these demands.

Before I rework the uffd-wp series, I wanted to know whether this approach can
be accepted upstream.  So besides the swap part, comments on PTE markers will
be extremely welcomed.

What is PTE Markers?
====================

PTE markers are defined as some special PTEs that works like a "marker" just
like in normal life.  Firstly it uses a swap type, IOW it's not a valid/present
pte, so processor will trigger a page fault when it's accessed.  Meanwhile, the
format of the PTE is well-defined, so as to contain some information that we
would like to know before/during the page access happening.

In this specific case, when the shmem page is paged out, we set a marker
showing that this page was paged out, then when pagemap is read about this pte,
we know this is a swapped-out/very-cold page.

This use case is not an obvious one but the most simplest.  The uffd-wp use
case is more obvious (wr-protect is per-pte, so we can't save into page cache;
meanwhile we need that info to persist across zappings e.g. thp split or page
out of shmem pages).

So in the future, it can contain more information, e.g., whether this pte is
wr-protected by userfaultfd; whether this pte was written in this mm context
for soft-dirtying.  On 64 bit systems, we have a total of 58 bits (swp_offset).

I'm also curious whether it can be further expanded to other mm areas.  E.g.,
logically it can work too for non-RAM based memories outside shmem/hugetlbfs,
e.g. a common file system like ext4 or btrfs?  As long as there will be a need
to store some per-pte information across zapping of the ptes, then maybe it can
be considered.

Known Issues/Concerns
=====================

About THP
---------

Currently we don't need to worry about THP because paged out shmem pages will
be split when shrinking, IOW we only need to consider PTE, and the markers will
only be applied to a shmem pte not pmd or bigger.

About PM_SWAP Accuracy
----------------------

This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:

  - When process A & B both map shmem page P somewhere, it can happen that only
    one of these ptes got marked with the pte marker.  Imagine below sequence:

    0. Process A & B both map shmem page P somewhere
    1. Process A zap pte of page P for some reason (e.g. thp split)
    2. System decides to recycle page P
    3. System replace process B's pte (pointed to P) by PTE marker
    4. System _didn't_ replace process A's pte because it was none pte, and
       it'll continue to be none pte
    5. Only process B's relevant pte has the PTE marker after P swapped out

  - When fork, we don't copy shmem vma ptes, including the pte markers.  So
    even if page P was swapped out, only the parent process has the pte marker
    installed, in child it'll be none pte if fork() happened after pageout.

Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
work in 99.99% cases and it should already start to solve problems.

About Performance Impact
------------------------

Due to the special PTE marker, page fault logic needs to understand this pte
and there will be some extra logic to handle that.  The overhead is merely
non-observable with 0.82% perf drop.

For more information, please see the test section below where I wrote a test
for it.  When we really care about that small difference, the user can also
disable the shmem PM_SWAP support with !CONFIG_PTE_MARKER_PAGEOUT.

Tests
=====

Test case I used is here:

https://github.com/xzpeter/clibs/blob/master/bsd/pagemap.c

Functional test
---------------

Run with !CONFIG_PTE_MARKER_PAGEOUT, we'll miss the PM_SWAP when paged out (see
swap bit always being zeros):

       FAULT1 (expect swap==0): present bit 1, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 0
       FAULT2 (expect swap==0): present bit 1, swap bit 0
       REMOVE (expect swap==0): present bit 0, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 0
       REMOVE (expect swap==0): present bit 0, swap bit 0

Run with CONFIG_PTE_MARKER_PAGEOUT, we'll be able to observe correct PM_SWAP:

       FAULT1 (expect swap==0): present bit 1, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 1
       FAULT2 (expect swap==0): present bit 1, swap bit 0
       REMOVE (expect swap==0): present bit 0, swap bit 0
      PAGEOUT (expect swap==1): present bit 0, swap bit 1
       REMOVE (expect swap==0): present bit 0, swap bit 0

Performance test
----------------

The performance test is not about pagemap reading, because it should be the
same as before.  Instead there's indeed extra overhead in the fault path, when
the page is swapped in from the disk.  I did some sequential swap-in tests of
1GB range (each for 5 times in a loop) to measure the difference.

Hardware I used:

        Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
        Memory:    32GB memory, 16GB swap (on a PERC H330 Mini 2TBi disk)
        Test Size: 1GB shmem

I only measured the time to fault-in the pages on the disk, so the measurement
does not include pageout time, one can refer to the .c file.  Results:

   |-----------------------------------+------------------+------------|
   | Config                            | Time used (us)   | Change (%) |
   |-----------------------------------+------------------+------------|
   | !PTE_MARKER                       | 519652 (+-0.73%) |        N/A |
   | PTE_MARKER && !PTE_MARKER_PAGEOUT | 519874 (+-0.40%) |     -0.04% |
   | PTE_MARKER && PTE_MARKER_PAGEOUT  | 523914 (+-0.71%) |     -0.82% |
   |-----------------------------------+------------------+------------|

Any comment would be greatly welcomed.

[1] https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

Peter Xu (4):
  mm: Introduce PTE_MARKER swap entry
  mm: Check against orig_pte for finish_fault()
  mm: Handle PTE_MARKER page faults
  mm: Install marker pte when page out for shmem pages

 fs/proc/task_mmu.c      |  1 +
 include/linux/rmap.h    |  1 +
 include/linux/swap.h    | 14 ++++++++++++-
 include/linux/swapops.h | 45 +++++++++++++++++++++++++++++++++++++++++
 mm/Kconfig              | 17 ++++++++++++++++
 mm/memory.c             | 43 ++++++++++++++++++++++++++++++++++++++-
 mm/rmap.c               | 19 +++++++++++++++++
 mm/vmscan.c             |  2 +-
 8 files changed, 139 insertions(+), 3 deletions(-)

-- 
2.32.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 1/4] mm: Introduce PTE_MARKER swap entry
  2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
@ 2021-08-07  3:25 ` Peter Xu
  2021-08-07  3:25 ` [PATCH RFC 2/4] mm: Check against orig_pte for finish_fault() Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-08-07  3:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Alistair Popple, Tiberiu Georgescu, ivan.teterevkov,
	Mike Rapoport, Hugh Dickins, peterx, Matthew Wilcox,
	Andrea Arcangeli, David Hildenbrand, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

This patch introduces a new swap entry type called PTE_MARKER.  It can be
installed for any pte that maps a file-backed memory when the pte is
temporarily zapped, so as to maintain per-pte information.

The information that kept in the pte is called a "marker".  Here we define the
marker as "unsigned long" just to match pgoff_t, however it will only work if
it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.

The first marker bit that is introduced together with the new swap pte is the
PAGEOUT marker.  When that bit is set, it means this pte used to point to a
page which got swapped out.  It's mostly a definition so the swap type is not
totally nothing, however the functions are not implemented yet to handle the
new swap type.

A new config CONFIG_PTE_MARKER is introduced too; it's by default off.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swap.h    | 14 ++++++++++++-
 include/linux/swapops.h | 45 +++++++++++++++++++++++++++++++++++++++++
 mm/Kconfig              | 17 ++++++++++++++++
 3 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6f5a43251593..545dc8e0b0fb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,18 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+/*
+ * PTE markers are used to persist information onto PTEs that are mapped with
+ * file-backed memories.
+ */
+#ifdef CONFIG_PTE_MARKER
+#define SWP_PTE_MARKER_NUM 1
+#define SWP_PTE_MARKER     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
+			    SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
+#else
+#define SWP_PTE_MARKER_NUM 0
+#endif
+
 /*
  * Unaddressable device memory support. See include/linux/hmm.h and
  * Documentation/vm/hmm.rst. Short description is we need struct pages for
@@ -100,7 +112,7 @@ static inline int current_is_kswapd(void)
 
 #define MAX_SWAPFILES \
 	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
-	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..3fec83449e1e 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -247,6 +247,51 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
 
 #endif
 
+#ifdef CONFIG_PTE_MARKER
+
+#ifdef CONFIG_PTE_MARKER_PAGEOUT
+/* When this bit is set, it means this page is swapped out previously */
+#define  PTE_MARKER_PAGEOUT  (1UL << 0)
+#else
+#define  PTE_MARKER_PAGEOUT  0
+#endif
+
+#define  PTE_MARKER_MASK     (PTE_MARKER_PAGEOUT)
+
+static inline swp_entry_t make_pte_marker_entry(unsigned long marker)
+{
+	return swp_entry(SWP_PTE_MARKER, marker);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_PTE_MARKER;
+}
+
+static inline unsigned long pte_marker_get(swp_entry_t entry)
+{
+	return swp_offset(entry) & PTE_MARKER_MASK;
+}
+
+#else /* CONFIG_PTE_MARKER */
+
+static inline swp_entry_t make_pte_marker_entry(unsigned long marker)
+{
+	return swp_entry(0, 0);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline unsigned long pte_marker_get(swp_entry_t entry)
+{
+	return 0;
+}
+
+#endif /* CONFIG_PTE_MARKER */
+
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
 	struct page *p = pfn_to_page(swp_offset(entry));
diff --git a/mm/Kconfig b/mm/Kconfig
index 40a9bfcd5062..6043d8f1c066 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -889,4 +889,21 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config PTE_MARKER
+        def_bool n
+        bool "Marker PTEs support"
+
+	help
+	  Allows to create marker PTEs for file-backed memory.
+
+config PTE_MARKER_PAGEOUT
+        def_bool n
+        depends on PTE_MARKER
+        bool "Shmem pagemap PM_SWAP support"
+
+	help
+	  Allows to create marker PTEs for file-backed memory when the page is
+	  swapped out.  It's required for pagemap to work correctly with shmem
+	  on page swapping.
+
 endmenu
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 2/4] mm: Check against orig_pte for finish_fault()
  2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
  2021-08-07  3:25 ` [PATCH RFC 1/4] mm: Introduce PTE_MARKER swap entry Peter Xu
@ 2021-08-07  3:25 ` Peter Xu
  2021-08-07  3:25 ` [PATCH RFC 3/4] mm: Handle PTE_MARKER page faults Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-08-07  3:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Alistair Popple, Tiberiu Georgescu, ivan.teterevkov,
	Mike Rapoport, Hugh Dickins, peterx, Matthew Wilcox,
	Andrea Arcangeli, David Hildenbrand, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

We used to check against none pte and in those cases orig_pte should always be
none pte anyway.

This change prepares us to be able to call do_fault() on !none ptes.  For
example, we should allow that to happen for pte marker that has PAGEOUT set.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..7288f585544a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4047,7 +4047,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 		do_set_pte(vmf, page, vmf->address);
 	else
 		ret = VM_FAULT_NOPAGE;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 3/4] mm: Handle PTE_MARKER page faults
  2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
  2021-08-07  3:25 ` [PATCH RFC 1/4] mm: Introduce PTE_MARKER swap entry Peter Xu
  2021-08-07  3:25 ` [PATCH RFC 2/4] mm: Check against orig_pte for finish_fault() Peter Xu
@ 2021-08-07  3:25 ` Peter Xu
  2021-08-07  3:25 ` [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages Peter Xu
  2021-08-17  9:04 ` [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER David Hildenbrand
  4 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-08-07  3:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Alistair Popple, Tiberiu Georgescu, ivan.teterevkov,
	Mike Rapoport, Hugh Dickins, peterx, Matthew Wilcox,
	Andrea Arcangeli, David Hildenbrand, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

handle_pte_marker() is the function that will parse and handle all the pte
marker faults.  For PAGEOUT marker, it's as simple as dropping the pte and do
the fault just like a none pte.

The other solution should be that we clear the pte to none pte and retry the
fault, however that'll be slower than handling it right now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 7288f585544a..47f8ca064459 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -98,6 +98,8 @@ struct page *mem_map;
 EXPORT_SYMBOL(mem_map);
 #endif
 
+static vm_fault_t do_fault(struct vm_fault *vmf);
+
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
  * that high_memory defines the upper bound on direct map memory, then end
@@ -1394,6 +1396,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			put_page(page);
 			continue;
+		} else if (is_pte_marker_entry(entry)) {
+			/* Drop PTE_MARKER_PAGEOUT when zapped */
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
 		}
 
 		/* If details->check_mapping, we leave swap entries. */
@@ -3467,6 +3473,39 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+/*
+ * This function parses PTE markers and handle the faults.  Returns true if we
+ * finished the fault, and we should have put the return value into "*ret".
+ * Otherwise it means we want to continue the swap path, and "*ret" untouched.
+ */
+static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
+{
+	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
+	unsigned long marker;
+
+	marker = pte_marker_get(entry);
+
+	/*
+	 * PTE markers should always be with file-backed memories, and the
+	 * marker should never be empty.  If anything weird happened, the best
+	 * thing to do is to kill the process along with its mm.
+	 */
+	if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
+		return VM_FAULT_SIGBUS;
+
+#ifdef CONFIG_PTE_MARKER_PAGEOUT
+	if (marker == PTE_MARKER_PAGEOUT)
+		/*
+		 * This pte is previously zapped for swap, the PAGEOUT is only
+		 * a flag before it's accessed again.  Safe to drop it now.
+		 */
+		return do_fault(vmf);
+#endif
+
+	/* We see some marker that we can't handle */
+	return VM_FAULT_SIGBUS;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3503,6 +3542,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_pte_marker_entry(entry)) {
+			ret = handle_pte_marker(vmf);
 		} else {
 			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages
  2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
                   ` (2 preceding siblings ...)
  2021-08-07  3:25 ` [PATCH RFC 3/4] mm: Handle PTE_MARKER page faults Peter Xu
@ 2021-08-07  3:25 ` Peter Xu
  2021-08-13 15:18   ` Tiberiu Georgescu
  2021-08-17  9:04 ` [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER David Hildenbrand
  4 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-08-07  3:25 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Alistair Popple, Tiberiu Georgescu, ivan.teterevkov,
	Mike Rapoport, Hugh Dickins, peterx, Matthew Wilcox,
	Andrea Arcangeli, David Hildenbrand, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

When shmem pages are swapped out, instead of clearing the pte entry, we leave a
marker pte showing that this page is swapped out as a hint for pagemap.  A new
TTU flag is introduced to identify this case.

This can be useful for detecting swapped out cold shmem pages.  Then after some
memory background scanning work (which will fault in the shmem page and
confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
it out again as we know it was cold.

For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/proc/task_mmu.c   |  1 +
 include/linux/rmap.h |  1 +
 mm/rmap.c            | 19 +++++++++++++++++++
 mm/vmscan.c          |  2 +-
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..21b8594abc1d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (pm->show_pfn)
 			frame = swp_type(entry) |
 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+		/* NOTE: this covers PTE_MARKER_PAGEOUT too */
 		flags |= PM_SWAP;
 		if (is_pfn_swap_entry(entry))
 			page = pfn_swap_entry_to_page(entry);
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c976cc6de257..318a0e95c7fb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -95,6 +95,7 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
+	TTU_HINT_PAGEOUT	= 0x100, /* Hint for pageout operation */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/rmap.c b/mm/rmap.c
index b9eb5c12f3fe..24a70b36b6da 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
 	unlock_page_memcg(page);
 }
 
+static inline void
+pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
+		   struct page *page, unsigned long address)
+{
+#ifdef CONFIG_PTE_MARKER_PAGEOUT
+	swp_entry_t entry;
+	pte_t pteval;
+
+	if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
+		entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
+		pteval = swp_entry_to_pte(entry);
+		set_pte_at(vma->vm_mm, address, pte, pteval);
+	}
+#endif
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 */
 			dec_mm_counter(mm, mm_counter_file(page));
 		}
+
+		if (flags & TTU_HINT_PAGEOUT)
+			pte_marker_install(vma, pvmw.pte, page, address);
 discard:
 		/*
 		 * No need to call mmu_notifier_invalidate_range() it has be
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4620df62f0ff..4754af6fa24b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page)) {
-			enum ttu_flags flags = TTU_BATCH_FLUSH;
+			enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
 			bool was_swapbacked = PageSwapBacked(page);
 
 			if (unlikely(PageTransHuge(page)))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages
  2021-08-07  3:25 ` [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages Peter Xu
@ 2021-08-13 15:18   ` Tiberiu Georgescu
  2021-08-13 16:01     ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Tiberiu Georgescu @ 2021-08-13 15:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Alistair Popple, Ivan Teterevkov,
	Mike Rapoport, Hugh Dickins, Matthew Wilcox, Andrea Arcangeli,
	David Hildenbrand, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz

Hello Peter,

Sorry for commenting so late.

> On 7 Aug 2021, at 04:25, Peter Xu <peterx@redhat.com> wrote:
> 
> When shmem pages are swapped out, instead of clearing the pte entry, we leave a
> marker pte showing that this page is swapped out as a hint for pagemap.  A new
> TTU flag is introduced to identify this case.
> 
> This can be useful for detecting swapped out cold shmem pages.  Then after some
> memory background scanning work (which will fault in the shmem page and
> confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
> it out again as we know it was cold.
> 
> For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
> SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> fs/proc/task_mmu.c   |  1 +
> include/linux/rmap.h |  1 +
> mm/rmap.c            | 19 +++++++++++++++++++
> mm/vmscan.c          |  2 +-
> 4 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index eb97468dfe4c..21b8594abc1d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> 		if (pm->show_pfn)
> 			frame = swp_type(entry) |
> 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +		/* NOTE: this covers PTE_MARKER_PAGEOUT too */
> 		flags |= PM_SWAP;
> 		if (is_pfn_swap_entry(entry))
> 			page = pfn_swap_entry_to_page(entry);
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index c976cc6de257..318a0e95c7fb 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -95,6 +95,7 @@ enum ttu_flags {
> 					 * do a final flush if necessary */
> 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
> 					 * caller holds it */
> +	TTU_HINT_PAGEOUT	= 0x100, /* Hint for pageout operation */
> };
> 
> #ifdef CONFIG_MMU
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b9eb5c12f3fe..24a70b36b6da 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
> 	unlock_page_memcg(page);
> }
> 
> +static inline void
> +pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
> +		   struct page *page, unsigned long address)
> +{
> +#ifdef CONFIG_PTE_MARKER_PAGEOUT
> +	swp_entry_t entry;
> +	pte_t pteval;
> +
> +	if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
> +		entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
> +		pteval = swp_entry_to_pte(entry);
> +		set_pte_at(vma->vm_mm, address, pte, pteval);
> +	}
> +#endif
> +}
> +
> /*
>  * @arg: enum ttu_flags will be passed to this argument
>  */
> @@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 			 */
> 			dec_mm_counter(mm, mm_counter_file(page));
> 		}
> +
> +		if (flags & TTU_HINT_PAGEOUT)
> +			pte_marker_install(vma, pvmw.pte, page, address);
> discard:
> 		/*
> 		 * No need to call mmu_notifier_invalidate_range() it has be
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4620df62f0ff..4754af6fa24b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> 		 * processes. Try to unmap it here.
> 		 */
> 		if (page_mapped(page)) {
> -			enum ttu_flags flags = TTU_BATCH_FLUSH;
> +			enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
> 			bool was_swapbacked = PageSwapBacked(page);
> 
> 			if (unlikely(PageTransHuge(page)))
> -- 
> 2.32.0
> 
The RFC looks good to me. It makes it much cleaner to verify whether the page
is in swap or not, and there is great performance benefit compared to my patch.

Currently, your patch does not have a way of putting the swap offset onto the
entry when a shared page is swapped, so it does not cover all use cases
IMO.

To counter that, I added my patch on top of yours. By making use of your
mechanism, we don't need to read the page cache xarray when we pages
are definitely none. This reduces much unnecessary overhead.

Main diff in fs/proc/task_mmu.c:pte_to_pagemap_entry():
        } else if (is_swap_pte(pte)) {
                swp_entry_t entry;
                ...
                entry = pte_to_swp_entry(pte);
+               if (is_pte_marker_entry(entry)) {
+                       void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
                ...

For reference: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/.

After some preliminary performance tests on VMs, I noticed a significant
performance improvement in my patch, when yours is added. Here is the
most interesting result:

Program I tested it on: Same as https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/

Specs:
    Memory: 32GB memory, 64GB swap
    Params: 16GB allocated shmem, 50% dirty, 4GB cgroup, no batching

In short, the pagemap read deals with 
    * ~4GB of physical memory (present pages), 
    * ~4GB in swap (swapped pages)
    * 8GB non-dirty (none pages).

Results:
+------------+-------+-------+
| Peter\Tibi |  Pre  | Post  | (in seconds)
+------------+-------+-------+
| Pre        | 11.82 | 38.44 |
| Post       | 13.28 | 22.34 |
+------------+-------+-------+

With your patch, mine yields the results in twice the time, compared to 4x.
I am aware it is not ideal, but until it is decided whether a whole different
system is preferred to PTE markers, our solutions together currently deliver
the best performance for correctness. Thank you for finding this solution.

I am happy to introduce a configuration variable to enable my pagemap
patch only if necessary. Either way, if performance is a must, batching is still
the best way to access multiple pagemap entries.

Looking forward to updates,
Tibi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages
  2021-08-13 15:18   ` Tiberiu Georgescu
@ 2021-08-13 16:01     ` Peter Xu
  2021-08-18 18:02       ` Tiberiu Georgescu
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-08-13 16:01 UTC (permalink / raw)
  To: Tiberiu Georgescu
  Cc: linux-kernel, linux-mm, Alistair Popple, Ivan Teterevkov,
	Mike Rapoport, Hugh Dickins, Matthew Wilcox, Andrea Arcangeli,
	David Hildenbrand, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz

On Fri, Aug 13, 2021 at 03:18:22PM +0000, Tiberiu Georgescu wrote:
> Hello Peter,

Hi, Tiberiu,

> 
> Sorry for commenting so late.

No worry on that.  I appreciate a lot your follow up on this problem.

> 
> > On 7 Aug 2021, at 04:25, Peter Xu <peterx@redhat.com> wrote:
> > 
> > When shmem pages are swapped out, instead of clearing the pte entry, we leave a
> > marker pte showing that this page is swapped out as a hint for pagemap.  A new
> > TTU flag is introduced to identify this case.
> > 
> > This can be useful for detecting swapped out cold shmem pages.  Then after some
> > memory background scanning work (which will fault in the shmem page and
> > confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
> > it out again as we know it was cold.
> > 
> > For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
> > SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > fs/proc/task_mmu.c   |  1 +
> > include/linux/rmap.h |  1 +
> > mm/rmap.c            | 19 +++++++++++++++++++
> > mm/vmscan.c          |  2 +-
> > 4 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index eb97468dfe4c..21b8594abc1d 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > 		if (pm->show_pfn)
> > 			frame = swp_type(entry) |
> > 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> > +		/* NOTE: this covers PTE_MARKER_PAGEOUT too */
> > 		flags |= PM_SWAP;
> > 		if (is_pfn_swap_entry(entry))
> > 			page = pfn_swap_entry_to_page(entry);
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index c976cc6de257..318a0e95c7fb 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -95,6 +95,7 @@ enum ttu_flags {
> > 					 * do a final flush if necessary */
> > 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
> > 					 * caller holds it */
> > +	TTU_HINT_PAGEOUT	= 0x100, /* Hint for pageout operation */
> > };
> > 
> > #ifdef CONFIG_MMU
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index b9eb5c12f3fe..24a70b36b6da 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
> > 	unlock_page_memcg(page);
> > }
> > 
> > +static inline void
> > +pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
> > +		   struct page *page, unsigned long address)
> > +{
> > +#ifdef CONFIG_PTE_MARKER_PAGEOUT
> > +	swp_entry_t entry;
> > +	pte_t pteval;
> > +
> > +	if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
> > +		entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
> > +		pteval = swp_entry_to_pte(entry);
> > +		set_pte_at(vma->vm_mm, address, pte, pteval);
> > +	}
> > +#endif
> > +}
> > +
> > /*
> >  * @arg: enum ttu_flags will be passed to this argument
> >  */
> > @@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 			 */
> > 			dec_mm_counter(mm, mm_counter_file(page));
> > 		}
> > +
> > +		if (flags & TTU_HINT_PAGEOUT)
> > +			pte_marker_install(vma, pvmw.pte, page, address);
> > discard:
> > 		/*
> > 		 * No need to call mmu_notifier_invalidate_range() it has be
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4620df62f0ff..4754af6fa24b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> > 		 * processes. Try to unmap it here.
> > 		 */
> > 		if (page_mapped(page)) {
> > -			enum ttu_flags flags = TTU_BATCH_FLUSH;
> > +			enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
> > 			bool was_swapbacked = PageSwapBacked(page);
> > 
> > 			if (unlikely(PageTransHuge(page)))
> > -- 
> > 2.32.0
> > 
> The RFC looks good to me. It makes it much cleaner to verify whether the page
> is in swap or not, and there is great performance benefit compared to my patch.
> 
> Currently, your patch does not have a way of putting the swap offset onto the
> entry when a shared page is swapped, so it does not cover all use cases
> IMO.

Could you remind me on why you need the swap offset?  I was trying to
understand your scenaior and I hope I summarized it right: what we want to do
is being able to MADV_PAGEOUT pages that was swapped out.  Then IIUC the
PM_SWAP bit should be enough for it.  Did I miss something important?

> 
> To counter that, I added my patch on top of yours. By making use of your
> mechanism, we don't need to read the page cache xarray when we pages
> are definitely none. This reduces much unnecessary overhead.
> 
> Main diff in fs/proc/task_mmu.c:pte_to_pagemap_entry():
>         } else if (is_swap_pte(pte)) {
>                 swp_entry_t entry;
>                 ...
>                 entry = pte_to_swp_entry(pte);
> +               if (is_pte_marker_entry(entry)) {
> +                       void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
>                 ...
> 
> For reference: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/.
> 
> After some preliminary performance tests on VMs, I noticed a significant
> performance improvement in my patch, when yours is added. Here is the
> most interesting result:
> 
> Program I tested it on: Same as https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> 
> Specs:
>     Memory: 32GB memory, 64GB swap
>     Params: 16GB allocated shmem, 50% dirty, 4GB cgroup, no batching
> 
> In short, the pagemap read deals with 
>     * ~4GB of physical memory (present pages), 
>     * ~4GB in swap (swapped pages)
>     * 8GB non-dirty (none pages).
> 
> Results:
> +------------+-------+-------+
> | Peter\Tibi |  Pre  | Post  | (in seconds)
> +------------+-------+-------+
> | Pre        | 11.82 | 38.44 |
> | Post       | 13.28 | 22.34 |
> +------------+-------+-------+
> 
> With your patch, mine yields the results in twice the time, compared to 4x.
> I am aware it is not ideal, but until it is decided whether a whole different
> system is preferred to PTE markers, our solutions together currently deliver
> the best performance for correctness. Thank you for finding this solution.

Thanks for trying that out!  It's great to know these test results.

> 
> I am happy to introduce a configuration variable to enable my pagemap
> patch only if necessary.

Right.  We can use a config for that, but I didn't mention when replying to
your previous patchset because I thought w/ or w/o the config it should still
be better to use the pte markers because it should be more efficient.

However I think I need to double check I didn't miss anything that you're
looking for. E.g. I need to understand why swp offset matters here as I asked.
I must confess that cannot be trivially done with pte markers yet - keeping a
swap hint is definitely not the same story as keeping a swp entry.

> Either way, if performance is a must, batching is still the best way to
> access multiple pagemap entries.

I agree, especially when we have pmd pgtable locks things can happen
concurrently.

It's just that it's a pity the major overhead comparing to the old way is at
page cache look up, especially as you pointed out - the capability to identify
used ptes with empty ptes matters.  That's kind of orthogonal to batching to me.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
                   ` (3 preceding siblings ...)
  2021-08-07  3:25 ` [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages Peter Xu
@ 2021-08-17  9:04 ` David Hildenbrand
  2021-08-17 17:09   ` Peter Xu
  4 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-08-17  9:04 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Alistair Popple, Tiberiu Georgescu, ivan.teterevkov,
	Mike Rapoport, Hugh Dickins, Matthew Wilcox, Andrea Arcangeli,
	Kirill A . Shutemov, Andrew Morton, Mike Kravetz

Hi Peter, a couple of comments, sorry for the late reply.

> Summary
> =======
> 
> [Based on v5.14-rc4]
> 
> This patchset enables PM_SWAP of pagemap on shmem.  IOW userspace will be able
> to detect whether a shmem page is swapped out, just like anonymous pages.
> 
> This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it
> brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make
> it the default yet.  However IMHO 0.8% is still in an acceptable range that we
> can even make it the default at last.  Comments are welcomed here.

Special config option and added complexity for handling a corner case 
feature partially more correct. Hm.

> 
> There's one previous series that wanted to address the same issue but in
> another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here:
> 
> https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> 
> In that series it's done by looking up page cache for all none ptes.  However I
> raised concern on 4x performance degradation for all shmem pagemap users.

Who cares? I am asking because for me, it's hard to imagine a workload 
that actually cares about a 4x performance degradation when querying the 
pagemap in very special cases, especially if it involves gigantic shmem 
ranges. VM live migration -- sync will be a bit slower? CRIU sync will 
be a bit slower? I mean, actual page dumping is a lot more expensive. 
Really a problem?

I read that CRIU cares about shmem via pagemap [1], at least for 
anonymous shared memory; not sure how memfd is treated, I assume in a 
similar way. But I do wonder how it even works reliably, because it 
relies on present/swapped out and sofrtdirty tracking, which are both 
essentially broken e.g., when swapping out AFAIKT. Looks like this 
really needs a proper fix.

[1] https://criu.org/Shared_memory

> 
> Unlike the other approach, this series has zero overhead on pagemap read
> because the PM_SWAP info is consolidated into the zapped PTEs directly.
> 
> Goals
> =====
> 
> One major goal of this series is to add the PM_SWAP support, the reason is as
> stated by Tiberiu and Ivan in the other patchset:
> 
> https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/
> 
> As a summary: for some reason the userspace needs to scan the pages in the
> background, however that scanning could misguide page reclaim on which page is
> hot and which is cold.  With correct PM_SWAP information, the userspace can
> correct the behavior of page reclaim by firstly fetching that info from
> pagemap, and explicit madvise(MADV_PAGEOUT).  In this case, the pages are for
> the guest, but it can be any shmem page.
> 
> Another major goal of this series is to do a proof-of-concept of the PTE marker
> idea, and that's also the major reason why it's RFC.  So far PTE marker can
> potentially be the solution for below three problems that I'm aware of:
> 
>    (a) PM_SWAP on shmem
> 
>    (b) Userfaultfd-wp on shmem/hugetlbfs
> 
>    (c) PM_SOFT_DIRTY lost for shmem over swapping
> 
> This series tries to resolve problem (a) which should be the simplest, ideally
> it should solve immediate problem for the live migration issue raised by
> Tiberiu and Ivan on proactive paging out unused guest pages.
> 
> Both (a) and (c) will be for performance-wise or statistic-wise.
> 
> Scenario (b) will require pte markers as part of the function to trap writes to
> uffd-wp protected regions when the pages were e.g. swapped out or zapped for
> any reason.
> 
> Currently, uffd-wp shmem work (still during review on the list, latest v5, [1])
> used another solution called "special swap pte".  It works similarly like PTE
> markers as both of the approachs are to persist information into zapped pte,
> but people showed concern about that idea and it's suggested to use a safer
> (swp-entry level operation, not pte level), and arch-independent approach.
> 
> Hopefully PTE markers satifsfy these demands.
> 
> Before I rework the uffd-wp series, I wanted to know whether this approach can
> be accepted upstream.  So besides the swap part, comments on PTE markers will
> be extremely welcomed.

For uffd-wp in its current form, it would certainly be the way to go I 
think. AFAIKT, the idea of special swap entries isn't new, just that 
it's limited to anonymous memory for now, which makes things like fork 
and new mappings a lot cheaper.

> 
> What is PTE Markers?
> ====================
> 
> PTE markers are defined as some special PTEs that works like a "marker" just
> like in normal life.  Firstly it uses a swap type, IOW it's not a valid/present
> pte, so processor will trigger a page fault when it's accessed.  Meanwhile, the
> format of the PTE is well-defined, so as to contain some information that we
> would like to know before/during the page access happening.
> 
> In this specific case, when the shmem page is paged out, we set a marker
> showing that this page was paged out, then when pagemap is read about this pte,
> we know this is a swapped-out/very-cold page.
> 
> This use case is not an obvious one but the most simplest.  The uffd-wp use
> case is more obvious (wr-protect is per-pte, so we can't save into page cache;
> meanwhile we need that info to persist across zappings e.g. thp split or page
> out of shmem pages).
> 
> So in the future, it can contain more information, e.g., whether this pte is
> wr-protected by userfaultfd; whether this pte was written in this mm context
> for soft-dirtying.  On 64 bit systems, we have a total of 58 bits (swp_offset).
> 
> I'm also curious whether it can be further expanded to other mm areas.  E.g.,
> logically it can work too for non-RAM based memories outside shmem/hugetlbfs,
> e.g. a common file system like ext4 or btrfs?  As long as there will be a need
> to store some per-pte information across zapping of the ptes, then maybe it can
> be considered.

As already expressed, we should try storing as little information in 
page tables as possible if we're dealing with shared memory. The 
features we design around this all seem to over-complicate the actual 
users, over-complicate fork, over-complicate handling on new mappings.

For uffd-wp in its current form, there seems to be no way around it, and 
PTE markers seem to be what you want -- but as I already raised, the 
feature itself on shmem is somewhat suboptimal, just like SOFTDIRTY 
tracking on shmem.

But I guess I'm biased at this point because the main users of these 
features actually want to query/set such properties for all sharers, not 
individual processes; so the opinion of others would be appreciated.

> 
> Known Issues/Concerns
> =====================
> 
> About THP
> ---------
> 
> Currently we don't need to worry about THP because paged out shmem pages will
> be split when shrinking, IOW we only need to consider PTE, and the markers will
> only be applied to a shmem pte not pmd or bigger.
> 
> About PM_SWAP Accuracy
> ----------------------
> 
> This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:
> 
>    - When process A & B both map shmem page P somewhere, it can happen that only
>      one of these ptes got marked with the pte marker.  Imagine below sequence:
> 
>      0. Process A & B both map shmem page P somewhere
>      1. Process A zap pte of page P for some reason (e.g. thp split)
>      2. System decides to recycle page P
>      3. System replace process B's pte (pointed to P) by PTE marker
>      4. System _didn't_ replace process A's pte because it was none pte, and
>         it'll continue to be none pte
>      5. Only process B's relevant pte has the PTE marker after P swapped out
> 
>    - When fork, we don't copy shmem vma ptes, including the pte markers.  So
>      even if page P was swapped out, only the parent process has the pte marker
>      installed, in child it'll be none pte if fork() happened after pageout.
> 
> Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
> work in 99.99% cases and it should already start to solve problems.

At least I don't like these semantics at all. PM_SWAP is a cached value 
which might be under-represented and consequently wrong. Take CRIU as an 
example, it has to be correct even if a process would remap a memory 
region, fork() and unmap in the parent as far as I understand, ...

If we really care about performance for users with the old semantics, 
introduce some runtime toggle that enables the new behavior (even for a 
single process?) and consequently is a bit slower in corner cases. But I 
really do wonder if we care at all about the performance degradation in 
corner cases.

If we really care about performance for users with new semantics, then 
let's do it properly and see how we can actually speed it up without 
per-process page table hacks.

Anyhow, just my 2 cents.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-17  9:04 ` [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER David Hildenbrand
@ 2021-08-17 17:09   ` Peter Xu
  2021-08-17 18:46     ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-08-17 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Alistair Popple, Tiberiu Georgescu,
	ivan.teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz

On Tue, Aug 17, 2021 at 11:04:18AM +0200, David Hildenbrand wrote:
> Hi Peter, a couple of comments, sorry for the late reply.
> 
> > Summary
> > =======
> > 
> > [Based on v5.14-rc4]
> > 
> > This patchset enables PM_SWAP of pagemap on shmem.  IOW userspace will be able
> > to detect whether a shmem page is swapped out, just like anonymous pages.
> > 
> > This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it
> > brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make
> > it the default yet.  However IMHO 0.8% is still in an acceptable range that we
> > can even make it the default at last.  Comments are welcomed here.
> 
> Special config option and added complexity for handling a corner case
> feature partially more correct. Hm.
> 
> > 
> > There's one previous series that wanted to address the same issue but in
> > another way by Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>, here:
> > 
> > https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
> > 
> > In that series it's done by looking up page cache for all none ptes.  However I
> > raised concern on 4x performance degradation for all shmem pagemap users.
> 
> Who cares? I am asking because for me, it's hard to imagine a workload that
> actually cares about a 4x performance degradation when querying the pagemap
> in very special cases,

I won't say it's a "special case".  it affects any pagemap reading if there's
shmem, even if they're not looking for PM_SWAP it'll suffer from trying to look
up page cache if the pte is none.

Yes it needs to be a relatively large memory to show an effect, but IMHO it's
not a reason good enough that we can make it 4x slower.

> especially if it involves gigantic shmem ranges. VM
> live migration -- sync will be a bit slower? CRIU sync will be a bit slower?
> I mean, actual page dumping is a lot more expensive. Really a problem?
> 
> I read that CRIU cares about shmem via pagemap [1], at least for anonymous
> shared memory; not sure how memfd is treated, I assume in a similar way. But
> I do wonder how it even works reliably, because it relies on present/swapped
> out and sofrtdirty tracking, which are both essentially broken e.g., when
> swapping out AFAIKT. Looks like this really needs a proper fix.
> 
> [1] https://criu.org/Shared_memory

I have not enough knowledge to judge CRIU here, but I can kind of understand
Tiberiu's problem and hopefully I digested it right.  What I'm targetting with
this series is just to see whether it can fix the problem there, assuming that
could be a good summary of what it can also be used elsewhere.

We have the other solution of the page cache lookup, but I think that's slower
than this solution, so when there's the 0.8% overhead possibility I think this
one is much better.  The bad side is it's going to be added into swap-in path,
it's not ideal for fast swap devices, but I have no knowledge on that part too,
as to my understanding most swap devices should be still slow like hard disks
which can take worst case milli-seconds to read in a sector.  I think that
overhead should be mostly not measureable and lost in the white noise.

> 
> > 
> > Unlike the other approach, this series has zero overhead on pagemap read
> > because the PM_SWAP info is consolidated into the zapped PTEs directly.
> > 
> > Goals
> > =====
> > 
> > One major goal of this series is to add the PM_SWAP support, the reason is as
> > stated by Tiberiu and Ivan in the other patchset:
> > 
> > https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/
> > 
> > As a summary: for some reason the userspace needs to scan the pages in the
> > background, however that scanning could misguide page reclaim on which page is
> > hot and which is cold.  With correct PM_SWAP information, the userspace can
> > correct the behavior of page reclaim by firstly fetching that info from
> > pagemap, and explicit madvise(MADV_PAGEOUT).  In this case, the pages are for
> > the guest, but it can be any shmem page.
> > 
> > Another major goal of this series is to do a proof-of-concept of the PTE marker
> > idea, and that's also the major reason why it's RFC.  So far PTE marker can
> > potentially be the solution for below three problems that I'm aware of:
> > 
> >    (a) PM_SWAP on shmem
> > 
> >    (b) Userfaultfd-wp on shmem/hugetlbfs
> > 
> >    (c) PM_SOFT_DIRTY lost for shmem over swapping
> > 
> > This series tries to resolve problem (a) which should be the simplest, ideally
> > it should solve immediate problem for the live migration issue raised by
> > Tiberiu and Ivan on proactive paging out unused guest pages.
> > 
> > Both (a) and (c) will be for performance-wise or statistic-wise.
> > 
> > Scenario (b) will require pte markers as part of the function to trap writes to
> > uffd-wp protected regions when the pages were e.g. swapped out or zapped for
> > any reason.
> > 
> > Currently, uffd-wp shmem work (still during review on the list, latest v5, [1])
> > used another solution called "special swap pte".  It works similarly like PTE
> > markers as both of the approachs are to persist information into zapped pte,
> > but people showed concern about that idea and it's suggested to use a safer
> > (swp-entry level operation, not pte level), and arch-independent approach.
> > 
> > Hopefully PTE markers satifsfy these demands.
> > 
> > Before I rework the uffd-wp series, I wanted to know whether this approach can
> > be accepted upstream.  So besides the swap part, comments on PTE markers will
> > be extremely welcomed.
> 
> For uffd-wp in its current form, it would certainly be the way to go I
> think. AFAIKT, the idea of special swap entries isn't new, just that it's
> limited to anonymous memory for now, which makes things like fork and new
> mappings a lot cheaper.

Thanks for reviewing this series separately; yes I definitely wanted to get
comments on both sides: one on the pte marker idea, the other is whether it's
applicable to this swap+shmem use case.

I don't plan to handle both fork and new mappings for swapping.

For fork(), as we've discussed, and as I've also mentioned below in Known
Issues, that we don't copy these markers in fork().  So children may lose these
hints from pte, but I think that's fine if that's hardly used.

Here I really want to make the pte marker be flexible - it can be strict when
necessary (it will be 100% strict with uffd-wp), then it can also be a hint
just like what we have with available ptes on soft-dirty, idle, accessed bits.
Here the swap bit I wanted to make it that kind, so we add zero overhead to
fork() and we still solve problems.

Same thing to "newly mapped shmem".  Do we have a use case for that?  If that's
a hint bit, can we ignore it?

> 
> > 
> > What is PTE Markers?
> > ====================
> > 
> > PTE markers are defined as some special PTEs that works like a "marker" just
> > like in normal life.  Firstly it uses a swap type, IOW it's not a valid/present
> > pte, so processor will trigger a page fault when it's accessed.  Meanwhile, the
> > format of the PTE is well-defined, so as to contain some information that we
> > would like to know before/during the page access happening.
> > 
> > In this specific case, when the shmem page is paged out, we set a marker
> > showing that this page was paged out, then when pagemap is read about this pte,
> > we know this is a swapped-out/very-cold page.
> > 
> > This use case is not an obvious one but the most simplest.  The uffd-wp use
> > case is more obvious (wr-protect is per-pte, so we can't save into page cache;
> > meanwhile we need that info to persist across zappings e.g. thp split or page
> > out of shmem pages).
> > 
> > So in the future, it can contain more information, e.g., whether this pte is
> > wr-protected by userfaultfd; whether this pte was written in this mm context
> > for soft-dirtying.  On 64 bit systems, we have a total of 58 bits (swp_offset).
> > 
> > I'm also curious whether it can be further expanded to other mm areas.  E.g.,
> > logically it can work too for non-RAM based memories outside shmem/hugetlbfs,
> > e.g. a common file system like ext4 or btrfs?  As long as there will be a need
> > to store some per-pte information across zapping of the ptes, then maybe it can
> > be considered.
> 
> As already expressed, we should try storing as little information in page
> tables as possible if we're dealing with shared memory. The features we
> design around this all seem to over-complicate the actual users,
> over-complicate fork, over-complicate handling on new mappings.

I'll skip the last two "over-complicated" items, because as we've discussed I
don't think we need to take care of them so far.  We can revisit when they
become some kind of requirement.

To me having PM_SWAP 99% right on shmem is still a progress comparing to
completely missing it, even if it's not 100% right.  It's used for performance
reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
not a strict requirement.

So I still cannot strictly follow why storing information in pte is so bad for
file-backed, which I can see you really don't like it.  Could you share some
detailed example?

> 
> For uffd-wp in its current form, there seems to be no way around it, and PTE
> markers seem to be what you want -- but as I already raised, the feature
> itself on shmem is somewhat suboptimal, just like SOFTDIRTY tracking on
> shmem.

Emm I don't know whether we should call it sub-optimal. To me it's still a
clean and self-contained way to do things.  I don't know how to do that from
page cache layer, but I bet there'll be issues coming when we go that way, then
we may figure out "oh this way is 90% beautiful, but that way is 85%".  To me
it's sometimes not easy to figure out the ideal solution for all the problems.

> 
> But I guess I'm biased at this point because the main users of these
> features actually want to query/set such properties for all sharers, not
> individual processes; so the opinion of others would be appreciated.
> 
> > 
> > Known Issues/Concerns
> > =====================
> > 
> > About THP
> > ---------
> > 
> > Currently we don't need to worry about THP because paged out shmem pages will
> > be split when shrinking, IOW we only need to consider PTE, and the markers will
> > only be applied to a shmem pte not pmd or bigger.
> > 
> > About PM_SWAP Accuracy
> > ----------------------
> > 
> > This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:
> > 
> >    - When process A & B both map shmem page P somewhere, it can happen that only
> >      one of these ptes got marked with the pte marker.  Imagine below sequence:
> > 
> >      0. Process A & B both map shmem page P somewhere
> >      1. Process A zap pte of page P for some reason (e.g. thp split)
> >      2. System decides to recycle page P
> >      3. System replace process B's pte (pointed to P) by PTE marker
> >      4. System _didn't_ replace process A's pte because it was none pte, and
> >         it'll continue to be none pte
> >      5. Only process B's relevant pte has the PTE marker after P swapped out
> > 
> >    - When fork, we don't copy shmem vma ptes, including the pte markers.  So
> >      even if page P was swapped out, only the parent process has the pte marker
> >      installed, in child it'll be none pte if fork() happened after pageout.
> > 
> > Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
> > work in 99.99% cases and it should already start to solve problems.
> 
> At least I don't like these semantics at all. PM_SWAP is a cached value
> which might be under-represented and consequently wrong.

Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
make it more accurate, but I think it's fine, because it's a hint.

> Take CRIU as an example, it has to be correct even if a process would remap a
> memory region, fork() and unmap in the parent as far as I understand, ...

Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
bit needs to be accurate.  Maybe you mean the dirty bit?

> 
> If we really care about performance for users with the old semantics,
> introduce some runtime toggle that enables the new behavior (even for a
> single process?) and consequently is a bit slower in corner cases. But I
> really do wonder if we care at all about the performance degradation in
> corner cases.

Yes that's okay.  If we want to go with Tiberiu's approach, we can use an
option to avoid regress users.  However I still prefer the current approach,
and I even want to make it a default option if it can prove itself with some
more real swap workloads somewhere (maybe kernel bot has some?  I didn't check
yet).

> 
> If we really care about performance for users with new semantics, then let's
> do it properly and see how we can actually speed it up without per-process
> page table hacks.
> 
> Anyhow, just my 2 cents.

Thanks for the discussion!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-17 17:09   ` Peter Xu
@ 2021-08-17 18:46     ` David Hildenbrand
  2021-08-17 20:24       ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-08-17 18:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Alistair Popple, Tiberiu Georgescu,
	ivan.teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz

>>
>> For uffd-wp in its current form, it would certainly be the way to go I
>> think. AFAIKT, the idea of special swap entries isn't new, just that it's
>> limited to anonymous memory for now, which makes things like fork and new
>> mappings a lot cheaper.
> 
> Thanks for reviewing this series separately; yes I definitely wanted to get
> comments on both sides: one on the pte marker idea, the other is whether it's
> applicable to this swap+shmem use case.

Exactly.

> 
> Here I really want to make the pte marker be flexible - it can be strict when
> necessary (it will be 100% strict with uffd-wp), then it can also be a hint
> just like what we have with available ptes on soft-dirty, idle, accessed bits.
> Here the swap bit I wanted to make it that kind, so we add zero overhead to
> fork() and we still solve problems.
> 
> Same thing to "newly mapped shmem".  Do we have a use case for that?  If that's
> a hint bit, can we ignore it?

I am really not a fan of taking an already broken feature (broken 
presence information for shmem in pagemap) and instead of fixing it 
properly, turning it less broken, crossing fingers that nobody will 
notice the remaining (undocumented) cases.

[...]

>> As already expressed, we should try storing as little information in page
>> tables as possible if we're dealing with shared memory. The features we
>> design around this all seem to over-complicate the actual users,
>> over-complicate fork, over-complicate handling on new mappings.
> 
> I'll skip the last two "over-complicated" items, because as we've discussed I
> don't think we need to take care of them so far.  We can revisit when they
> become some kind of requirement.
> 
> To me having PM_SWAP 99% right on shmem is still a progress comparing to
> completely missing it, even if it's not 100% right.  It's used for performance
> reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
> not a strict requirement.
> 
> So I still cannot strictly follow why storing information in pte is so bad for
> file-backed, which I can see you really don't like it.  Could you share some
> detailed example?

I am not a fan of your approach of "hints", because while it might work 
for some use cases, it might not work for others (see below for CRIU); I 
would rather like to avoid such "inconsistent caches" where it's not 
really required. But again, this is just my opinion and there might be 
plenty other people that will most probably disagree.

Storing hints in page tables isn't actually that bad, because we can 
drop hints whenever we like (well, there are side effects, and once we 
drop hints too often people might complain again) -- for example, when 
reclaiming "empty" but actually "filled with hints" page tables. When we 
rely on consistent values, fork() and mmap() are a problem. Well, and 
page tables will have to stick around. At least for uffd-wp, mmap() is 
not an issue, and we don't expect too many uffd-wp users such that page 
table consumption would matter -- my guess.

So I repeat, for uffd-wp in its current form, it sounds like the right 
thing to do.

>> But I guess I'm biased at this point because the main users of these
>> features actually want to query/set such properties for all sharers, not
>> individual processes; so the opinion of others would be appreciated.
>>
>>>
>>> Known Issues/Concerns
>>> =====================
>>>
>>> About THP
>>> ---------
>>>
>>> Currently we don't need to worry about THP because paged out shmem pages will
>>> be split when shrinking, IOW we only need to consider PTE, and the markers will
>>> only be applied to a shmem pte not pmd or bigger.
>>>
>>> About PM_SWAP Accuracy
>>> ----------------------
>>>
>>> This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:
>>>
>>>     - When process A & B both map shmem page P somewhere, it can happen that only
>>>       one of these ptes got marked with the pte marker.  Imagine below sequence:
>>>
>>>       0. Process A & B both map shmem page P somewhere
>>>       1. Process A zap pte of page P for some reason (e.g. thp split)
>>>       2. System decides to recycle page P
>>>       3. System replace process B's pte (pointed to P) by PTE marker
>>>       4. System _didn't_ replace process A's pte because it was none pte, and
>>>          it'll continue to be none pte
>>>       5. Only process B's relevant pte has the PTE marker after P swapped out
>>>
>>>     - When fork, we don't copy shmem vma ptes, including the pte markers.  So
>>>       even if page P was swapped out, only the parent process has the pte marker
>>>       installed, in child it'll be none pte if fork() happened after pageout.
>>>
>>> Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
>>> work in 99.99% cases and it should already start to solve problems.
>>
>> At least I don't like these semantics at all. PM_SWAP is a cached value
>> which might be under-represented and consequently wrong.
> 
> Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
> accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
> make it more accurate, but I think it's fine, because it's a hint.

That inconsistency doesn't really matter as you can determine if 
something is present and worth dumping if it's either swapped or 
present. As long as it's one of both but not simply nothing.

I will shamelessly reference 
tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that 
checks exactly for that (the test case uses only private anonymous memory).

> 
>> Take CRIU as an example, it has to be correct even if a process would remap a
>> memory region, fork() and unmap in the parent as far as I understand, ...
> 
> Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
> bit needs to be accurate.  Maybe you mean the dirty bit?

https://criu.org/Shared_memory

"Dumping present pages"

"... CRIU does not dump all of the data. Instead, it determines which 
pages contain it, and only dumps those pages. This is done similarly to 
how regular memory dumping and restoring works, i.e. by looking for 
PRESENT or SWAPPED bits in owners' pagemap entries."

-> Neither PRESENT nor SWAPPED results in memory not getting dumped, 
which makes perfect sense.

1) Process A sets up shared memory and writes data to it.
2) System swaps out memory, hints are setup.
3) Process A forks Process B, hints are not copied.
4) Process A unmaps shared memory, hints are dropped.
5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in 
pagemap.
6) Process B uses memory in shared memory region. Pages were not migrated.

Just one example; feel free to correct me.


There is notion of the mincore() systemcall:

"There is one particular feature of shared memory dumps worth 
mentioning. Sometimes, a shared memory page can exist in the kernel, but 
it is not mapped to any process. CRIU detects such pages by calling 
mincore() on the shmem segment, which reports back the page in-memory 
status. The mincore bitmap is when ANDed with the per-process ones. "

Not sure if they actually mean ORed, because otherwise they'd be losing 
pages that have been swapped out. "mincore() returns a vector that 
indicates whether pages of the calling process's virtual memory are 
resident in core (RAM)"

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-17 18:46     ` David Hildenbrand
@ 2021-08-17 20:24       ` Peter Xu
  2021-08-18  8:24         ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-08-17 20:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Alistair Popple, Tiberiu Georgescu,
	ivan.teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz

On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote:
> > Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
> > accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
> > node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
> > make it more accurate, but I think it's fine, because it's a hint.
> 
> That inconsistency doesn't really matter as you can determine if something
> is present and worth dumping if it's either swapped or present. As long as
> it's one of both but not simply nothing.
> 
> I will shamelessly reference
> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that
> checks exactly for that (the test case uses only private anonymous memory).

Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on
PM_SWAP for that when it goes beyond anonymous private memories - when shmem
swapped out the pte can be none, then the test case can fail even if it
shouldn't, imho.

The mincore() syscall seems to be ideally the thing you may want to make it
accurate, but again it's not a problem for current anonymous private memories.

> 
> > 
> > > Take CRIU as an example, it has to be correct even if a process would remap a
> > > memory region, fork() and unmap in the parent as far as I understand, ...
> > 
> > Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
> > bit needs to be accurate.  Maybe you mean the dirty bit?
> 
> https://criu.org/Shared_memory
> 
> "Dumping present pages"
> 
> "... CRIU does not dump all of the data. Instead, it determines which pages
> contain it, and only dumps those pages. This is done similarly to how
> regular memory dumping and restoring works, i.e. by looking for PRESENT or
> SWAPPED bits in owners' pagemap entries."
> 
> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which
> makes perfect sense.
> 
> 1) Process A sets up shared memory and writes data to it.
> 2) System swaps out memory, hints are setup.
> 3) Process A forks Process B, hints are not copied.
> 4) Process A unmaps shared memory, hints are dropped.
> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in
> pagemap.
> 6) Process B uses memory in shared memory region. Pages were not migrated.
> 
> Just one example; feel free to correct me.

I think pte marker won't crash criu, what will happen is that it'll see more
ptes that used to be none that become the pte markers.  This reminded me that
maybe I should teach up mincore() syscall to also be aware of the pte marker at
least, and all non_swap_entry() callers.

> 
> 
> There is notion of the mincore() systemcall:
> 
> "There is one particular feature of shared memory dumps worth mentioning.
> Sometimes, a shared memory page can exist in the kernel, but it is not
> mapped to any process. CRIU detects such pages by calling mincore() on the
> shmem segment, which reports back the page in-memory status. The mincore
> bitmap is when ANDed with the per-process ones. "
> 
> Not sure if they actually mean ORed, because otherwise they'd be losing
> pages that have been swapped out. "mincore() returns a vector that indicates
> whether pages of the calling process's virtual memory are resident in core
> (RAM)"

I am wildly guessing they ORed the two just because PM_SWAP is not working
properly for shmem, so the OR happens only for shmem.  Criu may not only rely
on mincore() because they also want the dirty bits.

Btw, I noticed in 2016 criu switched from mincore() to lseek():

https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b587eac5a481094b6274ae

Criu should want to know "whether this page has valid data" not "whether this
page has swapped out", so lseek() seems to be more suitable, which I'm not
aware of before.

I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
should just still be a bit slow because it'll look up the cache too, but it
should work similarly like the original proposal.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-17 20:24       ` Peter Xu
@ 2021-08-18  8:24         ` David Hildenbrand
  2021-08-18 17:52           ` Tiberiu Georgescu
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-08-18  8:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Alistair Popple, Tiberiu Georgescu,
	ivan.teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz

On 17.08.21 22:24, Peter Xu wrote:
> On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote:
>>> Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
>>> accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
>>> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
>>> make it more accurate, but I think it's fine, because it's a hint.
>>
>> That inconsistency doesn't really matter as you can determine if something
>> is present and worth dumping if it's either swapped or present. As long as
>> it's one of both but not simply nothing.
>>
>> I will shamelessly reference
>> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that
>> checks exactly for that (the test case uses only private anonymous memory).
> 
> Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on
> PM_SWAP for that when it goes beyond anonymous private memories - when shmem
> swapped out the pte can be none, then the test case can fail even if it
> shouldn't, imho.

Exactly, because the pagemap is fairly completely broken for shmem.

> 
> The mincore() syscall seems to be ideally the thing you may want to make it
> accurate, but again it's not a problem for current anonymous private memories.

I haven't checked the details, but I believe the mincore() syscall won't 
report swapped out pages. At least according to its documentation:

"mincore()  returns a vector that indicates whether pages of the calling 
process's virtual memory are resident in core (RAM), and so will not 
cause a disk access  (page  fault)  if  referenced."

(to protect it from swapping and relying on mincore() we would have to 
mlock that memory; we'd want MCL_ONFAULT to be able to test 
MADV_POPULATE_READ|WRITE; or we'd just want to rely on lseek)

> 
>>
>>>
>>>> Take CRIU as an example, it has to be correct even if a process would remap a
>>>> memory region, fork() and unmap in the parent as far as I understand, ...
>>>
>>> Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
>>> bit needs to be accurate.  Maybe you mean the dirty bit?
>>
>> https://criu.org/Shared_memory
>>
>> "Dumping present pages"
>>
>> "... CRIU does not dump all of the data. Instead, it determines which pages
>> contain it, and only dumps those pages. This is done similarly to how
>> regular memory dumping and restoring works, i.e. by looking for PRESENT or
>> SWAPPED bits in owners' pagemap entries."
>>
>> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which
>> makes perfect sense.
>>
>> 1) Process A sets up shared memory and writes data to it.
>> 2) System swaps out memory, hints are setup.
>> 3) Process A forks Process B, hints are not copied.
>> 4) Process A unmaps shared memory, hints are dropped.
>> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in
>> pagemap.
>> 6) Process B uses memory in shared memory region. Pages were not migrated.
>>
>> Just one example; feel free to correct me.
> 
> I think pte marker won't crash criu, what will happen is that it'll see more
> ptes that used to be none that become the pte markers.  This reminded me that
> maybe I should teach up mincore() syscall to also be aware of the pte marker at
> least, and all non_swap_entry() callers.
> 

I haven't checked what mincore() is doing, but from what I understand 
when reading the CRIU doc and the mincore() doc, it does the right thing 
without requiring any fiddling with pte marker hints. I assume you 
merely have a performance improvement in mind.

>>
>>
>> There is notion of the mincore() systemcall:
>>
>> "There is one particular feature of shared memory dumps worth mentioning.
>> Sometimes, a shared memory page can exist in the kernel, but it is not
>> mapped to any process. CRIU detects such pages by calling mincore() on the
>> shmem segment, which reports back the page in-memory status. The mincore
>> bitmap is when ANDed with the per-process ones. "
>>
>> Not sure if they actually mean ORed, because otherwise they'd be losing
>> pages that have been swapped out. "mincore() returns a vector that indicates
>> whether pages of the calling process's virtual memory are resident in core
>> (RAM)"
> 
> I am wildly guessing they ORed the two just because PM_SWAP is not working
> properly for shmem, so the OR happens only for shmem.  Criu may not only rely
> on mincore() because they also want the dirty bits.
> 
> Btw, I noticed in 2016 criu switched from mincore() to lseek():
> 
> https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b587eac5a481094b6274ae

Interesting. That's certainly what we want when it comes to skipping 
holes in files. (before reading that, I wasn't even aware that mincore() 
existed)

> 
> Criu should want to know "whether this page has valid data" not "whether this
> page has swapped out", so lseek() seems to be more suitable, which I'm not
> aware of before.

Again, just as you, I learned a lot :)

> 
> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
> should just still be a bit slow because it'll look up the cache too, but it
> should work similarly like the original proposal.
> 

Very right, maybe we can just avoid tampering with pagemap on shmem 
completely (which sounds like an excellent idea to me) and document it 
as "On shared memory, we will never indicate SWAPPED if the pages have 
been swapped out. Further, PRESENT might be under-indicated: if a shared 
page is currently not mapped into the page table of a process.". I saw 
there was a related, proposed doc update, maybe we can finetune that.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-18  8:24         ` David Hildenbrand
@ 2021-08-18 17:52           ` Tiberiu Georgescu
  2021-08-18 18:13             ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Tiberiu Georgescu @ 2021-08-18 17:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies


> On 18 Aug 2021, at 09:24, David Hildenbrand <david@redhat.com> wrote:
> 
> On 17.08.21 22:24, Peter Xu wrote:
>> On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote:
>>>> Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
>>>> accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
>>>> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
>>>> make it more accurate, but I think it's fine, because it's a hint.
>>> 
>>> That inconsistency doesn't really matter as you can determine if something
>>> is present and worth dumping if it's either swapped or present. As long as
>>> it's one of both but not simply nothing.
>>> 
>>> I will shamelessly reference
>>> tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that
>>> checks exactly for that (the test case uses only private anonymous memory).
>> Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on
>> PM_SWAP for that when it goes beyond anonymous private memories - when shmem
>> swapped out the pte can be none, then the test case can fail even if it
>> shouldn't, imho.
> 
> Exactly, because the pagemap is fairly completely broken for shmem.
> 
>> The mincore() syscall seems to be ideally the thing you may want to make it
>> accurate, but again it's not a problem for current anonymous private memories.
> 
> I haven't checked the details, but I believe the mincore() syscall won't report swapped out pages. At least according to its documentation:
> 
> "mincore()  returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM), and so will not cause a disk access  (page  fault)  if  referenced."
> 
> (to protect it from swapping and relying on mincore() we would have to mlock that memory; we'd want MCL_ONFAULT to be able to test MADV_POPULATE_READ|WRITE; or we'd just want to rely on lseek)

After some digging and testing, I found out that the docs for mincore() are a little outdated, and that
the RFC has an unexpected side effect on the sys call.

The output vector is supposed to set the flag to 1 if the page it indicates was present in either the
page cache or the swap cache. I would like to highlight that if a page got swapped out and reached
swap (i.e. page content no longer stored in the swap cache), the flag gets set to 0.

So indeed, mincore does not report swapped out pages, but AFAIK, it does reports pages which are
still in the swap cache.

There was an attempt to rework mincore altogether and make it retrieve mappings instead [1], but
was quickly reverted [2] because the removed functionality was necessary for some long functioning systems.

On Peter's RFC, it now looks like mincore()'s flags are set to 1 for any shared page that has been
dirtied, whether it is still present, in swap cache or it arrived in swap. AFAIK, only none pages have
their flags set to zero. For private pages, mincore still seems to behave normally.

[1] https://github.com/torvalds/linux/commit/574823bfab82d9d8fa47f422778043fbb4b4f50e
[2] https://github.com/torvalds/linux/commit/30bac164aca750892b93eef350439a0562a68647

> 
>>> 
>>>> 
>>>>> Take CRIU as an example, it has to be correct even if a process would remap a
>>>>> memory region, fork() and unmap in the parent as far as I understand, ...
>>>> 
>>>> Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
>>>> bit needs to be accurate.  Maybe you mean the dirty bit?
>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__criu.org_Shared-5Fmemory&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=A5H_4nfdW_jAPHckF-cuCBfRHsm2aij-cr-mELX0uww&s=DZgiYWJgLokyZkBYd5VKOnr5Fbj63aAc01Fu2BPE8Lc&e= 
>>> "Dumping present pages"
>>> 
>>> "... CRIU does not dump all of the data. Instead, it determines which pages
>>> contain it, and only dumps those pages. This is done similarly to how
>>> regular memory dumping and restoring works, i.e. by looking for PRESENT or
>>> SWAPPED bits in owners' pagemap entries."
>>> 
>>> -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which
>>> makes perfect sense.
>>> 
>>> 1) Process A sets up shared memory and writes data to it.
>>> 2) System swaps out memory, hints are setup.
>>> 3) Process A forks Process B, hints are not copied.
>>> 4) Process A unmaps shared memory, hints are dropped.
>>> 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in
>>> pagemap.
>>> 6) Process B uses memory in shared memory region. Pages were not migrated.
>>> 
>>> Just one example; feel free to correct me.
>> I think pte marker won't crash criu, what will happen is that it'll see more
>> ptes that used to be none that become the pte markers.  This reminded me that
>> maybe I should teach up mincore() syscall to also be aware of the pte marker at
>> least, and all non_swap_entry() callers.

I think in mincore_pte_range, the call to non_swap_entry(entry) could return true, setting the flag on
the vector to 1 prematurely. Please read my comment above.
> 
> I haven't checked what mincore() is doing, but from what I understand when reading the CRIU doc and the mincore() doc, it does the right thing without requiring any fiddling with pte marker hints. I assume you merely have a performance improvement in mind.
> 
>>> 
>>> 
>>> There is notion of the mincore() systemcall:
>>> 
>>> "There is one particular feature of shared memory dumps worth mentioning.
>>> Sometimes, a shared memory page can exist in the kernel, but it is not
>>> mapped to any process. CRIU detects such pages by calling mincore() on the
>>> shmem segment, which reports back the page in-memory status. The mincore
>>> bitmap is when ANDed with the per-process ones. "
>>> 
>>> Not sure if they actually mean ORed, because otherwise they'd be losing
>>> pages that have been swapped out. "mincore() returns a vector that indicates
>>> whether pages of the calling process's virtual memory are resident in core
>>> (RAM)"
>> I am wildly guessing they ORed the two just because PM_SWAP is not working
>> properly for shmem, so the OR happens only for shmem.  Criu may not only rely
>> on mincore() because they also want the dirty bits.
>> Btw, I noticed in 2016 criu switched from mincore() to lseek():
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_checkpoint-2Drestore_criu_commit_1821acedd04b602b37b587eac5a481094b6274ae&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=A5H_4nfdW_jAPHckF-cuCBfRHsm2aij-cr-mELX0uww&s=kel85AR7AGZnvBymbM7QEwc770HGO2koka-kTiF-r5U&e= 
> 
> Interesting. That's certainly what we want when it comes to skipping holes in files. (before reading that, I wasn't even aware that mincore() existed)
> 
>> Criu should want to know "whether this page has valid data" not "whether this
>> page has swapped out", so lseek() seems to be more suitable, which I'm not
>> aware of before.
> 
> Again, just as you, I learned a lot :)

Same :)
> 
>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>> should just still be a bit slow because it'll look up the cache too, but it
>> should work similarly like the original proposal.

I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
> 
> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that.
> 
We could take into consideration an alternative approach to retrieving the shared page info in user
space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality
incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix.

Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your
next reply.

[3] https://marc.info/?m=162878395426774
> 
> -- 
> Thanks,
> 
> David / dhildenb
--
Kind regards,

Tibi Georgescu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages
  2021-08-13 16:01     ` Peter Xu
@ 2021-08-18 18:02       ` Tiberiu Georgescu
  0 siblings, 0 replies; 21+ messages in thread
From: Tiberiu Georgescu @ 2021-08-18 18:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Alistair Popple, Ivan Teterevkov,
	Mike Rapoport, Hugh Dickins, Matthew Wilcox, Andrea Arcangeli,
	David Hildenbrand, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Jonathan Davies, Florian Schmidt


> On 13 Aug 2021, at 17:01, Peter Xu <peterx@redhat.com> wrote:
> 
> On Fri, Aug 13, 2021 at 03:18:22PM +0000, Tiberiu Georgescu wrote:
>> Hello Peter,
> 
> Hi, Tiberiu,
> 
>> 
>> Sorry for commenting so late.
> 
> No worry on that.  I appreciate a lot your follow up on this problem.
> 
>> 
>>> On 7 Aug 2021, at 04:25, Peter Xu <peterx@redhat.com> wrote:
>>> 
>>> When shmem pages are swapped out, instead of clearing the pte entry, we leave a
>>> marker pte showing that this page is swapped out as a hint for pagemap.  A new
>>> TTU flag is introduced to identify this case.
>>> 
>>> This can be useful for detecting swapped out cold shmem pages.  Then after some
>>> memory background scanning work (which will fault in the shmem page and
>>> confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
>>> it out again as we know it was cold.
>>> 
>>> For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
>>> SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.
>>> 
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>> fs/proc/task_mmu.c   |  1 +
>>> include/linux/rmap.h |  1 +
>>> mm/rmap.c            | 19 +++++++++++++++++++
>>> mm/vmscan.c          |  2 +-
>>> 4 files changed, 22 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index eb97468dfe4c..21b8594abc1d 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>> 		if (pm->show_pfn)
>>> 			frame = swp_type(entry) |
>>> 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
>>> +		/* NOTE: this covers PTE_MARKER_PAGEOUT too */
>>> 		flags |= PM_SWAP;
>>> 		if (is_pfn_swap_entry(entry))
>>> 			page = pfn_swap_entry_to_page(entry);
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index c976cc6de257..318a0e95c7fb 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -95,6 +95,7 @@ enum ttu_flags {
>>> 					 * do a final flush if necessary */
>>> 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>>> 					 * caller holds it */
>>> +	TTU_HINT_PAGEOUT	= 0x100, /* Hint for pageout operation */
>>> };
>>> 
>>> #ifdef CONFIG_MMU
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index b9eb5c12f3fe..24a70b36b6da 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
>>> 	unlock_page_memcg(page);
>>> }
>>> 
>>> +static inline void
>>> +pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
>>> +		   struct page *page, unsigned long address)
>>> +{
>>> +#ifdef CONFIG_PTE_MARKER_PAGEOUT
>>> +	swp_entry_t entry;
>>> +	pte_t pteval;
>>> +
>>> +	if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
>>> +		entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
>>> +		pteval = swp_entry_to_pte(entry);
>>> +		set_pte_at(vma->vm_mm, address, pte, pteval);
>>> +	}
>>> +#endif
>>> +}
>>> +
>>> /*
>>> * @arg: enum ttu_flags will be passed to this argument
>>> */
>>> @@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 			 */
>>> 			dec_mm_counter(mm, mm_counter_file(page));
>>> 		}
>>> +
>>> +		if (flags & TTU_HINT_PAGEOUT)
>>> +			pte_marker_install(vma, pvmw.pte, page, address);
>>> discard:
>>> 		/*
>>> 		 * No need to call mmu_notifier_invalidate_range() it has be
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 4620df62f0ff..4754af6fa24b 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>>> 		 * processes. Try to unmap it here.
>>> 		 */
>>> 		if (page_mapped(page)) {
>>> -			enum ttu_flags flags = TTU_BATCH_FLUSH;
>>> +			enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
>>> 			bool was_swapbacked = PageSwapBacked(page);
>>> 
>>> 			if (unlikely(PageTransHuge(page)))
>>> -- 
>>> 2.32.0
>>> 
>> The RFC looks good to me. It makes it much cleaner to verify whether the page
>> is in swap or not, and there is great performance benefit compared to my patch.
>> 
>> Currently, your patch does not have a way of putting the swap offset onto the
>> entry when a shared page is swapped, so it does not cover all use cases
>> IMO.
> 
> Could you remind me on why you need the swap offset?  I was trying to
> understand your scenaior and I hope I summarized it right: what we want to do
> is being able to MADV_PAGEOUT pages that was swapped out.  Then IIUC the
> PM_SWAP bit should be enough for it.  Did I miss something important?

That's right, PM_SWAP is enough for that use cases. However, there are
more optimisations we are considering for QEMU live migration that require
access to the swap offset as well. For one of those, we need a mechanism in
place to make a page's swap offset accessible outside kernel space. This is
where our patches would come in.
> 
>> 
>> To counter that, I added my patch on top of yours. By making use of your
>> mechanism, we don't need to read the page cache xarray when we pages
>> are definitely none. This reduces much unnecessary overhead.
>> 
>> Main diff in fs/proc/task_mmu.c:pte_to_pagemap_entry():
>>        } else if (is_swap_pte(pte)) {
>>                swp_entry_t entry;
>>                ...
>>                entry = pte_to_swp_entry(pte);
>> +               if (is_pte_marker_entry(entry)) {
>> +                       void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
>>                ...
>> 
>> For reference: https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_20210730160826.63785-2D1-2Dtiberiu.georgescu-40nutanix.com_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=QsruIxW9oyVSF9YCw783f9XnUlbHz5DSUTClk6tXsoE&s=6WWxG1NwJf-GfMYNQgGIb0rCozmhWxHnTZaCL9cBrt4&e= .
>> 
>> After some preliminary performance tests on VMs, I noticed a significant
>> performance improvement in my patch, when yours is added. Here is the
>> most interesting result:
>> 
>> Program I tested it on: Same as https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_20210730160826.63785-2D1-2Dtiberiu.georgescu-40nutanix.com_&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=QsruIxW9oyVSF9YCw783f9XnUlbHz5DSUTClk6tXsoE&s=6WWxG1NwJf-GfMYNQgGIb0rCozmhWxHnTZaCL9cBrt4&e= 
>> 
>> Specs:
>>    Memory: 32GB memory, 64GB swap
>>    Params: 16GB allocated shmem, 50% dirty, 4GB cgroup, no batching
>> 
>> In short, the pagemap read deals with 
>>    * ~4GB of physical memory (present pages), 
>>    * ~4GB in swap (swapped pages)
>>    * 8GB non-dirty (none pages).
>> 
>> Results:
>> +------------+-------+-------+
>> | Peter\Tibi |  Pre  | Post  | (in seconds)
>> +------------+-------+-------+
>> | Pre        | 11.82 | 38.44 |
>> | Post       | 13.28 | 22.34 |
>> +------------+-------+-------+
>> 
>> With your patch, mine yields the results in twice the time, compared to 4x.
>> I am aware it is not ideal, but until it is decided whether a whole different
>> system is preferred to PTE markers, our solutions together currently deliver
>> the best performance for correctness. Thank you for finding this solution.
> 
> Thanks for trying that out!  It's great to know these test results.
> 
>> 
>> I am happy to introduce a configuration variable to enable my pagemap
>> patch only if necessary.
> 
> Right.  We can use a config for that, but I didn't mention when replying to
> your previous patchset because I thought w/ or w/o the config it should still
> be better to use the pte markers because it should be more efficient.
> 
> However I think I need to double check I didn't miss anything that you're
> looking for. E.g. I need to understand why swp offset matters here as I asked.
> I must confess that cannot be trivially done with pte markers yet - keeping a
> swap hint is definitely not the same story as keeping a swp entry.

I am well aware. Initially, I considered doing something similar. When I found out
there was already backend in place that keeps track of the swap location (i.e. page cache), I thought it was unnecessary to duplicate it into PTE, so I just
extracted it in the pagemap. This way, the overhead was only applied when
retrieving pagemap data, not over the whole kernel.

> 
>> Either way, if performance is a must, batching is still the best way to
>> access multiple pagemap entries.
> 
> I agree, especially when we have pmd pgtable locks things can happen
> concurrently.
> 
> It's just that it's a pity the major overhead comparing to the old way is at
> page cache look up, especially as you pointed out - the capability to identify
> used ptes with empty ptes matters.  That's kind of orthogonal to batching to me.
> 
> Thanks,
> 
> -- 
> Peter Xu

--
Kind regards,

Tibi Georgescu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-18 17:52           ` Tiberiu Georgescu
@ 2021-08-18 18:13             ` David Hildenbrand
  2021-08-19 14:54               ` Tiberiu Georgescu
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-08-18 18:13 UTC (permalink / raw)
  To: Tiberiu Georgescu
  Cc: Peter Xu, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies

>>
>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>> should just still be a bit slow because it'll look up the cache too, but it
>>> should work similarly like the original proposal.
> 
> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.

Well, with mincore() you could at least decide "page is present" vs. 
"page is swapped or not existent". At least for making pageout decisions 
it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.

But I'm not 100% sure what exactly your use case is here and what you 
would really need, so you know best :)

>>
>> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that.
>>
> We could take into consideration an alternative approach to retrieving the shared page info in user
> space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality
> incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix.
> 

As I stated before, making pagemap less broken is not a good idea IMHO. 
Either make it really correct or just leave it all broken -- and 
document that e.g., other interfaces (lseek) shall be used. It sounds 
like they exist and are good enough for CRUI.

And TBH, if other interfaces already exist and get the job done, I'm 
more than happy that we can avoid mixing more shmem stuff into pagemap 
and trying to compensate performance problems by introducing inconsistency.

If it has an fd and we can punch that into syscalls, we should much 
rather use that fd to lookup stuff then going via process page tables -- 
if possible of course (to be evaluated, because I haven't looked into 
the CRIU details and how they use lseek with anonymous shared memory).

> Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your
> next reply.
> 
> [3] https://marc.info/?m=162878395426774

No, that's it.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-18 18:13             ` David Hildenbrand
@ 2021-08-19 14:54               ` Tiberiu Georgescu
  2021-08-19 17:26                 ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Tiberiu Georgescu @ 2021-08-19 14:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies


> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote:
> 
>>> 
>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>>> should just still be a bit slow because it'll look up the cache too, but it
>>>> should work similarly like the original proposal.
>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
> 
> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.

I think you are right. In the optimisation we first presented, we should be able to
send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
question can be answered by the current pagemap too. Nice catch.

However, not all use cases are the same. AFAIK, there is still no way to figure
out whether a shared page is swapped out or none unless it is directly
read/accessed after a pagemap check. Bringing a page into memory to check
if it previously was in swap does not seem ideal.

Also, we still have no mechanism to retrieve the swap offsets of shmem pages
AFAIK. There is one more QEMU optimisation we are working on that requires
these mappings available outside of kernel space.
> 
> But I'm not 100% sure what exactly your use case is here and what you would really need, so you know best :)

Maybe, but I am always open to (m)advise :)
> 
>>> 
>>> Very right, maybe we can just avoid tampering with pagemap on shmem completely (which sounds like an excellent idea to me) and document it as "On shared memory, we will never indicate SWAPPED if the pages have been swapped out. Further, PRESENT might be under-indicated: if a shared page is currently not mapped into the page table of a process.". I saw there was a related, proposed doc update, maybe we can finetune that.
>>> 
>> We could take into consideration an alternative approach to retrieving the shared page info in user
>> space, like storing it in sys/fs instead of per process. However, just leaving the pagemap functionality
>> incomplete, and not providing an alternative to retrieve the missing information, does not seem right. Updating the docs with a "can't do" should be temporary, until an alternative or fix.
> 
> As I stated before, making pagemap less broken is not a good idea IMHO. Either make it really correct or just leave it all broken -- and document that e.g., other interfaces (lseek) shall be used. It sounds like they exist and are good enough for CRUI.
> 
> And TBH, if other interfaces already exist and get the job done, I'm more than happy that we can avoid mixing more shmem stuff into pagemap and trying to compensate performance problems by introducing inconsistency.
> 
> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).

I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
whether a page is swapped out or just empty, only if it is present on page cache.
The holes that lseek finds would not be able to infer this difference, AFAIK. Will
test the behaviour to make sure.
> 
>> Also, I think you are talking about my own doc update patch[3]. If not, please share a link with your
>> next reply.
>> [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fm-3D162878395426774&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=rRM5dtWOv0DNo5dDxZ2U16jl4WAw6ql5szOKa9cu_RA&m=T9yjhM3vhL_Ip2wxg2x-BZclbbY3WAO5Oc-y7IqNs7Y&s=HujDmGVIi1iXQ22oWF_GE-sPxvQ2ORTcCWEfnpqq35o&e= 
> 
> No, that's it.
> 
Great, I'll head right there.

--
Kind regards,

Tibi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-19 14:54               ` Tiberiu Georgescu
@ 2021-08-19 17:26                 ` David Hildenbrand
  2021-08-20 16:49                   ` Tiberiu Georgescu
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-08-19 17:26 UTC (permalink / raw)
  To: Tiberiu Georgescu
  Cc: Peter Xu, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies

On 19.08.21 16:54, Tiberiu Georgescu wrote:
> 
>> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote:
>>
>>>>
>>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>>>> should just still be a bit slow because it'll look up the cache too, but it
>>>>> should work similarly like the original proposal.
>>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
>>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
>>
>> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.
> 
> I think you are right. In the optimisation we first presented, we should be able to
> send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
> and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
> question can be answered by the current pagemap too. Nice catch.
> 
> However, not all use cases are the same. AFAIK, there is still no way to figure
> out whether a shared page is swapped out or none unless it is directly
> read/accessed after a pagemap check. Bringing a page into memory to check
> if it previously was in swap does not seem ideal.

Well, you can lseek() to remove all the holes and use mincore() to 
remove all in-memory pages. You're left with the swapped ones. Not the 
most efficient interface maybe, but there is a way :)

> 
> Also, we still have no mechanism to retrieve the swap offsets of shmem pages
> AFAIK. There is one more QEMU optimisation we are working on that requires
> these mappings available outside of kernel space.

How exactly would the swap offset really help? IMHO that's a kernel 
internal that shouldn't be of any value to user space -- it's merely for 
debugging purposes. But I'd love to learn details.

[...]

>> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).
> 
> I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
> using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
> whether a page is swapped out or just empty, only if it is present on page cache.
> The holes that lseek finds would not be able to infer this difference, AFAIK. Will
> test the behaviour to make sure.

CRIU wants to migrate everything. lseek() gives you the definitive 
answer what needs migration -- if it's swapped out or resident. Just 
skip the holes.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-19 17:26                 ` David Hildenbrand
@ 2021-08-20 16:49                   ` Tiberiu Georgescu
  2021-08-20 19:12                     ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Tiberiu Georgescu @ 2021-08-20 16:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies


> On 19 Aug 2021, at 18:26, David Hildenbrand <david@redhat.com> wrote:
> 
> On 19.08.21 16:54, Tiberiu Georgescu wrote:
>>> On 18 Aug 2021, at 19:13, David Hildenbrand <david@redhat.com> wrote:
>>> 
>>>>> 
>>>>>> I'm now wondering whether for Tiberiu's case mincore() can also be used.  It
>>>>>> should just still be a bit slow because it'll look up the cache too, but it
>>>>>> should work similarly like the original proposal.
>>>> I am afraid that the information returned by mincore is a little too vague to be of better help, compared to what the pagemap should provide in theory. I will have a look to see whether lseek on
>>>> proc/map_files works as a "PM_SWAP" equivalent. However, the swap offset would still be missing.
>>> 
>>> Well, with mincore() you could at least decide "page is present" vs. "page is swapped or not existent". At least for making pageout decisions it shouldn't really matter, no? madvise(MADV_PAGEOUT) on a hole is a nop.
>> I think you are right. In the optimisation we first presented, we should be able to
>> send the madvise(MADV_PAGEOUT) call even if the page is none quite safely
>> and get the wanted behaviour. Also, the "is_present" or "is_swap_or_none"
>> question can be answered by the current pagemap too. Nice catch.
>> However, not all use cases are the same. AFAIK, there is still no way to figure
>> out whether a shared page is swapped out or none unless it is directly
>> read/accessed after a pagemap check. Bringing a page into memory to check
>> if it previously was in swap does not seem ideal.
> 
> Well, you can lseek() to remove all the holes and use mincore() to remove all in-memory pages. You're left with the swapped ones. Not the most efficient interface maybe, but there is a way :)

Ok, that could work. Still, I have a couple of concerns.

Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
swap, so using it would be a direct factor on its own output. If people are working
on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
works, so please feel free to contradict me, but I think it would swap in some
of the pages that it seeks through, if not all, to figure out when to stop. Unless it
leverages the page cache somehow, or an internal bitmap.

Secondly, mincore() could return some "false positives" for this particular use
case. That is because it returns flag=1 for pages which are still in the swap
cache, so the output becomes ambiguous.

I am not saying this is not something that would ever be needed. Some people
could actually be looking for exactly this scenario, and lseeking during the check
could be an advantage. Just that it does not look very flexible. That is why the
pagemap would have been ideal for us.

Alternatively, to get all logically swapped out pages, the lseek with pagemap
should do the trick. As you said, we remove holes with lseek, but we remove
in-memory pages with is_present(pte) instead. This solution would still suffer from 
my first concern, but it should do the job.

> 
>> Also, we still have no mechanism to retrieve the swap offsets of shmem pages
>> AFAIK. There is one more QEMU optimisation we are working on that requires
>> these mappings available outside of kernel space.
> 
> How exactly would the swap offset really help? IMHO that's a kernel internal that shouldn't be of any value to user space -- it's merely for debugging purposes. But I'd love to learn details.

It is possible for the swap device to be network attached and shared, so multiple
hosts would need to understand its content. Then it is no longer internal to one
kernel only.

By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
is zero-copy.
> 
> [...]
> 
>>> If it has an fd and we can punch that into syscalls, we should much rather use that fd to lookup stuff then going via process page tables -- if possible of course (to be evaluated, because I haven't looked into the CRIU details and how they use lseek with anonymous shared memory).
>> I found out that it is possible to retrieve the fds of shmem/tmpfs file allocations
>> using proc/pid/map_files, which is neat. Still, CRIU does not seem to care
>> whether a page is swapped out or just empty, only if it is present on page cache.
>> The holes that lseek finds would not be able to infer this difference, AFAIK. Will
>> test the behaviour to make sure.
> 
> CRIU wants to migrate everything. lseek() gives you the definitive answer what needs migration -- if it's swapped out or resident. Just skip the holes.

Thank you for the summary. I see why the use case is sufficient for CRIU then.
In our case, the optimisations aim to make the migration on QEMU swap aware.

--
Kind regards,
Tibi Georgescu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-20 16:49                   ` Tiberiu Georgescu
@ 2021-08-20 19:12                     ` Peter Xu
  2021-08-25 13:40                       ` Tiberiu Georgescu
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-08-20 19:12 UTC (permalink / raw)
  To: Tiberiu Georgescu
  Cc: David Hildenbrand, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies

Hello, Tiberiu,

On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote:
> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
> swap, so using it would be a direct factor on its own output. If people are working
> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
> works, so please feel free to contradict me, but I think it would swap in some
> of the pages that it seeks through, if not all, to figure out when to stop. Unless it
> leverages the page cache somehow, or an internal bitmap.

It shouldn't.  Man page is clear on that:

       SEEK_DATA
              Adjust the file offset to the next location in the file greater
              than or equal to offset containing data.  If offset points to
              data, then the file offset is set to offset.

Again, I think your requirement is different from CRIU, so I think mincore() is
the right thing for you.

> 
> Secondly, mincore() could return some "false positives" for this particular use
> case. That is because it returns flag=1 for pages which are still in the swap
> cache, so the output becomes ambiguous.

I don't think so; mincore() should return flag=0 if it's either in swap cache
or even got dropped from it.  I think its name/doc also shows that in the fact
that "as long as it's not in RAM, the flag is cleared".  That's why I think
that should indeed be what you're looking for, if swp entry can be ignored.
More below on that.

Note that my series is as you mentioned missing the changes to support
mincore() (otherwise I'll know the existance of it!).  It'll be trivial to add
that, but let's see whether mincore() will satisfy your need.

[...]

> It is possible for the swap device to be network attached and shared, so multiple
> hosts would need to understand its content. Then it is no longer internal to one
> kernel only.
> 
> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
> is zero-copy.

That sounds reasonable, but I'm not aware of any user-API that exposes swap
entries to userspace, or is there one?

I.e., how do you know which swap device is which?  How do you guarantee the
kernel swp entry information won't change along with time?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-20 19:12                     ` Peter Xu
@ 2021-08-25 13:40                       ` Tiberiu Georgescu
  2021-08-25 14:59                         ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Tiberiu Georgescu @ 2021-08-25 13:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies, Chris Riches

Hello Peter, sorry for the late reply,

> On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote:
>> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
>> swap, so using it would be a direct factor on its own output. If people are working
>> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
>> works, so please feel free to contradict me, but I think it would swap in some
>> of the pages that it seeks through, if not all, to figure out when to stop. Unless it
>> leverages the page cache somehow, or an internal bitmap.
> 
> It shouldn't.  Man page is clear on that:
> 
>       SEEK_DATA
>              Adjust the file offset to the next location in the file greater
>              than or equal to offset containing data.  If offset points to
>              data, then the file offset is set to offset.

Ok, I got to test it out and you are right. lseek does not swap in pages. That is
great news.
> 
> Again, I think your requirement is different from CRIU, so I think mincore() is
> the right thing for you.
> 
>> 
>> Secondly, mincore() could return some "false positives" for this particular use
>> case. That is because it returns flag=1 for pages which are still in the swap
>> cache, so the output becomes ambiguous.
> 
> I don't think so; mincore() should return flag=0 if it's either in swap cache
> or even got dropped from it.  I think its name/doc also shows that in the fact
> that "as long as it's not in RAM, the flag is cleared".  That's why I think
> that should indeed be what you're looking for, if swp entry can be ignored.
> More below on that.

By saying there are "false positives", I do not mean that the mincore() would
not work as expected, only that its definition is a little more subtle than that. And
that it does not suit our needs entirely by itself.

I tested mincore() compared to the pagemap, and I discovered that there are
more flags set to 1 (not necessarily contiguous) compared to the pages pagemap 
was reporting as present. By also looking through the code, I could only conclude
that pages in the swap cache were considered "still in RAM", so were set to 1 as
well. When looking into what the swap cache does, it makes sense.

We could use mincore() and pagemap to find the pages in the swap cache.

In short, mincore() is not enough because it does not differentiate between
present pages and swap-cache entries, as both are in RAM, but the latter
is also in swap. It can be used with other tools to get more specific information
though, so it is useful.
> 
> Note that my series is as you mentioned missing the changes to support
> mincore() (otherwise I'll know the existance of it!).  It'll be trivial to add
> that, but let's see whether mincore() will satisfy your need.

We are currently trying to make use of all tools that we have learned of so far
during our discussions (lseek, map_files, even mincore) to get the information
that we need about swap pages. In theory, for many of our use cases, a
combination of 2 or 3 should be enough.

It is a little more convoluted than a simple pagemap call, but it can be more
versatile (using lseek to skip multiple unallocated pages). As to whether the swap
bit (and more) should be eventually added on the pagemap, maybe this topic
makes more sense to continue on the Documentation thread.
> 
> [...]
> 
>> It is possible for the swap device to be network attached and shared, so multiple
>> hosts would need to understand its content. Then it is no longer internal to one
>> kernel only.
>> 
>> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
>> is zero-copy.
> 
> That sounds reasonable, but I'm not aware of any user-API that exposes swap
> entries to userspace, or is there one?

Good question. AFAIK, the swap device can be retrieved by using the swap type,
which is part of the swap entry. During our discussions, I was always assuming
that, if the pagemap entry kept track of the swap offset, it would keep track of the
swap type and, conversely, the swap device as well. Sorry if I haven't made this
assumption clear until now.

So we were relying on the pagemap to expose swap entry information. Seeing it
works for private pages, we thought it made sense to have worked on shared pages as well.
> 
> I.e., how do you know which swap device is which?  How do you guarantee the
> kernel swp entry information won't change along with time?

I don't think we can guarantee it unless we halt the guest. But QEMU does most
migration work in pre-copy using a best-effort approach anyway.

So, having a way to retrieve temporary, but accurate information about swap
entries (i.e. post-patch pagemap) should be enough to guarantee a smoother
migration process. It is intended to be repeated, unless there is no change
between iterations.

--
Kind regards,

Tibi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
  2021-08-25 13:40                       ` Tiberiu Georgescu
@ 2021-08-25 14:59                         ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-08-25 14:59 UTC (permalink / raw)
  To: Tiberiu Georgescu
  Cc: David Hildenbrand, linux-kernel, linux-mm, Alistair Popple,
	Ivan Teterevkov, Mike Rapoport, Hugh Dickins, Matthew Wilcox,
	Andrea Arcangeli, Kirill A . Shutemov, Andrew Morton,
	Mike Kravetz, Carl Waldspurger [C],
	Florian Schmidt, Jonathan Davies, Chris Riches

On Wed, Aug 25, 2021 at 01:40:13PM +0000, Tiberiu Georgescu wrote:
> Hello Peter, sorry for the late reply,

Hi, Tiberiu,

No worries on that.

> 
> > On Fri, Aug 20, 2021 at 04:49:58PM +0000, Tiberiu Georgescu wrote:
> >> Firstly, I am worried lseek with the SEEK_HOLE flag would page in pages from
> >> swap, so using it would be a direct factor on its own output. If people are working
> >> on Live Migration, this would not be ideal. I am not 100% sure this is how lseek
> >> works, so please feel free to contradict me, but I think it would swap in some
> >> of the pages that it seeks through, if not all, to figure out when to stop. Unless it
> >> leverages the page cache somehow, or an internal bitmap.
> > 
> > It shouldn't.  Man page is clear on that:
> > 
> >       SEEK_DATA
> >              Adjust the file offset to the next location in the file greater
> >              than or equal to offset containing data.  If offset points to
> >              data, then the file offset is set to offset.
> 
> Ok, I got to test it out and you are right. lseek does not swap in pages. That is
> great news.
> > 
> > Again, I think your requirement is different from CRIU, so I think mincore() is
> > the right thing for you.
> > 
> >> 
> >> Secondly, mincore() could return some "false positives" for this particular use
> >> case. That is because it returns flag=1 for pages which are still in the swap
> >> cache, so the output becomes ambiguous.
> > 
> > I don't think so; mincore() should return flag=0 if it's either in swap cache
> > or even got dropped from it.  I think its name/doc also shows that in the fact
> > that "as long as it's not in RAM, the flag is cleared".  That's why I think
> > that should indeed be what you're looking for, if swp entry can be ignored.
> > More below on that.
> 
> By saying there are "false positives", I do not mean that the mincore() would
> not work as expected, only that its definition is a little more subtle than that. And
> that it does not suit our needs entirely by itself.
> 
> I tested mincore() compared to the pagemap, and I discovered that there are
> more flags set to 1 (not necessarily contiguous) compared to the pages pagemap 
> was reporting as present. By also looking through the code, I could only conclude
> that pages in the swap cache were considered "still in RAM", so were set to 1 as
> well. When looking into what the swap cache does, it makes sense.

Please see mincore_page():

	/*
	 * When tmpfs swaps out a page from a file, any process mapping that
	 * file will not get a swp_entry_t in its pte, but rather it is like
	 * any other file mapping (ie. marked !present and faulted in with
	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
	 */
	page = find_get_incore_page(mapping, index);
	if (page) {
		present = PageUptodate(page);
		put_page(page);
	}

I think the "testing" means when swapped out, the page will be NULL. If it's
just the pte got zapped, the page will be returned.  The call stack should look
like:

  find_get_incore_page -> find_get_page -> pagecache_get_page(fgp_flags==0).

I think the fgp_flags guaranteed it, with no FGP_ENTRY.

Did you test mincore() without my patch (as my current patch will indeed cause
more 1's returned than it should)?  My guess is there's something else that
made your test read more 1's with mincore() than pagemap, but I have no solid
idea on that.

> 
> We could use mincore() and pagemap to find the pages in the swap cache.
> 
> In short, mincore() is not enough because it does not differentiate between
> present pages and swap-cache entries, as both are in RAM, but the latter
> is also in swap. It can be used with other tools to get more specific information
> though, so it is useful.
> > 
> > Note that my series is as you mentioned missing the changes to support
> > mincore() (otherwise I'll know the existance of it!).  It'll be trivial to add
> > that, but let's see whether mincore() will satisfy your need.
> 
> We are currently trying to make use of all tools that we have learned of so far
> during our discussions (lseek, map_files, even mincore) to get the information
> that we need about swap pages. In theory, for many of our use cases, a
> combination of 2 or 3 should be enough.
> 
> It is a little more convoluted than a simple pagemap call, but it can be more
> versatile (using lseek to skip multiple unallocated pages). As to whether the swap
> bit (and more) should be eventually added on the pagemap, maybe this topic
> makes more sense to continue on the Documentation thread.
> > 
> > [...]
> > 
> >> It is possible for the swap device to be network attached and shared, so multiple
> >> hosts would need to understand its content. Then it is no longer internal to one
> >> kernel only.
> >> 
> >> By being swap-aware, we can skip swapped-out pages during migration (to prevent IO and potential thrashing), and transfer those pages in another way that
> >> is zero-copy.
> > 
> > That sounds reasonable, but I'm not aware of any user-API that exposes swap
> > entries to userspace, or is there one?
> 
> Good question. AFAIK, the swap device can be retrieved by using the swap type,
> which is part of the swap entry. During our discussions, I was always assuming
> that, if the pagemap entry kept track of the swap offset, it would keep track of the
> swap type and, conversely, the swap device as well. Sorry if I haven't made this
> assumption clear until now.
> 
> So we were relying on the pagemap to expose swap entry information. Seeing it
> works for private pages, we thought it made sense to have worked on shared pages as well.
> > 
> > I.e., how do you know which swap device is which?  How do you guarantee the
> > kernel swp entry information won't change along with time?
> 
> I don't think we can guarantee it unless we halt the guest.

Yes, halting the guest helps, though then performance start to matter because
all time consumed in either pagemap or mincore() will be counted in as downtime
of the VM live migration, and it's not "live" at all during this period.

I'm not sure how it was done with private mappings before, because I thought
that's a pre-requisite knowledge to decide whether we should migrate a page or
not, but I might have missed something.  We can stop vm, sample, start vm, but
it could become hiccups in the guest too, or otherwise contribute to downtime
when src/dst vm switches.

> But QEMU does most
> migration work in pre-copy using a best-effort approach anyway.
> 
> So, having a way to retrieve temporary, but accurate information about swap
> entries (i.e. post-patch pagemap) should be enough to guarantee a smoother
> migration process. It is intended to be repeated, unless there is no change
> between iterations.

The kernel will allocate swap device index which will be assigned as swp_type,
right?  If there're multiple swap devices, how do you know which swp_entry is
located on which device?  I wanted to look for that info in "swapon -s" but I
didn't.  Or maybe that solution only works if there's only one swap device?

Besides, I also still have a question on the accuracy. If there's no formal way
for userspace to interact with the kernel, I'm wondering how to guarantee the
page will be kept swapped out, and data located on the swap device will always
be the latest?  Because IMHO the kernel can swap in pages as wish, even if it's
not accessed from the userspace.  After all, all these operations should be
transparent to userspace.

One example in my mind is we do have page fault-around enabled for shmem, so
that even if the VM is stopped, its pages can be faulted in if (unluckily, in
this case, though) some page near the swapped out page got faulted in - it
could be some qemu malloc()ed region that (again, unluckily) was allocated to
have a virtual address very close to the mmap()ed guest memories.

I am not sure whether it's a real problem, e.g., even if some page swapped in
during guest halted for some reason, if no one is writting to that page, at
least the data on the page will still be identical to the one located on the
swap device.  However I think that still sounds too tricky, and maybe fragile.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-08-25 14:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
2021-08-07  3:25 ` [PATCH RFC 1/4] mm: Introduce PTE_MARKER swap entry Peter Xu
2021-08-07  3:25 ` [PATCH RFC 2/4] mm: Check against orig_pte for finish_fault() Peter Xu
2021-08-07  3:25 ` [PATCH RFC 3/4] mm: Handle PTE_MARKER page faults Peter Xu
2021-08-07  3:25 ` [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages Peter Xu
2021-08-13 15:18   ` Tiberiu Georgescu
2021-08-13 16:01     ` Peter Xu
2021-08-18 18:02       ` Tiberiu Georgescu
2021-08-17  9:04 ` [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER David Hildenbrand
2021-08-17 17:09   ` Peter Xu
2021-08-17 18:46     ` David Hildenbrand
2021-08-17 20:24       ` Peter Xu
2021-08-18  8:24         ` David Hildenbrand
2021-08-18 17:52           ` Tiberiu Georgescu
2021-08-18 18:13             ` David Hildenbrand
2021-08-19 14:54               ` Tiberiu Georgescu
2021-08-19 17:26                 ` David Hildenbrand
2021-08-20 16:49                   ` Tiberiu Georgescu
2021-08-20 19:12                     ` Peter Xu
2021-08-25 13:40                       ` Tiberiu Georgescu
2021-08-25 14:59                         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).