linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm: Remember a/d bits for migration entries
@ 2022-08-04 20:39 Peter Xu
  2022-08-04 20:39 ` [PATCH v2 1/2] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Peter Xu @ 2022-08-04 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, Andrea Arcangeli, David Hildenbrand, Minchan Kim,
	Andrew Morton, Vlastimil Babka, Nadav Amit, Hugh Dickins,
	Andi Kleen, peterx, Kirill A . Shutemov

v2:
- Fix build for !CONFIG_SWAP [syzbot]
- Carry over dirty bit too [Nadav]

rfc: https://lore.kernel.org/all/20220729014041.21292-1-peterx@redhat.com
v1:  https://lore.kernel.org/all/20220803012159.36551-1-peterx@redhat.com

Problem
=======

When migrate a page, right now we always mark the migrated page as old &
clean.

However that could lead to at least two problems:

  (1) We lost the real hot/cold information while we could have persisted.
      That information shouldn't change even if the backing page is changed
      after the migration,

  (2) There can be always extra overhead on the immediate next access to
      any migrated page, because hardware MMU needs cycles to set the young
      bit again for reads, and dirty bits for write, as long as the
      hardware MMU supports these bits.

Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable.  In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.

This issue is originally reported by Andrea Arcangeli.

Solution
========

To solve this problem, this patchset tries to remember the young/dirty bits
in the migration entries and carry them over when recovering the ptes.

We have the chance to do so because in many systems the swap offset is not
really fully used.  Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like A/D bits, and that's how this series tried to approach this
problem.

max_swapfile_size() is used here to detect per-arch offset length in swp
entries.  We'll automatically remember the A/D bits when we find that we
have enough swp offset field to keep both the PFN and the extra bits.

This series is majorly solving the bit lost issue only for the migration
procedure.  Besides that, a few topics to mention related to this series:

  (1) Page Idle Tracking

  Before this series, idle tracking can cause false negative if an accessed
  page got migrated, since after migration the young bit will get lost.
  After this series, it'll be better in that after migration young bit will
  be persisted, so it'll be able to be detected correctly by page idle
  logic when walking the pgtable.

  However there's still nothing done when page idle reset was carried out
  during migration procedure in progress, but that should be a separate
  topic to be addressed (e.g. to teach rmap pgtable walk code to be able to
  walk with both present ptes and migration ptes).

  (2) MADV_COLD/MADV_FREE

  This series doesn't teach madvise() code to recognize the new entries
  with reasons.

  Currently MADV_COLD is not handled for migration entries containing A
  bit.  Logically A bit should be dropped when colding a page, but here the
  more important thing is probably LRU operations which is still missing.
  It may justify that it is not a major scenario for COLD on migration
  entries.

  Similar thing applies to MADV_FREE: logically we should consider dropping
  migration entries as a whole if it is found.  In all cases, this series
  didn't cover any of the cases above assuming they'll be either kept as-is
  or be addressed in separate patchset.

Tests
=====

After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms.  The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps.  The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

Similar effect can also be measured when writting the memory the 1st time
after migration.

After applying the patchset, both initial immediate read/write after page
migrated will perform similarly like before migration happened.

Patch Layout
============

Patch 1:  Add swp_offset_pfn() and apply to all pfn swap entries, we should
          also stop treating swp_offset() as PFN anymore because it can
          contain more information starting from next patch.
Patch 2:  The core patch to remember young/dirty bit in swap offsets.

Please review, thanks.

[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c

Peter Xu (2):
  mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  mm: Remember young/dirty bit for page migrations

 arch/arm64/mm/hugetlbpage.c |   2 +-
 include/linux/swapops.h     | 126 ++++++++++++++++++++++++++++++++++--
 mm/hmm.c                    |   2 +-
 mm/huge_memory.c            |  26 +++++++-
 mm/memory-failure.c         |   2 +-
 mm/migrate.c                |   6 +-
 mm/migrate_device.c         |   4 ++
 mm/page_vma_mapped.c        |   6 +-
 mm/rmap.c                   |   5 +-
 9 files changed, 163 insertions(+), 16 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/2] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-08-04 20:39 [PATCH v2 0/2] mm: Remember a/d bits for migration entries Peter Xu
@ 2022-08-04 20:39 ` Peter Xu
  2022-08-04 20:39 ` [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations Peter Xu
  2022-08-04 22:17 ` [PATCH v2 0/2] mm: Remember a/d bits for migration entries Nadav Amit
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2022-08-04 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, Andrea Arcangeli, David Hildenbrand, Minchan Kim,
	Andrew Morton, Vlastimil Babka, Nadav Amit, Hugh Dickins,
	Andi Kleen, peterx, Kirill A . Shutemov

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields.  To fetch the PFN, normally the user just calls swp_offset()
assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check with
MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN.  At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp offset
field, so assuming "swp_offset(entry)" to be the PFN will not stand any
more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper.  Note that many of the existing users are not candidates for the
replacement, e.g.:

  (1) When the swap entry is not a pfn swap entry at all, or,
  (2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only".  They're left aside so that when there're more information
within the swp offset they'll be carried over naturally in those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/mm/hugetlbpage.c |  2 +-
 include/linux/swapops.h     | 35 +++++++++++++++++++++++++++++------
 mm/hmm.c                    |  2 +-
 mm/memory-failure.c         |  2 +-
 mm/page_vma_mapped.c        |  6 +++---
 5 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7430060cb0d6..f897d40821dd 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -242,7 +242,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
 {
 	VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
 
-	return page_folio(pfn_to_page(swp_offset(entry)));
+	return page_folio(pfn_to_page(swp_offset_pfn(entry)));
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index a3d435bf9f97..1d17e4bb3d2f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -23,6 +23,20 @@
 #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
 #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
 
+/*
+ * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
+ * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
+ * can use the extra bits to store other information besides PFN.
+ */
+#ifdef MAX_PHYSMEM_BITS
+#define SWP_PFN_BITS			(MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#else
+#define SWP_PFN_BITS			(BITS_PER_LONG - PAGE_SHIFT)
+#endif
+#define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
+
+static inline bool is_pfn_swap_entry(swp_entry_t entry);
+
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
@@ -64,6 +78,17 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
 	return entry.val & SWP_OFFSET_MASK;
 }
 
+/*
+ * This should only be called upon a pfn swap entry to get the PFN stored
+ * in the swap entry.  Please refers to is_pfn_swap_entry() for definition
+ * of pfn swap entry.
+ */
+static inline unsigned long swp_offset_pfn(swp_entry_t entry)
+{
+	VM_BUG_ON(!is_pfn_swap_entry(entry));
+	return swp_offset(entry) & SWP_PFN_MASK;
+}
+
 /* check whether a pte points to a swap entry */
 static inline int is_swap_pte(pte_t pte)
 {
@@ -369,7 +394,7 @@ static inline int pte_none_mostly(pte_t pte)
 
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
-	struct page *p = pfn_to_page(swp_offset(entry));
+	struct page *p = pfn_to_page(swp_offset_pfn(entry));
 
 	/*
 	 * Any use of migration entries may only occur while the
@@ -387,6 +412,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
  */
 static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
+	/* Make sure the swp offset can always store the needed fields */
+	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+
 	return is_migration_entry(entry) || is_device_private_entry(entry) ||
 	       is_device_exclusive_entry(entry);
 }
@@ -475,11 +503,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
-static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
-{
-	return swp_offset(entry);
-}
-
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index f2aa63b94d9b..3850fb625dda 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = swp_offset(entry) | cpu_flags;
+			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
 			return 0;
 		}
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index cc6fc9be8d22..e451219124dd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 		swp_entry_t swp = pte_to_swp_entry(pte);
 
 		if (is_hwpoison_entry(swp))
-			pfn = hwpoison_entry_to_pfn(swp);
+			pfn = swp_offset_pfn(swp);
 	}
 
 	if (!pfn || pfn != poisoned_pfn)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8e9e574d535a..93e13fc17d3c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		    !is_device_exclusive_entry(entry))
 			return false;
 
-		pfn = swp_offset(entry);
+		pfn = swp_offset_pfn(entry);
 	} else if (is_swap_pte(*pvmw->pte)) {
 		swp_entry_t entry;
 
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		    !is_device_exclusive_entry(entry))
 			return false;
 
-		pfn = swp_offset(entry);
+		pfn = swp_offset_pfn(entry);
 	} else {
 		if (!pte_present(*pvmw->pte))
 			return false;
@@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					return not_found(pvmw);
 				entry = pmd_to_swp_entry(pmde);
 				if (!is_migration_entry(entry) ||
-				    !check_pmd(swp_offset(entry), pvmw))
+				    !check_pmd(swp_offset_pfn(entry), pvmw))
 					return not_found(pvmw);
 				return true;
 			}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-04 20:39 [PATCH v2 0/2] mm: Remember a/d bits for migration entries Peter Xu
  2022-08-04 20:39 ` [PATCH v2 1/2] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
@ 2022-08-04 20:39 ` Peter Xu
  2022-08-04 22:40   ` Nadav Amit
                     ` (2 more replies)
  2022-08-04 22:17 ` [PATCH v2 0/2] mm: Remember a/d bits for migration entries Nadav Amit
  2 siblings, 3 replies; 16+ messages in thread
From: Peter Xu @ 2022-08-04 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, Andrea Arcangeli, David Hildenbrand, Minchan Kim,
	Andrew Morton, Vlastimil Babka, Nadav Amit, Hugh Dickins,
	Andi Kleen, peterx, Kirill A . Shutemov

When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table using
either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.  Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
The same slowdown problem to dirty bits when the memory is first written
after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated.  To achieve it, define a new set of
bits in the migration swap offset field to cache the A/D bits for old pte.
Then when removing/recovering the migration entry, we can recover the A/D
bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset can be big enough to store both the PFN value and the young
bit.  Otherwise the A/D bits are dropped like before.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 26 +++++++++++-
 mm/migrate.c            |  6 ++-
 mm/migrate_device.c     |  4 ++
 mm/rmap.c               |  5 ++-
 5 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 1d17e4bb3d2f..34aa448ac6ee 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -8,6 +8,8 @@
 
 #ifdef CONFIG_MMU
 
+#include <linux/swapfile.h>
+
 /*
  * swapcache pages are stored in the swapper_space radix tree.  We want to
  * get good packing density in that tree, so the index should be dense in
@@ -35,6 +37,24 @@
 #endif
 #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
 
+/**
+ * Migration swap entry specific bitfield definitions.
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
+ * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
+ *
+ * Note: these bits will be stored in migration entries iff there're enough
+ * free bits in arch specific swp offset.  By default we'll ignore A/D bits
+ * when migrating a page.  Please refer to migration_entry_supports_ad()
+ * for more information.
+ */
+#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
+#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
+#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
+
+#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
+#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
+
 static inline bool is_pfn_swap_entry(swp_entry_t entry);
 
 /* Clear all flags but only keep swp_entry_t related information */
@@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
+/*
+ * Returns whether the host has large enough swap offset field to support
+ * carrying over pgtable A/D bits for page migrations.  The result is
+ * pretty much arch specific.
+ */
+static inline bool migration_entry_supports_ad(void)
+{
+	/*
+	 * max_swapfile_size() returns the max supported swp-offset plus 1.
+	 * We can support the migration A/D bits iff the pfn swap entry has
+	 * the offset large enough to cover all of them (PFN, A & D bits).
+	 */
+#ifdef CONFIG_SWAP
+	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+#else
+	return false;
+#endif
+}
+
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_YOUNG);
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_YOUNG;
+	/* Keep the old behavior of aging page after migration */
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_DIRTY);
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
+	/* Keep the old behavior of clean page after migration */
+	return false;
+}
+
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -311,6 +382,26 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
 	return 0;
 }
 
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	return false;
+}
+
 #endif
 
 typedef unsigned long pte_marker;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29e3628687a6..7d79db4bd76a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2005,6 +2005,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
 	bool anon_exclusive = false;
+	bool dirty = false;
 	unsigned long addr;
 	int i;
 
@@ -2088,7 +2089,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_writable_migration_entry(entry);
 		if (PageAnon(page))
 			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = false;
+		young = is_migration_entry_young(entry);
+		dirty = is_migration_entry_dirty(entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2097,6 +2099,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			SetPageDirty(page);
 		write = pmd_write(old_pmd);
 		young = pmd_young(old_pmd);
+		dirty = pmd_dirty(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
 		uffd_wp = pmd_uffd_wp(old_pmd);
 
@@ -2146,6 +2149,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			else
 				swp_entry = make_readable_migration_entry(
 							page_to_pfn(page + i));
+			if (young)
+				swp_entry = make_migration_entry_young(swp_entry);
+			if (dirty)
+				swp_entry = make_migration_entry_dirty(swp_entry);
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -2160,6 +2167,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_wrprotect(entry);
 			if (!young)
 				entry = pte_mkold(entry);
+			if (dirty)
+				/*
+				 * NOTE: this may contains setting soft
+				 * dirty too on some archs like x86.
+				 */
+				entry = pte_mkdirty(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 			if (uffd_wp)
@@ -3148,6 +3161,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
 	else
 		entry = make_readable_migration_entry(page_to_pfn(page));
+	if (pmd_young(pmdval))
+		entry = make_migration_entry_young(entry);
+	if (pmd_dirty(pmdval))
+		entry = make_migration_entry_dirty(entry);
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3173,13 +3190,18 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	get_page(new);
-	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+	if (!is_migration_entry_young(entry))
+		pmde = pmd_mkold(pmde);
+	if (PageDirty(new) && is_migration_entry_dirty(entry))
+		/* NOTE: this may contain setting soft-dirty on some archs */
+		pmde = pmd_mkdirty(pmde);
 
 	if (PageAnon(new)) {
 		rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 1649270bc1a7..957e8e6ee28e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -199,7 +199,7 @@ static bool remove_migration_pte(struct folio *folio,
 #endif
 
 		folio_get(folio);
-		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
 		if (pte_swp_soft_dirty(*pvmw.pte))
 			pte = pte_mksoft_dirty(pte);
 
@@ -207,6 +207,10 @@ static bool remove_migration_pte(struct folio *folio,
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
+		if (!is_migration_entry_young(entry))
+			pte = pte_mkold(pte);
+		if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
+			pte = pte_mkdirty(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 7feeb447e3b9..47d90df04cc6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(page));
+			if (pte_young(pte))
+				entry = make_migration_entry_young(entry);
+			if (pte_dirty(pte))
+				entry = make_migration_entry_dirty(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..28aef434ea41 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-
+			if (pte_young(pteval))
+				entry = make_migration_entry_young(entry);
+			if (pte_dirty(pteval))
+				entry = make_migration_entry_dirty(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/2] mm: Remember a/d bits for migration entries
  2022-08-04 20:39 [PATCH v2 0/2] mm: Remember a/d bits for migration entries Peter Xu
  2022-08-04 20:39 ` [PATCH v2 1/2] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
  2022-08-04 20:39 ` [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations Peter Xu
@ 2022-08-04 22:17 ` Nadav Amit
  2 siblings, 0 replies; 16+ messages in thread
From: Nadav Amit @ 2022-08-04 22:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, LKML, Huang Ying, Andrea Arcangeli, David Hildenbrand,
	Minchan Kim, Andrew Morton, Vlastimil Babka, Hugh Dickins,
	Andi Kleen, Kirill A . Shutemov

On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:

>  (1) Page Idle Tracking
> 
>  Before this series, idle tracking can cause false negative if an accessed
>  page got migrated, since after migration the young bit will get lost.
>  After this series, it'll be better in that after migration young bit will
>  be persisted, so it'll be able to be detected correctly by page idle
>  logic when walking the pgtable.
> 
>  However there's still nothing done when page idle reset was carried out
>  during migration procedure in progress, but that should be a separate
>  topic to be addressed (e.g. to teach rmap pgtable walk code to be able to
>  walk with both present ptes and migration ptes).

IIUC, when a migration entry is set page_remove_rmap() is called by
try_to_migrate_one(), so improving page-idle accuracy should be done in a
different way.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-04 20:39 ` [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations Peter Xu
@ 2022-08-04 22:40   ` Nadav Amit
  2022-08-05 16:30     ` Peter Xu
  2022-08-05 12:17   ` David Hildenbrand
  2022-08-09  8:40   ` Huang, Ying
  2 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2022-08-04 22:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, linux-kernel, Huang Ying, Andrea Arcangeli,
	David Hildenbrand, Minchan Kim, Andrew Morton, Vlastimil Babka,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov

On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:

> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> 
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
> 
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
> 
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the A/D bits are dropped like before.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
> mm/huge_memory.c        | 26 +++++++++++-
> mm/migrate.c            |  6 ++-
> mm/migrate_device.c     |  4 ++
> mm/rmap.c               |  5 ++-
> 5 files changed, 128 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 1d17e4bb3d2f..34aa448ac6ee 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,8 @@
> 
> #ifdef CONFIG_MMU
> 
> +#include <linux/swapfile.h>

Shouldn’t the ifdef go into linux/swapfile.h if that’s the right thing to do
to prevent others from mistakenly including it?

> +
> /*
>  * swapcache pages are stored in the swapper_space radix tree.  We want to
>  * get good packing density in that tree, so the index should be dense in
> @@ -35,6 +37,24 @@
> #endif
> #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
> 
> +/**
> + * Migration swap entry specific bitfield definitions.
> + *
> + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> + *
> + * Note: these bits will be stored in migration entries iff there're enough
> + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> + * when migrating a page.  Please refer to migration_entry_supports_ad()
> + * for more information.
> + */
> +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> +
> +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)

Any reason not to use BIT(x) ?

> +
> static inline bool is_pfn_swap_entry(swp_entry_t entry);
> 
> /* Clear all flags but only keep swp_entry_t related information */
> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> 	return swp_entry(SWP_MIGRATION_WRITE, offset);
> }
> 
> +/*
> + * Returns whether the host has large enough swap offset field to support
> + * carrying over pgtable A/D bits for page migrations.  The result is
> + * pretty much arch specific.
> + */
> +static inline bool migration_entry_supports_ad(void)
> +{
> +	/*
> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> +	 * We can support the migration A/D bits iff the pfn swap entry has
> +	 * the offset large enough to cover all of them (PFN, A & D bits).
> +	 */
> +#ifdef CONFIG_SWAP
> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);

This is an actual a function call (unless LTO has some trick). A bit of a
shame it cannot be at least memoized.

Or at least mark max_swapfile_size() as __attribute_const__ so it would not
be called twice for make_migration_entry_young() and
make_migration_entry_dirty().

> +#else
> +	return false;
> +#endif
> +}
> +
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_YOUNG);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG;
> +	/* Keep the old behavior of aging page after migration */
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_DIRTY);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;

Shouldn’t it be SWP_MIG_DIRTY ?

> +	/* Keep the old behavior of clean page after migration */
> +	return false;
> +}
> +
> extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> 					spinlock_t *ptl);
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> @@ -311,6 +382,26 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
> 	return 0;
> }
> 
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> #endif

While at it, can you change to:

#endif /* CONFIG_MIGRATION */

[ these ifdefs burn my eyes ]

Other than that looks good.

Thanks,
Nadav



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-04 20:39 ` [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations Peter Xu
  2022-08-04 22:40   ` Nadav Amit
@ 2022-08-05 12:17   ` David Hildenbrand
  2022-08-05 16:36     ` Peter Xu
  2022-08-09  8:40   ` Huang, Ying
  2 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2022-08-05 12:17 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Huang Ying, Andrea Arcangeli, Minchan Kim, Andrew Morton,
	Vlastimil Babka, Nadav Amit, Hugh Dickins, Andi Kleen,
	Kirill A . Shutemov

On 04.08.22 22:39, Peter Xu wrote:
> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> 
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
> 
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
> 
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the A/D bits are dropped like before.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c        | 26 +++++++++++-
>  mm/migrate.c            |  6 ++-
>  mm/migrate_device.c     |  4 ++
>  mm/rmap.c               |  5 ++-
>  5 files changed, 128 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 1d17e4bb3d2f..34aa448ac6ee 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,8 @@
>  
>  #ifdef CONFIG_MMU
>  
> +#include <linux/swapfile.h>
> +
>  /*
>   * swapcache pages are stored in the swapper_space radix tree.  We want to
>   * get good packing density in that tree, so the index should be dense in
> @@ -35,6 +37,24 @@
>  #endif
>  #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
>  
> +/**
> + * Migration swap entry specific bitfield definitions.
> + *
> + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> + *
> + * Note: these bits will be stored in migration entries iff there're enough
> + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> + * when migrating a page.  Please refer to migration_entry_supports_ad()
> + * for more information.
> + */
> +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> +
> +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> +
>  static inline bool is_pfn_swap_entry(swp_entry_t entry);
>  
>  /* Clear all flags but only keep swp_entry_t related information */
> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>  	return swp_entry(SWP_MIGRATION_WRITE, offset);
>  }
>  
> +/*
> + * Returns whether the host has large enough swap offset field to support
> + * carrying over pgtable A/D bits for page migrations.  The result is
> + * pretty much arch specific.
> + */
> +static inline bool migration_entry_supports_ad(void)
> +{
> +	/*
> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> +	 * We can support the migration A/D bits iff the pfn swap entry has
> +	 * the offset large enough to cover all of them (PFN, A & D bits).
> +	 */
> +#ifdef CONFIG_SWAP
> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> +#else
> +	return false;
> +#endif
> +}


This looks much cleaner to me. It might be helpful to draw an ascii
picture where exatcly these bits reside isnide the offset.

> +
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())

Do we maybe want to turn that into a static key and enable it once and
for all? As Nadav says, the repeated max_swapfile_size() calls/checks
might be worth optimizing out.

[...]


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-04 22:40   ` Nadav Amit
@ 2022-08-05 16:30     ` Peter Xu
  2022-08-09  8:45       ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2022-08-05 16:30 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, linux-kernel, Huang Ying, Andrea Arcangeli,
	David Hildenbrand, Minchan Kim, Andrew Morton, Vlastimil Babka,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov

On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > When page migration happens, we always ignore the young/dirty bit settings
> > in the old pgtable, and marking the page as old in the new page table using
> > either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> > 
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure.  Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> > The same slowdown problem to dirty bits when the memory is first written
> > after page migration happened.
> > 
> > Actually we can easily remember the A/D bit configuration and recover the
> > information after the page is migrated.  To achieve it, define a new set of
> > bits in the migration swap offset field to cache the A/D bits for old pte.
> > Then when removing/recovering the migration entry, we can recover the A/D
> > bits even if the page changed.
> > 
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
> > bit.  Otherwise the A/D bits are dropped like before.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
> > mm/huge_memory.c        | 26 +++++++++++-
> > mm/migrate.c            |  6 ++-
> > mm/migrate_device.c     |  4 ++
> > mm/rmap.c               |  5 ++-
> > 5 files changed, 128 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index 1d17e4bb3d2f..34aa448ac6ee 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -8,6 +8,8 @@
> > 
> > #ifdef CONFIG_MMU
> > 
> > +#include <linux/swapfile.h>
> 
> Shouldn’t the ifdef go into linux/swapfile.h if that’s the right thing to do
> to prevent others from mistakenly including it?

swapfile.h is for max_swapfile_size() that's referenced in the new code. If
!CONFIG_MMU then there shouldn't be a reference to max_swapfile_size() even
if swapops.h included, then logically it should be after CONFIG_MMU?

> 
> > +
> > /*
> >  * swapcache pages are stored in the swapper_space radix tree.  We want to
> >  * get good packing density in that tree, so the index should be dense in
> > @@ -35,6 +37,24 @@
> > #endif
> > #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
> > 
> > +/**
> > + * Migration swap entry specific bitfield definitions.
> > + *
> > + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> > + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> > + *
> > + * Note: these bits will be stored in migration entries iff there're enough
> > + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> > + * when migrating a page.  Please refer to migration_entry_supports_ad()
> > + * for more information.
> > + */
> > +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> > +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> > +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> > +
> > +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> > +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> 
> Any reason not to use BIT(x) ?

Yeh why not..

> 
> > +
> > static inline bool is_pfn_swap_entry(swp_entry_t entry);
> > 
> > /* Clear all flags but only keep swp_entry_t related information */
> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> > 	return swp_entry(SWP_MIGRATION_WRITE, offset);
> > }
> > 
> > +/*
> > + * Returns whether the host has large enough swap offset field to support
> > + * carrying over pgtable A/D bits for page migrations.  The result is
> > + * pretty much arch specific.
> > + */
> > +static inline bool migration_entry_supports_ad(void)
> > +{
> > +	/*
> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> > +	 * We can support the migration A/D bits iff the pfn swap entry has
> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
> > +	 */
> > +#ifdef CONFIG_SWAP
> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> 
> This is an actual a function call (unless LTO has some trick). A bit of a
> shame it cannot be at least memoized.
> 
> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
> be called twice for make_migration_entry_young() and
> make_migration_entry_dirty().

I didn't take too much effort on this one since we're on swap path and I
assumed that's not a super hot path.  But __attribute_const__ sounds good
and easy to get, thanks.

Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
check 5-lvl).

> 
> > +#else
> > +	return false;
> > +#endif
> > +}
> > +
> > +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_entry(swp_type(entry),
> > +				 swp_offset(entry) | SWP_MIG_YOUNG);
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_young(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_offset(entry) & SWP_MIG_YOUNG;
> > +	/* Keep the old behavior of aging page after migration */
> > +	return false;
> > +}
> > +
> > +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_entry(swp_type(entry),
> > +				 swp_offset(entry) | SWP_MIG_DIRTY);
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
> 
> Shouldn’t it be SWP_MIG_DIRTY ?

Oops, yes.

> 
> > +	/* Keep the old behavior of clean page after migration */
> > +	return false;
> > +}
> > +
> > extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > 					spinlock_t *ptl);
> > extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> > @@ -311,6 +382,26 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
> > 	return 0;
> > }
> > 
> > +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> > +{
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_young(swp_entry_t entry)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	return false;
> > +}
> > +
> > #endif
> 
> While at it, can you change to:
> 
> #endif /* CONFIG_MIGRATION */
> 
> [ these ifdefs burn my eyes ]

Ok, but there're a bunch of "ifdef"s in this header that are missing those.
Maybe I'll add a separate patch for all of them just to do the commenting.

> 
> Other than that looks good.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-05 12:17   ` David Hildenbrand
@ 2022-08-05 16:36     ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2022-08-05 16:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Minchan Kim, Andrew Morton, Vlastimil Babka, Nadav Amit,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov

On Fri, Aug 05, 2022 at 02:17:55PM +0200, David Hildenbrand wrote:
> On 04.08.22 22:39, Peter Xu wrote:
> > When page migration happens, we always ignore the young/dirty bit settings
> > in the old pgtable, and marking the page as old in the new page table using
> > either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> > 
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure.  Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> > The same slowdown problem to dirty bits when the memory is first written
> > after page migration happened.
> > 
> > Actually we can easily remember the A/D bit configuration and recover the
> > information after the page is migrated.  To achieve it, define a new set of
> > bits in the migration swap offset field to cache the A/D bits for old pte.
> > Then when removing/recovering the migration entry, we can recover the A/D
> > bits even if the page changed.
> > 
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
> > bit.  Otherwise the A/D bits are dropped like before.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
> >  mm/huge_memory.c        | 26 +++++++++++-
> >  mm/migrate.c            |  6 ++-
> >  mm/migrate_device.c     |  4 ++
> >  mm/rmap.c               |  5 ++-
> >  5 files changed, 128 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index 1d17e4bb3d2f..34aa448ac6ee 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -8,6 +8,8 @@
> >  
> >  #ifdef CONFIG_MMU
> >  
> > +#include <linux/swapfile.h>
> > +
> >  /*
> >   * swapcache pages are stored in the swapper_space radix tree.  We want to
> >   * get good packing density in that tree, so the index should be dense in
> > @@ -35,6 +37,24 @@
> >  #endif
> >  #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
> >  
> > +/**
> > + * Migration swap entry specific bitfield definitions.
> > + *
> > + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> > + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> > + *
> > + * Note: these bits will be stored in migration entries iff there're enough
> > + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> > + * when migrating a page.  Please refer to migration_entry_supports_ad()
> > + * for more information.
> > + */
> > +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> > +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> > +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> > +
> > +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> > +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> > +
> >  static inline bool is_pfn_swap_entry(swp_entry_t entry);
> >  
> >  /* Clear all flags but only keep swp_entry_t related information */
> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> >  	return swp_entry(SWP_MIGRATION_WRITE, offset);
> >  }
> >  
> > +/*
> > + * Returns whether the host has large enough swap offset field to support
> > + * carrying over pgtable A/D bits for page migrations.  The result is
> > + * pretty much arch specific.
> > + */
> > +static inline bool migration_entry_supports_ad(void)
> > +{
> > +	/*
> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> > +	 * We can support the migration A/D bits iff the pfn swap entry has
> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
> > +	 */
> > +#ifdef CONFIG_SWAP
> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> > +#else
> > +	return false;
> > +#endif
> > +}
> 
> 
> This looks much cleaner to me. It might be helpful to draw an ascii
> picture where exatcly these bits reside isnide the offset.

Yes that'll be helpful especially when more bits are defined.  Not sure how
much it'll help for now but I can definitely do that.

> 
> > +
> > +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> 
> Do we maybe want to turn that into a static key and enable it once and
> for all? As Nadav says, the repeated max_swapfile_size() calls/checks
> might be worth optimizing out.

Since there're a few arch related issues to answer (as replied to Nadav -
both max_swapfile_size and SWP_MIG_TOTAL_BITS may not be constant), my
current plan is to first attach the const attribute, then leave the other
optimizations for later.

If this is a super hot path, I probably need to do this in reversed order,
but hopefully it's fine to this case.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-04 20:39 ` [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations Peter Xu
  2022-08-04 22:40   ` Nadav Amit
  2022-08-05 12:17   ` David Hildenbrand
@ 2022-08-09  8:40   ` Huang, Ying
  2022-08-09 17:59     ` Peter Xu
  2 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2022-08-09  8:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, David Hildenbrand,
	Minchan Kim, Andrew Morton, Vlastimil Babka, Nadav Amit,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov

Peter Xu <peterx@redhat.com> writes:

> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
>
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
>
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
>
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the A/D bits are dropped like before.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c        | 26 +++++++++++-
>  mm/migrate.c            |  6 ++-
>  mm/migrate_device.c     |  4 ++
>  mm/rmap.c               |  5 ++-
>  5 files changed, 128 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 1d17e4bb3d2f..34aa448ac6ee 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,8 @@
>  
>  #ifdef CONFIG_MMU
>  
> +#include <linux/swapfile.h>
> +
>  /*
>   * swapcache pages are stored in the swapper_space radix tree.  We want to
>   * get good packing density in that tree, so the index should be dense in
> @@ -35,6 +37,24 @@
>  #endif
>  #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
>  
> +/**
> + * Migration swap entry specific bitfield definitions.
> + *
> + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> + *
> + * Note: these bits will be stored in migration entries iff there're enough
> + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> + * when migrating a page.  Please refer to migration_entry_supports_ad()
> + * for more information.
> + */
> +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> +
> +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> +
>  static inline bool is_pfn_swap_entry(swp_entry_t entry);
>  
>  /* Clear all flags but only keep swp_entry_t related information */
> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>  	return swp_entry(SWP_MIGRATION_WRITE, offset);
>  }
>  
> +/*
> + * Returns whether the host has large enough swap offset field to support
> + * carrying over pgtable A/D bits for page migrations.  The result is
> + * pretty much arch specific.
> + */
> +static inline bool migration_entry_supports_ad(void)
> +{
> +	/*
> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> +	 * We can support the migration A/D bits iff the pfn swap entry has
> +	 * the offset large enough to cover all of them (PFN, A & D bits).
> +	 */
> +#ifdef CONFIG_SWAP
> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> +#else
> +	return false;
> +#endif
> +}
> +
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_YOUNG);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG;
> +	/* Keep the old behavior of aging page after migration */
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_DIRTY);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
> +	/* Keep the old behavior of clean page after migration */
> +	return false;
> +}
> +
>  extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  					spinlock_t *ptl);
>  extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> @@ -311,6 +382,26 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
>  	return 0;
>  }
>  
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
>  #endif
>  
>  typedef unsigned long pte_marker;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 29e3628687a6..7d79db4bd76a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2005,6 +2005,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	pmd_t old_pmd, _pmd;
>  	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>  	bool anon_exclusive = false;
> +	bool dirty = false;
>  	unsigned long addr;
>  	int i;
>  
> @@ -2088,7 +2089,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		write = is_writable_migration_entry(entry);
>  		if (PageAnon(page))
>  			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = false;
> +		young = is_migration_entry_young(entry);
> +		dirty = is_migration_entry_dirty(entry);
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
>  	} else {
> @@ -2097,6 +2099,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			SetPageDirty(page);
>  		write = pmd_write(old_pmd);
>  		young = pmd_young(old_pmd);
> +		dirty = pmd_dirty(old_pmd);
>  		soft_dirty = pmd_soft_dirty(old_pmd);
>  		uffd_wp = pmd_uffd_wp(old_pmd);
>  
> @@ -2146,6 +2149,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			else
>  				swp_entry = make_readable_migration_entry(
>  							page_to_pfn(page + i));
> +			if (young)
> +				swp_entry = make_migration_entry_young(swp_entry);
> +			if (dirty)
> +				swp_entry = make_migration_entry_dirty(swp_entry);
>  			entry = swp_entry_to_pte(swp_entry);
>  			if (soft_dirty)
>  				entry = pte_swp_mksoft_dirty(entry);
> @@ -2160,6 +2167,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  				entry = pte_wrprotect(entry);
>  			if (!young)
>  				entry = pte_mkold(entry);
> +			if (dirty)
> +				/*
> +				 * NOTE: this may contains setting soft
> +				 * dirty too on some archs like x86.
> +				 */

Personally, I prefer to put comments above "if (dirty)".  But you can
choose your favorite way unless it violates coding style.

> +				entry = pte_mkdirty(entry);

We don't track dirty flag even for normal PTE before.  So I think we
should separate the dirty flag tracking for normal PTE in a separate
patch.

>  			if (soft_dirty)
>  				entry = pte_mksoft_dirty(entry);
>  			if (uffd_wp)
> @@ -3148,6 +3161,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
>  	else
>  		entry = make_readable_migration_entry(page_to_pfn(page));
> +	if (pmd_young(pmdval))
> +		entry = make_migration_entry_young(entry);
> +	if (pmd_dirty(pmdval))
> +		entry = make_migration_entry_dirty(entry);
>  	pmdswp = swp_entry_to_pmd(entry);
>  	if (pmd_soft_dirty(pmdval))
>  		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
> @@ -3173,13 +3190,18 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  
>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>  	get_page(new);
> -	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
> +	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>  		pmde = pmd_mksoft_dirty(pmde);
>  	if (is_writable_migration_entry(entry))
>  		pmde = maybe_pmd_mkwrite(pmde, vma);
>  	if (pmd_swp_uffd_wp(*pvmw->pmd))
>  		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
> +	if (!is_migration_entry_young(entry))
> +		pmde = pmd_mkold(pmde);
> +	if (PageDirty(new) && is_migration_entry_dirty(entry))
> +		/* NOTE: this may contain setting soft-dirty on some archs */
> +		pmde = pmd_mkdirty(pmde);
>  
>  	if (PageAnon(new)) {
>  		rmap_t rmap_flags = RMAP_COMPOUND;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1649270bc1a7..957e8e6ee28e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -199,7 +199,7 @@ static bool remove_migration_pte(struct folio *folio,
>  #endif
>  
>  		folio_get(folio);
> -		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
>  			pte = pte_mksoft_dirty(pte);
>  
> @@ -207,6 +207,10 @@ static bool remove_migration_pte(struct folio *folio,
>  		 * Recheck VMA as permissions can change since migration started
>  		 */
>  		entry = pte_to_swp_entry(*pvmw.pte);
> +		if (!is_migration_entry_young(entry))
> +			pte = pte_mkold(pte);
> +		if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
> +			pte = pte_mkdirty(pte);
>  		if (is_writable_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
>  		else if (pte_swp_uffd_wp(*pvmw.pte))
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 7feeb447e3b9..47d90df04cc6 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			else
>  				entry = make_readable_migration_entry(
>  							page_to_pfn(page));
> +			if (pte_young(pte))
> +				entry = make_migration_entry_young(entry);
> +			if (pte_dirty(pte))
> +				entry = make_migration_entry_dirty(entry);

I don't find pte_dirty() is synced to PageDirty() as in
try_to_migrate_one().  Is it a issue in the original code?

Best Regards,
Huang, Ying

>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_present(pte)) {
>  				if (pte_soft_dirty(pte))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index af775855e58f..28aef434ea41 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2065,7 +2065,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  			else
>  				entry = make_readable_migration_entry(
>  							page_to_pfn(subpage));
> -
> +			if (pte_young(pteval))
> +				entry = make_migration_entry_young(entry);
> +			if (pte_dirty(pteval))
> +				entry = make_migration_entry_dirty(entry);
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-05 16:30     ` Peter Xu
@ 2022-08-09  8:45       ` Huang, Ying
  2022-08-09  8:47         ` David Hildenbrand
  2022-08-09 14:59         ` Peter Xu
  0 siblings, 2 replies; 16+ messages in thread
From: Huang, Ying @ 2022-08-09  8:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nadav Amit, Linux MM, linux-kernel, Andrea Arcangeli,
	David Hildenbrand, Minchan Kim, Andrew Morton, Vlastimil Babka,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov

Peter Xu <peterx@redhat.com> writes:

> On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
>> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
>> > +
>> > static inline bool is_pfn_swap_entry(swp_entry_t entry);
>> > 
>> > /* Clear all flags but only keep swp_entry_t related information */
>> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>> > 	return swp_entry(SWP_MIGRATION_WRITE, offset);
>> > }
>> > 
>> > +/*
>> > + * Returns whether the host has large enough swap offset field to support
>> > + * carrying over pgtable A/D bits for page migrations.  The result is
>> > + * pretty much arch specific.
>> > + */
>> > +static inline bool migration_entry_supports_ad(void)
>> > +{
>> > +	/*
>> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
>> > +	 * We can support the migration A/D bits iff the pfn swap entry has
>> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
>> > +	 */
>> > +#ifdef CONFIG_SWAP
>> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
>> 
>> This is an actual a function call (unless LTO has some trick). A bit of a
>> shame it cannot be at least memoized.
>> 
>> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
>> be called twice for make_migration_entry_young() and
>> make_migration_entry_dirty().
>
> I didn't take too much effort on this one since we're on swap path and I
> assumed that's not a super hot path.  But __attribute_const__ sounds good
> and easy to get, thanks.
>
> Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
> that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
> that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
> check 5-lvl).

I think it's possible to memorize max_swapfile_size() or
migration_entry_supports_ad().  Although they are not constant, they are
not changed after initialized.  The challenge is to find a clean way to
initialize it.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-09  8:45       ` Huang, Ying
@ 2022-08-09  8:47         ` David Hildenbrand
  2022-08-09 14:59         ` Peter Xu
  1 sibling, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2022-08-09  8:47 UTC (permalink / raw)
  To: Huang, Ying, Peter Xu
  Cc: Nadav Amit, Linux MM, linux-kernel, Andrea Arcangeli,
	Minchan Kim, Andrew Morton, Vlastimil Babka, Hugh Dickins,
	Andi Kleen, Kirill A . Shutemov

On 09.08.22 10:45, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
>>> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
>>>> +
>>>> static inline bool is_pfn_swap_entry(swp_entry_t entry);
>>>>
>>>> /* Clear all flags but only keep swp_entry_t related information */
>>>> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>>>> 	return swp_entry(SWP_MIGRATION_WRITE, offset);
>>>> }
>>>>
>>>> +/*
>>>> + * Returns whether the host has large enough swap offset field to support
>>>> + * carrying over pgtable A/D bits for page migrations.  The result is
>>>> + * pretty much arch specific.
>>>> + */
>>>> +static inline bool migration_entry_supports_ad(void)
>>>> +{
>>>> +	/*
>>>> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
>>>> +	 * We can support the migration A/D bits iff the pfn swap entry has
>>>> +	 * the offset large enough to cover all of them (PFN, A & D bits).
>>>> +	 */
>>>> +#ifdef CONFIG_SWAP
>>>> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
>>>
>>> This is an actual a function call (unless LTO has some trick). A bit of a
>>> shame it cannot be at least memoized.
>>>
>>> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
>>> be called twice for make_migration_entry_young() and
>>> make_migration_entry_dirty().
>>
>> I didn't take too much effort on this one since we're on swap path and I
>> assumed that's not a super hot path.  But __attribute_const__ sounds good
>> and easy to get, thanks.
>>
>> Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
>> that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
>> that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
>> check 5-lvl).
> 
> I think it's possible to memorize max_swapfile_size() or
> migration_entry_supports_ad().  Although they are not constant, they are
> not changed after initialized.  The challenge is to find a clean way to
> initialize it.

We could max_swapfile_size()->__max_swapfile_size()

and then simply have a new max_swapfile_size() that caches the value in
a static variable. If that turns out to be worth the trouble.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-09  8:45       ` Huang, Ying
  2022-08-09  8:47         ` David Hildenbrand
@ 2022-08-09 14:59         ` Peter Xu
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Xu @ 2022-08-09 14:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Nadav Amit, Linux MM, linux-kernel, Andrea Arcangeli,
	David Hildenbrand, Minchan Kim, Andrew Morton, Vlastimil Babka,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov

On Tue, Aug 09, 2022 at 04:45:32PM +0800, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
> >> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
> >> > +
> >> > static inline bool is_pfn_swap_entry(swp_entry_t entry);
> >> > 
> >> > /* Clear all flags but only keep swp_entry_t related information */
> >> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> >> > 	return swp_entry(SWP_MIGRATION_WRITE, offset);
> >> > }
> >> > 
> >> > +/*
> >> > + * Returns whether the host has large enough swap offset field to support
> >> > + * carrying over pgtable A/D bits for page migrations.  The result is
> >> > + * pretty much arch specific.
> >> > + */
> >> > +static inline bool migration_entry_supports_ad(void)
> >> > +{
> >> > +	/*
> >> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> >> > +	 * We can support the migration A/D bits iff the pfn swap entry has
> >> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
> >> > +	 */
> >> > +#ifdef CONFIG_SWAP
> >> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> >> 
> >> This is an actual a function call (unless LTO has some trick). A bit of a
> >> shame it cannot be at least memoized.
> >> 
> >> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
> >> be called twice for make_migration_entry_young() and
> >> make_migration_entry_dirty().
> >
> > I didn't take too much effort on this one since we're on swap path and I
> > assumed that's not a super hot path.  But __attribute_const__ sounds good
> > and easy to get, thanks.
> >
> > Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
> > that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
> > that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
> > check 5-lvl).
> 
> I think it's possible to memorize max_swapfile_size() or
> migration_entry_supports_ad().  Although they are not constant, they are
> not changed after initialized.  The challenge is to find a clean way to
> initialize it.

I checked it up today, the conclusion is I think it's safe we initialize ad
bits for migration in swapfile_init() and I'll do that in the next version.

Longer proof material below.

Generic max_swapfile_size() is pretty much a constant except x86.  x86 has
two dependency, on (1) X86_BUG_L1TF, and (2) l1tf_mitigation.

The other challenge is the reference to MAX_PHYSMEM_BITS, which is also
only complicated on x86 with 5 level page tables.

Luckily the cpu bits are all setup within early_cpu_init(). The setup of
l1tf_mitigation variable is later but still earlier than most of the init
calls (which swapfile_init belongs to level 4).

A full graph for reference:

- start_kernel
  - setup_arch
    - early_cpu_init
      - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
      - early_identify_cpu --> setup X86_BUG_L1TF
      - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
  - parse_early_param
    - l1tf_cmdline --> set l1tf_mitigation
  - check_bugs
    - l1tf_select_mitigation --> set l1tf_mitigation
  - arch_call_rest_init
    - rest_init
      - kernel_init
        - kernel_init_freeable
          - do_basic_setup
            - do_initcalls --> calls swapfile_init() (initcall level 4)

I'll add one (or maybe >1?) patch on top in next post to optimize the
fetching of these states.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-09  8:40   ` Huang, Ying
@ 2022-08-09 17:59     ` Peter Xu
  2022-08-10  0:53       ` Huang, Ying
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2022-08-09 17:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, David Hildenbrand,
	Minchan Kim, Andrew Morton, Vlastimil Babka, Nadav Amit,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov

On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
> > @@ -2160,6 +2167,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  				entry = pte_wrprotect(entry);
> >  			if (!young)
> >  				entry = pte_mkold(entry);
> > +			if (dirty)
> > +				/*
> > +				 * NOTE: this may contains setting soft
> > +				 * dirty too on some archs like x86.
> > +				 */
> 
> Personally, I prefer to put comments above "if (dirty)".  But you can
> choose your favorite way unless it violates coding style.

Sure.

> 
> > +				entry = pte_mkdirty(entry);
> 
> We don't track dirty flag even for normal PTE before.  So I think we
> should separate the dirty flag tracking for normal PTE in a separate
> patch.

It's kinda convenient to touch that up, but for sure I can split that into
a tiny but separate patch too.

[...]

> I don't find pte_dirty() is synced to PageDirty() as in
> try_to_migrate_one().  Is it a issue in the original code?

I think it has?  There is:

		/* Set the dirty flag on the folio now the pte is gone. */
		if (pte_dirty(pteval))
			folio_mark_dirty(folio);

?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-09 17:59     ` Peter Xu
@ 2022-08-10  0:53       ` Huang, Ying
  2022-08-10 19:21         ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Ying @ 2022-08-10  0:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, David Hildenbrand,
	Minchan Kim, Andrew Morton, Vlastimil Babka, Nadav Amit,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov, Christoph Hellwig

Peter Xu <peterx@redhat.com> writes:

> On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
[snip]
>
>> I don't find pte_dirty() is synced to PageDirty() as in
>> try_to_migrate_one().  Is it a issue in the original code?
>
> I think it has?  There is:
>
> 		/* Set the dirty flag on the folio now the pte is gone. */
> 		if (pte_dirty(pteval))
> 			folio_mark_dirty(folio);
>

Sorry, my original words are confusing.  Yes, there's dirty bit syncing
in try_to_migrate_one().  But I don't find that in migrate_device.c

 $ grep dirty mm/migrate_device.c
				if (pte_soft_dirty(pte))
					swp_pte = pte_swp_mksoft_dirty(swp_pte);
				if (pte_swp_soft_dirty(pte))
					swp_pte = pte_swp_mksoft_dirty(swp_pte);
			entry = pte_mkwrite(pte_mkdirty(entry));

I guess that migrate_device.c is used to migrate between CPU visible
page to CPU un-visible page (device visible), so the rule is different?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-10  0:53       ` Huang, Ying
@ 2022-08-10 19:21         ` Peter Xu
  2022-08-11  5:44           ` Alistair Popple
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2022-08-10 19:21 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, David Hildenbrand,
	Minchan Kim, Andrew Morton, Vlastimil Babka, Nadav Amit,
	Hugh Dickins, Andi Kleen, Kirill A . Shutemov, Christoph Hellwig,
	Alistair Popple, Jason Gunthorpe

On Wed, Aug 10, 2022 at 08:53:49AM +0800, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
> [snip]
> >
> >> I don't find pte_dirty() is synced to PageDirty() as in
> >> try_to_migrate_one().  Is it a issue in the original code?
> >
> > I think it has?  There is:
> >
> > 		/* Set the dirty flag on the folio now the pte is gone. */
> > 		if (pte_dirty(pteval))
> > 			folio_mark_dirty(folio);
> >
> 
> Sorry, my original words are confusing.  Yes, there's dirty bit syncing
> in try_to_migrate_one().  But I don't find that in migrate_device.c
> 
>  $ grep dirty mm/migrate_device.c
> 				if (pte_soft_dirty(pte))
> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
> 				if (pte_swp_soft_dirty(pte))
> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
> 			entry = pte_mkwrite(pte_mkdirty(entry));
> 
> I guess that migrate_device.c is used to migrate between CPU visible
> page to CPU un-visible page (device visible), so the rule is different?

IIUC migrate_vma_collect() handles migrations for both directions (RAM <->
device mem).

Yeah, indeed I also didn't see how migrate_vma_collect_pmd() handles the
carry-over of pte dirty to page dirty, which looks a bit odd.  I also don't
see why the dirty bit doesn't need to be maintained, e.g. when a previous
page was dirty then after migration of ram->dev->ram it seems to be clean
with current code.

Another scenario is, even if the page was clean, as long as page migrated
to device mem, device DMAed to the page, then page migrated back to RAM.  I
also didn't see how we could detect the DMAs and set pte/page dirty
properly after migrated back.

Copy Alistair and Jason..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations
  2022-08-10 19:21         ` Peter Xu
@ 2022-08-11  5:44           ` Alistair Popple
  0 siblings, 0 replies; 16+ messages in thread
From: Alistair Popple @ 2022-08-11  5:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Huang, Ying, linux-mm, linux-kernel, Andrea Arcangeli,
	David Hildenbrand, Minchan Kim, Andrew Morton, Vlastimil Babka,
	Nadav Amit, Hugh Dickins, Andi Kleen, Kirill A . Shutemov,
	Christoph Hellwig, Jason Gunthorpe


Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 10, 2022 at 08:53:49AM +0800, Huang, Ying wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
>> [snip]
>> >
>> >> I don't find pte_dirty() is synced to PageDirty() as in
>> >> try_to_migrate_one().  Is it a issue in the original code?
>> >
>> > I think it has?  There is:
>> >
>> > 		/* Set the dirty flag on the folio now the pte is gone. */
>> > 		if (pte_dirty(pteval))
>> > 			folio_mark_dirty(folio);
>> >
>>
>> Sorry, my original words are confusing.  Yes, there's dirty bit syncing
>> in try_to_migrate_one().  But I don't find that in migrate_device.c
>>
>>  $ grep dirty mm/migrate_device.c
>> 				if (pte_soft_dirty(pte))
>> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> 				if (pte_swp_soft_dirty(pte))
>> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> 			entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> I guess that migrate_device.c is used to migrate between CPU visible
>> page to CPU un-visible page (device visible), so the rule is different?
>
> IIUC migrate_vma_collect() handles migrations for both directions (RAM <->
> device mem).

That's correct.

> Yeah, indeed I also didn't see how migrate_vma_collect_pmd() handles the
> carry-over of pte dirty to page dirty, which looks a bit odd.  I also don't
> see why the dirty bit doesn't need to be maintained, e.g. when a previous
> page was dirty then after migration of ram->dev->ram it seems to be clean
> with current code.

That's a bug - it does need to be maintained. migrate_vma_*() currently
only works with anonymous private mappings. We could still loose data if
we attempt (but fail) to migrate a page that has been swapped in from
disk though, depending on the precise sequence.

Will post a fix for this, thanks for pointing it out.

> Another scenario is, even if the page was clean, as long as page migrated
> to device mem, device DMAed to the page, then page migrated back to RAM.  I
> also didn't see how we could detect the DMAs and set pte/page dirty
> properly after migrated back.

That would be up to the driver, unless we assume the page is always
dirty which is probably not a bad default. In practice I don't think
this will currently be a problem as any pages migrated to the device
won't have pages allocated in swap and this only works with private
anonymous mappings. But I think we should fix it anyway so will include
it in the fix.

> Copy Alistair and Jason..

Thanks. I will take a look at this series too, but probably won't get to
it until next week.

 - Alistair

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-08-11  6:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-04 20:39 [PATCH v2 0/2] mm: Remember a/d bits for migration entries Peter Xu
2022-08-04 20:39 ` [PATCH v2 1/2] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
2022-08-04 20:39 ` [PATCH v2 2/2] mm: Remember young/dirty bit for page migrations Peter Xu
2022-08-04 22:40   ` Nadav Amit
2022-08-05 16:30     ` Peter Xu
2022-08-09  8:45       ` Huang, Ying
2022-08-09  8:47         ` David Hildenbrand
2022-08-09 14:59         ` Peter Xu
2022-08-05 12:17   ` David Hildenbrand
2022-08-05 16:36     ` Peter Xu
2022-08-09  8:40   ` Huang, Ying
2022-08-09 17:59     ` Peter Xu
2022-08-10  0:53       ` Huang, Ying
2022-08-10 19:21         ` Peter Xu
2022-08-11  5:44           ` Alistair Popple
2022-08-04 22:17 ` [PATCH v2 0/2] mm: Remember a/d bits for migration entries Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).