linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] mm: Remember a/d bits for migration entries
@ 2022-08-11 16:13 Peter Xu
  2022-08-11 16:13 ` [PATCH v4 1/7] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros Peter Xu
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

v4:
- Added r-bs for Ying
- Some cosmetic changes here and there [Ying]
- Fix smaps to only dump PFN for pfn swap entries for both pte/pmd [Ying]
- Remove max_swapfile_size(), export swapfile_maximum_size variable [Ying]
- In migrate_vma_collect_pmd() only read A/D if pte_present()

rfc: https://lore.kernel.org/all/20220729014041.21292-1-peterx@redhat.com
v1:  https://lore.kernel.org/all/20220803012159.36551-1-peterx@redhat.com
v2:  https://lore.kernel.org/all/20220804203952.53665-1-peterx@redhat.com
v3:  https://lore.kernel.org/all/20220809220100.20033-1-peterx@redhat.com

Problem
=======

When migrate a page, right now we always mark the migrated page as old &
clean.

However that could lead to at least two problems:

  (1) We lost the real hot/cold information while we could have persisted.
      That information shouldn't change even if the backing page is changed
      after the migration,

  (2) There can be always extra overhead on the immediate next access to
      any migrated page, because hardware MMU needs cycles to set the young
      bit again for reads, and dirty bits for write, as long as the
      hardware MMU supports these bits.

Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable.  In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.

This issue is originally reported by Andrea Arcangeli.

Solution
========

To solve this problem, this patchset tries to remember the young/dirty bits
in the migration entries and carry them over when recovering the ptes.

We have the chance to do so because in many systems the swap offset is not
really fully used.  Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like A/D bits, and that's how this series tried to approach this
problem.

max_swapfile_size() is used here to detect per-arch offset length in swp
entries.  We'll automatically remember the A/D bits when we find that we
have enough swp offset field to keep both the PFN and the extra bits.

Since max_swapfile_size() can be slow, the last two patches cache the
results for it and also swap_migration_ad_supported as a whole.

Known Issues / TODOs
====================

We still haven't taught madvise() to recognize the new A/D bits in
migration entries, namely MADV_COLD/MADV_FREE.  E.g. when MADV_COLD upon a
migration entry.  It's not clear yet on whether we should clear the A bit,
or we should just drop the entry directly.

We didn't teach idle page tracking on the new migration entries, because
it'll need larger rework on the tree on rmap pgtable walk.  However it
should make it already better because before this patchset page will be old
page after migration, so the series will fix potential false negative of
idle page tracking when pages were migrated before observing.

The other thing is migration A/D bits will not start to working for private
device swap entries.  The code is there for completeness but since private
device swap entries do not yet have fields to store A/D bits, even if we'll
persistent A/D across present pte switching to migration entry, we'll lose
it again when the migration entry converted to private device swap entry.

Tests
=====

After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms.  The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps.  The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

Similar effect can also be measured when writting the memory the 1st time
after migration.

After applying the patchset, both initial immediate read/write after page
migrated will perform similarly like before migration happened.

Patch Layout
============

Patch 1-2:  Cleanups from either previous versions or on swapops.h macros.

Patch 3-4:  Prepare for the introduction of migration A/D bits

Patch 5:    The core patch to remember young/dirty bit in swap offsets.

Patch 6-7:  Cache relevant fields to make migration_entry_supports_ad() fast.

Please review, thanks.

[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c

Peter Xu (7):
  mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
  mm/swap: Comment all the ifdef in swapops.h
  mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  mm/thp: Carry over dirty bit when thp splits on pmd
  mm: Remember young/dirty bit for page migrations
  mm/swap: Cache maximum swapfile size when init swap
  mm/swap: Cache swap migration A/D bits support

 arch/arm64/mm/hugetlbpage.c           |   2 +-
 arch/x86/include/asm/pgtable-3level.h |   8 +-
 arch/x86/mm/init.c                    |   2 +-
 fs/proc/task_mmu.c                    |  20 +++-
 include/linux/swapfile.h              |   5 +-
 include/linux/swapops.h               | 145 +++++++++++++++++++++++---
 mm/hmm.c                              |   2 +-
 mm/huge_memory.c                      |  27 ++++-
 mm/memory-failure.c                   |   2 +-
 mm/migrate.c                          |   6 +-
 mm/migrate_device.c                   |   6 ++
 mm/page_vma_mapped.c                  |   6 +-
 mm/rmap.c                             |   5 +-
 mm/swapfile.c                         |  15 ++-
 14 files changed, 214 insertions(+), 37 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 1/7] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
@ 2022-08-11 16:13 ` Peter Xu
  2022-08-11 16:13 ` [PATCH v4 2/7] mm/swap: Comment all the ifdef in swapops.h Peter Xu
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

Replace all the magic "5" with the macro.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable-3level.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24..28421a887209 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -256,10 +256,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 /* We always extract/encode the offset by shifting it all the way up, and then down again */
 #define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
 
-#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
-#define __swp_type(x)			(((x).val) & 0x1f)
-#define __swp_offset(x)			((x).val >> 5)
-#define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})
+#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
+#define __swp_type(x)			(((x).val) & ((1UL << SWP_TYPE_BITS) - 1))
+#define __swp_offset(x)			((x).val >> SWP_TYPE_BITS)
+#define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << SWP_TYPE_BITS})
 
 /*
  * Normally, __swp_entry() converts from arch-independent swp_entry_t to
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 2/7] mm/swap: Comment all the ifdef in swapops.h
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
  2022-08-11 16:13 ` [PATCH v4 1/7] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros Peter Xu
@ 2022-08-11 16:13 ` Peter Xu
  2022-08-15  6:03   ` Alistair Popple
  2022-08-11 16:13 ` [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

swapops.h contains quite a few layers of ifdef, some of the "else" and
"endif" doesn't get proper comment on the macro so it's hard to follow on
what are they referring to.  Add the comments.

Suggested-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index a3d435bf9f97..3a2901ff4f1e 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -247,8 +247,8 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 #ifdef CONFIG_HUGETLB_PAGE
 extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
 extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
-#endif
-#else
+#endif	/* CONFIG_HUGETLB_PAGE */
+#else  /* CONFIG_MIGRATION */
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
 	return swp_entry(0, 0);
@@ -276,7 +276,7 @@ static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 #ifdef CONFIG_HUGETLB_PAGE
 static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
 static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
-#endif
+#endif	/* CONFIG_HUGETLB_PAGE */
 static inline int is_writable_migration_entry(swp_entry_t entry)
 {
 	return 0;
@@ -286,7 +286,7 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
 	return 0;
 }
 
-#endif
+#endif	/* CONFIG_MIGRATION */
 
 typedef unsigned long pte_marker;
 
@@ -426,7 +426,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 {
 	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
 }
-#else
+#else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page)
 {
@@ -455,7 +455,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 {
 	return 0;
 }
-#endif
+#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 
 #ifdef CONFIG_MEMORY_FAILURE
 
@@ -495,7 +495,7 @@ static inline void num_poisoned_pages_sub(long i)
 	atomic_long_sub(i, &num_poisoned_pages);
 }
 
-#else
+#else  /* CONFIG_MEMORY_FAILURE */
 
 static inline swp_entry_t make_hwpoison_entry(struct page *page)
 {
@@ -514,7 +514,7 @@ static inline void num_poisoned_pages_inc(void)
 static inline void num_poisoned_pages_sub(long i)
 {
 }
-#endif
+#endif  /* CONFIG_MEMORY_FAILURE */
 
 static inline int non_swap_entry(swp_entry_t entry)
 {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
  2022-08-11 16:13 ` [PATCH v4 1/7] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros Peter Xu
  2022-08-11 16:13 ` [PATCH v4 2/7] mm/swap: Comment all the ifdef in swapops.h Peter Xu
@ 2022-08-11 16:13 ` Peter Xu
  2022-08-12  2:33   ` Huang, Ying
  2022-08-11 16:13 ` [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd Peter Xu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields.  To fetch the PFN, normally the user just calls swp_offset()
assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check with
MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN.  At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp offset
field, so assuming "swp_offset(entry)" to be the PFN will not stand any
more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper.  Note that many of the existing users are not candidates for the
replacement, e.g.:

  (1) When the swap entry is not a pfn swap entry at all, or,
  (2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only".  They're left aside so that when there're more information
within the swp offset they'll be carried over naturally in those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/mm/hugetlbpage.c |  2 +-
 fs/proc/task_mmu.c          | 20 +++++++++++++++++---
 include/linux/swapops.h     | 35 +++++++++++++++++++++++++++++------
 mm/hmm.c                    |  2 +-
 mm/memory-failure.c         |  2 +-
 mm/page_vma_mapped.c        |  6 +++---
 6 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 0795028f017c..35e9a468d13e 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -245,7 +245,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
 {
 	VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
 
-	return page_folio(pfn_to_page(swp_offset(entry)));
+	return page_folio(pfn_to_page(swp_offset_pfn(entry)));
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d56c65f98d00..b3e79128fca0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1419,9 +1419,19 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (pte_swp_uffd_wp(pte))
 			flags |= PM_UFFD_WP;
 		entry = pte_to_swp_entry(pte);
-		if (pm->show_pfn)
+		if (pm->show_pfn) {
+			pgoff_t offset;
+			/*
+			 * For PFN swap offsets, keeping the offset field
+			 * to be PFN only to be compatible with old smaps.
+			 */
+			if (is_pfn_swap_entry(entry))
+				offset = swp_offset_pfn(entry);
+			else
+				offset = swp_offset(entry);
 			frame = swp_type(entry) |
-				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+			    (offset << MAX_SWAPFILES_SHIFT);
+		}
 		flags |= PM_SWAP;
 		migration = is_migration_entry(entry);
 		if (is_pfn_swap_entry(entry))
@@ -1478,7 +1488,11 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			unsigned long offset;
 
 			if (pm->show_pfn) {
-				offset = swp_offset(entry) +
+				if (is_pfn_swap_entry(entry))
+					offset = swp_offset_pfn(entry);
+				else
+					offset = swp_offset(entry);
+				offset = offset +
 					((addr & ~PMD_MASK) >> PAGE_SHIFT);
 				frame = swp_type(entry) |
 					(offset << MAX_SWAPFILES_SHIFT);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 3a2901ff4f1e..bd4c6f0c2103 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -23,6 +23,20 @@
 #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
 #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
 
+/*
+ * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
+ * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
+ * can use the extra bits to store other information besides PFN.
+ */
+#ifdef MAX_PHYSMEM_BITS
+#define SWP_PFN_BITS			(MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#else  /* MAX_PHYSMEM_BITS */
+#define SWP_PFN_BITS			(BITS_PER_LONG - PAGE_SHIFT)
+#endif	/* MAX_PHYSMEM_BITS */
+#define SWP_PFN_MASK			(BIT(SWP_PFN_BITS) - 1)
+
+static inline bool is_pfn_swap_entry(swp_entry_t entry);
+
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
@@ -64,6 +78,17 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
 	return entry.val & SWP_OFFSET_MASK;
 }
 
+/*
+ * This should only be called upon a pfn swap entry to get the PFN stored
+ * in the swap entry.  Please refers to is_pfn_swap_entry() for definition
+ * of pfn swap entry.
+ */
+static inline unsigned long swp_offset_pfn(swp_entry_t entry)
+{
+	VM_BUG_ON(!is_pfn_swap_entry(entry));
+	return swp_offset(entry) & SWP_PFN_MASK;
+}
+
 /* check whether a pte points to a swap entry */
 static inline int is_swap_pte(pte_t pte)
 {
@@ -369,7 +394,7 @@ static inline int pte_none_mostly(pte_t pte)
 
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
-	struct page *p = pfn_to_page(swp_offset(entry));
+	struct page *p = pfn_to_page(swp_offset_pfn(entry));
 
 	/*
 	 * Any use of migration entries may only occur while the
@@ -387,6 +412,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
  */
 static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
+	/* Make sure the swp offset can always store the needed fields */
+	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+
 	return is_migration_entry(entry) || is_device_private_entry(entry) ||
 	       is_device_exclusive_entry(entry);
 }
@@ -475,11 +503,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
-static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
-{
-	return swp_offset(entry);
-}
-
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index f2aa63b94d9b..3850fb625dda 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = swp_offset(entry) | cpu_flags;
+			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
 			return 0;
 		}
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0dfed9d7b273..e48f6f6a259d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 		swp_entry_t swp = pte_to_swp_entry(pte);
 
 		if (is_hwpoison_entry(swp))
-			pfn = hwpoison_entry_to_pfn(swp);
+			pfn = swp_offset_pfn(swp);
 	}
 
 	if (!pfn || pfn != poisoned_pfn)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8e9e574d535a..93e13fc17d3c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		    !is_device_exclusive_entry(entry))
 			return false;
 
-		pfn = swp_offset(entry);
+		pfn = swp_offset_pfn(entry);
 	} else if (is_swap_pte(*pvmw->pte)) {
 		swp_entry_t entry;
 
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		    !is_device_exclusive_entry(entry))
 			return false;
 
-		pfn = swp_offset(entry);
+		pfn = swp_offset_pfn(entry);
 	} else {
 		if (!pte_present(*pvmw->pte))
 			return false;
@@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					return not_found(pvmw);
 				entry = pmd_to_swp_entry(pmde);
 				if (!is_migration_entry(entry) ||
-				    !check_pmd(swp_offset(entry), pvmw))
+				    !check_pmd(swp_offset_pfn(entry), pvmw))
 					return not_found(pvmw);
 				return true;
 			}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
                   ` (2 preceding siblings ...)
  2022-08-11 16:13 ` [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
@ 2022-08-11 16:13 ` Peter Xu
  2022-10-21 16:06   ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Anatoly Pugachev
  2022-08-11 16:13 ` [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations Peter Xu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
shouldn't be a correctness issue since when pmd_dirty() we'll have the page
marked dirty anyway, however having dirty bit carried over helps the next
initial writes of split ptes on some archs like x86.

Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3222b40a0f6d..2f68e034ddec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2027,7 +2027,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	bool anon_exclusive = false;
+	bool anon_exclusive = false, dirty = false;
 	unsigned long addr;
 	int i;
 
@@ -2116,8 +2116,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
 		page = pmd_page(old_pmd);
-		if (pmd_dirty(old_pmd))
+		if (pmd_dirty(old_pmd)) {
+			dirty = true;
 			SetPageDirty(page);
+		}
 		write = pmd_write(old_pmd);
 		young = pmd_young(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
@@ -2183,6 +2185,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_wrprotect(entry);
 			if (!young)
 				entry = pte_mkold(entry);
+			/* NOTE: this may set soft-dirty too on some archs */
+			if (dirty)
+				entry = pte_mkdirty(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 			if (uffd_wp)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
                   ` (3 preceding siblings ...)
  2022-08-11 16:13 ` [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd Peter Xu
@ 2022-08-11 16:13 ` Peter Xu
  2022-09-11 23:48   ` Andrew Morton
  2022-08-11 16:13 ` [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap Peter Xu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table using
either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.  Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
The same slowdown problem to dirty bits when the memory is first written
after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated.  To achieve it, define a new set of
bits in the migration swap offset field to cache the A/D bits for old pte.
Then when removing/recovering the migration entry, we can recover the A/D
bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset is big enough to store both the PFN value and the A/D bits.
Otherwise the A/D bits are dropped like before.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 18 +++++++-
 mm/migrate.c            |  6 ++-
 mm/migrate_device.c     |  6 +++
 mm/rmap.c               |  5 ++-
 5 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index bd4c6f0c2103..36e462e116af 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -8,6 +8,10 @@
 
 #ifdef CONFIG_MMU
 
+#ifdef CONFIG_SWAP
+#include <linux/swapfile.h>
+#endif	/* CONFIG_SWAP */
+
 /*
  * swapcache pages are stored in the swapper_space radix tree.  We want to
  * get good packing density in that tree, so the index should be dense in
@@ -35,6 +39,31 @@
 #endif	/* MAX_PHYSMEM_BITS */
 #define SWP_PFN_MASK			(BIT(SWP_PFN_BITS) - 1)
 
+/**
+ * Migration swap entry specific bitfield definitions.  Layout:
+ *
+ *   |----------+--------------------|
+ *   | swp_type | swp_offset         |
+ *   |----------+--------+-+-+-------|
+ *   |          | resv   |D|A|  PFN  |
+ *   |----------+--------+-+-+-------|
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set (bit A)
+ * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set (bit D)
+ *
+ * Note: A/D bits will be stored in migration entries iff there're enough
+ * free bits in arch specific swp offset.  By default we'll ignore A/D bits
+ * when migrating a page.  Please refer to migration_entry_supports_ad()
+ * for more information.  If there're more bits besides PFN and A/D bits,
+ * they should be reserved and always be zeros.
+ */
+#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
+#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
+#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
+
+#define SWP_MIG_YOUNG			BIT(SWP_MIG_YOUNG_BIT)
+#define SWP_MIG_DIRTY			BIT(SWP_MIG_DIRTY_BIT)
+
 static inline bool is_pfn_swap_entry(swp_entry_t entry);
 
 /* Clear all flags but only keep swp_entry_t related information */
@@ -265,6 +294,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
+/*
+ * Returns whether the host has large enough swap offset field to support
+ * carrying over pgtable A/D bits for page migrations.  The result is
+ * pretty much arch specific.
+ */
+static inline bool migration_entry_supports_ad(void)
+{
+	/*
+	 * max_swapfile_size() returns the max supported swp-offset plus 1.
+	 * We can support the migration A/D bits iff the pfn swap entry has
+	 * the offset large enough to cover all of them (PFN, A & D bits).
+	 */
+#ifdef CONFIG_SWAP
+	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+#else  /* CONFIG_SWAP */
+	return false;
+#endif	/* CONFIG_SWAP */
+}
+
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_YOUNG);
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_YOUNG;
+	/* Keep the old behavior of aging page after migration */
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_DIRTY);
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_DIRTY;
+	/* Keep the old behavior of clean page after migration */
+	return false;
+}
+
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -311,6 +391,25 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
 	return 0;
 }
 
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	return false;
+}
 #endif	/* CONFIG_MIGRATION */
 
 typedef unsigned long pte_marker;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f68e034ddec..ac858fd9c1f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2111,7 +2111,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_writable_migration_entry(entry);
 		if (PageAnon(page))
 			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = false;
+		young = is_migration_entry_young(entry);
+		dirty = is_migration_entry_dirty(entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2171,6 +2172,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			else
 				swp_entry = make_readable_migration_entry(
 							page_to_pfn(page + i));
+			if (young)
+				swp_entry = make_migration_entry_young(swp_entry);
+			if (dirty)
+				swp_entry = make_migration_entry_dirty(swp_entry);
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -3180,6 +3185,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
 	else
 		entry = make_readable_migration_entry(page_to_pfn(page));
+	if (pmd_young(pmdval))
+		entry = make_migration_entry_young(entry);
+	if (pmd_dirty(pmdval))
+		entry = make_migration_entry_dirty(entry);
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3205,13 +3214,18 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	get_page(new);
-	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+	if (!is_migration_entry_young(entry))
+		pmde = pmd_mkold(pmde);
+	/* NOTE: this may contain setting soft-dirty on some archs */
+	if (PageDirty(new) && is_migration_entry_dirty(entry))
+		pmde = pmd_mkdirty(pmde);
 
 	if (PageAnon(new)) {
 		rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..0433a71d2bee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -198,7 +198,7 @@ static bool remove_migration_pte(struct folio *folio,
 #endif
 
 		folio_get(folio);
-		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
 		if (pte_swp_soft_dirty(*pvmw.pte))
 			pte = pte_mksoft_dirty(pte);
 
@@ -206,6 +206,10 @@ static bool remove_migration_pte(struct folio *folio,
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
+		if (!is_migration_entry_young(entry))
+			pte = pte_mkold(pte);
+		if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
+			pte = pte_mkdirty(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..e450b318b01b 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(page));
+			if (pte_present(pte)) {
+				if (pte_young(pte))
+					entry = make_migration_entry_young(entry);
+				if (pte_dirty(pte))
+					entry = make_migration_entry_dirty(entry);
+			}
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..28aef434ea41 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-
+			if (pte_young(pteval))
+				entry = make_migration_entry_young(entry);
+			if (pte_dirty(pteval))
+				entry = make_migration_entry_dirty(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
                   ` (4 preceding siblings ...)
  2022-08-11 16:13 ` [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations Peter Xu
@ 2022-08-11 16:13 ` Peter Xu
  2022-08-12  2:34   ` Huang, Ying
  2022-08-11 16:13 ` [PATCH v4 7/7] mm/swap: Cache swap migration A/D bits support Peter Xu
  2022-11-21  5:15 ` [PATCH v4 0/7] mm: Remember a/d bits for migration entries Raghavendra K T
  7 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.

As the caller of max_swapfile_size() grows, this patch introduce a variable
"swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
that we don't need to calculate the value every time.

Caching the value in swapfile_init() is safe because when reaching the
phase we should have initialized all the relevant information.  Here the
major arch to take care of is x86, which defines the max swapfile size
based on L1TF mitigation.

Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
when reaching swapfile_init(). As a reference, the code path looks like
this for x86:

- start_kernel
  - setup_arch
    - early_cpu_init
      - early_identify_cpu --> setup X86_BUG_L1TF
  - parse_early_param
    - l1tf_cmdline --> set l1tf_mitigation
  - check_bugs
    - l1tf_select_mitigation --> set l1tf_mitigation
  - arch_call_rest_init
    - rest_init
      - kernel_init
        - kernel_init_freeable
          - do_basic_setup
            - do_initcalls --> calls swapfile_init() (initcall level 4)

The swapfile size only depends on swp pte format on non-x86 archs, so
caching it is safe too.

Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
arch can define its own function, so it's more straightforward to have
"arch_" as its prefix.  At the meantime, export swapfile_maximum_size to
replace the old usages of max_swapfile_size().

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/mm/init.c       | 2 +-
 include/linux/swapfile.h | 3 ++-
 include/linux/swapops.h  | 2 +-
 mm/swapfile.c            | 7 +++++--
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 82a042c03824..9121bc1b9453 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
 }
 
 #ifdef CONFIG_SWAP
-unsigned long max_swapfile_size(void)
+unsigned long arch_max_swapfile_size(void)
 {
 	unsigned long pages;
 
diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index 54078542134c..165e0bd04862 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -8,6 +8,7 @@
  */
 extern struct swap_info_struct *swap_info[];
 extern unsigned long generic_max_swapfile_size(void);
-extern unsigned long max_swapfile_size(void);
+/* Maximum swapfile size supported for the arch (not inclusive). */
+extern unsigned long swapfile_maximum_size;
 
 #endif /* _LINUX_SWAPFILE_H */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 36e462e116af..f25b566643f1 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -307,7 +307,7 @@ static inline bool migration_entry_supports_ad(void)
 	 * the offset large enough to cover all of them (PFN, A & D bits).
 	 */
 #ifdef CONFIG_SWAP
-	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+	return swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS);
 #else  /* CONFIG_SWAP */
 	return false;
 #endif	/* CONFIG_SWAP */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1fdccd2f1422..3cc64399df44 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
 long total_swap_pages;
 static int least_priority = -1;
+unsigned long swapfile_maximum_size;
 
 static const char Bad_file[] = "Bad swap file entry ";
 static const char Unused_file[] = "Unused swap file entry ";
@@ -2816,7 +2817,7 @@ unsigned long generic_max_swapfile_size(void)
 }
 
 /* Can be overridden by an architecture for additional checks. */
-__weak unsigned long max_swapfile_size(void)
+__weak unsigned long arch_max_swapfile_size(void)
 {
 	return generic_max_swapfile_size();
 }
@@ -2856,7 +2857,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	p->cluster_next = 1;
 	p->cluster_nr = 0;
 
-	maxpages = max_swapfile_size();
+	maxpages = swapfile_maximum_size;
 	last_page = swap_header->info.last_page;
 	if (!last_page) {
 		pr_warn("Empty swap-file\n");
@@ -3677,6 +3678,8 @@ static int __init swapfile_init(void)
 	for_each_node(nid)
 		plist_head_init(&swap_avail_heads[nid]);
 
+	swapfile_maximum_size = arch_max_swapfile_size();
+
 	return 0;
 }
 subsys_initcall(swapfile_init);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 7/7] mm/swap: Cache swap migration A/D bits support
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
                   ` (5 preceding siblings ...)
  2022-08-11 16:13 ` [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap Peter Xu
@ 2022-08-11 16:13 ` Peter Xu
  2022-11-21  5:15 ` [PATCH v4 0/7] mm: Remember a/d bits for migration entries Raghavendra K T
  7 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2022-08-11 16:13 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple, peterx,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

Introduce a variable swap_migration_ad_supported to cache whether the arch
supports swap migration A/D bits.

Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
reference the other macro MAX_PHYSMEM_BITS, which is a function call on
x86 (constant on all the rest of archs).

It's safe to reference it in swapfile_init() because when reaching here
we're already during initcalls level 4 so we must have initialized 5-level
pgtable for x86_64 (right after early_identify_cpu() finishes).

- start_kernel
  - setup_arch
    - early_cpu_init
      - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
      - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
  - arch_call_rest_init
    - rest_init
      - kernel_init
        - kernel_init_freeable
          - do_basic_setup
            - do_initcalls --> calls swapfile_init() (initcall level 4)

This should slightly speed up the migration swap entry handlings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapfile.h | 2 ++
 include/linux/swapops.h  | 7 +------
 mm/swapfile.c            | 8 ++++++++
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index 165e0bd04862..2fbcc9afd814 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -10,5 +10,7 @@ extern struct swap_info_struct *swap_info[];
 extern unsigned long generic_max_swapfile_size(void);
 /* Maximum swapfile size supported for the arch (not inclusive). */
 extern unsigned long swapfile_maximum_size;
+/* Whether swap migration entry supports storing A/D bits for the arch */
+extern bool swap_migration_ad_supported;
 
 #endif /* _LINUX_SWAPFILE_H */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index f25b566643f1..dbf9df854124 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -301,13 +301,8 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
  */
 static inline bool migration_entry_supports_ad(void)
 {
-	/*
-	 * max_swapfile_size() returns the max supported swp-offset plus 1.
-	 * We can support the migration A/D bits iff the pfn swap entry has
-	 * the offset large enough to cover all of them (PFN, A & D bits).
-	 */
 #ifdef CONFIG_SWAP
-	return swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS);
+	return swap_migration_ad_supported;
 #else  /* CONFIG_SWAP */
 	return false;
 #endif	/* CONFIG_SWAP */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3cc64399df44..263b19e693cf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -64,6 +64,9 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
 long total_swap_pages;
 static int least_priority = -1;
 unsigned long swapfile_maximum_size;
+#ifdef CONFIG_MIGRATION
+bool swap_migration_ad_supported;
+#endif	/* CONFIG_MIGRATION */
 
 static const char Bad_file[] = "Bad swap file entry ";
 static const char Unused_file[] = "Unused swap file entry ";
@@ -3680,6 +3683,11 @@ static int __init swapfile_init(void)
 
 	swapfile_maximum_size = arch_max_swapfile_size();
 
+#ifdef CONFIG_MIGRATION
+	if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
+		swap_migration_ad_supported = true;
+#endif	/* CONFIG_MIGRATION */
+
 	return 0;
 }
 subsys_initcall(swapfile_init);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-08-11 16:13 ` [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
@ 2022-08-12  2:33   ` Huang, Ying
  2022-08-23 21:01     ` Yu Zhao
  0 siblings, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2022-08-12  2:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Kirill A . Shutemov,
	Alistair Popple, Andrea Arcangeli, Minchan Kim, Andrew Morton,
	David Hildenbrand, Andi Kleen, Nadav Amit, Vlastimil Babka

Peter Xu <peterx@redhat.com> writes:

> We've got a bunch of special swap entries that stores PFN inside the swap
> offset fields.  To fetch the PFN, normally the user just calls swp_offset()
> assuming that'll be the PFN.
>
> Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> max possible length of a PFN on the host, meanwhile doing proper check with
> MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
>
> One reason to do so is we never tried to sanitize whether swap offset can
> really fit for storing PFN.  At the meantime, this patch also prepares us
> with the future possibility to store more information inside the swp offset
> field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> more very soon.
>
> Replace many of the swp_offset() callers to use swp_offset_pfn() where
> proper.  Note that many of the existing users are not candidates for the
> replacement, e.g.:
>
>   (1) When the swap entry is not a pfn swap entry at all, or,
>   (2) when we wanna keep the whole swp_offset but only change the swp type.
>
> For the latter, it can happen when fork() triggered on a write-migration
> swap entry pte, we may want to only change the migration type from
> write->read but keep the rest, so it's not "fetching PFN" but "changing
> swap type only".  They're left aside so that when there're more information
> within the swp offset they'll be carried over naturally in those cases.
>
> Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> the new swp_offset_pfn() is about.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  arch/arm64/mm/hugetlbpage.c |  2 +-
>  fs/proc/task_mmu.c          | 20 +++++++++++++++++---
>  include/linux/swapops.h     | 35 +++++++++++++++++++++++++++++------
>  mm/hmm.c                    |  2 +-
>  mm/memory-failure.c         |  2 +-
>  mm/page_vma_mapped.c        |  6 +++---
>  6 files changed, 52 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 0795028f017c..35e9a468d13e 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -245,7 +245,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
>  {
>  	VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
>  
> -	return page_folio(pfn_to_page(swp_offset(entry)));
> +	return page_folio(pfn_to_page(swp_offset_pfn(entry)));
>  }
>  
>  void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index d56c65f98d00..b3e79128fca0 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1419,9 +1419,19 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		if (pte_swp_uffd_wp(pte))
>  			flags |= PM_UFFD_WP;
>  		entry = pte_to_swp_entry(pte);
> -		if (pm->show_pfn)
> +		if (pm->show_pfn) {
> +			pgoff_t offset;
> +			/*
> +			 * For PFN swap offsets, keeping the offset field
> +			 * to be PFN only to be compatible with old smaps.
> +			 */
> +			if (is_pfn_swap_entry(entry))
> +				offset = swp_offset_pfn(entry);
> +			else
> +				offset = swp_offset(entry);
>  			frame = swp_type(entry) |
> -				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> +			    (offset << MAX_SWAPFILES_SHIFT);
> +		}
>  		flags |= PM_SWAP;
>  		migration = is_migration_entry(entry);
>  		if (is_pfn_swap_entry(entry))
> @@ -1478,7 +1488,11 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  			unsigned long offset;
>  
>  			if (pm->show_pfn) {
> -				offset = swp_offset(entry) +
> +				if (is_pfn_swap_entry(entry))
> +					offset = swp_offset_pfn(entry);
> +				else
> +					offset = swp_offset(entry);
> +				offset = offset +
>  					((addr & ~PMD_MASK) >> PAGE_SHIFT);
>  				frame = swp_type(entry) |
>  					(offset << MAX_SWAPFILES_SHIFT);
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 3a2901ff4f1e..bd4c6f0c2103 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -23,6 +23,20 @@
>  #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
>  #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
>  
> +/*
> + * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
> + * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
> + * can use the extra bits to store other information besides PFN.
> + */
> +#ifdef MAX_PHYSMEM_BITS
> +#define SWP_PFN_BITS			(MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +#else  /* MAX_PHYSMEM_BITS */
> +#define SWP_PFN_BITS			(BITS_PER_LONG - PAGE_SHIFT)
> +#endif	/* MAX_PHYSMEM_BITS */
> +#define SWP_PFN_MASK			(BIT(SWP_PFN_BITS) - 1)
> +
> +static inline bool is_pfn_swap_entry(swp_entry_t entry);
> +
>  /* Clear all flags but only keep swp_entry_t related information */
>  static inline pte_t pte_swp_clear_flags(pte_t pte)
>  {
> @@ -64,6 +78,17 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
>  	return entry.val & SWP_OFFSET_MASK;
>  }
>  
> +/*
> + * This should only be called upon a pfn swap entry to get the PFN stored
> + * in the swap entry.  Please refers to is_pfn_swap_entry() for definition
> + * of pfn swap entry.
> + */
> +static inline unsigned long swp_offset_pfn(swp_entry_t entry)
> +{
> +	VM_BUG_ON(!is_pfn_swap_entry(entry));
> +	return swp_offset(entry) & SWP_PFN_MASK;
> +}
> +
>  /* check whether a pte points to a swap entry */
>  static inline int is_swap_pte(pte_t pte)
>  {
> @@ -369,7 +394,7 @@ static inline int pte_none_mostly(pte_t pte)
>  
>  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>  {
> -	struct page *p = pfn_to_page(swp_offset(entry));
> +	struct page *p = pfn_to_page(swp_offset_pfn(entry));
>  
>  	/*
>  	 * Any use of migration entries may only occur while the
> @@ -387,6 +412,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>   */
>  static inline bool is_pfn_swap_entry(swp_entry_t entry)
>  {
> +	/* Make sure the swp offset can always store the needed fields */
> +	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
> +
>  	return is_migration_entry(entry) || is_device_private_entry(entry) ||
>  	       is_device_exclusive_entry(entry);
>  }
> @@ -475,11 +503,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
>  	return swp_type(entry) == SWP_HWPOISON;
>  }
>  
> -static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
> -{
> -	return swp_offset(entry);
> -}
> -
>  static inline void num_poisoned_pages_inc(void)
>  {
>  	atomic_long_inc(&num_poisoned_pages);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2aa63b94d9b..3850fb625dda 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  			cpu_flags = HMM_PFN_VALID;
>  			if (is_writable_device_private_entry(entry))
>  				cpu_flags |= HMM_PFN_WRITE;
> -			*hmm_pfn = swp_offset(entry) | cpu_flags;
> +			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
>  			return 0;
>  		}
>  
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 0dfed9d7b273..e48f6f6a259d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
>  		swp_entry_t swp = pte_to_swp_entry(pte);
>  
>  		if (is_hwpoison_entry(swp))
> -			pfn = hwpoison_entry_to_pfn(swp);
> +			pfn = swp_offset_pfn(swp);
>  	}
>  
>  	if (!pfn || pfn != poisoned_pfn)
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 8e9e574d535a..93e13fc17d3c 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  		    !is_device_exclusive_entry(entry))
>  			return false;
>  
> -		pfn = swp_offset(entry);
> +		pfn = swp_offset_pfn(entry);
>  	} else if (is_swap_pte(*pvmw->pte)) {
>  		swp_entry_t entry;
>  
> @@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  		    !is_device_exclusive_entry(entry))
>  			return false;
>  
> -		pfn = swp_offset(entry);
> +		pfn = swp_offset_pfn(entry);
>  	} else {
>  		if (!pte_present(*pvmw->pte))
>  			return false;
> @@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  					return not_found(pvmw);
>  				entry = pmd_to_swp_entry(pmde);
>  				if (!is_migration_entry(entry) ||
> -				    !check_pmd(swp_offset(entry), pvmw))
> +				    !check_pmd(swp_offset_pfn(entry), pvmw))
>  					return not_found(pvmw);
>  				return true;
>  			}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap
  2022-08-11 16:13 ` [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap Peter Xu
@ 2022-08-12  2:34   ` Huang, Ying
  0 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2022-08-12  2:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Kirill A . Shutemov,
	Alistair Popple, Andrea Arcangeli, Minchan Kim, Andrew Morton,
	David Hildenbrand, Andi Kleen, Nadav Amit, Vlastimil Babka

Peter Xu <peterx@redhat.com> writes:

> We used to have swapfile_maximum_size() fetching a maximum value of
> swapfile size per-arch.
>
> As the caller of max_swapfile_size() grows, this patch introduce a variable
> "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
> that we don't need to calculate the value every time.
>
> Caching the value in swapfile_init() is safe because when reaching the
> phase we should have initialized all the relevant information.  Here the
> major arch to take care of is x86, which defines the max swapfile size
> based on L1TF mitigation.
>
> Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
> when reaching swapfile_init(). As a reference, the code path looks like
> this for x86:
>
> - start_kernel
>   - setup_arch
>     - early_cpu_init
>       - early_identify_cpu --> setup X86_BUG_L1TF
>   - parse_early_param
>     - l1tf_cmdline --> set l1tf_mitigation
>   - check_bugs
>     - l1tf_select_mitigation --> set l1tf_mitigation
>   - arch_call_rest_init
>     - rest_init
>       - kernel_init
>         - kernel_init_freeable
>           - do_basic_setup
>             - do_initcalls --> calls swapfile_init() (initcall level 4)
>
> The swapfile size only depends on swp pte format on non-x86 archs, so
> caching it is safe too.
>
> Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
> arch can define its own function, so it's more straightforward to have
> "arch_" as its prefix.  At the meantime, export swapfile_maximum_size to
> replace the old usages of max_swapfile_size().
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

LGTM.

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying

> ---
>  arch/x86/mm/init.c       | 2 +-
>  include/linux/swapfile.h | 3 ++-
>  include/linux/swapops.h  | 2 +-
>  mm/swapfile.c            | 7 +++++--
>  4 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 82a042c03824..9121bc1b9453 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
>  }
>  
>  #ifdef CONFIG_SWAP
> -unsigned long max_swapfile_size(void)
> +unsigned long arch_max_swapfile_size(void)
>  {
>  	unsigned long pages;
>  
> diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
> index 54078542134c..165e0bd04862 100644
> --- a/include/linux/swapfile.h
> +++ b/include/linux/swapfile.h
> @@ -8,6 +8,7 @@
>   */
>  extern struct swap_info_struct *swap_info[];
>  extern unsigned long generic_max_swapfile_size(void);
> -extern unsigned long max_swapfile_size(void);
> +/* Maximum swapfile size supported for the arch (not inclusive). */
> +extern unsigned long swapfile_maximum_size;
>  
>  #endif /* _LINUX_SWAPFILE_H */
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 36e462e116af..f25b566643f1 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -307,7 +307,7 @@ static inline bool migration_entry_supports_ad(void)
>  	 * the offset large enough to cover all of them (PFN, A & D bits).
>  	 */
>  #ifdef CONFIG_SWAP
> -	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> +	return swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS);
>  #else  /* CONFIG_SWAP */
>  	return false;
>  #endif	/* CONFIG_SWAP */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1fdccd2f1422..3cc64399df44 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
>  /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
>  long total_swap_pages;
>  static int least_priority = -1;
> +unsigned long swapfile_maximum_size;
>  
>  static const char Bad_file[] = "Bad swap file entry ";
>  static const char Unused_file[] = "Unused swap file entry ";
> @@ -2816,7 +2817,7 @@ unsigned long generic_max_swapfile_size(void)
>  }
>  
>  /* Can be overridden by an architecture for additional checks. */
> -__weak unsigned long max_swapfile_size(void)
> +__weak unsigned long arch_max_swapfile_size(void)
>  {
>  	return generic_max_swapfile_size();
>  }
> @@ -2856,7 +2857,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>  	p->cluster_next = 1;
>  	p->cluster_nr = 0;
>  
> -	maxpages = max_swapfile_size();
> +	maxpages = swapfile_maximum_size;
>  	last_page = swap_header->info.last_page;
>  	if (!last_page) {
>  		pr_warn("Empty swap-file\n");
> @@ -3677,6 +3678,8 @@ static int __init swapfile_init(void)
>  	for_each_node(nid)
>  		plist_head_init(&swap_avail_heads[nid]);
>  
> +	swapfile_maximum_size = arch_max_swapfile_size();
> +
>  	return 0;
>  }
>  subsys_initcall(swapfile_init);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/7] mm/swap: Comment all the ifdef in swapops.h
  2022-08-11 16:13 ` [PATCH v4 2/7] mm/swap: Comment all the ifdef in swapops.h Peter Xu
@ 2022-08-15  6:03   ` Alistair Popple
  0 siblings, 0 replies; 27+ messages in thread
From: Alistair Popple @ 2022-08-15  6:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Kirill A . Shutemov,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka


Looks good:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

Peter Xu <peterx@redhat.com> writes:

> swapops.h contains quite a few layers of ifdef, some of the "else" and
> "endif" doesn't get proper comment on the macro so it's hard to follow on
> what are they referring to.  Add the comments.
>
> Suggested-by: Nadav Amit <nadav.amit@gmail.com>
> Reviewed-by: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index a3d435bf9f97..3a2901ff4f1e 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -247,8 +247,8 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  #ifdef CONFIG_HUGETLB_PAGE
>  extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
>  extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
> -#endif
> -#else
> +#endif	/* CONFIG_HUGETLB_PAGE */
> +#else  /* CONFIG_MIGRATION */
>  static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
>  {
>  	return swp_entry(0, 0);
> @@ -276,7 +276,7 @@ static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
>  #ifdef CONFIG_HUGETLB_PAGE
>  static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
>  static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
> -#endif
> +#endif	/* CONFIG_HUGETLB_PAGE */
>  static inline int is_writable_migration_entry(swp_entry_t entry)
>  {
>  	return 0;
> @@ -286,7 +286,7 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
>  	return 0;
>  }
>
> -#endif
> +#endif	/* CONFIG_MIGRATION */
>
>  typedef unsigned long pte_marker;
>
> @@ -426,7 +426,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  {
>  	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>  }
> -#else
> +#else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		struct page *page)
>  {
> @@ -455,7 +455,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  {
>  	return 0;
>  }
> -#endif
> +#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>
>  #ifdef CONFIG_MEMORY_FAILURE
>
> @@ -495,7 +495,7 @@ static inline void num_poisoned_pages_sub(long i)
>  	atomic_long_sub(i, &num_poisoned_pages);
>  }
>
> -#else
> +#else  /* CONFIG_MEMORY_FAILURE */
>
>  static inline swp_entry_t make_hwpoison_entry(struct page *page)
>  {
> @@ -514,7 +514,7 @@ static inline void num_poisoned_pages_inc(void)
>  static inline void num_poisoned_pages_sub(long i)
>  {
>  }
> -#endif
> +#endif  /* CONFIG_MEMORY_FAILURE */
>
>  static inline int non_swap_entry(swp_entry_t entry)
>  {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-08-12  2:33   ` Huang, Ying
@ 2022-08-23 21:01     ` Yu Zhao
  2022-08-23 22:04       ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: Yu Zhao @ 2022-08-23 21:01 UTC (permalink / raw)
  To: Peter Xu, Andrew Morton
  Cc: Huang, Ying, Linux-MM, linux-kernel, Hugh Dickins,
	Kirill A . Shutemov, Alistair Popple, Andrea Arcangeli,
	Minchan Kim, David Hildenbrand, Andi Kleen, Nadav Amit,
	Vlastimil Babka

On Thu, Aug 11, 2022 at 8:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Peter Xu <peterx@redhat.com> writes:
>
> > We've got a bunch of special swap entries that stores PFN inside the swap
> > offset fields.  To fetch the PFN, normally the user just calls swp_offset()
> > assuming that'll be the PFN.
> >
> > Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> > max possible length of a PFN on the host, meanwhile doing proper check with
> > MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> > properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
> >
> > One reason to do so is we never tried to sanitize whether swap offset can
> > really fit for storing PFN.  At the meantime, this patch also prepares us
> > with the future possibility to store more information inside the swp offset
> > field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> > more very soon.
> >
> > Replace many of the swp_offset() callers to use swp_offset_pfn() where
> > proper.  Note that many of the existing users are not candidates for the
> > replacement, e.g.:
> >
> >   (1) When the swap entry is not a pfn swap entry at all, or,
> >   (2) when we wanna keep the whole swp_offset but only change the swp type.
> >
> > For the latter, it can happen when fork() triggered on a write-migration
> > swap entry pte, we may want to only change the migration type from
> > write->read but keep the rest, so it's not "fetching PFN" but "changing
> > swap type only".  They're left aside so that when there're more information
> > within the swp offset they'll be carried over naturally in those cases.
> >
> > Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> > the new swp_offset_pfn() is about.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
>
> LGTM, Thanks!
>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Hi,

I hit the following crash on mm-everything-2022-08-22-22-59. Please take a look.

Thanks.

  kernel BUG at include/linux/swapops.h:117!
  CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S         O L
6.0.0-dbg-DEV #2
  RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
  Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b
48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
  RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
  RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
  RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
  RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
  R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
  R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
  FS:  00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   change_pte_range+0x36e/0x880
   change_p4d_range+0x2e8/0x670
   change_protection_range+0x14e/0x2c0
   mprotect_fixup+0x1ee/0x330
   do_mprotect_pkey+0x34c/0x440
   __x64_sys_mprotect+0x1d/0x30

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-08-23 21:01     ` Yu Zhao
@ 2022-08-23 22:04       ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2022-08-23 22:04 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Huang, Ying, Linux-MM, linux-kernel, Hugh Dickins,
	Kirill A . Shutemov, Alistair Popple, Andrea Arcangeli,
	Minchan Kim, David Hildenbrand, Andi Kleen, Nadav Amit,
	Vlastimil Babka

On Tue, Aug 23, 2022 at 03:01:09PM -0600, Yu Zhao wrote:
> On Thu, Aug 11, 2022 at 8:33 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Peter Xu <peterx@redhat.com> writes:
> >
> > > We've got a bunch of special swap entries that stores PFN inside the swap
> > > offset fields.  To fetch the PFN, normally the user just calls swp_offset()
> > > assuming that'll be the PFN.
> > >
> > > Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> > > max possible length of a PFN on the host, meanwhile doing proper check with
> > > MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> > > properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
> > >
> > > One reason to do so is we never tried to sanitize whether swap offset can
> > > really fit for storing PFN.  At the meantime, this patch also prepares us
> > > with the future possibility to store more information inside the swp offset
> > > field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> > > more very soon.
> > >
> > > Replace many of the swp_offset() callers to use swp_offset_pfn() where
> > > proper.  Note that many of the existing users are not candidates for the
> > > replacement, e.g.:
> > >
> > >   (1) When the swap entry is not a pfn swap entry at all, or,
> > >   (2) when we wanna keep the whole swp_offset but only change the swp type.
> > >
> > > For the latter, it can happen when fork() triggered on a write-migration
> > > swap entry pte, we may want to only change the migration type from
> > > write->read but keep the rest, so it's not "fetching PFN" but "changing
> > > swap type only".  They're left aside so that when there're more information
> > > within the swp offset they'll be carried over naturally in those cases.
> > >
> > > Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> > > the new swp_offset_pfn() is about.
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> >
> > LGTM, Thanks!
> >
> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> 
> Hi,
> 
> I hit the following crash on mm-everything-2022-08-22-22-59. Please take a look.
> 
> Thanks.
> 
>   kernel BUG at include/linux/swapops.h:117!
>   CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S         O L
> 6.0.0-dbg-DEV #2
>   RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
>   Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
> c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b
> 48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
>   RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
>   RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
>   RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
>   RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
>   R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
>   R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
>   FS:  00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   Call Trace:
>    <TASK>
>    change_pte_range+0x36e/0x880
>    change_p4d_range+0x2e8/0x670
>    change_protection_range+0x14e/0x2c0
>    mprotect_fixup+0x1ee/0x330
>    do_mprotect_pkey+0x34c/0x440
>    __x64_sys_mprotect+0x1d/0x30

The VM_BUG_ON added in this patch seems to have revealed a real bug,
because we probably shouldn't call pfn_swap_entry_to_page() upon e.g. a
genuine swap pte.

I'll post a patch shortly, thanks.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations
  2022-08-11 16:13 ` [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations Peter Xu
@ 2022-09-11 23:48   ` Andrew Morton
  2022-09-13  0:55     ` Huang, Ying
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2022-09-11 23:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Kirill A . Shutemov,
	Alistair Popple, Andrea Arcangeli, Minchan Kim,
	David Hildenbrand, Andi Kleen, Nadav Amit, Huang Ying,
	Vlastimil Babka

On Thu, 11 Aug 2022 12:13:29 -0400 Peter Xu <peterx@redhat.com> wrote:

> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> 
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
> 
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
> 
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset is big enough to store both the PFN value and the A/D bits.
> Otherwise the A/D bits are dropped like before.
> 

There was some discussion over v3 of this patch, but none over v4.

Can people please review this patch series so we can get moving with it?

Thanks.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations
  2022-09-11 23:48   ` Andrew Morton
@ 2022-09-13  0:55     ` Huang, Ying
  0 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2022-09-13  0:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, linux-mm, linux-kernel, Hugh Dickins,
	Kirill A . Shutemov, Alistair Popple, Andrea Arcangeli,
	Minchan Kim, David Hildenbrand, Andi Kleen, Nadav Amit,
	Vlastimil Babka

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 11 Aug 2022 12:13:29 -0400 Peter Xu <peterx@redhat.com> wrote:
>
>> When page migration happens, we always ignore the young/dirty bit settings
>> in the old pgtable, and marking the page as old in the new page table using
>> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
>> 
>> That's fine from functional-wise, but that's not friendly to page reclaim
>> because the moving page can be actively accessed within the procedure.  Not
>> to mention hardware setting the young bit can bring quite some overhead on
>> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
>> The same slowdown problem to dirty bits when the memory is first written
>> after page migration happened.
>> 
>> Actually we can easily remember the A/D bit configuration and recover the
>> information after the page is migrated.  To achieve it, define a new set of
>> bits in the migration swap offset field to cache the A/D bits for old pte.
>> Then when removing/recovering the migration entry, we can recover the A/D
>> bits even if the page changed.
>> 
>> One thing to mention is that here we used max_swapfile_size() to detect how
>> many swp offset bits we have, and we'll only enable this feature if we know
>> the swp offset is big enough to store both the PFN value and the A/D bits.
>> Otherwise the A/D bits are dropped like before.
>> 
>
> There was some discussion over v3 of this patch, but none over v4.
>
> Can people please review this patch series so we can get moving with it?

Most discussions over v3 are for migrate_device.c code.  There are some
bugs and they have been fixed by Alistair via [1].

This patch itself is good.  Sorry for bothering.

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

[1] https://lore.kernel.org/linux-mm/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.com/

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 27+ messages in thread

* dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd)
  2022-08-11 16:13 ` [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd Peter Xu
@ 2022-10-21 16:06   ` Anatoly Pugachev
  2022-10-23 13:33     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) #forregzbot Thorsten Leemhuis
  2022-10-23 19:52     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Peter Xu
  0 siblings, 2 replies; 27+ messages in thread
From: Anatoly Pugachev @ 2022-10-21 16:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Kirill A . Shutemov,
	Alistair Popple, Andrea Arcangeli, Minchan Kim, Andrew Morton,
	David Hildenbrand, Andi Kleen, Nadav Amit, Huang Ying,
	Vlastimil Babka, sparclinux

On Thu, Aug 11, 2022 at 12:13:28PM -0400, Peter Xu wrote:
> Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
> shouldn't be a correctness issue since when pmd_dirty() we'll have the page
> marked dirty anyway, however having dirty bit carried over helps the next
> initial writes of split ptes on some archs like x86.
> 
> Reviewed-by: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/huge_memory.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)


Hello!

Tried to update my debian sparc64 sid (unstable) linux distro to latest
version of available packages, got dpkg segfault... 

$ apt update -y
...
Unpacking linux-image-sparc64-smp (6.0.2-1) ...
E: Sub-process /usr/bin/dpkg received a segmentation fault.

Downgraded dpkg from 1.21.9 to 1.21.8 / 1.21.7 (2-3 monthes old
versions) - still getting segfault on package install (which was never
an issue before, even on this old dpkg versions).

Tried to gdb backtrace core file, which is unlucky :


root@ttip:/# apt install -y linux-image-sparc64-smp ccache qemu-utils xdelta qemu-system-x86 distcc qemu-efi-aarch64 pkg-kde-tools
...
Preparing to unpack .../2-linux-image-6.0.0-1-sparc64-smp_6.0.2-1_sparc64.deb ...
Unpacking linux-image-6.0.0-1-sparc64-smp (6.0.2-1) ...
Selecting previously unselected package linux-image-sparc64-smp.
Preparing to unpack .../3-linux-image-sparc64-smp_6.0.2-1_sparc64.deb ...
Unpacking linux-image-sparc64-smp (6.0.2-1) ...
E: Sub-process /usr/bin/dpkg received a segmentation fault.
root@ttip:/# ls -l core.4751
-rw------- 1 root root 25042944 Oct 21 14:38 core.4751
root@ttip:/# gdb -q -c core.4751
GNU gdb (Debian 12.1-4) 12.1
[New LWP 4751]
Core was generated by `/usr/bin/dpkg --status-fd 15 --no-triggers --unpack --auto-deconfigure --recurs'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xfff800010089cde4 in ?? ()
(gdb) bt
#0  0xfff800010089cde4 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)


rebooted from my compiled kernel 6.1.0-rc1 to older (debian) kernel -
5.19.0-2-sparc64-smp

dpkg installed packages without any problems. Removed just installed
packages, rebooted to 6.1.0-rc1 and tried to install packages, dpkg got
segfault again.

Recompiled 6.1.0-rc1 with gcc-11 instead of gcc-12, still segfaults...
... bisect time ...

mator@ttip:~/linux-2.6$ git bisect log
# bad: [9abf2313adc1ca1b6180c508c25f22f9395cc780] Linux 6.1-rc1
# good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
git bisect start 'v6.1-rc1' 'v6.0'
# good: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect good 18fd049731e67651009f316195da9281b756f2cf
# good: [4c540c92b46497dcda59203eea78e4620bc96f47] RISC-V: Add mvendorid, marchid, and mimpid to /proc/cpuinfo output
git bisect good 4c540c92b46497dcda59203eea78e4620bc96f47
# bad: [27bc50fc90647bbf7b734c3fc306a5e61350da53] Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 27bc50fc90647bbf7b734c3fc306a5e61350da53
# good: [ada3bfb6492a6d0d3eca50f3b61315fe032efc72] Merge tag 'tpmdd-next-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
git bisect good ada3bfb6492a6d0d3eca50f3b61315fe032efc72
# bad: [5f7fa13fa858c17580ed513bd5e0a4b36d68fdd6] mm: add pageblock_align() macro
git bisect bad 5f7fa13fa858c17580ed513bd5e0a4b36d68fdd6
# bad: [54a611b605901c7d5d05b6b8f5d04a6ceb0962aa] Maple Tree: add new data structure
git bisect bad 54a611b605901c7d5d05b6b8f5d04a6ceb0962aa
# good: [59298997df89e19aad426d4ae0a7e5037074da5a] x86/uaccess: avoid check_object_size() in copy_from_user_nmi()
git bisect good 59298997df89e19aad426d4ae0a7e5037074da5a
# good: [04c6b79ae4f0bcbd96afd7cea5e1a8848162438e] btrfs: convert __process_pages_contig() to use filemap_get_folios_contig()
git bisect good 04c6b79ae4f0bcbd96afd7cea5e1a8848162438e
# good: [da29499124cd2221539b235c1f93c7d93faf6565] mm, hwpoison: use __PageMovable() to detect non-lru movable pages
git bisect good da29499124cd2221539b235c1f93c7d93faf6565
# bad: [eed9a328aa1ae6ac1edaa026957e6882f57de0dd] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
git bisect bad eed9a328aa1ae6ac1edaa026957e6882f57de0dd
# bad: [f347c9d2697fcbbb64e077f7113a3887a181b8c0] filemap: make the accounting of thrashing more consistent
git bisect bad f347c9d2697fcbbb64e077f7113a3887a181b8c0
# good: [eba4d770efc86a3710e36b828190858abfa3bb74] mm/swap: comment all the ifdef in swapops.h
git bisect good eba4d770efc86a3710e36b828190858abfa3bb74
# bad: [2e3468778dbe3ec389a10c21a703bb8e5be5cfbc] mm: remember young/dirty bit for page migrations
git bisect bad 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc
# bad: [0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb] mm/thp: carry over dirty bit when thp splits on pmd
git bisect bad 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb
# good: [0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e] mm/swap: add swp_offset_pfn() to fetch PFN from swap entry
git bisect good 0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e
# first bad commit: [0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb] mm/thp: carry over dirty bit when thp splits on pmd


mator@ttip:~/linux-2.6$ git bisect good
0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb is the first bad commit
commit 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb
Author: Peter Xu <peterx@redhat.com>
Date:   Thu Aug 11 12:13:28 2022 -0400

    mm/thp: carry over dirty bit when thp splits on pmd

    Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
    shouldn't be a correctness issue since when pmd_dirty() we'll have the
    page marked dirty anyway, however having dirty bit carried over helps the
    next initial writes of split ptes on some archs like x86.

    Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com



So, v6.0-rc3-176-g0d206b5d2e0d) does not segfault dpkg,
v6.0-rc3-177-g0ccf7f168e17 segfaults it on package install.

dpkg test was (apt) install/remove some packages, segfaults only on install
(not remove).

Reverted 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb from top of v6.1-rc1 and
tried to compile kernel, but got error 

mm/huge_memory.c: In function ‘__split_huge_pmd_locked’:
mm/huge_memory.c:2129:17: error: ‘dirty’ undeclared (first use in this function)
 2129 |                 dirty = is_migration_entry_dirty(entry);
      |                 ^~~~~
mm/huge_memory.c:2129:17: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [scripts/Makefile.build:250: mm/huge_memory.o] Error 1

So can't test v6.1-rc1 with patch reverted...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) #forregzbot
  2022-10-21 16:06   ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Anatoly Pugachev
@ 2022-10-23 13:33     ` Thorsten Leemhuis
  2022-11-04 10:39       ` Thorsten Leemhuis
  2022-10-23 19:52     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Peter Xu
  1 sibling, 1 reply; 27+ messages in thread
From: Thorsten Leemhuis @ 2022-10-23 13:33 UTC (permalink / raw)
  To: regressions; +Cc: linux-mm, linux-kernel, sparclinux

[Note: this mail is primarily send for documentation purposes and/or for
regzbot, my Linux kernel regression tracking bot. That's why I removed
most or all folks from the list of recipients, but left any that looked
like a mailing lists. These mails usually contain '#forregzbot' in the
subject, to make them easy to spot and filter out.]

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

Hi, this is your Linux kernel regression tracker.CCing the regression
mailing list, as it should be in the loop for all regressions, as
explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

On 21.10.22 18:06, Anatoly Pugachev wrote:
> On Thu, Aug 11, 2022 at 12:13:28PM -0400, Peter Xu wrote:
>> Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
>> shouldn't be a correctness issue since when pmd_dirty() we'll have the page
>> marked dirty anyway, however having dirty bit carried over helps the next
>> initial writes of split ptes on some archs like x86.
>>
>> Reviewed-by: Huang Ying <ying.huang@intel.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>>  mm/huge_memory.c | 9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> 
> Hello!
> 
> Tried to update my debian sparc64 sid (unstable) linux distro to latest
> version of available packages, got dpkg segfault... 

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced 0ccf7f168e17bb7
#regzbot title mm: sparc64: dpkg fails on sparc64 since "mm/thp: Carry
over dirty bit when thp splits on pmd)"
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

> $ apt update -y
> ...
> Unpacking linux-image-sparc64-smp (6.0.2-1) ...
> E: Sub-process /usr/bin/dpkg received a segmentation fault.
> 
> Downgraded dpkg from 1.21.9 to 1.21.8 / 1.21.7 (2-3 monthes old
> versions) - still getting segfault on package install (which was never
> an issue before, even on this old dpkg versions).
> 
> Tried to gdb backtrace core file, which is unlucky :
> 
> 
> root@ttip:/# apt install -y linux-image-sparc64-smp ccache qemu-utils xdelta qemu-system-x86 distcc qemu-efi-aarch64 pkg-kde-tools
> ...
> Preparing to unpack .../2-linux-image-6.0.0-1-sparc64-smp_6.0.2-1_sparc64.deb ...
> Unpacking linux-image-6.0.0-1-sparc64-smp (6.0.2-1) ...
> Selecting previously unselected package linux-image-sparc64-smp.
> Preparing to unpack .../3-linux-image-sparc64-smp_6.0.2-1_sparc64.deb ...
> Unpacking linux-image-sparc64-smp (6.0.2-1) ...
> E: Sub-process /usr/bin/dpkg received a segmentation fault.
> root@ttip:/# ls -l core.4751
> -rw------- 1 root root 25042944 Oct 21 14:38 core.4751
> root@ttip:/# gdb -q -c core.4751
> GNU gdb (Debian 12.1-4) 12.1
> [New LWP 4751]
> Core was generated by `/usr/bin/dpkg --status-fd 15 --no-triggers --unpack --auto-deconfigure --recurs'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0xfff800010089cde4 in ?? ()
> (gdb) bt
> #0  0xfff800010089cde4 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb)
> 
> 
> rebooted from my compiled kernel 6.1.0-rc1 to older (debian) kernel -
> 5.19.0-2-sparc64-smp
> 
> dpkg installed packages without any problems. Removed just installed
> packages, rebooted to 6.1.0-rc1 and tried to install packages, dpkg got
> segfault again.
> 
> Recompiled 6.1.0-rc1 with gcc-11 instead of gcc-12, still segfaults...
> ... bisect time ...
> 
> mator@ttip:~/linux-2.6$ git bisect log
> # bad: [9abf2313adc1ca1b6180c508c25f22f9395cc780] Linux 6.1-rc1
> # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
> git bisect start 'v6.1-rc1' 'v6.0'
> # good: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> git bisect good 18fd049731e67651009f316195da9281b756f2cf
> # good: [4c540c92b46497dcda59203eea78e4620bc96f47] RISC-V: Add mvendorid, marchid, and mimpid to /proc/cpuinfo output
> git bisect good 4c540c92b46497dcda59203eea78e4620bc96f47
> # bad: [27bc50fc90647bbf7b734c3fc306a5e61350da53] Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect bad 27bc50fc90647bbf7b734c3fc306a5e61350da53
> # good: [ada3bfb6492a6d0d3eca50f3b61315fe032efc72] Merge tag 'tpmdd-next-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
> git bisect good ada3bfb6492a6d0d3eca50f3b61315fe032efc72
> # bad: [5f7fa13fa858c17580ed513bd5e0a4b36d68fdd6] mm: add pageblock_align() macro
> git bisect bad 5f7fa13fa858c17580ed513bd5e0a4b36d68fdd6
> # bad: [54a611b605901c7d5d05b6b8f5d04a6ceb0962aa] Maple Tree: add new data structure
> git bisect bad 54a611b605901c7d5d05b6b8f5d04a6ceb0962aa
> # good: [59298997df89e19aad426d4ae0a7e5037074da5a] x86/uaccess: avoid check_object_size() in copy_from_user_nmi()
> git bisect good 59298997df89e19aad426d4ae0a7e5037074da5a
> # good: [04c6b79ae4f0bcbd96afd7cea5e1a8848162438e] btrfs: convert __process_pages_contig() to use filemap_get_folios_contig()
> git bisect good 04c6b79ae4f0bcbd96afd7cea5e1a8848162438e
> # good: [da29499124cd2221539b235c1f93c7d93faf6565] mm, hwpoison: use __PageMovable() to detect non-lru movable pages
> git bisect good da29499124cd2221539b235c1f93c7d93faf6565
> # bad: [eed9a328aa1ae6ac1edaa026957e6882f57de0dd] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> git bisect bad eed9a328aa1ae6ac1edaa026957e6882f57de0dd
> # bad: [f347c9d2697fcbbb64e077f7113a3887a181b8c0] filemap: make the accounting of thrashing more consistent
> git bisect bad f347c9d2697fcbbb64e077f7113a3887a181b8c0
> # good: [eba4d770efc86a3710e36b828190858abfa3bb74] mm/swap: comment all the ifdef in swapops.h
> git bisect good eba4d770efc86a3710e36b828190858abfa3bb74
> # bad: [2e3468778dbe3ec389a10c21a703bb8e5be5cfbc] mm: remember young/dirty bit for page migrations
> git bisect bad 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc
> # bad: [0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb] mm/thp: carry over dirty bit when thp splits on pmd
> git bisect bad 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb
> # good: [0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e] mm/swap: add swp_offset_pfn() to fetch PFN from swap entry
> git bisect good 0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e
> # first bad commit: [0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb] mm/thp: carry over dirty bit when thp splits on pmd
> 
> 
> mator@ttip:~/linux-2.6$ git bisect good
> 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb is the first bad commit
> commit 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb
> Author: Peter Xu <peterx@redhat.com>
> Date:   Thu Aug 11 12:13:28 2022 -0400
> 
>     mm/thp: carry over dirty bit when thp splits on pmd
> 
>     Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
>     shouldn't be a correctness issue since when pmd_dirty() we'll have the
>     page marked dirty anyway, however having dirty bit carried over helps the
>     next initial writes of split ptes on some archs like x86.
> 
>     Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com
> 
> 
> 
> So, v6.0-rc3-176-g0d206b5d2e0d) does not segfault dpkg,
> v6.0-rc3-177-g0ccf7f168e17 segfaults it on package install.
> 
> dpkg test was (apt) install/remove some packages, segfaults only on install
> (not remove).
> 
> Reverted 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb from top of v6.1-rc1 and
> tried to compile kernel, but got error 
> 
> mm/huge_memory.c: In function ‘__split_huge_pmd_locked’:
> mm/huge_memory.c:2129:17: error: ‘dirty’ undeclared (first use in this function)
>  2129 |                 dirty = is_migration_entry_dirty(entry);
>       |                 ^~~~~
> mm/huge_memory.c:2129:17: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [scripts/Makefile.build:250: mm/huge_memory.o] Error 1
> 
> So can't test v6.1-rc1 with patch reverted...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd)
  2022-10-21 16:06   ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Anatoly Pugachev
  2022-10-23 13:33     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) #forregzbot Thorsten Leemhuis
@ 2022-10-23 19:52     ` Peter Xu
  2022-10-25 10:22       ` Anatoly Pugachev
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-10-23 19:52 UTC (permalink / raw)
  To: Anatoly Pugachev, David Miller
  Cc: linux-mm, linux-kernel, Hugh Dickins, Kirill A . Shutemov,
	Alistair Popple, Andrea Arcangeli, Minchan Kim, Andrew Morton,
	David Hildenbrand, Andi Kleen, Nadav Amit, Huang Ying,
	Vlastimil Babka, sparclinux

[-- Attachment #1: Type: text/plain, Size: 8617 bytes --]

On Fri, Oct 21, 2022 at 07:06:03PM +0300, Anatoly Pugachev wrote:
> On Thu, Aug 11, 2022 at 12:13:28PM -0400, Peter Xu wrote:
> > Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
> > shouldn't be a correctness issue since when pmd_dirty() we'll have the page
> > marked dirty anyway, however having dirty bit carried over helps the next
> > initial writes of split ptes on some archs like x86.
> > 
> > Reviewed-by: Huang Ying <ying.huang@intel.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/huge_memory.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> 
> Hello!

Hi, Anatoly,

> 
> Tried to update my debian sparc64 sid (unstable) linux distro to latest
> version of available packages, got dpkg segfault... 
> 
> $ apt update -y
> ...
> Unpacking linux-image-sparc64-smp (6.0.2-1) ...
> E: Sub-process /usr/bin/dpkg received a segmentation fault.
> 
> Downgraded dpkg from 1.21.9 to 1.21.8 / 1.21.7 (2-3 monthes old
> versions) - still getting segfault on package install (which was never
> an issue before, even on this old dpkg versions).
> 
> Tried to gdb backtrace core file, which is unlucky :
> 
> 
> root@ttip:/# apt install -y linux-image-sparc64-smp ccache qemu-utils xdelta qemu-system-x86 distcc qemu-efi-aarch64 pkg-kde-tools
> ...
> Preparing to unpack .../2-linux-image-6.0.0-1-sparc64-smp_6.0.2-1_sparc64.deb ...
> Unpacking linux-image-6.0.0-1-sparc64-smp (6.0.2-1) ...
> Selecting previously unselected package linux-image-sparc64-smp.
> Preparing to unpack .../3-linux-image-sparc64-smp_6.0.2-1_sparc64.deb ...
> Unpacking linux-image-sparc64-smp (6.0.2-1) ...
> E: Sub-process /usr/bin/dpkg received a segmentation fault.
> root@ttip:/# ls -l core.4751
> -rw------- 1 root root 25042944 Oct 21 14:38 core.4751
> root@ttip:/# gdb -q -c core.4751
> GNU gdb (Debian 12.1-4) 12.1
> [New LWP 4751]
> Core was generated by `/usr/bin/dpkg --status-fd 15 --no-triggers --unpack --auto-deconfigure --recurs'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0xfff800010089cde4 in ?? ()
> (gdb) bt
> #0  0xfff800010089cde4 in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb)
> 
> 
> rebooted from my compiled kernel 6.1.0-rc1 to older (debian) kernel -
> 5.19.0-2-sparc64-smp
> 
> dpkg installed packages without any problems. Removed just installed
> packages, rebooted to 6.1.0-rc1 and tried to install packages, dpkg got
> segfault again.
> 
> Recompiled 6.1.0-rc1 with gcc-11 instead of gcc-12, still segfaults...
> ... bisect time ...
> 
> mator@ttip:~/linux-2.6$ git bisect log
> # bad: [9abf2313adc1ca1b6180c508c25f22f9395cc780] Linux 6.1-rc1
> # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
> git bisect start 'v6.1-rc1' 'v6.0'
> # good: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> git bisect good 18fd049731e67651009f316195da9281b756f2cf
> # good: [4c540c92b46497dcda59203eea78e4620bc96f47] RISC-V: Add mvendorid, marchid, and mimpid to /proc/cpuinfo output
> git bisect good 4c540c92b46497dcda59203eea78e4620bc96f47
> # bad: [27bc50fc90647bbf7b734c3fc306a5e61350da53] Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect bad 27bc50fc90647bbf7b734c3fc306a5e61350da53
> # good: [ada3bfb6492a6d0d3eca50f3b61315fe032efc72] Merge tag 'tpmdd-next-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
> git bisect good ada3bfb6492a6d0d3eca50f3b61315fe032efc72
> # bad: [5f7fa13fa858c17580ed513bd5e0a4b36d68fdd6] mm: add pageblock_align() macro
> git bisect bad 5f7fa13fa858c17580ed513bd5e0a4b36d68fdd6
> # bad: [54a611b605901c7d5d05b6b8f5d04a6ceb0962aa] Maple Tree: add new data structure
> git bisect bad 54a611b605901c7d5d05b6b8f5d04a6ceb0962aa
> # good: [59298997df89e19aad426d4ae0a7e5037074da5a] x86/uaccess: avoid check_object_size() in copy_from_user_nmi()
> git bisect good 59298997df89e19aad426d4ae0a7e5037074da5a
> # good: [04c6b79ae4f0bcbd96afd7cea5e1a8848162438e] btrfs: convert __process_pages_contig() to use filemap_get_folios_contig()
> git bisect good 04c6b79ae4f0bcbd96afd7cea5e1a8848162438e
> # good: [da29499124cd2221539b235c1f93c7d93faf6565] mm, hwpoison: use __PageMovable() to detect non-lru movable pages
> git bisect good da29499124cd2221539b235c1f93c7d93faf6565
> # bad: [eed9a328aa1ae6ac1edaa026957e6882f57de0dd] mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
> git bisect bad eed9a328aa1ae6ac1edaa026957e6882f57de0dd
> # bad: [f347c9d2697fcbbb64e077f7113a3887a181b8c0] filemap: make the accounting of thrashing more consistent
> git bisect bad f347c9d2697fcbbb64e077f7113a3887a181b8c0
> # good: [eba4d770efc86a3710e36b828190858abfa3bb74] mm/swap: comment all the ifdef in swapops.h
> git bisect good eba4d770efc86a3710e36b828190858abfa3bb74
> # bad: [2e3468778dbe3ec389a10c21a703bb8e5be5cfbc] mm: remember young/dirty bit for page migrations
> git bisect bad 2e3468778dbe3ec389a10c21a703bb8e5be5cfbc
> # bad: [0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb] mm/thp: carry over dirty bit when thp splits on pmd
> git bisect bad 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb
> # good: [0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e] mm/swap: add swp_offset_pfn() to fetch PFN from swap entry
> git bisect good 0d206b5d2e0d7d7f09ac9540e3ab3e35a34f536e
> # first bad commit: [0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb] mm/thp: carry over dirty bit when thp splits on pmd
> 
> 
> mator@ttip:~/linux-2.6$ git bisect good
> 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb is the first bad commit
> commit 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb
> Author: Peter Xu <peterx@redhat.com>
> Date:   Thu Aug 11 12:13:28 2022 -0400
> 
>     mm/thp: carry over dirty bit when thp splits on pmd
> 
>     Carry over the dirty bit from pmd to pte when a huge pmd splits.  It
>     shouldn't be a correctness issue since when pmd_dirty() we'll have the
>     page marked dirty anyway, however having dirty bit carried over helps the
>     next initial writes of split ptes on some archs like x86.
> 
>     Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com
> 
> 
> 
> So, v6.0-rc3-176-g0d206b5d2e0d) does not segfault dpkg,
> v6.0-rc3-177-g0ccf7f168e17 segfaults it on package install.
> 
> dpkg test was (apt) install/remove some packages, segfaults only on install
> (not remove).
> 
> Reverted 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb from top of v6.1-rc1 and
> tried to compile kernel, but got error 
> 
> mm/huge_memory.c: In function ‘__split_huge_pmd_locked’:
> mm/huge_memory.c:2129:17: error: ‘dirty’ undeclared (first use in this function)
>  2129 |                 dirty = is_migration_entry_dirty(entry);
>       |                 ^~~~~
> mm/huge_memory.c:2129:17: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [scripts/Makefile.build:250: mm/huge_memory.o] Error 1
> 
> So can't test v6.1-rc1 with patch reverted...

Sorry to know this, and thanks for the report and debugging.  The revert
won't work because dirty variable is used in later patch for the swap path
too.  I've attached a partial (and minimum) revert, feel free to try.

I had a feeling that it's somehow related to the special impl of sparc64
pte_mkdirty() where a kernel patching mechanism is used to share code
between sun4[uv].  I'd assume your machine is sun4v?  As that's the one
that needs the patching, iiuc.

The sparc64 impl goes back to commit cf627156c450 ("[SPARC64]: Use inline
patching for critical PTE operations.", 2006-03-20).  I believe it works
solidly for all these years, so I really have no quick clue on why that can
fail with the new code added.

I think the magic is done with sun4v_patch_2insn_range().  What I can think
of is this thp patch can definitely add much more places of the kernel that
will need patching, because both __split_huge_pmd() and split_huge_pmd()
are defined as macros not functions.  However I don't see a problem for it
so far, e.g., I don't see a limitation of __sun4v_2insn_patch_end growing
to satisfy all those new spots.

I'm copying David Miller who implemented the sparc64 pte operations.  I
know he's probably always very busy, but just in case there'll be quick
answers so we don't need the revert patch but just make it work for sparc64
too.  Currently with the revert patch we'll start to loose dirty bit again
like before on many archs when thp split, but I assume that's so far better
than breaking any arch or making an arch specific ifdef so we can revisit.

Thanks,

-- 
Peter Xu

[-- Attachment #2: 0001-Partly-revert-mm-thp-carry-over-dirty-bit-when-thp-s.patch --]
[-- Type: text/plain, Size: 1670 bytes --]

From 1ea9c520b3d0bb10cc4195893dd4326f451c3dad Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Sun, 23 Oct 2022 15:29:29 -0400
Subject: [PATCH] Partly revert "mm/thp: carry over dirty bit when thp splits
 on pmd"
Content-type: text/plain

Anatoly Pugachev <matorola@gmail.com> reported sparc64 breakage on the
patch:

https://lore.kernel.org/r/20221021160603.GA23307@u164.east.ru

The sparc64 impl of pte_mkdirty() is definitely slightly special in that it
leverages a code patching mechanism for sun4u/sun4v on relevant pgtable
entry operations.

Before having a clue of why the sparc64 is special and caused the patch to
SIGSEGV the processes, revert the patch for now.  The swap path of dirty
bit inheritage is kept because that's using the swap shared code so we
assume it'll not be affected.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ad17c8d3c0fe..72b9b4622a38 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2160,9 +2160,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_wrprotect(entry);
 			if (!young)
 				entry = pte_mkold(entry);
-			/* NOTE: this may set soft-dirty too on some archs */
-			if (dirty)
-				entry = pte_mkdirty(entry);
+			/*
+			 * NOTE: we don't do pte_mkdirty when dirty==true
+			 * because it breaks sparc64 which can sigsegv
+			 * random process.  Need to revisit when we figure
+			 * out what is special with sparc64.
+			 */
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 			if (uffd_wp)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd)
  2022-10-23 19:52     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Peter Xu
@ 2022-10-25 10:22       ` Anatoly Pugachev
  2022-10-25 14:43         ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: Anatoly Pugachev @ 2022-10-25 10:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Miller, linux-mm, linux-kernel, Hugh Dickins,
	Kirill A . Shutemov, Alistair Popple, Andrea Arcangeli,
	Minchan Kim, Andrew Morton, David Hildenbrand, Andi Kleen,
	Nadav Amit, Huang Ying, Vlastimil Babka, sparclinux

On Sun, Oct 23, 2022 at 10:53 PM Peter Xu <peterx@redhat.com> wrote:
> On Fri, Oct 21, 2022 at 07:06:03PM +0300, Anatoly Pugachev wrote:
> >
> >     Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com
> >
> > So, v6.0-rc3-176-g0d206b5d2e0d) does not segfault dpkg,
> > v6.0-rc3-177-g0ccf7f168e17 segfaults it on package install.
> >
> > dpkg test was (apt) install/remove some packages, segfaults only on install
> > (not remove).
> >
> > Reverted 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb from top of v6.1-rc1 and
> > tried to compile kernel, but got error
> >
> > mm/huge_memory.c: In function ‘__split_huge_pmd_locked’:
> > mm/huge_memory.c:2129:17: error: ‘dirty’ undeclared (first use in this function)
> >  2129 |                 dirty = is_migration_entry_dirty(entry);
> >       |                 ^~~~~
> > mm/huge_memory.c:2129:17: note: each undeclared identifier is reported only once for each function it appears in
> > make[2]: *** [scripts/Makefile.build:250: mm/huge_memory.o] Error 1
> >
> > So can't test v6.1-rc1 with patch reverted...
>
> Sorry to know this, and thanks for the report and debugging.  The revert
> won't work because dirty variable is used in later patch for the swap path
> too.  I've attached a partial (and minimum) revert, feel free to try.

Peter,

tested again with 6.1.0-rc2 already, non patched kernel segfaulting
dpkg, using your patch makes dpkg
(or kernel) to behave properly.
Thanks!

> I had a feeling that it's somehow related to the special impl of sparc64
> pte_mkdirty() where a kernel patching mechanism is used to share code
> between sun4[uv].  I'd assume your machine is sun4v?  As that's the one
> that needs the patching, iiuc.

kernel boot log reports
ARCH: SUN4V

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd)
  2022-10-25 10:22       ` Anatoly Pugachev
@ 2022-10-25 14:43         ` Peter Xu
  2022-11-01 13:13           ` Anatoly Pugachev
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-10-25 14:43 UTC (permalink / raw)
  To: Anatoly Pugachev
  Cc: David Miller, linux-mm, linux-kernel, Hugh Dickins,
	Kirill A . Shutemov, Alistair Popple, Andrea Arcangeli,
	Minchan Kim, Andrew Morton, David Hildenbrand, Andi Kleen,
	Nadav Amit, Huang Ying, Vlastimil Babka, sparclinux

On Tue, Oct 25, 2022 at 01:22:45PM +0300, Anatoly Pugachev wrote:
> On Sun, Oct 23, 2022 at 10:53 PM Peter Xu <peterx@redhat.com> wrote:
> > On Fri, Oct 21, 2022 at 07:06:03PM +0300, Anatoly Pugachev wrote:
> > >
> > >     Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com
> > >
> > > So, v6.0-rc3-176-g0d206b5d2e0d) does not segfault dpkg,
> > > v6.0-rc3-177-g0ccf7f168e17 segfaults it on package install.
> > >
> > > dpkg test was (apt) install/remove some packages, segfaults only on install
> > > (not remove).
> > >
> > > Reverted 0ccf7f168e17bb7eb5a322397ba5a841f4fbaccb from top of v6.1-rc1 and
> > > tried to compile kernel, but got error
> > >
> > > mm/huge_memory.c: In function ‘__split_huge_pmd_locked’:
> > > mm/huge_memory.c:2129:17: error: ‘dirty’ undeclared (first use in this function)
> > >  2129 |                 dirty = is_migration_entry_dirty(entry);
> > >       |                 ^~~~~
> > > mm/huge_memory.c:2129:17: note: each undeclared identifier is reported only once for each function it appears in
> > > make[2]: *** [scripts/Makefile.build:250: mm/huge_memory.o] Error 1
> > >
> > > So can't test v6.1-rc1 with patch reverted...
> >
> > Sorry to know this, and thanks for the report and debugging.  The revert
> > won't work because dirty variable is used in later patch for the swap path
> > too.  I've attached a partial (and minimum) revert, feel free to try.
> 
> Peter,
> 
> tested again with 6.1.0-rc2 already, non patched kernel segfaulting
> dpkg, using your patch makes dpkg
> (or kernel) to behave properly.
> Thanks!

Thanks for the quick feedback.

> 
> > I had a feeling that it's somehow related to the special impl of sparc64
> > pte_mkdirty() where a kernel patching mechanism is used to share code
> > between sun4[uv].  I'd assume your machine is sun4v?  As that's the one
> > that needs the patching, iiuc.
> 
> kernel boot log reports
> ARCH: SUN4V

Then it's expected but unfortunate too, as QEMU doesn't seem to have
support on sun4v so I cannot even try that out with a VM.

https://wiki.qemu.org/Documentation/Platforms/SPARC

I'd also expect there's nothing useful in either dmesg or relevant logs
because it's segv, but please share if you find anything that may be
helpful.

Maybe we need to have the minimum revert for v6.1 before we have more
clues.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd)
  2022-10-25 14:43         ` Peter Xu
@ 2022-11-01 13:13           ` Anatoly Pugachev
  2022-11-02 18:34             ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: Anatoly Pugachev @ 2022-11-01 13:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Miller, linux-mm, linux-kernel, Hugh Dickins,
	Kirill A . Shutemov, Alistair Popple, Andrea Arcangeli,
	Minchan Kim, Andrew Morton, David Hildenbrand, Andi Kleen,
	Nadav Amit, Huang Ying, Vlastimil Babka, sparclinux

On Tue, Oct 25, 2022 at 5:43 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Oct 25, 2022 at 01:22:45PM +0300, Anatoly Pugachev wrote:
> > On Sun, Oct 23, 2022 at 10:53 PM Peter Xu <peterx@redhat.com> wrote:
> > > On Fri, Oct 21, 2022 at 07:06:03PM +0300, Anatoly Pugachev wrote:
> > > >
> > > >     Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com
>
> Maybe we need to have the minimum revert for v6.1 before we have more
> clues.

Just a quick update on 6.1.0-rc3

Tested again with 6.1.0-rc3, segfaults dpkg... applied patch - no dpkg
segfaults.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd)
  2022-11-01 13:13           ` Anatoly Pugachev
@ 2022-11-02 18:34             ` Peter Xu
  2022-11-02 18:47               ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2022-11-02 18:34 UTC (permalink / raw)
  To: Anatoly Pugachev, Andrew Morton
  Cc: David Miller, linux-mm, linux-kernel, Hugh Dickins,
	Kirill A . Shutemov, Alistair Popple, Andrea Arcangeli,
	Minchan Kim, Andrew Morton, David Hildenbrand, Andi Kleen,
	Nadav Amit, Huang Ying, Vlastimil Babka, sparclinux

On Tue, Nov 01, 2022 at 04:13:20PM +0300, Anatoly Pugachev wrote:
> On Tue, Oct 25, 2022 at 5:43 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Oct 25, 2022 at 01:22:45PM +0300, Anatoly Pugachev wrote:
> > > On Sun, Oct 23, 2022 at 10:53 PM Peter Xu <peterx@redhat.com> wrote:
> > > > On Fri, Oct 21, 2022 at 07:06:03PM +0300, Anatoly Pugachev wrote:
> > > > >
> > > > >     Link: https://lkml.kernel.org/r/20220811161331.37055-5-peterx@redhat.com
> >
> > Maybe we need to have the minimum revert for v6.1 before we have more
> > clues.
> 
> Just a quick update on 6.1.0-rc3
> 
> Tested again with 6.1.0-rc3, segfaults dpkg... applied patch - no dpkg
> segfaults.

Andrew, shall we apply the minimum revert for this patch for now?  The
one-liner was attached in this email I replied to Anatoly:

https://lore.kernel.org/all/Y1Wbi4yyVvDtg4zN@x1n/

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd)
  2022-11-02 18:34             ` Peter Xu
@ 2022-11-02 18:47               ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2022-11-02 18:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Anatoly Pugachev, David Miller, linux-mm, linux-kernel,
	Hugh Dickins, Kirill A . Shutemov, Alistair Popple,
	Andrea Arcangeli, Minchan Kim, David Hildenbrand, Andi Kleen,
	Nadav Amit, Huang Ying, Vlastimil Babka, sparclinux

On Wed, 2 Nov 2022 14:34:17 -0400 Peter Xu <peterx@redhat.com> wrote:

> > Tested again with 6.1.0-rc3, segfaults dpkg... applied patch - no dpkg
> > segfaults.
> 
> Andrew, shall we apply the minimum revert for this patch for now?  The
> one-liner was attached in this email I replied to Anatoly:
> 
> https://lore.kernel.org/all/Y1Wbi4yyVvDtg4zN@x1n/

Oh.  I missed that in the email flood.

I added the Fixes: and queued it, thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) #forregzbot
  2022-10-23 13:33     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) #forregzbot Thorsten Leemhuis
@ 2022-11-04 10:39       ` Thorsten Leemhuis
  2022-11-13 17:56         ` Thorsten Leemhuis
  0 siblings, 1 reply; 27+ messages in thread
From: Thorsten Leemhuis @ 2022-11-04 10:39 UTC (permalink / raw)
  To: regressions; +Cc: linux-mm, linux-kernel, sparclinux

On 23.10.22 15:33, Thorsten Leemhuis wrote:
> On 21.10.22 18:06, Anatoly Pugachev wrote:
>> Tried to update my debian sparc64 sid (unstable) linux distro to latest
>> version of available packages, got dpkg segfault... 
> #regzbot ^introduced 0ccf7f168e17bb7
> #regzbot title mm: sparc64: dpkg fails on sparc64 since "mm/thp: Carry
> over dirty bit when thp splits on pmd)"
> #regzbot ignore-activity

#regzbot fixed-by: 434e3d15d92b

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) #forregzbot
  2022-11-04 10:39       ` Thorsten Leemhuis
@ 2022-11-13 17:56         ` Thorsten Leemhuis
  0 siblings, 0 replies; 27+ messages in thread
From: Thorsten Leemhuis @ 2022-11-13 17:56 UTC (permalink / raw)
  To: regressions; +Cc: linux-mm, linux-kernel, sparclinux



On 04.11.22 11:39, Thorsten Leemhuis wrote:
> On 23.10.22 15:33, Thorsten Leemhuis wrote:
>> On 21.10.22 18:06, Anatoly Pugachev wrote:
>>> Tried to update my debian sparc64 sid (unstable) linux distro to latest
>>> version of available packages, got dpkg segfault... 
>> #regzbot ^introduced 0ccf7f168e17bb7
>> #regzbot title mm: sparc64: dpkg fails on sparc64 since "mm/thp: Carry
>> over dirty bit when thp splits on pmd)"
>> #regzbot ignore-activity
> 
> #regzbot fixed-by: 434e3d15d92b

#regzbot fixed-by: 624a2c94f5b7

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 0/7] mm: Remember a/d bits for migration entries
  2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
                   ` (6 preceding siblings ...)
  2022-08-11 16:13 ` [PATCH v4 7/7] mm/swap: Cache swap migration A/D bits support Peter Xu
@ 2022-11-21  5:15 ` Raghavendra K T
  2022-11-21 14:57   ` Peter Xu
  7 siblings, 1 reply; 27+ messages in thread
From: Raghavendra K T @ 2022-11-21  5:15 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Hugh Dickins, Kirill A . Shutemov, Alistair Popple,
	Andrea Arcangeli, Minchan Kim, Andrew Morton, David Hildenbrand,
	Andi Kleen, Nadav Amit, Huang Ying, Vlastimil Babka

On 8/11/2022 9:43 PM, Peter Xu wrote:
> v4:
> - Added r-bs for Ying
> - Some cosmetic changes here and there [Ying]
> - Fix smaps to only dump PFN for pfn swap entries for both pte/pmd [Ying]
> - Remove max_swapfile_size(), export swapfile_maximum_size variable [Ying]
> - In migrate_vma_collect_pmd() only read A/D if pte_present()
> 
> rfc: https://lore.kernel.org/all/20220729014041.21292-1-peterx@redhat.com
> v1:  https://lore.kernel.org/all/20220803012159.36551-1-peterx@redhat.com
> v2:  https://lore.kernel.org/all/20220804203952.53665-1-peterx@redhat.com
> v3:  https://lore.kernel.org/all/20220809220100.20033-1-peterx@redhat.com
> 
> Problem
> =======
> 
> When migrate a page, right now we always mark the migrated page as old &
> clean.
> 
> However that could lead to at least two problems:
> 
>    (1) We lost the real hot/cold information while we could have persisted.
>        That information shouldn't change even if the backing page is changed
>        after the migration,
> 
>    (2) There can be always extra overhead on the immediate next access to
>        any migrated page, because hardware MMU needs cycles to set the young
>        bit again for reads, and dirty bits for write, as long as the
>        hardware MMU supports these bits.
> 
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable.  In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
> 
> This issue is originally reported by Andrea Arcangeli.
> 
> Solution
> ========
> 
> To solve this problem, this patchset tries to remember the young/dirty bits
> in the migration entries and carry them over when recovering the ptes.
> 
> We have the chance to do so because in many systems the swap offset is not
> really fully used.  Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like A/D bits, and that's how this series tried to approach this
> problem.
> 
> max_swapfile_size() is used here to detect per-arch offset length in swp
> entries.  We'll automatically remember the A/D bits when we find that we
> have enough swp offset field to keep both the PFN and the extra bits.
> 
> Since max_swapfile_size() can be slow, the last two patches cache the
> results for it and also swap_migration_ad_supported as a whole.
> 
> Known Issues / TODOs
> ====================
> 
> We still haven't taught madvise() to recognize the new A/D bits in
> migration entries, namely MADV_COLD/MADV_FREE.  E.g. when MADV_COLD upon a
> migration entry.  It's not clear yet on whether we should clear the A bit,
> or we should just drop the entry directly.
> 
> We didn't teach idle page tracking on the new migration entries, because
> it'll need larger rework on the tree on rmap pgtable walk.  However it
> should make it already better because before this patchset page will be old
> page after migration, so the series will fix potential false negative of
> idle page tracking when pages were migrated before observing.
> 
> The other thing is migration A/D bits will not start to working for private
> device swap entries.  The code is there for completeness but since private
> device swap entries do not yet have fields to store A/D bits, even if we'll
> persistent A/D across present pte switching to migration entry, we'll lose
> it again when the migration entry converted to private device swap entry.
> 
> Tests
> =====
> 
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms.  The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps.  The
> test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
> 
> Similar effect can also be measured when writting the memory the 1st time
> after migration.
> 
> After applying the patchset, both initial immediate read/write after page
> migrated will perform similarly like before migration happened.

I was able to test on AMD EPYC 64 core 2 numa node (Milan) 3.72 GHz 
clocked system

am seeing the similar improvement for the test mentioned above (swap-young)

base: (6.0)
--------------
Write (node 0) took 562202 (us)
Read (node 0) took 7790 (us)
Move to node 1 took 474876(us)
Move to node 0 took 642805(us)
Read (node 0) took 81364 (us)
Write (node 0) took 12887 (us)
Read (node 0) took 5202 (us)
Write (node 0) took 4533 (us)
Read (node 0) took 5229 (us)
Write (node 0) took 4558 (us)
Read (node 0) took 5198 (us)
Write (node 0) took 4551 (us)
Read (node 0) took 5218 (us)
Write (node 0) took 4534 (us)

patched
-------------
Write (node 0) took 250232 (us)
Read (node 0) took 3262 (us)
Move to node 1 took 640636(us)
Move to node 0 took 449051(us)
Read (node 0) took 2966 (us)
Write (node 0) took 2720 (us)
Read (node 0) took 2891 (us)
Write (node 0) took 2560 (us)
Read (node 0) took 2899 (us)
Write (node 0) took 2568 (us)
Read (node 0) took 2890 (us)
Write (node 0) took 2568 (us)
Read (node 0) took 2897 (us)
Write (node 0) took 2563 (us)

Please feel free to add FWIW
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

> 
> Patch Layout
> ============
> 
> Patch 1-2:  Cleanups from either previous versions or on swapops.h macros.
> 
> Patch 3-4:  Prepare for the introduction of migration A/D bits
> 
> Patch 5:    The core patch to remember young/dirty bit in swap offsets.
> 
> Patch 6-7:  Cache relevant fields to make migration_entry_supports_ad() fast.
> 
> Please review, thanks.
> 
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
> 
> Peter Xu (7):
>    mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
>    mm/swap: Comment all the ifdef in swapops.h
>    mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
>    mm/thp: Carry over dirty bit when thp splits on pmd
>    mm: Remember young/dirty bit for page migrations
>    mm/swap: Cache maximum swapfile size when init swap
>    mm/swap: Cache swap migration A/D bits support
> 
>   arch/arm64/mm/hugetlbpage.c           |   2 +-
>   arch/x86/include/asm/pgtable-3level.h |   8 +-
>   arch/x86/mm/init.c                    |   2 +-
>   fs/proc/task_mmu.c                    |  20 +++-
>   include/linux/swapfile.h              |   5 +-
>   include/linux/swapops.h               | 145 +++++++++++++++++++++++---
>   mm/hmm.c                              |   2 +-
>   mm/huge_memory.c                      |  27 ++++-
>   mm/memory-failure.c                   |   2 +-
>   mm/migrate.c                          |   6 +-
>   mm/migrate_device.c                   |   6 ++
>   mm/page_vma_mapped.c                  |   6 +-
>   mm/rmap.c                             |   5 +-
>   mm/swapfile.c                         |  15 ++-
>   14 files changed, 214 insertions(+), 37 deletions(-)
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 0/7] mm: Remember a/d bits for migration entries
  2022-11-21  5:15 ` [PATCH v4 0/7] mm: Remember a/d bits for migration entries Raghavendra K T
@ 2022-11-21 14:57   ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2022-11-21 14:57 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-mm, linux-kernel, Hugh Dickins, Kirill A . Shutemov,
	Alistair Popple, Andrea Arcangeli, Minchan Kim, Andrew Morton,
	David Hildenbrand, Andi Kleen, Nadav Amit, Huang Ying,
	Vlastimil Babka

Hi, Raghavendra,

On Mon, Nov 21, 2022 at 10:45:45AM +0530, Raghavendra K T wrote:
> I was able to test on AMD EPYC 64 core 2 numa node (Milan) 3.72 GHz clocked
> system
> 
> am seeing the similar improvement for the test mentioned above (swap-young)
> 
> base: (6.0)
> --------------
> Write (node 0) took 562202 (us)
> Read (node 0) took 7790 (us)
> Move to node 1 took 474876(us)
> Move to node 0 took 642805(us)
> Read (node 0) took 81364 (us)
> Write (node 0) took 12887 (us)
> Read (node 0) took 5202 (us)
> Write (node 0) took 4533 (us)
> Read (node 0) took 5229 (us)
> Write (node 0) took 4558 (us)
> Read (node 0) took 5198 (us)
> Write (node 0) took 4551 (us)
> Read (node 0) took 5218 (us)
> Write (node 0) took 4534 (us)
> 
> patched
> -------------
> Write (node 0) took 250232 (us)
> Read (node 0) took 3262 (us)
> Move to node 1 took 640636(us)
> Move to node 0 took 449051(us)
> Read (node 0) took 2966 (us)
> Write (node 0) took 2720 (us)
> Read (node 0) took 2891 (us)
> Write (node 0) took 2560 (us)
> Read (node 0) took 2899 (us)
> Write (node 0) took 2568 (us)
> Read (node 0) took 2890 (us)
> Write (node 0) took 2568 (us)
> Read (node 0) took 2897 (us)
> Write (node 0) took 2563 (us)
> 
> Please feel free to add FWIW
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

The series has already landed v6.1-rc1 so it should be a bit late to apply
the tested-by tag, but still thanks a lot for your tests and upate!

It seems the mem size is different for the two rounds of test as even the
initial write differs in time, but I think that still explains the
difference because what matters is the first read/write after migration,
and that can be compared with the 2nd/3rd/... reads/writes afterwards.

Side note: there's actually one tiny thing got removed from the series on
handling dirty bit of thp split (434e3d15d92b), but it seems there's hope
we cound have found the root issue of the issues on sparc64 and loongarch
so we may have chance to re-apply them.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-11-21 15:08 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-11 16:13 [PATCH v4 0/7] mm: Remember a/d bits for migration entries Peter Xu
2022-08-11 16:13 ` [PATCH v4 1/7] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros Peter Xu
2022-08-11 16:13 ` [PATCH v4 2/7] mm/swap: Comment all the ifdef in swapops.h Peter Xu
2022-08-15  6:03   ` Alistair Popple
2022-08-11 16:13 ` [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
2022-08-12  2:33   ` Huang, Ying
2022-08-23 21:01     ` Yu Zhao
2022-08-23 22:04       ` Peter Xu
2022-08-11 16:13 ` [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd Peter Xu
2022-10-21 16:06   ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Anatoly Pugachev
2022-10-23 13:33     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) #forregzbot Thorsten Leemhuis
2022-11-04 10:39       ` Thorsten Leemhuis
2022-11-13 17:56         ` Thorsten Leemhuis
2022-10-23 19:52     ` dpkg fails on sparc64 (was: [PATCH v4 4/7] mm/thp: Carry over dirty bit when thp splits on pmd) Peter Xu
2022-10-25 10:22       ` Anatoly Pugachev
2022-10-25 14:43         ` Peter Xu
2022-11-01 13:13           ` Anatoly Pugachev
2022-11-02 18:34             ` Peter Xu
2022-11-02 18:47               ` Andrew Morton
2022-08-11 16:13 ` [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations Peter Xu
2022-09-11 23:48   ` Andrew Morton
2022-09-13  0:55     ` Huang, Ying
2022-08-11 16:13 ` [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap Peter Xu
2022-08-12  2:34   ` Huang, Ying
2022-08-11 16:13 ` [PATCH v4 7/7] mm/swap: Cache swap migration A/D bits support Peter Xu
2022-11-21  5:15 ` [PATCH v4 0/7] mm: Remember a/d bits for migration entries Raghavendra K T
2022-11-21 14:57   ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).