All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] mm: Remember young bit for migration entries
@ 2022-07-29  1:40 Peter Xu
  2022-07-29  1:40 ` [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Peter Xu @ 2022-07-29  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, peterx, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

[Marking as RFC; only x86 is supported for now, plan to add a few more
 archs when there's a formal version]

Problem
=======

When migrate a page, right now we always mark the migrated page as old.
The reason could be that we don't really know whether the page is hot or
cold, so we could have taken it a default negative assuming that's safer.

However that could lead to at least two problems:

  (1) We lost the real hot/cold information while we could have persisted.
      That information shouldn't change even if the backing page is changed
      after the migration,

  (2) There can be always extra overhead on the immediate next access to
      any migrated page, because hardware MMU needs cycles to set the young
      bit again (as long as the MMU supports).

Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable.  In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.

This issue is originally reported by Andrea Arcangeli.

Solution
========

To solve this problem, this patchset tries to remember the young bit in the
migration entries and carry it over when recovering the ptes.

We have the chance to do so because in many systems the swap offset is not
really fully used.  Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like young, and that's how this series tried to approach this
problem.

One tricky thing here is even though we're embedding the information into
swap entry which seems to be a very generic data structure, the number of
bits that are free is still arch dependent.  Not only because the size of
swp_entry_t differs, but also due to the different layouts of swap ptes on
different archs.

Here, this series requires specific arch to define an extra macro called
__ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
information, the swap logic can know whether there's extra bits to use,
then it'll remember the young bits when possible.  By default, it'll keep
the old behavior of keeping all migrated pages cold.

Tests
=====

After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms.  The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps.

Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

Patch Layout
============

Patch 1:  Add swp_offset_pfn() and apply to all pfn swap entries, we should
          also stop treating swp_offset() as PFN anymore because it can
          contain more information starting from next patch.
Patch 2:  The core patch to remember young bit in swap offsets.
Patch 3:  A cleanup for x86 32 bits pgtable.h.
Patch 4:  Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration

Please review, thanks.

[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c

Peter Xu (4):
  mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  mm: Remember young bit for page migrations
  mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
  mm/x86: Define __ARCH_SWP_OFFSET_BITS

 arch/arm64/mm/hugetlbpage.c           |  2 +-
 arch/x86/include/asm/pgtable-2level.h |  6 ++
 arch/x86/include/asm/pgtable-3level.h | 15 +++--
 arch/x86/include/asm/pgtable_64.h     |  5 ++
 include/linux/swapops.h               | 85 +++++++++++++++++++++++++--
 mm/hmm.c                              |  2 +-
 mm/huge_memory.c                      | 10 +++-
 mm/memory-failure.c                   |  2 +-
 mm/migrate.c                          |  4 +-
 mm/migrate_device.c                   |  2 +
 mm/page_vma_mapped.c                  |  6 +-
 mm/rmap.c                             |  3 +-
 12 files changed, 122 insertions(+), 20 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
@ 2022-07-29  1:40 ` Peter Xu
  2022-08-01  3:13   ` Huang, Ying
  2022-07-29  1:40 ` [PATCH RFC 2/4] mm: Remember young bit for page migrations Peter Xu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2022-07-29  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, peterx, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields.  To fetch the PFN, normally the user just calls swp_offset()
assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check with
MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN.  At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp offset
field, so assuming "swp_offset(entry)" to be the PFN will not stand any
more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper.  Note that many of the existing users are not candidates for the
replacement, e.g.:

  (1) When the swap entry is not a pfn swap entry at all, or,
  (2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only".  They're left aside so that when there're more information
within the swp offset they'll be carried over naturally in those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/mm/hugetlbpage.c |  2 +-
 include/linux/swapops.h     | 28 ++++++++++++++++++++++------
 mm/hmm.c                    |  2 +-
 mm/memory-failure.c         |  2 +-
 mm/page_vma_mapped.c        |  6 +++---
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7430060cb0d6..f897d40821dd 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -242,7 +242,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
 {
 	VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
 
-	return page_folio(pfn_to_page(swp_offset(entry)));
+	return page_folio(pfn_to_page(swp_offset_pfn(entry)));
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index a3d435bf9f97..5378f77860fb 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -23,6 +23,14 @@
 #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
 #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
 
+/*
+ * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
+ * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
+ * can use the extra bits to store other information besides PFN.
+ */
+#define SWP_PFN_BITS			(MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
+
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
@@ -64,6 +72,16 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
 	return entry.val & SWP_OFFSET_MASK;
 }
 
+/*
+ * This should only be called upon a pfn swap entry to get the PFN stored
+ * in the swap entry.  Please refers to is_pfn_swap_entry() for definition
+ * of pfn swap entry.
+ */
+static inline unsigned long swp_offset_pfn(swp_entry_t entry)
+{
+	return swp_offset(entry) & SWP_PFN_MASK;
+}
+
 /* check whether a pte points to a swap entry */
 static inline int is_swap_pte(pte_t pte)
 {
@@ -369,7 +387,7 @@ static inline int pte_none_mostly(pte_t pte)
 
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
-	struct page *p = pfn_to_page(swp_offset(entry));
+	struct page *p = pfn_to_page(swp_offset_pfn(entry));
 
 	/*
 	 * Any use of migration entries may only occur while the
@@ -387,6 +405,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
  */
 static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
+	/* Make sure the swp offset can always store the needed fields */
+	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+
 	return is_migration_entry(entry) || is_device_private_entry(entry) ||
 	       is_device_exclusive_entry(entry);
 }
@@ -475,11 +496,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
 	return swp_type(entry) == SWP_HWPOISON;
 }
 
-static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
-{
-	return swp_offset(entry);
-}
-
 static inline void num_poisoned_pages_inc(void)
 {
 	atomic_long_inc(&num_poisoned_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index f2aa63b94d9b..3850fb625dda 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = swp_offset(entry) | cpu_flags;
+			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
 			return 0;
 		}
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index cc6fc9be8d22..e451219124dd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 		swp_entry_t swp = pte_to_swp_entry(pte);
 
 		if (is_hwpoison_entry(swp))
-			pfn = hwpoison_entry_to_pfn(swp);
+			pfn = swp_offset_pfn(swp);
 	}
 
 	if (!pfn || pfn != poisoned_pfn)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8e9e574d535a..93e13fc17d3c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		    !is_device_exclusive_entry(entry))
 			return false;
 
-		pfn = swp_offset(entry);
+		pfn = swp_offset_pfn(entry);
 	} else if (is_swap_pte(*pvmw->pte)) {
 		swp_entry_t entry;
 
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		    !is_device_exclusive_entry(entry))
 			return false;
 
-		pfn = swp_offset(entry);
+		pfn = swp_offset_pfn(entry);
 	} else {
 		if (!pte_present(*pvmw->pte))
 			return false;
@@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					return not_found(pvmw);
 				entry = pmd_to_swp_entry(pmde);
 				if (!is_migration_entry(entry) ||
-				    !check_pmd(swp_offset(entry), pvmw))
+				    !check_pmd(swp_offset_pfn(entry), pvmw))
 					return not_found(pvmw);
 				return true;
 			}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 2/4] mm: Remember young bit for page migrations
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
  2022-07-29  1:40 ` [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
@ 2022-07-29  1:40 ` Peter Xu
  2022-07-29  1:40 ` [PATCH RFC 3/4] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros Peter Xu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-07-29  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, peterx, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

When page migration happens, we always ignore the young bit settings in the
old pgtable, and marking the page as old in the new page table using either
pte_mkold() or pmd_mkold().

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.

Actually we can easily remember the young bit configuration and make that
information recovered when the page is migrated.  To achieve it, define a
new bit in the migration swap offset field to show whether the old pte has
young bit set or not.  Then when removing/recovering the migration entry,
we can recover the young bit even if the page changed.

One thing to mention is that the whole feature is based on an arch specific
macro __ARCH_SWP_OFFSET_BITS that needs to be defined per-arch.  The macro
tells how many bits are available for the arch specific swp offset field.
When that macro is not defined, we'll assume we don't have free bits in the
migration swap entry offset, so we can't persist the young bit.

So until now, there should have no functional change at all with this
patch, since no arch has yet defined __ARCH_SWP_OFFSET_BITS.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 57 +++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 10 ++++++--
 mm/migrate.c            |  4 ++-
 mm/migrate_device.c     |  2 ++
 mm/rmap.c               |  3 ++-
 5 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5378f77860fb..3bbb57aa6742 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -31,6 +31,28 @@
 #define SWP_PFN_BITS			(MAX_PHYSMEM_BITS - PAGE_SHIFT)
 #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
 
+#ifdef __ARCH_SWP_OFFSET_BITS
+#define  SWP_PFN_OFFSET_FREE_BITS	(__ARCH_SWP_OFFSET_BITS - SWP_PFN_BITS)
+#else
+/*
+ * If __ARCH_SWP_OFFSET_BITS not defined, assuming we don't have free bits
+ * to be on the safe side.
+ */
+#define  SWP_PFN_OFFSET_FREE_BITS	0
+#endif
+
+/**
+ * Migration swap entry specific bitfield definitions.
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
+ *
+ * Note: these bits will be used only if there're free bits in arch
+ * specific swp offset field.  Arch needs __ARCH_SWP_OFFSET_BITS defined to
+ * use the bits/features.
+ */
+#define SWP_MIG_YOUNG_BIT		(1UL << SWP_PFN_BITS)
+#define SWP_MIG_OFFSET_BITS		(SWP_PFN_BITS + 1)
+
 /* Clear all flags but only keep swp_entry_t related information */
 static inline pte_t pte_swp_clear_flags(pte_t pte)
 {
@@ -258,6 +280,30 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	/*
+	 * Due to a limitation on x86_64 we can't use #ifdef, as
+	 * SWP_PFN_OFFSET_FREE_BITS value can be changed dynamically for
+	 * 4/5 level pgtables.  For all the non-x86_64 archs (where the
+	 * macro MAX_PHYSMEM_BITS is constant) this branching should be
+	 * optimized out by the compiler.
+	 */
+	if (SWP_PFN_OFFSET_FREE_BITS)
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_YOUNG_BIT);
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	/* Please refer to comment in make_migration_entry_young() */
+	if (SWP_PFN_OFFSET_FREE_BITS)
+		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
+	/* Keep the old behavior of aging page after migration */
+	return false;
+}
+
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -304,6 +350,16 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
 	return 0;
 }
 
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	return false;
+}
+
 #endif
 
 typedef unsigned long pte_marker;
@@ -407,6 +463,7 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
 	/* Make sure the swp offset can always store the needed fields */
 	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_MIG_OFFSET_BITS);
 
 	return is_migration_entry(entry) || is_device_private_entry(entry) ||
 	       is_device_exclusive_entry(entry);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29e3628687a6..131fe5754d8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2088,7 +2088,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_writable_migration_entry(entry);
 		if (PageAnon(page))
 			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = false;
+		young = is_migration_entry_young(entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2146,6 +2146,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			else
 				swp_entry = make_readable_migration_entry(
 							page_to_pfn(page + i));
+			if (young)
+				swp_entry = make_migration_entry_young(swp_entry);
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -3148,6 +3150,8 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
 	else
 		entry = make_readable_migration_entry(page_to_pfn(page));
+	if (pmd_young(pmdval))
+		entry = make_migration_entry_young(entry);
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3173,13 +3177,15 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	get_page(new);
-	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+	if (!is_migration_entry_young(entry))
+		pmde = pmd_mkold(pmde);
 
 	if (PageAnon(new)) {
 		rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 1649270bc1a7..62cb3a9451de 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -199,7 +199,7 @@ static bool remove_migration_pte(struct folio *folio,
 #endif
 
 		folio_get(folio);
-		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
 		if (pte_swp_soft_dirty(*pvmw.pte))
 			pte = pte_mksoft_dirty(pte);
 
@@ -207,6 +207,8 @@ static bool remove_migration_pte(struct folio *folio,
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
+		if (!is_migration_entry_young(entry))
+			pte = pte_mkold(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 7feeb447e3b9..fd8daf45c1a6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(page));
+			if (pte_young(pte))
+				entry = make_migration_entry_young(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..605fb37ae95e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-
+			if (pte_young(pteval))
+				entry = make_migration_entry_young(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 3/4] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
  2022-07-29  1:40 ` [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
  2022-07-29  1:40 ` [PATCH RFC 2/4] mm: Remember young bit for page migrations Peter Xu
@ 2022-07-29  1:40 ` Peter Xu
  2022-07-29  1:40 ` [PATCH RFC 4/4] mm/x86: Define __ARCH_SWP_OFFSET_BITS Peter Xu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-07-29  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, peterx, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

Replace all the magic "5" with the macro.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable-3level.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24..28421a887209 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -256,10 +256,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 /* We always extract/encode the offset by shifting it all the way up, and then down again */
 #define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
 
-#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
-#define __swp_type(x)			(((x).val) & 0x1f)
-#define __swp_offset(x)			((x).val >> 5)
-#define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})
+#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
+#define __swp_type(x)			(((x).val) & ((1UL << SWP_TYPE_BITS) - 1))
+#define __swp_offset(x)			((x).val >> SWP_TYPE_BITS)
+#define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << SWP_TYPE_BITS})
 
 /*
  * Normally, __swp_entry() converts from arch-independent swp_entry_t to
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 4/4] mm/x86: Define __ARCH_SWP_OFFSET_BITS
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
                   ` (2 preceding siblings ...)
  2022-07-29  1:40 ` [PATCH RFC 3/4] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros Peter Xu
@ 2022-07-29  1:40 ` Peter Xu
  2022-07-29 17:07 ` [PATCH RFC 0/4] mm: Remember young bit for migration entries Nadav Amit
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-07-29  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Huang Ying, peterx, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

This will enable the new migration young bit for all x86 systems for both
32 bits and 64 bits systems (including PAE).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable-2level.h | 6 ++++++
 arch/x86/include/asm/pgtable-3level.h | 7 +++++++
 arch/x86/include/asm/pgtable_64.h     | 5 +++++
 3 files changed, 18 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 60d0f9015317..6e70833feb69 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -95,6 +95,12 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
+/*
+ * This defines how many bits we have in the arch specific swp offset.
+ * For 32 bits vanilla systems the pte and swap entry has the same size.
+ */
+#define __ARCH_SWP_OFFSET_BITS	(sizeof(swp_entry_t) - SWP_TYPE_BITS)
+
 /* No inverted PFNs on 2 level page tables */
 
 static inline u64 protnone_mask(u64 val)
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 28421a887209..8dbf29b51f8b 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -287,6 +287,13 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 #define __pte_to_swp_entry(pte)	(__swp_entry(__pteval_swp_type(pte), \
 					     __pteval_swp_offset(pte)))
 
+/*
+ * This defines how many bits we have in the arch specific swp offset.
+ * Here since we're putting the 32 bits swap entry into 64 bits pte, the
+ * limitation is the 32 bits swap entry minus the swap type field.
+ */
+#define __ARCH_SWP_OFFSET_BITS	(sizeof(swp_entry_t) - SWP_TYPE_BITS)
+
 #include <asm/pgtable-invert.h>
 
 #endif /* _ASM_X86_PGTABLE_3LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index e479491da8d5..1714f0ded1db 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -217,6 +217,11 @@ static inline void native_pgd_clear(pgd_t *pgd)
 
 /* We always extract/encode the offset by shifting it all the way up, and then down again */
 #define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
+/*
+ * This defines how many bits we have in the arch specific swp offset.  64
+ * bits systems have both swp_entry_t and pte in 64 bits.
+ */
+#define __ARCH_SWP_OFFSET_BITS	(BITS_PER_LONG - SWP_OFFSET_SHIFT)
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
                   ` (3 preceding siblings ...)
  2022-07-29  1:40 ` [PATCH RFC 4/4] mm/x86: Define __ARCH_SWP_OFFSET_BITS Peter Xu
@ 2022-07-29 17:07 ` Nadav Amit
  2022-07-29 22:43   ` Peter Xu
  2022-08-01  3:20 ` Huang, Ying
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Nadav Amit @ 2022-07-29 17:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux MM, LKML, Huang Ying, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

On Jul 28, 2022, at 6:40 PM, Peter Xu <peterx@redhat.com> wrote:

> [Marking as RFC; only x86 is supported for now, plan to add a few more
> archs when there's a formal version]
> 
> Problem
> =======
> 
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.

Looks good to me.

I just wonder whether the order of the patches should be different. I always
understood that separating the “enabling” patch from the others is not a
good practice, since it complicates bisection. I guess it is more of a minor
issue for such a small patch-set…


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-07-29 17:07 ` [PATCH RFC 0/4] mm: Remember young bit for migration entries Nadav Amit
@ 2022-07-29 22:43   ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-07-29 22:43 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux MM, LKML, Huang Ying, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

On Fri, Jul 29, 2022 at 10:07:02AM -0700, Nadav Amit wrote:
> On Jul 28, 2022, at 6:40 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > [Marking as RFC; only x86 is supported for now, plan to add a few more
> > archs when there's a formal version]
> > 
> > Problem
> > =======
> > 
> > When migrate a page, right now we always mark the migrated page as old.
> > The reason could be that we don't really know whether the page is hot or
> > cold, so we could have taken it a default negative assuming that's safer.
> 
> Looks good to me.

Thanks for the quick review comment, Nadav.

> 
> I just wonder whether the order of the patches should be different. I always
> understood that separating the “enabling” patch from the others is not a
> good practice, since it complicates bisection. I guess it is more of a minor
> issue for such a small patch-set…

Yeah I'd guess you mean when there are a bunch of patches to form one
feature, then we may want to be able to know which part of the feature
break something.  But as you mentioned this feature is mostly implemented
in patch 2 only.

I can squash the enablement patch into the same patch, but when comes to
more archs it also means I'll squash all the archs into the same patch.
I'm just afraid it'll complicate that patch too much - I'd expect each
calculation of swp offset for any arch may not be that straightforward
enough, so it'll be good if they can be reviewed separately and carefully.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-07-29  1:40 ` [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
@ 2022-08-01  3:13   ` Huang, Ying
  2022-08-01 22:29     ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2022-08-01  3:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

Peter Xu <peterx@redhat.com> writes:

> We've got a bunch of special swap entries that stores PFN inside the swap
> offset fields.  To fetch the PFN, normally the user just calls swp_offset()
> assuming that'll be the PFN.
>
> Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> max possible length of a PFN on the host, meanwhile doing proper check with
> MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
>
> One reason to do so is we never tried to sanitize whether swap offset can
> really fit for storing PFN.  At the meantime, this patch also prepares us
> with the future possibility to store more information inside the swp offset
> field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> more very soon.
>
> Replace many of the swp_offset() callers to use swp_offset_pfn() where
> proper.  Note that many of the existing users are not candidates for the
> replacement, e.g.:
>
>   (1) When the swap entry is not a pfn swap entry at all, or,
>   (2) when we wanna keep the whole swp_offset but only change the swp type.
>
> For the latter, it can happen when fork() triggered on a write-migration
> swap entry pte, we may want to only change the migration type from
> write->read but keep the rest, so it's not "fetching PFN" but "changing
> swap type only".  They're left aside so that when there're more information
> within the swp offset they'll be carried over naturally in those cases.
>
> Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> the new swp_offset_pfn() is about.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/arm64/mm/hugetlbpage.c |  2 +-
>  include/linux/swapops.h     | 28 ++++++++++++++++++++++------
>  mm/hmm.c                    |  2 +-
>  mm/memory-failure.c         |  2 +-
>  mm/page_vma_mapped.c        |  6 +++---
>  5 files changed, 28 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 7430060cb0d6..f897d40821dd 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -242,7 +242,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
>  {
>  	VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
>  
> -	return page_folio(pfn_to_page(swp_offset(entry)));
> +	return page_folio(pfn_to_page(swp_offset_pfn(entry)));
>  }
>  
>  void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index a3d435bf9f97..5378f77860fb 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -23,6 +23,14 @@
>  #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
>  #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
>  
> +/*
> + * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
> + * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
> + * can use the extra bits to store other information besides PFN.
> + */
> +#define SWP_PFN_BITS			(MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +#define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
> +
>  /* Clear all flags but only keep swp_entry_t related information */
>  static inline pte_t pte_swp_clear_flags(pte_t pte)
>  {
> @@ -64,6 +72,16 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
>  	return entry.val & SWP_OFFSET_MASK;
>  }
>  
> +/*
> + * This should only be called upon a pfn swap entry to get the PFN stored
> + * in the swap entry.  Please refers to is_pfn_swap_entry() for definition
> + * of pfn swap entry.
> + */
> +static inline unsigned long swp_offset_pfn(swp_entry_t entry)
> +{

Is it good to call is_pfn_swap_entry() here for debug that can be
eliminated in the production kernel?

> +	return swp_offset(entry) & SWP_PFN_MASK;
> +}
> +
>  /* check whether a pte points to a swap entry */
>  static inline int is_swap_pte(pte_t pte)
>  {
> @@ -369,7 +387,7 @@ static inline int pte_none_mostly(pte_t pte)
>  
>  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>  {
> -	struct page *p = pfn_to_page(swp_offset(entry));
> +	struct page *p = pfn_to_page(swp_offset_pfn(entry));
>  
>  	/*
>  	 * Any use of migration entries may only occur while the
> @@ -387,6 +405,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>   */
>  static inline bool is_pfn_swap_entry(swp_entry_t entry)
>  {
> +	/* Make sure the swp offset can always store the needed fields */
> +	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);

	BUILD_BUG_ON(SWP_TYPE_SHIFT <= SWP_PFN_BITS);

?

Best Regards,
Huang, Ying

> +
>  	return is_migration_entry(entry) || is_device_private_entry(entry) ||
>  	       is_device_exclusive_entry(entry);
>  }
> @@ -475,11 +496,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
>  	return swp_type(entry) == SWP_HWPOISON;
>  }
>  
> -static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
> -{
> -	return swp_offset(entry);
> -}
> -
>  static inline void num_poisoned_pages_inc(void)
>  {
>  	atomic_long_inc(&num_poisoned_pages);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2aa63b94d9b..3850fb625dda 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  			cpu_flags = HMM_PFN_VALID;
>  			if (is_writable_device_private_entry(entry))
>  				cpu_flags |= HMM_PFN_WRITE;
> -			*hmm_pfn = swp_offset(entry) | cpu_flags;
> +			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
>  			return 0;
>  		}
>  
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index cc6fc9be8d22..e451219124dd 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
>  		swp_entry_t swp = pte_to_swp_entry(pte);
>  
>  		if (is_hwpoison_entry(swp))
> -			pfn = hwpoison_entry_to_pfn(swp);
> +			pfn = swp_offset_pfn(swp);
>  	}
>  
>  	if (!pfn || pfn != poisoned_pfn)
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 8e9e574d535a..93e13fc17d3c 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  		    !is_device_exclusive_entry(entry))
>  			return false;
>  
> -		pfn = swp_offset(entry);
> +		pfn = swp_offset_pfn(entry);
>  	} else if (is_swap_pte(*pvmw->pte)) {
>  		swp_entry_t entry;
>  
> @@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  		    !is_device_exclusive_entry(entry))
>  			return false;
>  
> -		pfn = swp_offset(entry);
> +		pfn = swp_offset_pfn(entry);
>  	} else {
>  		if (!pte_present(*pvmw->pte))
>  			return false;
> @@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  					return not_found(pvmw);
>  				entry = pmd_to_swp_entry(pmde);
>  				if (!is_migration_entry(entry) ||
> -				    !check_pmd(swp_offset(entry), pvmw))
> +				    !check_pmd(swp_offset_pfn(entry), pvmw))
>  					return not_found(pvmw);
>  				return true;
>  			}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
                   ` (4 preceding siblings ...)
  2022-07-29 17:07 ` [PATCH RFC 0/4] mm: Remember young bit for migration entries Nadav Amit
@ 2022-08-01  3:20 ` Huang, Ying
  2022-08-01  5:33 ` Huang, Ying
  2022-08-01  8:21 ` David Hildenbrand
  7 siblings, 0 replies; 21+ messages in thread
From: Huang, Ying @ 2022-08-01  3:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

Peter Xu <peterx@redhat.com> writes:

> [Marking as RFC; only x86 is supported for now, plan to add a few more
>  archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
>
> However that could lead to at least two problems:
>
>   (1) We lost the real hot/cold information while we could have persisted.
>       That information shouldn't change even if the backing page is changed
>       after the migration,
>
>   (2) There can be always extra overhead on the immediate next access to
>       any migrated page, because hardware MMU needs cycles to set the young
>       bit again (as long as the MMU supports).
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable.  In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.

LKP has observed that before too, as in the following reports and
discussion.

https://lore.kernel.org/all/87bn35zcko.fsf@yhuang-dev.intel.com/t/

Best Regards,
Huang, Ying

> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used.  Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
>
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent.  Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.
>
> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible.  By default, it'll keep
> the old behavior of keeping all migrated pages cold.
>
> Tests
> =====
>
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms.  The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps.
>
> Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
> only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
>
> Patch Layout
> ============
>
> Patch 1:  Add swp_offset_pfn() and apply to all pfn swap entries, we should
>           also stop treating swp_offset() as PFN anymore because it can
>           contain more information starting from next patch.
> Patch 2:  The core patch to remember young bit in swap offsets.
> Patch 3:  A cleanup for x86 32 bits pgtable.h.
> Patch 4:  Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration
>
> Please review, thanks.
>
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
>
> Peter Xu (4):
>   mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
>   mm: Remember young bit for page migrations
>   mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
>   mm/x86: Define __ARCH_SWP_OFFSET_BITS
>
>  arch/arm64/mm/hugetlbpage.c           |  2 +-
>  arch/x86/include/asm/pgtable-2level.h |  6 ++
>  arch/x86/include/asm/pgtable-3level.h | 15 +++--
>  arch/x86/include/asm/pgtable_64.h     |  5 ++
>  include/linux/swapops.h               | 85 +++++++++++++++++++++++++--
>  mm/hmm.c                              |  2 +-
>  mm/huge_memory.c                      | 10 +++-
>  mm/memory-failure.c                   |  2 +-
>  mm/migrate.c                          |  4 +-
>  mm/migrate_device.c                   |  2 +
>  mm/page_vma_mapped.c                  |  6 +-
>  mm/rmap.c                             |  3 +-
>  12 files changed, 122 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
                   ` (5 preceding siblings ...)
  2022-08-01  3:20 ` Huang, Ying
@ 2022-08-01  5:33 ` Huang, Ying
  2022-08-01 22:25   ` Peter Xu
  2022-08-01  8:21 ` David Hildenbrand
  7 siblings, 1 reply; 21+ messages in thread
From: Huang, Ying @ 2022-08-01  5:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka, Andi Kleen

Peter Xu <peterx@redhat.com> writes:

> [Marking as RFC; only x86 is supported for now, plan to add a few more
>  archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
>
> However that could lead to at least two problems:
>
>   (1) We lost the real hot/cold information while we could have persisted.
>       That information shouldn't change even if the backing page is changed
>       after the migration,
>
>   (2) There can be always extra overhead on the immediate next access to
>       any migrated page, because hardware MMU needs cycles to set the young
>       bit again (as long as the MMU supports).
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable.  In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
>
> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used.  Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
>
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent.  Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.

If my understanding were correct, max_swapfile_size() provides a
mechanism to identify the available bits with swp_entry_t and swap PTE
considered.  We may take advantage of that?

And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap
file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if
L1TF mitigation is enabled.

Cced Andi for confirmation.

Best Regards,
Huang, Ying

> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible.  By default, it'll keep
> the old behavior of keeping all migrated pages cold.
>
> Tests
> =====
>
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms.  The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps.
>
> Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
> only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
>
> Patch Layout
> ============
>
> Patch 1:  Add swp_offset_pfn() and apply to all pfn swap entries, we should
>           also stop treating swp_offset() as PFN anymore because it can
>           contain more information starting from next patch.
> Patch 2:  The core patch to remember young bit in swap offsets.
> Patch 3:  A cleanup for x86 32 bits pgtable.h.
> Patch 4:  Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration
>
> Please review, thanks.
>
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
>
> Peter Xu (4):
>   mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
>   mm: Remember young bit for page migrations
>   mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
>   mm/x86: Define __ARCH_SWP_OFFSET_BITS
>
>  arch/arm64/mm/hugetlbpage.c           |  2 +-
>  arch/x86/include/asm/pgtable-2level.h |  6 ++
>  arch/x86/include/asm/pgtable-3level.h | 15 +++--
>  arch/x86/include/asm/pgtable_64.h     |  5 ++
>  include/linux/swapops.h               | 85 +++++++++++++++++++++++++--
>  mm/hmm.c                              |  2 +-
>  mm/huge_memory.c                      | 10 +++-
>  mm/memory-failure.c                   |  2 +-
>  mm/migrate.c                          |  4 +-
>  mm/migrate_device.c                   |  2 +
>  mm/page_vma_mapped.c                  |  6 +-
>  mm/rmap.c                             |  3 +-
>  12 files changed, 122 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
                   ` (6 preceding siblings ...)
  2022-08-01  5:33 ` Huang, Ying
@ 2022-08-01  8:21 ` David Hildenbrand
  2022-08-01 22:35   ` Peter Xu
  7 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2022-08-01  8:21 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Huang Ying, Andrea Arcangeli, Andrew Morton, Kirill A . Shutemov,
	Nadav Amit, Hugh Dickins, Vlastimil Babka

On 29.07.22 03:40, Peter Xu wrote:
> [Marking as RFC; only x86 is supported for now, plan to add a few more
>  archs when there's a formal version]
> 
> Problem
> =======
> 
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
> 
> However that could lead to at least two problems:
> 
>   (1) We lost the real hot/cold information while we could have persisted.
>       That information shouldn't change even if the backing page is changed
>       after the migration,
> 
>   (2) There can be always extra overhead on the immediate next access to
>       any migrated page, because hardware MMU needs cycles to set the young
>       bit again (as long as the MMU supports).
> 
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable.  In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
> 
> This issue is originally reported by Andrea Arcangeli.
> 
> Solution
> ========
> 
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
> 
> We have the chance to do so because in many systems the swap offset is not
> really fully used.  Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
> 
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent.  Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.
> 
> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible.  By default, it'll keep
> the old behavior of keeping all migrated pages cold.
> 


I played with a similar idea when working on pte_swp_exclusive() but
gave up, because it ended up looking too hacky. Looking at patch #2, I
get the same feeling again. Kind of hacky.


If we mostly only care about x86_64, and it's a performance improvement
after all, why not simply do it like
pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-01  5:33 ` Huang, Ying
@ 2022-08-01 22:25   ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-08-01 22:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka, Andi Kleen

On Mon, Aug 01, 2022 at 01:33:28PM +0800, Huang, Ying wrote:
> If my understanding were correct, max_swapfile_size() provides a
> mechanism to identify the available bits with swp_entry_t and swap PTE
> considered.  We may take advantage of that?

That's an interesting trick, I'll have a closer look, thanks for the
pointer!

> 
> And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap
> file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if
> L1TF mitigation is enabled.
> 
> Cced Andi for confirmation.

Yeah it'll be great to have a confirmation, hopefully max_swapfile_size()
should have covered that case.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-08-01  3:13   ` Huang, Ying
@ 2022-08-01 22:29     ` Peter Xu
  2022-08-02  1:22       ` Huang, Ying
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2022-08-01 22:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

On Mon, Aug 01, 2022 at 11:13:58AM +0800, Huang, Ying wrote:
> > +/*
> > + * This should only be called upon a pfn swap entry to get the PFN stored
> > + * in the swap entry.  Please refers to is_pfn_swap_entry() for definition
> > + * of pfn swap entry.
> > + */
> > +static inline unsigned long swp_offset_pfn(swp_entry_t entry)
> > +{
> 
> Is it good to call is_pfn_swap_entry() here for debug that can be
> eliminated in the production kernel?

Sure, I'll add a VM_BUG_ON() there in the next spin.

> 
> > +	return swp_offset(entry) & SWP_PFN_MASK;
> > +}
> > +
> >  /* check whether a pte points to a swap entry */
> >  static inline int is_swap_pte(pte_t pte)
> >  {
> > @@ -369,7 +387,7 @@ static inline int pte_none_mostly(pte_t pte)
> >  
> >  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> >  {
> > -	struct page *p = pfn_to_page(swp_offset(entry));
> > +	struct page *p = pfn_to_page(swp_offset_pfn(entry));
> >  
> >  	/*
> >  	 * Any use of migration entries may only occur while the
> > @@ -387,6 +405,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> >   */
> >  static inline bool is_pfn_swap_entry(swp_entry_t entry)
> >  {
> > +	/* Make sure the swp offset can always store the needed fields */
> > +	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
> 
> 	BUILD_BUG_ON(SWP_TYPE_SHIFT <= SWP_PFN_BITS);

Logiclaly it's okay to have SWP_TYPE_SHIFT==SWP_PFN_BITS?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-01  8:21 ` David Hildenbrand
@ 2022-08-01 22:35   ` Peter Xu
  2022-08-02 12:06     ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2022-08-01 22:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Andrew Morton, Kirill A . Shutemov, Nadav Amit, Hugh Dickins,
	Vlastimil Babka

On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
> On 29.07.22 03:40, Peter Xu wrote:
> > [Marking as RFC; only x86 is supported for now, plan to add a few more
> >  archs when there's a formal version]
> > 
> > Problem
> > =======
> > 
> > When migrate a page, right now we always mark the migrated page as old.
> > The reason could be that we don't really know whether the page is hot or
> > cold, so we could have taken it a default negative assuming that's safer.
> > 
> > However that could lead to at least two problems:
> > 
> >   (1) We lost the real hot/cold information while we could have persisted.
> >       That information shouldn't change even if the backing page is changed
> >       after the migration,
> > 
> >   (2) There can be always extra overhead on the immediate next access to
> >       any migrated page, because hardware MMU needs cycles to set the young
> >       bit again (as long as the MMU supports).
> > 
> > Many of the recent upstream works showed that (2) is not something trivial
> > and actually very measurable.  In my test case, reading 1G chunk of memory
> > - jumping in page size intervals - could take 99ms just because of the
> > extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> > if young set.
> > 
> > This issue is originally reported by Andrea Arcangeli.
> > 
> > Solution
> > ========
> > 
> > To solve this problem, this patchset tries to remember the young bit in the
> > migration entries and carry it over when recovering the ptes.
> > 
> > We have the chance to do so because in many systems the swap offset is not
> > really fully used.  Migration entries use swp offset to store PFN only,
> > while the PFN is normally not as large as swp offset and normally smaller.
> > It means we do have some free bits in swp offset that we can use to store
> > things like young, and that's how this series tried to approach this
> > problem.
> > 
> > One tricky thing here is even though we're embedding the information into
> > swap entry which seems to be a very generic data structure, the number of
> > bits that are free is still arch dependent.  Not only because the size of
> > swp_entry_t differs, but also due to the different layouts of swap ptes on
> > different archs.
> > 
> > Here, this series requires specific arch to define an extra macro called
> > __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> > information, the swap logic can know whether there's extra bits to use,
> > then it'll remember the young bits when possible.  By default, it'll keep
> > the old behavior of keeping all migrated pages cold.
> > 
> 
> 
> I played with a similar idea when working on pte_swp_exclusive() but
> gave up, because it ended up looking too hacky. Looking at patch #2, I
> get the same feeling again. Kind of hacky.

Could you explain what's the "hacky" part you mentioned?

I used swap entry to avoid per-arch operations. I failed to figure out a
common way to know swp offset length myself so unluckily in this RFC I
still needed one macro per-arch.  Ying's suggestion seems to be a good fit
here to me to remove the last arch-specific dependency.

> 
> 
> If we mostly only care about x86_64, and it's a performance improvement
> after all, why not simply do it like
> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?

Page migration works for most archs, I want to have it work for all archs
that can easily benefit from it.

Besides I actually have a question on the anon exclusive bit in the swap
pte: since we have that anyway, why we need a specific migration type for
anon exclusive pages?  Can it be simply read migration entries with anon
exclusive bit set?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  2022-08-01 22:29     ` Peter Xu
@ 2022-08-02  1:22       ` Huang, Ying
  0 siblings, 0 replies; 21+ messages in thread
From: Huang, Ying @ 2022-08-02  1:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Andrew Morton,
	Kirill A . Shutemov, Nadav Amit, Hugh Dickins, David Hildenbrand,
	Vlastimil Babka

Peter Xu <peterx@redhat.com> writes:

> On Mon, Aug 01, 2022 at 11:13:58AM +0800, Huang, Ying wrote:
>> > +/*
>> > + * This should only be called upon a pfn swap entry to get the PFN stored
>> > + * in the swap entry.  Please refers to is_pfn_swap_entry() for definition
>> > + * of pfn swap entry.
>> > + */
>> > +static inline unsigned long swp_offset_pfn(swp_entry_t entry)
>> > +{
>> 
>> Is it good to call is_pfn_swap_entry() here for debug that can be
>> eliminated in the production kernel?
>
> Sure, I'll add a VM_BUG_ON() there in the next spin.
>
>> 
>> > +	return swp_offset(entry) & SWP_PFN_MASK;
>> > +}
>> > +
>> >  /* check whether a pte points to a swap entry */
>> >  static inline int is_swap_pte(pte_t pte)
>> >  {
>> > @@ -369,7 +387,7 @@ static inline int pte_none_mostly(pte_t pte)
>> >  
>> >  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>> >  {
>> > -	struct page *p = pfn_to_page(swp_offset(entry));
>> > +	struct page *p = pfn_to_page(swp_offset_pfn(entry));
>> >  
>> >  	/*
>> >  	 * Any use of migration entries may only occur while the
>> > @@ -387,6 +405,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>> >   */
>> >  static inline bool is_pfn_swap_entry(swp_entry_t entry)
>> >  {
>> > +	/* Make sure the swp offset can always store the needed fields */
>> > +	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
>> 
>> 	BUILD_BUG_ON(SWP_TYPE_SHIFT <= SWP_PFN_BITS);
>
> Logiclaly it's okay to have SWP_TYPE_SHIFT==SWP_PFN_BITS?

Sorry, I misunderstood the original code.  Please ignore this comment.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-01 22:35   ` Peter Xu
@ 2022-08-02 12:06     ` David Hildenbrand
  2022-08-02 20:14       ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2022-08-02 12:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Andrew Morton, Kirill A . Shutemov, Nadav Amit, Hugh Dickins,
	Vlastimil Babka

On 02.08.22 00:35, Peter Xu wrote:
> On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
>> On 29.07.22 03:40, Peter Xu wrote:
>>> [Marking as RFC; only x86 is supported for now, plan to add a few more
>>>  archs when there's a formal version]
>>>
>>> Problem
>>> =======
>>>
>>> When migrate a page, right now we always mark the migrated page as old.
>>> The reason could be that we don't really know whether the page is hot or
>>> cold, so we could have taken it a default negative assuming that's safer.
>>>
>>> However that could lead to at least two problems:
>>>
>>>   (1) We lost the real hot/cold information while we could have persisted.
>>>       That information shouldn't change even if the backing page is changed
>>>       after the migration,
>>>
>>>   (2) There can be always extra overhead on the immediate next access to
>>>       any migrated page, because hardware MMU needs cycles to set the young
>>>       bit again (as long as the MMU supports).
>>>
>>> Many of the recent upstream works showed that (2) is not something trivial
>>> and actually very measurable.  In my test case, reading 1G chunk of memory
>>> - jumping in page size intervals - could take 99ms just because of the
>>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
>>> if young set.
>>>
>>> This issue is originally reported by Andrea Arcangeli.
>>>
>>> Solution
>>> ========
>>>
>>> To solve this problem, this patchset tries to remember the young bit in the
>>> migration entries and carry it over when recovering the ptes.
>>>
>>> We have the chance to do so because in many systems the swap offset is not
>>> really fully used.  Migration entries use swp offset to store PFN only,
>>> while the PFN is normally not as large as swp offset and normally smaller.
>>> It means we do have some free bits in swp offset that we can use to store
>>> things like young, and that's how this series tried to approach this
>>> problem.
>>>
>>> One tricky thing here is even though we're embedding the information into
>>> swap entry which seems to be a very generic data structure, the number of
>>> bits that are free is still arch dependent.  Not only because the size of
>>> swp_entry_t differs, but also due to the different layouts of swap ptes on
>>> different archs.
>>>
>>> Here, this series requires specific arch to define an extra macro called
>>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
>>> information, the swap logic can know whether there's extra bits to use,
>>> then it'll remember the young bits when possible.  By default, it'll keep
>>> the old behavior of keeping all migrated pages cold.
>>>
>>
>>
>> I played with a similar idea when working on pte_swp_exclusive() but
>> gave up, because it ended up looking too hacky. Looking at patch #2, I
>> get the same feeling again. Kind of hacky.
> 
> Could you explain what's the "hacky" part you mentioned?

SWP_PFN_OFFSET_FREE_BITS :)

It's a PFN offset and we're mangling in random other bits. That's hacky
IMHO.

I played with the idea of converting all code to store bits in addition
to the type + offset. But that requires digging through a lot of arch
code to teach that code about additional flags, so I discarded that idea
when working on the COW fixes.

> 
> I used swap entry to avoid per-arch operations. I failed to figure out a
> common way to know swp offset length myself so unluckily in this RFC I
> still needed one macro per-arch.  Ying's suggestion seems to be a good fit
> here to me to remove the last arch-specific dependency.

Instead of mangling this into the PFN offset and let the arch tell you
which bits of the PFN offset are unused ... rather remove the bits from
the offset and define them manually to have a certain meaning. That's
exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be
handled on architectures that want to support it.

I hope I could make it clearer what the hacky part is IMHO :)

> 
>>
>>
>> If we mostly only care about x86_64, and it's a performance improvement
>> after all, why not simply do it like
>> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
> 
> Page migration works for most archs, I want to have it work for all archs
> that can easily benefit from it.

Yet we only care about x86-64 IIUC regarding performance, just the way
the dirty bit is handled?

> 
> Besides I actually have a question on the anon exclusive bit in the swap
> pte: since we have that anyway, why we need a specific migration type for
> anon exclusive pages?  Can it be simply read migration entries with anon
> exclusive bit set?

Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.

As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
PTEs, you could even reuse that bit for migration entries and get at
alteast the most relevant 64bit architectures supported easily.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-02 12:06     ` David Hildenbrand
@ 2022-08-02 20:14       ` Peter Xu
  2022-08-02 20:23         ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2022-08-02 20:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Andrew Morton, Kirill A . Shutemov, Nadav Amit, Hugh Dickins,
	Vlastimil Babka

On Tue, Aug 02, 2022 at 02:06:37PM +0200, David Hildenbrand wrote:
> On 02.08.22 00:35, Peter Xu wrote:
> > On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
> >> On 29.07.22 03:40, Peter Xu wrote:
> >>> [Marking as RFC; only x86 is supported for now, plan to add a few more
> >>>  archs when there's a formal version]
> >>>
> >>> Problem
> >>> =======
> >>>
> >>> When migrate a page, right now we always mark the migrated page as old.
> >>> The reason could be that we don't really know whether the page is hot or
> >>> cold, so we could have taken it a default negative assuming that's safer.
> >>>
> >>> However that could lead to at least two problems:
> >>>
> >>>   (1) We lost the real hot/cold information while we could have persisted.
> >>>       That information shouldn't change even if the backing page is changed
> >>>       after the migration,
> >>>
> >>>   (2) There can be always extra overhead on the immediate next access to
> >>>       any migrated page, because hardware MMU needs cycles to set the young
> >>>       bit again (as long as the MMU supports).
> >>>
> >>> Many of the recent upstream works showed that (2) is not something trivial
> >>> and actually very measurable.  In my test case, reading 1G chunk of memory
> >>> - jumping in page size intervals - could take 99ms just because of the
> >>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> >>> if young set.
> >>>
> >>> This issue is originally reported by Andrea Arcangeli.
> >>>
> >>> Solution
> >>> ========
> >>>
> >>> To solve this problem, this patchset tries to remember the young bit in the
> >>> migration entries and carry it over when recovering the ptes.
> >>>
> >>> We have the chance to do so because in many systems the swap offset is not
> >>> really fully used.  Migration entries use swp offset to store PFN only,
> >>> while the PFN is normally not as large as swp offset and normally smaller.
> >>> It means we do have some free bits in swp offset that we can use to store
> >>> things like young, and that's how this series tried to approach this
> >>> problem.
> >>>
> >>> One tricky thing here is even though we're embedding the information into
> >>> swap entry which seems to be a very generic data structure, the number of
> >>> bits that are free is still arch dependent.  Not only because the size of
> >>> swp_entry_t differs, but also due to the different layouts of swap ptes on
> >>> different archs.
> >>>
> >>> Here, this series requires specific arch to define an extra macro called
> >>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> >>> information, the swap logic can know whether there's extra bits to use,
> >>> then it'll remember the young bits when possible.  By default, it'll keep
> >>> the old behavior of keeping all migrated pages cold.
> >>>
> >>
> >>
> >> I played with a similar idea when working on pte_swp_exclusive() but
> >> gave up, because it ended up looking too hacky. Looking at patch #2, I
> >> get the same feeling again. Kind of hacky.
> > 
> > Could you explain what's the "hacky" part you mentioned?
> 
> SWP_PFN_OFFSET_FREE_BITS :)
> 
> It's a PFN offset and we're mangling in random other bits. That's hacky
> IMHO.
> 
> I played with the idea of converting all code to store bits in addition
> to the type + offset. But that requires digging through a lot of arch
> code to teach that code about additional flags, so I discarded that idea
> when working on the COW fixes.

Having SWP_PFN_OFFSET_FREE_BITS was the cleanest approach I could think of
before I know the max_swapfile_size() trick. It only needs the arch to
define this one macro and swapops.h will take action accordingly.

OTOH, I don't want to use swap pte bits not only because there aren't a lot
left for some archs (x86_64 only has bit 4 left, afaict), but also since
this is a page migration issue it'll be nicer to me if can be solved in the
swp entry level, not pte.

> 
> > 
> > I used swap entry to avoid per-arch operations. I failed to figure out a
> > common way to know swp offset length myself so unluckily in this RFC I
> > still needed one macro per-arch.  Ying's suggestion seems to be a good fit
> > here to me to remove the last arch-specific dependency.
> 
> Instead of mangling this into the PFN offset and let the arch tell you
> which bits of the PFN offset are unused ... rather remove the bits from
> the offset and define them manually to have a certain meaning. That's
> exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be
> handled on architectures that want to support it.
> 
> I hope I could make it clearer what the hacky part is IMHO :)
> 
> > 
> >>
> >>
> >> If we mostly only care about x86_64, and it's a performance improvement
> >> after all, why not simply do it like
> >> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
> > 
> > Page migration works for most archs, I want to have it work for all archs
> > that can easily benefit from it.
> 
> Yet we only care about x86-64 IIUC regarding performance, just the way
> the dirty bit is handled?

I don't think we only care about x86_64?  Should other archs have the same
issue as long as there's the hardware young bit?

Even without it, it'll affect page reclaim logic too, and that's also not
x86 only.

> 
> > 
> > Besides I actually have a question on the anon exclusive bit in the swap
> > pte: since we have that anyway, why we need a specific migration type for
> > anon exclusive pages?  Can it be simply read migration entries with anon
> > exclusive bit set?
> 
> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> 
> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> PTEs, you could even reuse that bit for migration entries and get at
> alteast the most relevant 64bit architectures supported easily.

Yes, but I think having two mechanisms for the single problem can confuse
people.

IIUC the swap bit is already defined in major archs anyway, and since anon
exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
only servicing that very minority.. which seems to be a pity to waste the
swp type on all archs even if the archs defined swp pte bits just for anon
exclusive.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-02 20:14       ` Peter Xu
@ 2022-08-02 20:23         ` David Hildenbrand
  2022-08-02 20:35           ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2022-08-02 20:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Andrew Morton, Kirill A . Shutemov, Nadav Amit, Hugh Dickins,
	Vlastimil Babka

> I don't think we only care about x86_64?  Should other archs have the same
> issue as long as there's the hardware young bit?
> 
> Even without it, it'll affect page reclaim logic too, and that's also not
> x86 only.

Okay, reading the cover letter and looking at the code my understanding
was that x86-64 is the real focus.

>>
>>>
>>> Besides I actually have a question on the anon exclusive bit in the swap
>>> pte: since we have that anyway, why we need a specific migration type for
>>> anon exclusive pages?  Can it be simply read migration entries with anon
>>> exclusive bit set?
>>
>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
>>
>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
>> PTEs, you could even reuse that bit for migration entries and get at
>> alteast the most relevant 64bit architectures supported easily.
> 
> Yes, but I think having two mechanisms for the single problem can confuse
> people.
> 

It would be one bit with two different meanings depending on the swp type.

> IIUC the swap bit is already defined in major archs anyway, and since anon
> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on

It kind-of is best effort, but the goal is to have all archs support it.

... just like the young bit here?

> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> only servicing that very minority.. which seems to be a pity to waste the

I have a big item on my todo list to support all, but I have different
priorities right now.

If there is no free bit, simply steal one from the offset ... which is
the same thing your approach would do, just in a different way, no?

> swp type on all archs even if the archs defined swp pte bits just for anon
> exclusive.

Why do we care? We walk about one type not one bit.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-02 20:23         ` David Hildenbrand
@ 2022-08-02 20:35           ` Peter Xu
  2022-08-02 20:59             ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2022-08-02 20:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Andrew Morton, Kirill A . Shutemov, Nadav Amit, Hugh Dickins,
	Vlastimil Babka

On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
> > I don't think we only care about x86_64?  Should other archs have the same
> > issue as long as there's the hardware young bit?
> > 
> > Even without it, it'll affect page reclaim logic too, and that's also not
> > x86 only.
> 
> Okay, reading the cover letter and looking at the code my understanding
> was that x86-64 is the real focus.
> 
> >>
> >>>
> >>> Besides I actually have a question on the anon exclusive bit in the swap
> >>> pte: since we have that anyway, why we need a specific migration type for
> >>> anon exclusive pages?  Can it be simply read migration entries with anon
> >>> exclusive bit set?
> >>
> >> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> >>
> >> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> >> PTEs, you could even reuse that bit for migration entries and get at
> >> alteast the most relevant 64bit architectures supported easily.
> > 
> > Yes, but I think having two mechanisms for the single problem can confuse
> > people.
> > 
> 
> It would be one bit with two different meanings depending on the swp type.
> 
> > IIUC the swap bit is already defined in major archs anyway, and since anon
> > exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
> 
> It kind-of is best effort, but the goal is to have all archs support it.
> 
> ... just like the young bit here?

Exactly, so I'm also wondering whether we can move the swp pte anon
exclusive bit into swp entry.  It just sounds weird to have them defined in
two ways.

> 
> > archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> > migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> > only servicing that very minority.. which seems to be a pity to waste the
> 
> I have a big item on my todo list to support all, but I have different
> priorities right now.
> 
> If there is no free bit, simply steal one from the offset ... which is
> the same thing your approach would do, just in a different way, no?
> 
> > swp type on all archs even if the archs defined swp pte bits just for anon
> > exclusive.
> 
> Why do we care? We walk about one type not one bit.

The swap type address space is still limited, I'd say we should save when
possible.  I believe people caring about swapping care about the limit of
swap devices too.  If the offset can keep it, I think it's better than the
swap type.  De-dup either the type or the swap pte bit would be nicer, imho.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-02 20:35           ` Peter Xu
@ 2022-08-02 20:59             ` David Hildenbrand
  2022-08-02 22:15               ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2022-08-02 20:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Andrew Morton, Kirill A . Shutemov, Nadav Amit, Hugh Dickins,
	Vlastimil Babka

On 02.08.22 22:35, Peter Xu wrote:
> On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
>>> I don't think we only care about x86_64?  Should other archs have the same
>>> issue as long as there's the hardware young bit?
>>>
>>> Even without it, it'll affect page reclaim logic too, and that's also not
>>> x86 only.
>>
>> Okay, reading the cover letter and looking at the code my understanding
>> was that x86-64 is the real focus.
>>
>>>>
>>>>>
>>>>> Besides I actually have a question on the anon exclusive bit in the swap
>>>>> pte: since we have that anyway, why we need a specific migration type for
>>>>> anon exclusive pages?  Can it be simply read migration entries with anon
>>>>> exclusive bit set?
>>>>
>>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
>>>>
>>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
>>>> PTEs, you could even reuse that bit for migration entries and get at
>>>> alteast the most relevant 64bit architectures supported easily.
>>>
>>> Yes, but I think having two mechanisms for the single problem can confuse
>>> people.
>>>
>>
>> It would be one bit with two different meanings depending on the swp type.
>>
>>> IIUC the swap bit is already defined in major archs anyway, and since anon
>>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
>>
>> It kind-of is best effort, but the goal is to have all archs support it.
>>
>> ... just like the young bit here?
> 
> Exactly, so I'm also wondering whether we can move the swp pte anon
> exclusive bit into swp entry.  It just sounds weird to have them defined in
> two ways.

I'd argue it's just the swp vs. nonswp difference that are in fact two
different concepts (device+offset vs. type+pte). And some dirty details
how swp entries are actually used.

With swp entries you have to be very careful, for example, take a look
at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain
from touching anything inside actual swp entries and instead store it in
the pte.

> 
>>
>>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
>>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
>>> only servicing that very minority.. which seems to be a pity to waste the
>>
>> I have a big item on my todo list to support all, but I have different
>> priorities right now.
>>
>> If there is no free bit, simply steal one from the offset ... which is
>> the same thing your approach would do, just in a different way, no?
>>
>>> swp type on all archs even if the archs defined swp pte bits just for anon
>>> exclusive.
>>
>> Why do we care? We walk about one type not one bit.
> 
> The swap type address space is still limited, I'd say we should save when
> possible.  I believe people caring about swapping care about the limit of
> swap devices too.  If the offset can keep it, I think it's better than the

Ehm, last time I did the math I came to the conclusion that nobody
cares. Let me redo the math:

MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22

Which is the worst case right now with all kinds of oddity compiled in
(sorry CONFIG_DEVICE_PRIVATE).

So far nobody complaint.

> swap type.  De-dup either the type or the swap pte bit would be nicer, imho.
> 

If you manage bits in the pte manually, you might be able to get a
better packing density, if bits are scattered around. Just take a look
at the x86_64 location of _PAGE_SWP_EXCLUSIVE.

What I'm rooting for is something like

#define pte_nonswp_mkyoung pte_swp_mkexclusive

Eventually with some VM_BUG_ONs to make sure people call it on the right
swp ptes.

If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people
can have 23 swap devices), and eventually have separate bits for both.
For now it's not necessary.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries
  2022-08-02 20:59             ` David Hildenbrand
@ 2022-08-02 22:15               ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-08-02 22:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Huang Ying, Andrea Arcangeli,
	Andrew Morton, Kirill A . Shutemov, Nadav Amit, Hugh Dickins,
	Vlastimil Babka

On Tue, Aug 02, 2022 at 10:59:42PM +0200, David Hildenbrand wrote:
> On 02.08.22 22:35, Peter Xu wrote:
> > On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
> >>> I don't think we only care about x86_64?  Should other archs have the same
> >>> issue as long as there's the hardware young bit?
> >>>
> >>> Even without it, it'll affect page reclaim logic too, and that's also not
> >>> x86 only.
> >>
> >> Okay, reading the cover letter and looking at the code my understanding
> >> was that x86-64 is the real focus.
> >>
> >>>>
> >>>>>
> >>>>> Besides I actually have a question on the anon exclusive bit in the swap
> >>>>> pte: since we have that anyway, why we need a specific migration type for
> >>>>> anon exclusive pages?  Can it be simply read migration entries with anon
> >>>>> exclusive bit set?
> >>>>
> >>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> >>>>
> >>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> >>>> PTEs, you could even reuse that bit for migration entries and get at
> >>>> alteast the most relevant 64bit architectures supported easily.
> >>>
> >>> Yes, but I think having two mechanisms for the single problem can confuse
> >>> people.
> >>>
> >>
> >> It would be one bit with two different meanings depending on the swp type.
> >>
> >>> IIUC the swap bit is already defined in major archs anyway, and since anon
> >>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
> >>
> >> It kind-of is best effort, but the goal is to have all archs support it.
> >>
> >> ... just like the young bit here?
> > 
> > Exactly, so I'm also wondering whether we can move the swp pte anon
> > exclusive bit into swp entry.  It just sounds weird to have them defined in
> > two ways.
> 
> I'd argue it's just the swp vs. nonswp difference that are in fact two
> different concepts (device+offset vs. type+pte). And some dirty details
> how swp entries are actually used.
> 
> With swp entries you have to be very careful, for example, take a look
> at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain
> from touching anything inside actual swp entries and instead store it in
> the pte.

I don't really see any risk - it neither touches the swp entry nor do
complicated things around it (shift 1 and set bit 0 to 1).  Please feel
free to share your concern in case I overlooked something, though.

> 
> > 
> >>
> >>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> >>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> >>> only servicing that very minority.. which seems to be a pity to waste the
> >>
> >> I have a big item on my todo list to support all, but I have different
> >> priorities right now.
> >>
> >> If there is no free bit, simply steal one from the offset ... which is
> >> the same thing your approach would do, just in a different way, no?
> >>
> >>> swp type on all archs even if the archs defined swp pte bits just for anon
> >>> exclusive.
> >>
> >> Why do we care? We walk about one type not one bit.
> > 
> > The swap type address space is still limited, I'd say we should save when
> > possible.  I believe people caring about swapping care about the limit of
> > swap devices too.  If the offset can keep it, I think it's better than the
> 
> Ehm, last time I did the math I came to the conclusion that nobody
> cares. Let me redo the math:
> 
> MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22
> 
> Which is the worst case right now with all kinds of oddity compiled in
> (sorry CONFIG_DEVICE_PRIVATE).
> 
> So far nobody complaint.

Yeah.  To me using one bit of it is fine especially if that's the best to
do.  Here what confuses me is we have two ways to represent "whether this
page is anon exclusive" in a swap pte, either we need to go into the type
or using the bit.  The trick here is whether the swap pte bit makes sense
depends on the swp type first too, while the swap type can be "anon
exclusive read migration" itself.

> 
> > swap type.  De-dup either the type or the swap pte bit would be nicer, imho.
> > 
> 
> If you manage bits in the pte manually, you might be able to get a
> better packing density, if bits are scattered around. Just take a look
> at the x86_64 location of _PAGE_SWP_EXCLUSIVE.
> 
> What I'm rooting for is something like
> 
> #define pte_nonswp_mkyoung pte_swp_mkexclusive
> 
> Eventually with some VM_BUG_ONs to make sure people call it on the right
> swp ptes.
> 
> If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people
> can have 23 swap devices), and eventually have separate bits for both.
> For now it's not necessary.

Okay, but that's probably the last thing I'd like to try - I don't want to
further complicate the anon exclusive bit in swap pte..  I'd think cleaning
that up somehow would be nice, but as you mentioned if no one complains I
have no strong opinion either.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-08-02 22:16 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-29  1:40 [PATCH RFC 0/4] mm: Remember young bit for migration entries Peter Xu
2022-07-29  1:40 ` [PATCH RFC 1/4] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry Peter Xu
2022-08-01  3:13   ` Huang, Ying
2022-08-01 22:29     ` Peter Xu
2022-08-02  1:22       ` Huang, Ying
2022-07-29  1:40 ` [PATCH RFC 2/4] mm: Remember young bit for page migrations Peter Xu
2022-07-29  1:40 ` [PATCH RFC 3/4] mm/x86: Use SWP_TYPE_BITS in 3-level swap macros Peter Xu
2022-07-29  1:40 ` [PATCH RFC 4/4] mm/x86: Define __ARCH_SWP_OFFSET_BITS Peter Xu
2022-07-29 17:07 ` [PATCH RFC 0/4] mm: Remember young bit for migration entries Nadav Amit
2022-07-29 22:43   ` Peter Xu
2022-08-01  3:20 ` Huang, Ying
2022-08-01  5:33 ` Huang, Ying
2022-08-01 22:25   ` Peter Xu
2022-08-01  8:21 ` David Hildenbrand
2022-08-01 22:35   ` Peter Xu
2022-08-02 12:06     ` David Hildenbrand
2022-08-02 20:14       ` Peter Xu
2022-08-02 20:23         ` David Hildenbrand
2022-08-02 20:35           ` Peter Xu
2022-08-02 20:59             ` David Hildenbrand
2022-08-02 22:15               ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.