All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
@ 2022-11-09 20:30 Linus Torvalds
  2022-11-09 20:30 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 20:30 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton
  Cc: linux-kernel, linux-mm, Alexander Gordeev

We already have this notion in parts of the MM code (see the mlock code
with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new
case, and I refuse to do the same thing we've done before where we just
put bits in the raw pointer and say it's still a normal pointer.

So this introduces a 'struct encoded_page' pointer that cannot be used
for anything else than to encode a real page pointer and a couple of
extra bits in the low bits.  That way the compiler can trivially track
the state of the pointer and you just explicitly encode and decode the
extra bits.

Note that this makes the alignment of 'struct page' explicit even for
the case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set.  That is
entirely redundant in almost all cases, since the page structure already
contains several word-sized entries.

However, on m68k, the alignment of even 32-bit data is just 16 bits, and
as such in theory the alignment of 'struct page' could be too.  So let's
just make it very very explicit that the alignment needs to be at least
32 bits, giving us a guarantee of two unused low bits in the pointer.

Now, in practice, our page struct array is aligned much more than that
anyway, even on m68k, and our existing code in mm/mlock.c obviously
already depended on that.  But since the whole point of this change is
to be careful about the type system when hiding extra bits in the
pointer, let's also be explicit about the assumptions we make.

NOTE! This is being very careful in another way too: it has a build-time
assertion that the 'flags' added to the page pointer actually fit in the
two bits.  That means that this helper must be inlined, and can only be
used in contexts where the compiler can statically determine that the
value fits in the available bits.

Link: https://lore.kernel.org/all/Y2tKixpO4RO6DgW5@tuxmaker.boeblingen.de.ibm.com/
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm_types.h | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..0a38fcb08d85 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,7 +67,7 @@ struct mem_cgroup;
 #ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
 #define _struct_page_alignment	__aligned(2 * sizeof(unsigned long))
 #else
-#define _struct_page_alignment
+#define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
 struct page {
@@ -241,6 +241,38 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+/**
+ * struct encoded_page - a nonexistent type marking this pointer
+ *
+ * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
+ * with the low bits of the pointer indicating extra context-dependent
+ * information. Not super-common, but happens in mmu_gather and mlock
+ * handling, and this acts as a type system check on that use.
+ *
+ * We only really have two guaranteed bits in general, although you could
+ * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+ * for more.
+ *
+ * Use the supplied helper functions to endcode/decode the pointer and bits.
+ */
+struct encoded_page;
+#define ENCODE_PAGE_BITS 3ul
+static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+	BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
+	return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline unsigned long encoded_page_flags(struct encoded_page *page)
+{
+	return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too
  2022-11-09 20:30 [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
@ 2022-11-09 20:30 ` Linus Torvalds
  2022-11-16  9:24   ` David Hildenbrand
  2022-11-09 20:30 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 20:30 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton; +Cc: linux-kernel, linux-mm

release_pages() already could take either an array of page pointers, or
an array of folio pointers.  Expand it to also accept an array of
encoded page pointers, which is what both the existing mlock() use and
the upcoming mmu_gather use of encoded page pointers wants.

Note that release_pages() won't actually use, or react to, any extra
encoded bits.  Instead, this is very much a case of "I have walked the
array of encoded pages and done everything the extra bits tell me to do,
now release it all".

Also, while the "either page or folio pointers" dual use was handled
with a cast of the pointer in "release_folios()", this takes a slightly
different approach and uses the "transparent union" attribute to
describe the set of arguments to the function:

  https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html

which has been supported by gcc forever, but the kernel hasn't used
before.

That allows us to avoid using various wrappers with casts, and just use
the same function regardless of use.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm.h | 21 +++++++++++++++++++--
 mm/swap.c          | 16 ++++++++++++----
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..d9fb5c3e3045 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1179,7 +1179,24 @@ static inline void folio_put_refs(struct folio *folio, int refs)
 		__folio_put(folio);
 }
 
-void release_pages(struct page **pages, int nr);
+/**
+ * release_pages - release an array of pages or folios
+ *
+ * This just releases a simple array of multiple pages, and
+ * accepts various different forms of said page array: either
+ * a regular old boring array of pages, an array of folios, or
+ * an array of encoded page pointers.
+ *
+ * The transparent union syntax for this kind of "any of these
+ * argument types" is all kinds of ugly, so look away.
+ */
+typedef union {
+	struct page **pages;
+	struct folio **folios;
+	struct encoded_page **encoded_pages;
+} release_pages_arg __attribute__ ((__transparent_union__));
+
+void release_pages(release_pages_arg, int nr);
 
 /**
  * folios_put - Decrement the reference count on an array of folios.
@@ -1195,7 +1212,7 @@ void release_pages(struct page **pages, int nr);
  */
 static inline void folios_put(struct folio **folios, unsigned int nr)
 {
-	release_pages((struct page **)folios, nr);
+	release_pages(folios, nr);
 }
 
 static inline void put_page(struct page *page)
diff --git a/mm/swap.c b/mm/swap.c
index 955930f41d20..596ed226ddb8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -968,22 +968,30 @@ void lru_cache_disable(void)
 
 /**
  * release_pages - batched put_page()
- * @pages: array of pages to release
+ * @arg: array of pages to release
  * @nr: number of pages
  *
- * Decrement the reference count on all the pages in @pages.  If it
+ * Decrement the reference count on all the pages in @arg.  If it
  * fell to zero, remove the page from the LRU and free it.
+ *
+ * Note that the argument can be an array of pages, encoded pages,
+ * or folio pointers. We ignore any encoded bits, and turn any of
+ * them into just a folio that gets free'd.
  */
-void release_pages(struct page **pages, int nr)
+void release_pages(release_pages_arg arg, int nr)
 {
 	int i;
+	struct encoded_page **encoded = arg.encoded_pages;
 	LIST_HEAD(pages_to_free);
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	unsigned int lock_batch;
 
 	for (i = 0; i < nr; i++) {
-		struct folio *folio = page_folio(pages[i]);
+		struct folio *folio;
+
+		/* Turn any of the argument types into a folio */
+		folio = page_folio(encoded_page_ptr(encoded[i]));
 
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags
  2022-11-09 20:30 [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
  2022-11-09 20:30 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
@ 2022-11-09 20:30 ` Linus Torvalds
  2022-11-09 20:30 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
  2022-11-16  9:15 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits David Hildenbrand
  3 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 20:30 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton; +Cc: linux-kernel, linux-mm

This is purely a preparatory patch that makes all the data structures
ready for encoding flags with the mmu_gather page pointers.

The code currently always sets the flag to zero and doesn't use it yet,
but now it's tracking the type state along.  The next step will be to
actually start using it.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/s390/include/asm/tlb.h |  8 +++++---
 include/asm-generic/tlb.h   |  9 +++++----
 include/linux/swap.h        |  2 +-
 mm/mmu_gather.c             |  8 ++++----
 mm/swap_state.c             | 11 ++++-------
 5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 3a5c8fb590e5..05142226d65d 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,7 +25,8 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size);
+					  struct encoded_page *page,
+					  int page_size);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -42,9 +43,10 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
  * has already been freed, so just do free_page_and_swap_cache.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
-					  struct page *page, int page_size)
+					  struct encoded_page *page,
+					  int page_size)
 {
-	free_page_and_swap_cache(page);
+	free_page_and_swap_cache(encoded_page_ptr(page));
 	return false;
 }
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..e5cd34393372 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -242,7 +242,7 @@ struct mmu_gather_batch {
 	struct mmu_gather_batch	*next;
 	unsigned int		nr;
 	unsigned int		max;
-	struct page		*pages[];
+	struct encoded_page	*encoded_pages[];
 };
 
 #define MAX_GATHER_BATCH	\
@@ -256,7 +256,8 @@ struct mmu_gather_batch {
  */
 #define MAX_GATHER_BATCH_COUNT	(10000UL/MAX_GATHER_BATCH)
 
-extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
+				   struct encoded_page *page,
 				   int page_size);
 #endif
 
@@ -431,13 +432,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 					struct page *page, int page_size)
 {
-	if (__tlb_remove_page_size(tlb, page, page_size))
+	if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
 		tlb_flush_mmu(tlb);
 }
 
 static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
-	return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, encode_page(page, 0), PAGE_SIZE);
 }
 
 /* tlb_remove_page
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a18cf4b7c724..40e418e3461b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -470,7 +470,7 @@ static inline unsigned long total_swapcache_pages(void)
 
 extern void free_swap_cache(struct page *page);
 extern void free_page_and_swap_cache(struct page *);
-extern void free_pages_and_swap_cache(struct page **, int);
+extern void free_pages_and_swap_cache(struct encoded_page **, int);
 /* linux/mm/swapfile.c */
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..f44cc8a5b581 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -48,7 +48,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;
 
 	for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
-		struct page **pages = batch->pages;
+		struct encoded_page **pages = batch->encoded_pages;
 
 		do {
 			/*
@@ -77,7 +77,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 	tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size)
 {
 	struct mmu_gather_batch *batch;
 
@@ -92,13 +92,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 	 * Add the page and check if we are full. If so
 	 * force a flush.
 	 */
-	batch->pages[batch->nr++] = page;
+	batch->encoded_pages[batch->nr++] = page;
 	if (batch->nr == batch->max) {
 		if (!tlb_next_batch(tlb))
 			return true;
 		batch = tlb->active;
 	}
-	VM_BUG_ON_PAGE(batch->nr > batch->max, page);
+	VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
 
 	return false;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 438d0676c5be..8bf08c313872 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -303,15 +303,12 @@ void free_page_and_swap_cache(struct page *page)
  * Passed an array of pages, drop them all from swapcache and then release
  * them.  They are removed from the LRU and freed if this is their last use.
  */
-void free_pages_and_swap_cache(struct page **pages, int nr)
+void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
 {
-	struct page **pagep = pages;
-	int i;
-
 	lru_add_drain();
-	for (i = 0; i < nr; i++)
-		free_swap_cache(pagep[i]);
-	release_pages(pagep, nr);
+	for (int i = 0; i < nr; i++)
+		free_swap_cache(encoded_page_ptr(pages[i]));
+	release_pages(pages, nr);
 }
 
 static inline bool swap_use_vma_readahead(void)
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-09 20:30 [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
  2022-11-09 20:30 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
  2022-11-09 20:30 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
@ 2022-11-09 20:30 ` Linus Torvalds
  2022-11-09 20:48   ` Linus Torvalds
  2022-11-16  9:15 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits David Hildenbrand
  3 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 20:30 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Peter Zijlstra,
	Gerald Schaefer

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.

And that is a problem, because while the TLB entry exists, we could end
up with the following situation:

 (a) one CPU could come in and clean it, never seeing our mapping of the
     page

 (b) another CPU could continue to use the stale and dirty TLB entry and
     continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This extends our current TLB gather infrastructure to optionally track a
"should I do a delayed page_remove_rmap() for this page after flushing
the TLB".  It uses the newly introduced 'encoded page pointer' to do
that without having to keep separate data around.

Note, this is complicated by a couple of issues:

 - we want to delay the rmap removal, but not past the page table lock,
   because that simplifies the memcg accounting

 - only SMP configurations want to delay TLB flushing, since on UP
   there are obviously no remote TLBs to worry about, and the page
   table lock means there are no preemption issues either

 - s390 has its own mmu_gather model that doesn't delay TLB flushing,
   and as a result also does not want the delayed rmap. As such, we can
   treat S390 like the UP case and use a common fallback for the "no
   delays" case.

 - we can track an enormous number of pages in our mmu_gather structure,
   with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each,
   all set up to be approximately 10k pending pages.

   We do not want to have a huge number of batched pages that we then
   need to check for delayed rmap handling inside the page table lock.

Particularly that last point results in a noteworthy detail, where the
normal page batch gathering is limited once we have delayed rmaps
pending, in such a way that only the last batch (the so-called "active
batch") in the mmu_gather structure can have any delayed entries.

NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even when user space goes off the rails.

Reported-and-tested-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
Cc: Will Deacon <will@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/s390/include/asm/tlb.h |  3 +++
 include/asm-generic/tlb.h   | 31 +++++++++++++++++++++++++++++--
 mm/memory.c                 | 23 +++++++++++++++++------
 mm/mmu_gather.c             | 31 +++++++++++++++++++++++++++++++
 4 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 05142226d65d..b91f4a9b044c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -41,6 +41,9 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
  * Release the page cache reference for a pte removed by
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
+ *
+ * s390 doesn't delay rmap removal, so there is nothing encoded in
+ * the page pointer.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 					  struct encoded_page *page,
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index e5cd34393372..154c774d6307 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -259,6 +259,28 @@ struct mmu_gather_batch {
 extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
 				   struct encoded_page *page,
 				   int page_size);
+
+#ifdef CONFIG_SMP
+/*
+ * This both sets 'delayed_rmap', and returns true. It would be an inline
+ * function, except we define it before the 'struct mmu_gather'.
+ */
+#define tlb_delay_rmap(tlb) (((tlb)->delayed_rmap = 1), true)
+extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
+#endif
+
+#endif
+
+/*
+ * We have a no-op version of the rmap removal that doesn't
+ * delay anything. That is used on S390, which flushes remote
+ * TLBs synchronously, and on UP, which doesn't have any
+ * remote TLBs to flush and is not preemptible due to this
+ * all happening under the page table lock.
+ */
+#ifndef tlb_delay_rmap
+#define tlb_delay_rmap(tlb) (false)
+static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
 #endif
 
 /*
@@ -291,6 +313,11 @@ struct mmu_gather {
 	 */
 	unsigned int		freed_tables : 1;
 
+	/*
+	 * Do we have pending delayed rmap removals?
+	 */
+	unsigned int		delayed_rmap : 1;
+
 	/*
 	 * at which levels have we cleared entries?
 	 */
@@ -436,9 +463,9 @@ static inline void tlb_remove_page_size(struct mmu_gather *tlb,
 		tlb_flush_mmu(tlb);
 }
 
-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
 {
-	return __tlb_remove_page_size(tlb, encode_page(page, 0), PAGE_SIZE);
+	return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
 }
 
 /* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..60a0f44f6e72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1432,6 +1432,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
+			unsigned int delay_rmap;
+
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
@@ -1443,20 +1445,26 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!page))
 				continue;
 
+			delay_rmap = 0;
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
-					force_flush = 1;
 					set_page_dirty(page);
+					if (tlb_delay_rmap(tlb)) {
+						delay_rmap = 1;
+						force_flush = 1;
+					}
 				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, vma, false);
-			if (unlikely(page_mapcount(page) < 0))
-				print_bad_pte(vma, addr, ptent, page);
-			if (unlikely(__tlb_remove_page(tlb, page))) {
+			if (!delay_rmap) {
+				page_remove_rmap(page, vma, false);
+				if (unlikely(page_mapcount(page) < 0))
+					print_bad_pte(vma, addr, ptent, page);
+			}
+			if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
@@ -1513,8 +1521,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush)
+	if (force_flush) {
 		tlb_flush_mmu_tlbonly(tlb);
+		if (tlb->delayed_rmap)
+			tlb_flush_rmaps(tlb, vma);
+	}
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index f44cc8a5b581..38592fba3826 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
 #include <linux/rcupdate.h>
 #include <linux/smp.h>
 #include <linux/swap.h>
+#include <linux/rmap.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
@@ -19,6 +20,10 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
 
+	/* No more batching if we have delayed rmaps pending */
+	if (tlb->delayed_rmap)
+		return false;
+
 	batch = tlb->active;
 	if (batch->next) {
 		tlb->active = batch->next;
@@ -43,6 +48,31 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	return true;
 }
 
+/**
+ * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB
+ * @tlb: the current mmu_gather
+ *
+ * Note that because of how tlb_next_batch() above works, we will
+ * never start new batches with pending delayed rmaps, so we only
+ * need to walk through the current active batch.
+ */
+void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	struct mmu_gather_batch *batch;
+
+	batch = tlb->active;
+	for (int i = 0; i < batch->nr; i++) {
+		struct encoded_page *enc = batch->encoded_pages[i];
+
+		if (encoded_page_flags(enc)) {
+			struct page *page = encoded_page_ptr(enc);
+			page_remove_rmap(page, vma, false);
+		}
+	}
+
+	tlb->delayed_rmap = 0;
+}
+
 static void tlb_batch_pages_flush(struct mmu_gather *tlb)
 {
 	struct mmu_gather_batch *batch;
@@ -286,6 +316,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
 #endif
+	tlb->delayed_rmap = 0;
 
 	tlb_table_init(tlb);
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-09 20:30 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
@ 2022-11-09 20:48   ` Linus Torvalds
  2022-11-09 21:04     ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 20:48 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Peter Zijlstra,
	Gerald Schaefer

Bah, in carefully removing all the "let's send it as a reply to the
previous thread" command line flags, I cleverly also skipped adding a
cover letter, so this updated series got sent out without one.

I need more coffee.

But hey, it's not like the people cc'd haven't seen it before, and if
you want to see *all* the patches (I didn't want to patch-bomb people
with the prep-work), at least 'b4' is happy so you can get it all with
just

   b4 am 20221109203051.1835763-1-torvalds@linux-foundation.org

this time.

The main changes to the previously posted series are

 (a) try to  move the s390 changes to generic code

 (b) build-time checking for the value range of the flags passed to
encode_page()

 (c) added comments both to code and commit messages

I'm sure I messed something up in the process, not just the lack of
cover letter which has now turned into this "tail letter" instead.

                 Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-09 20:48   ` Linus Torvalds
@ 2022-11-09 21:04     ` Linus Torvalds
  2022-11-16  7:47       ` Alexander Gordeev
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 21:04 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton
  Cc: linux-kernel, linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar,
	Nick Piggin, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Peter Zijlstra,
	Gerald Schaefer

On Wed, Nov 9, 2022 at 12:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm sure I messed something up in the process [...]

I hate being right.

The UP build requires a

  #ifdef CONFIG_SMP
  ..
  #endif

around the tlb_flush_rmaps() implementation in mm/mmu_gather.c, since
the UP case now shares the empty "no nothing" implementation with
s390.

I'm not going to re-send the series for that trivial fix, since nobody
is likely to actually care about UP anyway, but since I noticed it
(after sending things out, sorrt), I'll just mention it here.

And I was so happy about sharing the s390 and UP case, and avoiding
any code being specific to s390. Which is what introduced this thing.

Oh well. Easy fix. Just egg on my face. Again.

                Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-09 21:04     ` Linus Torvalds
@ 2022-11-16  7:47       ` Alexander Gordeev
  2022-11-16  7:49         ` mm: mmu_gather: do not expose delayed_rmap flag Alexander Gordeev
                           ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Alexander Gordeev @ 2022-11-16  7:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

On Wed, Nov 09, 2022 at 01:04:07PM -0800, Linus Torvalds wrote:

Hi Linus,

[...]

> And I was so happy about sharing the s390 and UP case, and avoiding
> any code being specific to s390. Which is what introduced this thing.

Which actually brings a question whether CONFIG_MMU_GATHER_NO_GATHER
mode could be beneficial for UP?

But anyway, please find a follow-up series on top of mm-unstable
with patches 1,2 aimed to avoid delayed_rmap flag on s390/UP and
patches 3,4 hopefully cleaning things a bit (not so sure).

>                 Linus

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* mm: mmu_gather: do not expose delayed_rmap flag
  2022-11-16  7:47       ` Alexander Gordeev
@ 2022-11-16  7:49         ` Alexander Gordeev
  2022-11-16 17:45           ` Linus Torvalds
  2022-11-16  7:50         ` mm: mmu_gather: do not define delayed_rmap if not used Alexander Gordeev
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Alexander Gordeev @ 2022-11-16  7:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

Flag delayed_rmap of 'struct mmu_gather' is rather
a private member, but it is still accessed directly.
Instead, let the TLB gather code access the flag.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/memory.c     | 3 +--
 mm/mmu_gather.c | 3 +++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 42f10cc1de58..38b58cd07b52 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1465,8 +1465,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	/* Do the actual TLB flush before dropping ptl */
 	if (force_flush) {
 		tlb_flush_mmu_tlbonly(tlb);
-		if (tlb->delayed_rmap)
-			tlb_flush_rmaps(tlb, vma);
+		tlb_flush_rmaps(tlb, vma);
 	}
 	pte_unmap_unlock(start_pte, ptl);
 
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 79de59136cd2..9f22309affee 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -60,6 +60,9 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	struct mmu_gather_batch *batch;
 
+	if (!tlb->delayed_rmap)
+		return;
+
 	batch = tlb->active;
 	for (int i = 0; i < batch->nr; i++) {
 		struct encoded_page *enc = batch->encoded_pages[i];
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* mm: mmu_gather: do not define delayed_rmap if not used
  2022-11-16  7:47       ` Alexander Gordeev
  2022-11-16  7:49         ` mm: mmu_gather: do not expose delayed_rmap flag Alexander Gordeev
@ 2022-11-16  7:50         ` Alexander Gordeev
  2022-11-16 17:52           ` Linus Torvalds
  2022-11-16  7:52         ` [PATCH 3/4] mm: mmu_gather: turn delayed rmap macros into inlines Alexander Gordeev
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Alexander Gordeev @ 2022-11-16  7:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

In cases the delayed rmap removal is not used (which are
currently UP and s390) skip delayed_rmap flag and make
the related code paths no-op.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/asm-generic/tlb.h | 32 +++++++++++++++++++-------------
 mm/mmu_gather.c           |  8 ++++----
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 154c774d6307..317bef9eee3c 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -265,24 +265,14 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
  * This both sets 'delayed_rmap', and returns true. It would be an inline
  * function, except we define it before the 'struct mmu_gather'.
  */
-#define tlb_delay_rmap(tlb) (((tlb)->delayed_rmap = 1), true)
+#define tlb_delay_rmap(tlb)		(((tlb)->delayed_rmap = 1), true)
+#define tlb_reset_delay_rmap(tlb)	((tlb)->delayed_rmap = 0)
+#define tlb_rmap_delayed(tlb)		((tlb)->delayed_rmap)
 extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
 #endif
 
 #endif
 
-/*
- * We have a no-op version of the rmap removal that doesn't
- * delay anything. That is used on S390, which flushes remote
- * TLBs synchronously, and on UP, which doesn't have any
- * remote TLBs to flush and is not preemptible due to this
- * all happening under the page table lock.
- */
-#ifndef tlb_delay_rmap
-#define tlb_delay_rmap(tlb) (false)
-static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
-#endif
-
 /*
  * struct mmu_gather is an opaque type used by the mm code for passing around
  * any data needed by arch specific code for tlb_remove_page.
@@ -313,10 +303,12 @@ struct mmu_gather {
 	 */
 	unsigned int		freed_tables : 1;
 
+#ifdef tlb_delay_rmap
 	/*
 	 * Do we have pending delayed rmap removals?
 	 */
 	unsigned int		delayed_rmap : 1;
+#endif
 
 	/*
 	 * at which levels have we cleared entries?
@@ -346,6 +338,20 @@ struct mmu_gather {
 #endif
 };
 
+/*
+ * We have a no-op version of the rmap removal that doesn't
+ * delay anything. That is used on S390, which flushes remote
+ * TLBs synchronously, and on UP, which doesn't have any
+ * remote TLBs to flush and is not preemptible due to this
+ * all happening under the page table lock.
+ */
+#ifndef tlb_delay_rmap
+#define tlb_delay_rmap(tlb)		(false)
+#define tlb_reset_delay_rmap(tlb)	do { } while (0)
+#define tlb_rmap_delayed(tlb)		(false)
+static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
+#endif
+
 void tlb_flush_mmu(struct mmu_gather *tlb);
 
 static inline void __tlb_adjust_range(struct mmu_gather *tlb,
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 9f22309affee..b0f1bd20af2f 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -20,7 +20,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
 	struct mmu_gather_batch *batch;
 
 	/* No more batching if we have delayed rmaps pending */
-	if (tlb->delayed_rmap)
+	if (tlb_rmap_delayed(tlb))
 		return false;
 
 	batch = tlb->active;
@@ -60,7 +60,7 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	struct mmu_gather_batch *batch;
 
-	if (!tlb->delayed_rmap)
+	if (!tlb_rmap_delayed(tlb))
 		return;
 
 	batch = tlb->active;
@@ -73,7 +73,7 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
 		}
 	}
 
-	tlb->delayed_rmap = 0;
+	tlb_reset_delay_rmap(tlb);
 }
 #endif
 
@@ -311,7 +311,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	tlb->active     = &tlb->local;
 	tlb->batch_count = 0;
 #endif
-	tlb->delayed_rmap = 0;
+	tlb_reset_delay_rmap(tlb);
 
 	tlb_table_init(tlb);
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] mm: mmu_gather: turn delayed rmap macros into inlines
  2022-11-16  7:47       ` Alexander Gordeev
  2022-11-16  7:49         ` mm: mmu_gather: do not expose delayed_rmap flag Alexander Gordeev
  2022-11-16  7:50         ` mm: mmu_gather: do not define delayed_rmap if not used Alexander Gordeev
@ 2022-11-16  7:52         ` Alexander Gordeev
  2022-11-16  7:55         ` [PATCH 4/4] mm: mmu_gather: rename tlb_delay_rmap() function Alexander Gordeev
  2022-11-16 17:39         ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
  4 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2022-11-16  7:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

Make tlb_delay_rmap() and friend macros inline functions
by using forward declarations, which allows defining ones
after the 'struct mmu_gather' definition.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/asm-generic/tlb.h | 56 ++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 317bef9eee3c..33943a4de5a7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -261,13 +261,10 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
 				   int page_size);
 
 #ifdef CONFIG_SMP
-/*
- * This both sets 'delayed_rmap', and returns true. It would be an inline
- * function, except we define it before the 'struct mmu_gather'.
- */
-#define tlb_delay_rmap(tlb)		(((tlb)->delayed_rmap = 1), true)
-#define tlb_reset_delay_rmap(tlb)	((tlb)->delayed_rmap = 0)
-#define tlb_rmap_delayed(tlb)		((tlb)->delayed_rmap)
+#define tlb_delay_rmap tlb_delay_rmap
+static inline bool tlb_delay_rmap(struct mmu_gather *tlb);
+static inline void tlb_reset_delay_rmap(struct mmu_gather *tlb);
+static inline bool tlb_rmap_delayed(struct mmu_gather *tlb);
 extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
 #endif
 
@@ -338,6 +335,27 @@ struct mmu_gather {
 #endif
 };
 
+#ifdef tlb_delay_rmap
+
+static inline bool tlb_delay_rmap(struct mmu_gather *tlb)
+{
+	tlb->delayed_rmap = 1;
+
+	return true;
+}
+
+static inline void tlb_reset_delay_rmap(struct mmu_gather *tlb)
+{
+	tlb->delayed_rmap = 0;
+}
+
+static inline bool tlb_rmap_delayed(struct mmu_gather *tlb)
+{
+	return tlb->delayed_rmap;
+}
+
+#else
+
 /*
  * We have a no-op version of the rmap removal that doesn't
  * delay anything. That is used on S390, which flushes remote
@@ -345,11 +363,25 @@ struct mmu_gather {
  * remote TLBs to flush and is not preemptible due to this
  * all happening under the page table lock.
  */
-#ifndef tlb_delay_rmap
-#define tlb_delay_rmap(tlb)		(false)
-#define tlb_reset_delay_rmap(tlb)	do { } while (0)
-#define tlb_rmap_delayed(tlb)		(false)
-static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
+#define tlb_delay_rmap tlb_delay_rmap
+static inline bool tlb_delay_rmap(struct mmu_gather *tlb)
+{
+	return false;
+}
+
+static inline void tlb_reset_delay_rmap(struct mmu_gather *tlb)
+{
+}
+
+static inline bool tlb_rmap_delayed(struct mmu_gather *tlb)
+{
+	return false;
+}
+
+static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+}
+
 #endif
 
 void tlb_flush_mmu(struct mmu_gather *tlb);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] mm: mmu_gather: rename tlb_delay_rmap() function
  2022-11-16  7:47       ` Alexander Gordeev
                           ` (2 preceding siblings ...)
  2022-11-16  7:52         ` [PATCH 3/4] mm: mmu_gather: turn delayed rmap macros into inlines Alexander Gordeev
@ 2022-11-16  7:55         ` Alexander Gordeev
  2022-11-16 17:39         ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
  4 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2022-11-16  7:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

tlb_delay_rmap() function indicates to the TLB gather
infrastructure that a particular page should be removed
from rmap until after the TLB flush. Yet, the function
name and prototype indicate the whole TLB gather state.

Rename tlb_delay_rmap() to tlb_delay_page_rmap() along
with delay_rmap local variable and avoid the described
ambiguity.

Although unlikely ever used, add 'struc page' argument
to the renamed function to emphasize the notion of the
page being delayed.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/asm-generic/tlb.h | 14 +++++++-------
 mm/memory.c               | 12 ++++++------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 33943a4de5a7..2c425acdd2be 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -261,8 +261,8 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
 				   int page_size);
 
 #ifdef CONFIG_SMP
-#define tlb_delay_rmap tlb_delay_rmap
-static inline bool tlb_delay_rmap(struct mmu_gather *tlb);
+#define tlb_delay_page_rmap tlb_delay_page_rmap
+static inline bool tlb_delay_page_rmap(struct mmu_gather *tlb, struct page *page);
 static inline void tlb_reset_delay_rmap(struct mmu_gather *tlb);
 static inline bool tlb_rmap_delayed(struct mmu_gather *tlb);
 extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
@@ -300,7 +300,7 @@ struct mmu_gather {
 	 */
 	unsigned int		freed_tables : 1;
 
-#ifdef tlb_delay_rmap
+#ifdef tlb_delay_page_rmap
 	/*
 	 * Do we have pending delayed rmap removals?
 	 */
@@ -335,9 +335,9 @@ struct mmu_gather {
 #endif
 };
 
-#ifdef tlb_delay_rmap
+#ifdef tlb_delay_page_rmap
 
-static inline bool tlb_delay_rmap(struct mmu_gather *tlb)
+static inline bool tlb_delay_page_rmap(struct mmu_gather *tlb, struct page *page)
 {
 	tlb->delayed_rmap = 1;
 
@@ -363,8 +363,8 @@ static inline bool tlb_rmap_delayed(struct mmu_gather *tlb)
  * remote TLBs to flush and is not preemptible due to this
  * all happening under the page table lock.
  */
-#define tlb_delay_rmap tlb_delay_rmap
-static inline bool tlb_delay_rmap(struct mmu_gather *tlb)
+#define tlb_delay_page_rmap tlb_delay_page_rmap
+static inline bool tlb_delay_page_rmap(struct mmu_gather *tlb, struct page *page)
 {
 	return false;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 38b58cd07b52..5ba4a1f94688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1374,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
-			unsigned int delay_rmap;
+			unsigned int delay_page_rmap;
 
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
@@ -1387,12 +1387,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(!page))
 				continue;
 
-			delay_rmap = 0;
+			delay_page_rmap = 0;
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
 					set_page_dirty(page);
-					if (tlb_delay_rmap(tlb)) {
-						delay_rmap = 1;
+					if (tlb_delay_page_rmap(tlb, page)) {
+						delay_page_rmap = 1;
 						force_flush = 1;
 					}
 				}
@@ -1401,12 +1401,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			if (!delay_rmap) {
+			if (!delay_page_rmap) {
 				page_remove_rmap(page, vma, false);
 				if (unlikely(page_mapcount(page) < 0))
 					print_bad_pte(vma, addr, ptent, page);
 			}
-			if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+			if (unlikely(__tlb_remove_page(tlb, page, delay_page_rmap))) {
 				force_flush = 1;
 				addr += PAGE_SIZE;
 				break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-09 20:30 [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
                   ` (2 preceding siblings ...)
  2022-11-09 20:30 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
@ 2022-11-16  9:15 ` David Hildenbrand
  3 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2022-11-16  9:15 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, Johannes Weiner, Andrew Morton
  Cc: linux-kernel, linux-mm, Alexander Gordeev

On 09.11.22 21:30, Linus Torvalds wrote:
> We already have this notion in parts of the MM code (see the mlock code
> with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new
> case, and I refuse to do the same thing we've done before where we just
> put bits in the raw pointer and say it's still a normal pointer.
> 
> So this introduces a 'struct encoded_page' pointer that cannot be used
> for anything else than to encode a real page pointer and a couple of
> extra bits in the low bits.  That way the compiler can trivially track
> the state of the pointer and you just explicitly encode and decode the
> extra bits.
> 
> Note that this makes the alignment of 'struct page' explicit even for
> the case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set.  That is
> entirely redundant in almost all cases, since the page structure already
> contains several word-sized entries.
> 
> However, on m68k, the alignment of even 32-bit data is just 16 bits, and
> as such in theory the alignment of 'struct page' could be too.  So let's
> just make it very very explicit that the alignment needs to be at least
> 32 bits, giving us a guarantee of two unused low bits in the pointer.
> 
> Now, in practice, our page struct array is aligned much more than that
> anyway, even on m68k, and our existing code in mm/mlock.c obviously
> already depended on that.  But since the whole point of this change is
> to be careful about the type system when hiding extra bits in the
> pointer, let's also be explicit about the assumptions we make.
> 
> NOTE! This is being very careful in another way too: it has a build-time
> assertion that the 'flags' added to the page pointer actually fit in the
> two bits.  That means that this helper must be inlined, and can only be
> used in contexts where the compiler can statically determine that the
> value fits in the available bits.
> 
> Link: https://lore.kernel.org/all/Y2tKixpO4RO6DgW5@tuxmaker.boeblingen.de.ibm.com/
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too
  2022-11-09 20:30 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
@ 2022-11-16  9:24   ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2022-11-16  9:24 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, Johannes Weiner, Andrew Morton
  Cc: linux-kernel, linux-mm

On 09.11.22 21:30, Linus Torvalds wrote:
> release_pages() already could take either an array of page pointers, or
> an array of folio pointers.  Expand it to also accept an array of
> encoded page pointers, which is what both the existing mlock() use and
> the upcoming mmu_gather use of encoded page pointers wants.
> 
> Note that release_pages() won't actually use, or react to, any extra
> encoded bits.  Instead, this is very much a case of "I have walked the
> array of encoded pages and done everything the extra bits tell me to do,
> now release it all".
> 
> Also, while the "either page or folio pointers" dual use was handled
> with a cast of the pointer in "release_folios()", this takes a slightly
> different approach and uses the "transparent union" attribute to
> describe the set of arguments to the function:
> 
>    https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html
> 
> which has been supported by gcc forever, but the kernel hasn't used
> before.
> 
> That allows us to avoid using various wrappers with casts, and just use
> the same function regardless of use.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed
  2022-11-16  7:47       ` Alexander Gordeev
                           ` (3 preceding siblings ...)
  2022-11-16  7:55         ` [PATCH 4/4] mm: mmu_gather: rename tlb_delay_rmap() function Alexander Gordeev
@ 2022-11-16 17:39         ` Linus Torvalds
  4 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-16 17:39 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

On Tue, Nov 15, 2022 at 11:48 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> Which actually brings a question whether CONFIG_MMU_GATHER_NO_GATHER
> mode could be beneficial for UP?

No, the NO_GATHER case wouldn't work for UP in general, because once
we drop the page table lock, even on UP we end up possibly
re-scheduling due to preemption (and even without actual kernel
preemption, we have that explicit "cond_resched()" there).

And if we schedule to another thread that shares the same VM, that
other thread will continue to use the existing TLB entries.

And if those TLB entries then point to pages that were already free'd...

So the NO_GATHER case requires that you flush the TLB early even on
UP, although the requirements are a _bit_ less strict than on SMP.

And TLB flushes are not necessarily cheap, even on UP.

Now, we could possibly optimize this to the point where it *would* be
quite possible - instead of actually flushing the TLB, just set the
bit to "flush on next thread switch". So I think the UP case *could*
be made to be non-gathering.

But I don't think anybody cares about - or tests - UP enough for it to
make sense to worry about it.

           Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: mm: mmu_gather: do not expose delayed_rmap flag
  2022-11-16  7:49         ` mm: mmu_gather: do not expose delayed_rmap flag Alexander Gordeev
@ 2022-11-16 17:45           ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-16 17:45 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

On Tue, Nov 15, 2022 at 11:49 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> Flag delayed_rmap of 'struct mmu_gather' is rather
> a private member, but it is still accessed directly.
> Instead, let the TLB gather code access the flag.

Now, I set it up so that if you don't use delayed_rmap, the
tlb_flush_rmaps() function ends up being an empty inline function, and
as such the compiler should already have done this for you - including
optimizing out the test that then doesn't even matter.

But this patch shouldn't *matter*, but it also isn't wrong, so..

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

Thanks,

                    Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: mm: mmu_gather: do not define delayed_rmap if not used
  2022-11-16  7:50         ` mm: mmu_gather: do not define delayed_rmap if not used Alexander Gordeev
@ 2022-11-16 17:52           ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-16 17:52 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel,
	linux-mm, Nadav Amit, Will Deacon, Aneesh Kumar, Nick Piggin,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, Peter Zijlstra, Gerald Schaefer

On Tue, Nov 15, 2022 at 11:51 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> In cases the delayed rmap removal is not used (which are
> currently UP and s390) skip delayed_rmap flag and make
> the related code paths no-op.

So I'm not convinced about this patch.

I particularly dislike adding even more #ifdef's around the data
structure - it already is pretty nasty, and it was hard to see where
things were initialized.

The only actual code impact of this is in tlb_next_batch(), which
tests for "do I have delayed rmaps pending, in which case I won't add
new batches". Everything else is already either optimized away, or
just "one bit declared in a structure that already has bitfields and
has room for several extra bits":

And that "I need to allocate new batches" case really doesn't matter
anyway - it's not even build at all on s390, and on UP where it's
there but technically pointless to have the test it really isn't
noticeable.

So the previous patch I was "this shouldn't actually _matter_, but it
does seem cleaner to do it this way".

But _this_ patch makes me go "it still doesn't matter, but now this
patch is actually adding extra infrastructure for the 'not-mattering'
case".

So I don't _hate_ this patch, but I think this actually makes the
current mess wrt our 'struct mmu_gather' worse rather than better.

That structure is already a pain, with horrendous initialization and
different bit-fields having different lifetimes. I'd rather have one
unconditional simple bitfield, than have another bitfield that has
conditional complications.

              Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-09 18:00     ` Linus Torvalds
@ 2022-11-09 20:02       ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 20:02 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Wed, Nov 9, 2022 at 10:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But while I think a BUILD_BUG_ON() would be the right thing to do
> here, I do not feel confident enough to really put that to the test.

Oh, what the hell.

Just writing that whole explanation out made me just go "let's try to
re-organize it a bit so that we *can* inline everything, and see how
well it works".

And it does actually work to use BUILD_BUG_ON(), both with gcc and clang.

At least that's the case with the versions of gcc and clang _I_ use,
and in the configurations I tested.

So now I have a slightly massaged version of the patches (I did have
to move the 'encode_page()' around a bit), which has that
BUILD_BUG_ON() in it, and it passes for me.

And I find that I really like seeing that whole page pointer encoding
be so obviously much stricter. That was obviously the point of the
whole separate type system checking, now it does bit value validity
checking too.

So I'll walk through my patches one more time to check for it, but
I'll post it as a git branch and send out a new series (and do it in a
separate thread with a cover letter, to not confuse the little mind of
'b4' again).

If it turns out that some other compiler version or configuration
doesn't deal with the BUILD_BUG_ON() gracefully, it's easy enough to
remove, and it will hopefully show up in linux-next when Andrew picks
it up.

                  Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-09  6:36   ` Alexander Gordeev
@ 2022-11-09 18:00     ` Linus Torvalds
  2022-11-09 20:02       ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2022-11-09 18:00 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Tue, Nov 8, 2022 at 10:38 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> On Tue, Nov 08, 2022 at 11:41:36AM -0800, Linus Torvalds wrote:
>
> > +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> > +{
>
> Any reaction in case ((flags & ~ENCODE_PAGE_BITS) != 0)?

Heh. I've actually had three different implementations for that during
the development series, and I think I even posted them all at one
point or another (although usually just as attachments). And none of
them are good.

Those three trivial versions are: (a) use VM_BUG_ON(), (b) just
silently mask the bits and (c) just silently add them.

And (c) is that least annoying option that this latest patch uses,
because both (a) and (b) are just nasty.

Basically, all users are locally trivial to verify statically, so
VM_BUG_ON() is just conceptually wrong and generates extra pointless
code. And the silent masking - if it makes any difference - is just
another version of "just silently add the bits": regardless of whether
it clears them or not, it does the wrong thing if the bits don't fit.

So there are three bad options, I've gone back and forth between them
all, and I chose the least offensive one that is "invisible", in that
it at least doesn't do any extra pointless work.

Now, there are two non-offensive options too, and I actually
considered, but never implemented them. They both fix the problem
properly, by making it a *buildtime* check, but they have other
issues.

There's two ways to just make it a build-time check, and it's
annoyingly _close_ to being usable, but not quite there.

One is simply to require that the flags argument is always a plain
constant, and simply using BUILD_BUG_ON().

I actually almost went down that path - one of the things I considered
was to not add a 'flags' argument to __tlb_remove_page() at all, but
instead just have separate __tlb_remove_page() and
__tlb_remove_page_dirty() functions.

That would have meant that the argument to __tlb_remove_page_size
would have always been a built-time constant, and then it would be
trivial to just have that BUILD_BUG_ON(). Problem solved.

But it turns out that it's just nasty, particularly with different
configurations wanting different rules for what the dirty bit is. So
forcing it to some constant value was really not acceptable.

The thing that I actually *wanted* to do, but didn't actually dare,
was to just say "I will trust the compiler to do the value range
tracking".

Because *technically* our BUILD_BUG_ON() doesn't need a compile-time
constant. Because our implementation of BUILD_BUG_ON() is not the
garbage that the compiler gives us in "_Static_assert()" that really
requires a syntactically pure integer constant expression.

So the kernel version of BUILD_BUG_ON() is actually something much
smarter: it depends on the compiler actually *optimizing* the
expression, and it's only that optimized value that needs to be
determined at compile-time to be either true or false. You can use
things like inline functions etc, just as long as the end result is
obvious enough that the compiler ends up saying "ok, that's never the
case".

And *if* the compiler does any kind of reasonable range analysis, then a

        BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);

should actually work. In theory.

In practice? Not so much.

Because while the argument isn't constant (not even in the caller),
the compiler *should* be smart enough to see that in the use in
mm/memory.c, 'flags' is always that

        unsigned int delay_rmap;

which then gets initialized to

        delay_rmap = 0;

and conditionally set to '1' later. So it's not a *constant*, but the
compiler can see that the value of flags is clearly never larger than
ENCODE_PAGE_BITS.

But right now the compiler cannot track that over the non-inline
function in __tlb_remove_page_size().

Maybe if the 'encode_page()' was done in the caller, and
__tlb_remove_page_size() were to just take an encoded_page as the
argument, then the compiler would always only see this all through
inlined functions, and it would work.

But even if it were to work for me (I never tried), I'd have been much
too worried that some other compiler version, with some other config
options, on some other architecture, wouldn't make the required
optimizations.

We do require compiler optimizations to be on for 'BUILD_BUG_ON()' to
do anything at all:

   #ifdef __OPTIMIZE__
   # define __compiletime_assert(condition, msg, prefix, suffix)           \
   ..
   #else
   # define __compiletime_assert(condition, msg, prefix, suffix) do {
} while (0)
   #endif

and we have a lot of places that depend on BUILD_BUG_ON() to do basic
constant folding and other fairly simple optimizations.

But while I think a BUILD_BUG_ON() would be the right thing to do
here, I do not feel confident enough to really put that to the test.

              Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-08 19:41 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
  2022-11-08 20:37   ` Nadav Amit
@ 2022-11-09  6:36   ` Alexander Gordeev
  2022-11-09 18:00     ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Alexander Gordeev @ 2022-11-09  6:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, linux-kernel, linux-mm

On Tue, Nov 08, 2022 at 11:41:36AM -0800, Linus Torvalds wrote:

Hi Linus,

[...]

> +struct encoded_page;
> +#define ENCODE_PAGE_BITS 3ul
> +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> +{

Any reaction in case ((flags & ~ENCODE_PAGE_BITS) != 0)?

> +	return (struct encoded_page *)(flags | (unsigned long)page);
> +}

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-08 20:37   ` Nadav Amit
@ 2022-11-08 20:46     ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-08 20:46 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, kernel list, linux-mm

On Tue, Nov 8, 2022 at 12:37 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > +static inline bool encoded_page_flags(struct encoded_page *page)
> > +{
> > +     return ENCODE_PAGE_BITS & (unsigned long)page;
> > +}
>
> I think this one wants to be some unsigned, as otherwise why have
> ENCODE_PAGE_BITS as 3ul ?

Right you are. That came from my old old version where this was just
"bool dirty".

Will fix.

Doesn't matter for the TLB flushing case, but I really did hope that
we could use this for mlock too, and that case needs both bits.

I did look at converting mlock (and it's why I wanted to make
release_pages() take that whole encoded thing in general, rather than
make some special case for it), but the mlock code uses that "struct
pagevec" abstraction that seems entirely pointless ("pvec->nr" becomes
"pagevec_count(pvec)", which really doesn't seem to be any clearer at
alll), but whatever.

               Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-08 19:41 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
@ 2022-11-08 20:37   ` Nadav Amit
  2022-11-08 20:46     ` Linus Torvalds
  2022-11-09  6:36   ` Alexander Gordeev
  1 sibling, 1 reply; 22+ messages in thread
From: Nadav Amit @ 2022-11-08 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Johannes Weiner, Andrew Morton, kernel list, linux-mm

On Nov 8, 2022, at 11:41 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> We already have this notion in parts of the MM code (see the mlock code
> with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
> case, and I refuse to do the same thing we've done before where we just
> put bits in the raw pointer and say it's still a normal pointer.
> 
> So this introduces a 'struct encoded_page' pointer that cannot be used
> for anything else than to encode a real page pointer and a couple of
> extra bits in the low bits.  That way the compiler can trivially track
> the state of the pointer and you just explicitly encode and decode the
> extra bits.

I tested again all of the patches with the PoC. They pass.

> 
> +struct encoded_page;
> +#define ENCODE_PAGE_BITS 3ul
> +static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
> +{
> +	return (struct encoded_page *)(flags | (unsigned long)page);
> +}
> +
> +static inline bool encoded_page_flags(struct encoded_page *page)
> +{
> +	return ENCODE_PAGE_BITS & (unsigned long)page;
> +}

I think this one wants to be some unsigned, as otherwise why have
ENCODE_PAGE_BITS as 3ul ?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits
  2022-11-07 23:47 mm: delay rmap removal until after TLB flush Linus Torvalds
@ 2022-11-08 19:41 ` Linus Torvalds
  2022-11-08 20:37   ` Nadav Amit
  2022-11-09  6:36   ` Alexander Gordeev
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2022-11-08 19:41 UTC (permalink / raw)
  To: Hugh Dickins, Johannes Weiner, Andrew Morton; +Cc: linux-kernel, linux-mm

We already have this notion in parts of the MM code (see the mlock code
with the LRU_PAGE and NEW_PAGE) bits, but I'm going to introduce a new
case, and I refuse to do the same thing we've done before where we just
put bits in the raw pointer and say it's still a normal pointer.

So this introduces a 'struct encoded_page' pointer that cannot be used
for anything else than to encode a real page pointer and a couple of
extra bits in the low bits.  That way the compiler can trivially track
the state of the pointer and you just explicitly encode and decode the
extra bits.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm_types.h | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..b5cffd250784 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,7 +67,7 @@ struct mem_cgroup;
 #ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
 #define _struct_page_alignment	__aligned(2 * sizeof(unsigned long))
 #else
-#define _struct_page_alignment
+#define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
 struct page {
@@ -241,6 +241,37 @@ struct page {
 #endif
 } _struct_page_alignment;
 
+/**
+ * struct encoded_page - a nonexistent type marking this pointer
+ *
+ * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
+ * with the low bits of the pointer indicating extra context-dependent
+ * information. Not super-common, but happens in mmu_gather and mlock
+ * handling, and this acts as a type system check on that use.
+ *
+ * We only really have two guaranteed bits in general, although you could
+ * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
+ * for more.
+ *
+ * Use the supplied helper functions to endcode/decode the pointer and bits.
+ */
+struct encoded_page;
+#define ENCODE_PAGE_BITS 3ul
+static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+	return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline bool encoded_page_flags(struct encoded_page *page)
+{
+	return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+	return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
-- 
2.38.1.284.gfd9468d787


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-11-16 17:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-09 20:30 [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
2022-11-09 20:30 ` [PATCH 2/4] mm: teach release_pages() to take an array of encoded page pointers too Linus Torvalds
2022-11-16  9:24   ` David Hildenbrand
2022-11-09 20:30 ` [PATCH 3/4] mm: mmu_gather: prepare to gather encoded page pointers with flags Linus Torvalds
2022-11-09 20:30 ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
2022-11-09 20:48   ` Linus Torvalds
2022-11-09 21:04     ` Linus Torvalds
2022-11-16  7:47       ` Alexander Gordeev
2022-11-16  7:49         ` mm: mmu_gather: do not expose delayed_rmap flag Alexander Gordeev
2022-11-16 17:45           ` Linus Torvalds
2022-11-16  7:50         ` mm: mmu_gather: do not define delayed_rmap if not used Alexander Gordeev
2022-11-16 17:52           ` Linus Torvalds
2022-11-16  7:52         ` [PATCH 3/4] mm: mmu_gather: turn delayed rmap macros into inlines Alexander Gordeev
2022-11-16  7:55         ` [PATCH 4/4] mm: mmu_gather: rename tlb_delay_rmap() function Alexander Gordeev
2022-11-16 17:39         ` [PATCH 4/4] mm: delay page_remove_rmap() until after the TLB has been flushed Linus Torvalds
2022-11-16  9:15 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits David Hildenbrand
  -- strict thread matches above, loose matches on Subject: below --
2022-11-07 23:47 mm: delay rmap removal until after TLB flush Linus Torvalds
2022-11-08 19:41 ` [PATCH 1/4] mm: introduce 'encoded' page pointers with embedded extra bits Linus Torvalds
2022-11-08 20:37   ` Nadav Amit
2022-11-08 20:46     ` Linus Torvalds
2022-11-09  6:36   ` Alexander Gordeev
2022-11-09 18:00     ` Linus Torvalds
2022-11-09 20:02       ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.