[PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts
@ 2022-11-03  1:44 Hugh Dickins
  2022-11-03  1:48 ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Hugh Dickins
                   ` (4 more replies)
  0 siblings, 5 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-03  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

I had intended to send just a self-explanatory 1/2 and 2/2 against
6.1-rc3 on Monday; but checked for clashes with linux-next (mainly
mm-unstable) just before sending, and fuzz in mm_types.h revealed
not just a clash with Sidhartha's series, but also that I had missed
the hugetlb usage of tail page->private, problematic for me on 32-bit.

So that series was slightly broken; and although it would probably
have been easy to fix with a "SUBPAGE_INDEX_SUBPOOL = 2" patch,
that would not have moved us forward very well.  So this series is
against next-20221102 (and hopefully later nexts), with a preparatory
1/3 to rejig the hugetlb tail private usage, on top of Sidhartha's.

1/3 mm,hugetlb: use folio fields in second tail page
2/3 mm,thp,rmap: simplify compound page mapcount handling
3/3 mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts

2/3 and 3/3 can almost be applied cleanly to 6.1-rc3 (okay when
you are on 64-bit): just a couple of easily resolved rejects.

 Documentation/mm/transhuge.rst |  34 +---
 include/linux/hugetlb.h        |  23 +--
 include/linux/hugetlb_cgroup.h |  31 +--
 include/linux/mm.h             |  85 ++++++---
 include/linux/mm_types.h       |  91 ++++++---
 include/linux/page-flags.h     |  21 --
 include/linux/rmap.h           |  12 +-
 mm/Kconfig                     |   2 +-
 mm/debug.c                     |   5 +-
 mm/folio-compat.c              |   6 -
 mm/huge_memory.c               |  37 +---
 mm/hugetlb.c                   |   2 +
 mm/khugepaged.c                |  11 +-
 mm/memory-failure.c            |   5 +-
 mm/page_alloc.c                |  27 +--
 mm/rmap.c                      | 359 ++++++++++++++++++++---------------
 mm/util.c                      |  79 --------
 17 files changed, 401 insertions(+), 429 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3] mm,hugetlb: use folio fields in second tail page
  2022-11-03  1:44 [PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts Hugh Dickins
@ 2022-11-03  1:48 ` Hugh Dickins
  2022-11-03 21:18   ` Sidhartha Kumar
  2022-11-05 19:13   ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Kirill A. Shutemov
  2022-11-03  1:51 ` [PATCH 2/3] mm,thp,rmap: simplify compound page mapcount handling Hugh Dickins
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-03  1:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

We want to declare one more int in the first tail of a compound page:
that first tail page being valuable property, since every compound page
has a first tail, but perhaps no more than that.

No problem on 64-bit: there is already space for it.  No problem with
32-bit THPs: 5.18 commit 5232c63f46fd ("mm: Make compound_pincount always
available") kindly cleared the space for it, apparently not realizing
that only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail
page->private might conflict) - but make sure of that in its Kconfig.

But hugetlb pages use tail page->private of the first tail page for a
subpool pointer, which will conflict; and they also use page->private
of the 2nd, 3rd and 4th tails.

Undo "mm: add private field of first tail to struct page and struct
folio"'s recent addition of private_1 to the folio tail: instead add
hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison
to a second tail page of the folio: THP has long been using several
fields of that tail, so make better use of it for hugetlb too.
This is not how a generic folio should be declared in future,
but it is an effective transitional way to make use of it.

Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/hugetlb.h        | 23 +++--------
 include/linux/hugetlb_cgroup.h | 31 +++++----------
 include/linux/mm_types.h       | 72 ++++++++++++++++++++++------------
 mm/Kconfig                     |  2 +-
 mm/memory-failure.c            |  5 +--
 5 files changed, 65 insertions(+), 68 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 65ea34022aa2..03ecf1c5e46f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -33,22 +33,9 @@ typedef struct { unsigned long pd; } hugepd_t;
 /*
  * For HugeTLB page, there are more metadata to save in the struct page. But
  * the head struct page cannot meet our needs, so we have to abuse other tail
- * struct page to store the metadata. In order to avoid conflicts caused by
- * subsequent use of more tail struct pages, we gather these discrete indexes
- * of tail struct page here.
+ * struct page to store the metadata.
  */
-enum {
-	SUBPAGE_INDEX_SUBPOOL = 1,	/* reuse page->private */
-#ifdef CONFIG_CGROUP_HUGETLB
-	SUBPAGE_INDEX_CGROUP,		/* reuse page->private */
-	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
-	__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
-#endif
-#ifdef CONFIG_MEMORY_FAILURE
-	SUBPAGE_INDEX_HWPOISON,
-#endif
-	__NR_USED_SUBPAGE,
-};
+#define __NR_USED_SUBPAGE 3
 
 struct hugepage_subpool {
 	spinlock_t lock;
@@ -722,11 +709,11 @@ extern unsigned int default_hstate_idx;
 
 static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio)
 {
-	return (void *)folio_get_private_1(folio);
+	return folio->_hugetlb_subpool;
 }
 
 /*
- * hugetlb page subpool pointer located in hpage[1].private
+ * hugetlb page subpool pointer located in hpage[2].hugetlb_subpool
  */
 static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
 {
@@ -736,7 +723,7 @@ static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
 static inline void hugetlb_set_folio_subpool(struct folio *folio,
 					struct hugepage_subpool *subpool)
 {
-	folio_set_private_1(folio, (unsigned long)subpool);
+	folio->_hugetlb_subpool = subpool;
 }
 
 static inline void hugetlb_set_page_subpool(struct page *hpage,
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index c70f92fe493e..f706626a8063 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -24,12 +24,10 @@ struct file_region;
 #ifdef CONFIG_CGROUP_HUGETLB
 /*
  * Minimum page order trackable by hugetlb cgroup.
- * At least 4 pages are necessary for all the tracking information.
- * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault
- * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD])
- * is the reservation usage cgroup.
+ * At least 3 pages are necessary for all the tracking information.
+ * The second tail page contains all of the hugetlb-specific fields.
  */
-#define HUGETLB_CGROUP_MIN_ORDER order_base_2(__MAX_CGROUP_SUBPAGE_INDEX + 1)
+#define HUGETLB_CGROUP_MIN_ORDER order_base_2(__NR_USED_SUBPAGE)
 
 enum hugetlb_memory_event {
 	HUGETLB_MAX,
@@ -69,21 +67,13 @@ struct hugetlb_cgroup {
 static inline struct hugetlb_cgroup *
 __hugetlb_cgroup_from_folio(struct folio *folio, bool rsvd)
 {
-	struct page *tail;
-
 	VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio);
 	if (folio_order(folio) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
-
-	if (rsvd) {
-		tail = folio_page(folio, SUBPAGE_INDEX_CGROUP_RSVD);
-		return (void *)page_private(tail);
-	}
-
-	else {
-		tail = folio_page(folio, SUBPAGE_INDEX_CGROUP);
-		return (void *)page_private(tail);
-	}
+	if (rsvd)
+		return folio->_hugetlb_cgroup_rsvd;
+	else
+		return folio->_hugetlb_cgroup;
 }
 
 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_folio(struct folio *folio)
@@ -101,15 +91,12 @@ static inline void __set_hugetlb_cgroup(struct folio *folio,
 				       struct hugetlb_cgroup *h_cg, bool rsvd)
 {
 	VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio);
-
 	if (folio_order(folio) < HUGETLB_CGROUP_MIN_ORDER)
 		return;
 	if (rsvd)
-		set_page_private(folio_page(folio, SUBPAGE_INDEX_CGROUP_RSVD),
-				 (unsigned long)h_cg);
+		folio->_hugetlb_cgroup_rsvd = h_cg;
 	else
-		set_page_private(folio_page(folio, SUBPAGE_INDEX_CGROUP),
-				 (unsigned long)h_cg);
+		folio->_hugetlb_cgroup = h_cg;
 }
 
 static inline void set_hugetlb_cgroup(struct folio *folio,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 834022721bc6..728eb6089bba 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -145,15 +145,22 @@ struct page {
 			atomic_t compound_pincount;
 #ifdef CONFIG_64BIT
 			unsigned int compound_nr; /* 1 << compound_order */
-			unsigned long _private_1;
 #endif
 		};
-		struct {	/* Second tail page of compound page */
+		struct {	/* Second tail page of transparent huge page */
 			unsigned long _compound_pad_1;	/* compound_head */
 			unsigned long _compound_pad_2;
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
+		struct {	/* Second tail page of hugetlb page */
+			unsigned long _hugetlb_pad_1;	/* compound_head */
+			void *hugetlb_subpool;
+			void *hugetlb_cgroup;
+			void *hugetlb_cgroup_rsvd;
+			void *hugetlb_hwpoison;
+			/* No more space on 32-bit: use third tail if more */
+		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
@@ -260,13 +267,16 @@ struct page {
  *    to find how many references there are to this folio.
  * @memcg_data: Memory Control Group data.
  * @_flags_1: For large folios, additional page flags.
- * @__head: Points to the folio.  Do not use.
+ * @_head_1: Points to the folio.  Do not use.
  * @_folio_dtor: Which destructor to use for this folio.
  * @_folio_order: Do not use directly, call folio_order().
  * @_total_mapcount: Do not use directly, call folio_entire_mapcount().
  * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
  * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
- * @_private_1: Do not use directly, call folio_get_private_1().
+ * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
+ * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
+ * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
+ * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
  *
  * A folio is a physically, virtually and logically contiguous set
  * of bytes.  It is a power-of-two in size, and it is aligned to that
@@ -305,16 +315,31 @@ struct folio {
 		};
 		struct page page;
 	};
-	unsigned long _flags_1;
-	unsigned long __head;
-	unsigned char _folio_dtor;
-	unsigned char _folio_order;
-	atomic_t _total_mapcount;
-	atomic_t _pincount;
+	union {
+		struct {
+			unsigned long _flags_1;
+			unsigned long _head_1;
+			unsigned char _folio_dtor;
+			unsigned char _folio_order;
+			atomic_t _total_mapcount;
+			atomic_t _pincount;
 #ifdef CONFIG_64BIT
-	unsigned int _folio_nr_pages;
+			unsigned int _folio_nr_pages;
 #endif
-	unsigned long _private_1;
+		};
+		struct page page_1;
+	};
+	union {
+		struct {
+			unsigned long _flags_2;
+			unsigned long _head_2;
+			void *_hugetlb_subpool;
+			void *_hugetlb_cgroup;
+			void *_hugetlb_cgroup_rsvd;
+			void *_hugetlb_hwpoison;
+		};
+		struct page page_2;
+	};
 };
 
 #define FOLIO_MATCH(pg, fl)						\
@@ -335,16 +360,25 @@ FOLIO_MATCH(memcg_data, memcg_data);
 	static_assert(offsetof(struct folio, fl) ==			\
 			offsetof(struct page, pg) + sizeof(struct page))
 FOLIO_MATCH(flags, _flags_1);
-FOLIO_MATCH(compound_head, __head);
+FOLIO_MATCH(compound_head, _head_1);
 FOLIO_MATCH(compound_dtor, _folio_dtor);
 FOLIO_MATCH(compound_order, _folio_order);
 FOLIO_MATCH(compound_mapcount, _total_mapcount);
 FOLIO_MATCH(compound_pincount, _pincount);
 #ifdef CONFIG_64BIT
 FOLIO_MATCH(compound_nr, _folio_nr_pages);
-FOLIO_MATCH(_private_1, _private_1);
 #endif
 #undef FOLIO_MATCH
+#define FOLIO_MATCH(pg, fl)						\
+	static_assert(offsetof(struct folio, fl) ==			\
+			offsetof(struct page, pg) + 2 * sizeof(struct page))
+FOLIO_MATCH(flags, _flags_2);
+FOLIO_MATCH(compound_head, _head_2);
+FOLIO_MATCH(hugetlb_subpool, _hugetlb_subpool);
+FOLIO_MATCH(hugetlb_cgroup, _hugetlb_cgroup);
+FOLIO_MATCH(hugetlb_cgroup_rsvd, _hugetlb_cgroup_rsvd);
+FOLIO_MATCH(hugetlb_hwpoison, _hugetlb_hwpoison);
+#undef FOLIO_MATCH
 
 static inline atomic_t *folio_mapcount_ptr(struct folio *folio)
 {
@@ -388,16 +422,6 @@ static inline void *folio_get_private(struct folio *folio)
 	return folio->private;
 }
 
-static inline void folio_set_private_1(struct folio *folio, unsigned long private)
-{
-	folio->_private_1 = private;
-}
-
-static inline unsigned long folio_get_private_1(struct folio *folio)
-{
-	return folio->_private_1;
-}
-
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
diff --git a/mm/Kconfig b/mm/Kconfig
index 57e1d8c5b505..bc7e7dacfcd5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -775,7 +775,7 @@ endchoice
 
 config THP_SWAP
 	def_bool y
-	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP
+	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP && 64BIT
 	help
 	  Swap transparent huge pages in one piece, without splitting.
 	  XXX: For now, swap cluster backing transparent huge page
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 779a426d2cab..63d8501001c6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1687,8 +1687,7 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
 #ifdef CONFIG_HUGETLB_PAGE
 /*
  * Struct raw_hwp_page represents information about "raw error page",
- * constructing singly linked list originated from ->private field of
- * SUBPAGE_INDEX_HWPOISON-th tail page.
+ * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
  */
 struct raw_hwp_page {
 	struct llist_node node;
@@ -1697,7 +1696,7 @@ struct raw_hwp_page {
 
 static inline struct llist_head *raw_hwp_list_head(struct page *hpage)
 {
-	return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON);
+	return (struct llist_head *)&page_folio(hpage)->_hugetlb_hwpoison;
 }
 
 static unsigned long __free_raw_hwp_pages(struct page *hpage, bool move_flag)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/3] mm,thp,rmap: simplify compound page mapcount handling
  2022-11-03  1:44 [PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts Hugh Dickins
  2022-11-03  1:48 ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Hugh Dickins
@ 2022-11-03  1:51 ` Hugh Dickins
  2022-11-05 19:51   ` Kirill A. Shutemov
  2022-11-03  1:53 ` [PATCH 3/3] mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts Hugh Dickins
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-03  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

Compound page (folio) mapcount calculations have been different for
anon and file (or shmem) THPs, and involved the obscure PageDoubleMap
flag.  And each huge mapping and unmapping of a file (or shmem) THP
involved atomically incrementing and decrementing the mapcount of every
subpage of that huge page, dirtying many struct page cachelines.

Add subpages_mapcount field to the struct folio and first tail page,
so that the total of subpage mapcounts is available in one place near
the head: then page_mapcount() and total_mapcount() and page_mapped(),
and their folio equivalents, are so quick that anon and file and hugetlb
don't need to be optimized differently. Delete the unloved PageDoubleMap.

page_add and page_remove rmap functions must now maintain the
subpages_mapcount as well as the subpage _mapcount, when dealing with
pte mappings of huge pages; and correct maintenance of NR_ANON_MAPPED
and NR_FILE_MAPPED statistics still needs reading through the subpages,
using nr_subpages_unmapped() - but only when first or last pmd mapping
finds subpages_mapcount raised (double-map case, not the common case).

But are those counts (used to decide when to split an anon THP, and
in vmscan's pagecache_reclaimable heuristic) correctly maintained?
Not quite: since page_remove_rmap() (and also split_huge_pmd()) is
often called without page lock, there can be races when a subpage pte
mapcount 0<->1 while compound pmd mapcount 0<->1 is scanning - races
which the previous implementation had prevented. The statistics might
become inaccurate, and even drift down until they underflow through 0.
That is not good enough, but is better dealt with in a followup patch.

Update a few comments on first and second tail page overlaid fields.
hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
subpages_mapcount and compound_pincount are already correctly at 0,
so delete its reinitialization of compound_pincount.

A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB)
took 18 seconds on small pages, and used to take 1 second on huge pages,
but now takes 119 milliseconds on huge pages.  Mapping by pmds a second
time used to take 860ms and now takes 92ms; mapping by pmds after mapping
by ptes (when the scan is needed) used to take 870ms and now takes 495ms.
But there might be some benchmarks which would show a slowdown, because
tail struct pages now fall out of cache until final freeing checks them.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/mm/transhuge.rst |  18 -----
 include/linux/mm.h             |  85 ++++++++++++++------
 include/linux/mm_types.h       |  21 ++++-
 include/linux/page-flags.h     |  21 -----
 include/linux/rmap.h           |   2 +
 mm/debug.c                     |   5 +-
 mm/folio-compat.c              |   6 --
 mm/huge_memory.c               |  36 ++-------
 mm/hugetlb.c                   |   2 +
 mm/khugepaged.c                |  11 +--
 mm/page_alloc.c                |  27 ++++---
 mm/rmap.c                      | 142 +++++++++++++++++++--------------
 mm/util.c                      |  79 ------------------
 13 files changed, 194 insertions(+), 261 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 216db1d67d04..a560e0c01b16 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -125,24 +125,6 @@ pages:
     ->_mapcount of all sub-pages in order to have race-free detection of
     last unmap of subpages.
 
-PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
-
-For anonymous pages, PageDoubleMap() also indicates ->_mapcount in all
-subpages is offset up by one. This additional reference is required to
-get race-free detection of unmap of subpages when we have them mapped with
-both PMDs and PTEs.
-
-This optimization is required to lower the overhead of per-subpage mapcount
-tracking. The alternative is to alter ->_mapcount in all subpages on each
-map/unmap of the whole compound page.
-
-For anonymous pages, we set PG_double_map when a PMD of the page is split
-for the first time, but still have a PMD mapping. The additional references
-go away with the last compound_mapcount.
-
-File pages get PG_double_map set on the first map of the page with PTE and
-goes away when the page gets evicted from the page cache.
-
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
 structures. It can be done easily for refcounts taken by page table
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25ff9a14a777..5b99e3216a23 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -818,8 +818,8 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 /*
  * How many times the entire folio is mapped as a single unit (eg by a
  * PMD or PUD entry).  This is probably not what you want, except for
- * debugging purposes; look at folio_mapcount() or page_mapcount()
- * instead.
+ * debugging purposes - it does not include PTE-mapped sub-pages; look
+ * at folio_mapcount() or page_mapcount() or total_mapcount() instead.
  */
 static inline int folio_entire_mapcount(struct folio *folio)
 {
@@ -829,12 +829,20 @@ static inline int folio_entire_mapcount(struct folio *folio)
 
 /*
  * Mapcount of compound page as a whole, does not include mapped sub-pages.
- *
- * Must be called only for compound pages.
+ * Must be called only on head of compound page.
  */
-static inline int compound_mapcount(struct page *page)
+static inline int head_compound_mapcount(struct page *head)
 {
-	return folio_entire_mapcount(page_folio(page));
+	return atomic_read(compound_mapcount_ptr(head)) + 1;
+}
+
+/*
+ * Sum of mapcounts of sub-pages, does not include compound mapcount.
+ * Must be called only on head of compound page.
+ */
+static inline int head_subpages_mapcount(struct page *head)
+{
+	return atomic_read(subpages_mapcount_ptr(head));
 }
 
 /*
@@ -847,11 +855,9 @@ static inline void page_mapcount_reset(struct page *page)
 	atomic_set(&(page)->_mapcount, -1);
 }
 
-int __page_mapcount(struct page *page);
-
 /*
  * Mapcount of 0-order page; when compound sub-page, includes
- * compound_mapcount().
+ * compound_mapcount of compound_head of page.
  *
  * Result is undefined for pages which cannot be mapped into userspace.
  * For example SLAB or special types of pages. See function page_has_type().
@@ -859,25 +865,61 @@ int __page_mapcount(struct page *page);
  */
 static inline int page_mapcount(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		return __page_mapcount(page);
-	return atomic_read(&page->_mapcount) + 1;
-}
+	int mapcount = atomic_read(&page->_mapcount) + 1;
 
-int folio_mapcount(struct folio *folio);
+	if (likely(!PageCompound(page)))
+		return mapcount;
+	page = compound_head(page);
+	return head_compound_mapcount(page) + mapcount;
+}
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int total_mapcount(struct page *page)
 {
-	return folio_mapcount(page_folio(page));
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+	page = compound_head(page);
+	return head_compound_mapcount(page) + head_subpages_mapcount(page);
 }
 
-#else
-static inline int total_mapcount(struct page *page)
+/*
+ * Return true if this page is mapped into pagetables.
+ * For compound page it returns true if any subpage of compound page is mapped,
+ * even if this particular subpage is not itself mapped by any PTE or PMD.
+ */
+static inline bool page_mapped(struct page *page)
 {
-	return page_mapcount(page);
+	return total_mapcount(page) > 0;
+}
+
+/**
+ * folio_mapcount() - Calculate the number of mappings of this folio.
+ * @folio: The folio.
+ *
+ * A large folio tracks both how many times the entire folio is mapped,
+ * and how many times each individual page in the folio is mapped.
+ * This function calculates the total number of times the folio is
+ * mapped.
+ *
+ * Return: The number of times this folio is mapped.
+ */
+static inline int folio_mapcount(struct folio *folio)
+{
+	if (likely(!folio_test_large(folio)))
+		return atomic_read(&folio->_mapcount) + 1;
+	return atomic_read(folio_mapcount_ptr(folio)) + 1 +
+		atomic_read(folio_subpages_mapcount_ptr(folio));
+}
+
+/**
+ * folio_mapped - Is this folio mapped into userspace?
+ * @folio: The folio.
+ *
+ * Return: True if any page in this folio is referenced by user page tables.
+ */
+static inline bool folio_mapped(struct folio *folio)
+{
+	return folio_mapcount(folio) > 0;
 }
-#endif
 
 static inline struct page *virt_to_head_page(const void *x)
 {
@@ -1770,9 +1812,6 @@ static inline pgoff_t page_index(struct page *page)
 	return page->index;
 }
 
-bool page_mapped(struct page *page);
-bool folio_mapped(struct folio *folio);
-
 /*
  * Return true only if the page has been allocated with
  * ALLOC_NO_WATERMARKS and the low watermark was not
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 728eb6089bba..069620826a19 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -142,6 +142,7 @@ struct page {
 			unsigned char compound_dtor;
 			unsigned char compound_order;
 			atomic_t compound_mapcount;
+			atomic_t subpages_mapcount;
 			atomic_t compound_pincount;
 #ifdef CONFIG_64BIT
 			unsigned int compound_nr; /* 1 << compound_order */
@@ -270,7 +271,8 @@ struct page {
  * @_head_1: Points to the folio.  Do not use.
  * @_folio_dtor: Which destructor to use for this folio.
  * @_folio_order: Do not use directly, call folio_order().
- * @_total_mapcount: Do not use directly, call folio_entire_mapcount().
+ * @_compound_mapcount: Do not use directly, call folio_entire_mapcount().
+ * @_subpages_mapcount: Do not use directly, call folio_mapcount().
  * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
  * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
  * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
@@ -321,7 +323,8 @@ struct folio {
 			unsigned long _head_1;
 			unsigned char _folio_dtor;
 			unsigned char _folio_order;
-			atomic_t _total_mapcount;
+			atomic_t _compound_mapcount;
+			atomic_t _subpages_mapcount;
 			atomic_t _pincount;
 #ifdef CONFIG_64BIT
 			unsigned int _folio_nr_pages;
@@ -363,7 +366,8 @@ FOLIO_MATCH(flags, _flags_1);
 FOLIO_MATCH(compound_head, _head_1);
 FOLIO_MATCH(compound_dtor, _folio_dtor);
 FOLIO_MATCH(compound_order, _folio_order);
-FOLIO_MATCH(compound_mapcount, _total_mapcount);
+FOLIO_MATCH(compound_mapcount, _compound_mapcount);
+FOLIO_MATCH(subpages_mapcount, _subpages_mapcount);
 FOLIO_MATCH(compound_pincount, _pincount);
 #ifdef CONFIG_64BIT
 FOLIO_MATCH(compound_nr, _folio_nr_pages);
@@ -386,11 +390,22 @@ static inline atomic_t *folio_mapcount_ptr(struct folio *folio)
 	return &tail->compound_mapcount;
 }
 
+static inline atomic_t *folio_subpages_mapcount_ptr(struct folio *folio)
+{
+	struct page *tail = &folio->page + 1;
+	return &tail->subpages_mapcount;
+}
+
 static inline atomic_t *compound_mapcount_ptr(struct page *page)
 {
 	return &page[1].compound_mapcount;
 }
 
+static inline atomic_t *subpages_mapcount_ptr(struct page *page)
+{
+	return &page[1].subpages_mapcount;
+}
+
 static inline atomic_t *compound_pincount_ptr(struct page *page)
 {
 	return &page[1].compound_pincount;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0b0ae5084e60..e42c55a7e012 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -176,9 +176,6 @@ enum pageflags {
 	/* SLOB */
 	PG_slob_free = PG_private,
 
-	/* Compound pages. Stored in first tail page's flags */
-	PG_double_map = PG_workingset,
-
 #ifdef CONFIG_MEMORY_FAILURE
 	/*
 	 * Compound pages. Stored in first tail page's flags.
@@ -874,29 +871,11 @@ static inline int PageTransTail(struct page *page)
 {
 	return PageTail(page);
 }
-
-/*
- * PageDoubleMap indicates that the compound page is mapped with PTEs as well
- * as PMDs.
- *
- * This is required for optimization of rmap operations for THP: we can postpone
- * per small page mapcount accounting (and its overhead from atomic operations)
- * until the first PMD split.
- *
- * For the page PageDoubleMap means ->_mapcount in all sub-pages is offset up
- * by one. This reference will go away with last compound_mapcount.
- *
- * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap().
- */
-PAGEFLAG(DoubleMap, double_map, PF_SECOND)
-	TESTSCFLAG(DoubleMap, double_map, PF_SECOND)
 #else
 TESTPAGEFLAG_FALSE(TransHuge, transhuge)
 TESTPAGEFLAG_FALSE(TransCompound, transcompound)
 TESTPAGEFLAG_FALSE(TransCompoundMap, transcompoundmap)
 TESTPAGEFLAG_FALSE(TransTail, transtail)
-PAGEFLAG_FALSE(DoubleMap, double_map)
-	TESTSCFLAG_FALSE(DoubleMap, double_map)
 #endif
 
 #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..1973649e8f93 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -206,6 +206,8 @@ void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 
 static inline void __page_dup_rmap(struct page *page, bool compound)
 {
+	if (!compound && PageCompound(page))
+		atomic_inc(subpages_mapcount_ptr(compound_head(page)));
 	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
 }
 
diff --git a/mm/debug.c b/mm/debug.c
index 0fd15ba70d16..7f8e5f744e42 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -94,9 +94,10 @@ static void __dump_page(struct page *page)
 			page, page_ref_count(head), mapcount, mapping,
 			page_to_pgoff(page), page_to_pfn(page));
 	if (compound) {
-		pr_warn("head:%p order:%u compound_mapcount:%d compound_pincount:%d\n",
+		pr_warn("head:%p order:%u compound_mapcount:%d subpages_mapcount:%d compound_pincount:%d\n",
 				head, compound_order(head),
-				folio_entire_mapcount(folio),
+				head_compound_mapcount(head),
+				head_subpages_mapcount(head),
 				head_compound_pincount(head));
 	}
 
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index bac2a366aada..cbfe51091c39 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -39,12 +39,6 @@ void wait_for_stable_page(struct page *page)
 }
 EXPORT_SYMBOL_GPL(wait_for_stable_page);
 
-bool page_mapped(struct page *page)
-{
-	return folio_mapped(page_folio(page));
-}
-EXPORT_SYMBOL(page_mapped);
-
 void mark_page_accessed(struct page *page)
 {
 	folio_mark_accessed(page_folio(page));
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a524db74e9e6..23ff175768c3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2093,6 +2093,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 		VM_BUG_ON_PAGE(!page_count(page), page);
 		page_ref_add(page, HPAGE_PMD_NR - 1);
+		atomic_add(HPAGE_PMD_NR, subpages_mapcount_ptr(page));
 
 		/*
 		 * Without "freeze", we'll simply split the PMD, propagating the
@@ -2173,33 +2174,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		pte_unmap(pte);
 	}
 
-	if (!pmd_migration) {
-		/*
-		 * Set PG_double_map before dropping compound_mapcount to avoid
-		 * false-negative page_mapped().
-		 */
-		if (compound_mapcount(page) > 1 &&
-		    !TestSetPageDoubleMap(page)) {
-			for (i = 0; i < HPAGE_PMD_NR; i++)
-				atomic_inc(&page[i]._mapcount);
-		}
-
-		lock_page_memcg(page);
-		if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
-			/* Last compound_mapcount is gone. */
-			__mod_lruvec_page_state(page, NR_ANON_THPS,
-						-HPAGE_PMD_NR);
-			if (TestClearPageDoubleMap(page)) {
-				/* No need in mapcount reference anymore */
-				for (i = 0; i < HPAGE_PMD_NR; i++)
-					atomic_dec(&page[i]._mapcount);
-			}
-		}
-		unlock_page_memcg(page);
-
-		/* Above is effectively page_remove_rmap(page, vma, true) */
-		munlock_vma_page(page, vma, true);
-	}
+	if (!pmd_migration)
+		page_remove_rmap(page, vma, true);
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
@@ -2401,7 +2377,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_dirty) |
 			 LRU_GEN_MASK | LRU_REFS_MASK));
 
-	/* ->mapping in first tail page is compound_mapcount */
+	/* ->mapping in first and second tail page is replaced by other uses */
 	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
 			page_tail);
 	page_tail->mapping = head->mapping;
@@ -2411,6 +2387,10 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * page->private should not be set in tail pages with the exception
 	 * of swap cache pages that store the swp_entry_t in tail pages.
 	 * Fix up and warn once if private is unexpectedly set.
+	 *
+	 * What of 32-bit systems, on which head[1].compound_pincount overlays
+	 * head[1].private?  No problem: THP_SWAP is not enabled on 32-bit, and
+	 * compound_pincount must be 0 for folio_ref_freeze() to have succeeded.
 	 */
 	if (!folio_test_swapcache(page_folio(head))) {
 		VM_WARN_ON_ONCE_PAGE(page_tail->private != 0, page_tail);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b27caef538f9..f8355360b3cd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1333,6 +1333,7 @@ static void __destroy_compound_gigantic_page(struct page *page,
 	struct page *p;
 
 	atomic_set(compound_mapcount_ptr(page), 0);
+	atomic_set(subpages_mapcount_ptr(page), 0);
 	atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++) {
@@ -1850,6 +1851,7 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
 			set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
+	atomic_set(subpages_mapcount_ptr(page), 0);
 	atomic_set(compound_pincount_ptr(page), 0);
 	return true;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ea0d186bc9d4..564f996c388d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1242,15 +1242,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		/*
 		 * Check if the page has any GUP (or other external) pins.
 		 *
-		 * Here the check is racy it may see total_mapcount > refcount
-		 * in some cases.
-		 * For example, one process with one forked child process.
-		 * The parent has the PMD split due to MADV_DONTNEED, then
-		 * the child is trying unmap the whole PMD, but khugepaged
-		 * may be scanning the parent between the child has
-		 * PageDoubleMap flag cleared and dec the mapcount.  So
-		 * khugepaged may see total_mapcount > refcount.
-		 *
+		 * Here the check may be racy:
+		 * it may see total_mapcount > refcount in some cases?
 		 * But such case is ephemeral we could always retry collapse
 		 * later.  However it may report false positive if the page
 		 * has excessive GUP pins (i.e. 512).  Anyway the same check
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7192ded44ad0..f7a63684e6c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -798,6 +798,7 @@ static void prep_compound_head(struct page *page, unsigned int order)
 	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
 	set_compound_order(page, order);
 	atomic_set(compound_mapcount_ptr(page), -1);
+	atomic_set(subpages_mapcount_ptr(page), 0);
 	atomic_set(compound_pincount_ptr(page), 0);
 }
 
@@ -1324,11 +1325,19 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 	}
 	switch (page - head_page) {
 	case 1:
-		/* the first tail page: ->mapping may be compound_mapcount() */
-		if (unlikely(compound_mapcount(page))) {
+		/* the first tail page: these may be in place of ->mapping */
+		if (unlikely(head_compound_mapcount(head_page))) {
 			bad_page(page, "nonzero compound_mapcount");
 			goto out;
 		}
+		if (unlikely(head_subpages_mapcount(head_page))) {
+			bad_page(page, "nonzero subpages_mapcount");
+			goto out;
+		}
+		if (unlikely(head_compound_pincount(head_page))) {
+			bad_page(page, "nonzero compound_pincount");
+			goto out;
+		}
 		break;
 	case 2:
 		/*
@@ -1433,10 +1442,8 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
-		if (compound) {
-			ClearPageDoubleMap(page);
+		if (compound)
 			ClearPageHasHWPoisoned(page);
-		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);
@@ -6871,13 +6878,11 @@ static void __ref memmap_init_compound(struct page *head,
 		set_page_count(page, 0);
 
 		/*
-		 * The first tail page stores compound_mapcount_ptr() and
-		 * compound_order() and the second tail page stores
-		 * compound_pincount_ptr(). Call prep_compound_head() after
-		 * the first and second tail pages have been initialized to
-		 * not have the data overwritten.
+		 * The first tail page stores important compound page info.
+		 * Call prep_compound_head() after the first tail page has
+		 * been initialized, to not have the data overwritten.
 		 */
-		if (pfn == head_pfn + 2)
+		if (pfn == head_pfn + 1)
 			prep_compound_head(head, order);
 	}
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 3b2d18bbdc44..f43339ea4970 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1085,6 +1085,24 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 	return page_vma_mkclean_one(&pvmw);
 }
 
+/*
+ * When mapping a THP's first pmd, or unmapping its last pmd, if that THP
+ * also has pte mappings, then those must be discounted: in order to maintain
+ * NR_ANON_MAPPED and NR_FILE_MAPPED statistics exactly, without any drift,
+ * and to decide when an anon THP should be put on the deferred split queue.
+ */
+static int nr_subpages_unmapped(struct page *head, int nr_subpages)
+{
+	int nr = nr_subpages;
+	int i;
+
+	/* Discount those subpages mapped by pte */
+	for (i = 0; i < nr_subpages; i++)
+		if (atomic_read(&head[i]._mapcount) >= 0)
+			nr--;
+	return nr;
+}
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
@@ -1194,6 +1212,7 @@ static void __page_check_anon_rmap(struct page *page,
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
+	int nr, nr_pages;
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
 
@@ -1202,28 +1221,32 @@ void page_add_anon_rmap(struct page *page,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	if (compound) {
+	if (compound && PageTransHuge(page)) {
 		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		mapcount = compound_mapcount_ptr(page);
 		first = atomic_inc_and_test(mapcount);
+
+		nr = nr_pages = thp_nr_pages(page);
+		if (first && head_subpages_mapcount(page))
+			nr = nr_subpages_unmapped(page, nr_pages);
 	} else {
+		nr = 1;
+		if (PageTransCompound(page)) {
+			struct page *head = compound_head(page);
+
+			atomic_inc(subpages_mapcount_ptr(head));
+			nr = !head_compound_mapcount(head);
+		}
 		first = atomic_inc_and_test(&page->_mapcount);
 	}
+
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
 	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
 
 	if (first) {
-		int nr = compound ? thp_nr_pages(page) : 1;
-		/*
-		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
-		 * these counters are not modified in interrupt context, and
-		 * pte lock(a spinlock) is held, which implies preemption
-		 * disabled.
-		 */
 		if (compound)
-			__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
+			__mod_lruvec_page_state(page, NR_ANON_THPS, nr_pages);
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
@@ -1265,8 +1288,6 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
-		atomic_set(compound_pincount_ptr(page), 0);
-
 		__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 	} else {
 		/* increment count (starts at -1) */
@@ -1287,29 +1308,19 @@ void page_add_new_anon_rmap(struct page *page,
 void page_add_file_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	int i, nr = 0;
+	int nr = 0;
 
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
 	if (compound && PageTransHuge(page)) {
-		int nr_pages = thp_nr_pages(page);
+		int nr_pages;
 
-		for (i = 0; i < nr_pages; i++) {
-			if (atomic_inc_and_test(&page[i]._mapcount))
-				nr++;
-		}
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
 
-		/*
-		 * It is racy to ClearPageDoubleMap in page_remove_file_rmap();
-		 * but page lock is held by all page_add_file_rmap() compound
-		 * callers, and SetPageDoubleMap below warns if !PageLocked:
-		 * so here is a place that DoubleMap can be safely cleared.
-		 */
-		VM_WARN_ON_ONCE(!PageLocked(page));
-		if (nr == nr_pages && PageDoubleMap(page))
-			ClearPageDoubleMap(page);
+		nr = nr_pages = thp_nr_pages(page);
+		if (head_subpages_mapcount(page))
+			nr = nr_subpages_unmapped(page, nr_pages);
 
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
@@ -1318,11 +1329,15 @@ void page_add_file_rmap(struct page *page,
 			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
 						nr_pages);
 	} else {
-		if (PageTransCompound(page) && page_mapping(page)) {
-			VM_WARN_ON_ONCE(!PageLocked(page));
-			SetPageDoubleMap(compound_head(page));
+		bool pmd_mapped = false;
+
+		if (PageTransCompound(page)) {
+			struct page *head = compound_head(page);
+
+			atomic_inc(subpages_mapcount_ptr(head));
+			pmd_mapped = head_compound_mapcount(head);
 		}
-		if (atomic_inc_and_test(&page->_mapcount))
+		if (atomic_inc_and_test(&page->_mapcount) && !pmd_mapped)
 			nr++;
 	}
 out:
@@ -1335,7 +1350,7 @@ void page_add_file_rmap(struct page *page,
 
 static void page_remove_file_rmap(struct page *page, bool compound)
 {
-	int i, nr = 0;
+	int nr = 0;
 
 	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
 
@@ -1348,14 +1363,15 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 
 	/* page still mapped by someone else? */
 	if (compound && PageTransHuge(page)) {
-		int nr_pages = thp_nr_pages(page);
+		int nr_pages;
 
-		for (i = 0; i < nr_pages; i++) {
-			if (atomic_add_negative(-1, &page[i]._mapcount))
-				nr++;
-		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
-			goto out;
+			return;
+
+		nr = nr_pages = thp_nr_pages(page);
+		if (head_subpages_mapcount(page))
+			nr = nr_subpages_unmapped(page, nr_pages);
+
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						-nr_pages);
@@ -1363,17 +1379,25 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
 						-nr_pages);
 	} else {
-		if (atomic_add_negative(-1, &page->_mapcount))
+		bool pmd_mapped = false;
+
+		if (PageTransCompound(page)) {
+			struct page *head = compound_head(page);
+
+			atomic_dec(subpages_mapcount_ptr(head));
+			pmd_mapped = head_compound_mapcount(head);
+		}
+		if (atomic_add_negative(-1, &page->_mapcount) && !pmd_mapped)
 			nr++;
 	}
-out:
+
 	if (nr)
 		__mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);
 }
 
 static void page_remove_anon_compound_rmap(struct page *page)
 {
-	int i, nr;
+	int nr, nr_pages;
 
 	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 		return;
@@ -1385,27 +1409,19 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__mod_lruvec_page_state(page, NR_ANON_THPS, -thp_nr_pages(page));
+	nr = nr_pages = thp_nr_pages(page);
+	__mod_lruvec_page_state(page, NR_ANON_THPS, -nr);
 
-	if (TestClearPageDoubleMap(page)) {
-		/*
-		 * Subpages can be mapped with PTEs too. Check how many of
-		 * them are still mapped.
-		 */
-		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
-			if (atomic_add_negative(-1, &page[i]._mapcount))
-				nr++;
-		}
+	if (head_subpages_mapcount(page)) {
+		nr = nr_subpages_unmapped(page, nr_pages);
 
 		/*
 		 * Queue the page for deferred split if at least one small
 		 * page of the compound page is unmapped, but at least one
 		 * small page is still mapped.
 		 */
-		if (nr && nr < thp_nr_pages(page))
+		if (nr && nr < nr_pages)
 			deferred_split_huge_page(page);
-	} else {
-		nr = thp_nr_pages(page);
 	}
 
 	if (nr)
@@ -1423,6 +1439,8 @@ static void page_remove_anon_compound_rmap(struct page *page)
 void page_remove_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
+	bool pmd_mapped = false;
+
 	lock_page_memcg(page);
 
 	if (!PageAnon(page)) {
@@ -1435,15 +1453,17 @@ void page_remove_rmap(struct page *page,
 		goto out;
 	}
 
+	if (PageTransCompound(page)) {
+		struct page *head = compound_head(page);
+
+		atomic_dec(subpages_mapcount_ptr(head));
+		pmd_mapped = head_compound_mapcount(head);
+	}
+
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	if (!atomic_add_negative(-1, &page->_mapcount) || pmd_mapped)
 		goto out;
 
-	/*
-	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
-	 * these counters are not modified in interrupt context, and
-	 * pte lock(a spinlock) is held, which implies preemption disabled.
-	 */
 	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
 
 	if (PageTransCompound(page))
@@ -2569,8 +2589,8 @@ void hugepage_add_new_anon_rmap(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
+	/* increment count (starts at -1) */
 	atomic_set(compound_mapcount_ptr(page), 0);
-	atomic_set(compound_pincount_ptr(page), 0);
 	ClearHPageRestoreReserve(page);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
diff --git a/mm/util.c b/mm/util.c
index 12984e76767e..b56c92fb910f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -717,32 +717,6 @@ void *page_rmapping(struct page *page)
 	return folio_raw_mapping(page_folio(page));
 }
 
-/**
- * folio_mapped - Is this folio mapped into userspace?
- * @folio: The folio.
- *
- * Return: True if any page in this folio is referenced by user page tables.
- */
-bool folio_mapped(struct folio *folio)
-{
-	long i, nr;
-
-	if (!folio_test_large(folio))
-		return atomic_read(&folio->_mapcount) >= 0;
-	if (atomic_read(folio_mapcount_ptr(folio)) >= 0)
-		return true;
-	if (folio_test_hugetlb(folio))
-		return false;
-
-	nr = folio_nr_pages(folio);
-	for (i = 0; i < nr; i++) {
-		if (atomic_read(&folio_page(folio, i)->_mapcount) >= 0)
-			return true;
-	}
-	return false;
-}
-EXPORT_SYMBOL(folio_mapped);
-
 struct anon_vma *folio_anon_vma(struct folio *folio)
 {
 	unsigned long mapping = (unsigned long)folio->mapping;
@@ -783,59 +757,6 @@ struct address_space *folio_mapping(struct folio *folio)
 }
 EXPORT_SYMBOL(folio_mapping);
 
-/* Slow path of page_mapcount() for compound pages */
-int __page_mapcount(struct page *page)
-{
-	int ret;
-
-	ret = atomic_read(&page->_mapcount) + 1;
-	/*
-	 * For file THP page->_mapcount contains total number of mapping
-	 * of the page: no need to look into compound_mapcount.
-	 */
-	if (!PageAnon(page) && !PageHuge(page))
-		return ret;
-	page = compound_head(page);
-	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
-	if (PageDoubleMap(page))
-		ret--;
-	return ret;
-}
-EXPORT_SYMBOL_GPL(__page_mapcount);
-
-/**
- * folio_mapcount() - Calculate the number of mappings of this folio.
- * @folio: The folio.
- *
- * A large folio tracks both how many times the entire folio is mapped,
- * and how many times each individual page in the folio is mapped.
- * This function calculates the total number of times the folio is
- * mapped.
- *
- * Return: The number of times this folio is mapped.
- */
-int folio_mapcount(struct folio *folio)
-{
-	int i, compound, nr, ret;
-
-	if (likely(!folio_test_large(folio)))
-		return atomic_read(&folio->_mapcount) + 1;
-
-	compound = folio_entire_mapcount(folio);
-	if (folio_test_hugetlb(folio))
-		return compound;
-	ret = compound;
-	nr = folio_nr_pages(folio);
-	for (i = 0; i < nr; i++)
-		ret += atomic_read(&folio_page(folio, i)->_mapcount) + 1;
-	/* File pages has compound_mapcount included in _mapcount */
-	if (!folio_test_anon(folio))
-		return ret - compound * nr;
-	if (folio_test_double_map(folio))
-		ret -= nr;
-	return ret;
-}
-
 /**
  * folio_copy - Copy the contents of one folio to another.
  * @dst: Folio to copy to.
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/3] mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts
  2022-11-03  1:44 [PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts Hugh Dickins
  2022-11-03  1:48 ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Hugh Dickins
  2022-11-03  1:51 ` [PATCH 2/3] mm,thp,rmap: simplify compound page mapcount handling Hugh Dickins
@ 2022-11-03  1:53 ` Hugh Dickins
  2022-11-05 20:06   ` Kirill A. Shutemov
  2022-11-10  2:18 ` [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first Hugh Dickins
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
  4 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-03  1:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

Fix the races in maintaining compound_mapcount, subpages_mapcount and
subpage _mapcount by using PG_locked in the first tail of any compound
page for a bit_spin_lock() on such modifications; skipping the usual
atomic operations on those fields in this case.

Bring page_remove_file_rmap() and page_remove_anon_compound_rmap()
back into page_remove_rmap() itself.  Rearrange page_add_anon_rmap()
and page_add_file_rmap() and page_remove_rmap() to follow the same
"if (compound) {lock} else if (PageCompound) {lock} else {atomic}"
pattern (with a PageTransHuge in the compound test, like before, to
avoid BUG_ONs and optimize away that block when THP is not configured).
Move all the stats updates outside, after the bit_spin_locked section,
so that it is sure to be a leaf lock.

Add page_dup_compound_rmap() to manage compound locking versus atomics
in sync with the rest.  In particular, hugetlb pages are still using
the atomics: to avoid unnecessary interference there, and because they
never have subpage mappings; but this exception can easily be changed.
Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's
__split_huge_pmd_locked() too.

bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT
sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure
is to the non-hugetlb non-THP pte-mapped compound pages (with large
folios being currently dependent on TRANSPARENT_HUGEPAGE).  There is
never any scan of subpages in this case; but we have chosen to use
PageCompound tests rather than PageTransCompound tests to gate the
use of lock_compound_mapcounts(), so that page_mapped() is correct on
all compound pages, whether or not TRANSPARENT_HUGEPAGE is enabled:
could that be a problem for PREEMPT_RT, when there is contention on
the lock - under heavy concurrent forking for example?  If so, then it
can be turned into a sleeping lock (like folio_lock()) when PREEMPT_RT.

A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB)
took 18 seconds on small pages, and used to take 1 second on huge pages,
but now takes 115 milliseconds on huge pages.  Mapping by pmds a second
time used to take 860ms and now takes 86ms; mapping by pmds after mapping
by ptes (when the scan is needed) used to take 870ms and now takes 495ms.
Mapping huge pages by ptes is largely unaffected but variable: between 5%
faster and 5% slower in what I've recorded.  Contention on the lock is
likely to behave worse than contention on the atomics behaved.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/mm/transhuge.rst |  16 +-
 include/linux/rmap.h           |  14 +-
 mm/huge_memory.c               |   3 +-
 mm/rmap.c                      | 333 +++++++++++++++++++--------------
 4 files changed, 204 insertions(+), 162 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index a560e0c01b16..1e2a637cc607 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -117,13 +117,15 @@ pages:
   - ->_refcount in tail pages is always zero: get_page_unless_zero() never
     succeeds on tail pages.
 
-  - map/unmap of the pages with PTE entry increment/decrement ->_mapcount
-    on relevant sub-page of the compound page.
-
-  - map/unmap of the whole compound page is accounted for in compound_mapcount
-    (stored in first tail page). For file huge pages, we also increment
-    ->_mapcount of all sub-pages in order to have race-free detection of
-    last unmap of subpages.
+  - map/unmap of PMD entry for the whole compound page increment/decrement
+    ->compound_mapcount, stored in the first tail page of the compound page.
+
+  - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
+    on relevant sub-page of the compound page, and also increment/decrement
+    ->subpages_mapcount, stored in first tail page of the compound page.
+    In order to have race-free accounting of sub-pages mapped, changes to
+    sub-page ->_mapcount, ->subpages_mapcount and ->compound_mapcount are
+    are all locked by bit_spin_lock of PG_locked in the first tail ->flags.
 
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 1973649e8f93..011a7530dc76 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,16 +204,14 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 
-static inline void __page_dup_rmap(struct page *page, bool compound)
-{
-	if (!compound && PageCompound(page))
-		atomic_inc(subpages_mapcount_ptr(compound_head(page)));
-	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
-}
+void page_dup_compound_rmap(struct page *page, bool compound);
 
 static inline void page_dup_file_rmap(struct page *page, bool compound)
 {
-	__page_dup_rmap(page, compound);
+	if (PageCompound(page))
+		page_dup_compound_rmap(page, compound);
+	else
+		atomic_inc(&page->_mapcount);
 }
 
 /**
@@ -262,7 +260,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
 	 * the page R/O into both processes.
 	 */
 dup:
-	__page_dup_rmap(page, compound);
+	page_dup_file_rmap(page, compound);
 	return 0;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 23ff175768c3..2c4c668eee6c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2093,7 +2093,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 		VM_BUG_ON_PAGE(!page_count(page), page);
 		page_ref_add(page, HPAGE_PMD_NR - 1);
-		atomic_add(HPAGE_PMD_NR, subpages_mapcount_ptr(page));
 
 		/*
 		 * Without "freeze", we'll simply split the PMD, propagating the
@@ -2170,7 +2169,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
 		if (!pmd_migration)
-			atomic_inc(&page[i]._mapcount);
+			page_dup_compound_rmap(page + i, false);
 		pte_unmap(pte);
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index f43339ea4970..512e53cae2ca 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1085,11 +1085,66 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 	return page_vma_mkclean_one(&pvmw);
 }
 
+struct compound_mapcounts {
+	unsigned int compound_mapcount;
+	unsigned int subpages_mapcount;
+};
+
+/*
+ * lock_compound_mapcounts() first locks, then copies subpages_mapcount and
+ * compound_mapcount from head[1].compound_mapcount and subpages_mapcount,
+ * converting from struct page's internal representation to logical count
+ * (that is, adding 1 to compound_mapcount to hide its offset by -1).
+ */
+static void lock_compound_mapcounts(struct page *head,
+		struct compound_mapcounts *local)
+{
+	bit_spin_lock(PG_locked, &head[1].flags);
+	local->compound_mapcount = atomic_read(compound_mapcount_ptr(head)) + 1;
+	local->subpages_mapcount = atomic_read(subpages_mapcount_ptr(head));
+}
+
+/*
+ * After caller has updated subpage._mapcount, local subpages_mapcount and
+ * local compound_mapcount, as necessary, unlock_compound_mapcounts() converts
+ * and copies them back to the compound head[1] fields, and then unlocks.
+ */
+static void unlock_compound_mapcounts(struct page *head,
+		struct compound_mapcounts *local)
+{
+	atomic_set(compound_mapcount_ptr(head), local->compound_mapcount - 1);
+	atomic_set(subpages_mapcount_ptr(head), local->subpages_mapcount);
+	bit_spin_unlock(PG_locked, &head[1].flags);
+}
+
+/*
+ * When acting on a compound page under lock_compound_mapcounts(), avoid the
+ * unnecessary overhead of an actual atomic operation on its subpage mapcount.
+ * Return true if this is the first increment or the last decrement
+ * (remembering that page->_mapcount -1 represents logical mapcount 0).
+ */
+static bool subpage_mapcount_inc(struct page *page)
+{
+	int orig_mapcount = atomic_read(&page->_mapcount);
+
+	atomic_set(&page->_mapcount, orig_mapcount + 1);
+	return orig_mapcount < 0;
+}
+
+static bool subpage_mapcount_dec(struct page *page)
+{
+	int orig_mapcount = atomic_read(&page->_mapcount);
+
+	atomic_set(&page->_mapcount, orig_mapcount - 1);
+	return orig_mapcount == 0;
+}
+
 /*
  * When mapping a THP's first pmd, or unmapping its last pmd, if that THP
  * also has pte mappings, then those must be discounted: in order to maintain
  * NR_ANON_MAPPED and NR_FILE_MAPPED statistics exactly, without any drift,
  * and to decide when an anon THP should be put on the deferred split queue.
+ * This function must be called between lock_ and unlock_compound_mapcounts().
  */
 static int nr_subpages_unmapped(struct page *head, int nr_subpages)
 {
@@ -1103,6 +1158,40 @@ static int nr_subpages_unmapped(struct page *head, int nr_subpages)
 	return nr;
 }
 
+/*
+ * page_dup_compound_rmap(), used when copying mm, or when splitting pmd,
+ * provides a simple example of using lock_ and unlock_compound_mapcounts().
+ */
+void page_dup_compound_rmap(struct page *page, bool compound)
+{
+	struct compound_mapcounts mapcounts;
+	struct page *head;
+
+	/*
+	 * Hugetlb pages could use lock_compound_mapcounts(), like THPs do;
+	 * but at present they are still being managed by atomic operations:
+	 * which are likely to be somewhat faster, so don't rush to convert
+	 * them over without evaluating the effect.
+	 *
+	 * Note that hugetlb does not call page_add_file_rmap():
+	 * here is where hugetlb shared page mapcount is raised.
+	 */
+	if (PageHuge(page)) {
+		atomic_inc(compound_mapcount_ptr(page));
+		return;
+	}
+
+	head = compound_head(page);
+	lock_compound_mapcounts(head, &mapcounts);
+	if (compound) {
+		mapcounts.compound_mapcount++;
+	} else {
+		mapcounts.subpages_mapcount++;
+		subpage_mapcount_inc(page);
+	}
+	unlock_compound_mapcounts(head, &mapcounts);
+}
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
@@ -1212,7 +1301,8 @@ static void __page_check_anon_rmap(struct page *page,
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
-	int nr, nr_pages;
+	struct compound_mapcounts mapcounts;
+	int nr = 0, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
 
@@ -1222,33 +1312,37 @@ void page_add_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	if (compound && PageTransHuge(page)) {
-		atomic_t *mapcount;
-		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		mapcount = compound_mapcount_ptr(page);
-		first = atomic_inc_and_test(mapcount);
+		lock_compound_mapcounts(page, &mapcounts);
+		first = !mapcounts.compound_mapcount;
+		mapcounts.compound_mapcount++;
+		if (first) {
+			nr = nr_pmdmapped = thp_nr_pages(page);
+			if (mapcounts.subpages_mapcount)
+				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+		}
+		unlock_compound_mapcounts(page, &mapcounts);
 
-		nr = nr_pages = thp_nr_pages(page);
-		if (first && head_subpages_mapcount(page))
-			nr = nr_subpages_unmapped(page, nr_pages);
-	} else {
-		nr = 1;
-		if (PageTransCompound(page)) {
-			struct page *head = compound_head(page);
+	} else if (PageCompound(page)) {
+		struct page *head = compound_head(page);
 
-			atomic_inc(subpages_mapcount_ptr(head));
-			nr = !head_compound_mapcount(head);
-		}
+		lock_compound_mapcounts(head, &mapcounts);
+		mapcounts.subpages_mapcount++;
+		first = subpage_mapcount_inc(page);
+		nr = first && !mapcounts.compound_mapcount;
+		unlock_compound_mapcounts(head, &mapcounts);
+
+	} else {
 		first = atomic_inc_and_test(&page->_mapcount);
+		nr = first;
 	}
 
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
 	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
 
-	if (first) {
-		if (compound)
-			__mod_lruvec_page_state(page, NR_ANON_THPS, nr_pages);
+	if (nr_pmdmapped)
+		__mod_lruvec_page_state(page, NR_ANON_THPS, nr_pmdmapped);
+	if (nr)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
-	}
 
 	if (unlikely(PageKsm(page)))
 		unlock_page_memcg(page);
@@ -1308,39 +1402,41 @@ void page_add_new_anon_rmap(struct page *page,
 void page_add_file_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	int nr = 0;
+	struct compound_mapcounts mapcounts;
+	int nr = 0, nr_pmdmapped = 0;
+	bool first;
 
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
+
 	if (compound && PageTransHuge(page)) {
-		int nr_pages;
+		lock_compound_mapcounts(page, &mapcounts);
+		first = !mapcounts.compound_mapcount;
+		mapcounts.compound_mapcount++;
+		if (first) {
+			nr = nr_pmdmapped = thp_nr_pages(page);
+			if (mapcounts.subpages_mapcount)
+				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+		}
+		unlock_compound_mapcounts(page, &mapcounts);
 
-		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
-			goto out;
+	} else if (PageCompound(page)) {
+		struct page *head = compound_head(page);
 
-		nr = nr_pages = thp_nr_pages(page);
-		if (head_subpages_mapcount(page))
-			nr = nr_subpages_unmapped(page, nr_pages);
+		lock_compound_mapcounts(head, &mapcounts);
+		mapcounts.subpages_mapcount++;
+		first = subpage_mapcount_inc(page);
+		nr = first && !mapcounts.compound_mapcount;
+		unlock_compound_mapcounts(head, &mapcounts);
 
-		if (PageSwapBacked(page))
-			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
-						nr_pages);
-		else
-			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
-						nr_pages);
 	} else {
-		bool pmd_mapped = false;
-
-		if (PageTransCompound(page)) {
-			struct page *head = compound_head(page);
-
-			atomic_inc(subpages_mapcount_ptr(head));
-			pmd_mapped = head_compound_mapcount(head);
-		}
-		if (atomic_inc_and_test(&page->_mapcount) && !pmd_mapped)
-			nr++;
+		first = atomic_inc_and_test(&page->_mapcount);
+		nr = first;
 	}
-out:
+
+	if (nr_pmdmapped)
+		__mod_lruvec_page_state(page, PageSwapBacked(page) ?
+			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
 	if (nr)
 		__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
 	unlock_page_memcg(page);
@@ -1348,137 +1444,84 @@ void page_add_file_rmap(struct page *page,
 	mlock_vma_page(page, vma, compound);
 }
 
-static void page_remove_file_rmap(struct page *page, bool compound)
+/**
+ * page_remove_rmap - take down pte mapping from a page
+ * @page:	page to remove mapping from
+ * @vma:	the vm area from which the mapping is removed
+ * @compound:	uncharge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_remove_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
-	int nr = 0;
+	struct compound_mapcounts mapcounts;
+	int nr = 0, nr_pmdmapped = 0;
+	bool last;
 
 	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
 
-	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
+	/* Hugetlb pages are not counted in NR_*MAPPED */
 	if (unlikely(PageHuge(page))) {
 		/* hugetlb pages are always mapped with pmds */
 		atomic_dec(compound_mapcount_ptr(page));
 		return;
 	}
 
+	lock_page_memcg(page);
+
 	/* page still mapped by someone else? */
 	if (compound && PageTransHuge(page)) {
-		int nr_pages;
+		lock_compound_mapcounts(page, &mapcounts);
+		mapcounts.compound_mapcount--;
+		last = !mapcounts.compound_mapcount;
+		if (last) {
+			nr = nr_pmdmapped = thp_nr_pages(page);
+			if (mapcounts.subpages_mapcount)
+				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+		}
+		unlock_compound_mapcounts(page, &mapcounts);
 
-		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
-			return;
+	} else if (PageCompound(page)) {
+		struct page *head = compound_head(page);
 
-		nr = nr_pages = thp_nr_pages(page);
-		if (head_subpages_mapcount(page))
-			nr = nr_subpages_unmapped(page, nr_pages);
+		lock_compound_mapcounts(head, &mapcounts);
+		mapcounts.subpages_mapcount--;
+		last = subpage_mapcount_dec(page);
+		nr = last && !mapcounts.compound_mapcount;
+		unlock_compound_mapcounts(head, &mapcounts);
 
-		if (PageSwapBacked(page))
-			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
-						-nr_pages);
-		else
-			__mod_lruvec_page_state(page, NR_FILE_PMDMAPPED,
-						-nr_pages);
 	} else {
-		bool pmd_mapped = false;
-
-		if (PageTransCompound(page)) {
-			struct page *head = compound_head(page);
-
-			atomic_dec(subpages_mapcount_ptr(head));
-			pmd_mapped = head_compound_mapcount(head);
-		}
-		if (atomic_add_negative(-1, &page->_mapcount) && !pmd_mapped)
-			nr++;
+		last = atomic_add_negative(-1, &page->_mapcount);
+		nr = last;
 	}
 
-	if (nr)
-		__mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);
-}
-
-static void page_remove_anon_compound_rmap(struct page *page)
-{
-	int nr, nr_pages;
-
-	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
-		return;
-
-	/* Hugepages are not counted in NR_ANON_PAGES for now. */
-	if (unlikely(PageHuge(page)))
-		return;
-
-	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-		return;
-
-	nr = nr_pages = thp_nr_pages(page);
-	__mod_lruvec_page_state(page, NR_ANON_THPS, -nr);
-
-	if (head_subpages_mapcount(page)) {
-		nr = nr_subpages_unmapped(page, nr_pages);
-
+	if (nr_pmdmapped) {
+		__mod_lruvec_page_state(page, PageAnon(page) ? NR_ANON_THPS :
+				(PageSwapBacked(page) ? NR_SHMEM_PMDMAPPED :
+				NR_FILE_PMDMAPPED), -nr_pmdmapped);
+	}
+	if (nr) {
+		__mod_lruvec_page_state(page, PageAnon(page) ? NR_ANON_MAPPED :
+				NR_FILE_MAPPED, -nr);
 		/*
-		 * Queue the page for deferred split if at least one small
+		 * Queue anon THP for deferred split if at least one small
 		 * page of the compound page is unmapped, but at least one
 		 * small page is still mapped.
 		 */
-		if (nr && nr < nr_pages)
-			deferred_split_huge_page(page);
-	}
-
-	if (nr)
-		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
-}
-
-/**
- * page_remove_rmap - take down pte mapping from a page
- * @page:	page to remove mapping from
- * @vma:	the vm area from which the mapping is removed
- * @compound:	uncharge the page as compound or small page
- *
- * The caller needs to hold the pte lock.
- */
-void page_remove_rmap(struct page *page,
-	struct vm_area_struct *vma, bool compound)
-{
-	bool pmd_mapped = false;
-
-	lock_page_memcg(page);
-
-	if (!PageAnon(page)) {
-		page_remove_file_rmap(page, compound);
-		goto out;
+		if (PageTransCompound(page) && PageAnon(page))
+			if (!compound || nr < nr_pmdmapped)
+				deferred_split_huge_page(compound_head(page));
 	}
 
-	if (compound) {
-		page_remove_anon_compound_rmap(page);
-		goto out;
-	}
-
-	if (PageTransCompound(page)) {
-		struct page *head = compound_head(page);
-
-		atomic_dec(subpages_mapcount_ptr(head));
-		pmd_mapped = head_compound_mapcount(head);
-	}
-
-	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount) || pmd_mapped)
-		goto out;
-
-	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
-
-	if (PageTransCompound(page))
-		deferred_split_huge_page(compound_head(page));
-
 	/*
-	 * It would be tidy to reset the PageAnon mapping here,
+	 * It would be tidy to reset PageAnon mapping when fully unmapped,
 	 * but that might overwrite a racing page_add_anon_rmap
 	 * which increments mapcount after us but sets mapping
-	 * before us: so leave the reset to free_unref_page,
+	 * before us: so leave the reset to free_pages_prepare,
 	 * and remember that it's only reliable while mapped.
-	 * Leaving it set also helps swapoff to reinstate ptes
-	 * faster for those pages still in swapcache.
 	 */
-out:
+
 	unlock_page_memcg(page);
 
 	munlock_vma_page(page, vma, compound);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,hugetlb: use folio fields in second tail page
  2022-11-03  1:48 ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Hugh Dickins
@ 2022-11-03 21:18   ` Sidhartha Kumar
  2022-11-04  4:29     ` Hugh Dickins
  2022-11-05 19:13   ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Kirill A. Shutemov
  1 sibling, 1 reply; 54+ messages in thread
From: Sidhartha Kumar @ 2022-11-03 21:18 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm


On 11/2/22 6:48 PM, Hugh Dickins wrote:
> We want to declare one more int in the first tail of a compound page:
> that first tail page being valuable property, since every compound page
> has a first tail, but perhaps no more than that.
>
> No problem on 64-bit: there is already space for it.  No problem with
> 32-bit THPs: 5.18 commit 5232c63f46fd ("mm: Make compound_pincount always
> available") kindly cleared the space for it, apparently not realizing
> that only 64-bit architectures enable CONFIG_THP_SWAP (whose use of tail
> page->private might conflict) - but make sure of that in its Kconfig.
>
> But hugetlb pages use tail page->private of the first tail page for a
> subpool pointer, which will conflict; and they also use page->private
> of the 2nd, 3rd and 4th tails.
>
> Undo "mm: add private field of first tail to struct page and struct
> folio"'s recent addition of private_1 to the folio tail: instead add
> hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison
> to a second tail page of the folio: THP has long been using several
> fields of that tail, so make better use of it for hugetlb too.
> This is not how a generic folio should be declared in future,
> but it is an effective transitional way to make use of it.
>
> Delete the SUBPAGE_INDEX stuff, but keep __NR_USED_SUBPAGE: now 3.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   include/linux/hugetlb.h        | 23 +++--------
>   include/linux/hugetlb_cgroup.h | 31 +++++----------
>   include/linux/mm_types.h       | 72 ++++++++++++++++++++++------------
>   mm/Kconfig                     |  2 +-
>   mm/memory-failure.c            |  5 +--
>   5 files changed, 65 insertions(+), 68 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 65ea34022aa2..03ecf1c5e46f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -33,22 +33,9 @@ typedef struct { unsigned long pd; } hugepd_t;
>   /*
>    * For HugeTLB page, there are more metadata to save in the struct page. But
>    * the head struct page cannot meet our needs, so we have to abuse other tail
> - * struct page to store the metadata. In order to avoid conflicts caused by
> - * subsequent use of more tail struct pages, we gather these discrete indexes
> - * of tail struct page here.
> + * struct page to store the metadata.
>    */
> -enum {
> -	SUBPAGE_INDEX_SUBPOOL = 1,	/* reuse page->private */
> -#ifdef CONFIG_CGROUP_HUGETLB
> -	SUBPAGE_INDEX_CGROUP,		/* reuse page->private */
> -	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
> -	__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
> -#endif
> -#ifdef CONFIG_MEMORY_FAILURE
> -	SUBPAGE_INDEX_HWPOISON,
> -#endif
> -	__NR_USED_SUBPAGE,
> -};
> +#define __NR_USED_SUBPAGE 3
>   
>   struct hugepage_subpool {
>   	spinlock_t lock;
> @@ -722,11 +709,11 @@ extern unsigned int default_hstate_idx;
>   
>   static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio)
>   {
> -	return (void *)folio_get_private_1(folio);
> +	return folio->_hugetlb_subpool;
>   }
>   
>   /*
> - * hugetlb page subpool pointer located in hpage[1].private
> + * hugetlb page subpool pointer located in hpage[2].hugetlb_subpool
>    */
>   static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
>   {
> @@ -736,7 +723,7 @@ static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
>   static inline void hugetlb_set_folio_subpool(struct folio *folio,
>   					struct hugepage_subpool *subpool)
>   {
> -	folio_set_private_1(folio, (unsigned long)subpool);
> +	folio->_hugetlb_subpool = subpool;
>   }
>   
>   static inline void hugetlb_set_page_subpool(struct page *hpage,
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index c70f92fe493e..f706626a8063 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -24,12 +24,10 @@ struct file_region;
>   #ifdef CONFIG_CGROUP_HUGETLB
>   /*
>    * Minimum page order trackable by hugetlb cgroup.
> - * At least 4 pages are necessary for all the tracking information.
> - * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault
> - * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD])
> - * is the reservation usage cgroup.
> + * At least 3 pages are necessary for all the tracking information.
> + * The second tail page contains all of the hugetlb-specific fields.
>    */
> -#define HUGETLB_CGROUP_MIN_ORDER order_base_2(__MAX_CGROUP_SUBPAGE_INDEX + 1)
> +#define HUGETLB_CGROUP_MIN_ORDER order_base_2(__NR_USED_SUBPAGE)
>   
>   enum hugetlb_memory_event {
>   	HUGETLB_MAX,
> @@ -69,21 +67,13 @@ struct hugetlb_cgroup {
>   static inline struct hugetlb_cgroup *
>   __hugetlb_cgroup_from_folio(struct folio *folio, bool rsvd)
>   {
> -	struct page *tail;
> -
>   	VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio);
>   	if (folio_order(folio) < HUGETLB_CGROUP_MIN_ORDER)
>   		return NULL;
> -
> -	if (rsvd) {
> -		tail = folio_page(folio, SUBPAGE_INDEX_CGROUP_RSVD);
> -		return (void *)page_private(tail);
> -	}
> -
> -	else {
> -		tail = folio_page(folio, SUBPAGE_INDEX_CGROUP);
> -		return (void *)page_private(tail);
> -	}
> +	if (rsvd)
> +		return folio->_hugetlb_cgroup_rsvd;
> +	else
> +		return folio->_hugetlb_cgroup;
>   }
>   
>   static inline struct hugetlb_cgroup *hugetlb_cgroup_from_folio(struct folio *folio)
> @@ -101,15 +91,12 @@ static inline void __set_hugetlb_cgroup(struct folio *folio,
>   				       struct hugetlb_cgroup *h_cg, bool rsvd)
>   {
>   	VM_BUG_ON_FOLIO(!folio_test_hugetlb(folio), folio);
> -
>   	if (folio_order(folio) < HUGETLB_CGROUP_MIN_ORDER)
>   		return;
>   	if (rsvd)
> -		set_page_private(folio_page(folio, SUBPAGE_INDEX_CGROUP_RSVD),
> -				 (unsigned long)h_cg);
> +		folio->_hugetlb_cgroup_rsvd = h_cg;
>   	else
> -		set_page_private(folio_page(folio, SUBPAGE_INDEX_CGROUP),
> -				 (unsigned long)h_cg);
> +		folio->_hugetlb_cgroup = h_cg;
>   }
>   
>   static inline void set_hugetlb_cgroup(struct folio *folio,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 834022721bc6..728eb6089bba 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -145,15 +145,22 @@ struct page {
>   			atomic_t compound_pincount;
>   #ifdef CONFIG_64BIT
>   			unsigned int compound_nr; /* 1 << compound_order */
> -			unsigned long _private_1;
>   #endif
>   		};
> -		struct {	/* Second tail page of compound page */
> +		struct {	/* Second tail page of transparent huge page */
>   			unsigned long _compound_pad_1;	/* compound_head */
>   			unsigned long _compound_pad_2;
>   			/* For both global and memcg */
>   			struct list_head deferred_list;
>   		};
> +		struct {	/* Second tail page of hugetlb page */
> +			unsigned long _hugetlb_pad_1;	/* compound_head */
> +			void *hugetlb_subpool;
> +			void *hugetlb_cgroup;
> +			void *hugetlb_cgroup_rsvd;
> +			void *hugetlb_hwpoison;
> +			/* No more space on 32-bit: use third tail if more */
> +		};
>   		struct {	/* Page table pages */
>   			unsigned long _pt_pad_1;	/* compound_head */
>   			pgtable_t pmd_huge_pte; /* protected by page->ptl */
> @@ -260,13 +267,16 @@ struct page {
>    *    to find how many references there are to this folio.
>    * @memcg_data: Memory Control Group data.
>    * @_flags_1: For large folios, additional page flags.
> - * @__head: Points to the folio.  Do not use.
> + * @_head_1: Points to the folio.  Do not use.

Changes to my original patch set look good, this seems to be a cleaner 
implementation.

Should the usage of page_1 and page_2 also be documented here?

Thanks,

Sidhartha Kumar

>    * @_folio_dtor: Which destructor to use for this folio.
>    * @_folio_order: Do not use directly, call folio_order().
>    * @_total_mapcount: Do not use directly, call folio_entire_mapcount().
>    * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
>    * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
> - * @_private_1: Do not use directly, call folio_get_private_1().
> + * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
> + * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
> + * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
> + * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
>    *
>    * A folio is a physically, virtually and logically contiguous set
>    * of bytes.  It is a power-of-two in size, and it is aligned to that
> @@ -305,16 +315,31 @@ struct folio {
>   		};
>   		struct page page;
>   	};
> -	unsigned long _flags_1;
> -	unsigned long __head;
> -	unsigned char _folio_dtor;
> -	unsigned char _folio_order;
> -	atomic_t _total_mapcount;
> -	atomic_t _pincount;
> +	union {
> +		struct {
> +			unsigned long _flags_1;
> +			unsigned long _head_1;
> +			unsigned char _folio_dtor;
> +			unsigned char _folio_order;
> +			atomic_t _total_mapcount;
> +			atomic_t _pincount;
>   #ifdef CONFIG_64BIT
> -	unsigned int _folio_nr_pages;
> +			unsigned int _folio_nr_pages;
>   #endif
> -	unsigned long _private_1;
> +		};
> +		struct page page_1;
> +	};
> +	union {
> +		struct {
> +			unsigned long _flags_2;
> +			unsigned long _head_2;
> +			void *_hugetlb_subpool;
> +			void *_hugetlb_cgroup;
> +			void *_hugetlb_cgroup_rsvd;
> +			void *_hugetlb_hwpoison;
> +		};
> +		struct page page_2;
> +	};
>   };
>   
>   #define FOLIO_MATCH(pg, fl)						\
> @@ -335,16 +360,25 @@ FOLIO_MATCH(memcg_data, memcg_data);
>   	static_assert(offsetof(struct folio, fl) ==			\
>   			offsetof(struct page, pg) + sizeof(struct page))
>   FOLIO_MATCH(flags, _flags_1);
> -FOLIO_MATCH(compound_head, __head);
> +FOLIO_MATCH(compound_head, _head_1);
>   FOLIO_MATCH(compound_dtor, _folio_dtor);
>   FOLIO_MATCH(compound_order, _folio_order);
>   FOLIO_MATCH(compound_mapcount, _total_mapcount);
>   FOLIO_MATCH(compound_pincount, _pincount);
>   #ifdef CONFIG_64BIT
>   FOLIO_MATCH(compound_nr, _folio_nr_pages);
> -FOLIO_MATCH(_private_1, _private_1);
>   #endif
>   #undef FOLIO_MATCH
> +#define FOLIO_MATCH(pg, fl)						\
> +	static_assert(offsetof(struct folio, fl) ==			\
> +			offsetof(struct page, pg) + 2 * sizeof(struct page))
> +FOLIO_MATCH(flags, _flags_2);
> +FOLIO_MATCH(compound_head, _head_2);
> +FOLIO_MATCH(hugetlb_subpool, _hugetlb_subpool);
> +FOLIO_MATCH(hugetlb_cgroup, _hugetlb_cgroup);
> +FOLIO_MATCH(hugetlb_cgroup_rsvd, _hugetlb_cgroup_rsvd);
> +FOLIO_MATCH(hugetlb_hwpoison, _hugetlb_hwpoison);
> +#undef FOLIO_MATCH
>   
>   static inline atomic_t *folio_mapcount_ptr(struct folio *folio)
>   {
> @@ -388,16 +422,6 @@ static inline void *folio_get_private(struct folio *folio)
>   	return folio->private;
>   }
>   
> -static inline void folio_set_private_1(struct folio *folio, unsigned long private)
> -{
> -	folio->_private_1 = private;
> -}
> -
> -static inline unsigned long folio_get_private_1(struct folio *folio)
> -{
> -	return folio->_private_1;
> -}
> -
>   struct page_frag_cache {
>   	void * va;
>   #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 57e1d8c5b505..bc7e7dacfcd5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -775,7 +775,7 @@ endchoice
>   
>   config THP_SWAP
>   	def_bool y
> -	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP
> +	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP && 64BIT
>   	help
>   	  Swap transparent huge pages in one piece, without splitting.
>   	  XXX: For now, swap cluster backing transparent huge page
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 779a426d2cab..63d8501001c6 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1687,8 +1687,7 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
>   #ifdef CONFIG_HUGETLB_PAGE
>   /*
>    * Struct raw_hwp_page represents information about "raw error page",
> - * constructing singly linked list originated from ->private field of
> - * SUBPAGE_INDEX_HWPOISON-th tail page.
> + * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
>    */
>   struct raw_hwp_page {
>   	struct llist_node node;
> @@ -1697,7 +1696,7 @@ struct raw_hwp_page {
>   
>   static inline struct llist_head *raw_hwp_list_head(struct page *hpage)
>   {
> -	return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> +	return (struct llist_head *)&page_folio(hpage)->_hugetlb_hwpoison;
>   }
>   
>   static unsigned long __free_raw_hwp_pages(struct page *hpage, bool move_flag)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,hugetlb: use folio fields in second tail page
  2022-11-03 21:18   ` Sidhartha Kumar
@ 2022-11-04  4:29     ` Hugh Dickins
  2022-11-10  0:11       ` Sidhartha Kumar
  0 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-04  4:29 UTC (permalink / raw)
  To: Sidhartha Kumar
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Muchun Song, Miaohe Lin,
	Naoya Horiguchi, Mina Almasry, James Houghton, Zach O'Keefe,
	linux-kernel, linux-mm

On Thu, 3 Nov 2022, Sidhartha Kumar wrote:
> On 11/2/22 6:48 PM, Hugh Dickins wrote:
...
> > Undo "mm: add private field of first tail to struct page and struct
> > folio"'s recent addition of private_1 to the folio tail: instead add
> > hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison
> > to a second tail page of the folio: THP has long been using several
> > fields of that tail, so make better use of it for hugetlb too.
> > This is not how a generic folio should be declared in future,
> > but it is an effective transitional way to make use of it.
...
> > @@ -260,13 +267,16 @@ struct page {
> >    *    to find how many references there are to this folio.
> >    * @memcg_data: Memory Control Group data.
> >    * @_flags_1: For large folios, additional page flags.
> > - * @__head: Points to the folio.  Do not use.
> > + * @_head_1: Points to the folio.  Do not use.
> 
> Changes to my original patch set look good, this seems to be a cleaner
> implementation.

Thanks a lot, Sidhartha, I'm glad to hear that it works for you too.

I expect that it will be done differently in the future: maybe generalizing
the additional fields to further "private"s as you did, letting different
subsystems accessorize them differently; or removing them completely from
struct folio, letting subsystems declare their own struct folio containers.
I don't know how that will end up, but this for now seems good and clear.

> 
> Should the usage of page_1 and page_2 also be documented here?

You must have something interesting in mind to document about them,
but I cannot guess what! They are for field alignment, not for use.
(page_2 to help when/if someone needs to add another pageful.)

Do you mean that I should copy the 
	/* private: the union with struct page is transitional */
comment from above the original "struct page page;" line I copied?
Or give all three of them a few underscores to imply not for use?

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,hugetlb: use folio fields in second tail page
  2022-11-03  1:48 ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Hugh Dickins
  2022-11-03 21:18   ` Sidhartha Kumar
@ 2022-11-05 19:13   ` Kirill A. Shutemov
  2022-11-10  1:58     ` Hugh Dickins
  1 sibling, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-11-05 19:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Wed, Nov 02, 2022 at 06:48:45PM -0700, Hugh Dickins wrote:
> @@ -260,13 +267,16 @@ struct page {
>   *    to find how many references there are to this folio.
>   * @memcg_data: Memory Control Group data.
>   * @_flags_1: For large folios, additional page flags.
> - * @__head: Points to the folio.  Do not use.
> + * @_head_1: Points to the folio.  Do not use.
>   * @_folio_dtor: Which destructor to use for this folio.
>   * @_folio_order: Do not use directly, call folio_order().
>   * @_total_mapcount: Do not use directly, call folio_entire_mapcount().
>   * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
>   * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
> - * @_private_1: Do not use directly, call folio_get_private_1().

Looks like it misses

  + * @_flags_2: For large folios, additional page flags.
  + * @_head_2: Points to the folio.  Do not use.

to match the first tail page documentation.

Otherwise the patch looks good to me:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>


-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] mm,thp,rmap: simplify compound page mapcount handling
  2022-11-03  1:51 ` [PATCH 2/3] mm,thp,rmap: simplify compound page mapcount handling Hugh Dickins
@ 2022-11-05 19:51   ` Kirill A. Shutemov
  2022-11-10  2:49     ` Hugh Dickins
  0 siblings, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-11-05 19:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Wed, Nov 02, 2022 at 06:51:38PM -0700, Hugh Dickins wrote:
> Compound page (folio) mapcount calculations have been different for
> anon and file (or shmem) THPs, and involved the obscure PageDoubleMap
> flag.  And each huge mapping and unmapping of a file (or shmem) THP
> involved atomically incrementing and decrementing the mapcount of every
> subpage of that huge page, dirtying many struct page cachelines.
> 
> Add subpages_mapcount field to the struct folio and first tail page,
> so that the total of subpage mapcounts is available in one place near
> the head: then page_mapcount() and total_mapcount() and page_mapped(),
> and their folio equivalents, are so quick that anon and file and hugetlb
> don't need to be optimized differently. Delete the unloved PageDoubleMap.
> 
> page_add and page_remove rmap functions must now maintain the
> subpages_mapcount as well as the subpage _mapcount, when dealing with
> pte mappings of huge pages; and correct maintenance of NR_ANON_MAPPED
> and NR_FILE_MAPPED statistics still needs reading through the subpages,
> using nr_subpages_unmapped() - but only when first or last pmd mapping
> finds subpages_mapcount raised (double-map case, not the common case).
> 
> But are those counts (used to decide when to split an anon THP, and
> in vmscan's pagecache_reclaimable heuristic) correctly maintained?
> Not quite: since page_remove_rmap() (and also split_huge_pmd()) is
> often called without page lock, there can be races when a subpage pte
> mapcount 0<->1 while compound pmd mapcount 0<->1 is scanning - races
> which the previous implementation had prevented. The statistics might
> become inaccurate, and even drift down until they underflow through 0.
> That is not good enough, but is better dealt with in a followup patch.
> 
> Update a few comments on first and second tail page overlaid fields.
> hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
> subpages_mapcount and compound_pincount are already correctly at 0,
> so delete its reinitialization of compound_pincount.
> 
> A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB)
> took 18 seconds on small pages, and used to take 1 second on huge pages,
> but now takes 119 milliseconds on huge pages.  Mapping by pmds a second
> time used to take 860ms and now takes 92ms; mapping by pmds after mapping
> by ptes (when the scan is needed) used to take 870ms and now takes 495ms.
> But there might be some benchmarks which would show a slowdown, because
> tail struct pages now fall out of cache until final freeing checks them.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Thanks for doing this!

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

And sorry again for PageDoubleMap() :/

Minor nitpick and a question below.

> @@ -829,12 +829,20 @@ static inline int folio_entire_mapcount(struct folio *folio)
>  
>  /*
>   * Mapcount of compound page as a whole, does not include mapped sub-pages.
> - *
> - * Must be called only for compound pages.
> + * Must be called only on head of compound page.
>   */
> -static inline int compound_mapcount(struct page *page)
> +static inline int head_compound_mapcount(struct page *head)
>  {
> -	return folio_entire_mapcount(page_folio(page));
> +	return atomic_read(compound_mapcount_ptr(head)) + 1;
> +}
> +
> +/*
> + * Sum of mapcounts of sub-pages, does not include compound mapcount.
> + * Must be called only on head of compound page.
> + */
> +static inline int head_subpages_mapcount(struct page *head)
> +{
> +	return atomic_read(subpages_mapcount_ptr(head));
>  }
>  
>  /*

Any particular reason these two do not take struct folio as an input?
It would guarantee that it is non-tail page. It will not guarantee
large-folio, but it is something.

> @@ -1265,8 +1288,6 @@ void page_add_new_anon_rmap(struct page *page,
>  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  		/* increment count (starts at -1) */
>  		atomic_set(compound_mapcount_ptr(page), 0);
> -		atomic_set(compound_pincount_ptr(page), 0);
> -

It has to be initialized to 0 on allocation, right?

>  		__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
>  	} else {
>  		/* increment count (starts at -1) */

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts
  2022-11-03  1:53 ` [PATCH 3/3] mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts Hugh Dickins
@ 2022-11-05 20:06   ` Kirill A. Shutemov
  2022-11-10  3:31     ` Hugh Dickins
  0 siblings, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-11-05 20:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Wed, Nov 02, 2022 at 06:53:45PM -0700, Hugh Dickins wrote:
> Fix the races in maintaining compound_mapcount, subpages_mapcount and
> subpage _mapcount by using PG_locked in the first tail of any compound
> page for a bit_spin_lock() on such modifications; skipping the usual
> atomic operations on those fields in this case.
> 
> Bring page_remove_file_rmap() and page_remove_anon_compound_rmap()
> back into page_remove_rmap() itself.  Rearrange page_add_anon_rmap()
> and page_add_file_rmap() and page_remove_rmap() to follow the same
> "if (compound) {lock} else if (PageCompound) {lock} else {atomic}"
> pattern (with a PageTransHuge in the compound test, like before, to
> avoid BUG_ONs and optimize away that block when THP is not configured).
> Move all the stats updates outside, after the bit_spin_locked section,
> so that it is sure to be a leaf lock.
> 
> Add page_dup_compound_rmap() to manage compound locking versus atomics
> in sync with the rest.  In particular, hugetlb pages are still using
> the atomics: to avoid unnecessary interference there, and because they
> never have subpage mappings; but this exception can easily be changed.
> Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's
> __split_huge_pmd_locked() too.
> 
> bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT
> sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure
> is to the non-hugetlb non-THP pte-mapped compound pages (with large
> folios being currently dependent on TRANSPARENT_HUGEPAGE).  There is
> never any scan of subpages in this case; but we have chosen to use
> PageCompound tests rather than PageTransCompound tests to gate the
> use of lock_compound_mapcounts(), so that page_mapped() is correct on
> all compound pages, whether or not TRANSPARENT_HUGEPAGE is enabled:
> could that be a problem for PREEMPT_RT, when there is contention on
> the lock - under heavy concurrent forking for example?  If so, then it
> can be turned into a sleeping lock (like folio_lock()) when PREEMPT_RT.
> 
> A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB)
> took 18 seconds on small pages, and used to take 1 second on huge pages,
> but now takes 115 milliseconds on huge pages.  Mapping by pmds a second
> time used to take 860ms and now takes 86ms; mapping by pmds after mapping
> by ptes (when the scan is needed) used to take 870ms and now takes 495ms.
> Mapping huge pages by ptes is largely unaffected but variable: between 5%
> faster and 5% slower in what I've recorded.  Contention on the lock is
> likely to behave worse than contention on the atomics behaved.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,hugetlb: use folio fields in second tail page
  2022-11-04  4:29     ` Hugh Dickins
@ 2022-11-10  0:11       ` Sidhartha Kumar
  2022-11-10  2:10         ` Hugh Dickins
  0 siblings, 1 reply; 54+ messages in thread
From: Sidhartha Kumar @ 2022-11-10  0:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Muchun Song, Miaohe Lin,
	Naoya Horiguchi, Mina Almasry, James Houghton, Zach O'Keefe,
	linux-kernel, linux-mm


On 11/3/22 9:29 PM, Hugh Dickins wrote:
> On Thu, 3 Nov 2022, Sidhartha Kumar wrote:
>> On 11/2/22 6:48 PM, Hugh Dickins wrote:
> ...
>>> Undo "mm: add private field of first tail to struct page and struct
>>> folio"'s recent addition of private_1 to the folio tail: instead add
>>> hugetlb_subpool, hugetlb_cgroup, hugetlb_cgroup_rsvd, hugetlb_hwpoison
>>> to a second tail page of the folio: THP has long been using several
>>> fields of that tail, so make better use of it for hugetlb too.
>>> This is not how a generic folio should be declared in future,
>>> but it is an effective transitional way to make use of it.
> ...
>>> @@ -260,13 +267,16 @@ struct page {
>>>     *    to find how many references there are to this folio.
>>>     * @memcg_data: Memory Control Group data.
>>>     * @_flags_1: For large folios, additional page flags.
>>> - * @__head: Points to the folio.  Do not use.
>>> + * @_head_1: Points to the folio.  Do not use.
>> Changes to my original patch set look good, this seems to be a cleaner
>> implementation.
> Thanks a lot, Sidhartha, I'm glad to hear that it works for you too.
>
> I expect that it will be done differently in the future: maybe generalizing
> the additional fields to further "private"s as you did, letting different
> subsystems accessorize them differently; or removing them completely from
> struct folio, letting subsystems declare their own struct folio containers.
> I don't know how that will end up, but this for now seems good and clear.
>
>> Should the usage of page_1 and page_2 also be documented here?
> You must have something interesting in mind to document about them,
> but I cannot guess what! They are for field alignment, not for use.
> (page_2 to help when/if someone needs to add another pageful.)
>
> Do you mean that I should copy the
> 	/* private: the union with struct page is transitional */
> comment from above the original "struct page page;" line I copied?
> Or give all three of them a few underscores to imply not for use?

I think the underscores with a comment about not for use could be helpful.

Thanks,

Sidhartha Kumar

> Thanks,
> Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,hugetlb: use folio fields in second tail page
  2022-11-05 19:13   ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Kirill A. Shutemov
@ 2022-11-10  1:58     ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-10  1:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Sat, 5 Nov 2022, Kirill A. Shutemov wrote:
> On Wed, Nov 02, 2022 at 06:48:45PM -0700, Hugh Dickins wrote:
> > @@ -260,13 +267,16 @@ struct page {
> >   *    to find how many references there are to this folio.
> >   * @memcg_data: Memory Control Group data.
> >   * @_flags_1: For large folios, additional page flags.
> > - * @__head: Points to the folio.  Do not use.
> > + * @_head_1: Points to the folio.  Do not use.
> >   * @_folio_dtor: Which destructor to use for this folio.
> >   * @_folio_order: Do not use directly, call folio_order().
> >   * @_total_mapcount: Do not use directly, call folio_entire_mapcount().
> >   * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
> >   * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
> > - * @_private_1: Do not use directly, call folio_get_private_1().
> 
> Looks like it misses
> 
>   + * @_flags_2: For large folios, additional page flags.
>   + * @_head_2: Points to the folio.  Do not use.
> 
> to match the first tail page documentation.
> 
> Otherwise the patch looks good to me:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Many thanks for all your encouragement and reviews, Kirill.

Okay, I've added a couple of lines on those fields; but did not
want to recommend the _flags_2 field for use (by the time we run
out in _flags_1, I hope all this will be gone or look different).

I'm sending an incremental fix, once I've responded to Sidhartha.

Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,hugetlb: use folio fields in second tail page
  2022-11-10  0:11       ` Sidhartha Kumar
@ 2022-11-10  2:10         ` Hugh Dickins
  2022-11-10  2:13           ` [PATCH 1/3 fix] mm,hugetlb: use folio fields in second tail page: fix Hugh Dickins
  0 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-10  2:10 UTC (permalink / raw)
  To: Sidhartha Kumar
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Muchun Song, Miaohe Lin,
	Naoya Horiguchi, Mina Almasry, James Houghton, Zach O'Keefe,
	linux-kernel, linux-mm

On Wed, 9 Nov 2022, Sidhartha Kumar wrote:
> On 11/3/22 9:29 PM, Hugh Dickins wrote:
> >
> >> Should the usage of page_1 and page_2 also be documented here?
> > You must have something interesting in mind to document about them,
> > but I cannot guess what! They are for field alignment, not for use.
> > (page_2 to help when/if someone needs to add another pageful.)
> >
> > Do you mean that I should copy the
> > 	/* private: the union with struct page is transitional */
> > comment from above the original "struct page page;" line I copied?
> > Or give all three of them a few underscores to imply not for use?
> 
> I think the underscores with a comment about not for use could be helpful.

I've given them two underscores (but not to the original "struct page page",
since a build showed that used as "page" elsewhere, not just for alignment).

I'm sorry, but I've not given them any comment: I don't think they
belong in the commented fields section (_flags_1 etc), "page" is not
there; and I'm, let's be honest, terrified of dabbling in this kerneldoc
area - feel very fortunate to have escaped attack by a robot for my
additions so far.  I'll leave adding comment to you or other cognoscenti.

Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3 fix] mm,hugetlb: use folio fields in second tail page: fix
  2022-11-10  2:10         ` Hugh Dickins
@ 2022-11-10  2:13           ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-10  2:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sidhartha Kumar, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Muchun Song, Miaohe Lin,
	Naoya Horiguchi, Mina Almasry, James Houghton, Zach O'Keefe,
	linux-kernel, linux-mm

Per review comment from Sidhartha: prefix folio's page_1 and page_2 with
double underscore, to underscore that they are fillers for alignment
rather than directly usable members of the union (whereas the first
"struct page page" is important for folio<->page conversions).

Per review comment from Kirill: give folio's _flags_2 and _head_2 a line
of documentation each, though both of them "Do not use" (I think _flags_1
should be enough for now, and shouldn't recommend spilling to _flags_2).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm_types.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5d28bbd19e3f..1b8db9b4a7e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -275,6 +275,8 @@ struct page {
  * @_subpages_mapcount: Do not use directly, call folio_mapcount().
  * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
  * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
+ * @_flags_2: For alignment.  Do not use.
+ * @_head_2: Points to the folio.  Do not use.
  * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
  * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
@@ -330,7 +332,7 @@ struct folio {
 			unsigned int _folio_nr_pages;
 #endif
 		};
-		struct page page_1;
+		struct page __page_1;
 	};
 	union {
 		struct {
@@ -341,7 +343,7 @@ struct folio {
 			void *_hugetlb_cgroup_rsvd;
 			void *_hugetlb_hwpoison;
 		};
-		struct page page_2;
+		struct page __page_2;
 	};
 };
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first
  2022-11-03  1:44 [PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts Hugh Dickins
                   ` (2 preceding siblings ...)
  2022-11-03  1:53 ` [PATCH 3/3] mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts Hugh Dickins
@ 2022-11-10  2:18 ` Hugh Dickins
  2022-11-10  3:23   ` Linus Torvalds
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
  4 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-10  2:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts")
propagated the "if (compound) {lock} else if (PageCompound) {lock} else
{atomic}" pattern throughout; but Linus hated the way that gives primacy
to the uncommon case: switch to "if (!PageCompound) {atomic} else if
(compound) {lock} else {lock}" throughout.  Linus has a bigger idea
for how to improve it all, but here just make that rearrangement.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/rmap.c | 54 +++++++++++++++++++++++++++---------------------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 512e53cae2ca..4833d28c5e1a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1311,7 +1311,11 @@ void page_add_anon_rmap(struct page *page,
 	else
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	if (compound && PageTransHuge(page)) {
+	if (likely(!PageCompound(page))) {
+		first = atomic_inc_and_test(&page->_mapcount);
+		nr = first;
+
+	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		first = !mapcounts.compound_mapcount;
 		mapcounts.compound_mapcount++;
@@ -1321,8 +1325,7 @@ void page_add_anon_rmap(struct page *page,
 				nr = nr_subpages_unmapped(page, nr_pmdmapped);
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-
-	} else if (PageCompound(page)) {
+	} else {
 		struct page *head = compound_head(page);
 
 		lock_compound_mapcounts(head, &mapcounts);
@@ -1330,10 +1333,6 @@ void page_add_anon_rmap(struct page *page,
 		first = subpage_mapcount_inc(page);
 		nr = first && !mapcounts.compound_mapcount;
 		unlock_compound_mapcounts(head, &mapcounts);
-
-	} else {
-		first = atomic_inc_and_test(&page->_mapcount);
-		nr = first;
 	}
 
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
@@ -1373,20 +1372,23 @@ void page_add_anon_rmap(struct page *page,
 void page_add_new_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address)
 {
-	const bool compound = PageCompound(page);
-	int nr = compound ? thp_nr_pages(page) : 1;
+	int nr;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	__SetPageSwapBacked(page);
-	if (compound) {
+
+	if (likely(!PageCompound(page))) {
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
+		nr = 1;
+	} else {
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
+		nr = thp_nr_pages(page);
 		__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
-	} else {
-		/* increment count (starts at -1) */
-		atomic_set(&page->_mapcount, 0);
 	}
+
 	__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
@@ -1409,7 +1411,11 @@ void page_add_file_rmap(struct page *page,
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
 
-	if (compound && PageTransHuge(page)) {
+	if (likely(!PageCompound(page))) {
+		first = atomic_inc_and_test(&page->_mapcount);
+		nr = first;
+
+	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		first = !mapcounts.compound_mapcount;
 		mapcounts.compound_mapcount++;
@@ -1419,8 +1425,7 @@ void page_add_file_rmap(struct page *page,
 				nr = nr_subpages_unmapped(page, nr_pmdmapped);
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-
-	} else if (PageCompound(page)) {
+	} else {
 		struct page *head = compound_head(page);
 
 		lock_compound_mapcounts(head, &mapcounts);
@@ -1428,10 +1433,6 @@ void page_add_file_rmap(struct page *page,
 		first = subpage_mapcount_inc(page);
 		nr = first && !mapcounts.compound_mapcount;
 		unlock_compound_mapcounts(head, &mapcounts);
-
-	} else {
-		first = atomic_inc_and_test(&page->_mapcount);
-		nr = first;
 	}
 
 	if (nr_pmdmapped)
@@ -1471,7 +1472,11 @@ void page_remove_rmap(struct page *page,
 	lock_page_memcg(page);
 
 	/* page still mapped by someone else? */
-	if (compound && PageTransHuge(page)) {
+	if (likely(!PageCompound(page))) {
+		last = atomic_add_negative(-1, &page->_mapcount);
+		nr = last;
+
+	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		mapcounts.compound_mapcount--;
 		last = !mapcounts.compound_mapcount;
@@ -1481,8 +1486,7 @@ void page_remove_rmap(struct page *page,
 				nr = nr_subpages_unmapped(page, nr_pmdmapped);
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-
-	} else if (PageCompound(page)) {
+	} else {
 		struct page *head = compound_head(page);
 
 		lock_compound_mapcounts(head, &mapcounts);
@@ -1490,10 +1494,6 @@ void page_remove_rmap(struct page *page,
 		last = subpage_mapcount_dec(page);
 		nr = last && !mapcounts.compound_mapcount;
 		unlock_compound_mapcounts(head, &mapcounts);
-
-	} else {
-		last = atomic_add_negative(-1, &page->_mapcount);
-		nr = last;
 	}
 
 	if (nr_pmdmapped) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] mm,thp,rmap: simplify compound page mapcount handling
  2022-11-05 19:51   ` Kirill A. Shutemov
@ 2022-11-10  2:49     ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-10  2:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Sat, 5 Nov 2022, Kirill A. Shutemov wrote:
> On Wed, Nov 02, 2022 at 06:51:38PM -0700, Hugh Dickins wrote:
> 
> Thanks for doing this!
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks!

> 
> And sorry again for PageDoubleMap() :/

It did serve a real purpose, but I always found it hard to live with,
and I'm glad that you're happy it's gone too :)

> 
> Minor nitpick and a question below.
> 
> > @@ -829,12 +829,20 @@ static inline int folio_entire_mapcount(struct folio *folio)
> >  
> >  /*
> >   * Mapcount of compound page as a whole, does not include mapped sub-pages.
> > - *
> > - * Must be called only for compound pages.
> > + * Must be called only on head of compound page.
> >   */
> > -static inline int compound_mapcount(struct page *page)
> > +static inline int head_compound_mapcount(struct page *head)
> >  {
> > -	return folio_entire_mapcount(page_folio(page));
> > +	return atomic_read(compound_mapcount_ptr(head)) + 1;
> > +}
> > +
> > +/*
> > + * Sum of mapcounts of sub-pages, does not include compound mapcount.
> > + * Must be called only on head of compound page.
> > + */
> > +static inline int head_subpages_mapcount(struct page *head)
> > +{
> > +	return atomic_read(subpages_mapcount_ptr(head));
> >  }
> >  
> >  /*
> 
> Any particular reason these two do not take struct folio as an input?
> It would guarantee that it is non-tail page. It will not guarantee
> large-folio, but it is something.

The actual reason is that I first did this work in a pre-folio tree;
and even now I am much more at ease with compound pages than folios.

But when I looked to see if I ought to change them, found that the
only uses are below in this header file, or in __dump_page() or in
free_tail_pages_check() - low-level functions, page-oriented and
obviously on head.  So I wasn't tempted to change them at all.

> 
> > @@ -1265,8 +1288,6 @@ void page_add_new_anon_rmap(struct page *page,
> >  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> >  		/* increment count (starts at -1) */
> >  		atomic_set(compound_mapcount_ptr(page), 0);
> > -		atomic_set(compound_pincount_ptr(page), 0);
> > -
> 
> It has to be initialized to 0 on allocation, right?

That's right.  I was going to say that I'd commented on this in the
commit message, but no, it looks like I only commented on the instance
in hugepage_add_new_new_anon_rmap() (and added the "increment" comment
line from here to there).

I visited both those functions to add a matching subpages_mapcount
initialization; then realized that the pincount addition had missed
the point, initialization to 0 has already been done, and the
compound_mapcount line is about incrementing from -1 to 0,
not about initializing.

There are similar places in mm/hugetlb.c, where I did add the
subpages_mapcount initialization to the compound_pincount and
compound_mapcount initializations: that's because I'm on shaky ground
with hugetlb page lifecycle, and not so sure of their status there.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first
  2022-11-10  2:18 ` [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first Hugh Dickins
@ 2022-11-10  3:23   ` Linus Torvalds
  2022-11-10  4:21     ` Hugh Dickins
  2022-11-10 16:31     ` Matthew Wilcox
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2022-11-10  3:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Wed, Nov 9, 2022 at 6:18 PM Hugh Dickins <hughd@google.com> wrote:
>
> Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts")
> propagated the "if (compound) {lock} else if (PageCompound) {lock} else
> {atomic}" pattern throughout; but Linus hated the way that gives primacy
> to the uncommon case: switch to "if (!PageCompound) {atomic} else if
> (compound) {lock} else {lock}" throughout.

Side note, that 'compound' naming is also on my list of "I'm _really_
not a fan".

We actually have a completely different meaning for PageCompound()
than the meaning of 'compound' in the rmap functions, and those
functions literally mix those meanings if  not on the same line, then
at least right next to each other.

What 'rmap' actually means with 'compound' in the add/remove functions
is basically 'not PAGE_SIZE' as far as I can tell.

So if I get the energy to do the rmap counts, I will *also* be
renaming that horrible thing completely.

In fact, I'd be inclined to just pass in the actual page size
(possibly a page shift order), which some of the functions want
anyway, and which would be a lot clearer than the horrid "compound"
name.

One reason I find the "compound" name so horrifying is that it is used
very much for HUGETLB pages, which I don't think end up ever being
marked as PageCompund(), and which are - for various historical
reasons - doubly confusing because they use a "pte_t" to describe
themselves, even when they are actually using a "pmd_t" or a "pud_t"
to actually map the page.

So a HUGETLB entry really is (for historical reasons) designed to look
like a single last-level pte_t entry, but from an rmap perspective it
is explicitly *not* made to look like that at all, completely
violating the HUGETLB design.

So the rmap code has several really REALLY confusing cases:

 - the common one: just a page mapped at a *real* pte_t level.

   To make that more confusing, it can actually be a single-page
_part_ of a compound page in the PageCompound() sense, but the rmap
'compound' argument will *not* be set, because from a *mmap*
standpoint it's mapped as a single page.

   This is generally recognized by the rmap code by 'compound' being zero.

 - a HUGETLB mapping, which uses '->pte' in the page walking (please
kill me now) and is *not* using a PageCompund() page, but 'compound'
is still set, because from a *mapping* standpoint it's not a final
pte_t entry (buit from a MM design standpoint it _was_ supposed to be
designed like a single page).

   This is randomly recognized by looking at the vma flags (using
"is_vm_hugetlb_page(vma)") or just inherent in the code itself (ie the
'hugetlb()' functions are only called by code that has tested this
situation one way or another)

   To make things more confusing, some places use PageHeadHuge()
instead (but the folio version of said test is called
"folio_test_hugetlb()", just so that nobody could possibly ever accuse
the HUGETLB code to have consistency).

    You'd think that PageHeadHuge() is one of the functions that
checks the page flag bits. You'd be wrong. It's very very special.

 - an *actual* PageCompound() page, mapped as such as a THP page (ie
mapped by a pmd, not a pte).

   This may be the same page size as a HUGETLB mapping (and often is),
but it's a completely different case in every single way.

   But like the HUGETLB case, the 'compound' argument will be set, and
now it's actually a compound page (but hey, so could the single page
mapping case be too).

   Unlike the HUGETLB case, the page walker does not use ->pte for
this, and some of the walkers will very much use that, ie
folio_referenced_one() will do

                if (pvmw.pte) {

   to distinguish the "natively mapped PageCompound()" case (no pte)
from the "map a single page" or from the HUGETLB case (yes pte).

There may be more cases than those three, and I may have confused
myself and gotten some of the details wrong, but I did want to write
the above diatribe out to

 (a) see if somebody corrects me for any of the cases I enumerated

 (b) see if somebody can point to yet another odd case

 (c) see if somebody has suggestions for better and more obvious names
for that 'compound' argument in the rmap code

I do wish the HUGETLB case didn't use 'pte' for its notion of how
HUGETLB entries are mapped, but that's literally how HUGETLB is
designed: it started life as a larger last-level pte.

It just means that it ends up being very confusing when from a page
table walk perspective, you're walking a pud or a pmd entry, and then
you see a 'pte_t' instead.

An example of that confusion is visible in try_to_unmap_one(), which
can be called with a HUGEPTE page (well, folio), and that does

        while (page_vma_mapped_walk(&pvmw)) {

to find the rmap entries, but it can't do that

                if (pvmw.pte) {

test to see what mapping it's walking (since both regular pages and
HUGETLB pages use that), so then it just keeps testing what kind of
page that was passed in.

Which really smells very odd to me, but apparently it works,
presumably because unlike THP there can be no splitting.  But it's a
case where the whole "was it a regular page or a HUGETLB page" is
really really confusing/

And mm/hugetlb.c (and places like mm/pagewalk.c too) has a fair number
of random casts as a result of this "it's not really a pte_t, but it's
designed to look like one" thing.

This all really is understandable from a historical context, and from
HUGETLB really being just kind of designed to be a bigger page (not a
collection of small pages that can be mapped as a bigger entity), but
it really does mean that 'rmap' calling those HUGETLB pages 'compound'
is conceptually very very wrong.

Oh well. That whole HUGETLB model isn't getting fixed, but I think the
naming confusion about 'compound' *can* be fixed fairly easily, and we
could try to at least avoid having 'compound' and 'PageCompound()'
mean totally different things in the same function.

I'm not going to do any of this cleanup now, but I wanted to at least
voice my concerns. Maybe I'll get around to actually trying to clarify
the code later.

Because this was all stuff that was *very* confusing when I did the
rmap simplification in that (now completely rewritten to explicitly
_not_ touch rmap at all) original version of the delayed rmap patch
series.

                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts
  2022-11-05 20:06   ` Kirill A. Shutemov
@ 2022-11-10  3:31     ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-10  3:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Sat, 5 Nov 2022, Kirill A. Shutemov wrote:
> On Wed, Nov 02, 2022 at 06:53:45PM -0700, Hugh Dickins wrote:
> > Fix the races in maintaining compound_mapcount, subpages_mapcount and
> > subpage _mapcount by using PG_locked in the first tail of any compound
> > page for a bit_spin_lock() on such modifications; skipping the usual
> > atomic operations on those fields in this case.
> > 
> > Bring page_remove_file_rmap() and page_remove_anon_compound_rmap()
> > back into page_remove_rmap() itself.  Rearrange page_add_anon_rmap()
> > and page_add_file_rmap() and page_remove_rmap() to follow the same
> > "if (compound) {lock} else if (PageCompound) {lock} else {atomic}"
> > pattern (with a PageTransHuge in the compound test, like before, to
> > avoid BUG_ONs and optimize away that block when THP is not configured).
> > Move all the stats updates outside, after the bit_spin_locked section,
> > so that it is sure to be a leaf lock.
> > 
> > Add page_dup_compound_rmap() to manage compound locking versus atomics
> > in sync with the rest.  In particular, hugetlb pages are still using
> > the atomics: to avoid unnecessary interference there, and because they
> > never have subpage mappings; but this exception can easily be changed.
> > Conveniently, page_dup_compound_rmap() turns out to suit an anon THP's
> > __split_huge_pmd_locked() too.
> > 
> > bit_spin_lock() is not popular with PREEMPT_RT folks: but PREEMPT_RT
> > sensibly excludes TRANSPARENT_HUGEPAGE already, so its only exposure
> > is to the non-hugetlb non-THP pte-mapped compound pages (with large
> > folios being currently dependent on TRANSPARENT_HUGEPAGE).  There is
> > never any scan of subpages in this case; but we have chosen to use
> > PageCompound tests rather than PageTransCompound tests to gate the
> > use of lock_compound_mapcounts(), so that page_mapped() is correct on
> > all compound pages, whether or not TRANSPARENT_HUGEPAGE is enabled:
> > could that be a problem for PREEMPT_RT, when there is contention on
> > the lock - under heavy concurrent forking for example?  If so, then it
> > can be turned into a sleeping lock (like folio_lock()) when PREEMPT_RT.
> > 
> > A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB)
> > took 18 seconds on small pages, and used to take 1 second on huge pages,
> > but now takes 115 milliseconds on huge pages.  Mapping by pmds a second
> > time used to take 860ms and now takes 86ms; mapping by pmds after mapping
> > by ptes (when the scan is needed) used to take 870ms and now takes 495ms.
> > Mapping huge pages by ptes is largely unaffected but variable: between 5%
> > faster and 5% slower in what I've recorded.  Contention on the lock is
> > likely to behave worse than contention on the atomics behaved.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks, Kirill; and there's a 4/3 posted to change around that
"if (compound) {lock} else if (PageCompound) {lock} else {atomic}"
ordering, which Linus hated.

But this might be a good place to mention, that Linus (I'd sent private
mail to sort out mm-unstable instabilities in a hurry, and discussion
ensued from there) does not like this patch very much, and has a good
idea for improving it, but has let us move forward with this for now.

His idea is for subpages_mapcount not to count all the ptes of subpages,
but to count all the subpages which have ptes (or I think that's one way
of saying it, but not how he said it): count what the stats need counted.

I was sceptical at first, because that was indeed something I had tried
at one point, but decided against.  I am hoping that it will turn out
just to be my prejudice: that I embarked on this job, in large part,
to get rid of the scan lurking inside total_mapcount().  And Linus's
idea would appear to bring back the unlocked scan in total_mapcount():
but remove all the locked scans in page_add/remove_rmap() - which,
setting aside my prejudice, sounds like a big improvement (in the
double-mapped case; common cases unchanged).

I was not enthusiastic, in that discussion several days ago, but got
quite excited once I had a moment to consider (but I've not told him so
until now).  I'll try to pursue it this weekend: maybe I'll rediscover
a good reason why it had to be abandoned, but let's hope it works out.

Anyway, what's in mm-unstable is good, and an improvement over the old
scans; but I appreciate Linus's frustration that it could be much better.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first
  2022-11-10  3:23   ` Linus Torvalds
@ 2022-11-10  4:21     ` Hugh Dickins
  2022-11-10 16:31     ` Matthew Wilcox
  1 sibling, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-10  4:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Wed, 9 Nov 2022, Linus Torvalds wrote:
> On Wed, Nov 9, 2022 at 6:18 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts")
> > propagated the "if (compound) {lock} else if (PageCompound) {lock} else
> > {atomic}" pattern throughout; but Linus hated the way that gives primacy
> > to the uncommon case: switch to "if (!PageCompound) {atomic} else if
> > (compound) {lock} else {lock}" throughout.
> 
> Side note, that 'compound' naming is also on my list of "I'm _really_
> not a fan".
> 
> We actually have a completely different meaning for PageCompound()
> than the meaning of 'compound' in the rmap functions, and those
> functions literally mix those meanings if  not on the same line, then
> at least right next to each other.
> 
> What 'rmap' actually means with 'compound' in the add/remove functions
> is basically 'not PAGE_SIZE' as far as I can tell.
> 
> So if I get the energy to do the rmap counts,

See other mail, I got some zest to try your idea on the counts.

> I will *also* be renaming that horrible thing completely.

But I don't suppose I'll spend time on that part, I don't really
see the problem.  "compound" might be better named, say, "large_rmap"
(I'd have said "pmd_mapped" or "pmd_rmap", but you raise the spectre
of hugetlb below, and powerpc as usual does hugetlb very differently),
but compound seems okay to me, and consistent with usage elsewhere.

> 
> In fact, I'd be inclined to just pass in the actual page size
> (possibly a page shift order), which some of the functions want
> anyway, and which would be a lot clearer than the horrid "compound"
> name.

But yes, I think that would be an improvement; yet you might find a
reason why so often we don't do that - there's often an awkward
BUILD_BUG when you build without CONFIG_TRANSPARENT_HUGEPAGE=y.
And much as I've often wanted to remove it, it does give some
assurance that we're not bloating THP-disabled configs.  Maybe the
steady growth of compound_nr() usage gets around that better now
(or will you be renaming that too ?-)

> 
> One reason I find the "compound" name so horrifying is that it is used
> very much for HUGETLB pages, which I don't think end up ever being
> marked as PageCompund(), and which are - for various historical

hugetlb pages are always PageCompound.  Shoot me if they're not.

> reasons - doubly confusing because they use a "pte_t" to describe
> themselves, even when they are actually using a "pmd_t" or a "pud_t"
> to actually map the page.

Yes, I wish we would undo that hugetlb deception: it would probably
be much more (un)doable, were it not for powerpc (and ia64 iirc).

> 
> So a HUGETLB entry really is (for historical reasons) designed to look
> like a single last-level pte_t entry, but from an rmap perspective it
> is explicitly *not* made to look like that at all, completely
> violating the HUGETLB design.
> 
> So the rmap code has several really REALLY confusing cases:
> 
>  - the common one: just a page mapped at a *real* pte_t level.
> 
>    To make that more confusing, it can actually be a single-page
> _part_ of a compound page in the PageCompound() sense, but the rmap
> 'compound' argument will *not* be set, because from a *mmap*
> standpoint it's mapped as a single page.

Yes.  Most pages are unambiguous, but when a PageHead page arrives
at page_add/remove_rmap(), we have to do different things, according
to whether it's mapped with a large or a small entry.

But I'm going away at this point, you write much faster than I can
read and understand and respond.  I'm responding in part to "fix"
my stupid typo on Johannes's address.

Hugh

> 
>    This is generally recognized by the rmap code by 'compound' being zero.
> 
>  - a HUGETLB mapping, which uses '->pte' in the page walking (please
> kill me now) and is *not* using a PageCompund() page, but 'compound'
> is still set, because from a *mapping* standpoint it's not a final
> pte_t entry (buit from a MM design standpoint it _was_ supposed to be
> designed like a single page).
> 
>    This is randomly recognized by looking at the vma flags (using
> "is_vm_hugetlb_page(vma)") or just inherent in the code itself (ie the
> 'hugetlb()' functions are only called by code that has tested this
> situation one way or another)
> 
>    To make things more confusing, some places use PageHeadHuge()
> instead (but the folio version of said test is called
> "folio_test_hugetlb()", just so that nobody could possibly ever accuse
> the HUGETLB code to have consistency).
> 
>     You'd think that PageHeadHuge() is one of the functions that
> checks the page flag bits. You'd be wrong. It's very very special.
> 
>  - an *actual* PageCompound() page, mapped as such as a THP page (ie
> mapped by a pmd, not a pte).
> 
>    This may be the same page size as a HUGETLB mapping (and often is),
> but it's a completely different case in every single way.
> 
>    But like the HUGETLB case, the 'compound' argument will be set, and
> now it's actually a compound page (but hey, so could the single page
> mapping case be too).
> 
>    Unlike the HUGETLB case, the page walker does not use ->pte for
> this, and some of the walkers will very much use that, ie
> folio_referenced_one() will do
> 
>                 if (pvmw.pte) {
> 
>    to distinguish the "natively mapped PageCompound()" case (no pte)
> from the "map a single page" or from the HUGETLB case (yes pte).
> 
> There may be more cases than those three, and I may have confused
> myself and gotten some of the details wrong, but I did want to write
> the above diatribe out to
> 
>  (a) see if somebody corrects me for any of the cases I enumerated
> 
>  (b) see if somebody can point to yet another odd case
> 
>  (c) see if somebody has suggestions for better and more obvious names
> for that 'compound' argument in the rmap code
> 
> I do wish the HUGETLB case didn't use 'pte' for its notion of how
> HUGETLB entries are mapped, but that's literally how HUGETLB is
> designed: it started life as a larger last-level pte.
> 
> It just means that it ends up being very confusing when from a page
> table walk perspective, you're walking a pud or a pmd entry, and then
> you see a 'pte_t' instead.
> 
> An example of that confusion is visible in try_to_unmap_one(), which
> can be called with a HUGEPTE page (well, folio), and that does
> 
>         while (page_vma_mapped_walk(&pvmw)) {
> 
> to find the rmap entries, but it can't do that
> 
>                 if (pvmw.pte) {
> 
> test to see what mapping it's walking (since both regular pages and
> HUGETLB pages use that), so then it just keeps testing what kind of
> page that was passed in.
> 
> Which really smells very odd to me, but apparently it works,
> presumably because unlike THP there can be no splitting.  But it's a
> case where the whole "was it a regular page or a HUGETLB page" is
> really really confusing/
> 
> And mm/hugetlb.c (and places like mm/pagewalk.c too) has a fair number
> of random casts as a result of this "it's not really a pte_t, but it's
> designed to look like one" thing.
> 
> This all really is understandable from a historical context, and from
> HUGETLB really being just kind of designed to be a bigger page (not a
> collection of small pages that can be mapped as a bigger entity), but
> it really does mean that 'rmap' calling those HUGETLB pages 'compound'
> is conceptually very very wrong.
> 
> Oh well. That whole HUGETLB model isn't getting fixed, but I think the
> naming confusion about 'compound' *can* be fixed fairly easily, and we
> could try to at least avoid having 'compound' and 'PageCompound()'
> mean totally different things in the same function.
> 
> I'm not going to do any of this cleanup now, but I wanted to at least
> voice my concerns. Maybe I'll get around to actually trying to clarify
> the code later.
> 
> Because this was all stuff that was *very* confusing when I did the
> rmap simplification in that (now completely rewritten to explicitly
> _not_ touch rmap at all) original version of the delayed rmap patch
> series.
> 
>                  Linus


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first
  2022-11-10  3:23   ` Linus Torvalds
  2022-11-10  4:21     ` Hugh Dickins
@ 2022-11-10 16:31     ` Matthew Wilcox
  2022-11-10 16:58       ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-11-10 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Sidhartha Kumar, Muchun Song,
	Miaohe Lin, Naoya Horiguchi, Mina Almasry, James Houghton,
	Zach O'Keefe, linux-kernel, linux-mm

On Wed, Nov 09, 2022 at 07:23:08PM -0800, Linus Torvalds wrote:
> On Wed, Nov 9, 2022 at 6:18 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Commit ("mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts")
> > propagated the "if (compound) {lock} else if (PageCompound) {lock} else
> > {atomic}" pattern throughout; but Linus hated the way that gives primacy
> > to the uncommon case: switch to "if (!PageCompound) {atomic} else if
> > (compound) {lock} else {lock}" throughout.
> 
> Side note, that 'compound' naming is also on my list of "I'm _really_
> not a fan".
> 
> We actually have a completely different meaning for PageCompound()
> than the meaning of 'compound' in the rmap functions, and those
> functions literally mix those meanings if  not on the same line, then
> at least right next to each other.
> 
> What 'rmap' actually means with 'compound' in the add/remove functions
> is basically 'not PAGE_SIZE' as far as I can tell.

Ah.  I've been trying to understand what that 'compound' really means,
and what the difference is to 'PageCompound()' and why we need both.
Thanks!

> One reason I find the "compound" name so horrifying is that it is used
> very much for HUGETLB pages, which I don't think end up ever being
> marked as PageCompund(), and which are - for various historical
> reasons - doubly confusing because they use a "pte_t" to describe
> themselves, even when they are actually using a "pmd_t" or a "pud_t"
> to actually map the page.

HugeTLB pages _are_ marked as Compound.  There's some fairly horrific
code to manually make them compound when they have to be allocated
piecemeal (because they're 1GB and too large for the page allocator).

>    To make things more confusing, some places use PageHeadHuge()
> instead (but the folio version of said test is called
> "folio_test_hugetlb()", just so that nobody could possibly ever accuse
> the HUGETLB code to have consistency).

That one's my fault, but it's a reaction to all the times that I and
others have got confused between PageHuge and PageTransHuge.  I suppose
we could do a big sed s/PageHuge/PageHugeTLB/, but I'm hopeful the
entire hugetlb codebase is either converted to folios or unified with
THP handling.

> I do wish the HUGETLB case didn't use 'pte' for its notion of how
> HUGETLB entries are mapped, but that's literally how HUGETLB is
> designed: it started life as a larger last-level pte.
> 
> It just means that it ends up being very confusing when from a page
> table walk perspective, you're walking a pud or a pmd entry, and then
> you see a 'pte_t' instead.

Yes, one of the long-term things I want to try is making the hugetlb
code use the pmd/pud types like the THP code does.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first
  2022-11-10 16:31     ` Matthew Wilcox
@ 2022-11-10 16:58       ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2022-11-10 16:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Sidhartha Kumar, Muchun Song,
	Miaohe Lin, Naoya Horiguchi, Mina Almasry, James Houghton,
	Zach O'Keefe, linux-kernel, linux-mm

On Thu, Nov 10, 2022 at 8:31 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Ah.  I've been trying to understand what that 'compound' really means,
> and what the difference is to 'PageCompound()' and why we need both.

Yeah, so the 'why' is:

 (a) to distinguish the case of "I'm mapping the first sub-page of a
compound page as a _single_ page page entry in the pte" from "I'm
mapping the whole compound/THP/HUGETLB page as a pmd"

The actual 'page' pointer can be the same in both cases, so you can't
tell from that: PageCompound() will be true in both cases.

Of course, sometimes you *can* tell from the page pointer too (eg the
HUGETLB case can never be mapped as a small page), but not always.

 (b) because we do completely different things from a page locking and
statistics standpoint for the two cases.

That (b) is obviously related to (a), but it's effectively the main
reason why rmap needs to be able to tell the difference in the first
place.

> HugeTLB pages _are_ marked as Compound.

Oh, ok. It's not clear why they would be, and historically I don't
think they were, but I guess it's for random implementation details
(probably to look up the head page logic).

>  There's some fairly horrific
> code to manually make them compound when they have to be allocated
> piecemeal (because they're 1GB and too large for the page allocator).

Yeah, the HUGETLB case is a mess these days, but it made sense
historically, because it was a much simpler thing than the THP pages
that have all the fragmentation cases.

Now that we handle pmd-sized pages anyway, the HUGETLB case is mostly
just a nasty oddity, but we obviously also do the pud case with
HUGETLB.

And who knows what ia64 did with its completely random page-size
thing. I don't even want to think about it, and thankfully these days
I don't feel like I need to care any more ;)

> >    To make things more confusing, some places use PageHeadHuge()
> > instead (but the folio version of said test is called
> > "folio_test_hugetlb()", just so that nobody could possibly ever accuse
> > the HUGETLB code to have consistency).
>
> That one's my fault, but it's a reaction to all the times that I and
> others have got confused between PageHuge and PageTransHuge.  I suppose
> we could do a big sed s/PageHuge/PageHugeTLB/, but I'm hopeful the
> entire hugetlb codebase is either converted to folios or unified with
> THP handling.

Yeah, it would be lovely to make HUGETLB some THP special case some day.

                 Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-03  1:44 [PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts Hugh Dickins
                   ` (3 preceding siblings ...)
  2022-11-10  2:18 ` [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first Hugh Dickins
@ 2022-11-18  9:08 ` Hugh Dickins
  2022-11-18  9:12   ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
                     ` (5 more replies)
  4 siblings, 6 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-18  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

Linus was underwhelmed by the earlier compound mapcounts series:
this series builds on top of it (as in next-20221117) to follow
up on his suggestions - except rmap.c still using lock_page_memcg(),
since I hesitate to steal the pleasure of deletion from Johannes.

1/3 mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
2/3 mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
3/3 mm,thp,rmap: clean up the end of __split_huge_pmd_locked()

 Documentation/mm/transhuge.rst |  10 +-
 include/linux/mm.h             |  65 +++++++----
 include/linux/rmap.h           |  12 +-
 mm/debug.c                     |   2 +-
 mm/huge_memory.c               |  15 +--
 mm/rmap.c                      | 213 ++++++++++-------------------------
 6 files changed, 119 insertions(+), 198 deletions(-)

Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
@ 2022-11-18  9:12   ` Hugh Dickins
  2022-11-19  0:12     ` Yu Zhao
  2022-11-21 12:36     ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Kirill A. Shutemov
  2022-11-18  9:14   ` [PATCH 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped Hugh Dickins
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-18  9:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

Following suggestion from Linus, instead of counting every PTE map of a
compound page in subpages_mapcount, just count how many of its subpages
are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED
and NR_FILE_MAPPED stats, without any need for a locked scan of subpages;
and requires updating the count less often.

This does then revert total_mapcount() and folio_mapcount() to needing a
scan of subpages; but they are inherently racy, and need no locking, so
Linus is right that the scans are much better done there.  Plus (unlike
in 6.1 and previous) subpages_mapcount lets us avoid the scan in the
common case of no PTE maps.  And page_mapped() and folio_mapped() remain
scanless and just as efficient with the new meaning of subpages_mapcount:
those are the functions which I most wanted to remove the scan from.

The updated page_dup_compound_rmap() is no longer suitable for use by
anon THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be
used for that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.

Evidence is that this way goes slightly faster than the previous
implementation for most cases; but significantly faster in the (now
scanless) pmds after ptes case, which started out at 870ms and was
brought down to 495ms by the previous series, now takes around 105ms.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/mm/transhuge.rst |   3 +-
 include/linux/mm.h             |  52 ++++++-----
 include/linux/rmap.h           |   8 +-
 mm/huge_memory.c               |   2 +-
 mm/rmap.c                      | 155 ++++++++++++++-------------------
 5 files changed, 103 insertions(+), 117 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 1e2a637cc607..af4c9d70321d 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -122,7 +122,8 @@ pages:
 
   - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
     on relevant sub-page of the compound page, and also increment/decrement
-    ->subpages_mapcount, stored in first tail page of the compound page.
+    ->subpages_mapcount, stored in first tail page of the compound page, when
+    _mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE.
     In order to have race-free accounting of sub-pages mapped, changes to
     sub-page ->_mapcount, ->subpages_mapcount and ->compound_mapcount are
     are all locked by bit_spin_lock of PG_locked in the first tail ->flags.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8fe6276d8cc2..c9e46d4d46f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -828,7 +828,7 @@ static inline int head_compound_mapcount(struct page *head)
 }
 
 /*
- * Sum of mapcounts of sub-pages, does not include compound mapcount.
+ * Number of sub-pages mapped by PTE, does not include compound mapcount.
  * Must be called only on head of compound page.
  */
 static inline int head_subpages_mapcount(struct page *head)
@@ -864,23 +864,7 @@ static inline int page_mapcount(struct page *page)
 	return head_compound_mapcount(page) + mapcount;
 }
 
-static inline int total_mapcount(struct page *page)
-{
-	if (likely(!PageCompound(page)))
-		return atomic_read(&page->_mapcount) + 1;
-	page = compound_head(page);
-	return head_compound_mapcount(page) + head_subpages_mapcount(page);
-}
-
-/*
- * Return true if this page is mapped into pagetables.
- * For compound page it returns true if any subpage of compound page is mapped,
- * even if this particular subpage is not itself mapped by any PTE or PMD.
- */
-static inline bool page_mapped(struct page *page)
-{
-	return total_mapcount(page) > 0;
-}
+int total_compound_mapcount(struct page *head);
 
 /**
  * folio_mapcount() - Calculate the number of mappings of this folio.
@@ -897,8 +881,20 @@ static inline int folio_mapcount(struct folio *folio)
 {
 	if (likely(!folio_test_large(folio)))
 		return atomic_read(&folio->_mapcount) + 1;
-	return atomic_read(folio_mapcount_ptr(folio)) + 1 +
-		atomic_read(folio_subpages_mapcount_ptr(folio));
+	return total_compound_mapcount(&folio->page);
+}
+
+static inline int total_mapcount(struct page *page)
+{
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+	return total_compound_mapcount(compound_head(page));
+}
+
+static inline bool folio_large_is_mapped(struct folio *folio)
+{
+	return atomic_read(folio_mapcount_ptr(folio)) +
+		atomic_read(folio_subpages_mapcount_ptr(folio)) >= 0;
 }
 
 /**
@@ -909,7 +905,21 @@ static inline int folio_mapcount(struct folio *folio)
  */
 static inline bool folio_mapped(struct folio *folio)
 {
-	return folio_mapcount(folio) > 0;
+	if (likely(!folio_test_large(folio)))
+		return atomic_read(&folio->_mapcount) >= 0;
+	return folio_large_is_mapped(folio);
+}
+
+/*
+ * Return true if this page is mapped into pagetables.
+ * For compound page it returns true if any sub-page of compound page is mapped,
+ * even if this particular sub-page is not itself mapped by any PTE or PMD.
+ */
+static inline bool page_mapped(struct page *page)
+{
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) >= 0;
+	return folio_large_is_mapped(page_folio(page));
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 011a7530dc76..860f558126ac 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,14 +204,14 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 
-void page_dup_compound_rmap(struct page *page, bool compound);
+void page_dup_compound_rmap(struct page *page);
 
 static inline void page_dup_file_rmap(struct page *page, bool compound)
 {
-	if (PageCompound(page))
-		page_dup_compound_rmap(page, compound);
-	else
+	if (likely(!compound /* page is mapped by PTE */))
 		atomic_inc(&page->_mapcount);
+	else
+		page_dup_compound_rmap(page);
 }
 
 /**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 30056efc79ad..3dee8665c585 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2215,7 +2215,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
 		if (!pmd_migration)
-			page_dup_compound_rmap(page + i, false);
+			page_add_anon_rmap(page + i, vma, addr, false);
 		pte_unmap(pte);
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 4833d28c5e1a..66be8cae640f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1117,55 +1117,36 @@ static void unlock_compound_mapcounts(struct page *head,
 	bit_spin_unlock(PG_locked, &head[1].flags);
 }
 
-/*
- * When acting on a compound page under lock_compound_mapcounts(), avoid the
- * unnecessary overhead of an actual atomic operation on its subpage mapcount.
- * Return true if this is the first increment or the last decrement
- * (remembering that page->_mapcount -1 represents logical mapcount 0).
- */
-static bool subpage_mapcount_inc(struct page *page)
-{
-	int orig_mapcount = atomic_read(&page->_mapcount);
-
-	atomic_set(&page->_mapcount, orig_mapcount + 1);
-	return orig_mapcount < 0;
-}
-
-static bool subpage_mapcount_dec(struct page *page)
-{
-	int orig_mapcount = atomic_read(&page->_mapcount);
-
-	atomic_set(&page->_mapcount, orig_mapcount - 1);
-	return orig_mapcount == 0;
-}
-
-/*
- * When mapping a THP's first pmd, or unmapping its last pmd, if that THP
- * also has pte mappings, then those must be discounted: in order to maintain
- * NR_ANON_MAPPED and NR_FILE_MAPPED statistics exactly, without any drift,
- * and to decide when an anon THP should be put on the deferred split queue.
- * This function must be called between lock_ and unlock_compound_mapcounts().
- */
-static int nr_subpages_unmapped(struct page *head, int nr_subpages)
+int total_compound_mapcount(struct page *head)
 {
-	int nr = nr_subpages;
+	int mapcount = head_compound_mapcount(head);
+	int nr_subpages;
 	int i;
 
-	/* Discount those subpages mapped by pte */
+	/* In the common case, avoid the loop when no subpages mapped by PTE */
+	if (head_subpages_mapcount(head) == 0)
+		return mapcount;
+	/*
+	 * Add all the PTE mappings of those subpages mapped by PTE.
+	 * Limit the loop, knowing that only subpages_mapcount are mapped?
+	 * Perhaps: given all the raciness, that may be a good or a bad idea.
+	 */
+	nr_subpages = thp_nr_pages(head);
 	for (i = 0; i < nr_subpages; i++)
-		if (atomic_read(&head[i]._mapcount) >= 0)
-			nr--;
-	return nr;
+		mapcount += atomic_read(&head[i]._mapcount);
+
+	/* But each of those _mapcounts was based on -1 */
+	mapcount += nr_subpages;
+	return mapcount;
 }
 
 /*
- * page_dup_compound_rmap(), used when copying mm, or when splitting pmd,
+ * page_dup_compound_rmap(), used when copying mm,
  * provides a simple example of using lock_ and unlock_compound_mapcounts().
  */
-void page_dup_compound_rmap(struct page *page, bool compound)
+void page_dup_compound_rmap(struct page *head)
 {
 	struct compound_mapcounts mapcounts;
-	struct page *head;
 
 	/*
 	 * Hugetlb pages could use lock_compound_mapcounts(), like THPs do;
@@ -1176,20 +1157,16 @@ void page_dup_compound_rmap(struct page *page, bool compound)
 	 * Note that hugetlb does not call page_add_file_rmap():
 	 * here is where hugetlb shared page mapcount is raised.
 	 */
-	if (PageHuge(page)) {
-		atomic_inc(compound_mapcount_ptr(page));
-		return;
-	}
+	if (PageHuge(head)) {
+		atomic_inc(compound_mapcount_ptr(head));
 
-	head = compound_head(page);
-	lock_compound_mapcounts(head, &mapcounts);
-	if (compound) {
+	} else if (PageTransHuge(head)) {
+		/* That test is redundant: it's for safety or to optimize out */
+
+		lock_compound_mapcounts(head, &mapcounts);
 		mapcounts.compound_mapcount++;
-	} else {
-		mapcounts.subpages_mapcount++;
-		subpage_mapcount_inc(page);
+		unlock_compound_mapcounts(head, &mapcounts);
 	}
-	unlock_compound_mapcounts(head, &mapcounts);
 }
 
 /**
@@ -1308,31 +1285,29 @@ void page_add_anon_rmap(struct page *page,
 
 	if (unlikely(PageKsm(page)))
 		lock_page_memcg(page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	if (likely(!PageCompound(page))) {
+	if (likely(!compound /* page is mapped by PTE */)) {
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
+		if (first && PageCompound(page)) {
+			struct page *head = compound_head(page);
+
+			lock_compound_mapcounts(head, &mapcounts);
+			mapcounts.subpages_mapcount++;
+			nr = !mapcounts.compound_mapcount;
+			unlock_compound_mapcounts(head, &mapcounts);
+		}
+	} else if (PageTransHuge(page)) {
+		/* That test is redundant: it's for safety or to optimize out */
 
-	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		first = !mapcounts.compound_mapcount;
 		mapcounts.compound_mapcount++;
 		if (first) {
-			nr = nr_pmdmapped = thp_nr_pages(page);
-			if (mapcounts.subpages_mapcount)
-				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+			nr_pmdmapped = thp_nr_pages(page);
+			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-	} else {
-		struct page *head = compound_head(page);
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.subpages_mapcount++;
-		first = subpage_mapcount_inc(page);
-		nr = first && !mapcounts.compound_mapcount;
-		unlock_compound_mapcounts(head, &mapcounts);
 	}
 
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
@@ -1411,28 +1386,28 @@ void page_add_file_rmap(struct page *page,
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
 
-	if (likely(!PageCompound(page))) {
+	if (likely(!compound /* page is mapped by PTE */)) {
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
+		if (first && PageCompound(page)) {
+			struct page *head = compound_head(page);
+
+			lock_compound_mapcounts(head, &mapcounts);
+			mapcounts.subpages_mapcount++;
+			nr = !mapcounts.compound_mapcount;
+			unlock_compound_mapcounts(head, &mapcounts);
+		}
+	} else if (PageTransHuge(page)) {
+		/* That test is redundant: it's for safety or to optimize out */
 
-	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		first = !mapcounts.compound_mapcount;
 		mapcounts.compound_mapcount++;
 		if (first) {
-			nr = nr_pmdmapped = thp_nr_pages(page);
-			if (mapcounts.subpages_mapcount)
-				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+			nr_pmdmapped = thp_nr_pages(page);
+			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-	} else {
-		struct page *head = compound_head(page);
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.subpages_mapcount++;
-		first = subpage_mapcount_inc(page);
-		nr = first && !mapcounts.compound_mapcount;
-		unlock_compound_mapcounts(head, &mapcounts);
 	}
 
 	if (nr_pmdmapped)
@@ -1472,28 +1447,28 @@ void page_remove_rmap(struct page *page,
 	lock_page_memcg(page);
 
 	/* page still mapped by someone else? */
-	if (likely(!PageCompound(page))) {
+	if (likely(!compound /* page is mapped by PTE */)) {
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
+		if (last && PageCompound(page)) {
+			struct page *head = compound_head(page);
+
+			lock_compound_mapcounts(head, &mapcounts);
+			mapcounts.subpages_mapcount--;
+			nr = !mapcounts.compound_mapcount;
+			unlock_compound_mapcounts(head, &mapcounts);
+		}
+	} else if (PageTransHuge(page)) {
+		/* That test is redundant: it's for safety or to optimize out */
 
-	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		mapcounts.compound_mapcount--;
 		last = !mapcounts.compound_mapcount;
 		if (last) {
-			nr = nr_pmdmapped = thp_nr_pages(page);
-			if (mapcounts.subpages_mapcount)
-				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+			nr_pmdmapped = thp_nr_pages(page);
+			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-	} else {
-		struct page *head = compound_head(page);
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.subpages_mapcount--;
-		last = subpage_mapcount_dec(page);
-		nr = last && !mapcounts.compound_mapcount;
-		unlock_compound_mapcounts(head, &mapcounts);
 	}
 
 	if (nr_pmdmapped) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
  2022-11-18  9:12   ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
@ 2022-11-18  9:14   ` Hugh Dickins
  2022-11-21 13:09     ` Kirill A. Shutemov
  2022-11-18  9:16   ` [PATCH 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-18  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
but if we slightly abuse subpages_mapcount by additionally demanding that
one bit be set there when the compound page is PMD-mapped, then a cascade
of two atomic ops is able to maintain the stats without bit_spin_lock.

This is harder to reason about than when bit_spin_locked, but I believe
safe; and no drift in stats detected when testing.  When there are racing
removes and adds, of course the sequence of operations is less well-
defined; but each operation on subpages_mapcount is atomically good.
What might be disastrous, is if subpages_mapcount could ever fleetingly
appear negative: but the pte lock (or pmd lock) these rmap functions are
called under, ensures that a last remove cannot race ahead of a first add.

Continue to make an exception for hugetlb (PageHuge) pages, though that
exception can be easily removed by a further commit if necessary: leave
subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
carry on checking compound_mapcount too in folio_mapped(), page_mapped().

Evidence is that this way goes slightly faster than the previous
implementation in all cases (pmds after ptes now taking around 103ms);
and relieves us of worrying about contention on the bit_spin_lock.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/mm/transhuge.rst |   7 +-
 include/linux/mm.h             |  19 ++++-
 include/linux/rmap.h           |  12 ++--
 mm/debug.c                     |   2 +-
 mm/rmap.c                      | 124 +++++++--------------------------
 5 files changed, 52 insertions(+), 112 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index af4c9d70321d..ec3dc5b04226 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -118,15 +118,14 @@ pages:
     succeeds on tail pages.
 
   - map/unmap of PMD entry for the whole compound page increment/decrement
-    ->compound_mapcount, stored in the first tail page of the compound page.
+    ->compound_mapcount, stored in the first tail page of the compound page;
+    and also increment/decrement ->subpages_mapcount (also in the first tail)
+    by COMPOUND_MAPPED when compound_mapcount goes from -1 to 0 or 0 to -1.
 
   - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
     on relevant sub-page of the compound page, and also increment/decrement
     ->subpages_mapcount, stored in first tail page of the compound page, when
     _mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE.
-    In order to have race-free accounting of sub-pages mapped, changes to
-    sub-page ->_mapcount, ->subpages_mapcount and ->compound_mapcount are
-    are all locked by bit_spin_lock of PG_locked in the first tail ->flags.
 
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9e46d4d46f2..a2bfb5e4be62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -828,7 +828,16 @@ static inline int head_compound_mapcount(struct page *head)
 }
 
 /*
- * Number of sub-pages mapped by PTE, does not include compound mapcount.
+ * If a 16GB hugetlb page were mapped by PTEs of all of its 4kB sub-pages,
+ * its subpages_mapcount would be 0x400000: choose the COMPOUND_MAPPED bit
+ * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
+ * leaves subpages_mapcount at 0, but avoid surprise if it participates later.
+ */
+#define COMPOUND_MAPPED	0x800000
+#define SUBPAGES_MAPPED	(COMPOUND_MAPPED - 1)
+
+/*
+ * Number of sub-pages mapped by PTE, plus COMPOUND_MAPPED if compound mapped.
  * Must be called only on head of compound page.
  */
 static inline int head_subpages_mapcount(struct page *head)
@@ -893,8 +902,12 @@ static inline int total_mapcount(struct page *page)
 
 static inline bool folio_large_is_mapped(struct folio *folio)
 {
-	return atomic_read(folio_mapcount_ptr(folio)) +
-		atomic_read(folio_subpages_mapcount_ptr(folio)) >= 0;
+	/*
+	 * Reading folio_mapcount_ptr() below could be omitted if hugetlb
+	 * participated in incrementing subpages_mapcount when compound mapped.
+	 */
+	return atomic_read(folio_mapcount_ptr(folio)) >= 0 ||
+		atomic_read(folio_subpages_mapcount_ptr(folio)) > 0;
 }
 
 /**
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 860f558126ac..bd3504d11b15 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,14 +204,14 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 
-void page_dup_compound_rmap(struct page *page);
+static inline void __page_dup_rmap(struct page *page, bool compound)
+{
+	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
+}
 
 static inline void page_dup_file_rmap(struct page *page, bool compound)
 {
-	if (likely(!compound /* page is mapped by PTE */))
-		atomic_inc(&page->_mapcount);
-	else
-		page_dup_compound_rmap(page);
+	__page_dup_rmap(page, compound);
 }
 
 /**
@@ -260,7 +260,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
 	 * the page R/O into both processes.
 	 */
 dup:
-	page_dup_file_rmap(page, compound);
+	__page_dup_rmap(page, compound);
 	return 0;
 }
 
diff --git a/mm/debug.c b/mm/debug.c
index 7f8e5f744e42..1ef2ff6a05cb 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -97,7 +97,7 @@ static void __dump_page(struct page *page)
 		pr_warn("head:%p order:%u compound_mapcount:%d subpages_mapcount:%d compound_pincount:%d\n",
 				head, compound_order(head),
 				head_compound_mapcount(head),
-				head_subpages_mapcount(head),
+				head_subpages_mapcount(head) & SUBPAGES_MAPPED,
 				head_compound_pincount(head));
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 66be8cae640f..5e4ce0a6d6f1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1085,38 +1085,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 	return page_vma_mkclean_one(&pvmw);
 }
 
-struct compound_mapcounts {
-	unsigned int compound_mapcount;
-	unsigned int subpages_mapcount;
-};
-
-/*
- * lock_compound_mapcounts() first locks, then copies subpages_mapcount and
- * compound_mapcount from head[1].compound_mapcount and subpages_mapcount,
- * converting from struct page's internal representation to logical count
- * (that is, adding 1 to compound_mapcount to hide its offset by -1).
- */
-static void lock_compound_mapcounts(struct page *head,
-		struct compound_mapcounts *local)
-{
-	bit_spin_lock(PG_locked, &head[1].flags);
-	local->compound_mapcount = atomic_read(compound_mapcount_ptr(head)) + 1;
-	local->subpages_mapcount = atomic_read(subpages_mapcount_ptr(head));
-}
-
-/*
- * After caller has updated subpage._mapcount, local subpages_mapcount and
- * local compound_mapcount, as necessary, unlock_compound_mapcounts() converts
- * and copies them back to the compound head[1] fields, and then unlocks.
- */
-static void unlock_compound_mapcounts(struct page *head,
-		struct compound_mapcounts *local)
-{
-	atomic_set(compound_mapcount_ptr(head), local->compound_mapcount - 1);
-	atomic_set(subpages_mapcount_ptr(head), local->subpages_mapcount);
-	bit_spin_unlock(PG_locked, &head[1].flags);
-}
-
 int total_compound_mapcount(struct page *head)
 {
 	int mapcount = head_compound_mapcount(head);
@@ -1124,7 +1092,7 @@ int total_compound_mapcount(struct page *head)
 	int i;
 
 	/* In the common case, avoid the loop when no subpages mapped by PTE */
-	if (head_subpages_mapcount(head) == 0)
+	if ((head_subpages_mapcount(head) & SUBPAGES_MAPPED) == 0)
 		return mapcount;
 	/*
 	 * Add all the PTE mappings of those subpages mapped by PTE.
@@ -1140,35 +1108,6 @@ int total_compound_mapcount(struct page *head)
 	return mapcount;
 }
 
-/*
- * page_dup_compound_rmap(), used when copying mm,
- * provides a simple example of using lock_ and unlock_compound_mapcounts().
- */
-void page_dup_compound_rmap(struct page *head)
-{
-	struct compound_mapcounts mapcounts;
-
-	/*
-	 * Hugetlb pages could use lock_compound_mapcounts(), like THPs do;
-	 * but at present they are still being managed by atomic operations:
-	 * which are likely to be somewhat faster, so don't rush to convert
-	 * them over without evaluating the effect.
-	 *
-	 * Note that hugetlb does not call page_add_file_rmap():
-	 * here is where hugetlb shared page mapcount is raised.
-	 */
-	if (PageHuge(head)) {
-		atomic_inc(compound_mapcount_ptr(head));
-
-	} else if (PageTransHuge(head)) {
-		/* That test is redundant: it's for safety or to optimize out */
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.compound_mapcount++;
-		unlock_compound_mapcounts(head, &mapcounts);
-	}
-}
-
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
@@ -1278,7 +1217,7 @@ static void __page_check_anon_rmap(struct page *page,
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
-	struct compound_mapcounts mapcounts;
+	atomic_t *mapped;
 	int nr = 0, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
@@ -1290,24 +1229,20 @@ void page_add_anon_rmap(struct page *page,
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
 		if (first && PageCompound(page)) {
-			struct page *head = compound_head(page);
-
-			lock_compound_mapcounts(head, &mapcounts);
-			mapcounts.subpages_mapcount++;
-			nr = !mapcounts.compound_mapcount;
-			unlock_compound_mapcounts(head, &mapcounts);
+			mapped = subpages_mapcount_ptr(compound_head(page));
+			nr = atomic_inc_return_relaxed(mapped);
+			nr = !(nr & COMPOUND_MAPPED);
 		}
 	} else if (PageTransHuge(page)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
-		lock_compound_mapcounts(page, &mapcounts);
-		first = !mapcounts.compound_mapcount;
-		mapcounts.compound_mapcount++;
+		first = atomic_inc_and_test(compound_mapcount_ptr(page));
 		if (first) {
+			mapped = subpages_mapcount_ptr(page);
+			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
 			nr_pmdmapped = thp_nr_pages(page);
-			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
+			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
 		}
-		unlock_compound_mapcounts(page, &mapcounts);
 	}
 
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
@@ -1360,6 +1295,7 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
+		atomic_set(subpages_mapcount_ptr(page), COMPOUND_MAPPED);
 		nr = thp_nr_pages(page);
 		__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 	}
@@ -1379,7 +1315,7 @@ void page_add_new_anon_rmap(struct page *page,
 void page_add_file_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	struct compound_mapcounts mapcounts;
+	atomic_t *mapped;
 	int nr = 0, nr_pmdmapped = 0;
 	bool first;
 
@@ -1390,24 +1326,20 @@ void page_add_file_rmap(struct page *page,
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
 		if (first && PageCompound(page)) {
-			struct page *head = compound_head(page);
-
-			lock_compound_mapcounts(head, &mapcounts);
-			mapcounts.subpages_mapcount++;
-			nr = !mapcounts.compound_mapcount;
-			unlock_compound_mapcounts(head, &mapcounts);
+			mapped = subpages_mapcount_ptr(compound_head(page));
+			nr = atomic_inc_return_relaxed(mapped);
+			nr = !(nr & COMPOUND_MAPPED);
 		}
 	} else if (PageTransHuge(page)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
-		lock_compound_mapcounts(page, &mapcounts);
-		first = !mapcounts.compound_mapcount;
-		mapcounts.compound_mapcount++;
+		first = atomic_inc_and_test(compound_mapcount_ptr(page));
 		if (first) {
+			mapped = subpages_mapcount_ptr(page);
+			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
 			nr_pmdmapped = thp_nr_pages(page);
-			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
+			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
 		}
-		unlock_compound_mapcounts(page, &mapcounts);
 	}
 
 	if (nr_pmdmapped)
@@ -1431,7 +1363,7 @@ void page_add_file_rmap(struct page *page,
 void page_remove_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	struct compound_mapcounts mapcounts;
+	atomic_t *mapped;
 	int nr = 0, nr_pmdmapped = 0;
 	bool last;
 
@@ -1451,24 +1383,20 @@ void page_remove_rmap(struct page *page,
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
 		if (last && PageCompound(page)) {
-			struct page *head = compound_head(page);
-
-			lock_compound_mapcounts(head, &mapcounts);
-			mapcounts.subpages_mapcount--;
-			nr = !mapcounts.compound_mapcount;
-			unlock_compound_mapcounts(head, &mapcounts);
+			mapped = subpages_mapcount_ptr(compound_head(page));
+			nr = atomic_dec_return_relaxed(mapped);
+			nr = !(nr & COMPOUND_MAPPED);
 		}
 	} else if (PageTransHuge(page)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
-		lock_compound_mapcounts(page, &mapcounts);
-		mapcounts.compound_mapcount--;
-		last = !mapcounts.compound_mapcount;
+		last = atomic_add_negative(-1, compound_mapcount_ptr(page));
 		if (last) {
+			mapped = subpages_mapcount_ptr(page);
+			nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
 			nr_pmdmapped = thp_nr_pages(page);
-			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
+			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
 		}
-		unlock_compound_mapcounts(page, &mapcounts);
 	}
 
 	if (nr_pmdmapped) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked()
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
  2022-11-18  9:12   ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
  2022-11-18  9:14   ` [PATCH 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped Hugh Dickins
@ 2022-11-18  9:16   ` Hugh Dickins
  2022-11-21 13:24     ` Kirill A. Shutemov
  2022-11-18 20:18   ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Linus Torvalds
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-18  9:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

It's hard to add a page_add_anon_rmap() into __split_huge_pmd_locked()'s
HPAGE_PMD_NR set_pte_at() loop, without wincing at the "freeze" case's
HPAGE_PMD_NR page_remove_rmap() loop below it.

It's just a mistake to add rmaps in the "freeze" (insert migration entries
prior to splitting huge page) case: the pmd_migration case already avoids
doing that, so just follow its lead.  page_add_ref() versus put_page()
likewise.  But why is one more put_page() needed in the "freeze" case?
Because it's removing the pmd rmap, already removed when pmd_migration
(and freeze and pmd_migration are mutually exclusive cases).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/huge_memory.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3dee8665c585..ab5ab1a013e1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2135,7 +2135,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		uffd_wp = pmd_uffd_wp(old_pmd);
 
 		VM_BUG_ON_PAGE(!page_count(page), page);
-		page_ref_add(page, HPAGE_PMD_NR - 1);
 
 		/*
 		 * Without "freeze", we'll simply split the PMD, propagating the
@@ -2155,6 +2154,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
 		if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
 			freeze = false;
+		if (!freeze)
+			page_ref_add(page, HPAGE_PMD_NR - 1);
 	}
 
 	/*
@@ -2210,27 +2211,21 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_mkuffd_wp(entry);
+			page_add_anon_rmap(page + i, vma, addr, false);
 		}
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
-		if (!pmd_migration)
-			page_add_anon_rmap(page + i, vma, addr, false);
 		pte_unmap(pte);
 	}
 
 	if (!pmd_migration)
 		page_remove_rmap(page, vma, true);
+	if (freeze)
+		put_page(page);
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-
-	if (freeze) {
-		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, vma, false);
-			put_page(page + i);
-		}
-	}
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
                     ` (2 preceding siblings ...)
  2022-11-18  9:16   ` [PATCH 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
@ 2022-11-18 20:18   ` Linus Torvalds
  2022-11-18 20:42     ` Johannes Weiner
  2022-11-18 20:51     ` Hugh Dickins
  2022-11-21 16:59   ` Shakeel Butt
  2022-11-22  9:38   ` [PATCH v2 " Hugh Dickins
  5 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2022-11-18 20:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote:
>
> Linus was underwhelmed by the earlier compound mapcounts series:
> this series builds on top of it (as in next-20221117) to follow
> up on his suggestions - except rmap.c still using lock_page_memcg(),
> since I hesitate to steal the pleasure of deletion from Johannes.

This looks good to me. Particularly 2/3 made me go "Aww, yes" but the
overall line removal stats look good too.

That said, I only looked at the patches, and not the end result
itself. But not having the bit spin lock is, I think, a huge
improvement.

I do wonder if this should be now just merged with your previous
series - it looks a bit odd how your previous series adds that
bitlock, only for it to be immediately removed.

But if you think the logic ends up being easier to follow this way as
two separate patch series, I guess I don't care.

And the memcg locking is entirely a separate issue, and I hope
Johannes will deal with that.

Thanks,
              Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18 20:18   ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Linus Torvalds
@ 2022-11-18 20:42     ` Johannes Weiner
  2022-11-18 20:51     ` Hugh Dickins
  1 sibling, 0 replies; 54+ messages in thread
From: Johannes Weiner @ 2022-11-18 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Sidhartha Kumar, Muchun Song,
	Miaohe Lin, Naoya Horiguchi, Mina Almasry, James Houghton,
	Zach O'Keefe, linux-kernel, linux-mm

On Fri, Nov 18, 2022 at 12:18:42PM -0800, Linus Torvalds wrote:
> On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Linus was underwhelmed by the earlier compound mapcounts series:
> > this series builds on top of it (as in next-20221117) to follow
> > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > since I hesitate to steal the pleasure of deletion from Johannes.
> 
> This looks good to me. Particularly 2/3 made me go "Aww, yes" but the
> overall line removal stats look good too.
> 
> That said, I only looked at the patches, and not the end result
> itself. But not having the bit spin lock is, I think, a huge
> improvement.
> 
> I do wonder if this should be now just merged with your previous
> series - it looks a bit odd how your previous series adds that
> bitlock, only for it to be immediately removed.
> 
> But if you think the logic ends up being easier to follow this way as
> two separate patch series, I guess I don't care.
> 
> And the memcg locking is entirely a separate issue, and I hope
> Johannes will deal with that.

Yeah, I'll redo the removal on top of this series and resend it.

Thanks


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18 20:18   ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Linus Torvalds
  2022-11-18 20:42     ` Johannes Weiner
@ 2022-11-18 20:51     ` Hugh Dickins
  2022-11-18 22:03       ` Andrew Morton
  1 sibling, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-18 20:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Fri, 18 Nov 2022, Linus Torvalds wrote:
> On Fri, Nov 18, 2022 at 1:08 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Linus was underwhelmed by the earlier compound mapcounts series:
> > this series builds on top of it (as in next-20221117) to follow
> > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > since I hesitate to steal the pleasure of deletion from Johannes.
> 
> This looks good to me. Particularly 2/3 made me go "Aww, yes" but the
> overall line removal stats look good too.
> 
> That said, I only looked at the patches, and not the end result
> itself. But not having the bit spin lock is, I think, a huge
> improvement.

Great, thanks a lot for looking through.

> 
> I do wonder if this should be now just merged with your previous
> series - it looks a bit odd how your previous series adds that
> bitlock, only for it to be immediately removed.
> 
> But if you think the logic ends up being easier to follow this way as
> two separate patch series, I guess I don't care.

I rather like having its evolution on record there, but that might just
be my sentimentality + laziness.  Kirill did a grand job of reviewing
the first series: I think that, at least for now, it would be easier
for people to review the changes if the two series are not recombined.

But the first series has not yet graduated from mm-unstable,
so if Andrew and/or Kirill also prefer to have them combined into one
bit_spin_lock-less series, that I can do.  (And the end result should be
identical, so would not complicate Johannes's lock_page_memcg() excision.)

Hugh

> 
> And the memcg locking is entirely a separate issue, and I hope
> Johannes will deal with that.
> 
> Thanks,
>               Linus


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18 20:51     ` Hugh Dickins
@ 2022-11-18 22:03       ` Andrew Morton
  2022-11-18 22:07         ` Linus Torvalds
  2022-11-18 22:10         ` Hugh Dickins
  0 siblings, 2 replies; 54+ messages in thread
From: Andrew Morton @ 2022-11-18 22:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:

> But the first series has not yet graduated from mm-unstable,
> so if Andrew and/or Kirill also prefer to have them combined into one
> bit_spin_lock-less series, that I can do.  (And the end result should be
> identical, so would not complicate Johannes's lock_page_memcg() excision.)

I'd prefer that approach.  It's -rc5 and the earlier "mm,huge,rmap:
unify and speed up compound mapcounts" series has had some testing. 
I'd prefer not to toss it all out and start again.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18 22:03       ` Andrew Morton
@ 2022-11-18 22:07         ` Linus Torvalds
  2022-11-18 22:10         ` Hugh Dickins
  1 sibling, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2022-11-18 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Fri, Nov 18, 2022 at 2:03 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> I'd prefer that approach.

The "that approach" is a bit ambiguous here, particularly considering
how you quoted things.

But I think from the context you meant "keep them as two separate
series, even if the second undoes part of the first and does it
differently".

And that's fine. Even if it's maybe a bit odd to introduce that
locking that then goes away, I can't argue with "the first series was
already reviewed and has gone through a fair amount of testing".

             Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18 22:03       ` Andrew Morton
  2022-11-18 22:07         ` Linus Torvalds
@ 2022-11-18 22:10         ` Hugh Dickins
  2022-11-18 22:23           ` Andrew Morton
  1 sibling, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-18 22:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Fri, 18 Nov 2022, Andrew Morton wrote:
> On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:
> 
> > But the first series has not yet graduated from mm-unstable,
> > so if Andrew and/or Kirill also prefer to have them combined into one
> > bit_spin_lock-less series, that I can do.  (And the end result should be
> > identical, so would not complicate Johannes's lock_page_memcg() excision.)
> 
> I'd prefer that approach.

I think you're saying that you prefer the other approach, to keep the
two series separate (second immediately after the first, or not, doesn't
matter), rather than combined into one bit_spin_lock-less series.
Please clarify! Thanks,

Hugh

> It's -rc5 and the earlier "mm,huge,rmap:
> unify and speed up compound mapcounts" series has had some testing. 
> I'd prefer not to toss it all out and start again.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18 22:10         ` Hugh Dickins
@ 2022-11-18 22:23           ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2022-11-18 22:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Fri, 18 Nov 2022 14:10:32 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:

> On Fri, 18 Nov 2022, Andrew Morton wrote:
> > On Fri, 18 Nov 2022 12:51:09 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:
> > 
> > > But the first series has not yet graduated from mm-unstable,
> > > so if Andrew and/or Kirill also prefer to have them combined into one
> > > bit_spin_lock-less series, that I can do.  (And the end result should be
> > > identical, so would not complicate Johannes's lock_page_memcg() excision.)
> > 
> > I'd prefer that approach.
> 
> I think you're saying that you prefer the other approach, to keep the
> two series separate (second immediately after the first, or not, doesn't
> matter), rather than combined into one bit_spin_lock-less series.
> Please clarify! Thanks,

Yes, two separate series.   Apologies for the confuddling.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
  2022-11-18  9:12   ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
@ 2022-11-19  0:12     ` Yu Zhao
  2022-11-19  0:37       ` Hugh Dickins
  2022-11-21 12:36     ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Kirill A. Shutemov
  1 sibling, 1 reply; 54+ messages in thread
From: Yu Zhao @ 2022-11-19  0:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Fri, Nov 18, 2022 at 2:12 AM Hugh Dickins <hughd@google.com> wrote:

...

> @@ -1308,31 +1285,29 @@ void page_add_anon_rmap(struct page *page,
>
>         if (unlikely(PageKsm(page)))
>                 lock_page_memcg(page);
> -       else
> -               VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> -       if (likely(!PageCompound(page))) {
> +       if (likely(!compound /* page is mapped by PTE */)) {
>                 first = atomic_inc_and_test(&page->_mapcount);
>                 nr = first;
> +               if (first && PageCompound(page)) {
> +                       struct page *head = compound_head(page);
> +
> +                       lock_compound_mapcounts(head, &mapcounts);
> +                       mapcounts.subpages_mapcount++;
> +                       nr = !mapcounts.compound_mapcount;
> +                       unlock_compound_mapcounts(head, &mapcounts);
> +               }
> +       } else if (PageTransHuge(page)) {
> +               /* That test is redundant: it's for safety or to optimize out */
>
> -       } else if (compound && PageTransHuge(page)) {
>                 lock_compound_mapcounts(page, &mapcounts);
>                 first = !mapcounts.compound_mapcount;
>                 mapcounts.compound_mapcount++;
>                 if (first) {
> -                       nr = nr_pmdmapped = thp_nr_pages(page);
> -                       if (mapcounts.subpages_mapcount)
> -                               nr = nr_subpages_unmapped(page, nr_pmdmapped);
> +                       nr_pmdmapped = thp_nr_pages(page);
> +                       nr = nr_pmdmapped - mapcounts.subpages_mapcount;
>                 }
>                 unlock_compound_mapcounts(page, &mapcounts);
> -       } else {
> -               struct page *head = compound_head(page);
> -
> -               lock_compound_mapcounts(head, &mapcounts);
> -               mapcounts.subpages_mapcount++;
> -               first = subpage_mapcount_inc(page);
> -               nr = first && !mapcounts.compound_mapcount;
> -               unlock_compound_mapcounts(head, &mapcounts);
>         }
>
>         VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);

Hi Hugh, I got the following warning from the removed "else" branch.
Is it legit? Thanks.

mm/rmap.c:1236:13: warning: variable 'first' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
        } else if (PageTransHuge(page)) {
                   ^~~~~~~~~~~~~~~~~~~
mm/rmap.c:1248:18: note: uninitialized use occurs here
        VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
                        ^~~~~


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
  2022-11-19  0:12     ` Yu Zhao
@ 2022-11-19  0:37       ` Hugh Dickins
  2022-11-19  1:35         ` [PATCH 1/3 fix] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages: fix Hugh Dickins
  0 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-19  0:37 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Fri, 18 Nov 2022, Yu Zhao wrote:
> On Fri, Nov 18, 2022 at 2:12 AM Hugh Dickins <hughd@google.com> wrote:
> 
> ...
> 
> > @@ -1308,31 +1285,29 @@ void page_add_anon_rmap(struct page *page,
...
> >
> >         VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> 
> Hi Hugh, I got the following warning from the removed "else" branch.
> Is it legit? Thanks.
> 
> mm/rmap.c:1236:13: warning: variable 'first' is used uninitialized
> whenever 'if' condition is false [-Wsometimes-uninitialized]
>         } else if (PageTransHuge(page)) {
>                    ^~~~~~~~~~~~~~~~~~~
> mm/rmap.c:1248:18: note: uninitialized use occurs here
>         VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>                         ^~~~~

Thanks a lot for that.  From the compiler's point of view, it is
certainly a legit warning.  From our point of view, it's unimportant,
because we know that page_add_anon_rmap() should only ever be called
with compound true when PageTransHuge(page) (and should never be
called with compound true when TRANSPARENT_HUGEPAGE is disabled):
so it's a "system error" if first is uninitialized there.

But none of us want a compiler warning: I'll follow up with a fix
patch, when I've decided whether it's better initialized to true
or to false in the impossible case...

Although the same pattern is used in other functions, this is the
only one of them which goes on to use "first" or "last" afterwards.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/3 fix] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages: fix
  2022-11-19  0:37       ` Hugh Dickins
@ 2022-11-19  1:35         ` Hugh Dickins
  2022-11-21 12:38           ` Kirill A. Shutemov
  0 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-19  1:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yu Zhao, Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

Yu Zhao reports compiler warning in page_add_anon_rmap():

mm/rmap.c:1236:13: warning: variable 'first' is used uninitialized
whenever 'if' condition is false [-Wsometimes-uninitialized]
        } else if (PageTransHuge(page)) {
                   ^~~~~~~~~~~~~~~~~~~
mm/rmap.c:1248:18: note: uninitialized use occurs here
        VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
                        ^~~~~

We do need to fix that, even though it's only uninitialized in an
impossible condition: I've chosen to initialize "first" true, to
minimize the BUGs it might then hit; but you could just as well
choose to initialize it false, to maximize the BUGs it might hit.

Reported-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 66be8cae640f..25b720d5ba17 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1281,7 +1281,7 @@ void page_add_anon_rmap(struct page *page,
 	struct compound_mapcounts mapcounts;
 	int nr = 0, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
-	bool first;
+	bool first = true;
 
 	if (unlikely(PageKsm(page)))
 		lock_page_memcg(page);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
  2022-11-18  9:12   ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
  2022-11-19  0:12     ` Yu Zhao
@ 2022-11-21 12:36     ` Kirill A. Shutemov
  2022-11-22  9:03       ` Hugh Dickins
  1 sibling, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-11-21 12:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Sidhartha Kumar, Muchun Song,
	Miaohe Lin, Naoya Horiguchi, Mina Almasry, James Houghton,
	Zach O'Keefe, linux-kernel, linux-mm

On Fri, Nov 18, 2022 at 01:12:03AM -0800, Hugh Dickins wrote:
> Following suggestion from Linus, instead of counting every PTE map of a
> compound page in subpages_mapcount, just count how many of its subpages
> are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED
> and NR_FILE_MAPPED stats, without any need for a locked scan of subpages;
> and requires updating the count less often.
> 
> This does then revert total_mapcount() and folio_mapcount() to needing a
> scan of subpages; but they are inherently racy, and need no locking, so
> Linus is right that the scans are much better done there.  Plus (unlike
> in 6.1 and previous) subpages_mapcount lets us avoid the scan in the
> common case of no PTE maps.  And page_mapped() and folio_mapped() remain
> scanless and just as efficient with the new meaning of subpages_mapcount:
> those are the functions which I most wanted to remove the scan from.
> 
> The updated page_dup_compound_rmap() is no longer suitable for use by
> anon THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be
> used for that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.
> 
> Evidence is that this way goes slightly faster than the previous
> implementation for most cases; but significantly faster in the (now
> scanless) pmds after ptes case, which started out at 870ms and was
> brought down to 495ms by the previous series, now takes around 105ms.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Few minor nitpicks below.

> ---
>  Documentation/mm/transhuge.rst |   3 +-
>  include/linux/mm.h             |  52 ++++++-----
>  include/linux/rmap.h           |   8 +-
>  mm/huge_memory.c               |   2 +-
>  mm/rmap.c                      | 155 ++++++++++++++-------------------
>  5 files changed, 103 insertions(+), 117 deletions(-)
> 
> diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
> index 1e2a637cc607..af4c9d70321d 100644
> --- a/Documentation/mm/transhuge.rst
> +++ b/Documentation/mm/transhuge.rst
> @@ -122,7 +122,8 @@ pages:
>  
>    - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
>      on relevant sub-page of the compound page, and also increment/decrement
> -    ->subpages_mapcount, stored in first tail page of the compound page.
> +    ->subpages_mapcount, stored in first tail page of the compound page, when
> +    _mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE.
>      In order to have race-free accounting of sub-pages mapped, changes to
>      sub-page ->_mapcount, ->subpages_mapcount and ->compound_mapcount are
>      are all locked by bit_spin_lock of PG_locked in the first tail ->flags.
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8fe6276d8cc2..c9e46d4d46f2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -828,7 +828,7 @@ static inline int head_compound_mapcount(struct page *head)
>  }
>  
>  /*
> - * Sum of mapcounts of sub-pages, does not include compound mapcount.
> + * Number of sub-pages mapped by PTE, does not include compound mapcount.
>   * Must be called only on head of compound page.
>   */
>  static inline int head_subpages_mapcount(struct page *head)
> @@ -864,23 +864,7 @@ static inline int page_mapcount(struct page *page)
>  	return head_compound_mapcount(page) + mapcount;
>  }
>  
> -static inline int total_mapcount(struct page *page)
> -{
> -	if (likely(!PageCompound(page)))
> -		return atomic_read(&page->_mapcount) + 1;
> -	page = compound_head(page);
> -	return head_compound_mapcount(page) + head_subpages_mapcount(page);
> -}
> -
> -/*
> - * Return true if this page is mapped into pagetables.
> - * For compound page it returns true if any subpage of compound page is mapped,
> - * even if this particular subpage is not itself mapped by any PTE or PMD.
> - */
> -static inline bool page_mapped(struct page *page)
> -{
> -	return total_mapcount(page) > 0;
> -}
> +int total_compound_mapcount(struct page *head);
>  
>  /**
>   * folio_mapcount() - Calculate the number of mappings of this folio.
> @@ -897,8 +881,20 @@ static inline int folio_mapcount(struct folio *folio)
>  {
>  	if (likely(!folio_test_large(folio)))
>  		return atomic_read(&folio->_mapcount) + 1;
> -	return atomic_read(folio_mapcount_ptr(folio)) + 1 +
> -		atomic_read(folio_subpages_mapcount_ptr(folio));
> +	return total_compound_mapcount(&folio->page);
> +}
> +
> +static inline int total_mapcount(struct page *page)
> +{
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) + 1;
> +	return total_compound_mapcount(compound_head(page));
> +}
> +
> +static inline bool folio_large_is_mapped(struct folio *folio)
> +{
> +	return atomic_read(folio_mapcount_ptr(folio)) +
> +		atomic_read(folio_subpages_mapcount_ptr(folio)) >= 0;
>  }
>  
>  /**
> @@ -909,7 +905,21 @@ static inline int folio_mapcount(struct folio *folio)
>   */
>  static inline bool folio_mapped(struct folio *folio)
>  {
> -	return folio_mapcount(folio) > 0;
> +	if (likely(!folio_test_large(folio)))
> +		return atomic_read(&folio->_mapcount) >= 0;
> +	return folio_large_is_mapped(folio);
> +}
> +
> +/*
> + * Return true if this page is mapped into pagetables.
> + * For compound page it returns true if any sub-page of compound page is mapped,
> + * even if this particular sub-page is not itself mapped by any PTE or PMD.
> + */
> +static inline bool page_mapped(struct page *page)
> +{
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) >= 0;
> +	return folio_large_is_mapped(page_folio(page));
>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 011a7530dc76..860f558126ac 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -204,14 +204,14 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>  void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>  		unsigned long address);
>  
> -void page_dup_compound_rmap(struct page *page, bool compound);
> +void page_dup_compound_rmap(struct page *page);
>  
>  static inline void page_dup_file_rmap(struct page *page, bool compound)
>  {
> -	if (PageCompound(page))
> -		page_dup_compound_rmap(page, compound);
> -	else
> +	if (likely(!compound /* page is mapped by PTE */))

I'm not a fan of this kind of comments.

Maybe move above the line (here and below)?

>  		atomic_inc(&page->_mapcount);
> +	else
> +		page_dup_compound_rmap(page);
>  }
>  
>  /**
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 30056efc79ad..3dee8665c585 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2215,7 +2215,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		BUG_ON(!pte_none(*pte));
>  		set_pte_at(mm, addr, pte, entry);
>  		if (!pmd_migration)
> -			page_dup_compound_rmap(page + i, false);
> +			page_add_anon_rmap(page + i, vma, addr, false);
>  		pte_unmap(pte);
>  	}
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4833d28c5e1a..66be8cae640f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1117,55 +1117,36 @@ static void unlock_compound_mapcounts(struct page *head,
>  	bit_spin_unlock(PG_locked, &head[1].flags);
>  }
>  
> -/*
> - * When acting on a compound page under lock_compound_mapcounts(), avoid the
> - * unnecessary overhead of an actual atomic operation on its subpage mapcount.
> - * Return true if this is the first increment or the last decrement
> - * (remembering that page->_mapcount -1 represents logical mapcount 0).
> - */
> -static bool subpage_mapcount_inc(struct page *page)
> -{
> -	int orig_mapcount = atomic_read(&page->_mapcount);
> -
> -	atomic_set(&page->_mapcount, orig_mapcount + 1);
> -	return orig_mapcount < 0;
> -}
> -
> -static bool subpage_mapcount_dec(struct page *page)
> -{
> -	int orig_mapcount = atomic_read(&page->_mapcount);
> -
> -	atomic_set(&page->_mapcount, orig_mapcount - 1);
> -	return orig_mapcount == 0;
> -}
> -
> -/*
> - * When mapping a THP's first pmd, or unmapping its last pmd, if that THP
> - * also has pte mappings, then those must be discounted: in order to maintain
> - * NR_ANON_MAPPED and NR_FILE_MAPPED statistics exactly, without any drift,
> - * and to decide when an anon THP should be put on the deferred split queue.
> - * This function must be called between lock_ and unlock_compound_mapcounts().
> - */
> -static int nr_subpages_unmapped(struct page *head, int nr_subpages)
> +int total_compound_mapcount(struct page *head)
>  {
> -	int nr = nr_subpages;
> +	int mapcount = head_compound_mapcount(head);
> +	int nr_subpages;
>  	int i;
>  
> -	/* Discount those subpages mapped by pte */
> +	/* In the common case, avoid the loop when no subpages mapped by PTE */
> +	if (head_subpages_mapcount(head) == 0)
> +		return mapcount;
> +	/*
> +	 * Add all the PTE mappings of those subpages mapped by PTE.
> +	 * Limit the loop, knowing that only subpages_mapcount are mapped?
> +	 * Perhaps: given all the raciness, that may be a good or a bad idea.
> +	 */
> +	nr_subpages = thp_nr_pages(head);
>  	for (i = 0; i < nr_subpages; i++)
> -		if (atomic_read(&head[i]._mapcount) >= 0)
> -			nr--;
> -	return nr;
> +		mapcount += atomic_read(&head[i]._mapcount);
> +
> +	/* But each of those _mapcounts was based on -1 */
> +	mapcount += nr_subpages;
> +	return mapcount;
>  }
>  
>  /*
> - * page_dup_compound_rmap(), used when copying mm, or when splitting pmd,
> + * page_dup_compound_rmap(), used when copying mm,
>   * provides a simple example of using lock_ and unlock_compound_mapcounts().
>   */
> -void page_dup_compound_rmap(struct page *page, bool compound)
> +void page_dup_compound_rmap(struct page *head)
>  {
>  	struct compound_mapcounts mapcounts;
> -	struct page *head;
>  
>  	/*
>  	 * Hugetlb pages could use lock_compound_mapcounts(), like THPs do;
> @@ -1176,20 +1157,16 @@ void page_dup_compound_rmap(struct page *page, bool compound)
>  	 * Note that hugetlb does not call page_add_file_rmap():
>  	 * here is where hugetlb shared page mapcount is raised.
>  	 */
> -	if (PageHuge(page)) {
> -		atomic_inc(compound_mapcount_ptr(page));
> -		return;
> -	}
> +	if (PageHuge(head)) {
> +		atomic_inc(compound_mapcount_ptr(head));
>  

Remove the newline?

> -	head = compound_head(page);
> -	lock_compound_mapcounts(head, &mapcounts);
> -	if (compound) {
> +	} else if (PageTransHuge(head)) {
> +		/* That test is redundant: it's for safety or to optimize out */
> +
> +		lock_compound_mapcounts(head, &mapcounts);
>  		mapcounts.compound_mapcount++;
> -	} else {
> -		mapcounts.subpages_mapcount++;
> -		subpage_mapcount_inc(page);
> +		unlock_compound_mapcounts(head, &mapcounts);
>  	}
> -	unlock_compound_mapcounts(head, &mapcounts);
>  }
>  
>  /**
> @@ -1308,31 +1285,29 @@ void page_add_anon_rmap(struct page *page,
>  
>  	if (unlikely(PageKsm(page)))
>  		lock_page_memcg(page);
> -	else
> -		VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
> -	if (likely(!PageCompound(page))) {
> +	if (likely(!compound /* page is mapped by PTE */)) {
>  		first = atomic_inc_and_test(&page->_mapcount);
>  		nr = first;
> +		if (first && PageCompound(page)) {
> +			struct page *head = compound_head(page);
> +
> +			lock_compound_mapcounts(head, &mapcounts);
> +			mapcounts.subpages_mapcount++;
> +			nr = !mapcounts.compound_mapcount;
> +			unlock_compound_mapcounts(head, &mapcounts);
> +		}
> +	} else if (PageTransHuge(page)) {
> +		/* That test is redundant: it's for safety or to optimize out */
>  
> -	} else if (compound && PageTransHuge(page)) {
>  		lock_compound_mapcounts(page, &mapcounts);
>  		first = !mapcounts.compound_mapcount;
>  		mapcounts.compound_mapcount++;
>  		if (first) {
> -			nr = nr_pmdmapped = thp_nr_pages(page);
> -			if (mapcounts.subpages_mapcount)
> -				nr = nr_subpages_unmapped(page, nr_pmdmapped);
> +			nr_pmdmapped = thp_nr_pages(page);
> +			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
>  		}
>  		unlock_compound_mapcounts(page, &mapcounts);
> -	} else {
> -		struct page *head = compound_head(page);
> -
> -		lock_compound_mapcounts(head, &mapcounts);
> -		mapcounts.subpages_mapcount++;
> -		first = subpage_mapcount_inc(page);
> -		nr = first && !mapcounts.compound_mapcount;
> -		unlock_compound_mapcounts(head, &mapcounts);
>  	}
>  
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> @@ -1411,28 +1386,28 @@ void page_add_file_rmap(struct page *page,
>  	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
>  	lock_page_memcg(page);
>  
> -	if (likely(!PageCompound(page))) {
> +	if (likely(!compound /* page is mapped by PTE */)) {
>  		first = atomic_inc_and_test(&page->_mapcount);
>  		nr = first;
> +		if (first && PageCompound(page)) {
> +			struct page *head = compound_head(page);
> +
> +			lock_compound_mapcounts(head, &mapcounts);
> +			mapcounts.subpages_mapcount++;
> +			nr = !mapcounts.compound_mapcount;
> +			unlock_compound_mapcounts(head, &mapcounts);
> +		}
> +	} else if (PageTransHuge(page)) {
> +		/* That test is redundant: it's for safety or to optimize out */
>  
> -	} else if (compound && PageTransHuge(page)) {
>  		lock_compound_mapcounts(page, &mapcounts);
>  		first = !mapcounts.compound_mapcount;
>  		mapcounts.compound_mapcount++;
>  		if (first) {
> -			nr = nr_pmdmapped = thp_nr_pages(page);
> -			if (mapcounts.subpages_mapcount)
> -				nr = nr_subpages_unmapped(page, nr_pmdmapped);
> +			nr_pmdmapped = thp_nr_pages(page);
> +			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
>  		}
>  		unlock_compound_mapcounts(page, &mapcounts);
> -	} else {
> -		struct page *head = compound_head(page);
> -
> -		lock_compound_mapcounts(head, &mapcounts);
> -		mapcounts.subpages_mapcount++;
> -		first = subpage_mapcount_inc(page);
> -		nr = first && !mapcounts.compound_mapcount;
> -		unlock_compound_mapcounts(head, &mapcounts);
>  	}
>  
>  	if (nr_pmdmapped)
> @@ -1472,28 +1447,28 @@ void page_remove_rmap(struct page *page,
>  	lock_page_memcg(page);
>  
>  	/* page still mapped by someone else? */
> -	if (likely(!PageCompound(page))) {
> +	if (likely(!compound /* page is mapped by PTE */)) {
>  		last = atomic_add_negative(-1, &page->_mapcount);
>  		nr = last;
> +		if (last && PageCompound(page)) {
> +			struct page *head = compound_head(page);
> +
> +			lock_compound_mapcounts(head, &mapcounts);
> +			mapcounts.subpages_mapcount--;
> +			nr = !mapcounts.compound_mapcount;
> +			unlock_compound_mapcounts(head, &mapcounts);
> +		}
> +	} else if (PageTransHuge(page)) {
> +		/* That test is redundant: it's for safety or to optimize out */
>  
> -	} else if (compound && PageTransHuge(page)) {
>  		lock_compound_mapcounts(page, &mapcounts);
>  		mapcounts.compound_mapcount--;
>  		last = !mapcounts.compound_mapcount;
>  		if (last) {
> -			nr = nr_pmdmapped = thp_nr_pages(page);
> -			if (mapcounts.subpages_mapcount)
> -				nr = nr_subpages_unmapped(page, nr_pmdmapped);
> +			nr_pmdmapped = thp_nr_pages(page);
> +			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
>  		}
>  		unlock_compound_mapcounts(page, &mapcounts);
> -	} else {
> -		struct page *head = compound_head(page);
> -
> -		lock_compound_mapcounts(head, &mapcounts);
> -		mapcounts.subpages_mapcount--;
> -		last = subpage_mapcount_dec(page);
> -		nr = last && !mapcounts.compound_mapcount;
> -		unlock_compound_mapcounts(head, &mapcounts);
>  	}
>  
>  	if (nr_pmdmapped) {
> -- 
> 2.35.3
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3 fix] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages: fix
  2022-11-19  1:35         ` [PATCH 1/3 fix] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages: fix Hugh Dickins
@ 2022-11-21 12:38           ` Kirill A. Shutemov
  2022-11-22  9:13             ` Hugh Dickins
  0 siblings, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-11-21 12:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Yu Zhao, Linus Torvalds, Johannes Weiner,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Fri, Nov 18, 2022 at 05:35:05PM -0800, Hugh Dickins wrote:
> Yu Zhao reports compiler warning in page_add_anon_rmap():
> 
> mm/rmap.c:1236:13: warning: variable 'first' is used uninitialized
> whenever 'if' condition is false [-Wsometimes-uninitialized]
>         } else if (PageTransHuge(page)) {
>                    ^~~~~~~~~~~~~~~~~~~
> mm/rmap.c:1248:18: note: uninitialized use occurs here
>         VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
>                         ^~~~~
> 
> We do need to fix that, even though it's only uninitialized in an
> impossible condition: I've chosen to initialize "first" true, to
> minimize the BUGs it might then hit; but you could just as well
> choose to initialize it false, to maximize the BUGs it might hit.
> 
> Reported-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/rmap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 66be8cae640f..25b720d5ba17 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1281,7 +1281,7 @@ void page_add_anon_rmap(struct page *page,
>  	struct compound_mapcounts mapcounts;
>  	int nr = 0, nr_pmdmapped = 0;
>  	bool compound = flags & RMAP_COMPOUND;
> -	bool first;
> +	bool first = true;
>  
>  	if (unlikely(PageKsm(page)))
>  		lock_page_memcg(page);

Other option is to drop PageTransHuge() check that you already claim to be
redundant.

Or have else BUG() to catch cases where the helper called with
compound=true on non-THP page.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
  2022-11-18  9:14   ` [PATCH 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped Hugh Dickins
@ 2022-11-21 13:09     ` Kirill A. Shutemov
  2022-11-22  9:33       ` Hugh Dickins
  0 siblings, 1 reply; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-11-21 13:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Sidhartha Kumar, Muchun Song,
	Miaohe Lin, Naoya Horiguchi, Mina Almasry, James Houghton,
	Zach O'Keefe, linux-kernel, linux-mm

On Fri, Nov 18, 2022 at 01:14:17AM -0800, Hugh Dickins wrote:
> Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
> Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
> but if we slightly abuse subpages_mapcount by additionally demanding that
> one bit be set there when the compound page is PMD-mapped, then a cascade
> of two atomic ops is able to maintain the stats without bit_spin_lock.

Yay! New home for PageDoubleMap()! :P

> This is harder to reason about than when bit_spin_locked, but I believe
> safe; and no drift in stats detected when testing.  When there are racing
> removes and adds, of course the sequence of operations is less well-
> defined; but each operation on subpages_mapcount is atomically good.
> What might be disastrous, is if subpages_mapcount could ever fleetingly
> appear negative: but the pte lock (or pmd lock) these rmap functions are
> called under, ensures that a last remove cannot race ahead of a first add.
> 
> Continue to make an exception for hugetlb (PageHuge) pages, though that
> exception can be easily removed by a further commit if necessary: leave
> subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
> carry on checking compound_mapcount too in folio_mapped(), page_mapped().
> 
> Evidence is that this way goes slightly faster than the previous
> implementation in all cases (pmds after ptes now taking around 103ms);
> and relieves us of worrying about contention on the bit_spin_lock.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Jokes aside, looks neat.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

As always few minor nits below.

> ---
>  Documentation/mm/transhuge.rst |   7 +-
>  include/linux/mm.h             |  19 ++++-
>  include/linux/rmap.h           |  12 ++--
>  mm/debug.c                     |   2 +-
>  mm/rmap.c                      | 124 +++++++--------------------------
>  5 files changed, 52 insertions(+), 112 deletions(-)
> 
> diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
> index af4c9d70321d..ec3dc5b04226 100644
> --- a/Documentation/mm/transhuge.rst
> +++ b/Documentation/mm/transhuge.rst
> @@ -118,15 +118,14 @@ pages:
>      succeeds on tail pages.
>  
>    - map/unmap of PMD entry for the whole compound page increment/decrement
> -    ->compound_mapcount, stored in the first tail page of the compound page.
> +    ->compound_mapcount, stored in the first tail page of the compound page;
> +    and also increment/decrement ->subpages_mapcount (also in the first tail)
> +    by COMPOUND_MAPPED when compound_mapcount goes from -1 to 0 or 0 to -1.
>  
>    - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
>      on relevant sub-page of the compound page, and also increment/decrement
>      ->subpages_mapcount, stored in first tail page of the compound page, when
>      _mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE.
> -    In order to have race-free accounting of sub-pages mapped, changes to
> -    sub-page ->_mapcount, ->subpages_mapcount and ->compound_mapcount are
> -    are all locked by bit_spin_lock of PG_locked in the first tail ->flags.
>  
>  split_huge_page internally has to distribute the refcounts in the head
>  page to the tail pages before clearing all PG_head/tail bits from the page
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c9e46d4d46f2..a2bfb5e4be62 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -828,7 +828,16 @@ static inline int head_compound_mapcount(struct page *head)
>  }
>  
>  /*
> - * Number of sub-pages mapped by PTE, does not include compound mapcount.
> + * If a 16GB hugetlb page were mapped by PTEs of all of its 4kB sub-pages,
> + * its subpages_mapcount would be 0x400000: choose the COMPOUND_MAPPED bit
> + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
> + * leaves subpages_mapcount at 0, but avoid surprise if it participates later.
> + */
> +#define COMPOUND_MAPPED	0x800000
> +#define SUBPAGES_MAPPED	(COMPOUND_MAPPED - 1)
> +
> +/*
> + * Number of sub-pages mapped by PTE, plus COMPOUND_MAPPED if compound mapped.
>   * Must be called only on head of compound page.
>   */
>  static inline int head_subpages_mapcount(struct page *head)
> @@ -893,8 +902,12 @@ static inline int total_mapcount(struct page *page)
>  
>  static inline bool folio_large_is_mapped(struct folio *folio)
>  {
> -	return atomic_read(folio_mapcount_ptr(folio)) +
> -		atomic_read(folio_subpages_mapcount_ptr(folio)) >= 0;
> +	/*
> +	 * Reading folio_mapcount_ptr() below could be omitted if hugetlb
> +	 * participated in incrementing subpages_mapcount when compound mapped.
> +	 */
> +	return atomic_read(folio_mapcount_ptr(folio)) >= 0 ||
> +		atomic_read(folio_subpages_mapcount_ptr(folio)) > 0;

Maybe check folio_subpages_mapcount_ptr() first? It would avoid
folio_mapcount_ptr() read for everything, but hugetlb.

>  }
>  
>  /**
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 860f558126ac..bd3504d11b15 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -204,14 +204,14 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>  void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>  		unsigned long address);
>  
> -void page_dup_compound_rmap(struct page *page);
> +static inline void __page_dup_rmap(struct page *page, bool compound)
> +{
> +	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
> +}
>  
>  static inline void page_dup_file_rmap(struct page *page, bool compound)
>  {
> -	if (likely(!compound /* page is mapped by PTE */))
> -		atomic_inc(&page->_mapcount);
> -	else
> -		page_dup_compound_rmap(page);
> +	__page_dup_rmap(page, compound);
>  }
>  
>  /**
> @@ -260,7 +260,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
>  	 * the page R/O into both processes.
>  	 */
>  dup:
> -	page_dup_file_rmap(page, compound);
> +	__page_dup_rmap(page, compound);
>  	return 0;
>  }
>  
> diff --git a/mm/debug.c b/mm/debug.c
> index 7f8e5f744e42..1ef2ff6a05cb 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -97,7 +97,7 @@ static void __dump_page(struct page *page)
>  		pr_warn("head:%p order:%u compound_mapcount:%d subpages_mapcount:%d compound_pincount:%d\n",
>  				head, compound_order(head),
>  				head_compound_mapcount(head),
> -				head_subpages_mapcount(head),
> +				head_subpages_mapcount(head) & SUBPAGES_MAPPED,

Looks like applying the SUBPAGES_MAPPED mask belong to the
head_subpages_mapcount() helper, not to the caller.

>  				head_compound_pincount(head));
>  	}
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 66be8cae640f..5e4ce0a6d6f1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1085,38 +1085,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
>  	return page_vma_mkclean_one(&pvmw);
>  }
>  
> -struct compound_mapcounts {
> -	unsigned int compound_mapcount;
> -	unsigned int subpages_mapcount;
> -};
> -
> -/*
> - * lock_compound_mapcounts() first locks, then copies subpages_mapcount and
> - * compound_mapcount from head[1].compound_mapcount and subpages_mapcount,
> - * converting from struct page's internal representation to logical count
> - * (that is, adding 1 to compound_mapcount to hide its offset by -1).
> - */
> -static void lock_compound_mapcounts(struct page *head,
> -		struct compound_mapcounts *local)
> -{
> -	bit_spin_lock(PG_locked, &head[1].flags);
> -	local->compound_mapcount = atomic_read(compound_mapcount_ptr(head)) + 1;
> -	local->subpages_mapcount = atomic_read(subpages_mapcount_ptr(head));
> -}
> -
> -/*
> - * After caller has updated subpage._mapcount, local subpages_mapcount and
> - * local compound_mapcount, as necessary, unlock_compound_mapcounts() converts
> - * and copies them back to the compound head[1] fields, and then unlocks.
> - */
> -static void unlock_compound_mapcounts(struct page *head,
> -		struct compound_mapcounts *local)
> -{
> -	atomic_set(compound_mapcount_ptr(head), local->compound_mapcount - 1);
> -	atomic_set(subpages_mapcount_ptr(head), local->subpages_mapcount);
> -	bit_spin_unlock(PG_locked, &head[1].flags);
> -}
> -
>  int total_compound_mapcount(struct page *head)
>  {
>  	int mapcount = head_compound_mapcount(head);
> @@ -1124,7 +1092,7 @@ int total_compound_mapcount(struct page *head)
>  	int i;
>  
>  	/* In the common case, avoid the loop when no subpages mapped by PTE */
> -	if (head_subpages_mapcount(head) == 0)
> +	if ((head_subpages_mapcount(head) & SUBPAGES_MAPPED) == 0)
>  		return mapcount;
>  	/*
>  	 * Add all the PTE mappings of those subpages mapped by PTE.
> @@ -1140,35 +1108,6 @@ int total_compound_mapcount(struct page *head)
>  	return mapcount;
>  }
>  
> -/*
> - * page_dup_compound_rmap(), used when copying mm,
> - * provides a simple example of using lock_ and unlock_compound_mapcounts().
> - */
> -void page_dup_compound_rmap(struct page *head)
> -{
> -	struct compound_mapcounts mapcounts;
> -
> -	/*
> -	 * Hugetlb pages could use lock_compound_mapcounts(), like THPs do;
> -	 * but at present they are still being managed by atomic operations:
> -	 * which are likely to be somewhat faster, so don't rush to convert
> -	 * them over without evaluating the effect.
> -	 *
> -	 * Note that hugetlb does not call page_add_file_rmap():
> -	 * here is where hugetlb shared page mapcount is raised.
> -	 */
> -	if (PageHuge(head)) {
> -		atomic_inc(compound_mapcount_ptr(head));
> -
> -	} else if (PageTransHuge(head)) {
> -		/* That test is redundant: it's for safety or to optimize out */
> -
> -		lock_compound_mapcounts(head, &mapcounts);
> -		mapcounts.compound_mapcount++;
> -		unlock_compound_mapcounts(head, &mapcounts);
> -	}
> -}
> -
>  /**
>   * page_move_anon_rmap - move a page to our anon_vma
>   * @page:	the page to move to our anon_vma
> @@ -1278,7 +1217,7 @@ static void __page_check_anon_rmap(struct page *page,
>  void page_add_anon_rmap(struct page *page,
>  	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
>  {
> -	struct compound_mapcounts mapcounts;
> +	atomic_t *mapped;
>  	int nr = 0, nr_pmdmapped = 0;
>  	bool compound = flags & RMAP_COMPOUND;
>  	bool first;
> @@ -1290,24 +1229,20 @@ void page_add_anon_rmap(struct page *page,
>  		first = atomic_inc_and_test(&page->_mapcount);
>  		nr = first;
>  		if (first && PageCompound(page)) {
> -			struct page *head = compound_head(page);
> -
> -			lock_compound_mapcounts(head, &mapcounts);
> -			mapcounts.subpages_mapcount++;
> -			nr = !mapcounts.compound_mapcount;
> -			unlock_compound_mapcounts(head, &mapcounts);
> +			mapped = subpages_mapcount_ptr(compound_head(page));
> +			nr = atomic_inc_return_relaxed(mapped);
> +			nr = !(nr & COMPOUND_MAPPED);
>  		}
>  	} else if (PageTransHuge(page)) {
>  		/* That test is redundant: it's for safety or to optimize out */
>  
> -		lock_compound_mapcounts(page, &mapcounts);
> -		first = !mapcounts.compound_mapcount;
> -		mapcounts.compound_mapcount++;
> +		first = atomic_inc_and_test(compound_mapcount_ptr(page));
>  		if (first) {
> +			mapped = subpages_mapcount_ptr(page);
> +			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
>  			nr_pmdmapped = thp_nr_pages(page);
> -			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
> +			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
>  		}
> -		unlock_compound_mapcounts(page, &mapcounts);
>  	}
>  
>  	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
> @@ -1360,6 +1295,7 @@ void page_add_new_anon_rmap(struct page *page,
>  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  		/* increment count (starts at -1) */
>  		atomic_set(compound_mapcount_ptr(page), 0);
> +		atomic_set(subpages_mapcount_ptr(page), COMPOUND_MAPPED);
>  		nr = thp_nr_pages(page);
>  		__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
>  	}
> @@ -1379,7 +1315,7 @@ void page_add_new_anon_rmap(struct page *page,
>  void page_add_file_rmap(struct page *page,
>  	struct vm_area_struct *vma, bool compound)
>  {
> -	struct compound_mapcounts mapcounts;
> +	atomic_t *mapped;
>  	int nr = 0, nr_pmdmapped = 0;
>  	bool first;
>  
> @@ -1390,24 +1326,20 @@ void page_add_file_rmap(struct page *page,
>  		first = atomic_inc_and_test(&page->_mapcount);
>  		nr = first;
>  		if (first && PageCompound(page)) {
> -			struct page *head = compound_head(page);
> -
> -			lock_compound_mapcounts(head, &mapcounts);
> -			mapcounts.subpages_mapcount++;
> -			nr = !mapcounts.compound_mapcount;
> -			unlock_compound_mapcounts(head, &mapcounts);
> +			mapped = subpages_mapcount_ptr(compound_head(page));
> +			nr = atomic_inc_return_relaxed(mapped);
> +			nr = !(nr & COMPOUND_MAPPED);
>  		}
>  	} else if (PageTransHuge(page)) {
>  		/* That test is redundant: it's for safety or to optimize out */
>  
> -		lock_compound_mapcounts(page, &mapcounts);
> -		first = !mapcounts.compound_mapcount;
> -		mapcounts.compound_mapcount++;
> +		first = atomic_inc_and_test(compound_mapcount_ptr(page));
>  		if (first) {
> +			mapped = subpages_mapcount_ptr(page);
> +			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
>  			nr_pmdmapped = thp_nr_pages(page);
> -			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
> +			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
>  		}
> -		unlock_compound_mapcounts(page, &mapcounts);
>  	}
>  
>  	if (nr_pmdmapped)
> @@ -1431,7 +1363,7 @@ void page_add_file_rmap(struct page *page,
>  void page_remove_rmap(struct page *page,
>  	struct vm_area_struct *vma, bool compound)
>  {
> -	struct compound_mapcounts mapcounts;
> +	atomic_t *mapped;
>  	int nr = 0, nr_pmdmapped = 0;
>  	bool last;
>  
> @@ -1451,24 +1383,20 @@ void page_remove_rmap(struct page *page,
>  		last = atomic_add_negative(-1, &page->_mapcount);
>  		nr = last;
>  		if (last && PageCompound(page)) {
> -			struct page *head = compound_head(page);
> -
> -			lock_compound_mapcounts(head, &mapcounts);
> -			mapcounts.subpages_mapcount--;
> -			nr = !mapcounts.compound_mapcount;
> -			unlock_compound_mapcounts(head, &mapcounts);
> +			mapped = subpages_mapcount_ptr(compound_head(page));
> +			nr = atomic_dec_return_relaxed(mapped);
> +			nr = !(nr & COMPOUND_MAPPED);
>  		}
>  	} else if (PageTransHuge(page)) {
>  		/* That test is redundant: it's for safety or to optimize out */
>  
> -		lock_compound_mapcounts(page, &mapcounts);
> -		mapcounts.compound_mapcount--;
> -		last = !mapcounts.compound_mapcount;
> +		last = atomic_add_negative(-1, compound_mapcount_ptr(page));
>  		if (last) {
> +			mapped = subpages_mapcount_ptr(page);
> +			nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
>  			nr_pmdmapped = thp_nr_pages(page);
> -			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
> +			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
>  		}
> -		unlock_compound_mapcounts(page, &mapcounts);
>  	}
>  
>  	if (nr_pmdmapped) {
> -- 
> 2.35.3
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked()
  2022-11-18  9:16   ` [PATCH 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
@ 2022-11-21 13:24     ` Kirill A. Shutemov
  0 siblings, 0 replies; 54+ messages in thread
From: Kirill A. Shutemov @ 2022-11-21 13:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Sidhartha Kumar, Muchun Song,
	Miaohe Lin, Naoya Horiguchi, Mina Almasry, James Houghton,
	Zach O'Keefe, linux-kernel, linux-mm

On Fri, Nov 18, 2022 at 01:16:20AM -0800, Hugh Dickins wrote:
> It's hard to add a page_add_anon_rmap() into __split_huge_pmd_locked()'s
> HPAGE_PMD_NR set_pte_at() loop, without wincing at the "freeze" case's
> HPAGE_PMD_NR page_remove_rmap() loop below it.
> 
> It's just a mistake to add rmaps in the "freeze" (insert migration entries
> prior to splitting huge page) case: the pmd_migration case already avoids
> doing that, so just follow its lead.  page_add_ref() versus put_page()
> likewise.  But why is one more put_page() needed in the "freeze" case?
> Because it's removing the pmd rmap, already removed when pmd_migration
> (and freeze and pmd_migration are mutually exclusive cases).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
                     ` (3 preceding siblings ...)
  2022-11-18 20:18   ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Linus Torvalds
@ 2022-11-21 16:59   ` Shakeel Butt
  2022-11-21 17:16     ` Linus Torvalds
  2022-11-21 18:52     ` Johannes Weiner
  2022-11-22  9:38   ` [PATCH v2 " Hugh Dickins
  5 siblings, 2 replies; 54+ messages in thread
From: Shakeel Butt @ 2022-11-21 16:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote:
> Linus was underwhelmed by the earlier compound mapcounts series:
> this series builds on top of it (as in next-20221117) to follow
> up on his suggestions - except rmap.c still using lock_page_memcg(),
> since I hesitate to steal the pleasure of deletion from Johannes.
> 

Is there a plan to remove lock_page_memcg() altogether which I missed? I
am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
up in the perf profile on exit path) but if we are removing it then I
should just wait.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-21 16:59   ` Shakeel Butt
@ 2022-11-21 17:16     ` Linus Torvalds
  2022-11-22 16:27       ` Shakeel Butt
  2022-11-21 18:52     ` Johannes Weiner
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2022-11-21 17:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Mon, Nov 21, 2022 at 8:59 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Is there a plan to remove lock_page_memcg() altogether which I missed? I
> am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> up in the perf profile on exit path)

Yay. It seems I'm not the only one hating it.

> but if we are removing it then I should just wait.

Well, I think Johannes was saying that at least the case I disliked
(the rmap removal from the page table tear-down - I strongly suspect
it's the one you're seeing on your perf profile too) can be removed
entirely as long as it's done under the page table lock (which my
final version of the rmap delaying still was).

See

    https://lore.kernel.org/all/Y2llcRiDLHc2kg%2FN@cmpxchg.org/

for his preliminary patch.

That said, if you have some patch to make it a no-op for _other_
reasons, and could be done away with _entirely_ (not just for rmap),
then that would be even better. I am  not a fan of that lock in
general, but in the teardown rmap path it's actively horrifying
because it is taken one page at a time. So it's taken a *lot*
(although you might not see it if all you run is long-running
benchmarks - it's mainly the "run lots of small scripts that really
hits it).

The reason it seems to be so horrifyingly noticeable on the exit path
is that the fork() side already does the rmap stuff (mainly
__page_dup_rmap()) _without_ having to do the lock_page_memcg() dance.

So I really hate that lock. It's completely inconsistent, and it all
feels very wrong. It seemed entirely pointless when I was looking at
the rmap removal path for a single page. The fact that both you and
Johannes seem to be more than ready to just remove it makes me much
happier, because I've never actually known the memcg code enough to do
anything about my simmering hatred.

              Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-21 16:59   ` Shakeel Butt
  2022-11-21 17:16     ` Linus Torvalds
@ 2022-11-21 18:52     ` Johannes Weiner
  2022-11-22  1:32       ` Hugh Dickins
  2022-11-22  5:57       ` Matthew Wilcox
  1 sibling, 2 replies; 54+ messages in thread
From: Johannes Weiner @ 2022-11-21 18:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Mon, Nov 21, 2022 at 04:59:38PM +0000, Shakeel Butt wrote:
> On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote:
> > Linus was underwhelmed by the earlier compound mapcounts series:
> > this series builds on top of it (as in next-20221117) to follow
> > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > since I hesitate to steal the pleasure of deletion from Johannes.
> 
> Is there a plan to remove lock_page_memcg() altogether which I missed? I
> am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> up in the perf profile on exit path) but if we are removing it then I
> should just wait.

We can remove it for rmap at least, but we might be able to do more.

Besides rmap, we're left with the dirty and writeback page transitions
that wrt cgroups need to be atomic with NR_FILE_DIRTY and NR_WRITEBACK.

Looking through the various callsites, I think we can delete it from
setting and clearing dirty state, as we always hold the page lock (or
the pte lock in some instances of folio_mark_dirty). Both of these are
taken from the cgroup side, so we're good there.

I think we can also remove it when setting writeback, because those
sites have the page locked as well.

That leaves clearing writeback. This can't hold the page lock due to
the atomic context, so currently we need to take lock_page_memcg() as
the lock of last resort.

I wonder if we can have cgroup take the xalock instead: writeback
ending on file pages always acquires the xarray lock. Swap writeback
currently doesn't, but we could make it so (swap_address_space).

The only thing that gives me pause is the !mapping check in
__folio_end_writeback. File and swapcache pages usually have mappings,
and truncation waits for writeback to finish before axing
page->mapping. So AFAICS this can only happen if we call end_writeback
on something that isn't under writeback - in which case the test_clear
will fail and we don't update the stats anyway. But I want to be sure.

Does anybody know from the top of their heads if a page under
writeback could be without a mapping in some weird cornercase?

If we could ensure that the NR_WRITEBACK decs are always protected by
the xalock, we could grab it from mem_cgroup_move_account(), and then
kill lock_page_memcg() altogether.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-21 18:52     ` Johannes Weiner
@ 2022-11-22  1:32       ` Hugh Dickins
  2022-11-22  5:57       ` Matthew Wilcox
  1 sibling, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  1:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Kirill A. Shutemov, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Mon, 21 Nov 2022, Johannes Weiner wrote:
> On Mon, Nov 21, 2022 at 04:59:38PM +0000, Shakeel Butt wrote:
> > On Fri, Nov 18, 2022 at 01:08:13AM -0800, Hugh Dickins wrote:
> > > Linus was underwhelmed by the earlier compound mapcounts series:
> > > this series builds on top of it (as in next-20221117) to follow
> > > up on his suggestions - except rmap.c still using lock_page_memcg(),
> > > since I hesitate to steal the pleasure of deletion from Johannes.
> > 
> > Is there a plan to remove lock_page_memcg() altogether which I missed? I
> > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> > up in the perf profile on exit path) but if we are removing it then I
> > should just wait.
> 
> We can remove it for rmap at least, but we might be able to do more.

I hope the calls from mm/rmap.c can be deleted before deciding the
bigger picture for lock_page_memcg() itself; getting rid of it would
be very nice, but it has always had a difficult job to do (and you've
devoted lots of good effort to minimizing it).

> 
> Besides rmap, we're left with the dirty and writeback page transitions
> that wrt cgroups need to be atomic with NR_FILE_DIRTY and NR_WRITEBACK.
> 
> Looking through the various callsites, I think we can delete it from
> setting and clearing dirty state, as we always hold the page lock (or
> the pte lock in some instances of folio_mark_dirty). Both of these are
> taken from the cgroup side, so we're good there.
> 
> I think we can also remove it when setting writeback, because those
> sites have the page locked as well.
> 
> That leaves clearing writeback. This can't hold the page lock due to
> the atomic context, so currently we need to take lock_page_memcg() as
> the lock of last resort.
> 
> I wonder if we can have cgroup take the xalock instead: writeback
> ending on file pages always acquires the xarray lock. Swap writeback
> currently doesn't, but we could make it so (swap_address_space).

It's a little bit of a regression to have to take that lock when
ending writeback on swap (compared with the rcu_read_lock() of almost
every lock_page_memcg()); but I suppose if swap had been doing that
all along, like the normal page cache case, I would not be complaining.

> 
> The only thing that gives me pause is the !mapping check in
> __folio_end_writeback. File and swapcache pages usually have mappings,
> and truncation waits for writeback to finish before axing
> page->mapping. So AFAICS this can only happen if we call end_writeback
> on something that isn't under writeback - in which case the test_clear
> will fail and we don't update the stats anyway. But I want to be sure.
> 
> Does anybody know from the top of their heads if a page under
> writeback could be without a mapping in some weird cornercase?

End of writeback has been a persistent troublemaker, in several ways;
I forget whether we are content with it now or not.

I would not trust whatever I think OTOH of that !mapping case, but I
was deeper into it two years ago, and find myself saying "Can mapping be
NULL? I don't see how, but allow for that with a WARN_ON_ONCE()" in a
patch I posted then (but it didn't go in, we went in another direction).

I'm pretty sure it never warned once for me, but I probably wasn't doing
enough to test it.  And IIRC I did also think that the !mapping check had
perhaps been copied from a related function, one where it made more sense.

It's also worth noting that the two stats which get decremented there,
NR_WRITEBACK and NR_ZONE_WRITE_PENDING, are two of the three which we
have commented "Skip checking stats known to go negative occasionally"
in mm/vmstat.c: I never did come up with a convincing explanation for
that (Roman had his explanation, but I wasn't quite convinced).
Maybe it would just be wrong to touch them if mapping were NULL.

> 
> If we could ensure that the NR_WRITEBACK decs are always protected by
> the xalock, we could grab it from mem_cgroup_move_account(), and then
> kill lock_page_memcg() altogether.

I suppose so (but I still feel grudging about the xalock for swap).

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-21 18:52     ` Johannes Weiner
  2022-11-22  1:32       ` Hugh Dickins
@ 2022-11-22  5:57       ` Matthew Wilcox
  2022-11-22  6:55         ` Johannes Weiner
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2022-11-22  5:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Kirill A. Shutemov, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote:
> That leaves clearing writeback. This can't hold the page lock due to
> the atomic context, so currently we need to take lock_page_memcg() as
> the lock of last resort.
> 
> I wonder if we can have cgroup take the xalock instead: writeback
> ending on file pages always acquires the xarray lock. Swap writeback
> currently doesn't, but we could make it so (swap_address_space).
> 
> The only thing that gives me pause is the !mapping check in
> __folio_end_writeback. File and swapcache pages usually have mappings,
> and truncation waits for writeback to finish before axing
> page->mapping. So AFAICS this can only happen if we call end_writeback
> on something that isn't under writeback - in which case the test_clear
> will fail and we don't update the stats anyway. But I want to be sure.
> 
> Does anybody know from the top of their heads if a page under
> writeback could be without a mapping in some weird cornercase?

I can't think of such a corner case.  We should always wait for
writeback to finish before removing the page from the page cache;
the writeback bit used to be (and kind of still is) an implicit
reference to the page, which means that we can't remove the page
cache's reference to the page without waiting for writeback.

> If we could ensure that the NR_WRITEBACK decs are always protected by
> the xalock, we could grab it from mem_cgroup_move_account(), and then
> kill lock_page_memcg() altogether.

I'm not thrilled by this idea, but I'm not going to veto it.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-22  5:57       ` Matthew Wilcox
@ 2022-11-22  6:55         ` Johannes Weiner
  2022-11-22 16:30           ` Shakeel Butt
  0 siblings, 1 reply; 54+ messages in thread
From: Johannes Weiner @ 2022-11-22  6:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shakeel Butt, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Kirill A. Shutemov, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Tue, Nov 22, 2022 at 05:57:42AM +0000, Matthew Wilcox wrote:
> On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote:
> > That leaves clearing writeback. This can't hold the page lock due to
> > the atomic context, so currently we need to take lock_page_memcg() as
> > the lock of last resort.
> > 
> > I wonder if we can have cgroup take the xalock instead: writeback
> > ending on file pages always acquires the xarray lock. Swap writeback
> > currently doesn't, but we could make it so (swap_address_space).
> > 
> > The only thing that gives me pause is the !mapping check in
> > __folio_end_writeback. File and swapcache pages usually have mappings,
> > and truncation waits for writeback to finish before axing
> > page->mapping. So AFAICS this can only happen if we call end_writeback
> > on something that isn't under writeback - in which case the test_clear
> > will fail and we don't update the stats anyway. But I want to be sure.
> > 
> > Does anybody know from the top of their heads if a page under
> > writeback could be without a mapping in some weird cornercase?
> 
> I can't think of such a corner case.  We should always wait for
> writeback to finish before removing the page from the page cache;
> the writeback bit used to be (and kind of still is) an implicit
> reference to the page, which means that we can't remove the page
> cache's reference to the page without waiting for writeback.

Great, thanks!

> > If we could ensure that the NR_WRITEBACK decs are always protected by
> > the xalock, we could grab it from mem_cgroup_move_account(), and then
> > kill lock_page_memcg() altogether.
> 
> I'm not thrilled by this idea, but I'm not going to veto it.

Ok, I'm also happy to drop this one.

Certainly, the rmap one is the lowest-hanging fruit. I have the patch
rebased against Hugh's series in mm-unstable; I'll wait for that to
settle down, and then send an updated version to Andrew.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
  2022-11-21 12:36     ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Kirill A. Shutemov
@ 2022-11-22  9:03       ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  9:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Mon, 21 Nov 2022, Kirill A. Shutemov wrote:
> On Fri, Nov 18, 2022 at 01:12:03AM -0800, Hugh Dickins wrote:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks a lot for all these, Kirill.

> 
> Few minor nitpicks below.
> 
...
> >  static inline void page_dup_file_rmap(struct page *page, bool compound)
> >  {
> > -	if (PageCompound(page))
> > -		page_dup_compound_rmap(page, compound);
> > -	else
> > +	if (likely(!compound /* page is mapped by PTE */))
> 
> I'm not a fan of this kind of comments.
> 
> Maybe move above the line (here and below)?

Okay, done throughout.  I wouldn't have added those comments, but Linus
had another "simmering hatred", of the "compound" arguments: he found
them very confusing.

The real fix is to rename them, probably to pmd_mapped; or better, pass
down an int nr_pages as he suggested; but I'm wary of the HPAGE_NR_PAGES
build bug, and it wants to be propagated through various other files
(headers and mlock.c, maybe more) - not a cleanup to get into now.

> 
> >  		atomic_inc(&page->_mapcount);
> > +	else
> > +		page_dup_compound_rmap(page);
> >  }
...
> > @@ -1176,20 +1157,16 @@ void page_dup_compound_rmap(struct page *page, bool compound)
> >  	 * Note that hugetlb does not call page_add_file_rmap():
> >  	 * here is where hugetlb shared page mapcount is raised.
> >  	 */
> > -	if (PageHuge(page)) {
> > -		atomic_inc(compound_mapcount_ptr(page));
> > -		return;
> > -	}
> > +	if (PageHuge(head)) {
> > +		atomic_inc(compound_mapcount_ptr(head));
> >  
> 
> Remove the newline?

It was intentional there, I thought it was easier to read that way;
but since this gets reverted in the next patch, I've no reason to
fight over it - removed.

Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/3 fix] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages: fix
  2022-11-21 12:38           ` Kirill A. Shutemov
@ 2022-11-22  9:13             ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  9:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Yu Zhao, Linus Torvalds,
	Johannes Weiner, Matthew Wilcox, David Hildenbrand,
	Vlastimil Babka, Peter Xu, Yang Shi, John Hubbard, Mike Kravetz,
	Sidhartha Kumar, Muchun Song, Miaohe Lin, Naoya Horiguchi,
	Mina Almasry, James Houghton, Zach O'Keefe, linux-kernel,
	linux-mm

On Mon, 21 Nov 2022, Kirill A. Shutemov wrote:
> On Fri, Nov 18, 2022 at 05:35:05PM -0800, Hugh Dickins wrote:
...
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1281,7 +1281,7 @@ void page_add_anon_rmap(struct page *page,
> >  	struct compound_mapcounts mapcounts;
> >  	int nr = 0, nr_pmdmapped = 0;
> >  	bool compound = flags & RMAP_COMPOUND;
> > -	bool first;
> > +	bool first = true;
> >  
> >  	if (unlikely(PageKsm(page)))
> >  		lock_page_memcg(page);
> 
> Other option is to drop PageTransHuge() check that you already claim to be
> redundant.
> 
> Or have else BUG() to catch cases where the helper called with
> compound=true on non-THP page.

I'm sticking with the "first = true".  I did receive a report of bloating
some tiny config a little, on the very first series, so I've been on guard
since then about adding THP code where the optimizer cannot see to remove
it: so do want to keep the PageTransHuge check in there.  Could be done
in other ways, yes, but I'd ended up feeling this was the best compromise.

Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
  2022-11-21 13:09     ` Kirill A. Shutemov
@ 2022-11-22  9:33       ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  9:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Mon, 21 Nov 2022, Kirill A. Shutemov wrote:
> On Fri, Nov 18, 2022 at 01:14:17AM -0800, Hugh Dickins wrote:
> > Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
> > Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
> > but if we slightly abuse subpages_mapcount by additionally demanding that
> > one bit be set there when the compound page is PMD-mapped, then a cascade
> > of two atomic ops is able to maintain the stats without bit_spin_lock.
> 
> Yay! New home for PageDoubleMap()! :P

:) You only asked for one bit for PageDoubleMap, I've been greedier;
so it's not surprising if it has worked out better now.

...

> Jokes aside, looks neat.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks; but I'm very glad that Linus expressed his dissatisfaction
with the first implementation, this one does feel much better.

> 
> As always few minor nits below.
...
> > @@ -893,8 +902,12 @@ static inline int total_mapcount(struct page *page)
> >  
> >  static inline bool folio_large_is_mapped(struct folio *folio)
> >  {
> > -	return atomic_read(folio_mapcount_ptr(folio)) +
> > -		atomic_read(folio_subpages_mapcount_ptr(folio)) >= 0;
> > +	/*
> > +	 * Reading folio_mapcount_ptr() below could be omitted if hugetlb
> > +	 * participated in incrementing subpages_mapcount when compound mapped.
> > +	 */
> > +	return atomic_read(folio_mapcount_ptr(folio)) >= 0 ||
> > +		atomic_read(folio_subpages_mapcount_ptr(folio)) > 0;
> 
> Maybe check folio_subpages_mapcount_ptr() first? It would avoid
> folio_mapcount_ptr() read for everything, but hugetlb.

Okay: I'm not convinced, but don't mind switching those around: done.

> > --- a/mm/debug.c
> > +++ b/mm/debug.c
> > @@ -97,7 +97,7 @@ static void __dump_page(struct page *page)
> >  		pr_warn("head:%p order:%u compound_mapcount:%d subpages_mapcount:%d compound_pincount:%d\n",
> >  				head, compound_order(head),
> >  				head_compound_mapcount(head),
> > -				head_subpages_mapcount(head),
> > +				head_subpages_mapcount(head) & SUBPAGES_MAPPED,
> 
> Looks like applying the SUBPAGES_MAPPED mask belong to the
> head_subpages_mapcount() helper, not to the caller.

Yes, that would be more consistent, helper function doing the massage.
Done.  __dump_page() then remains unchanged, but free_tail_pages_check()
uses subpages_mapcount_ptr(head_page) to check the whole field is zero.

v2 coming up - thanks.

Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
                     ` (4 preceding siblings ...)
  2022-11-21 16:59   ` Shakeel Butt
@ 2022-11-22  9:38   ` Hugh Dickins
  2022-11-22  9:42     ` [PATCH v2 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
                       ` (2 more replies)
  5 siblings, 3 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  9:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, Yu Zhao, Dan Carpenter,
	linux-kernel, linux-mm

Andrew, please replace the 1/3, 1/3 fix, 2/3, 3/3 in mm-unstable
by these v2 three: which incorporate the uninitialized warning fix,
and adjustments according to Kirill's review comments, plus his
Acks - I couldn't quite manage them just by -fixes.
No functional change from the v1 series.

1/3 mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
2/3 mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
3/3 mm,thp,rmap: clean up the end of __split_huge_pmd_locked()

 Documentation/mm/transhuge.rst |  10 +-
 include/linux/mm.h             |  67 +++++++----
 include/linux/rmap.h           |  12 +-
 mm/huge_memory.c               |  15 +--
 mm/page_alloc.c                |   2 +-
 mm/rmap.c                      | 219 ++++++++++-------------------------
 6 files changed, 124 insertions(+), 201 deletions(-)

Thanks!
Hugh


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages
  2022-11-22  9:38   ` [PATCH v2 " Hugh Dickins
@ 2022-11-22  9:42     ` Hugh Dickins
  2022-11-22  9:49     ` [PATCH v2 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped Hugh Dickins
  2022-11-22  9:51     ` [PATCH v2 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
  2 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  9:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, Yu Zhao, Dan Carpenter,
	linux-kernel, linux-mm

Following suggestion from Linus, instead of counting every PTE map of a
compound page in subpages_mapcount, just count how many of its subpages
are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED
and NR_FILE_MAPPED stats, without any need for a locked scan of subpages;
and requires updating the count less often.

This does then revert total_mapcount() and folio_mapcount() to needing a
scan of subpages; but they are inherently racy, and need no locking, so
Linus is right that the scans are much better done there.  Plus (unlike
in 6.1 and previous) subpages_mapcount lets us avoid the scan in the
common case of no PTE maps.  And page_mapped() and folio_mapped() remain
scanless and just as efficient with the new meaning of subpages_mapcount:
those are the functions which I most wanted to remove the scan from.

The updated page_dup_compound_rmap() is no longer suitable for use by
anon THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be
used for that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.

Evidence is that this way goes slightly faster than the previous
implementation for most cases; but significantly faster in the (now
scanless) pmds after ptes case, which started out at 870ms and was
brought down to 495ms by the previous series, now takes around 105ms.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
v2: fix uninitialized 'first', reported by Yu Zhao and Dan Carpenter
    moved "mapped by PTE" comments above the !compound tests, per Kirill
    removed a newline (which goes away in the next patch), per Kirill

 Documentation/mm/transhuge.rst |   3 +-
 include/linux/mm.h             |  52 ++++++-----
 include/linux/rmap.h           |   9 +-
 mm/huge_memory.c               |   2 +-
 mm/rmap.c                      | 160 ++++++++++++++-------------------
 5 files changed, 107 insertions(+), 119 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 1e2a637cc607..af4c9d70321d 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -122,7 +122,8 @@ pages:
 
   - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
     on relevant sub-page of the compound page, and also increment/decrement
-    ->subpages_mapcount, stored in first tail page of the compound page.
+    ->subpages_mapcount, stored in first tail page of the compound page, when
+    _mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE.
     In order to have race-free accounting of sub-pages mapped, changes to
     sub-page ->_mapcount, ->subpages_mapcount and ->compound_mapcount are
     are all locked by bit_spin_lock of PG_locked in the first tail ->flags.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8fe6276d8cc2..c9e46d4d46f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -828,7 +828,7 @@ static inline int head_compound_mapcount(struct page *head)
 }
 
 /*
- * Sum of mapcounts of sub-pages, does not include compound mapcount.
+ * Number of sub-pages mapped by PTE, does not include compound mapcount.
  * Must be called only on head of compound page.
  */
 static inline int head_subpages_mapcount(struct page *head)
@@ -864,23 +864,7 @@ static inline int page_mapcount(struct page *page)
 	return head_compound_mapcount(page) + mapcount;
 }
 
-static inline int total_mapcount(struct page *page)
-{
-	if (likely(!PageCompound(page)))
-		return atomic_read(&page->_mapcount) + 1;
-	page = compound_head(page);
-	return head_compound_mapcount(page) + head_subpages_mapcount(page);
-}
-
-/*
- * Return true if this page is mapped into pagetables.
- * For compound page it returns true if any subpage of compound page is mapped,
- * even if this particular subpage is not itself mapped by any PTE or PMD.
- */
-static inline bool page_mapped(struct page *page)
-{
-	return total_mapcount(page) > 0;
-}
+int total_compound_mapcount(struct page *head);
 
 /**
  * folio_mapcount() - Calculate the number of mappings of this folio.
@@ -897,8 +881,20 @@ static inline int folio_mapcount(struct folio *folio)
 {
 	if (likely(!folio_test_large(folio)))
 		return atomic_read(&folio->_mapcount) + 1;
-	return atomic_read(folio_mapcount_ptr(folio)) + 1 +
-		atomic_read(folio_subpages_mapcount_ptr(folio));
+	return total_compound_mapcount(&folio->page);
+}
+
+static inline int total_mapcount(struct page *page)
+{
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+	return total_compound_mapcount(compound_head(page));
+}
+
+static inline bool folio_large_is_mapped(struct folio *folio)
+{
+	return atomic_read(folio_mapcount_ptr(folio)) +
+		atomic_read(folio_subpages_mapcount_ptr(folio)) >= 0;
 }
 
 /**
@@ -909,7 +905,21 @@ static inline int folio_mapcount(struct folio *folio)
  */
 static inline bool folio_mapped(struct folio *folio)
 {
-	return folio_mapcount(folio) > 0;
+	if (likely(!folio_test_large(folio)))
+		return atomic_read(&folio->_mapcount) >= 0;
+	return folio_large_is_mapped(folio);
+}
+
+/*
+ * Return true if this page is mapped into pagetables.
+ * For compound page it returns true if any sub-page of compound page is mapped,
+ * even if this particular sub-page is not itself mapped by any PTE or PMD.
+ */
+static inline bool page_mapped(struct page *page)
+{
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) >= 0;
+	return folio_large_is_mapped(page_folio(page));
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 011a7530dc76..5dadb9a3e010 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,14 +204,15 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 
-void page_dup_compound_rmap(struct page *page, bool compound);
+void page_dup_compound_rmap(struct page *page);
 
 static inline void page_dup_file_rmap(struct page *page, bool compound)
 {
-	if (PageCompound(page))
-		page_dup_compound_rmap(page, compound);
-	else
+	/* Is page being mapped by PTE? */
+	if (likely(!compound))
 		atomic_inc(&page->_mapcount);
+	else
+		page_dup_compound_rmap(page);
 }
 
 /**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 30056efc79ad..3dee8665c585 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2215,7 +2215,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
 		if (!pmd_migration)
-			page_dup_compound_rmap(page + i, false);
+			page_add_anon_rmap(page + i, vma, addr, false);
 		pte_unmap(pte);
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 4833d28c5e1a..e813785da613 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1117,55 +1117,36 @@ static void unlock_compound_mapcounts(struct page *head,
 	bit_spin_unlock(PG_locked, &head[1].flags);
 }
 
-/*
- * When acting on a compound page under lock_compound_mapcounts(), avoid the
- * unnecessary overhead of an actual atomic operation on its subpage mapcount.
- * Return true if this is the first increment or the last decrement
- * (remembering that page->_mapcount -1 represents logical mapcount 0).
- */
-static bool subpage_mapcount_inc(struct page *page)
-{
-	int orig_mapcount = atomic_read(&page->_mapcount);
-
-	atomic_set(&page->_mapcount, orig_mapcount + 1);
-	return orig_mapcount < 0;
-}
-
-static bool subpage_mapcount_dec(struct page *page)
-{
-	int orig_mapcount = atomic_read(&page->_mapcount);
-
-	atomic_set(&page->_mapcount, orig_mapcount - 1);
-	return orig_mapcount == 0;
-}
-
-/*
- * When mapping a THP's first pmd, or unmapping its last pmd, if that THP
- * also has pte mappings, then those must be discounted: in order to maintain
- * NR_ANON_MAPPED and NR_FILE_MAPPED statistics exactly, without any drift,
- * and to decide when an anon THP should be put on the deferred split queue.
- * This function must be called between lock_ and unlock_compound_mapcounts().
- */
-static int nr_subpages_unmapped(struct page *head, int nr_subpages)
+int total_compound_mapcount(struct page *head)
 {
-	int nr = nr_subpages;
+	int mapcount = head_compound_mapcount(head);
+	int nr_subpages;
 	int i;
 
-	/* Discount those subpages mapped by pte */
+	/* In the common case, avoid the loop when no subpages mapped by PTE */
+	if (head_subpages_mapcount(head) == 0)
+		return mapcount;
+	/*
+	 * Add all the PTE mappings of those subpages mapped by PTE.
+	 * Limit the loop, knowing that only subpages_mapcount are mapped?
+	 * Perhaps: given all the raciness, that may be a good or a bad idea.
+	 */
+	nr_subpages = thp_nr_pages(head);
 	for (i = 0; i < nr_subpages; i++)
-		if (atomic_read(&head[i]._mapcount) >= 0)
-			nr--;
-	return nr;
+		mapcount += atomic_read(&head[i]._mapcount);
+
+	/* But each of those _mapcounts was based on -1 */
+	mapcount += nr_subpages;
+	return mapcount;
 }
 
 /*
- * page_dup_compound_rmap(), used when copying mm, or when splitting pmd,
+ * page_dup_compound_rmap(), used when copying mm,
  * provides a simple example of using lock_ and unlock_compound_mapcounts().
  */
-void page_dup_compound_rmap(struct page *page, bool compound)
+void page_dup_compound_rmap(struct page *head)
 {
 	struct compound_mapcounts mapcounts;
-	struct page *head;
 
 	/*
 	 * Hugetlb pages could use lock_compound_mapcounts(), like THPs do;
@@ -1176,20 +1157,15 @@ void page_dup_compound_rmap(struct page *page, bool compound)
 	 * Note that hugetlb does not call page_add_file_rmap():
 	 * here is where hugetlb shared page mapcount is raised.
 	 */
-	if (PageHuge(page)) {
-		atomic_inc(compound_mapcount_ptr(page));
-		return;
-	}
+	if (PageHuge(head)) {
+		atomic_inc(compound_mapcount_ptr(head));
+	} else if (PageTransHuge(head)) {
+		/* That test is redundant: it's for safety or to optimize out */
 
-	head = compound_head(page);
-	lock_compound_mapcounts(head, &mapcounts);
-	if (compound) {
+		lock_compound_mapcounts(head, &mapcounts);
 		mapcounts.compound_mapcount++;
-	} else {
-		mapcounts.subpages_mapcount++;
-		subpage_mapcount_inc(page);
+		unlock_compound_mapcounts(head, &mapcounts);
 	}
-	unlock_compound_mapcounts(head, &mapcounts);
 }
 
 /**
@@ -1304,35 +1280,34 @@ void page_add_anon_rmap(struct page *page,
 	struct compound_mapcounts mapcounts;
 	int nr = 0, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
-	bool first;
+	bool first = true;
 
 	if (unlikely(PageKsm(page)))
 		lock_page_memcg(page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	if (likely(!PageCompound(page))) {
+	/* Is page being mapped by PTE? Is this its first map to be added? */
+	if (likely(!compound)) {
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
+		if (first && PageCompound(page)) {
+			struct page *head = compound_head(page);
+
+			lock_compound_mapcounts(head, &mapcounts);
+			mapcounts.subpages_mapcount++;
+			nr = !mapcounts.compound_mapcount;
+			unlock_compound_mapcounts(head, &mapcounts);
+		}
+	} else if (PageTransHuge(page)) {
+		/* That test is redundant: it's for safety or to optimize out */
 
-	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		first = !mapcounts.compound_mapcount;
 		mapcounts.compound_mapcount++;
 		if (first) {
-			nr = nr_pmdmapped = thp_nr_pages(page);
-			if (mapcounts.subpages_mapcount)
-				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+			nr_pmdmapped = thp_nr_pages(page);
+			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-	} else {
-		struct page *head = compound_head(page);
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.subpages_mapcount++;
-		first = subpage_mapcount_inc(page);
-		nr = first && !mapcounts.compound_mapcount;
-		unlock_compound_mapcounts(head, &mapcounts);
 	}
 
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
@@ -1411,28 +1386,29 @@ void page_add_file_rmap(struct page *page,
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
 
-	if (likely(!PageCompound(page))) {
+	/* Is page being mapped by PTE? Is this its first map to be added? */
+	if (likely(!compound)) {
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
+		if (first && PageCompound(page)) {
+			struct page *head = compound_head(page);
+
+			lock_compound_mapcounts(head, &mapcounts);
+			mapcounts.subpages_mapcount++;
+			nr = !mapcounts.compound_mapcount;
+			unlock_compound_mapcounts(head, &mapcounts);
+		}
+	} else if (PageTransHuge(page)) {
+		/* That test is redundant: it's for safety or to optimize out */
 
-	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		first = !mapcounts.compound_mapcount;
 		mapcounts.compound_mapcount++;
 		if (first) {
-			nr = nr_pmdmapped = thp_nr_pages(page);
-			if (mapcounts.subpages_mapcount)
-				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+			nr_pmdmapped = thp_nr_pages(page);
+			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-	} else {
-		struct page *head = compound_head(page);
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.subpages_mapcount++;
-		first = subpage_mapcount_inc(page);
-		nr = first && !mapcounts.compound_mapcount;
-		unlock_compound_mapcounts(head, &mapcounts);
 	}
 
 	if (nr_pmdmapped)
@@ -1471,29 +1447,29 @@ void page_remove_rmap(struct page *page,
 
 	lock_page_memcg(page);
 
-	/* page still mapped by someone else? */
-	if (likely(!PageCompound(page))) {
+	/* Is page being unmapped by PTE? Is this its last map to be removed? */
+	if (likely(!compound)) {
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
+		if (last && PageCompound(page)) {
+			struct page *head = compound_head(page);
+
+			lock_compound_mapcounts(head, &mapcounts);
+			mapcounts.subpages_mapcount--;
+			nr = !mapcounts.compound_mapcount;
+			unlock_compound_mapcounts(head, &mapcounts);
+		}
+	} else if (PageTransHuge(page)) {
+		/* That test is redundant: it's for safety or to optimize out */
 
-	} else if (compound && PageTransHuge(page)) {
 		lock_compound_mapcounts(page, &mapcounts);
 		mapcounts.compound_mapcount--;
 		last = !mapcounts.compound_mapcount;
 		if (last) {
-			nr = nr_pmdmapped = thp_nr_pages(page);
-			if (mapcounts.subpages_mapcount)
-				nr = nr_subpages_unmapped(page, nr_pmdmapped);
+			nr_pmdmapped = thp_nr_pages(page);
+			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
 		}
 		unlock_compound_mapcounts(page, &mapcounts);
-	} else {
-		struct page *head = compound_head(page);
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.subpages_mapcount--;
-		last = subpage_mapcount_dec(page);
-		nr = last && !mapcounts.compound_mapcount;
-		unlock_compound_mapcounts(head, &mapcounts);
 	}
 
 	if (nr_pmdmapped) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped
  2022-11-22  9:38   ` [PATCH v2 " Hugh Dickins
  2022-11-22  9:42     ` [PATCH v2 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
@ 2022-11-22  9:49     ` Hugh Dickins
  2022-11-22  9:51     ` [PATCH v2 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
  2 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, Yu Zhao, Dan Carpenter,
	linux-kernel, linux-mm

Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
but if we slightly abuse subpages_mapcount by additionally demanding that
one bit be set there when the compound page is PMD-mapped, then a cascade
of two atomic ops is able to maintain the stats without bit_spin_lock.

This is harder to reason about than when bit_spin_locked, but I believe
safe; and no drift in stats detected when testing.  When there are racing
removes and adds, of course the sequence of operations is less well-
defined; but each operation on subpages_mapcount is atomically good.
What might be disastrous, is if subpages_mapcount could ever fleetingly
appear negative: but the pte lock (or pmd lock) these rmap functions are
called under, ensures that a last remove cannot race ahead of a first add.

Continue to make an exception for hugetlb (PageHuge) pages, though that
exception can be easily removed by a further commit if necessary: leave
subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
carry on checking compound_mapcount too in folio_mapped(), page_mapped().

Evidence is that this way goes slightly faster than the previous
implementation in all cases (pmds after ptes now taking around 103ms);
and relieves us of worrying about contention on the bit_spin_lock.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
v2: head_subpages_mapcount() apply the SUBPAGES_MAPPED mask, per Kirill
    (which consequently modifies mm/page_alloc.c instead of mm/debug.c)
    reverse order of reads in folio_large_is_mapped(), per Kirill

 Documentation/mm/transhuge.rst |   7 +-
 include/linux/mm.h             |  19 +++++-
 include/linux/rmap.h           |  13 ++--
 mm/page_alloc.c                |   2 +-
 mm/rmap.c                      | 121 +++++++--------------------------
 5 files changed, 51 insertions(+), 111 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index af4c9d70321d..ec3dc5b04226 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -118,15 +118,14 @@ pages:
     succeeds on tail pages.
 
   - map/unmap of PMD entry for the whole compound page increment/decrement
-    ->compound_mapcount, stored in the first tail page of the compound page.
+    ->compound_mapcount, stored in the first tail page of the compound page;
+    and also increment/decrement ->subpages_mapcount (also in the first tail)
+    by COMPOUND_MAPPED when compound_mapcount goes from -1 to 0 or 0 to -1.
 
   - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
     on relevant sub-page of the compound page, and also increment/decrement
     ->subpages_mapcount, stored in first tail page of the compound page, when
     _mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE.
-    In order to have race-free accounting of sub-pages mapped, changes to
-    sub-page ->_mapcount, ->subpages_mapcount and ->compound_mapcount are
-    are all locked by bit_spin_lock of PG_locked in the first tail ->flags.
 
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9e46d4d46f2..d8de9f63c376 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -827,13 +827,22 @@ static inline int head_compound_mapcount(struct page *head)
 	return atomic_read(compound_mapcount_ptr(head)) + 1;
 }
 
+/*
+ * If a 16GB hugetlb page were mapped by PTEs of all of its 4kB sub-pages,
+ * its subpages_mapcount would be 0x400000: choose the COMPOUND_MAPPED bit
+ * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb currently
+ * leaves subpages_mapcount at 0, but avoid surprise if it participates later.
+ */
+#define COMPOUND_MAPPED	0x800000
+#define SUBPAGES_MAPPED	(COMPOUND_MAPPED - 1)
+
 /*
  * Number of sub-pages mapped by PTE, does not include compound mapcount.
  * Must be called only on head of compound page.
  */
 static inline int head_subpages_mapcount(struct page *head)
 {
-	return atomic_read(subpages_mapcount_ptr(head));
+	return atomic_read(subpages_mapcount_ptr(head)) & SUBPAGES_MAPPED;
 }
 
 /*
@@ -893,8 +902,12 @@ static inline int total_mapcount(struct page *page)
 
 static inline bool folio_large_is_mapped(struct folio *folio)
 {
-	return atomic_read(folio_mapcount_ptr(folio)) +
-		atomic_read(folio_subpages_mapcount_ptr(folio)) >= 0;
+	/*
+	 * Reading folio_mapcount_ptr() below could be omitted if hugetlb
+	 * participated in incrementing subpages_mapcount when compound mapped.
+	 */
+	return atomic_read(folio_subpages_mapcount_ptr(folio)) > 0 ||
+		atomic_read(folio_mapcount_ptr(folio)) >= 0;
 }
 
 /**
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 5dadb9a3e010..bd3504d11b15 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,15 +204,14 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address);
 
-void page_dup_compound_rmap(struct page *page);
+static inline void __page_dup_rmap(struct page *page, bool compound)
+{
+	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
+}
 
 static inline void page_dup_file_rmap(struct page *page, bool compound)
 {
-	/* Is page being mapped by PTE? */
-	if (likely(!compound))
-		atomic_inc(&page->_mapcount);
-	else
-		page_dup_compound_rmap(page);
+	__page_dup_rmap(page, compound);
 }
 
 /**
@@ -261,7 +260,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
 	 * the page R/O into both processes.
 	 */
 dup:
-	page_dup_file_rmap(page, compound);
+	__page_dup_rmap(page, compound);
 	return 0;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f7a63684e6c4..400c51d06939 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1330,7 +1330,7 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 			bad_page(page, "nonzero compound_mapcount");
 			goto out;
 		}
-		if (unlikely(head_subpages_mapcount(head_page))) {
+		if (unlikely(atomic_read(subpages_mapcount_ptr(head_page)))) {
 			bad_page(page, "nonzero subpages_mapcount");
 			goto out;
 		}
diff --git a/mm/rmap.c b/mm/rmap.c
index e813785da613..459dc1c44d8a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1085,38 +1085,6 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 	return page_vma_mkclean_one(&pvmw);
 }
 
-struct compound_mapcounts {
-	unsigned int compound_mapcount;
-	unsigned int subpages_mapcount;
-};
-
-/*
- * lock_compound_mapcounts() first locks, then copies subpages_mapcount and
- * compound_mapcount from head[1].compound_mapcount and subpages_mapcount,
- * converting from struct page's internal representation to logical count
- * (that is, adding 1 to compound_mapcount to hide its offset by -1).
- */
-static void lock_compound_mapcounts(struct page *head,
-		struct compound_mapcounts *local)
-{
-	bit_spin_lock(PG_locked, &head[1].flags);
-	local->compound_mapcount = atomic_read(compound_mapcount_ptr(head)) + 1;
-	local->subpages_mapcount = atomic_read(subpages_mapcount_ptr(head));
-}
-
-/*
- * After caller has updated subpage._mapcount, local subpages_mapcount and
- * local compound_mapcount, as necessary, unlock_compound_mapcounts() converts
- * and copies them back to the compound head[1] fields, and then unlocks.
- */
-static void unlock_compound_mapcounts(struct page *head,
-		struct compound_mapcounts *local)
-{
-	atomic_set(compound_mapcount_ptr(head), local->compound_mapcount - 1);
-	atomic_set(subpages_mapcount_ptr(head), local->subpages_mapcount);
-	bit_spin_unlock(PG_locked, &head[1].flags);
-}
-
 int total_compound_mapcount(struct page *head)
 {
 	int mapcount = head_compound_mapcount(head);
@@ -1140,34 +1108,6 @@ int total_compound_mapcount(struct page *head)
 	return mapcount;
 }
 
-/*
- * page_dup_compound_rmap(), used when copying mm,
- * provides a simple example of using lock_ and unlock_compound_mapcounts().
- */
-void page_dup_compound_rmap(struct page *head)
-{
-	struct compound_mapcounts mapcounts;
-
-	/*
-	 * Hugetlb pages could use lock_compound_mapcounts(), like THPs do;
-	 * but at present they are still being managed by atomic operations:
-	 * which are likely to be somewhat faster, so don't rush to convert
-	 * them over without evaluating the effect.
-	 *
-	 * Note that hugetlb does not call page_add_file_rmap():
-	 * here is where hugetlb shared page mapcount is raised.
-	 */
-	if (PageHuge(head)) {
-		atomic_inc(compound_mapcount_ptr(head));
-	} else if (PageTransHuge(head)) {
-		/* That test is redundant: it's for safety or to optimize out */
-
-		lock_compound_mapcounts(head, &mapcounts);
-		mapcounts.compound_mapcount++;
-		unlock_compound_mapcounts(head, &mapcounts);
-	}
-}
-
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
@@ -1277,7 +1217,7 @@ static void __page_check_anon_rmap(struct page *page,
 void page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
-	struct compound_mapcounts mapcounts;
+	atomic_t *mapped;
 	int nr = 0, nr_pmdmapped = 0;
 	bool compound = flags & RMAP_COMPOUND;
 	bool first = true;
@@ -1290,24 +1230,20 @@ void page_add_anon_rmap(struct page *page,
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
 		if (first && PageCompound(page)) {
-			struct page *head = compound_head(page);
-
-			lock_compound_mapcounts(head, &mapcounts);
-			mapcounts.subpages_mapcount++;
-			nr = !mapcounts.compound_mapcount;
-			unlock_compound_mapcounts(head, &mapcounts);
+			mapped = subpages_mapcount_ptr(compound_head(page));
+			nr = atomic_inc_return_relaxed(mapped);
+			nr = !(nr & COMPOUND_MAPPED);
 		}
 	} else if (PageTransHuge(page)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
-		lock_compound_mapcounts(page, &mapcounts);
-		first = !mapcounts.compound_mapcount;
-		mapcounts.compound_mapcount++;
+		first = atomic_inc_and_test(compound_mapcount_ptr(page));
 		if (first) {
+			mapped = subpages_mapcount_ptr(page);
+			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
 			nr_pmdmapped = thp_nr_pages(page);
-			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
+			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
 		}
-		unlock_compound_mapcounts(page, &mapcounts);
 	}
 
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
@@ -1360,6 +1296,7 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
+		atomic_set(subpages_mapcount_ptr(page), COMPOUND_MAPPED);
 		nr = thp_nr_pages(page);
 		__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 	}
@@ -1379,7 +1316,7 @@ void page_add_new_anon_rmap(struct page *page,
 void page_add_file_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	struct compound_mapcounts mapcounts;
+	atomic_t *mapped;
 	int nr = 0, nr_pmdmapped = 0;
 	bool first;
 
@@ -1391,24 +1328,20 @@ void page_add_file_rmap(struct page *page,
 		first = atomic_inc_and_test(&page->_mapcount);
 		nr = first;
 		if (first && PageCompound(page)) {
-			struct page *head = compound_head(page);
-
-			lock_compound_mapcounts(head, &mapcounts);
-			mapcounts.subpages_mapcount++;
-			nr = !mapcounts.compound_mapcount;
-			unlock_compound_mapcounts(head, &mapcounts);
+			mapped = subpages_mapcount_ptr(compound_head(page));
+			nr = atomic_inc_return_relaxed(mapped);
+			nr = !(nr & COMPOUND_MAPPED);
 		}
 	} else if (PageTransHuge(page)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
-		lock_compound_mapcounts(page, &mapcounts);
-		first = !mapcounts.compound_mapcount;
-		mapcounts.compound_mapcount++;
+		first = atomic_inc_and_test(compound_mapcount_ptr(page));
 		if (first) {
+			mapped = subpages_mapcount_ptr(page);
+			nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
 			nr_pmdmapped = thp_nr_pages(page);
-			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
+			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
 		}
-		unlock_compound_mapcounts(page, &mapcounts);
 	}
 
 	if (nr_pmdmapped)
@@ -1432,7 +1365,7 @@ void page_add_file_rmap(struct page *page,
 void page_remove_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	struct compound_mapcounts mapcounts;
+	atomic_t *mapped;
 	int nr = 0, nr_pmdmapped = 0;
 	bool last;
 
@@ -1452,24 +1385,20 @@ void page_remove_rmap(struct page *page,
 		last = atomic_add_negative(-1, &page->_mapcount);
 		nr = last;
 		if (last && PageCompound(page)) {
-			struct page *head = compound_head(page);
-
-			lock_compound_mapcounts(head, &mapcounts);
-			mapcounts.subpages_mapcount--;
-			nr = !mapcounts.compound_mapcount;
-			unlock_compound_mapcounts(head, &mapcounts);
+			mapped = subpages_mapcount_ptr(compound_head(page));
+			nr = atomic_dec_return_relaxed(mapped);
+			nr = !(nr & COMPOUND_MAPPED);
 		}
 	} else if (PageTransHuge(page)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
-		lock_compound_mapcounts(page, &mapcounts);
-		mapcounts.compound_mapcount--;
-		last = !mapcounts.compound_mapcount;
+		last = atomic_add_negative(-1, compound_mapcount_ptr(page));
 		if (last) {
+			mapped = subpages_mapcount_ptr(page);
+			nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
 			nr_pmdmapped = thp_nr_pages(page);
-			nr = nr_pmdmapped - mapcounts.subpages_mapcount;
+			nr = nr_pmdmapped - (nr & SUBPAGES_MAPPED);
 		}
-		unlock_compound_mapcounts(page, &mapcounts);
 	}
 
 	if (nr_pmdmapped) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked()
  2022-11-22  9:38   ` [PATCH v2 " Hugh Dickins
  2022-11-22  9:42     ` [PATCH v2 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
  2022-11-22  9:49     ` [PATCH v2 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped Hugh Dickins
@ 2022-11-22  9:51     ` Hugh Dickins
  2022-12-05  1:38       ` Hugh Dickins
  2 siblings, 1 reply; 54+ messages in thread
From: Hugh Dickins @ 2022-11-22  9:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, Yu Zhao, Dan Carpenter,
	linux-kernel, linux-mm

It's hard to add a page_add_anon_rmap() into __split_huge_pmd_locked()'s
HPAGE_PMD_NR set_pte_at() loop, without wincing at the "freeze" case's
HPAGE_PMD_NR page_remove_rmap() loop below it.

It's just a mistake to add rmaps in the "freeze" (insert migration entries
prior to splitting huge page) case: the pmd_migration case already avoids
doing that, so just follow its lead.  page_add_ref() versus put_page()
likewise.  But why is one more put_page() needed in the "freeze" case?
Because it's removing the pmd rmap, already removed when pmd_migration
(and freeze and pmd_migration are mutually exclusive cases).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
v2: same as v1, plus Ack from Kirill

 mm/huge_memory.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3dee8665c585..ab5ab1a013e1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2135,7 +2135,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		uffd_wp = pmd_uffd_wp(old_pmd);
 
 		VM_BUG_ON_PAGE(!page_count(page), page);
-		page_ref_add(page, HPAGE_PMD_NR - 1);
 
 		/*
 		 * Without "freeze", we'll simply split the PMD, propagating the
@@ -2155,6 +2154,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
 		if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
 			freeze = false;
+		if (!freeze)
+			page_ref_add(page, HPAGE_PMD_NR - 1);
 	}
 
 	/*
@@ -2210,27 +2211,21 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_mksoft_dirty(entry);
 			if (uffd_wp)
 				entry = pte_mkuffd_wp(entry);
+			page_add_anon_rmap(page + i, vma, addr, false);
 		}
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
-		if (!pmd_migration)
-			page_add_anon_rmap(page + i, vma, addr, false);
 		pte_unmap(pte);
 	}
 
 	if (!pmd_migration)
 		page_remove_rmap(page, vma, true);
+	if (freeze)
+		put_page(page);
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-
-	if (freeze) {
-		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, vma, false);
-			put_page(page + i);
-		}
-	}
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-21 17:16     ` Linus Torvalds
@ 2022-11-22 16:27       ` Shakeel Butt
  0 siblings, 0 replies; 54+ messages in thread
From: Shakeel Butt @ 2022-11-22 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, Johannes Weiner, Kirill A. Shutemov,
	Matthew Wilcox, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Mon, Nov 21, 2022 at 09:16:58AM -0800, Linus Torvalds wrote:
> On Mon, Nov 21, 2022 at 8:59 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Is there a plan to remove lock_page_memcg() altogether which I missed? I
> > am planning to make lock_page_memcg() a nop for cgroup-v2 (as it shows
> > up in the perf profile on exit path)
> 
> Yay. It seems I'm not the only one hating it.
> 
> > but if we are removing it then I should just wait.
> 
> Well, I think Johannes was saying that at least the case I disliked
> (the rmap removal from the page table tear-down - I strongly suspect
> it's the one you're seeing on your perf profile too)

Yes indeed that is the one.

-   99.89%     0.00%  fork-large-mmap  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hw◆
     entry_SYSCALL_64_after_hwframe                               
   - do_syscall_64                                                
      - 48.94% __x64_sys_exit_group                               
           do_group_exit                                          
         - do_exit                                                
            - 48.94% exit_mm                                      
                 mmput                                            
               - __mmput                                          
                  - exit_mmap                                     
                     - 48.61% unmap_vmas                          
                        - 48.61% unmap_single_vma                 
                           - unmap_page_range                     
                              - 48.60% zap_p4d_range              
                                 - 44.66% zap_pte_range           
                                    + 12.61% tlb_flush_mmu        
                                    - 9.38% page_remove_rmap      
                                         2.50% lock_page_memcg    
                                         2.37% unlock_page_memcg  
                                         0.61% PageHuge           
                                      4.80% vm_normal_page        
                                      2.56% __tlb_remove_page_size
                                      0.85% lock_page_memcg       
                                      0.53% PageHuge              
                                   2.22% __tlb_remove_page_size   
                                   0.93% vm_normal_page           
                                   0.72% page_remove_rmap

> can be removed
> entirely as long as it's done under the page table lock (which my
> final version of the rmap delaying still was).
> 
> See
> 
>     https://lore.kernel.org/all/Y2llcRiDLHc2kg%2FN@cmpxchg.org/
> 
> for his preliminary patch.
> 
> That said, if you have some patch to make it a no-op for _other_
> reasons, and could be done away with _entirely_ (not just for rmap),
> then that would be even better.

I am actually looking at deprecating the whole "move charge"
funcitonality of cgroup-v1 i.e. the underlying reason lock_page_memcg
exists. That already does not work for couple of cases like partially
mapped THP and madv_free'd pages. Though that deprecation process would
take some time. In the meantime I was looking at if we can make these
functions nop for cgroup-v2.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount
  2022-11-22  6:55         ` Johannes Weiner
@ 2022-11-22 16:30           ` Shakeel Butt
  0 siblings, 0 replies; 54+ messages in thread
From: Shakeel Butt @ 2022-11-22 16:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Kirill A. Shutemov, David Hildenbrand, Vlastimil Babka, Peter Xu,
	Yang Shi, John Hubbard, Mike Kravetz, Sidhartha Kumar,
	Muchun Song, Miaohe Lin, Naoya Horiguchi, Mina Almasry,
	James Houghton, Zach O'Keefe, linux-kernel, linux-mm

On Tue, Nov 22, 2022 at 01:55:39AM -0500, Johannes Weiner wrote:
> On Tue, Nov 22, 2022 at 05:57:42AM +0000, Matthew Wilcox wrote:
> > On Mon, Nov 21, 2022 at 01:52:23PM -0500, Johannes Weiner wrote:
> > > That leaves clearing writeback. This can't hold the page lock due to
> > > the atomic context, so currently we need to take lock_page_memcg() as
> > > the lock of last resort.
> > > 
> > > I wonder if we can have cgroup take the xalock instead: writeback
> > > ending on file pages always acquires the xarray lock. Swap writeback
> > > currently doesn't, but we could make it so (swap_address_space).
> > > 
> > > The only thing that gives me pause is the !mapping check in
> > > __folio_end_writeback. File and swapcache pages usually have mappings,
> > > and truncation waits for writeback to finish before axing
> > > page->mapping. So AFAICS this can only happen if we call end_writeback
> > > on something that isn't under writeback - in which case the test_clear
> > > will fail and we don't update the stats anyway. But I want to be sure.
> > > 
> > > Does anybody know from the top of their heads if a page under
> > > writeback could be without a mapping in some weird cornercase?
> > 
> > I can't think of such a corner case.  We should always wait for
> > writeback to finish before removing the page from the page cache;
> > the writeback bit used to be (and kind of still is) an implicit
> > reference to the page, which means that we can't remove the page
> > cache's reference to the page without waiting for writeback.
> 
> Great, thanks!
> 
> > > If we could ensure that the NR_WRITEBACK decs are always protected by
> > > the xalock, we could grab it from mem_cgroup_move_account(), and then
> > > kill lock_page_memcg() altogether.
> > 
> > I'm not thrilled by this idea, but I'm not going to veto it.
> 
> Ok, I'm also happy to drop this one.
> 
> Certainly, the rmap one is the lowest-hanging fruit. I have the patch
> rebased against Hugh's series in mm-unstable; I'll wait for that to
> settle down, and then send an updated version to Andrew.

I am planning to initiate the deprecation of the move charge
functionality of v1. So I would say let's go with low hanging fruit for
now and let slow process of deprecation remove the remaining cases.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked()
  2022-11-22  9:51     ` [PATCH v2 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
@ 2022-12-05  1:38       ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2022-12-05  1:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Matthew Wilcox,
	David Hildenbrand, Vlastimil Babka, Peter Xu, Yang Shi,
	John Hubbard, Mike Kravetz, Sidhartha Kumar, Muchun Song,
	Miaohe Lin, Naoya Horiguchi, Mina Almasry, James Houghton,
	Zach O'Keefe, Yu Zhao, Dan Carpenter, linux-kernel, linux-mm

On Tue, 22 Nov 2022, Hugh Dickins wrote:

> It's hard to add a page_add_anon_rmap() into __split_huge_pmd_locked()'s
> HPAGE_PMD_NR set_pte_at() loop, without wincing at the "freeze" case's
> HPAGE_PMD_NR page_remove_rmap() loop below it.

No problem here, but I did later learn something worth sharing.

Comparing before and after vmstats for the series, I was worried to find
the thp_deferred_split_page count consistently much lower afterwards
(10%? 1%?), and thought maybe the COMPOUND_MAPPED patch had messed up
the accounting for when to call deferred_split_huge_page().

But no: that's as before.  We can debate sometime whether it could do a
better job - the vast majority of calls to deferred_split_huge_page() are
just repeats - but that's a different story, one I'm not keen to get into
at the moment.

> -	if (freeze) {
> -		for (i = 0; i < HPAGE_PMD_NR; i++) {
> -			page_remove_rmap(page + i, vma, false);
> -			put_page(page + i);
> -		}
> -	}

The reason for the lower thp_deferred_split_page (at least in the kind
of testing I was doing) was a very good thing: those page_remove_rmap()
calls from __split_huge_pmd_locked() had very often been adding the page
to the deferred split queue, precisely while it was already being split.

The list management is such that there was no corruption, and splitting
calls from the split queue itself did not reach the point of bumping up
the thp_deferred_split_page count; but off-queue splits would add the
page before deleting it again, adding lots of noise to the count, and
unnecessary contention on the queue lock I presume.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2022-12-05  1:38 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-03  1:44 [PATCH 0/3] mm,huge,rmap: unify and speed up compound mapcounts Hugh Dickins
2022-11-03  1:48 ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Hugh Dickins
2022-11-03 21:18   ` Sidhartha Kumar
2022-11-04  4:29     ` Hugh Dickins
2022-11-10  0:11       ` Sidhartha Kumar
2022-11-10  2:10         ` Hugh Dickins
2022-11-10  2:13           ` [PATCH 1/3 fix] mm,hugetlb: use folio fields in second tail page: fix Hugh Dickins
2022-11-05 19:13   ` [PATCH 1/3] mm,hugetlb: use folio fields in second tail page Kirill A. Shutemov
2022-11-10  1:58     ` Hugh Dickins
2022-11-03  1:51 ` [PATCH 2/3] mm,thp,rmap: simplify compound page mapcount handling Hugh Dickins
2022-11-05 19:51   ` Kirill A. Shutemov
2022-11-10  2:49     ` Hugh Dickins
2022-11-03  1:53 ` [PATCH 3/3] mm,thp,rmap: lock_compound_mapcounts() on THP mapcounts Hugh Dickins
2022-11-05 20:06   ` Kirill A. Shutemov
2022-11-10  3:31     ` Hugh Dickins
2022-11-10  2:18 ` [PATCH 4/3] mm,thp,rmap: handle the normal !PageCompound case first Hugh Dickins
2022-11-10  3:23   ` Linus Torvalds
2022-11-10  4:21     ` Hugh Dickins
2022-11-10 16:31     ` Matthew Wilcox
2022-11-10 16:58       ` Linus Torvalds
2022-11-18  9:08 ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Hugh Dickins
2022-11-18  9:12   ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
2022-11-19  0:12     ` Yu Zhao
2022-11-19  0:37       ` Hugh Dickins
2022-11-19  1:35         ` [PATCH 1/3 fix] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages: fix Hugh Dickins
2022-11-21 12:38           ` Kirill A. Shutemov
2022-11-22  9:13             ` Hugh Dickins
2022-11-21 12:36     ` [PATCH 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Kirill A. Shutemov
2022-11-22  9:03       ` Hugh Dickins
2022-11-18  9:14   ` [PATCH 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped Hugh Dickins
2022-11-21 13:09     ` Kirill A. Shutemov
2022-11-22  9:33       ` Hugh Dickins
2022-11-18  9:16   ` [PATCH 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
2022-11-21 13:24     ` Kirill A. Shutemov
2022-11-18 20:18   ` [PATCH 0/3] mm,thp,rmap: rework the use of subpages_mapcount Linus Torvalds
2022-11-18 20:42     ` Johannes Weiner
2022-11-18 20:51     ` Hugh Dickins
2022-11-18 22:03       ` Andrew Morton
2022-11-18 22:07         ` Linus Torvalds
2022-11-18 22:10         ` Hugh Dickins
2022-11-18 22:23           ` Andrew Morton
2022-11-21 16:59   ` Shakeel Butt
2022-11-21 17:16     ` Linus Torvalds
2022-11-22 16:27       ` Shakeel Butt
2022-11-21 18:52     ` Johannes Weiner
2022-11-22  1:32       ` Hugh Dickins
2022-11-22  5:57       ` Matthew Wilcox
2022-11-22  6:55         ` Johannes Weiner
2022-11-22 16:30           ` Shakeel Butt
2022-11-22  9:38   ` [PATCH v2 " Hugh Dickins
2022-11-22  9:42     ` [PATCH v2 1/3] mm,thp,rmap: subpages_mapcount of PTE-mapped subpages Hugh Dickins
2022-11-22  9:49     ` [PATCH v2 2/3] mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped Hugh Dickins
2022-11-22  9:51     ` [PATCH v2 3/3] mm,thp,rmap: clean up the end of __split_huge_pmd_locked() Hugh Dickins
2022-12-05  1:38       ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).