All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v21 00/19] per memcg lru lock
@ 2020-11-05  8:55 ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

This version rebase on next/master 20201104, with much of Johannes's
Acks and some changes according to Johannes comments. And add a new patch
v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
v21-0007.

This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
added to -mm tree yesterday.
 
Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
Johannes Weiner.

So now this patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation. 
2, use TestCleanPageLRU as page isolation's precondition.
3, replace per node lru_lock with per memcg per node lru_lock.

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
with this v20.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!


Alex Shi (16):
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: use head for head page in lru_add_page_tail
  mm/thp: Simplify lru_add_page_tail()
  mm/thp: narrow lru locking
  mm/vmscan: remove unnecessary lruvec adding
  mm/rmap: stop store reordering issue on page->mapping
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lock into lru_note_cost
  mm/vmscan: remove lruvec reget in move_pages_to_lru
  mm/mlock: remove lru_lock on TestClearPageMlocked
  mm/mlock: remove __munlock_isolate_lru_page
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock

Alexander Duyck (1):
  mm/lru: introduce the relock_page_lruvec function

Hugh Dickins (2):
  mm: page_idle_get_page() does not need lru_lock
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/memcontrol.h                         | 110 +++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 mm/compaction.c                                    |  94 +++++++---
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  45 +++--
 mm/memcontrol.c                                    |  79 +++++++-
 mm/mlock.c                                         |  63 ++-----
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   4 -
 mm/rmap.c                                          |  11 +-
 mm/swap.c                                          | 208 ++++++++-------------
 mm/vmscan.c                                        | 207 ++++++++++----------
 mm/workingset.c                                    |   2 -
 21 files changed, 530 insertions(+), 372 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH v21 00/19] per memcg lru lock
@ 2020-11-05  8:55 ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

This version rebase on next/master 20201104, with much of Johannes's
Acks and some changes according to Johannes comments. And add a new patch
v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
v21-0007.

This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
added to -mm tree yesterday.
 
Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
Johannes Weiner.

So now this patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation. 
2, use TestCleanPageLRU as page isolation's precondition.
3, replace per node lru_lock with per memcg per node lru_lock.

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
with this v20.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!


Alex Shi (16):
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: use head for head page in lru_add_page_tail
  mm/thp: Simplify lru_add_page_tail()
  mm/thp: narrow lru locking
  mm/vmscan: remove unnecessary lruvec adding
  mm/rmap: stop store reordering issue on page->mapping
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lock into lru_note_cost
  mm/vmscan: remove lruvec reget in move_pages_to_lru
  mm/mlock: remove lru_lock on TestClearPageMlocked
  mm/mlock: remove __munlock_isolate_lru_page
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock

Alexander Duyck (1):
  mm/lru: introduce the relock_page_lruvec function

Hugh Dickins (2):
  mm: page_idle_get_page() does not need lru_lock
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/memcontrol.h                         | 110 +++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 mm/compaction.c                                    |  94 +++++++---
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  45 +++--
 mm/memcontrol.c                                    |  79 +++++++-
 mm/mlock.c                                         |  63 ++-----
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   4 -
 mm/rmap.c                                          |  11 +-
 mm/swap.c                                          | 208 ++++++++-------------
 mm/vmscan.c                                        | 207 ++++++++++----------
 mm/workingset.c                                    |   2 -
 21 files changed, 530 insertions(+), 372 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH v21 01/19] mm/thp: move lru_add_page_tail func to huge_memory.c
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 --
 mm/huge_memory.c     | 30 ++++++++++++++++++++++++++++++
 mm/swap.c            | 33 ---------------------------------
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 667935c0dbd4..5e1e967c225f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 			  unsigned int nr_pages);
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
-			 struct lruvec *lruvec, struct list_head *head);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 08a183f6c3ab..8f16e991f7cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2348,6 +2348,36 @@ static void remap_page(struct page *page, unsigned int nr)
 	}
 }
 
+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
+
 static void __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/swap.c b/mm/swap.c
index 29220174433b..8a578381c2fc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -977,39 +977,6 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
-		       struct lruvec *lruvec, struct list_head *list)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
-
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
-	else if (list) {
-		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 01/19] mm/thp: move lru_add_page_tail func to huge_memory.c
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
---
 include/linux/swap.h |  2 --
 mm/huge_memory.c     | 30 ++++++++++++++++++++++++++++++
 mm/swap.c            | 33 ---------------------------------
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 667935c0dbd4..5e1e967c225f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 			  unsigned int nr_pages);
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
-			 struct lruvec *lruvec, struct list_head *head);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 08a183f6c3ab..8f16e991f7cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2348,6 +2348,36 @@ static void remap_page(struct page *page, unsigned int nr)
 	}
 }
 
+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
+
 static void __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/swap.c b/mm/swap.c
index 29220174433b..8a578381c2fc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -977,39 +977,6 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
-		       struct lruvec *lruvec, struct list_head *list)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
-
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
-	else if (list) {
-		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 02/19] mm/thp: use head for head page in lru_add_page_tail
  2020-11-05  8:55 ` Alex Shi
  (?)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  -1 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f16e991f7cc..60726eb26840 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2348,33 +2348,32 @@ static void remap_page(struct page *page, unsigned int nr)
 	}
 }
 
-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON_PAGE(!PageHead(head), head);
+	VM_BUG_ON_PAGE(PageCompound(tail), head);
+	VM_BUG_ON_PAGE(PageLRU(tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
 	if (!list)
-		SetPageLRU(page_tail);
+		SetPageLRU(tail);
 
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
+	if (likely(PageLRU(head)))
+		list_add_tail(&tail->lru, &head->lru);
 	else if (list) {
 		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
+		get_page(tail);
+		list_add_tail(&tail->lru, list);
 	} else {
 		/*
 		 * Head page has not yet been counted, as an hpage,
 		 * so we must account for each subpage individually.
 		 *
-		 * Put page_tail on the list at the correct position
+		 * Put tail on the list at the correct position
 		 * so they all end up in order.
 		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
+		add_page_to_lru_list_tail(tail, lruvec, page_lru(tail));
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 03/19] mm/thp: Simplify lru_add_page_tail()
  2020-11-05  8:55 ` Alex Shi
@ 2020-11-05  8:55   ` Alex Shi
  -1 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Mika Penttilä

Simplify lru_add_page_tail(), there are actually only two cases possible:
split_huge_page_to_list(), with list supplied and head isolated from lru
by its caller; or split_huge_page(), with NULL list and head on lru -
because when head is racily isolated from lru, the isolator's reference
will stop the split from getting any further than its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 60726eb26840..79318d7f7d5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2356,24 +2356,16 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
+		VM_WARN_ON(PageLRU(head));
 		get_page(tail);
 		list_add_tail(&tail->lru, list);
 	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(tail, lruvec, page_lru(tail));
+		/* head is still on lru (and we have it frozen) */
+		VM_WARN_ON(!PageLRU(head));
+		SetPageLRU(tail);
+		list_add_tail(&tail->lru, &head->lru);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 03/19] mm/thp: Simplify lru_add_page_tail()
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Mika Penttilä

Simplify lru_add_page_tail(), there are actually only two cases possible:
split_huge_page_to_list(), with list supplied and head isolated from lru
by its caller; or split_huge_page(), with NULL list and head on lru -
because when head is racily isolated from lru, the isolator's reference
will stop the split from getting any further than its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 60726eb26840..79318d7f7d5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2356,24 +2356,16 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
+		VM_WARN_ON(PageLRU(head));
 		get_page(tail);
 		list_add_tail(&tail->lru, list);
 	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(tail, lruvec, page_lru(tail));
+		/* head is still on lru (and we have it frozen) */
+		VM_WARN_ON(!PageLRU(head));
+		SetPageLRU(tail);
+		list_add_tail(&tail->lru, &head->lru);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 04/19] mm/thp: narrow lru locking
  2020-11-05  8:55 ` Alex Shi
                   ` (3 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  -1 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrea Arcangeli

lru_lock and page cache xa_lock have no obvious reason to be taken
one way round or the other: until now, lru_lock has been taken before
page cache xa_lock, when splitting a THP; but nothing else takes them
together.  Reverse that ordering: let's narrow the lru locking - but
leave local_irq_disable to block interrupts throughout, like before.

Hugh Dickins point: split_huge_page_to_list() was already silly, to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.  So we
can save passing flags argument down to __split_huge_page().

Why change the lock ordering here? That was hard to decide. One reason:
when this series reaches per-memcg lru locking, it relies on the THP's
memcg to be stable when taking the lru_lock: that is now done after the
THP's refcount has been frozen, which ensures page memcg cannot change.

Another reason: previously, lock_page_memcg()'s move_lock was presumed
to nest inside lru_lock; but now lru_lock must nest inside (page cache
lock inside) move_lock, so it becomes possible to use lock_page_memcg()
to stabilize page memcg before taking its lru_lock.  That is not the
mechanism used in this series, but it is an option we want to keep open.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 79318d7f7d5d..b70ec0c6076b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2435,7 +2435,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+		pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2445,8 +2445,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned int nr = thp_nr_pages(head);
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2458,6 +2456,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock(&pgdat->lru_lock);
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2477,6 +2480,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock(&pgdat->lru_lock);
+	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
 
@@ -2494,8 +2499,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		page_ref_add(head, 2);
 		xa_unlock(&head->mapping->i_pages);
 	}
-
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	local_irq_enable();
 
 	remap_page(head, nr);
 
@@ -2641,12 +2645,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
-	unsigned long flags;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2707,9 +2709,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+	/* block interrupt reentry in xa_lock and spinlock */
+	local_irq_disable();
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2739,7 +2740,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_lruvec_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2753,7 +2754,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		local_irq_enable();
 		remap_page(head, thp_nr_pages(head));
 		ret = -EBUSY;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 05/19] mm/vmscan: remove unnecessary lruvec adding
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept before put_page_testzero for list
integrity, otherwise:

  #0 move_pages_to_lru             #1 release_pages
  if !put_page_testzero
     			           if (put_page_testzero())
     			              !PageLRU //skip lru_lock
     SetPageLRU()
     list_add(&page->lru,)
                                         list_add(&page->lru,)

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 12a4873942e2..b9935668d121 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1852,26 +1852,30 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			list_del(&page->lru);
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		/*
+		 * The SetPageLRU needs to be kept here for list integrity.
+		 * Otherwise:
+		 *   #0 move_pages_to_lru             #1 release_pages
+		 *   if !put_page_testzero
+		 *				      if (put_page_testzero())
+		 *				        !PageLRU //skip lru_lock
+		 *     SetPageLRU()
+		 *     list_add(&page->lru,)
+		 *                                        list_add(&page->lru,)
+		 */
 		SetPageLRU(page);
-		lru = page_lru(page);
 
-		nr_pages = thp_nr_pages(page);
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
-
-		if (put_page_testzero(page)) {
+		if (unlikely(put_page_testzero(page))) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -1879,11 +1883,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
-		} else {
-			nr_moved += nr_pages;
-			if (PageActive(page))
-				workingset_age_nonresident(lruvec, nr_pages);
+
+			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lru = page_lru(page);
+		nr_pages = thp_nr_pages(page);
+
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &lruvec->lists[lru]);
+		nr_moved += nr_pages;
+		if (PageActive(page))
+			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 05/19] mm/vmscan: remove unnecessary lruvec adding
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept before put_page_testzero for list
integrity, otherwise:

  #0 move_pages_to_lru             #1 release_pages
  if !put_page_testzero
     			           if (put_page_testzero())
     			              !PageLRU //skip lru_lock
     SetPageLRU()
     list_add(&page->lru,)
                                         list_add(&page->lru,)

[akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org: coding style fixes]
Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 mm/vmscan.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 12a4873942e2..b9935668d121 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1852,26 +1852,30 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			list_del(&page->lru);
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		/*
+		 * The SetPageLRU needs to be kept here for list integrity.
+		 * Otherwise:
+		 *   #0 move_pages_to_lru             #1 release_pages
+		 *   if !put_page_testzero
+		 *				      if (put_page_testzero())
+		 *				        !PageLRU //skip lru_lock
+		 *     SetPageLRU()
+		 *     list_add(&page->lru,)
+		 *                                        list_add(&page->lru,)
+		 */
 		SetPageLRU(page);
-		lru = page_lru(page);
 
-		nr_pages = thp_nr_pages(page);
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
-
-		if (put_page_testzero(page)) {
+		if (unlikely(put_page_testzero(page))) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -1879,11 +1883,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
-		} else {
-			nr_moved += nr_pages;
-			if (PageActive(page))
-				workingset_age_nonresident(lruvec, nr_pages);
+
+			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lru = page_lru(page);
+		nr_pages = thp_nr_pages(page);
+
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &lruvec->lists[lru]);
+		nr_moved += nr_pages;
+		if (PageActive(page))
+			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 06/19] mm/rmap: stop store reordering issue on page->mapping
  2020-11-05  8:55 ` Alex Shi
                   ` (5 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  2020-11-06  1:20     ` Alex Shi
  -1 siblings, 1 reply; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Minchan Kim

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

	CPU 0						CPU1
   do_anonymous_page
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)
						/* idletacking judged it as LRU
						 * page so pass the page in
						 * page_idle_clear_pte_refs
						 */
						page_idle_clear_pte_refs
						  rmap_walk
						    if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/rmap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1b84945d655c..078d54da59d4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,13 @@ static void __page_set_anon_rmap(struct page *page,
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	/*
+	 * Prevent page->mapping from pointing to an anon_vma without
+	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
+	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
+	 */
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Vlastimil Babka, Minchan Kim

From: Hugh Dickins <hughd@google.com>

It is necessary for page_idle_get_page() to recheck PageLRU() after
get_page_unless_zero(), but holding lru_lock around that serves no
useful purpose, and adds to lru_lock contention: delete it.

See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
discussion that led to lru_lock there; but __page_set_anon_rmap() now
uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
but not entirely prevented by page_count() check in ksm.c's
write_protect_page(): that risk being shared with page_referenced() and
not helped by lru_lock).

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_idle.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..64e5344a992c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,15 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Vlastimil Babka, Minchan Kim

From: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

It is necessary for page_idle_get_page() to recheck PageLRU() after
get_page_unless_zero(), but holding lru_lock around that serves no
useful purpose, and adds to lru_lock contention: delete it.

See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
discussion that led to lru_lock there; but __page_set_anon_rmap() now
uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
but not entirely prevented by page_count() check in ksm.c's
write_protect_page(): that risk being shared with page_referenced() and
not helped by lru_lock).

Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
Cc: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 mm/page_idle.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..64e5344a992c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,15 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 08/19] mm/memcg: add debug checking in lock_page_memcg
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Add a debug checking in lock_page_memcg, then we could get alarm
if anything wrong here.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2aa3b73ab82..157b745031a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2121,6 +2121,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (unlikely(!memcg))
 		return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
 	if (atomic_read(&memcg->moving_account) <= 0)
 		return memcg;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 08/19] mm/memcg: add debug checking in lock_page_memcg
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko

Add a debug checking in lock_page_memcg, then we could get alarm
if anything wrong here.

Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2aa3b73ab82..157b745031a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2121,6 +2121,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (unlikely(!memcg))
 		return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
 	if (atomic_read(&memcg->moving_account) <= 0)
 		return memcg;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 09/19] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  2020-11-05  8:55 ` Alex Shi
                   ` (8 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  -1 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Fold the PGROTATED event collection into pagevec_move_tail_fn call back
func like other funcs does in pagevec_lru_move_fn. Thus we could save
func call pagevec_move_tail().
Now all usage of pagevec_lru_move_fn are same and no needs of its 3rd
parameter.

It's just simply the calling. No functional change.

[lkp@intel.com: found a build issue in the original patch, thanks]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 65 ++++++++++++++++++++++-----------------------------------------
 1 file changed, 23 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 8a578381c2fc..ce8c97146e0d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
-	void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
 	struct pglist_data *pgdat = NULL;
@@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		(*move_fn)(page, lruvec, arg);
+		(*move_fn)(page, lruvec);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -232,35 +231,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	int *pgmoved = arg;
-
 	if (PageLRU(page) && !PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-		(*pgmoved) += thp_nr_pages(page);
+		__count_vm_events(PGROTATED, thp_nr_pages(page));
 	}
 }
 
 /*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
-	int pgmoved = 0;
-
-	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
-	__count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
+ *
+ * rotate_reclaimable_page() must disable IRQs, to prevent nasty races.
  */
 void rotate_reclaimable_page(struct page *page)
 {
@@ -273,7 +259,7 @@ void rotate_reclaimable_page(struct page *page)
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_move_tail(pvec);
+			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 }
@@ -315,8 +301,7 @@ void lru_note_cost_page(struct page *page)
 		      page_is_file_lru(page), thp_nr_pages(page));
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -340,7 +325,7 @@ static void activate_page_drain(int cpu)
 	struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
 
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, __activate_page, NULL);
+		pagevec_lru_move_fn(pvec, __activate_page);
 }
 
 static bool need_activate_page_drain(int cpu)
@@ -358,7 +343,7 @@ static void activate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, __activate_page, NULL);
+			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -374,7 +359,7 @@ static void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -525,8 +510,7 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
-			      void *arg)
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 {
 	int lru;
 	bool active;
@@ -573,8 +557,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -591,8 +574,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
@@ -636,21 +618,21 @@ void lru_add_drain_cpu(int cpu)
 
 		/* No harm done if a racing interrupt already did this */
 		local_lock_irqsave(&lru_rotate.lock, flags);
-		pagevec_move_tail(pvec);
+		pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
 }
@@ -679,7 +661,7 @@ void deactivate_file_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -701,7 +683,7 @@ void deactivate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -723,7 +705,7 @@ void mark_page_lazyfree(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -977,8 +959,7 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
@@ -1037,7 +1018,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 10/19] mm/lru: move lock into lru_note_cost
  2020-11-05  8:55 ` Alex Shi
                   ` (9 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  -1 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

We have to move lru_lock into lru_note_cost, since it cycle up on memcg
tree, for future per lruvec lru_lock replace. It's a bit ugly and may
cost a bit more locking, but benefit from multiple memcg locking could
cover the lost.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c       | 3 +++
 mm/vmscan.c     | 4 +---
 mm/workingset.c | 2 --
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index ce8c97146e0d..2681d9023998 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -268,7 +268,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+		spin_lock_irq(&pgdat->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -292,6 +294,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
+		spin_unlock_irq(&pgdat->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b9935668d121..d771f812e983 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1973,19 +1973,17 @@ static int current_may_throttle(void)
 				&stat, false);
 
 	spin_lock_irq(&pgdat->lru_lock);
-
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	lru_note_cost(lruvec, file, stat.nr_pageout);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
+	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 130348cbf40a..a915a812c363 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -381,9 +381,7 @@ void workingset_refault(struct page *page, void *shadow)
 	if (workingset) {
 		SetPageWorkingset(page);
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
-		spin_lock_irq(&page_pgdat(page)->lru_lock);
 		lru_note_cost_page(page);
-		spin_unlock_irq(&page_pgdat(page)->lru_lock);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
 	}
 out:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 11/19] mm/vmscan: remove lruvec reget in move_pages_to_lru
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Michal Hocko

Isolated page shouldn't be recharged by memcg since the memcg
migration isn't possible at the time. All pages were isolated
from the same lruvec (and isolation inhibits memcg migration).
So remove unnecessary regetting.

Thanks to Alexander Duyck for pointing this out.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 mm/vmscan.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d771f812e983..cb2f6256a7d6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1887,7 +1887,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		/*
+		 * All pages were isolated from the same lruvec (and isolation
+		 * inhibits memcg migration).
+		 */
+		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
+							!= lruvec, page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 11/19] mm/vmscan: remove lruvec reget in move_pages_to_lru
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Alexander Duyck, Michal Hocko

Isolated page shouldn't be recharged by memcg since the memcg
migration isn't possible at the time. All pages were isolated
from the same lruvec (and isolation inhibits memcg migration).
So remove unnecessary regetting.

Thanks to Alexander Duyck for pointing this out.

Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 mm/vmscan.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d771f812e983..cb2f6256a7d6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1887,7 +1887,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		/*
+		 * All pages were isolated from the same lruvec (and isolation
+		 * inhibits memcg migration).
+		 */
+		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
+							!= lruvec, page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 12/19] mm/mlock: remove lru_lock on TestClearPageMlocked
  2020-11-05  8:55 ` Alex Shi
                   ` (11 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  2020-11-11 13:03     ` Vlastimil Babka
  -1 siblings, 1 reply; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov, Vlastimil Babka

In the func munlock_vma_page, comments mentained lru_lock needed for
serialization with split_huge_pages. But the page must be PageLocked
as well as pages in split_huge_page series funcs. Thus the PageLocked
is enough to serialize both funcs.

Further more, Hugh Dickins pointed: before splitting in
split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes
which protect the page from munlock. Thus, no needs to guard
__split_huge_page_tail for mlock clean, just keep the lru_lock there for
isolation purpose.

LKP found a preempt issue on __mod_zone_page_state which need change
to mod_zone_page_state. Thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 26 +++++---------------------
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 884b1216da6a..796c726a0407 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -187,40 +187,24 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	pg_data_t *pgdat = page_pgdat(page);
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
-	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes thp_nr_pages().
-	 */
-	spin_lock_irq(&pgdat->lru_lock);
-
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
+		return 0;
 	}
 
 	nr_pages = thp_nr_pages(page);
-	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+	if (!isolate_lru_page(page))
 		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
-
-unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	else
+		__munlock_isolation_failed(page);
 
-out:
 	return nr_pages - 1;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 13/19] mm/mlock: remove __munlock_isolate_lru_page
  2020-11-05  8:55 ` Alex Shi
                   ` (12 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  2020-11-11 13:07   ` Vlastimil Babka
  -1 siblings, 1 reply; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov, Vlastimil Babka

The func only has one caller, remove it to clean up code and simplify
code.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 31 +++++++++----------------------
 1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 796c726a0407..d487aa864e86 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,26 +106,6 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
-{
-	if (PageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (getpage)
-			get_page(page);
-		ClearPageLRU(page);
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		return true;
-	}
-
-	return false;
-}
-
-/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -296,9 +276,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (PageLRU(page)) {
+				struct lruvec *lruvec;
+
+				ClearPageLRU(page);
+				lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+				del_page_from_lru_list(page, lruvec,
+							page_lru(page));
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
 		} else {
 			delta_munlocked++;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 14/19] mm/lru: introduce TestClearPageLRU
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we will clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
	1, get_page(); 	       #pin the page avoid to be free
	2, TestClearPageLRU(); #block other isolation like memcg change
	3, spin_lock on lru_lock; #serialize lru list access
	4, delete page from lru list;

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable in Rong Chen
<rong.a.chen@intel.com> report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c                 |  3 +--
 mm/vmscan.c                | 39 +++++++++++++++++++--------------------
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 291dc247dc79..6426f2f03611 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -335,6 +335,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+	TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index d487aa864e86..7b0e6334be6f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -276,10 +276,9 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (PageLRU(page)) {
+			if (TestClearPageLRU(page)) {
 				struct lruvec *lruvec;
 
-				ClearPageLRU(page);
 				lruvec = mem_cgroup_page_lruvec(page,
 							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb2f6256a7d6..ab7a0104d1e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  */
 int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 {
-	int ret = -EINVAL;
+	int ret = -EBUSY;
 
 	/* Only take pages on the LRU. */
 	if (!PageLRU(page))
@@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
-	ret = -EBUSY;
-
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1600,8 +1598,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		 * sure the page is not being freed elsewhere -- the
 		 * page release code relies on it.
 		 */
-		ClearPageLRU(page);
-		ret = 0;
+		if (TestClearPageLRU(page))
+			ret = 0;
+		else
+			put_page(page);
 	}
 
 	return ret;
@@ -1667,8 +1667,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-
 		nr_pages = compound_nr(page);
 		total_scan += nr_pages;
 
@@ -1765,21 +1763,18 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lru_lock);
+		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		if (PageLRU(page)) {
-			int lru = page_lru(page);
-			get_page(page);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
-			ret = 0;
-		}
+		spin_lock_irq(&pgdat->lru_lock);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		spin_unlock_irq(&pgdat->lru_lock);
+		ret = 0;
 	}
+
 	return ret;
 }
 
@@ -4293,6 +4288,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		nr_pages = thp_nr_pages(page);
 		pgscanned += nr_pages;
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		if (pagepgdat != pgdat) {
 			if (pgdat)
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -4301,10 +4300,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		if (!PageLRU(page) || !PageUnevictable(page))
-			continue;
-
-		if (page_evictable(page)) {
+		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
 			VM_BUG_ON_PAGE(PageActive(page), page);
@@ -4313,12 +4309,15 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 			add_page_to_lru_list(page, lruvec, lru);
 			pgrescued += nr_pages;
 		}
+		SetPageLRU(page);
 	}
 
 	if (pgdat) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 		spin_unlock_irq(&pgdat->lru_lock);
+	} else if (pgscanned) {
+		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 14/19] mm/lru: introduce TestClearPageLRU
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we will clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
	1, get_page(); 	       #pin the page avoid to be free
	2, TestClearPageLRU(); #block other isolation like memcg change
	3, spin_lock on lru_lock; #serialize lru list access
	4, delete page from lru list;

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable in Rong Chen
<rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c                 |  3 +--
 mm/vmscan.c                | 39 +++++++++++++++++++--------------------
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 291dc247dc79..6426f2f03611 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -335,6 +335,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+	TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index d487aa864e86..7b0e6334be6f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -276,10 +276,9 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (PageLRU(page)) {
+			if (TestClearPageLRU(page)) {
 				struct lruvec *lruvec;
 
-				ClearPageLRU(page);
 				lruvec = mem_cgroup_page_lruvec(page,
 							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb2f6256a7d6..ab7a0104d1e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  */
 int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 {
-	int ret = -EINVAL;
+	int ret = -EBUSY;
 
 	/* Only take pages on the LRU. */
 	if (!PageLRU(page))
@@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
-	ret = -EBUSY;
-
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1600,8 +1598,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		 * sure the page is not being freed elsewhere -- the
 		 * page release code relies on it.
 		 */
-		ClearPageLRU(page);
-		ret = 0;
+		if (TestClearPageLRU(page))
+			ret = 0;
+		else
+			put_page(page);
 	}
 
 	return ret;
@@ -1667,8 +1667,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-
 		nr_pages = compound_nr(page);
 		total_scan += nr_pages;
 
@@ -1765,21 +1763,18 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lru_lock);
+		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		if (PageLRU(page)) {
-			int lru = page_lru(page);
-			get_page(page);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
-			ret = 0;
-		}
+		spin_lock_irq(&pgdat->lru_lock);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		spin_unlock_irq(&pgdat->lru_lock);
+		ret = 0;
 	}
+
 	return ret;
 }
 
@@ -4293,6 +4288,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		nr_pages = thp_nr_pages(page);
 		pgscanned += nr_pages;
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		if (pagepgdat != pgdat) {
 			if (pgdat)
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -4301,10 +4300,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		if (!PageLRU(page) || !PageUnevictable(page))
-			continue;
-
-		if (page_evictable(page)) {
+		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
 			VM_BUG_ON_PAGE(PageActive(page), page);
@@ -4313,12 +4309,15 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 			add_page_to_lru_list(page, lruvec, lru);
 			pgrescued += nr_pages;
 		}
+		SetPageLRU(page);
 	}
 
 	if (pgdat) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 		spin_unlock_irq(&pgdat->lru_lock);
+	} else if (pgscanned) {
+		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 43 ++++++++++++++++++++++---------------------
 3 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e1e967c225f..596bc2f4d9b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -356,7 +356,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index ee1f8439369e..7b1cf48884dd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -886,6 +886,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
+				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -967,6 +968,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
+		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+			goto isolate_fail_put;
+
+		/* Try isolate the page */
+		if (!TestClearPageLRU(page))
+			goto isolate_fail_put;
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
 			locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}
 
-			/* Recheck PageLRU and PageCompound under lock */
-			if (!PageLRU(page))
-				goto isolate_fail;
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
 				low_pfn += compound_nr(page) - 1;
-				goto isolate_fail;
+				SetPageLRU(page);
+				goto isolate_fail_put;
 			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, isolate_mode) != 0)
-			goto isolate_fail;
-
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
 			low_pfn += compound_nr(page) - 1;
@@ -1028,6 +1037,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		}
 
 		continue;
+
+isolate_fail_put:
+		/* Avoid potential deadlock in freeing page under lru_lock */
+		if (locked) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			locked = false;
+		}
+		put_page(page);
+
 isolate_fail:
 		if (!skip_on_failure)
 			continue;
@@ -1064,9 +1082,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	page = NULL;
+
 isolate_abort:
 	if (locked)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (page) {
+		SetPageLRU(page);
+		put_page(page);
+	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ab7a0104d1e1..0be55d875fde 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EBUSY;
 
@@ -1592,22 +1592,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
 
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		if (TestClearPageLRU(page))
-			ret = 0;
-		else
-			put_page(page);
-	}
-
-	return ret;
+	return 0;
 }
 
-
 /*
  * Update LRU sizes after isolating pages. The LRU size updates must
  * be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1687,20 +1674,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		 * only when the page is being freed somewhere else.
 		 */
 		scan += nr_pages;
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page_prepare(page, mode)) {
 		case 0:
+			/*
+			 * Be careful not to clear PageLRU until after we're
+			 * sure the page is not being freed elsewhere -- the
+			 * page release code relies on it.
+			 */
+			if (unlikely(!get_page_unless_zero(page)))
+				goto busy;
+
+			if (!TestClearPageLRU(page)) {
+				/*
+				 * This page may in other isolation path,
+				 * but we still hold lru_lock.
+				 */
+				put_page(page);
+				goto busy;
+			}
+
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
 
-		case -EBUSY:
+		default:
+busy:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			continue;
-
-		default:
-			BUG();
 		}
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 43 ++++++++++++++++++++++---------------------
 3 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e1e967c225f..596bc2f4d9b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -356,7 +356,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index ee1f8439369e..7b1cf48884dd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -886,6 +886,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
+				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -967,6 +968,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
+		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+			goto isolate_fail_put;
+
+		/* Try isolate the page */
+		if (!TestClearPageLRU(page))
+			goto isolate_fail_put;
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
 			locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}
 
-			/* Recheck PageLRU and PageCompound under lock */
-			if (!PageLRU(page))
-				goto isolate_fail;
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
 				low_pfn += compound_nr(page) - 1;
-				goto isolate_fail;
+				SetPageLRU(page);
+				goto isolate_fail_put;
 			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, isolate_mode) != 0)
-			goto isolate_fail;
-
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
 			low_pfn += compound_nr(page) - 1;
@@ -1028,6 +1037,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		}
 
 		continue;
+
+isolate_fail_put:
+		/* Avoid potential deadlock in freeing page under lru_lock */
+		if (locked) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			locked = false;
+		}
+		put_page(page);
+
 isolate_fail:
 		if (!skip_on_failure)
 			continue;
@@ -1064,9 +1082,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	page = NULL;
+
 isolate_abort:
 	if (locked)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (page) {
+		SetPageLRU(page);
+		put_page(page);
+	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ab7a0104d1e1..0be55d875fde 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EBUSY;
 
@@ -1592,22 +1592,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
 
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		if (TestClearPageLRU(page))
-			ret = 0;
-		else
-			put_page(page);
-	}
-
-	return ret;
+	return 0;
 }
 
-
 /*
  * Update LRU sizes after isolating pages. The LRU size updates must
  * be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1687,20 +1674,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		 * only when the page is being freed somewhere else.
 		 */
 		scan += nr_pages;
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page_prepare(page, mode)) {
 		case 0:
+			/*
+			 * Be careful not to clear PageLRU until after we're
+			 * sure the page is not being freed elsewhere -- the
+			 * page release code relies on it.
+			 */
+			if (unlikely(!get_page_unless_zero(page)))
+				goto busy;
+
+			if (!TestClearPageLRU(page)) {
+				/*
+				 * This page may in other isolation path,
+				 * but we still hold lru_lock.
+				 */
+				put_page(page);
+				goto busy;
+			}
+
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
 
-		case -EBUSY:
+		default:
+busy:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			continue;
-
-		default:
-			BUG();
 		}
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 16/19] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  2020-11-05  8:55 ` Alex Shi
                   ` (15 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  2020-11-11 18:00     ` Vlastimil Babka
  -1 siblings, 1 reply; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Hugh Dickins' found a memcg change bug on original version:
If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
possible bad scenario would like:

	cpu 0					cpu 1
lruvec = mem_cgroup_page_lruvec()
					if (!isolate_lru_page())
						mem_cgroup_move_account

spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.

So we need TestClearPageLRU to block isolate_lru_page(), that serializes
the memcg change. and then removing the PageLRU check in move_fn callee
as the consequence.

__pagevec_lru_add_fn() is different from the others, because the pages
it deals with are, by definition, not yet on the lru.  TestClearPageLRU
is not needed and would not work, so __pagevec_lru_add() goes its own
way.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 44 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 2681d9023998..1838a9535703 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -222,8 +222,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&pgdat->lru_lock, flags);
 		}
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec);
+
+		SetPageLRU(page);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -233,7 +239,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (!PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);
 
@@ -362,7 +368,8 @@ static void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
+	if (PageLRU(page))
+		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -519,9 +526,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 	bool active;
 	int nr_pages = thp_nr_pages(page);
 
-	if (!PageLRU(page))
-		return;
-
 	if (PageUnevictable(page))
 		return;
 
@@ -562,7 +566,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);
 
@@ -579,7 +583,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	if (PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 		int nr_pages = thp_nr_pages(page);
@@ -1021,7 +1025,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
+	int i;
+	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec;
+	unsigned long flags = 0;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct pglist_data *pagepgdat = page_pgdat(page);
+
+		if (pagepgdat != pgdat) {
+			if (pgdat)
+				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			pgdat = pagepgdat;
+			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		__pagevec_lru_add_fn(page, lruvec);
+	}
+	if (pgdat)
+		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 ++++++++++++++--------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  73 ++++++++++++++++++++++++++--
 mm/mlock.c                 |  22 ++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 116 ++++++++++++++++++++++-----------------------
 mm/vmscan.c                |  55 ++++++++++-----------
 10 files changed, 270 insertions(+), 126 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..6ecb08ff4ad1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -666,6 +666,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1233,6 +1246,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1476,6 +1514,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1605,6 +1647,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7b1cf48884dd..9cfe90961493 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b70ec0c6076b..94e42dba052a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,7 +2354,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(tail), head);
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2438,7 +2438,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2456,10 +2455,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2480,7 +2477,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 157b745031a4..91226af58ce8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
 #include <linux/page_counter.h>
@@ -1305,6 +1308,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1343,6 +1359,59 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 }
 
 /**
+ * lock_page_lruvec - lock and return lruvec for a given page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or page is locked.
+ */
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+/**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
@@ -3245,10 +3314,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page->mem_cgroup is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..74bf7f4c6317 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6808,7 +6808,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 1838a9535703..ed033f7c4f2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,15 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		/*
+		 * Hold lruvec->lru_lock is safe here, since
+		 * 1) The pinned lruvec in reclaim, or
+		 * 2) From a pre-LRU page during refault (which also holds the
+		 *    rcu lock, so would be safe even if the page was on the LRU
+		 *    and could move simultaneously to a new lruvec).
+		 */
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +302,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +366,15 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+	if (TestClearPageLRU(page)) {
+		lruvec = lock_page_lruvec_irq(page);
+		__activate_page(page, lruvec);
+		unlock_page_lruvec_irq(lruvec);
+		SetPageLRU(page);
+	}
 }
 #endif
 
@@ -860,8 +864,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +874,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +886,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -907,27 +909,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -937,8 +939,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1026,26 +1028,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0be55d875fde..2953ddec88a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1953,7 +1950,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1965,7 +1962,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1973,7 +1970,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1982,7 +1979,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2035,7 +2032,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2046,7 +2043,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2092,7 +2089,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2103,7 +2100,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2693,10 +2690,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4272,16 +4269,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4293,13 +4289,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4313,10 +4308,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko, Yang Shi

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 ++++++++++++++--------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  73 ++++++++++++++++++++++++++--
 mm/mlock.c                 |  22 ++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 116 ++++++++++++++++++++++-----------------------
 mm/vmscan.c                |  55 ++++++++++-----------
 10 files changed, 270 insertions(+), 126 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..6ecb08ff4ad1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -666,6 +666,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1233,6 +1246,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1476,6 +1514,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1605,6 +1647,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7b1cf48884dd..9cfe90961493 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b70ec0c6076b..94e42dba052a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,7 +2354,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(tail), head);
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2438,7 +2438,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2456,10 +2455,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2480,7 +2477,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 157b745031a4..91226af58ce8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
 #include <linux/page_counter.h>
@@ -1305,6 +1308,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1343,6 +1359,59 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 }
 
 /**
+ * lock_page_lruvec - lock and return lruvec for a given page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or page is locked.
+ */
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+/**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
@@ -3245,10 +3314,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page->mem_cgroup is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..74bf7f4c6317 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6808,7 +6808,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 1838a9535703..ed033f7c4f2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,15 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		/*
+		 * Hold lruvec->lru_lock is safe here, since
+		 * 1) The pinned lruvec in reclaim, or
+		 * 2) From a pre-LRU page during refault (which also holds the
+		 *    rcu lock, so would be safe even if the page was on the LRU
+		 *    and could move simultaneously to a new lruvec).
+		 */
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +302,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +366,15 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+	if (TestClearPageLRU(page)) {
+		lruvec = lock_page_lruvec_irq(page);
+		__activate_page(page, lruvec);
+		unlock_page_lruvec_irq(lruvec);
+		SetPageLRU(page);
+	}
 }
 #endif
 
@@ -860,8 +864,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +874,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +886,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -907,27 +909,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -937,8 +939,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1026,26 +1028,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0be55d875fde..2953ddec88a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1953,7 +1950,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1965,7 +1962,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1973,7 +1970,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1982,7 +1979,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2035,7 +2032,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2046,7 +2043,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2092,7 +2089,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2103,7 +2100,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2693,10 +2690,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4272,16 +4269,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4293,13 +4289,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4313,10 +4308,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 | 11 +---------
 mm/swap.c                  | 33 +++++++----------------------
 mm/vmscan.c                | 12 ++---------
 4 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6ecb08ff4ad1..ba4050154fea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -660,6 +660,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	const struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node *mz;
+
+	if (mem_cgroup_disabled())
+		return lruvec == &pgdat->__lruvec;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+	memcg = page->mem_cgroup ? : root_mem_cgroup;
+
+	return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1221,6 +1237,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+
+	return lruvec == &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1663,6 +1687,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irq(locked_lruvec);
+	}
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+	}
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index ab164a675c25..55b3b3672977 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -277,16 +277,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *new_lruvec;
-
-				new_lruvec = mem_cgroup_page_lruvec(page,
-						page_pgdat(page));
-				if (new_lruvec != lruvec) {
-					if (lruvec)
-						unlock_page_lruvec_irq(lruvec);
-					lruvec = lock_page_lruvec_irq(page);
-				}
-
+				lruvec = relock_page_lruvec_irq(page, lruvec);
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
diff --git a/mm/swap.c b/mm/swap.c
index ed033f7c4f2d..c593ba596dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -210,19 +210,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
-
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -918,17 +911,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *prev_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (prev_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
@@ -1033,15 +1021,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2953ddec88a0..3b09a39de8cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1884,8 +1884,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 * All pages were isolated from the same lruvec (and isolation
 		 * inhibits memcg migration).
 		 */
-		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
-							!= lruvec, page);
+		VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
@@ -4277,7 +4276,6 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
 		int nr_pages;
-		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4289,13 +4287,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
-
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
@ 2020-11-05  8:55   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin

From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Cc: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
---
 include/linux/memcontrol.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 | 11 +---------
 mm/swap.c                  | 33 +++++++----------------------
 mm/vmscan.c                | 12 ++---------
 4 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6ecb08ff4ad1..ba4050154fea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -660,6 +660,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	const struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node *mz;
+
+	if (mem_cgroup_disabled())
+		return lruvec == &pgdat->__lruvec;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+	memcg = page->mem_cgroup ? : root_mem_cgroup;
+
+	return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1221,6 +1237,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+
+	return lruvec == &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1663,6 +1687,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irq(locked_lruvec);
+	}
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+	}
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index ab164a675c25..55b3b3672977 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -277,16 +277,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *new_lruvec;
-
-				new_lruvec = mem_cgroup_page_lruvec(page,
-						page_pgdat(page));
-				if (new_lruvec != lruvec) {
-					if (lruvec)
-						unlock_page_lruvec_irq(lruvec);
-					lruvec = lock_page_lruvec_irq(page);
-				}
-
+				lruvec = relock_page_lruvec_irq(page, lruvec);
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
diff --git a/mm/swap.c b/mm/swap.c
index ed033f7c4f2d..c593ba596dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -210,19 +210,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
-
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -918,17 +911,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *prev_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (prev_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
@@ -1033,15 +1021,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2953ddec88a0..3b09a39de8cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1884,8 +1884,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 * All pages were isolated from the same lruvec (and isolation
 		 * inhibits memcg migration).
 		 */
-		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
-							!= lruvec, page);
+		VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
@@ -4277,7 +4276,6 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
 		int nr_pages;
-		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4289,13 +4287,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
-
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v21 19/19] mm/lru: revise the comments of lru_lock
  2020-11-05  8:55 ` Alex Shi
                   ` (18 preceding siblings ...)
  (?)
@ 2020-11-05  8:55 ` Alex Shi
  2020-11-12 12:37     ` Vlastimil Babka
  -1 siblings, 1 reply; 111+ messages in thread
From: Alex Shi @ 2020-11-05  8:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

I struggled to understand the comment above move_pages_to_lru() (surely
it never calls page_referenced()), and eventually realized that most of
it had got separated from shrink_active_list(): move that comment back.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 ++------
 Documentation/admin-guide/cgroup-v1/memory.rst     | 21 +++++------
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 +++++-------
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  3 +-
 mm/filemap.c                                       |  4 +--
 mm/rmap.c                                          |  4 +--
 mm/vmscan.c                                        | 41 ++++++++++++----------
 9 files changed, 50 insertions(+), 64 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ======
-        Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global pgdat->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under pgdat->lru_lock.
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-
-	(By __isolate_lru_page(), the page is removed from both of global and
-	private LRU.)
-
+	Each memcg has its own vector of LRUs (inactive anon, active anon,
+	inactive file, active file, unevictable) of pages from each node,
+	each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 12757e63b26c..24450696579f 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered.
 2.6 Locking
 -----------
 
-   lock_page_cgroup()/unlock_page_cgroup() should not be called under
-   the i_pages lock.
+Lock order is as follows:
 
-   Other lock order is following:
+  Page lock (PG_locked bit of page->flags)
+    mm->page_table_lock or split pte_lock
+      lock_page_memcg (memcg->move_lock)
+        mapping->i_pages lock
+          lruvec->lru_lock.
 
-   PG_locked.
-     mm->page_table_lock
-         pgdat->lru_lock
-	   lock_page_cgroup.
-
-  In many cases, just lock_page_cgroup() is called.
-
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 -----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million 4k pages in a single node.  When a large
 fraction of these pages are not evictable for any reason [see below], vmscan
 will spend a lot of time scanning the LRU lists looking for the small fraction
 of pages that are evictable.  This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
 The Unevictable Page List
 -------------------------
 
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
 called the "unevictable" list and an associated page flag, PG_unevictable, to
 indicate that the page is being managed on the unevictable list.
 
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
 swap-backed pages.  This differentiation is only important while the pages are,
 in fact, evictable.
 
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
 lists and statistics originally proposed and posted by Christoph Lameter.
 
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock.  This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
 
 Memory Control Group Interaction
 --------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
 memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
 lru_list enum.
 
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
 lru_list enum element).  The memory controller tracks the movement of pages to
 and from the unevictable list.
 
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
 active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
 pages in all of the shrink_{active|inactive|page}_list() functions and will
 "cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
 
 There may be situations where a page is mapped into a VM_LOCKED VMA, but the
 page is not marked as PG_mlocked.  Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
 page from the LRU, as it is likely on the appropriate active or inactive list
 at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
 back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list.  If
+is now mlocked and divert the page to the node's unevictable list.  If
 mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
 it later if and when it attempts to reclaim the page.
 
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
      unevictable list in mlock_vma_page().
 
 shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
 
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a6b078b9d6a..82c788917319 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0afba4ea2a21..1299b8ce64d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
- * So add a wild amount of padding here to ensure that they fall into separate
+ * Add a wild amount of padding here to ensure datas fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index d90614f501da..426d547cf19e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -102,8 +102,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 078d54da59d4..73788505aa0a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -28,12 +28,12 @@
  *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
- *                   mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
+ *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
  *                     i_pages lock (widely used)
+ *                       lruvec->lru_lock (in lock_page_lruvec_irq)
  *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
  *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                   sb_lock (within inode_lock in fs/fs-writeback.c)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b09a39de8cd..1c343adbbbe3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1614,14 +1614,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
  * For pagecache intensive workloads, this function is the hottest
  * spot in the kernel (apart from copy_*_user functions).
  *
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
  *
  * @nr_to_scan:	The number of eligible pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
@@ -1815,25 +1817,11 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 }
 
 /*
- * This moves pages from @list to corresponding LRU list.
- *
- * We move them the other way if the page is referenced by one or more
- * processes, from rmap.
- *
- * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
- * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
- * this, so instead we remove the pages from the LRU while processing them.
- * It is safe to rely on PG_active against the non-LRU pages in here because
- * nobody will play with that bit on a non-LRU page.
- *
- * The downside is that we have to touch page->_refcount against each page.
- * But we had to alter page->flags anyway.
+ * move_pages_to_lru() moves pages from private @list to appropriate LRU list.
+ * On return, @list is reused as a list of pages to be freed by the caller.
  *
  * Returns the number of pages moved to the given lruvec.
  */
-
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
@@ -2012,6 +2000,23 @@ static int current_may_throttle(void)
 	return nr_reclaimed;
 }
 
+/*
+ * shrink_active_list() moves pages from the active LRU to the inactive LRU.
+ *
+ * We move them the other way if the page is referenced by one or more
+ * processes.
+ *
+ * If the pages are mostly unmapped, the processing is fast and it is
+ * appropriate to hold lru_lock across the whole operation.  But if
+ * the pages are mapped, the processing is slow (page_referenced()), so
+ * we should drop lru_lock around each page.  It's impossible to balance
+ * this, so instead we remove the pages from the LRU while processing them.
+ * It is safe to rely on PG_active against the non-LRU pages in here because
+ * nobody will play with that bit on a non-LRU page.
+ *
+ * The downside is that we have to touch page->_refcount against each page.
+ * But we had to alter page->flags anyway.
+ */
 static void shrink_active_list(unsigned long nr_to_scan,
 			       struct lruvec *lruvec,
 			       struct scan_control *sc,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-05 13:43     ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05 13:43 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi



在 2020/11/5 下午4:55, Alex Shi 写道:
>  /**
> + * lock_page_lruvec - lock and return lruvec for a given page.
> + * @page: the page
> + *
> + * This series functions should be used in either conditions:
> + * PageLRU is cleared or unset
> + * or page is locked.
      or page->_refcount is zero.
Ops, here is a typo, we need to add back above line.
so the patch updated here:


From 9f187d04c7ba62bb7e07c07733b2848f155961f6 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 18 Aug 2020 16:44:21 +0800
Subject: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 ++++++++++++++--------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  74 +++++++++++++++++++++++++++--
 mm/mlock.c                 |  22 ++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 116 ++++++++++++++++++++++-----------------------
 mm/vmscan.c                |  55 ++++++++++-----------
 10 files changed, 271 insertions(+), 126 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..6ecb08ff4ad1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -666,6 +666,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1233,6 +1246,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1476,6 +1514,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1605,6 +1647,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7b1cf48884dd..9cfe90961493 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b70ec0c6076b..94e42dba052a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,7 +2354,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(tail), head);
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2438,7 +2438,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2456,10 +2455,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2480,7 +2477,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 157b745031a4..591f9f9ca8b2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
 #include <linux/page_counter.h>
@@ -1305,6 +1308,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1343,6 +1359,60 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 }
 
 /**
+ * lock_page_lruvec - lock and return lruvec for a given page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or page->_refcount is zero
+ * or page is locked.
+ */
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+/**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
@@ -3245,10 +3315,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page->mem_cgroup is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..74bf7f4c6317 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6808,7 +6808,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 1838a9535703..ed033f7c4f2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,15 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		/*
+		 * Hold lruvec->lru_lock is safe here, since
+		 * 1) The pinned lruvec in reclaim, or
+		 * 2) From a pre-LRU page during refault (which also holds the
+		 *    rcu lock, so would be safe even if the page was on the LRU
+		 *    and could move simultaneously to a new lruvec).
+		 */
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +302,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +366,15 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+	if (TestClearPageLRU(page)) {
+		lruvec = lock_page_lruvec_irq(page);
+		__activate_page(page, lruvec);
+		unlock_page_lruvec_irq(lruvec);
+		SetPageLRU(page);
+	}
 }
 #endif
 
@@ -860,8 +864,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +874,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +886,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -907,27 +909,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -937,8 +939,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1026,26 +1028,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0be55d875fde..2953ddec88a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1953,7 +1950,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1965,7 +1962,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1973,7 +1970,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1982,7 +1979,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2035,7 +2032,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2046,7 +2043,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2092,7 +2089,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2103,7 +2100,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2693,10 +2690,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4272,16 +4269,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4293,13 +4289,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4313,10 +4308,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-05 13:43     ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-05 13:43 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko, Yang Shi



ÔÚ 2020/11/5 ÏÂÎç4:55, Alex Shi дµÀ:
>  /**
> + * lock_page_lruvec - lock and return lruvec for a given page.
> + * @page: the page
> + *
> + * This series functions should be used in either conditions:
> + * PageLRU is cleared or unset
> + * or page is locked.
      or page->_refcount is zero.
Ops, here is a typo, we need to add back above line.
so the patch updated here:


From 9f187d04c7ba62bb7e07c07733b2848f155961f6 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Date: Tue, 18 Aug 2020 16:44:21 +0800
Subject: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 ++++++++++++++--------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  74 +++++++++++++++++++++++++++--
 mm/mlock.c                 |  22 ++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 116 ++++++++++++++++++++++-----------------------
 mm/vmscan.c                |  55 ++++++++++-----------
 10 files changed, 271 insertions(+), 126 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..6ecb08ff4ad1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -666,6 +666,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1233,6 +1246,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1476,6 +1514,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1605,6 +1647,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7b1cf48884dd..9cfe90961493 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b70ec0c6076b..94e42dba052a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,7 +2354,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(tail), head);
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2438,7 +2438,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2456,10 +2455,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2480,7 +2477,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 157b745031a4..591f9f9ca8b2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
 #include <linux/page_counter.h>
@@ -1305,6 +1308,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1343,6 +1359,60 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 }
 
 /**
+ * lock_page_lruvec - lock and return lruvec for a given page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or page->_refcount is zero
+ * or page is locked.
+ */
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+/**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
@@ -3245,10 +3315,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page->mem_cgroup is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..74bf7f4c6317 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6808,7 +6808,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 1838a9535703..ed033f7c4f2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,15 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		/*
+		 * Hold lruvec->lru_lock is safe here, since
+		 * 1) The pinned lruvec in reclaim, or
+		 * 2) From a pre-LRU page during refault (which also holds the
+		 *    rcu lock, so would be safe even if the page was on the LRU
+		 *    and could move simultaneously to a new lruvec).
+		 */
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +302,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +366,15 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+	if (TestClearPageLRU(page)) {
+		lruvec = lock_page_lruvec_irq(page);
+		__activate_page(page, lruvec);
+		unlock_page_lruvec_irq(lruvec);
+		SetPageLRU(page);
+	}
 }
 #endif
 
@@ -860,8 +864,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +874,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +886,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -907,27 +909,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -937,8 +939,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1026,26 +1028,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0be55d875fde..2953ddec88a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1953,7 +1950,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1965,7 +1962,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1973,7 +1970,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1982,7 +1979,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2035,7 +2032,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2046,7 +2043,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2092,7 +2089,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2103,7 +2100,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2693,10 +2690,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4272,16 +4269,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4293,13 +4289,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4313,10 +4308,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 06/19] mm/rmap: stop store reordering issue on page->mapping
@ 2020-11-06  1:20     ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-06  1:20 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Minchan Kim



updated for comments change from Johannes


From 2fd278b1ca6c3e260ad249808b62f671d8db5a7b Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH v21 06/19] mm/rmap: stop store reordering issue on
 page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

	CPU 0						CPU1
   do_anonymous_page
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)
						/* idletacking judged it as LRU
						 * page so pass the page in
						 * page_idle_clear_pte_refs
						 */
						page_idle_clear_pte_refs
						  rmap_walk
						    if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
"The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment."

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/rmap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1b84945d655c..380c6b9956c2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,14 @@ static void __page_set_anon_rmap(struct page *page,
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	/*
+	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
+	 * Make sure the compiler doesn't split the stores of anon_vma and
+	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
+	 * could mistake the mapping for a struct address_space and crash.
+	 */
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 06/19] mm/rmap: stop store reordering issue on page->mapping
@ 2020-11-06  1:20     ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-06  1:20 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Minchan Kim



updated for comments change from Johannes


From 2fd278b1ca6c3e260ad249808b62f671d8db5a7b Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH v21 06/19] mm/rmap: stop store reordering issue on
 page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

	CPU 0						CPU1
   do_anonymous_page
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)
						/* idletacking judged it as LRU
						 * page so pass the page in
						 * page_idle_clear_pte_refs
						 */
						page_idle_clear_pte_refs
						  rmap_walk
						    if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
"The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment."

Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
---
 mm/rmap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1b84945d655c..380c6b9956c2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,14 @@ static void __page_set_anon_rmap(struct page *page,
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	/*
+	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
+	 * Make sure the compiler doesn't split the stores of anon_vma and
+	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
+	 * could mistake the mapping for a struct address_space and crash.
+	 */
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-06  7:48       ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-06  7:48 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi



在 2020/11/5 下午9:43, Alex Shi 写道:
> 
> 在 2020/11/5 下午4:55, Alex Shi 写道:
>>  /**
>> + * lock_page_lruvec - lock and return lruvec for a given page.
>> + * @page: the page
>> + *
>> + * This series functions should be used in either conditions:
>> + * PageLRU is cleared or unset
>> + * or page is locked.
>       or page->_refcount is zero.
> Ops, here is a typo, we need to add back above line.
> so the patch updated here:

Sorry for aother miss, linux-next removed the page.mem_cgroup,
replace it to memcg_data and get by page_memcg(). Hence this and next
patch needs update for this:


From 84e69f892119d99612e9668e3fe47a3922bafff1 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 18 Aug 2020 16:44:21 +0800
Subject: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 ++++++++++++++--------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  78 ++++++++++++++++++++++++++++--
 mm/mlock.c                 |  22 ++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 116 ++++++++++++++++++++++-----------------------
 mm/vmscan.c                |  55 ++++++++++-----------
 10 files changed, 275 insertions(+), 126 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..6ecb08ff4ad1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -666,6 +666,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1233,6 +1246,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1476,6 +1514,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1605,6 +1647,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7b1cf48884dd..9cfe90961493 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b70ec0c6076b..94e42dba052a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,7 +2354,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(tail), head);
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2438,7 +2438,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2456,10 +2455,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2480,7 +2477,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 157b745031a4..7657f16cf992 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
 #include <linux/page_counter.h>
@@ -1305,6 +1308,23 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	memcg = page_memcg(page);
+
+	if (!memcg)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != memcg, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1343,6 +1363,60 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 }
 
 /**
+ * lock_page_lruvec - lock and return lruvec for a given page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or page->_refcount is zero
+ * or page is locked.
+ */
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+/**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
@@ -3245,10 +3319,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page_memcg(head) is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..74bf7f4c6317 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6808,7 +6808,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 1838a9535703..ed033f7c4f2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,15 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		/*
+		 * Hold lruvec->lru_lock is safe here, since
+		 * 1) The pinned lruvec in reclaim, or
+		 * 2) From a pre-LRU page during refault (which also holds the
+		 *    rcu lock, so would be safe even if the page was on the LRU
+		 *    and could move simultaneously to a new lruvec).
+		 */
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +302,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +366,15 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+	if (TestClearPageLRU(page)) {
+		lruvec = lock_page_lruvec_irq(page);
+		__activate_page(page, lruvec);
+		unlock_page_lruvec_irq(lruvec);
+		SetPageLRU(page);
+	}
 }
 #endif
 
@@ -860,8 +864,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +874,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +886,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -907,27 +909,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -937,8 +939,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1026,26 +1028,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0be55d875fde..2953ddec88a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1953,7 +1950,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1965,7 +1962,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1973,7 +1970,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1982,7 +1979,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2035,7 +2032,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2046,7 +2043,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2092,7 +2089,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2103,7 +2100,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2693,10 +2690,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4272,16 +4269,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4293,13 +4289,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4313,10 +4308,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-06  7:48       ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-06  7:48 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko, Yang Shi



在 2020/11/5 下午9:43, Alex Shi 写道:
> 
> 在 2020/11/5 下午4:55, Alex Shi 写道:
>>  /**
>> + * lock_page_lruvec - lock and return lruvec for a given page.
>> + * @page: the page
>> + *
>> + * This series functions should be used in either conditions:
>> + * PageLRU is cleared or unset
>> + * or page is locked.
>       or page->_refcount is zero.
> Ops, here is a typo, we need to add back above line.
> so the patch updated here:

Sorry for aother miss, linux-next removed the page.mem_cgroup,
replace it to memcg_data and get by page_memcg(). Hence this and next
patch needs update for this:


From 84e69f892119d99612e9668e3fe47a3922bafff1 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Date: Tue, 18 Aug 2020 16:44:21 +0800
Subject: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 ++++++++++++++--------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  78 ++++++++++++++++++++++++++++--
 mm/mlock.c                 |  22 ++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 116 ++++++++++++++++++++++-----------------------
 mm/vmscan.c                |  55 ++++++++++-----------
 10 files changed, 275 insertions(+), 126 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..6ecb08ff4ad1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -666,6 +666,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1233,6 +1246,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1476,6 +1514,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1605,6 +1647,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 7b1cf48884dd..9cfe90961493 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b70ec0c6076b..94e42dba052a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,7 +2354,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(tail), head);
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2438,7 +2438,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2456,10 +2455,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2480,7 +2477,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 157b745031a4..7657f16cf992 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
 #include <linux/page_counter.h>
@@ -1305,6 +1308,23 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	memcg = page_memcg(page);
+
+	if (!memcg)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != memcg, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1343,6 +1363,60 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 }
 
 /**
+ * lock_page_lruvec - lock and return lruvec for a given page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or page->_refcount is zero
+ * or page is locked.
+ */
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+/**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
  * @lru: index of lru list the page is sitting on
@@ -3245,10 +3319,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page_memcg(head) is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..74bf7f4c6317 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6808,7 +6808,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 1838a9535703..ed033f7c4f2d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,15 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		/*
+		 * Hold lruvec->lru_lock is safe here, since
+		 * 1) The pinned lruvec in reclaim, or
+		 * 2) From a pre-LRU page during refault (which also holds the
+		 *    rcu lock, so would be safe even if the page was on the LRU
+		 *    and could move simultaneously to a new lruvec).
+		 */
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +302,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +366,15 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+	if (TestClearPageLRU(page)) {
+		lruvec = lock_page_lruvec_irq(page);
+		__activate_page(page, lruvec);
+		unlock_page_lruvec_irq(lruvec);
+		SetPageLRU(page);
+	}
 }
 #endif
 
@@ -860,8 +864,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +874,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +886,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -907,27 +909,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -937,8 +939,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1026,26 +1028,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0be55d875fde..2953ddec88a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1953,7 +1950,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1965,7 +1962,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1973,7 +1970,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1982,7 +1979,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2035,7 +2032,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2046,7 +2043,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2092,7 +2089,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2103,7 +2100,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2693,10 +2690,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4272,16 +4269,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4293,13 +4289,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4313,10 +4308,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
  2020-11-05  8:55   ` Alex Shi
@ 2020-11-06  7:50     ` Alex Shi
  -1 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-06  7:50 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin



在 2020/11/5 下午4:55, Alex Shi 写道:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>


update the patch on page_memcg() change:


From 6c142eb582e7d0dbf473572ad092eca07ab75221 Mon Sep 17 00:00:00 2001
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Date: Tue, 26 May 2020 17:31:15 +0800
Subject: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 | 11 +---------
 mm/swap.c                  | 33 +++++++----------------------
 mm/vmscan.c                | 12 ++---------
 4 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6ecb08ff4ad1..8c57d6335ee4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -660,6 +660,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	const struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node *mz;
+
+	if (mem_cgroup_disabled())
+		return lruvec == &pgdat->__lruvec;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+	memcg = page_memcg(page) ? : root_mem_cgroup;
+
+	return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1221,6 +1237,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+
+	return lruvec == &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1663,6 +1687,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irq(locked_lruvec);
+	}
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+	}
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index ab164a675c25..55b3b3672977 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -277,16 +277,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *new_lruvec;
-
-				new_lruvec = mem_cgroup_page_lruvec(page,
-						page_pgdat(page));
-				if (new_lruvec != lruvec) {
-					if (lruvec)
-						unlock_page_lruvec_irq(lruvec);
-					lruvec = lock_page_lruvec_irq(page);
-				}
-
+				lruvec = relock_page_lruvec_irq(page, lruvec);
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
diff --git a/mm/swap.c b/mm/swap.c
index ed033f7c4f2d..c593ba596dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -210,19 +210,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
-
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -918,17 +911,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *prev_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (prev_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
@@ -1033,15 +1021,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2953ddec88a0..3b09a39de8cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1884,8 +1884,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 * All pages were isolated from the same lruvec (and isolation
 		 * inhibits memcg migration).
 		 */
-		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
-							!= lruvec, page);
+		VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
@@ -4277,7 +4276,6 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
 		int nr_pages;
-		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4289,13 +4287,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
-
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
@ 2020-11-06  7:50     ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-06  7:50 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin



ÔÚ 2020/11/5 ÏÂÎç4:55, Alex Shi дµÀ:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>


update the patch on page_memcg() change:


From 6c142eb582e7d0dbf473572ad092eca07ab75221 Mon Sep 17 00:00:00 2001
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Date: Tue, 26 May 2020 17:31:15 +0800
Subject: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 | 11 +---------
 mm/swap.c                  | 33 +++++++----------------------
 mm/vmscan.c                | 12 ++---------
 4 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6ecb08ff4ad1..8c57d6335ee4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -660,6 +660,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	const struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node *mz;
+
+	if (mem_cgroup_disabled())
+		return lruvec == &pgdat->__lruvec;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+	memcg = page_memcg(page) ? : root_mem_cgroup;
+
+	return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1221,6 +1237,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+
+	return lruvec == &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1663,6 +1687,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irq(locked_lruvec);
+	}
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+	}
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index ab164a675c25..55b3b3672977 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -277,16 +277,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *new_lruvec;
-
-				new_lruvec = mem_cgroup_page_lruvec(page,
-						page_pgdat(page));
-				if (new_lruvec != lruvec) {
-					if (lruvec)
-						unlock_page_lruvec_irq(lruvec);
-					lruvec = lock_page_lruvec_irq(page);
-				}
-
+				lruvec = relock_page_lruvec_irq(page, lruvec);
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
diff --git a/mm/swap.c b/mm/swap.c
index ed033f7c4f2d..c593ba596dea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -210,19 +210,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
-
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -918,17 +911,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *prev_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (prev_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
@@ -1033,15 +1021,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2953ddec88a0..3b09a39de8cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1884,8 +1884,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 * All pages were isolated from the same lruvec (and isolation
 		 * inhibits memcg migration).
 		 */
-		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
-							!= lruvec, page);
+		VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
@@ -4277,7 +4276,6 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
 		int nr_pages;
-		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4289,13 +4287,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
-
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-11-10 12:14   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-10 12:14 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Hi All,

Is any more comments of this version?

Thanks
Alex

在 2020/11/5 下午4:55, Alex Shi 写道:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.
> 
> So now this patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation. 
> 2, use TestCleanPageLRU as page isolation's precondition.
> 3, replace per node lru_lock with per memcg per node lru_lock.
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
> with this v20.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (16):
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: use head for head page in lru_add_page_tail
>   mm/thp: Simplify lru_add_page_tail()
>   mm/thp: narrow lru locking
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/rmap: stop store reordering issue on page->mapping
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lock into lru_note_cost
>   mm/vmscan: remove lruvec reget in move_pages_to_lru
>   mm/mlock: remove lru_lock on TestClearPageMlocked
>   mm/mlock: remove __munlock_isolate_lru_page
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
> 
> Alexander Duyck (1):
>   mm/lru: introduce the relock_page_lruvec function
> 
> Hugh Dickins (2):
>   mm: page_idle_get_page() does not need lru_lock
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/memcontrol.h                         | 110 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  mm/compaction.c                                    |  94 +++++++---
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  45 +++--
>  mm/memcontrol.c                                    |  79 +++++++-
>  mm/mlock.c                                         |  63 ++-----
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   4 -
>  mm/rmap.c                                          |  11 +-
>  mm/swap.c                                          | 208 ++++++++-------------
>  mm/vmscan.c                                        | 207 ++++++++++----------
>  mm/workingset.c                                    |   2 -
>  21 files changed, 530 insertions(+), 372 deletions(-)
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-11-10 12:14   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-10 12:14 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

Hi All,

Is any more comments of this version?

Thanks
Alex

在 2020/11/5 下午4:55, Alex Shi 写道:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.
> 
> So now this patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation. 
> 2, use TestCleanPageLRU as page isolation's precondition.
> 3, replace per node lru_lock with per memcg per node lru_lock.
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
> with this v20.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (16):
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: use head for head page in lru_add_page_tail
>   mm/thp: Simplify lru_add_page_tail()
>   mm/thp: narrow lru locking
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/rmap: stop store reordering issue on page->mapping
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lock into lru_note_cost
>   mm/vmscan: remove lruvec reget in move_pages_to_lru
>   mm/mlock: remove lru_lock on TestClearPageMlocked
>   mm/mlock: remove __munlock_isolate_lru_page
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
> 
> Alexander Duyck (1):
>   mm/lru: introduce the relock_page_lruvec function
> 
> Hugh Dickins (2):
>   mm: page_idle_get_page() does not need lru_lock
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/memcontrol.h                         | 110 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  mm/compaction.c                                    |  94 +++++++---
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  45 +++--
>  mm/memcontrol.c                                    |  79 +++++++-
>  mm/mlock.c                                         |  63 ++-----
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   4 -
>  mm/rmap.c                                          |  11 +-
>  mm/swap.c                                          | 208 ++++++++-------------
>  mm/vmscan.c                                        | 207 ++++++++++----------
>  mm/workingset.c                                    |   2 -
>  21 files changed, 530 insertions(+), 372 deletions(-)
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-10 18:54         ` Johannes Weiner
  0 siblings, 0 replies; 111+ messages in thread
From: Johannes Weiner @ 2020-11-10 18:54 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko, Yang Shi

On Fri, Nov 06, 2020 at 03:48:16PM +0800, Alex Shi wrote:
> From 84e69f892119d99612e9668e3fe47a3922bafff1 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Tue, 18 Aug 2020 16:44:21 +0800
> Subject: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
> 
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on the patch polish, thanks!
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Rong Chen <rong.a.chen@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-10 18:54         ` Johannes Weiner
  0 siblings, 0 replies; 111+ messages in thread
From: Johannes Weiner @ 2020-11-10 18:54 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, lkp-ral2JQCrhuEAvxtiuMwx3w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w, Michal Hocko, Yang Shi

On Fri, Nov 06, 2020 at 03:48:16PM +0800, Alex Shi wrote:
> From 84e69f892119d99612e9668e3fe47a3922bafff1 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Date: Tue, 18 Aug 2020 16:44:21 +0800
> Subject: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
> 
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on the patch polish, thanks!
> 
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
@ 2020-11-10 18:59       ` Johannes Weiner
  0 siblings, 0 replies; 111+ messages in thread
From: Johannes Weiner @ 2020-11-10 18:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Alexander Duyck, Thomas Gleixner,
	Andrey Ryabinin

On Fri, Nov 06, 2020 at 03:50:22PM +0800, Alex Shi wrote:
> From 6c142eb582e7d0dbf473572ad092eca07ab75221 Mon Sep 17 00:00:00 2001
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Date: Tue, 26 May 2020 17:31:15 +0800
> Subject: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
> 
> Use this new function to replace repeated same code, no func change.
> 
> When testing for relock we can avoid the need for RCU locking if we simply
> compare the page pgdat and memcg pointers versus those that the lruvec is
> holding. By doing this we can avoid the extra pointer walks and accesses of
> the memory cgroup.
> 
> In addition we can avoid the checks entirely if lruvec is currently NULL.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
@ 2020-11-10 18:59       ` Johannes Weiner
  0 siblings, 0 replies; 111+ messages in thread
From: Johannes Weiner @ 2020-11-10 18:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, lkp-ral2JQCrhuEAvxtiuMwx3w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w, Alexander Duyck,
	Thomas Gleixner, Andrey Ryabinin

On Fri, Nov 06, 2020 at 03:50:22PM +0800, Alex Shi wrote:
> From 6c142eb582e7d0dbf473572ad092eca07ab75221 Mon Sep 17 00:00:00 2001
> From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Date: Tue, 26 May 2020 17:31:15 +0800
> Subject: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
> 
> Use this new function to replace repeated same code, no func change.
> 
> When testing for relock we can avoid the need for RCU locking if we simply
> compare the page pgdat and memcg pointers versus those that the lruvec is
> holding. By doing this we can avoid the extra pointer walks and accesses of
> the memory cgroup.
> 
> In addition we can avoid the checks entirely if lruvec is currently NULL.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Cc: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
  2020-11-05  8:55   ` Alex Shi
  (?)
@ 2020-11-10 19:01   ` Johannes Weiner
  -1 siblings, 0 replies; 111+ messages in thread
From: Johannes Weiner @ 2020-11-10 19:01 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Vlastimil Babka, Minchan Kim

On Thu, Nov 05, 2020 at 04:55:37PM +0800, Alex Shi wrote:
> From: Hugh Dickins <hughd@google.com>
> 
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
> 
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
> using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
> but not entirely prevented by page_count() check in ksm.c's
> write_protect_page(): that risk being shared with page_referenced() and
> not helped by lru_lock).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 06/19] mm/rmap: stop store reordering issue on page->mapping
  2020-11-06  1:20     ` Alex Shi
  (?)
@ 2020-11-10 19:06     ` Johannes Weiner
  -1 siblings, 0 replies; 111+ messages in thread
From: Johannes Weiner @ 2020-11-10 19:06 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Minchan Kim

On Fri, Nov 06, 2020 at 09:20:04AM +0800, Alex Shi wrote:
> From 2fd278b1ca6c3e260ad249808b62f671d8db5a7b Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Thu, 5 Nov 2020 11:38:24 +0800
> Subject: [PATCH v21 06/19] mm/rmap: stop store reordering issue on
>  page->mapping
> 
> Hugh Dickins and Minchan Kim observed a long time issue which
> discussed here, but actully the mentioned fix missed.
> https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
> The store reordering may cause problem in the scenario:
> 
> 	CPU 0						CPU1
>    do_anonymous_page
> 	page_add_new_anon_rmap()
> 	  page->mapping = anon_vma + PAGE_MAPPING_ANON
> 	lru_cache_add_inactive_or_unevictable()
> 	  spin_lock(lruvec->lock)
> 	  SetPageLRU()
> 	  spin_unlock(lruvec->lock)
> 						/* idletacking judged it as LRU
> 						 * page so pass the page in
> 						 * page_idle_clear_pte_refs
> 						 */
> 						page_idle_clear_pte_refs
> 						  rmap_walk
> 						    if PageAnon(page)
> 
> Johannes give detailed examples how the store reordering could cause
> a trouble:
> "The concern is the SetPageLRU may get reorder before 'page->mapping'
> setting, That would make CPU 1 will observe at page->mapping after
> observing PageLRU set on the page.
> 
> 1. anon_vma + PAGE_MAPPING_ANON
> 
>    That's the in-order scenario and is fine.
> 
> 2. NULL
> 
>    That's possible if the page->mapping store gets reordered to occur
>    after SetPageLRU. That's fine too because we check for it.
> 
> 3. anon_vma without the PAGE_MAPPING_ANON bit
> 
>    That would be a problem and could lead to all kinds of undesirable
>    behavior including crashes and data corruption.
> 
>    Is it possible? AFAICT the compiler is allowed to tear the store to
>    page->mapping and I don't see anything that would prevent it.
> 
> That said, I also don't see how the reader testing PageLRU under the
> lru_lock would prevent that in the first place. AFAICT we need that
> WRITE_ONCE() around the page->mapping assignment."
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Alex!

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 06/19] mm/rmap: stop store reordering issue on page->mapping
  2020-11-06  1:20     ` Alex Shi
@ 2020-11-11  7:41       ` Hugh Dickins
  -1 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-11-11  7:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Minchan Kim

On Fri, 6 Nov 2020, Alex Shi wrote:
> 
> updated for comments change from Johannes
> 
> 
> From 2fd278b1ca6c3e260ad249808b62f671d8db5a7b Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Thu, 5 Nov 2020 11:38:24 +0800
> Subject: [PATCH v21 06/19] mm/rmap: stop store reordering issue on
>  page->mapping
> 
> Hugh Dickins and Minchan Kim observed a long time issue which
> discussed here, but actully the mentioned fix missed.
> https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
> The store reordering may cause problem in the scenario:
> 
> 	CPU 0						CPU1
>    do_anonymous_page
> 	page_add_new_anon_rmap()
> 	  page->mapping = anon_vma + PAGE_MAPPING_ANON
> 	lru_cache_add_inactive_or_unevictable()
> 	  spin_lock(lruvec->lock)
> 	  SetPageLRU()
> 	  spin_unlock(lruvec->lock)
> 						/* idletacking judged it as LRU
> 						 * page so pass the page in
> 						 * page_idle_clear_pte_refs
> 						 */
> 						page_idle_clear_pte_refs
> 						  rmap_walk
> 						    if PageAnon(page)
> 
> Johannes give detailed examples how the store reordering could cause
> a trouble:
> "The concern is the SetPageLRU may get reorder before 'page->mapping'
> setting, That would make CPU 1 will observe at page->mapping after
> observing PageLRU set on the page.
> 
> 1. anon_vma + PAGE_MAPPING_ANON
> 
>    That's the in-order scenario and is fine.
> 
> 2. NULL
> 
>    That's possible if the page->mapping store gets reordered to occur
>    after SetPageLRU. That's fine too because we check for it.
> 
> 3. anon_vma without the PAGE_MAPPING_ANON bit
> 
>    That would be a problem and could lead to all kinds of undesirable
>    behavior including crashes and data corruption.
> 
>    Is it possible? AFAICT the compiler is allowed to tear the store to
>    page->mapping and I don't see anything that would prevent it.
> 
> That said, I also don't see how the reader testing PageLRU under the
> lru_lock would prevent that in the first place. AFAICT we need that
> WRITE_ONCE() around the page->mapping assignment."
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

Many thanks to Johannes for spotting my falsehood in the next patch,
and to Alex for making it true with this patch.  As I just remarked
against the v20, I do have some more of these WRITE_ONCEs, but consider
them merely theoretical: so please don't let me hold this series up.

Andrew, I am hoping that Alex's v21 will appear in the next mmotm?

Thanks,
Hugh


> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  mm/rmap.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1b84945d655c..380c6b9956c2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1054,8 +1054,14 @@ static void __page_set_anon_rmap(struct page *page,
>  	if (!exclusive)
>  		anon_vma = anon_vma->root;
>  
> +	/*
> +	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
> +	 * Make sure the compiler doesn't split the stores of anon_vma and
> +	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> +	 * could mistake the mapping for a struct address_space and crash.
> +	 */
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> -	page->mapping = (struct address_space *) anon_vma;
> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
>  	page->index = linear_page_index(vma, address);
>  }
>  
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 06/19] mm/rmap: stop store reordering issue on page->mapping
@ 2020-11-11  7:41       ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-11-11  7:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Minchan Kim

On Fri, 6 Nov 2020, Alex Shi wrote:
> 
> updated for comments change from Johannes
> 
> 
> From 2fd278b1ca6c3e260ad249808b62f671d8db5a7b Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Thu, 5 Nov 2020 11:38:24 +0800
> Subject: [PATCH v21 06/19] mm/rmap: stop store reordering issue on
>  page->mapping
> 
> Hugh Dickins and Minchan Kim observed a long time issue which
> discussed here, but actully the mentioned fix missed.
> https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
> The store reordering may cause problem in the scenario:
> 
> 	CPU 0						CPU1
>    do_anonymous_page
> 	page_add_new_anon_rmap()
> 	  page->mapping = anon_vma + PAGE_MAPPING_ANON
> 	lru_cache_add_inactive_or_unevictable()
> 	  spin_lock(lruvec->lock)
> 	  SetPageLRU()
> 	  spin_unlock(lruvec->lock)
> 						/* idletacking judged it as LRU
> 						 * page so pass the page in
> 						 * page_idle_clear_pte_refs
> 						 */
> 						page_idle_clear_pte_refs
> 						  rmap_walk
> 						    if PageAnon(page)
> 
> Johannes give detailed examples how the store reordering could cause
> a trouble:
> "The concern is the SetPageLRU may get reorder before 'page->mapping'
> setting, That would make CPU 1 will observe at page->mapping after
> observing PageLRU set on the page.
> 
> 1. anon_vma + PAGE_MAPPING_ANON
> 
>    That's the in-order scenario and is fine.
> 
> 2. NULL
> 
>    That's possible if the page->mapping store gets reordered to occur
>    after SetPageLRU. That's fine too because we check for it.
> 
> 3. anon_vma without the PAGE_MAPPING_ANON bit
> 
>    That would be a problem and could lead to all kinds of undesirable
>    behavior including crashes and data corruption.
> 
>    Is it possible? AFAICT the compiler is allowed to tear the store to
>    page->mapping and I don't see anything that would prevent it.
> 
> That said, I also don't see how the reader testing PageLRU under the
> lru_lock would prevent that in the first place. AFAICT we need that
> WRITE_ONCE() around the page->mapping assignment."
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

Many thanks to Johannes for spotting my falsehood in the next patch,
and to Alex for making it true with this patch.  As I just remarked
against the v20, I do have some more of these WRITE_ONCEs, but consider
them merely theoretical: so please don't let me hold this series up.

Andrew, I am hoping that Alex's v21 will appear in the next mmotm?

Thanks,
Hugh


> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  mm/rmap.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1b84945d655c..380c6b9956c2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1054,8 +1054,14 @@ static void __page_set_anon_rmap(struct page *page,
>  	if (!exclusive)
>  		anon_vma = anon_vma->root;
>  
> +	/*
> +	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
> +	 * Make sure the compiler doesn't split the stores of anon_vma and
> +	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> +	 * could mistake the mapping for a struct address_space and crash.
> +	 */
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> -	page->mapping = (struct address_space *) anon_vma;
> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
>  	page->index = linear_page_index(vma, address);
>  }
>  
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
  2020-11-05  8:55   ` Alex Shi
  (?)
@ 2020-11-11  8:17     ` huang ying
  -1 siblings, 0 replies; 111+ messages in thread
From: huang ying @ 2020-11-11  8:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, khlebnikov,
	daniel.m.jordan, willy, Johannes Weiner, lkp, linux-mm, LKML,
	cgroups, Shakeel Butt, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, kernel test robot, Michal Hocko,
	Vladimir Davydov, shy828301, Vlastimil Babka, Minchan Kim

On Thu, Nov 5, 2020 at 4:56 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> From: Hugh Dickins <hughd@google.com>
>
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
>
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
> using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
> but not entirely prevented by page_count() check in ksm.c's
> write_protect_page(): that risk being shared with page_referenced() and
> not helped by lru_lock).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/page_idle.c | 4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 057c61df12db..64e5344a992c 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -32,19 +32,15 @@
>  static struct page *page_idle_get_page(unsigned long pfn)
>  {
>         struct page *page = pfn_to_online_page(pfn);
> -       pg_data_t *pgdat;
>
>         if (!page || !PageLRU(page) ||
>             !get_page_unless_zero(page))
>                 return NULL;
>
> -       pgdat = page_pgdat(page);
> -       spin_lock_irq(&pgdat->lru_lock);

get_page_unless_zero() is a full memory barrier.  But do we need a
compiler barrier here to prevent the compiler to cache PageLRU()
results here?  Otherwise looks OK to me,

Acked-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying

>         if (unlikely(!PageLRU(page))) {
>                 put_page(page);
>                 page = NULL;
>         }
> -       spin_unlock_irq(&pgdat->lru_lock);
>         return page;
>  }
>
> --
> 1.8.3.1
>
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
@ 2020-11-11  8:17     ` huang ying
  0 siblings, 0 replies; 111+ messages in thread
From: huang ying @ 2020-11-11  8:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, khlebnikov,
	daniel.m.jordan, willy, Johannes Weiner, lkp, linux-mm, LKML,
	cgroups, Shakeel Butt, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, kernel test robot, Michal Hocko,
	Vladimir Davydov, shy828301, Vlastimil Babka, Minchan Kim

On Thu, Nov 5, 2020 at 4:56 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> From: Hugh Dickins <hughd@google.com>
>
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
>
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
> using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
> but not entirely prevented by page_count() check in ksm.c's
> write_protect_page(): that risk being shared with page_referenced() and
> not helped by lru_lock).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/page_idle.c | 4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 057c61df12db..64e5344a992c 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -32,19 +32,15 @@
>  static struct page *page_idle_get_page(unsigned long pfn)
>  {
>         struct page *page = pfn_to_online_page(pfn);
> -       pg_data_t *pgdat;
>
>         if (!page || !PageLRU(page) ||
>             !get_page_unless_zero(page))
>                 return NULL;
>
> -       pgdat = page_pgdat(page);
> -       spin_lock_irq(&pgdat->lru_lock);

get_page_unless_zero() is a full memory barrier.  But do we need a
compiler barrier here to prevent the compiler to cache PageLRU()
results here?  Otherwise looks OK to me,

Acked-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying

>         if (unlikely(!PageLRU(page))) {
>                 put_page(page);
>                 page = NULL;
>         }
> -       spin_unlock_irq(&pgdat->lru_lock);
>         return page;
>  }
>
> --
> 1.8.3.1
>
>


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
@ 2020-11-11  8:17     ` huang ying
  0 siblings, 0 replies; 111+ messages in thread
From: huang ying @ 2020-11-11  8:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, Johannes Weiner,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, kernel test robot,
	Michal Hocko, Vladimir Davydov, shy828301-Re5JQEeQqe8AvxtiuMwx3w,
	Vlastimil Babka, Minchan Kim

On Thu, Nov 5, 2020 at 4:56 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote:
>
> From: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
>
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
> using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
> but not entirely prevented by page_count() check in ksm.c's
> write_protect_page(): that risk being shared with page_referenced() and
> not helped by lru_lock).
>
> Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
> Cc: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>  mm/page_idle.c | 4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 057c61df12db..64e5344a992c 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -32,19 +32,15 @@
>  static struct page *page_idle_get_page(unsigned long pfn)
>  {
>         struct page *page = pfn_to_online_page(pfn);
> -       pg_data_t *pgdat;
>
>         if (!page || !PageLRU(page) ||
>             !get_page_unless_zero(page))
>                 return NULL;
>
> -       pgdat = page_pgdat(page);
> -       spin_lock_irq(&pgdat->lru_lock);

get_page_unless_zero() is a full memory barrier.  But do we need a
compiler barrier here to prevent the compiler to cache PageLRU()
results here?  Otherwise looks OK to me,

Acked-by: "Huang, Ying" <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Best Regards,
Huang, Ying

>         if (unlikely(!PageLRU(page))) {
>                 put_page(page);
>                 page = NULL;
>         }
> -       spin_unlock_irq(&pgdat->lru_lock);
>         return page;
>  }
>
> --
> 1.8.3.1
>
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 05/19] mm/vmscan: remove unnecessary lruvec adding
@ 2020-11-11 12:36     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 12:36 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On 11/5/20 9:55 AM, Alex Shi wrote:
> We don't have to add a freeable page into lru and then remove from it.
> This change saves a couple of actions and makes the moving more clear.
> 
> The SetPageLRU needs to be kept before put_page_testzero for list
> integrity, otherwise:
> 
>    #0 move_pages_to_lru             #1 release_pages
>    if !put_page_testzero
>       			           if (put_page_testzero())
>       			              !PageLRU //skip lru_lock
>       SetPageLRU()
>       list_add(&page->lru,)
>                                           list_add(&page->lru,)
> 
> [akpm@linux-foundation.org: coding style fixes]
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nice cleanup!

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 05/19] mm/vmscan: remove unnecessary lruvec adding
@ 2020-11-11 12:36     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 12:36 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On 11/5/20 9:55 AM, Alex Shi wrote:
> We don't have to add a freeable page into lru and then remove from it.
> This change saves a couple of actions and makes the moving more clear.
> 
> The SetPageLRU needs to be kept before put_page_testzero for list
> integrity, otherwise:
> 
>    #0 move_pages_to_lru             #1 release_pages
>    if !put_page_testzero
>       			           if (put_page_testzero())
>       			              !PageLRU //skip lru_lock
>       SetPageLRU()
>       list_add(&page->lru,)
>                                           list_add(&page->lru,)
> 
> [akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org: coding style fixes]
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

Nice cleanup!

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
@ 2020-11-11 12:52       ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 12:52 UTC (permalink / raw)
  To: huang ying, Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins, khlebnikov,
	daniel.m.jordan, willy, Johannes Weiner, lkp, linux-mm, LKML,
	cgroups, Shakeel Butt, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, kernel test robot, Michal Hocko,
	Vladimir Davydov, shy828301, Minchan Kim

On 11/11/20 9:17 AM, huang ying wrote:
> On Thu, Nov 5, 2020 at 4:56 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>> From: Hugh Dickins <hughd@google.com>
>>
>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>> get_page_unless_zero(), but holding lru_lock around that serves no
>> useful purpose, and adds to lru_lock contention: delete it.
>>
>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>> uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
>> using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
>> but not entirely prevented by page_count() check in ksm.c's
>> write_protect_page(): that risk being shared with page_referenced() and
>> not helped by lru_lock).
>>
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Alex Shi <alex.shi@linux.alibaba.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  mm/page_idle.c | 4 ----
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/mm/page_idle.c b/mm/page_idle.c
>> index 057c61df12db..64e5344a992c 100644
>> --- a/mm/page_idle.c
>> +++ b/mm/page_idle.c
>> @@ -32,19 +32,15 @@
>>  static struct page *page_idle_get_page(unsigned long pfn)
>>  {
>>         struct page *page = pfn_to_online_page(pfn);
>> -       pg_data_t *pgdat;
>>
>>         if (!page || !PageLRU(page) ||
>>             !get_page_unless_zero(page))
>>                 return NULL;
>>
>> -       pgdat = page_pgdat(page);
>> -       spin_lock_irq(&pgdat->lru_lock);
> 
> get_page_unless_zero() is a full memory barrier.  But do we need a
> compiler barrier here to prevent the compiler to cache PageLRU()
> results here?  Otherwise looks OK to me,

I think the compiler barrier is also implied by the full memory barrier and 
prevents the caching.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Acked-by: "Huang, Ying" <ying.huang@intel.com>
> 
> Best Regards,
> Huang, Ying
> 
>>         if (unlikely(!PageLRU(page))) {
>>                 put_page(page);
>>                 page = NULL;
>>         }
>> -       spin_unlock_irq(&pgdat->lru_lock);
>>         return page;
>>  }
>>
>> --
>> 1.8.3.1
>>
>>
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock
@ 2020-11-11 12:52       ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 12:52 UTC (permalink / raw)
  To: huang ying, Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, Johannes Weiner,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	LKML, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, kernel test robot,
	Michal Hocko, Vladimir Davydov, shy828301-Re5JQEeQqe8AvxtiuMwx3w,
	Minchan Kim

On 11/11/20 9:17 AM, huang ying wrote:
> On Thu, Nov 5, 2020 at 4:56 PM Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote:
>>
>> From: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>
>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>> get_page_unless_zero(), but holding lru_lock around that serves no
>> useful purpose, and adds to lru_lock contention: delete it.
>>
>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>> uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
>> using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
>> but not entirely prevented by page_count() check in ksm.c's
>> write_protect_page(): that risk being shared with page_referenced() and
>> not helped by lru_lock).
>>
>> Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
>> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
>> Cc: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> Cc: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
>> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> ---
>>  mm/page_idle.c | 4 ----
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/mm/page_idle.c b/mm/page_idle.c
>> index 057c61df12db..64e5344a992c 100644
>> --- a/mm/page_idle.c
>> +++ b/mm/page_idle.c
>> @@ -32,19 +32,15 @@
>>  static struct page *page_idle_get_page(unsigned long pfn)
>>  {
>>         struct page *page = pfn_to_online_page(pfn);
>> -       pg_data_t *pgdat;
>>
>>         if (!page || !PageLRU(page) ||
>>             !get_page_unless_zero(page))
>>                 return NULL;
>>
>> -       pgdat = page_pgdat(page);
>> -       spin_lock_irq(&pgdat->lru_lock);
> 
> get_page_unless_zero() is a full memory barrier.  But do we need a
> compiler barrier here to prevent the compiler to cache PageLRU()
> results here?  Otherwise looks OK to me,

I think the compiler barrier is also implied by the full memory barrier and 
prevents the caching.

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

> Acked-by: "Huang, Ying" <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
> Best Regards,
> Huang, Ying
> 
>>         if (unlikely(!PageLRU(page))) {
>>                 put_page(page);
>>                 page = NULL;
>>         }
>> -       spin_unlock_irq(&pgdat->lru_lock);
>>         return page;
>>  }
>>
>> --
>> 1.8.3.1
>>
>>
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 12/19] mm/mlock: remove lru_lock on TestClearPageMlocked
@ 2020-11-11 13:03     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 13:03 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov

On 11/5/20 9:55 AM, Alex Shi wrote:
> In the func munlock_vma_page, comments mentained lru_lock needed for
> serialization with split_huge_pages. But the page must be PageLocked
> as well as pages in split_huge_page series funcs. Thus the PageLocked
> is enough to serialize both funcs.
> 
> Further more, Hugh Dickins pointed: before splitting in
> split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes
> which protect the page from munlock. Thus, no needs to guard
> __split_huge_page_tail for mlock clean, just keep the lru_lock there for
> isolation purpose.
> 
> LKP found a preempt issue on __mod_zone_page_state which need change
> to mod_zone_page_state. Thanks!
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> ---
>   mm/mlock.c | 26 +++++---------------------
>   1 file changed, 5 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 884b1216da6a..796c726a0407 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -187,40 +187,24 @@ static void __munlock_isolation_failed(struct page *page)
>   unsigned int munlock_vma_page(struct page *page)
>   {
>   	int nr_pages;
> -	pg_data_t *pgdat = page_pgdat(page);
>   
>   	/* For try_to_munlock() and to serialize with page migration */

Now the reasons for locking are expanded?

>   	BUG_ON(!PageLocked(page));
> -
>   	VM_BUG_ON_PAGE(PageTail(page), page);
>   
> -	/*
> -	 * Serialize with any parallel __split_huge_page_refcount() which
> -	 * might otherwise copy PageMlocked to part of the tail pages before
> -	 * we clear it in the head page. It also stabilizes thp_nr_pages().
> -	 */
> -	spin_lock_irq(&pgdat->lru_lock);
> -
>   	if (!TestClearPageMlocked(page)) {
>   		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> -		nr_pages = 1;
> -		goto unlock_out;
> +		return 0;
>   	}
>   
>   	nr_pages = thp_nr_pages(page);
> -	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> +	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>   
> -	if (__munlock_isolate_lru_page(page, true)) {
> -		spin_unlock_irq(&pgdat->lru_lock);
> +	if (!isolate_lru_page(page))
>   		__munlock_isolated_page(page);
> -		goto out;
> -	}
> -	__munlock_isolation_failed(page);
> -
> -unlock_out:
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	else
> +		__munlock_isolation_failed(page);
>   
> -out:
>   	return nr_pages - 1;
>   }
>   
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 12/19] mm/mlock: remove lru_lock on TestClearPageMlocked
@ 2020-11-11 13:03     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 13:03 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Kirill A. Shutemov

On 11/5/20 9:55 AM, Alex Shi wrote:
> In the func munlock_vma_page, comments mentained lru_lock needed for
> serialization with split_huge_pages. But the page must be PageLocked
> as well as pages in split_huge_page series funcs. Thus the PageLocked
> is enough to serialize both funcs.
> 
> Further more, Hugh Dickins pointed: before splitting in
> split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes
> which protect the page from munlock. Thus, no needs to guard
> __split_huge_page_tail for mlock clean, just keep the lru_lock there for
> isolation purpose.
> 
> LKP found a preempt issue on __mod_zone_page_state which need change
> to mod_zone_page_state. Thanks!
> 
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Cc: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

Nit below:

> ---
>   mm/mlock.c | 26 +++++---------------------
>   1 file changed, 5 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 884b1216da6a..796c726a0407 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -187,40 +187,24 @@ static void __munlock_isolation_failed(struct page *page)
>   unsigned int munlock_vma_page(struct page *page)
>   {
>   	int nr_pages;
> -	pg_data_t *pgdat = page_pgdat(page);
>   
>   	/* For try_to_munlock() and to serialize with page migration */

Now the reasons for locking are expanded?

>   	BUG_ON(!PageLocked(page));
> -
>   	VM_BUG_ON_PAGE(PageTail(page), page);
>   
> -	/*
> -	 * Serialize with any parallel __split_huge_page_refcount() which
> -	 * might otherwise copy PageMlocked to part of the tail pages before
> -	 * we clear it in the head page. It also stabilizes thp_nr_pages().
> -	 */
> -	spin_lock_irq(&pgdat->lru_lock);
> -
>   	if (!TestClearPageMlocked(page)) {
>   		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> -		nr_pages = 1;
> -		goto unlock_out;
> +		return 0;
>   	}
>   
>   	nr_pages = thp_nr_pages(page);
> -	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> +	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>   
> -	if (__munlock_isolate_lru_page(page, true)) {
> -		spin_unlock_irq(&pgdat->lru_lock);
> +	if (!isolate_lru_page(page))
>   		__munlock_isolated_page(page);
> -		goto out;
> -	}
> -	__munlock_isolation_failed(page);
> -
> -unlock_out:
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	else
> +		__munlock_isolation_failed(page);
>   
> -out:
>   	return nr_pages - 1;
>   }
>   
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 13/19] mm/mlock: remove __munlock_isolate_lru_page
  2020-11-05  8:55 ` [PATCH v21 13/19] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
@ 2020-11-11 13:07   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 13:07 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov

On 11/5/20 9:55 AM, Alex Shi wrote:
> The func only has one caller, remove it to clean up code and simplify
> code.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> ---
>   mm/mlock.c | 31 +++++++++----------------------
>   1 file changed, 9 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 796c726a0407..d487aa864e86 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -106,26 +106,6 @@ void mlock_vma_page(struct page *page)
>   }
>   
>   /*
> - * Isolate a page from LRU with optional get_page() pin.
> - * Assumes lru_lock already held and page already pinned.
> - */
> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> -{
> -	if (PageLRU(page)) {
> -		struct lruvec *lruvec;
> -
> -		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -		if (getpage)
> -			get_page(page);
> -		ClearPageLRU(page);
> -		del_page_from_lru_list(page, lruvec, page_lru(page));
> -		return true;
> -	}
> -
> -	return false;
> -}
> -
> -/*
>    * Finish munlock after successful page isolation
>    *
>    * Page must be locked. This is a wrapper for try_to_munlock()
> @@ -296,9 +276,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>   			 * We already have pin from follow_page_mask()
>   			 * so we can spare the get_page() here.
>   			 */
> -			if (__munlock_isolate_lru_page(page, false))
> +			if (PageLRU(page)) {
> +				struct lruvec *lruvec;
> +
> +				ClearPageLRU(page);
> +				lruvec = mem_cgroup_page_lruvec(page,
> +							page_pgdat(page));
> +				del_page_from_lru_list(page, lruvec,
> +							page_lru(page));
>   				continue;
> -			else
> +			} else
>   				__munlock_isolation_failed(page);

IIRC coding styles says that once the if () part uses brackets, the else part 
should too, even if it's single line.

>   		} else {
>   			delta_munlocked++;
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 14/19] mm/lru: introduce TestClearPageLRU
  2020-11-05  8:55   ` Alex Shi
  (?)
@ 2020-11-11 13:36   ` Vlastimil Babka
  2020-11-12  2:03       ` Hugh Dickins
  -1 siblings, 1 reply; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 13:36 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

On 11/5/20 9:55 AM, Alex Shi wrote:
> Currently lru_lock still guards both lru list and page's lru bit, that's
> ok. but if we want to use specific lruvec lock on the page, we need to
> pin down the page's lruvec/memcg during locking. Just taking lruvec
> lock first may be undermined by the page's memcg charge/migration. To
> fix this problem, we will clear the lru bit out of locking and use
> it as pin down action to block the page isolation in memcg changing.
> 
> So now a standard steps of page isolation is following:
> 	1, get_page(); 	       #pin the page avoid to be free
> 	2, TestClearPageLRU(); #block other isolation like memcg change
> 	3, spin_lock on lru_lock; #serialize lru list access
> 	4, delete page from lru list;
> 
> This patch start with the first part: TestClearPageLRU, which combines
> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
> function will be used as page isolation precondition to prevent other
> isolations some where else. Then there are may !PageLRU page on lru
 > list, need to remove BUG() checking accordingly.

As there now may be !PageLRU pages on lru list, we need to ...

> 
> There 2 rules for lru bit now:
> 1, the lru bit still indicate if a page on lru list, just in some
>     temporary moment(isolating), the page may have no lru bit when
>     it's on lru list.  but the page still must be on lru list when the
>     lru bit set.
> 2, have to remove lru bit before delete it from lru list.

2. we have to remove the lru bit before deleting page from lru list

> 
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable in Rong Chen
> <rong.a.chen@intel.com> report:
> https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

AFAIK these places generally expect PageLRU to be true, and if it's false, it's 
because of a race, so that effect should be negligible?

> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---

...

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>    */
>   int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>   {
> -	int ret = -EINVAL;
> +	int ret = -EBUSY;
>   
>   	/* Only take pages on the LRU. */
>   	if (!PageLRU(page))
> @@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>   	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>   		return ret;
>   
> -	ret = -EBUSY;

I'm not sure why this change is here, looks unrelated to the patch?

Oh I see, you want to prevent the BUG() in isolate_lru_pages().

But due to that, the PageUnevictable check was also affected unintentionally. 
But I don't think it's that important to BUG() when we run into PageUnevictable 
unexpectedly, so that's probably ok.

But with that, we can just make __isolate_lru_page() a bool function and remove 
the ugly switch in  isolate_lru_pages()?

> -
>   	/*
>   	 * To minimise LRU disruption, the caller can indicate that it only
>   	 * wants to isolate pages it will be able to operate on without
> @@ -1600,8 +1598,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>   		 * sure the page is not being freed elsewhere -- the
>   		 * page release code relies on it.
>   		 */
> -		ClearPageLRU(page);
> -		ret = 0;
> +		if (TestClearPageLRU(page))
> +			ret = 0;
> +		else
> +			put_page(page);
>   	}
>   
>   	return ret;
> @@ -1667,8 +1667,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>   		page = lru_to_page(src);
>   		prefetchw_prev_lru_page(page, src, flags);
>   
> -		VM_BUG_ON_PAGE(!PageLRU(page), page);
> -
>   		nr_pages = compound_nr(page);
>   		total_scan += nr_pages;
>   
> @@ -1765,21 +1763,18 @@ int isolate_lru_page(struct page *page)
>   	VM_BUG_ON_PAGE(!page_count(page), page);
>   	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>   
> -	if (PageLRU(page)) {
> +	if (TestClearPageLRU(page)) {
>   		pg_data_t *pgdat = page_pgdat(page);
>   		struct lruvec *lruvec;
>   
> -		spin_lock_irq(&pgdat->lru_lock);
> +		get_page(page);
>   		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		if (PageLRU(page)) {
> -			int lru = page_lru(page);
> -			get_page(page);
> -			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, lru);
> -			ret = 0;
> -		}
> +		spin_lock_irq(&pgdat->lru_lock);
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
>   		spin_unlock_irq(&pgdat->lru_lock);
> +		ret = 0;
>   	}
> +
>   	return ret;
>   }
>   
> @@ -4293,6 +4288,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>   		nr_pages = thp_nr_pages(page);
>   		pgscanned += nr_pages;
>   
> +		/* block memcg migration during page moving between lru */
> +		if (!TestClearPageLRU(page))
> +			continue;
> +
>   		if (pagepgdat != pgdat) {
>   			if (pgdat)
>   				spin_unlock_irq(&pgdat->lru_lock);
> @@ -4301,10 +4300,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>   		}
>   		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   
> -		if (!PageLRU(page) || !PageUnevictable(page))
> -			continue;
> -
> -		if (page_evictable(page)) {
> +		if (page_evictable(page) && PageUnevictable(page)) {

Doing PageUnevictable() test first should be cheaper?

>   			enum lru_list lru = page_lru_base_type(page);
>   
>   			VM_BUG_ON_PAGE(PageActive(page), page);
> @@ -4313,12 +4309,15 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>   			add_page_to_lru_list(page, lruvec, lru);
>   			pgrescued += nr_pages;
>   		}
> +		SetPageLRU(page);
>   	}
>   
>   	if (pgdat) {
>   		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>   		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
>   		spin_unlock_irq(&pgdat->lru_lock);
> +	} else if (pgscanned) {
> +		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
>   	}
>   }
>   EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-11 17:12     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 17:12 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On 11/5/20 9:55 AM, Alex Shi wrote:
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
> 
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
> 
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
> 
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>

A question below:

> @@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>   					goto isolate_abort;
>   			}
>   
> -			/* Recheck PageLRU and PageCompound under lock */
> -			if (!PageLRU(page))
> -				goto isolate_fail;
> -
>   			/*
>   			 * Page become compound since the non-locked check,
>   			 * and it's on LRU. It can only be a THP so the order
> @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>   			 */
>   			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>   				low_pfn += compound_nr(page) - 1;
> -				goto isolate_fail;
> +				SetPageLRU(page);
> +				goto isolate_fail_put;
>   			}

IIUC the danger here is khugepaged will collapse a THP. For that, 
__collapse_huge_page_isolate() has to succeed isolate_lru_page(). Under the new 
scheme, it shouldn't be possible, right? If that's correct, we can remove this part?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-11 17:12     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 17:12 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On 11/5/20 9:55 AM, Alex Shi wrote:
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
> 
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
> 
> Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> fixed following bugs in this patch's
> early version:
> 
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
> 
> Suggested-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

A question below:

> @@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>   					goto isolate_abort;
>   			}
>   
> -			/* Recheck PageLRU and PageCompound under lock */
> -			if (!PageLRU(page))
> -				goto isolate_fail;
> -
>   			/*
>   			 * Page become compound since the non-locked check,
>   			 * and it's on LRU. It can only be a THP so the order
> @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>   			 */
>   			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>   				low_pfn += compound_nr(page) - 1;
> -				goto isolate_fail;
> +				SetPageLRU(page);
> +				goto isolate_fail_put;
>   			}

IIUC the danger here is khugepaged will collapse a THP. For that, 
__collapse_huge_page_isolate() has to succeed isolate_lru_page(). Under the new 
scheme, it shouldn't be possible, right? If that's correct, we can remove this part?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-11 17:46     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 17:46 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi

On 11/5/20 9:55 AM, Alex Shi wrote:
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on the patch polish, thanks!
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Rong Chen <rong.a.chen@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-11 17:46     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 17:46 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko, Yang Shi

On 11/5/20 9:55 AM, Alex Shi wrote:
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on the patch polish, thanks!
> 
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-11-11 17:46     ` Vlastimil Babka
  (?)
@ 2020-11-11 17:59     ` Vlastimil Babka
  -1 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 17:59 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi

On 11/11/20 6:46 PM, Vlastimil Babka wrote:
> Acked-by: Vlastimil Babka<vbabka@suse.cz>

Err, not yet, that was supposed for patch 16/17

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 16/19] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
@ 2020-11-11 18:00     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 18:00 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On 11/5/20 9:55 AM, Alex Shi wrote:
> Hugh Dickins' found a memcg change bug on original version:
> If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
> to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
> possible bad scenario would like:
> 
> 	cpu 0					cpu 1
> lruvec = mem_cgroup_page_lruvec()
> 					if (!isolate_lru_page())
> 						mem_cgroup_move_account
> 
> spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.
> 
> So we need TestClearPageLRU to block isolate_lru_page(), that serializes
> the memcg change. and then removing the PageLRU check in move_fn callee
> as the consequence.
> 
> __pagevec_lru_add_fn() is different from the others, because the pages
> it deals with are, by definition, not yet on the lru.  TestClearPageLRU
> is not needed and would not work, so __pagevec_lru_add() goes its own
> way.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/swap.c | 44 +++++++++++++++++++++++++++++++++++---------
>   1 file changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 2681d9023998..1838a9535703 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -222,8 +222,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>   			spin_lock_irqsave(&pgdat->lru_lock, flags);
>   		}
>   
> +		/* block memcg migration during page moving between lru */
> +		if (!TestClearPageLRU(page))
> +			continue;
> +
>   		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   		(*move_fn)(page, lruvec);
> +
> +		SetPageLRU(page);
>   	}
>   	if (pgdat)
>   		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> @@ -233,7 +239,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>   
>   static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && !PageUnevictable(page)) {
> +	if (!PageUnevictable(page)) {
>   		del_page_from_lru_list(page, lruvec, page_lru(page));
>   		ClearPageActive(page);
>   		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
> @@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
>   
>   static void __activate_page(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> +	if (!PageActive(page) && !PageUnevictable(page)) {
>   		int lru = page_lru_base_type(page);
>   		int nr_pages = thp_nr_pages(page);
>   
> @@ -362,7 +368,8 @@ static void activate_page(struct page *page)
>   
>   	page = compound_head(page);
>   	spin_lock_irq(&pgdat->lru_lock);
> -	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
> +	if (PageLRU(page))
> +		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
>   	spin_unlock_irq(&pgdat->lru_lock);
>   }
>   #endif
> @@ -519,9 +526,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>   	bool active;
>   	int nr_pages = thp_nr_pages(page);
>   
> -	if (!PageLRU(page))
> -		return;
> -
>   	if (PageUnevictable(page))
>   		return;
>   
> @@ -562,7 +566,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>   
>   static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +	if (PageActive(page) && !PageUnevictable(page)) {
>   		int lru = page_lru_base_type(page);
>   		int nr_pages = thp_nr_pages(page);
>   
> @@ -579,7 +583,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>   
>   static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	if (PageAnon(page) && PageSwapBacked(page) &&
>   	    !PageSwapCache(page) && !PageUnevictable(page)) {
>   		bool active = PageActive(page);
>   		int nr_pages = thp_nr_pages(page);
> @@ -1021,7 +1025,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>    */
>   void __pagevec_lru_add(struct pagevec *pvec)
>   {
> -	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
> +	int i;
> +	struct pglist_data *pgdat = NULL;
> +	struct lruvec *lruvec;
> +	unsigned long flags = 0;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		struct pglist_data *pagepgdat = page_pgdat(page);
> +
> +		if (pagepgdat != pgdat) {
> +			if (pgdat)
> +				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +			pgdat = pagepgdat;
> +			spin_lock_irqsave(&pgdat->lru_lock, flags);
> +		}
> +
> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		__pagevec_lru_add_fn(page, lruvec);
> +	}
> +	if (pgdat)
> +		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +	release_pages(pvec->pages, pvec->nr);
> +	pagevec_reinit(pvec);
>   }
>   
>   /**
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 16/19] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
@ 2020-11-11 18:00     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-11 18:00 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On 11/5/20 9:55 AM, Alex Shi wrote:
> Hugh Dickins' found a memcg change bug on original version:
> If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
> to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
> possible bad scenario would like:
> 
> 	cpu 0					cpu 1
> lruvec = mem_cgroup_page_lruvec()
> 					if (!isolate_lru_page())
> 						mem_cgroup_move_account
> 
> spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.
> 
> So we need TestClearPageLRU to block isolate_lru_page(), that serializes
> the memcg change. and then removing the PageLRU check in move_fn callee
> as the consequence.
> 
> __pagevec_lru_add_fn() is different from the others, because the pages
> it deals with are, by definition, not yet on the lru.  TestClearPageLRU
> is not needed and would not work, so __pagevec_lru_add() goes its own
> way.
> 
> Reported-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

> ---
>   mm/swap.c | 44 +++++++++++++++++++++++++++++++++++---------
>   1 file changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 2681d9023998..1838a9535703 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -222,8 +222,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>   			spin_lock_irqsave(&pgdat->lru_lock, flags);
>   		}
>   
> +		/* block memcg migration during page moving between lru */
> +		if (!TestClearPageLRU(page))
> +			continue;
> +
>   		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   		(*move_fn)(page, lruvec);
> +
> +		SetPageLRU(page);
>   	}
>   	if (pgdat)
>   		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> @@ -233,7 +239,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>   
>   static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && !PageUnevictable(page)) {
> +	if (!PageUnevictable(page)) {
>   		del_page_from_lru_list(page, lruvec, page_lru(page));
>   		ClearPageActive(page);
>   		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
> @@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
>   
>   static void __activate_page(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> +	if (!PageActive(page) && !PageUnevictable(page)) {
>   		int lru = page_lru_base_type(page);
>   		int nr_pages = thp_nr_pages(page);
>   
> @@ -362,7 +368,8 @@ static void activate_page(struct page *page)
>   
>   	page = compound_head(page);
>   	spin_lock_irq(&pgdat->lru_lock);
> -	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
> +	if (PageLRU(page))
> +		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
>   	spin_unlock_irq(&pgdat->lru_lock);
>   }
>   #endif
> @@ -519,9 +526,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>   	bool active;
>   	int nr_pages = thp_nr_pages(page);
>   
> -	if (!PageLRU(page))
> -		return;
> -
>   	if (PageUnevictable(page))
>   		return;
>   
> @@ -562,7 +566,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>   
>   static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +	if (PageActive(page) && !PageUnevictable(page)) {
>   		int lru = page_lru_base_type(page);
>   		int nr_pages = thp_nr_pages(page);
>   
> @@ -579,7 +583,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>   
>   static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
>   {
> -	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	if (PageAnon(page) && PageSwapBacked(page) &&
>   	    !PageSwapCache(page) && !PageUnevictable(page)) {
>   		bool active = PageActive(page);
>   		int nr_pages = thp_nr_pages(page);
> @@ -1021,7 +1025,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>    */
>   void __pagevec_lru_add(struct pagevec *pvec)
>   {
> -	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
> +	int i;
> +	struct pglist_data *pgdat = NULL;
> +	struct lruvec *lruvec;
> +	unsigned long flags = 0;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		struct pglist_data *pagepgdat = page_pgdat(page);
> +
> +		if (pagepgdat != pgdat) {
> +			if (pgdat)
> +				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +			pgdat = pagepgdat;
> +			spin_lock_irqsave(&pgdat->lru_lock, flags);
> +		}
> +
> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		__pagevec_lru_add_fn(page, lruvec);
> +	}
> +	if (pgdat)
> +		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +	release_pages(pvec->pages, pvec->nr);
> +	pagevec_reinit(pvec);
>   }
>   
>   /**
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 14/19] mm/lru: introduce TestClearPageLRU
  2020-11-11 13:36   ` Vlastimil Babka
@ 2020-11-12  2:03       ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-11-12  2:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Michal Hocko

On Wed, 11 Nov 2020, Vlastimil Babka wrote:
> On 11/5/20 9:55 AM, Alex Shi wrote:
> 
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct
> > zone *zone,
> >    */
> >   int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> >   {
> > -	int ret = -EINVAL;
> > +	int ret = -EBUSY;
> >     	/* Only take pages on the LRU. */
> >   	if (!PageLRU(page))
> > @@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page,
> > isolate_mode_t mode)
> >   	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
> >   		return ret;
> >   -	ret = -EBUSY;
> 
> I'm not sure why this change is here, looks unrelated to the patch?
> 
> Oh I see, you want to prevent the BUG() in isolate_lru_pages().

Yes, I suggested this part of the patch to Alex, when I hit that BUG().

> 
> But due to that, the PageUnevictable check was also affected unintentionally.
> But I don't think it's that important to BUG() when we run into
> PageUnevictable unexpectedly, so that's probably ok.

Not unintentional.  __isolate_lru_page(), or __isolate_lru_page_prepare(),
is a silly function, used by two callers whose requirements are almost
entirely disjoint.  The ISOLATE_UNEVICTABLE case is only for compaction.c,
which takes no interest in -EINVAL versus -EBUSY, and has no such BUG().

I think it dates back to lumpy reclaim days, and it probably made more
sense back then.

> 
> But with that, we can just make __isolate_lru_page() a bool function and
> remove the ugly switch in  isolate_lru_pages()?

I agree that the switch statement in isolate_lru_pages() seems pointless
now, and can be turned into an if{}else{}.  But that cleanup is a
diversion from this particular TestClearPageLRU patch, and I think from
the whole series (checking final state of the patchset, yes, the switch
is still there - though I think there have been variant series which
removed it).

Can we please leave that cleanup until after the series has gone in?

I think several of us have cleanups or optimization that we want to
follow (I had one that inlines what isolate_migratepages_block() wanted
of __isolate_lru_page() into that function, so simplifying what vmscan.c
needs; perhaps that can now eliminate it completely, I've not tried
recently).  But there was a point at which the series was growing
ten patches per release as we all added our bits and pieces on top,
it got harder and harder to review the whole, and further from
getting the basics in: I do push back against that tendency.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 14/19] mm/lru: introduce TestClearPageLRU
@ 2020-11-12  2:03       ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-11-12  2:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Michal Hocko

On Wed, 11 Nov 2020, Vlastimil Babka wrote:
> On 11/5/20 9:55 AM, Alex Shi wrote:
> 
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct
> > zone *zone,
> >    */
> >   int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> >   {
> > -	int ret = -EINVAL;
> > +	int ret = -EBUSY;
> >     	/* Only take pages on the LRU. */
> >   	if (!PageLRU(page))
> > @@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page,
> > isolate_mode_t mode)
> >   	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
> >   		return ret;
> >   -	ret = -EBUSY;
> 
> I'm not sure why this change is here, looks unrelated to the patch?
> 
> Oh I see, you want to prevent the BUG() in isolate_lru_pages().

Yes, I suggested this part of the patch to Alex, when I hit that BUG().

> 
> But due to that, the PageUnevictable check was also affected unintentionally.
> But I don't think it's that important to BUG() when we run into
> PageUnevictable unexpectedly, so that's probably ok.

Not unintentional.  __isolate_lru_page(), or __isolate_lru_page_prepare(),
is a silly function, used by two callers whose requirements are almost
entirely disjoint.  The ISOLATE_UNEVICTABLE case is only for compaction.c,
which takes no interest in -EINVAL versus -EBUSY, and has no such BUG().

I think it dates back to lumpy reclaim days, and it probably made more
sense back then.

> 
> But with that, we can just make __isolate_lru_page() a bool function and
> remove the ugly switch in  isolate_lru_pages()?

I agree that the switch statement in isolate_lru_pages() seems pointless
now, and can be turned into an if{}else{}.  But that cleanup is a
diversion from this particular TestClearPageLRU patch, and I think from
the whole series (checking final state of the patchset, yes, the switch
is still there - though I think there have been variant series which
removed it).

Can we please leave that cleanup until after the series has gone in?

I think several of us have cleanups or optimization that we want to
follow (I had one that inlines what isolate_migratepages_block() wanted
of __isolate_lru_page() into that function, so simplifying what vmscan.c
needs; perhaps that can now eliminate it completely, I've not tried
recently).  But there was a point at which the series was growing
ten patches per release as we all added our bits and pieces on top,
it got harder and harder to review the whole, and further from
getting the basics in: I do push back against that tendency.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
  2020-11-11 17:12     ` Vlastimil Babka
@ 2020-11-12  2:28       ` Hugh Dickins
  -1 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-11-12  2:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Wed, 11 Nov 2020, Vlastimil Babka wrote:
> On 11/5/20 9:55 AM, Alex Shi wrote:
> 
> > @@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >   					goto isolate_abort;
> >   			}
> >   -			/* Recheck PageLRU and PageCompound under lock */
> > -			if (!PageLRU(page))
> > -				goto isolate_fail;
> > -
> >   			/*
> >   			 * Page become compound since the non-locked check,
> >   			 * and it's on LRU. It can only be a THP so the order
> > @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)

Completely off-topic, and won't matter at all when Andrew rediffs into
mmotm: but isn't it weird that this is showing "too_many_isolated(",
when actually the function is isolate_migratepages_block()?

> >   			 */
> >   			if (unlikely(PageCompound(page) &&
> > !cc->alloc_contig)) {
> >   				low_pfn += compound_nr(page) - 1;
> > -				goto isolate_fail;
> > +				SetPageLRU(page);
> > +				goto isolate_fail_put;
> >   			}
> 
> IIUC the danger here is khugepaged will collapse a THP. For that,
> __collapse_huge_page_isolate() has to succeed isolate_lru_page(). Under the
> new scheme, it shouldn't be possible, right? If that's correct, we can remove
> this part?

I don't think so.  A preliminary check for PageCompound was made much
higher up, before taking a reference on the page, but it can easily have
become PageCompound since then (when racing prep_new_page() calls
prep_compound_page()).

And __collapse_huge_page_isolate() does not turn a non-compound page
into a compound page: it isolates small pages before copying them into
the compound page (in the usual case: I can see there's also allowance
for PageCompound there too, which will do something different).

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-12  2:28       ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-11-12  2:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Wed, 11 Nov 2020, Vlastimil Babka wrote:
> On 11/5/20 9:55 AM, Alex Shi wrote:
> 
> > @@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >   					goto isolate_abort;
> >   			}
> >   -			/* Recheck PageLRU and PageCompound under lock */
> > -			if (!PageLRU(page))
> > -				goto isolate_fail;
> > -
> >   			/*
> >   			 * Page become compound since the non-locked check,
> >   			 * and it's on LRU. It can only be a THP so the order
> > @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)

Completely off-topic, and won't matter at all when Andrew rediffs into
mmotm: but isn't it weird that this is showing "too_many_isolated(",
when actually the function is isolate_migratepages_block()?

> >   			 */
> >   			if (unlikely(PageCompound(page) &&
> > !cc->alloc_contig)) {
> >   				low_pfn += compound_nr(page) - 1;
> > -				goto isolate_fail;
> > +				SetPageLRU(page);
> > +				goto isolate_fail_put;
> >   			}
> 
> IIUC the danger here is khugepaged will collapse a THP. For that,
> __collapse_huge_page_isolate() has to succeed isolate_lru_page(). Under the
> new scheme, it shouldn't be possible, right? If that's correct, we can remove
> this part?

I don't think so.  A preliminary check for PageCompound was made much
higher up, before taking a reference on the page, but it can easily have
become PageCompound since then (when racing prep_new_page() calls
prep_compound_page()).

And __collapse_huge_page_isolate() does not turn a non-compound page
into a compound page: it isolates small pages before copying them into
the compound page (in the usual case: I can see there's also allowance
for PageCompound there too, which will do something different).

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
  2020-11-12  2:28       ` Hugh Dickins
@ 2020-11-12  3:35         ` Alex Shi
  -1 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-12  3:35 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/11/12 上午10:28, Hugh Dickins 写道:
>>>   			 * Page become compound since the non-locked check,
>>>   			 * and it's on LRU. It can only be a THP so the order
>>> @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
> Completely off-topic, and won't matter at all when Andrew rediffs into
> mmotm: but isn't it weird that this is showing "too_many_isolated(",
> when actually the function is isolate_migratepages_block()?
> 

My git version is too low for this. Thanks for reminder. the latest git
work fine on correct function name.

Thanks

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-12  3:35         ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-12  3:35 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



ÔÚ 2020/11/12 ÉÏÎç10:28, Hugh Dickins дµÀ:
>>>   			 * Page become compound since the non-locked check,
>>>   			 * and it's on LRU. It can only be a THP so the order
>>> @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
> Completely off-topic, and won't matter at all when Andrew rediffs into
> mmotm: but isn't it weird that this is showing "too_many_isolated(",
> when actually the function is isolate_migratepages_block()?
> 

My git version is too low for this. Thanks for reminder. the latest git
work fine on correct function name.

Thanks

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 14/19] mm/lru: introduce TestClearPageLRU
  2020-11-12  2:03       ` Hugh Dickins
  (?)
@ 2020-11-12 11:24       ` Vlastimil Babka
  -1 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 11:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Michal Hocko

On 11/12/20 3:03 AM, Hugh Dickins wrote:
> On Wed, 11 Nov 2020, Vlastimil Babka wrote:
>> On 11/5/20 9:55 AM, Alex Shi wrote:
>> 
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct
>> > zone *zone,
>> >    */
>> >   int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>> >   {
>> > -	int ret = -EINVAL;
>> > +	int ret = -EBUSY;
>> >     	/* Only take pages on the LRU. */
>> >   	if (!PageLRU(page))
>> > @@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page,
>> > isolate_mode_t mode)
>> >   	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>> >   		return ret;
>> >   -	ret = -EBUSY;
>> 
>> I'm not sure why this change is here, looks unrelated to the patch?
>> 
>> Oh I see, you want to prevent the BUG() in isolate_lru_pages().
> 
> Yes, I suggested this part of the patch to Alex, when I hit that BUG().
> 
>> 
>> But due to that, the PageUnevictable check was also affected unintentionally.
>> But I don't think it's that important to BUG() when we run into
>> PageUnevictable unexpectedly, so that's probably ok.
> 
> Not unintentional.  __isolate_lru_page(), or __isolate_lru_page_prepare(),
> is a silly function, used by two callers whose requirements are almost
> entirely disjoint.  The ISOLATE_UNEVICTABLE case is only for compaction.c,
> which takes no interest in -EINVAL versus -EBUSY, and has no such BUG().
> 
> I think it dates back to lumpy reclaim days, and it probably made more
> sense back then.

Ah, thanks for explaining.


>> 
>> But with that, we can just make __isolate_lru_page() a bool function and
>> remove the ugly switch in  isolate_lru_pages()?
> 
> I agree that the switch statement in isolate_lru_pages() seems pointless
> now, and can be turned into an if{}else{}.  But that cleanup is a
> diversion from this particular TestClearPageLRU patch, and I think from
> the whole series (checking final state of the patchset, yes, the switch
> is still there - though I think there have been variant series which
> removed it).
> 
> Can we please leave that cleanup until after the series has gone in?

Sure thing!

The patch seems functionally fine, so

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> I think several of us have cleanups or optimization that we want to
> follow (I had one that inlines what isolate_migratepages_block() wanted
> of __isolate_lru_page() into that function, so simplifying what vmscan.c
> needs; perhaps that can now eliminate it completely, I've not tried
> recently).  But there was a point at which the series was growing
> ten patches per release as we all added our bits and pieces on top,
> it got harder and harder to review the whole, and further from
> getting the basics in: I do push back against that tendency.
> 
> Hugh
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-12 11:25         ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 11:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On 11/12/20 3:28 AM, Hugh Dickins wrote:
> On Wed, 11 Nov 2020, Vlastimil Babka wrote:
>> On 11/5/20 9:55 AM, Alex Shi wrote:
>> 
>> > @@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>> >   					goto isolate_abort;
>> >   			}
>> >   -			/* Recheck PageLRU and PageCompound under lock */
>> > -			if (!PageLRU(page))
>> > -				goto isolate_fail;
>> > -
>> >   			/*
>> >   			 * Page become compound since the non-locked check,
>> >   			 * and it's on LRU. It can only be a THP so the order
>> > @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
> 
> Completely off-topic, and won't matter at all when Andrew rediffs into
> mmotm: but isn't it weird that this is showing "too_many_isolated(",
> when actually the function is isolate_migratepages_block()?
> 
>> >   			 */
>> >   			if (unlikely(PageCompound(page) &&
>> > !cc->alloc_contig)) {
>> >   				low_pfn += compound_nr(page) - 1;
>> > -				goto isolate_fail;
>> > +				SetPageLRU(page);
>> > +				goto isolate_fail_put;
>> >   			}
>> 
>> IIUC the danger here is khugepaged will collapse a THP. For that,
>> __collapse_huge_page_isolate() has to succeed isolate_lru_page(). Under the
>> new scheme, it shouldn't be possible, right? If that's correct, we can remove
>> this part?
> 
> I don't think so.  A preliminary check for PageCompound was made much
> higher up, before taking a reference on the page, but it can easily have
> become PageCompound since then (when racing prep_new_page() calls
> prep_compound_page()).
> 
> And __collapse_huge_page_isolate() does not turn a non-compound page
> into a compound page: it isolates small pages before copying them into
> the compound page (in the usual case: I can see there's also allowance
> for PageCompound there too, which will do something different).

Right, on both points, got too confused.

> Hugh
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 15/19] mm/compaction: do page isolation first in compaction
@ 2020-11-12 11:25         ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 11:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On 11/12/20 3:28 AM, Hugh Dickins wrote:
> On Wed, 11 Nov 2020, Vlastimil Babka wrote:
>> On 11/5/20 9:55 AM, Alex Shi wrote:
>> 
>> > @@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>> >   					goto isolate_abort;
>> >   			}
>> >   -			/* Recheck PageLRU and PageCompound under lock */
>> > -			if (!PageLRU(page))
>> > -				goto isolate_fail;
>> > -
>> >   			/*
>> >   			 * Page become compound since the non-locked check,
>> >   			 * and it's on LRU. It can only be a THP so the order
>> > @@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
> 
> Completely off-topic, and won't matter at all when Andrew rediffs into
> mmotm: but isn't it weird that this is showing "too_many_isolated(",
> when actually the function is isolate_migratepages_block()?
> 
>> >   			 */
>> >   			if (unlikely(PageCompound(page) &&
>> > !cc->alloc_contig)) {
>> >   				low_pfn += compound_nr(page) - 1;
>> > -				goto isolate_fail;
>> > +				SetPageLRU(page);
>> > +				goto isolate_fail_put;
>> >   			}
>> 
>> IIUC the danger here is khugepaged will collapse a THP. For that,
>> __collapse_huge_page_isolate() has to succeed isolate_lru_page(). Under the
>> new scheme, it shouldn't be possible, right? If that's correct, we can remove
>> this part?
> 
> I don't think so.  A preliminary check for PageCompound was made much
> higher up, before taking a reference on the page, but it can easily have
> become PageCompound since then (when racing prep_new_page() calls
> prep_compound_page()).
> 
> And __collapse_huge_page_isolate() does not turn a non-compound page
> into a compound page: it isolates small pages before copying them into
> the compound page (in the usual case: I can see there's also allowance
> for PageCompound there too, which will do something different).

Right, on both points, got too confused.

> Hugh
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-12 12:19     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 12:19 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi

On 11/5/20 9:55 AM, Alex Shi wrote:
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on the patch polish, thanks!
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Rong Chen <rong.a.chen@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org

I think I need some explanation about the rcu_read_lock() usage in 
lock_page_lruvec*() (and places effectively opencoding it).
Preferably in form of some code comment, but that can be also added as a 
additional patch later, I don't want to block the series.

mem_cgroup_page_lruvec() comment says

  * This function relies on page->mem_cgroup being stable - see the
  * access rules in commit_charge().

commit_charge() comment:

          * Any of the following ensures page->mem_cgroup stability:
          *
          * - the page lock
          * - LRU isolation
          * - lock_page_memcg()
          * - exclusive reference

"LRU isolation" used to be quite clear, but now is it after 
TestClearPageLRU(page) or after deleting from the lru list as well?
Also it doesn't mention rcu_read_lock(), should it?

So what exactly are we protecting by rcu_read_lock() in e.g. lock_page_lruvec()?

         rcu_read_lock();
         lruvec = mem_cgroup_page_lruvec(page, pgdat);
         spin_lock(&lruvec->lru_lock);
         rcu_read_unlock();

Looks like we are protecting the lruvec from going away and it can't go away 
anymore after we take the lru_lock?

But then e.g. in __munlock_pagevec() we are doing this without an rcu_read_lock():

	new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

where new_lruvec is potentionally not the one that we have locked

And the last thing mem_cgroup_page_lruvec() is doing is:

         if (unlikely(lruvec->pgdat != pgdat))
                 lruvec->pgdat = pgdat;
         return lruvec;

So without the rcu_read_lock() is this potentionally accessing the pgdat field 
of lruvec that might have just gone away?

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-12 12:19     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 12:19 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko, Yang Shi

On 11/5/20 9:55 AM, Alex Shi wrote:
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on the patch polish, thanks!
> 
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I think I need some explanation about the rcu_read_lock() usage in 
lock_page_lruvec*() (and places effectively opencoding it).
Preferably in form of some code comment, but that can be also added as a 
additional patch later, I don't want to block the series.

mem_cgroup_page_lruvec() comment says

  * This function relies on page->mem_cgroup being stable - see the
  * access rules in commit_charge().

commit_charge() comment:

          * Any of the following ensures page->mem_cgroup stability:
          *
          * - the page lock
          * - LRU isolation
          * - lock_page_memcg()
          * - exclusive reference

"LRU isolation" used to be quite clear, but now is it after 
TestClearPageLRU(page) or after deleting from the lru list as well?
Also it doesn't mention rcu_read_lock(), should it?

So what exactly are we protecting by rcu_read_lock() in e.g. lock_page_lruvec()?

         rcu_read_lock();
         lruvec = mem_cgroup_page_lruvec(page, pgdat);
         spin_lock(&lruvec->lru_lock);
         rcu_read_unlock();

Looks like we are protecting the lruvec from going away and it can't go away 
anymore after we take the lru_lock?

But then e.g. in __munlock_pagevec() we are doing this without an rcu_read_lock():

	new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

where new_lruvec is potentionally not the one that we have locked

And the last thing mem_cgroup_page_lruvec() is doing is:

         if (unlikely(lruvec->pgdat != pgdat))
                 lruvec->pgdat = pgdat;
         return lruvec;

So without the rcu_read_lock() is this potentionally accessing the pgdat field 
of lruvec that might have just gone away?

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
@ 2020-11-12 12:31     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 12:31 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin

On 11/5/20 9:55 AM, Alex Shi wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Use this new function to replace repeated same code, no func change.
> 
> When testing for relock we can avoid the need for RCU locking if we simply
> compare the page pgdat and memcg pointers versus those that the lruvec is
> holding. By doing this we can avoid the extra pointer walks and accesses of
> the memory cgroup.

Ah, clever. Seems to address my worry from previous patch (except the potential 
to improve documenting comments).

> In addition we can avoid the checks entirely if lruvec is currently NULL.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function
@ 2020-11-12 12:31     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 12:31 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin

On 11/5/20 9:55 AM, Alex Shi wrote:
> From: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> 
> Use this new function to replace repeated same code, no func change.
> 
> When testing for relock we can avoid the need for RCU locking if we simply
> compare the page pgdat and memcg pointers versus those that the lruvec is
> holding. By doing this we can avoid the extra pointer walks and accesses of
> the memory cgroup.

Ah, clever. Seems to address my worry from previous patch (except the potential 
to improve documenting comments).

> In addition we can avoid the checks entirely if lruvec is currently NULL.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Cc: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 19/19] mm/lru: revise the comments of lru_lock
@ 2020-11-12 12:37     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 12:37 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrey Ryabinin, Jann Horn

On 11/5/20 9:55 AM, Alex Shi wrote:
> From: Hugh Dickins <hughd@google.com>
> 
> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
> fix the incorrect comments in code. Also fixed some zone->lru_lock comment
> error from ancient time. etc.
> 
> I struggled to understand the comment above move_pages_to_lru() (surely
> it never calls page_referenced()), and eventually realized that most of
> it had got separated from shrink_active_list(): move that comment back.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 19/19] mm/lru: revise the comments of lru_lock
@ 2020-11-12 12:37     ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2020-11-12 12:37 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Andrey Ryabinin, Jann Horn

On 11/5/20 9:55 AM, Alex Shi wrote:
> From: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
> fix the incorrect comments in code. Also fixed some zone->lru_lock comment
> error from ancient time. etc.
> 
> I struggled to understand the comment above move_pages_to_lru() (surely
> it never calls page_referenced()), and eventually realized that most of
> it had got separated from shrink_active_list(): move that comment back.
> 
> Signed-off-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> Cc: Jann Horn <jannh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Mel Gorman <mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org

Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-12 14:19       ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-12 14:19 UTC (permalink / raw)
  To: Vlastimil Babka, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi



在 2020/11/12 下午8:19, Vlastimil Babka 写道:
> On 11/5/20 9:55 AM, Alex Shi wrote:
>> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
>> each of memcg per node. So on a large machine, each of memcg don't
>> have to suffer from per node pgdat->lru_lock competition. They could go
>> fast with their self lru_lock.
>>
>> After move memcg charge before lru inserting, page isolation could
>> serialize page's memcg, then per memcg lruvec lock is stable and could
>> replace per node lru lock.
>>
>> In func isolate_migratepages_block, compact_unlock_should_abort and
>> lock_page_lruvec_irqsave are open coded to work with compact_control.
>> Also add a debug func in locking which may give some clues if there are
>> sth out of hands.
>>
>> Daniel Jordan's testing show 62% improvement on modified readtwice case
>> on his 2P * 10 core * 2 HT broadwell box.
>> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
>>
>> On a large machine with memcg enabled but not used, the page's lruvec
>> seeking pass a few pointers, that may lead to lru_lock holding time
>> increase and a bit regression.
>>
>> Hugh Dickins helped on the patch polish, thanks!
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Acked-by: Hugh Dickins <hughd@google.com>
>> Cc: Rong Chen <rong.a.chen@intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: cgroups@vger.kernel.org
> 
> I think I need some explanation about the rcu_read_lock() usage in lock_page_lruvec*() (and places effectively opencoding it).
> Preferably in form of some code comment, but that can be also added as a additional patch later, I don't want to block the series.
> 

Hi Vlastimil, 

Thanks for comments!

Oh, we did talk about the rcu_read_lock which is used to block memcg destroy during locking.
and the spin_lock actually includes a rcu_read_lock(). Yes, we could add this comments later.

> mem_cgroup_page_lruvec() comment says
> 
>  * This function relies on page->mem_cgroup being stable - see the
>  * access rules in commit_charge().
> 
> commit_charge() comment:
> 
>          * Any of the following ensures page->mem_cgroup stability:
>          *
>          * - the page lock
>          * - LRU isolation
>          * - lock_page_memcg()
>          * - exclusive reference
> 
> "LRU isolation" used to be quite clear, but now is it after TestClearPageLRU(page) or after deleting from the lru list as well?
> Also it doesn't mention rcu_read_lock(), should it?

The lru isolation still is same as old conception, a set actions that take a page from a lru list, and commit_charge do
need a isoltion for the page.

but the condition of page_memcg could be change since we don't rely on lru isolation for it. The comments
could be changed later.

> 
> So what exactly are we protecting by rcu_read_lock() in e.g. lock_page_lruvec()?
> 
>         rcu_read_lock();
>         lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         spin_lock(&lruvec->lru_lock);
>         rcu_read_unlock();
> 
> Looks like we are protecting the lruvec from going away and it can't go away anymore after we take the lru_lock?
> 
> But then e.g. in __munlock_pagevec() we are doing this without an rcu_read_lock():
> 
>     new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

TestClearPageLRU could block the page from memcg migration/destory.

Thanks
Alex

> 
> where new_lruvec is potentionally not the one that we have locked
> 
> And the last thing mem_cgroup_page_lruvec() is doing is:
> 
>         if (unlikely(lruvec->pgdat != pgdat))
>                 lruvec->pgdat = pgdat;
>         return lruvec;
> 
> So without the rcu_read_lock() is this potentionally accessing the pgdat field of lruvec that might have just gone away?
> 
> Thanks,
> Vlastimil

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-11-12 14:19       ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-12 14:19 UTC (permalink / raw)
  To: Vlastimil Babka, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Michal Hocko, Yang Shi



在 2020/11/12 下午8:19, Vlastimil Babka 写道:
> On 11/5/20 9:55 AM, Alex Shi wrote:
>> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
>> each of memcg per node. So on a large machine, each of memcg don't
>> have to suffer from per node pgdat->lru_lock competition. They could go
>> fast with their self lru_lock.
>>
>> After move memcg charge before lru inserting, page isolation could
>> serialize page's memcg, then per memcg lruvec lock is stable and could
>> replace per node lru lock.
>>
>> In func isolate_migratepages_block, compact_unlock_should_abort and
>> lock_page_lruvec_irqsave are open coded to work with compact_control.
>> Also add a debug func in locking which may give some clues if there are
>> sth out of hands.
>>
>> Daniel Jordan's testing show 62% improvement on modified readtwice case
>> on his 2P * 10 core * 2 HT broadwell box.
>> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/
>>
>> On a large machine with memcg enabled but not used, the page's lruvec
>> seeking pass a few pointers, that may lead to lru_lock holding time
>> increase and a bit regression.
>>
>> Hugh Dickins helped on the patch polish, thanks!
>>
>> Signed-off-by: Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
>> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Cc: Rong Chen <rong.a.chen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>> Cc: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> Cc: Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> Cc: Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>
>> Cc: Matthew Wilcox <willy-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>> Cc: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
>> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
>> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> 
> I think I need some explanation about the rcu_read_lock() usage in lock_page_lruvec*() (and places effectively opencoding it).
> Preferably in form of some code comment, but that can be also added as a additional patch later, I don't want to block the series.
> 

Hi Vlastimil, 

Thanks for comments!

Oh, we did talk about the rcu_read_lock which is used to block memcg destroy during locking.
and the spin_lock actually includes a rcu_read_lock(). Yes, we could add this comments later.

> mem_cgroup_page_lruvec() comment says
> 
>  * This function relies on page->mem_cgroup being stable - see the
>  * access rules in commit_charge().
> 
> commit_charge() comment:
> 
>          * Any of the following ensures page->mem_cgroup stability:
>          *
>          * - the page lock
>          * - LRU isolation
>          * - lock_page_memcg()
>          * - exclusive reference
> 
> "LRU isolation" used to be quite clear, but now is it after TestClearPageLRU(page) or after deleting from the lru list as well?
> Also it doesn't mention rcu_read_lock(), should it?

The lru isolation still is same as old conception, a set actions that take a page from a lru list, and commit_charge do
need a isoltion for the page.

but the condition of page_memcg could be change since we don't rely on lru isolation for it. The comments
could be changed later.

> 
> So what exactly are we protecting by rcu_read_lock() in e.g. lock_page_lruvec()?
> 
>         rcu_read_lock();
>         lruvec = mem_cgroup_page_lruvec(page, pgdat);
>         spin_lock(&lruvec->lru_lock);
>         rcu_read_unlock();
> 
> Looks like we are protecting the lruvec from going away and it can't go away anymore after we take the lru_lock?
> 
> But then e.g. in __munlock_pagevec() we are doing this without an rcu_read_lock():
> 
>     new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));

TestClearPageLRU could block the page from memcg migration/destory.

Thanks
Alex

> 
> where new_lruvec is potentionally not the one that we have locked
> 
> And the last thing mem_cgroup_page_lruvec() is doing is:
> 
>         if (unlikely(lruvec->pgdat != pgdat))
>                 lruvec->pgdat = pgdat;
>         return lruvec;
> 
> So without the rcu_read_lock() is this potentionally accessing the pgdat field of lruvec that might have just gone away?
> 
> Thanks,
> Vlastimil

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-11-16  3:45   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-16  3:45 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Hi Andrew,

With all patches are acked-by Hugh and Johannes, and full testing from LKP,
is this patchset ready for more testing on linux-next? or anything still need
be improved?

Thanks
Alex


在 2020/11/5 下午4:55, Alex Shi 写道:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.
> 
> So now this patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation. 
> 2, use TestCleanPageLRU as page isolation's precondition.
> 3, replace per node lru_lock with per memcg per node lru_lock.
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
> with this v20.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-11-16  3:45   ` Alex Shi
  0 siblings, 0 replies; 111+ messages in thread
From: Alex Shi @ 2020-11-16  3:45 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

Hi Andrew,

With all patches are acked-by Hugh and Johannes, and full testing from LKP,
is this patchset ready for more testing on linux-next? or anything still need
be improved?

Thanks
Alex


在 2020/11/5 下午4:55, Alex Shi 写道:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.
> 
> So now this patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation. 
> 2, use TestCleanPageLRU as page isolation's precondition.
> 3, replace per node lru_lock with per memcg per node lru_lock.
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
> with this v20.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-12-15  0:47   ` Andrew Morton
  0 siblings, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2020-12-15  0:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu,  5 Nov 2020 16:55:30 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:

> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.

I assume the consensus on this series is 'not yet"?

Also, did
https://lkml.kernel.org/r/0000000000000340a105b49441d3@google.com get
resolved?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-12-15  0:47   ` Andrew Morton
  0 siblings, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2020-12-15  0:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On Thu,  5 Nov 2020 16:55:30 +0800 Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote:

> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.

I assume the consensus on this series is 'not yet"?

Also, did
https://lkml.kernel.org/r/0000000000000340a105b49441d3-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org get
resolved?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
  2020-12-15  0:47   ` Andrew Morton
  (?)
@ 2020-12-15  2:16     ` Hugh Dickins
  -1 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-12-15  2:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 14 Dec 2020, Andrew Morton wrote:
> On Thu,  5 Nov 2020 16:55:30 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
> > This version rebase on next/master 20201104, with much of Johannes's
> > Acks and some changes according to Johannes comments. And add a new patch
> > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > v21-0007.
> 
> I assume the consensus on this series is 'not yet"?

Speaking for my part in the consensus: I don't share that assumption,
the series by now is well-baked and well reviewed by enough people over
more than enough versions, has been completely untroublesome since it
entered mmotm/linux-next a month ago, not even any performance bleats
from 0day, and has nothing to gain from any further delay.

I think it was my fault that v20 didn't get into 5.10: I'd said "not yet"
when you first tried a part of v19 or earlier in mmotm, and by the time
I'd completed review it was too late in the cycle; Johannes and Vlastimil
have gone over it since then, and I'd be glad to see it go ahead into
5.11 very soon. Silence on v21 meaning that it's good.

Various of us have improvements or cleanups in mind or in private tree,
but nothing to hold back what's already there.

> 
> Also, did
> https://lkml.kernel.org/r/0000000000000340a105b49441d3@google.com get
> resolved?

Alex found enough precedents for that, before inclusion of his series,
so it should not discourage from moving his series forward.  I have
ignored that syzreport until now, but will take a quick try at the
repro now, to see if I'm inspired - probably not, but we'll see.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-12-15  2:16     ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-12-15  2:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 14 Dec 2020, Andrew Morton wrote:
> On Thu,  5 Nov 2020 16:55:30 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
> > This version rebase on next/master 20201104, with much of Johannes's
> > Acks and some changes according to Johannes comments. And add a new patch
> > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > v21-0007.
> 
> I assume the consensus on this series is 'not yet"?

Speaking for my part in the consensus: I don't share that assumption,
the series by now is well-baked and well reviewed by enough people over
more than enough versions, has been completely untroublesome since it
entered mmotm/linux-next a month ago, not even any performance bleats
from 0day, and has nothing to gain from any further delay.

I think it was my fault that v20 didn't get into 5.10: I'd said "not yet"
when you first tried a part of v19 or earlier in mmotm, and by the time
I'd completed review it was too late in the cycle; Johannes and Vlastimil
have gone over it since then, and I'd be glad to see it go ahead into
5.11 very soon. Silence on v21 meaning that it's good.

Various of us have improvements or cleanups in mind or in private tree,
but nothing to hold back what's already there.

> 
> Also, did
> https://lkml.kernel.org/r/0000000000000340a105b49441d3@google.com get
> resolved?

Alex found enough precedents for that, before inclusion of his series,
so it should not discourage from moving his series forward.  I have
ignored that syzreport until now, but will take a quick try at the
repro now, to see if I'm inspired - probably not, but we'll see.

Hugh


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-12-15  2:16     ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2020-12-15  2:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On Mon, 14 Dec 2020, Andrew Morton wrote:
> On Thu,  5 Nov 2020 16:55:30 +0800 Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote:
> 
> > This version rebase on next/master 20201104, with much of Johannes's
> > Acks and some changes according to Johannes comments. And add a new patch
> > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > v21-0007.
> 
> I assume the consensus on this series is 'not yet"?

Speaking for my part in the consensus: I don't share that assumption,
the series by now is well-baked and well reviewed by enough people over
more than enough versions, has been completely untroublesome since it
entered mmotm/linux-next a month ago, not even any performance bleats
from 0day, and has nothing to gain from any further delay.

I think it was my fault that v20 didn't get into 5.10: I'd said "not yet"
when you first tried a part of v19 or earlier in mmotm, and by the time
I'd completed review it was too late in the cycle; Johannes and Vlastimil
have gone over it since then, and I'd be glad to see it go ahead into
5.11 very soon. Silence on v21 meaning that it's good.

Various of us have improvements or cleanups in mind or in private tree,
but nothing to hold back what's already there.

> 
> Also, did
> https://lkml.kernel.org/r/0000000000000340a105b49441d3-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org get
> resolved?

Alex found enough precedents for that, before inclusion of his series,
so it should not discourage from moving his series forward.  I have
ignored that syzreport until now, but will take a quick try at the
repro now, to see if I'm inspired - probably not, but we'll see.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-12-15  2:28       ` Andrew Morton
  0 siblings, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2020-12-15  2:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 14 Dec 2020 18:16:34 -0800 (PST) Hugh Dickins <hughd@google.com> wrote:

> On Mon, 14 Dec 2020, Andrew Morton wrote:
> > On Thu,  5 Nov 2020 16:55:30 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> > 
> > > This version rebase on next/master 20201104, with much of Johannes's
> > > Acks and some changes according to Johannes comments. And add a new patch
> > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > v21-0007.
> > 
> > I assume the consensus on this series is 'not yet"?
> 
> Speaking for my part in the consensus: I don't share that assumption,

OK, thanks, I'll include it in patch-bomb #2, Tues or Weds.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2020-12-15  2:28       ` Andrew Morton
  0 siblings, 0 replies; 111+ messages in thread
From: Andrew Morton @ 2020-12-15  2:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On Mon, 14 Dec 2020 18:16:34 -0800 (PST) Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> On Mon, 14 Dec 2020, Andrew Morton wrote:
> > On Thu,  5 Nov 2020 16:55:30 +0800 Alex Shi <alex.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org> wrote:
> > 
> > > This version rebase on next/master 20201104, with much of Johannes's
> > > Acks and some changes according to Johannes comments. And add a new patch
> > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > v21-0007.
> > 
> > I assume the consensus on this series is 'not yet"?
> 
> Speaking for my part in the consensus: I don't share that assumption,

OK, thanks, I'll include it in patch-bomb #2, Tues or Weds.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
  2020-11-05  8:55 ` Alex Shi
  (?)
@ 2021-01-05 19:30   ` Qian Cai
  -1 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 19:30 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.

Given the troublesome history of this patchset, and had been put into linux-next 
recently, as well as it touched both THP and mlock. Is it a good idea to suspect
this patchset introducing some races and a spontaneous crash with some mlock
memory presume?

[10392.154328][T23803] huge_memory: total_mapcount: 5, page_count(): 6
[10392.154835][T23803] page:00000000eb7725ad refcount:6 mapcount:0 mapping:0000000000000000 index:0x7fff72a0 pfn:0x20023760
[10392.154865][T23803] head:00000000eb7725ad order:5 compound_mapcount:0 compound_pincount:0
[10392.154889][T23803] anon flags: 0x87fff800009000d(locked|uptodate|dirty|head|swapbacked)
[10392.154908][T23803] raw: 087fff800009000d 5deadbeef0000100 5deadbeef0000122 c0002016ff5e0849
[10392.154933][T23803] raw: 000000007fff72a0 0000000000000000 00000006ffffffff c0002014eb676000
[10392.154965][T23803] page dumped because: total_mapcount(head) > 0
[10392.154987][T23803] pages's memcg:c0002014eb676000
[10392.155023][T23803] ------------[ cut here ]------------
[10392.155042][T23803] kernel BUG at mm/huge_memory.c:2767!
[10392.155064][T23803] Oops: Exception in kernel mode, sig: 5 [#1]
[10392.155084][T23803] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[10392.155114][T23803] Modules linked in: loop kvm_hv kvm ip_tables x_tables sd_mod bnx2x ahci tg3 libahci mdio libphy libata firmware_class dm_mirror dm_region_hash dm_log dm_mod
[10392.155185][T23803] CPU: 44 PID: 23803 Comm: ranbug Not tainted 5.11.0-rc2-next-20210105 #2
[10392.155217][T23803] NIP:  c0000000003b5218 LR: c0000000003b5214 CTR: 0000000000000000
[10392.155247][T23803] REGS: c00000001a8d6ee0 TRAP: 0700   Not tainted  (5.11.0-rc2-next-20210105)
[10392.155279][T23803] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28422222  XER: 00000000
[10392.155314][T23803] CFAR: c0000000003135ac IRQMASK: 1 
[10392.155314][T23803] GPR00: c0000000003b5214 c00000001a8d7180 c000000007f70b00 000000000000001e 
[10392.155314][T23803] GPR04: c000000000eacd38 0000000000000004 0000000000000027 c000001ffe8a7218 
[10392.155314][T23803] GPR08: 0000000000000023 0000000000000000 0000000000000000 c000000007eacfc8 
[10392.155314][T23803] GPR12: 0000000000002000 c000001ffffcda00 0000000000000000 0000000000000001 
[10392.155314][T23803] GPR16: c00c0008008dd808 0000000000040000 0000000000000000 0000000000000020 
[10392.155314][T23803] GPR20: c00c0008008dd800 0000000000000020 0000000000000006 0000000000000001 
[10392.155314][T23803] GPR24: 0000000000000005 ffffffffffffffff c0002016ff5e0848 0000000000000000 
[10392.155314][T23803] GPR28: c0002014eb676e60 c00c0008008dd800 c00000001a8d73a8 c00c0008008dd800 
[10392.155533][T23803] NIP [c0000000003b5218] split_huge_page_to_list+0xa38/0xa40
[10392.155558][T23803] LR [c0000000003b5214] split_huge_page_to_list+0xa34/0xa40
[10392.155579][T23803] Call Trace:
[10392.155595][T23803] [c00000001a8d7180] [c0000000003b5214] split_huge_page_to_list+0xa34/0xa40 (unreliable)
[10392.155630][T23803] [c00000001a8d7270] [c0000000002dd378] shrink_page_list+0x1568/0x1b00
shrink_page_list at mm/vmscan.c:1251 (discriminator 1)
[10392.155655][T23803] [c00000001a8d7380] [c0000000002df798] shrink_inactive_list+0x228/0x5e0
[10392.155678][T23803] [c00000001a8d7450] [c0000000002e0858] shrink_lruvec+0x2b8/0x6f0
shrink_lruvec at mm/vmscan.c:2462
[10392.155710][T23803] [c00000001a8d7590] [c0000000002e0fd8] shrink_node+0x348/0x970
[10392.155742][T23803] [c00000001a8d7660] [c0000000002e1728] do_try_to_free_pages+0x128/0x560
[10392.155765][T23803] [c00000001a8d7710] [c0000000002e3b78] try_to_free_pages+0x198/0x500
[10392.155780][T23803] [c00000001a8d77e0] [c000000000356f5c] __alloc_pages_slowpath.constprop.112+0x64c/0x1380
[10392.155795][T23803] [c00000001a8d79c0] [c000000000358170] __alloc_pages_nodemask+0x4e0/0x590
[10392.155830][T23803] [c00000001a8d7a50] [c000000000381fb8] alloc_pages_vma+0xb8/0x340
[10392.155854][T23803] [c00000001a8d7ac0] [c000000000324fe8] handle_mm_fault+0xf38/0x1bd0
[10392.155887][T23803] [c00000001a8d7ba0] [c000000000316cd4] __get_user_pages+0x434/0x7d0
[10392.155920][T23803] [c00000001a8d7cb0] [c0000000003197d0] __mm_populate+0xe0/0x290
__mm_populate at mm/gup.c:1459
[10392.155952][T23803] [c00000001a8d7d20] [c00000000032d5a0] do_mlock+0x180/0x360
do_mlock at mm/mlock.c:688
[10392.155975][T23803] [c00000001a8d7d90] [c00000000032d954] sys_mlock+0x24/0x40
[10392.155999][T23803] [c00000001a8d7db0] [c00000000002f510] system_call_exception+0x170/0x280
[10392.156032][T23803] [c00000001a8d7e10] [c00000000000d7c8] system_call_common+0xe8/0x218
[10392.156065][T23803] Instruction dump:
[10392.156082][T23803] e93d0008 71290001 41820014 7fe3fb78 38800000 4bf5e36d 60000000 3c82f8b7 
[10392.156121][T23803] 7fa3eb78 38846b70 4bf5e359 60000000 <0fe00000> 60000000 3c4c07bc 3842b8e0 
[10392.156160][T23803] ---[ end trace 2e3423677d4f91f3 ]---
[10392.312793][T23803] 
[10393.312808][T23803] Kernel panic - not syncing: Fatal exception
[10394.723608][T23803] ---[ end Kernel panic - not syncing: Fatal exception ]---

> 
> So now this patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation. 
> 2, use TestCleanPageLRU as page isolation's precondition.
> 3, replace per node lru_lock with per memcg per node lru_lock.
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed
> much
> code from compaction fix to general code polish, thanks a lot!).
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
> with this v20.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (16):
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: use head for head page in lru_add_page_tail
>   mm/thp: Simplify lru_add_page_tail()
>   mm/thp: narrow lru locking
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/rmap: stop store reordering issue on page->mapping
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lock into lru_note_cost
>   mm/vmscan: remove lruvec reget in move_pages_to_lru
>   mm/mlock: remove lru_lock on TestClearPageMlocked
>   mm/mlock: remove __munlock_isolate_lru_page
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
> 
> Alexander Duyck (1):
>   mm/lru: introduce the relock_page_lruvec function
> 
> Hugh Dickins (2):
>   mm: page_idle_get_page() does not need lru_lock
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/memcontrol.h                         | 110 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  mm/compaction.c                                    |  94 +++++++---
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  45 +++--
>  mm/memcontrol.c                                    |  79 +++++++-
>  mm/mlock.c                                         |  63 ++-----
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   4 -
>  mm/rmap.c                                          |  11 +-
>  mm/swap.c                                          | 208 ++++++++----------
> ---
>  mm/vmscan.c                                        | 207 ++++++++++----------
>  mm/workingset.c                                    |   2 -
>  21 files changed, 530 insertions(+), 372 deletions(-)
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 19:30   ` Qian Cai
  0 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 19:30 UTC (permalink / raw)
  To: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.

Given the troublesome history of this patchset, and had been put into linux-next 
recently, as well as it touched both THP and mlock. Is it a good idea to suspect
this patchset introducing some races and a spontaneous crash with some mlock
memory presume?

[10392.154328][T23803] huge_memory: total_mapcount: 5, page_count(): 6
[10392.154835][T23803] page:00000000eb7725ad refcount:6 mapcount:0 mapping:0000000000000000 index:0x7fff72a0 pfn:0x20023760
[10392.154865][T23803] head:00000000eb7725ad order:5 compound_mapcount:0 compound_pincount:0
[10392.154889][T23803] anon flags: 0x87fff800009000d(locked|uptodate|dirty|head|swapbacked)
[10392.154908][T23803] raw: 087fff800009000d 5deadbeef0000100 5deadbeef0000122 c0002016ff5e0849
[10392.154933][T23803] raw: 000000007fff72a0 0000000000000000 00000006ffffffff c0002014eb676000
[10392.154965][T23803] page dumped because: total_mapcount(head) > 0
[10392.154987][T23803] pages's memcg:c0002014eb676000
[10392.155023][T23803] ------------[ cut here ]------------
[10392.155042][T23803] kernel BUG at mm/huge_memory.c:2767!
[10392.155064][T23803] Oops: Exception in kernel mode, sig: 5 [#1]
[10392.155084][T23803] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[10392.155114][T23803] Modules linked in: loop kvm_hv kvm ip_tables x_tables sd_mod bnx2x ahci tg3 libahci mdio libphy libata firmware_class dm_mirror dm_region_hash dm_log dm_mod
[10392.155185][T23803] CPU: 44 PID: 23803 Comm: ranbug Not tainted 5.11.0-rc2-next-20210105 #2
[10392.155217][T23803] NIP:  c0000000003b5218 LR: c0000000003b5214 CTR: 0000000000000000
[10392.155247][T23803] REGS: c00000001a8d6ee0 TRAP: 0700   Not tainted  (5.11.0-rc2-next-20210105)
[10392.155279][T23803] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28422222  XER: 00000000
[10392.155314][T23803] CFAR: c0000000003135ac IRQMASK: 1 
[10392.155314][T23803] GPR00: c0000000003b5214 c00000001a8d7180 c000000007f70b00 000000000000001e 
[10392.155314][T23803] GPR04: c000000000eacd38 0000000000000004 0000000000000027 c000001ffe8a7218 
[10392.155314][T23803] GPR08: 0000000000000023 0000000000000000 0000000000000000 c000000007eacfc8 
[10392.155314][T23803] GPR12: 0000000000002000 c000001ffffcda00 0000000000000000 0000000000000001 
[10392.155314][T23803] GPR16: c00c0008008dd808 0000000000040000 0000000000000000 0000000000000020 
[10392.155314][T23803] GPR20: c00c0008008dd800 0000000000000020 0000000000000006 0000000000000001 
[10392.155314][T23803] GPR24: 0000000000000005 ffffffffffffffff c0002016ff5e0848 0000000000000000 
[10392.155314][T23803] GPR28: c0002014eb676e60 c00c0008008dd800 c00000001a8d73a8 c00c0008008dd800 
[10392.155533][T23803] NIP [c0000000003b5218] split_huge_page_to_list+0xa38/0xa40
[10392.155558][T23803] LR [c0000000003b5214] split_huge_page_to_list+0xa34/0xa40
[10392.155579][T23803] Call Trace:
[10392.155595][T23803] [c00000001a8d7180] [c0000000003b5214] split_huge_page_to_list+0xa34/0xa40 (unreliable)
[10392.155630][T23803] [c00000001a8d7270] [c0000000002dd378] shrink_page_list+0x1568/0x1b00
shrink_page_list at mm/vmscan.c:1251 (discriminator 1)
[10392.155655][T23803] [c00000001a8d7380] [c0000000002df798] shrink_inactive_list+0x228/0x5e0
[10392.155678][T23803] [c00000001a8d7450] [c0000000002e0858] shrink_lruvec+0x2b8/0x6f0
shrink_lruvec at mm/vmscan.c:2462
[10392.155710][T23803] [c00000001a8d7590] [c0000000002e0fd8] shrink_node+0x348/0x970
[10392.155742][T23803] [c00000001a8d7660] [c0000000002e1728] do_try_to_free_pages+0x128/0x560
[10392.155765][T23803] [c00000001a8d7710] [c0000000002e3b78] try_to_free_pages+0x198/0x500
[10392.155780][T23803] [c00000001a8d77e0] [c000000000356f5c] __alloc_pages_slowpath.constprop.112+0x64c/0x1380
[10392.155795][T23803] [c00000001a8d79c0] [c000000000358170] __alloc_pages_nodemask+0x4e0/0x590
[10392.155830][T23803] [c00000001a8d7a50] [c000000000381fb8] alloc_pages_vma+0xb8/0x340
[10392.155854][T23803] [c00000001a8d7ac0] [c000000000324fe8] handle_mm_fault+0xf38/0x1bd0
[10392.155887][T23803] [c00000001a8d7ba0] [c000000000316cd4] __get_user_pages+0x434/0x7d0
[10392.155920][T23803] [c00000001a8d7cb0] [c0000000003197d0] __mm_populate+0xe0/0x290
__mm_populate at mm/gup.c:1459
[10392.155952][T23803] [c00000001a8d7d20] [c00000000032d5a0] do_mlock+0x180/0x360
do_mlock at mm/mlock.c:688
[10392.155975][T23803] [c00000001a8d7d90] [c00000000032d954] sys_mlock+0x24/0x40
[10392.155999][T23803] [c00000001a8d7db0] [c00000000002f510] system_call_exception+0x170/0x280
[10392.156032][T23803] [c00000001a8d7e10] [c00000000000d7c8] system_call_common+0xe8/0x218
[10392.156065][T23803] Instruction dump:
[10392.156082][T23803] e93d0008 71290001 41820014 7fe3fb78 38800000 4bf5e36d 60000000 3c82f8b7 
[10392.156121][T23803] 7fa3eb78 38846b70 4bf5e359 60000000 <0fe00000> 60000000 3c4c07bc 3842b8e0 
[10392.156160][T23803] ---[ end trace 2e3423677d4f91f3 ]---
[10392.312793][T23803] 
[10393.312808][T23803] Kernel panic - not syncing: Fatal exception
[10394.723608][T23803] ---[ end Kernel panic - not syncing: Fatal exception ]---

> 
> So now this patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation. 
> 2, use TestCleanPageLRU as page isolation's precondition.
> 3, replace per node lru_lock with per memcg per node lru_lock.
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed
> much
> code from compaction fix to general code polish, thanks a lot!).
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
> with this v20.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (16):
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: use head for head page in lru_add_page_tail
>   mm/thp: Simplify lru_add_page_tail()
>   mm/thp: narrow lru locking
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/rmap: stop store reordering issue on page->mapping
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lock into lru_note_cost
>   mm/vmscan: remove lruvec reget in move_pages_to_lru
>   mm/mlock: remove lru_lock on TestClearPageMlocked
>   mm/mlock: remove __munlock_isolate_lru_page
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
> 
> Alexander Duyck (1):
>   mm/lru: introduce the relock_page_lruvec function
> 
> Hugh Dickins (2):
>   mm: page_idle_get_page() does not need lru_lock
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/memcontrol.h                         | 110 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  mm/compaction.c                                    |  94 +++++++---
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  45 +++--
>  mm/memcontrol.c                                    |  79 +++++++-
>  mm/mlock.c                                         |  63 ++-----
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   4 -
>  mm/rmap.c                                          |  11 +-
>  mm/swap.c                                          | 208 ++++++++----------
> ---
>  mm/vmscan.c                                        | 207 ++++++++++----------
>  mm/workingset.c                                    |   2 -
>  21 files changed, 530 insertions(+), 372 deletions(-)
> 



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 19:30   ` Qian Cai
  0 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 19:30 UTC (permalink / raw)
  To: Alex Shi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mgorman-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt,
	tj-DgEjT+Ai2ygdnm+yROfE0A, hughd-hpIqsD4AKlfQT0dZR+AlfA,
	khlebnikov-XoJtRXgx1JseBXzfvpsJ4g,
	daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA,
	willy-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	lkp-ral2JQCrhuEAvxtiuMwx3w, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, shakeelb-hpIqsD4AKlfQT0dZR+AlfA,
	iamjoonsoo.kim-Hm3cg6mZ9cc,
	richard.weiyang-Re5JQEeQqe8AvxtiuMwx3w,
	kirill-oKw7cIdHH8eLwutG50LtGA,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w,
	rong.a.chen-ral2JQCrhuEAvxtiuMwx3w, mhocko-IBi9RG/b67k,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w,
	shy828301-Re5JQEeQqe8AvxtiuMwx3w

On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.

Given the troublesome history of this patchset, and had been put into linux-next 
recently, as well as it touched both THP and mlock. Is it a good idea to suspect
this patchset introducing some races and a spontaneous crash with some mlock
memory presume?

[10392.154328][T23803] huge_memory: total_mapcount: 5, page_count(): 6
[10392.154835][T23803] page:00000000eb7725ad refcount:6 mapcount:0 mapping:0000000000000000 index:0x7fff72a0 pfn:0x20023760
[10392.154865][T23803] head:00000000eb7725ad order:5 compound_mapcount:0 compound_pincount:0
[10392.154889][T23803] anon flags: 0x87fff800009000d(locked|uptodate|dirty|head|swapbacked)
[10392.154908][T23803] raw: 087fff800009000d 5deadbeef0000100 5deadbeef0000122 c0002016ff5e0849
[10392.154933][T23803] raw: 000000007fff72a0 0000000000000000 00000006ffffffff c0002014eb676000
[10392.154965][T23803] page dumped because: total_mapcount(head) > 0
[10392.154987][T23803] pages's memcg:c0002014eb676000
[10392.155023][T23803] ------------[ cut here ]------------
[10392.155042][T23803] kernel BUG at mm/huge_memory.c:2767!
[10392.155064][T23803] Oops: Exception in kernel mode, sig: 5 [#1]
[10392.155084][T23803] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[10392.155114][T23803] Modules linked in: loop kvm_hv kvm ip_tables x_tables sd_mod bnx2x ahci tg3 libahci mdio libphy libata firmware_class dm_mirror dm_region_hash dm_log dm_mod
[10392.155185][T23803] CPU: 44 PID: 23803 Comm: ranbug Not tainted 5.11.0-rc2-next-20210105 #2
[10392.155217][T23803] NIP:  c0000000003b5218 LR: c0000000003b5214 CTR: 0000000000000000
[10392.155247][T23803] REGS: c00000001a8d6ee0 TRAP: 0700   Not tainted  (5.11.0-rc2-next-20210105)
[10392.155279][T23803] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28422222  XER: 00000000
[10392.155314][T23803] CFAR: c0000000003135ac IRQMASK: 1 
[10392.155314][T23803] GPR00: c0000000003b5214 c00000001a8d7180 c000000007f70b00 000000000000001e 
[10392.155314][T23803] GPR04: c000000000eacd38 0000000000000004 0000000000000027 c000001ffe8a7218 
[10392.155314][T23803] GPR08: 0000000000000023 0000000000000000 0000000000000000 c000000007eacfc8 
[10392.155314][T23803] GPR12: 0000000000002000 c000001ffffcda00 0000000000000000 0000000000000001 
[10392.155314][T23803] GPR16: c00c0008008dd808 0000000000040000 0000000000000000 0000000000000020 
[10392.155314][T23803] GPR20: c00c0008008dd800 0000000000000020 0000000000000006 0000000000000001 
[10392.155314][T23803] GPR24: 0000000000000005 ffffffffffffffff c0002016ff5e0848 0000000000000000 
[10392.155314][T23803] GPR28: c0002014eb676e60 c00c0008008dd800 c00000001a8d73a8 c00c0008008dd800 
[10392.155533][T23803] NIP [c0000000003b5218] split_huge_page_to_list+0xa38/0xa40
[10392.155558][T23803] LR [c0000000003b5214] split_huge_page_to_list+0xa34/0xa40
[10392.155579][T23803] Call Trace:
[10392.155595][T23803] [c00000001a8d7180] [c0000000003b5214] split_huge_page_to_list+0xa34/0xa40 (unreliable)
[10392.155630][T23803] [c00000001a8d7270] [c0000000002dd378] shrink_page_list+0x1568/0x1b00
shrink_page_list at mm/vmscan.c:1251 (discriminator 1)
[10392.155655][T23803] [c00000001a8d7380] [c0000000002df798] shrink_inactive_list+0x228/0x5e0
[10392.155678][T23803] [c00000001a8d7450] [c0000000002e0858] shrink_lruvec+0x2b8/0x6f0
shrink_lruvec at mm/vmscan.c:2462
[10392.155710][T23803] [c00000001a8d7590] [c0000000002e0fd8] shrink_node+0x348/0x970
[10392.155742][T23803] [c00000001a8d7660] [c0000000002e1728] do_try_to_free_pages+0x128/0x560
[10392.155765][T23803] [c00000001a8d7710] [c0000000002e3b78] try_to_free_pages+0x198/0x500
[10392.155780][T23803] [c00000001a8d77e0] [c000000000356f5c] __alloc_pages_slowpath.constprop.112+0x64c/0x1380
[10392.155795][T23803] [c00000001a8d79c0] [c000000000358170] __alloc_pages_nodemask+0x4e0/0x590
[10392.155830][T23803] [c00000001a8d7a50] [c000000000381fb8] alloc_pages_vma+0xb8/0x340
[10392.155854][T23803] [c00000001a8d7ac0] [c000000000324fe8] handle_mm_fault+0xf38/0x1bd0
[10392.155887][T23803] [c00000001a8d7ba0] [c000000000316cd4] __get_user_pages+0x434/0x7d0
[10392.155920][T23803] [c00000001a8d7cb0] [c0000000003197d0] __mm_populate+0xe0/0x290
__mm_populate at mm/gup.c:1459
[10392.155952][T23803] [c00000001a8d7d20] [c00000000032d5a0] do_mlock+0x180/0x360
do_mlock at mm/mlock.c:688
[10392.155975][T23803] [c00000001a8d7d90] [c00000000032d954] sys_mlock+0x24/0x40
[10392.155999][T23803] [c00000001a8d7db0] [c00000000002f510] system_call_exception+0x170/0x280
[10392.156032][T23803] [c00000001a8d7e10] [c00000000000d7c8] system_call_common+0xe8/0x218
[10392.156065][T23803] Instruction dump:
[10392.156082][T23803] e93d0008 71290001 41820014 7fe3fb78 38800000 4bf5e36d 60000000 3c82f8b7 
[10392.156121][T23803] 7fa3eb78 38846b70 4bf5e359 60000000 <0fe00000> 60000000 3c4c07bc 3842b8e0 
[10392.156160][T23803] ---[ end trace 2e3423677d4f91f3 ]---
[10392.312793][T23803] 
[10393.312808][T23803] Kernel panic - not syncing: Fatal exception
[10394.723608][T23803] ---[ end Kernel panic - not syncing: Fatal exception ]---

> 
> So now this patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation. 
> 2, use TestCleanPageLRU as page isolation's precondition.
> 3, replace per node lru_lock with per memcg per node lru_lock.
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed
> much
> code from compaction fix to general code polish, thanks a lot!).
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
> with this v20.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu-S51bK0XF4qpuJJETbFA3a0B3C2bhBk7L0E9HWUfgJXw@public.gmane.org/
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (16):
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: use head for head page in lru_add_page_tail
>   mm/thp: Simplify lru_add_page_tail()
>   mm/thp: narrow lru locking
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/rmap: stop store reordering issue on page->mapping
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lock into lru_note_cost
>   mm/vmscan: remove lruvec reget in move_pages_to_lru
>   mm/mlock: remove lru_lock on TestClearPageMlocked
>   mm/mlock: remove __munlock_isolate_lru_page
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
> 
> Alexander Duyck (1):
>   mm/lru: introduce the relock_page_lruvec function
> 
> Hugh Dickins (2):
>   mm: page_idle_get_page() does not need lru_lock
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/memcontrol.h                         | 110 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  mm/compaction.c                                    |  94 +++++++---
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  45 +++--
>  mm/memcontrol.c                                    |  79 +++++++-
>  mm/mlock.c                                         |  63 ++-----
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   4 -
>  mm/rmap.c                                          |  11 +-
>  mm/swap.c                                          | 208 ++++++++----------
> ---
>  mm/vmscan.c                                        | 207 ++++++++++----------
>  mm/workingset.c                                    |   2 -
>  21 files changed, 530 insertions(+), 372 deletions(-)
> 


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
  2021-01-05 19:30   ` Qian Cai
  (?)
@ 2021-01-05 19:42     ` Shakeel Butt
  -1 siblings, 0 replies; 111+ messages in thread
From: Shakeel Butt @ 2021-01-05 19:42 UTC (permalink / raw)
  To: Qian Cai
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai@redhat.com> wrote:
>
> On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > This version rebase on next/master 20201104, with much of Johannes's
> > Acks and some changes according to Johannes comments. And add a new patch
> > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > v21-0007.
> >
> > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > added to -mm tree yesterday.
> >
> > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > Johannes Weiner.
>
> Given the troublesome history of this patchset, and had been put into linux-next
> recently, as well as it touched both THP and mlock. Is it a good idea to suspect
> this patchset introducing some races and a spontaneous crash with some mlock
> memory presume?

This has already been merged into the linus tree. Were you able to get
a similar crash on the latest upstream kernel as well?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 19:42     ` Shakeel Butt
  0 siblings, 0 replies; 111+ messages in thread
From: Shakeel Butt @ 2021-01-05 19:42 UTC (permalink / raw)
  To: Qian Cai
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai@redhat.com> wrote:
>
> On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > This version rebase on next/master 20201104, with much of Johannes's
> > Acks and some changes according to Johannes comments. And add a new patch
> > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > v21-0007.
> >
> > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > added to -mm tree yesterday.
> >
> > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > Johannes Weiner.
>
> Given the troublesome history of this patchset, and had been put into linux-next
> recently, as well as it touched both THP and mlock. Is it a good idea to suspect
> this patchset introducing some races and a spontaneous crash with some mlock
> memory presume?

This has already been merged into the linus tree. Were you able to get
a similar crash on the latest upstream kernel as well?


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 19:42     ` Shakeel Butt
  0 siblings, 0 replies; 111+ messages in thread
From: Shakeel Butt @ 2021-01-05 19:42 UTC (permalink / raw)
  To: Qian Cai
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, kernel test robot,
	Michal Hocko, Vladimir Davydov

On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > This version rebase on next/master 20201104, with much of Johannes's
> > Acks and some changes according to Johannes comments. And add a new patch
> > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > v21-0007.
> >
> > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > added to -mm tree yesterday.
> >
> > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > Johannes Weiner.
>
> Given the troublesome history of this patchset, and had been put into linux-next
> recently, as well as it touched both THP and mlock. Is it a good idea to suspect
> this patchset introducing some races and a spontaneous crash with some mlock
> memory presume?

This has already been merged into the linus tree. Were you able to get
a similar crash on the latest upstream kernel as well?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
  2021-01-05 19:42     ` Shakeel Butt
  (?)
@ 2021-01-05 20:11       ` Qian Cai
  -1 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 20:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai@redhat.com> wrote:
> > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > This version rebase on next/master 20201104, with much of Johannes's
> > > Acks and some changes according to Johannes comments. And add a new patch
> > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > v21-0007.
> > > 
> > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > added to -mm tree yesterday.
> > > 
> > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > Johannes Weiner.
> > 
> > Given the troublesome history of this patchset, and had been put into linux-
> > next
> > recently, as well as it touched both THP and mlock. Is it a good idea to
> > suspect
> > this patchset introducing some races and a spontaneous crash with some mlock
> > memory presume?
> 
> This has already been merged into the linus tree. Were you able to get
> a similar crash on the latest upstream kernel as well?

No, I seldom test the mainline those days. Before the vacations, I have tested
linux-next up to something like 12/10 which did not include this patchset IIRC
and never saw any crash like this. I am still trying to figure out how to
reproduce it fast, so I can try a revert to confirm.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 20:11       ` Qian Cai
  0 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 20:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai@redhat.com> wrote:
> > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > This version rebase on next/master 20201104, with much of Johannes's
> > > Acks and some changes according to Johannes comments. And add a new patch
> > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > v21-0007.
> > > 
> > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > added to -mm tree yesterday.
> > > 
> > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > Johannes Weiner.
> > 
> > Given the troublesome history of this patchset, and had been put into linux-
> > next
> > recently, as well as it touched both THP and mlock. Is it a good idea to
> > suspect
> > this patchset introducing some races and a spontaneous crash with some mlock
> > memory presume?
> 
> This has already been merged into the linus tree. Were you able to get
> a similar crash on the latest upstream kernel as well?

No, I seldom test the mainline those days. Before the vacations, I have tested
linux-next up to something like 12/10 which did not include this patchset IIRC
and never saw any crash like this. I am still trying to figure out how to
reproduce it fast, so I can try a revert to confirm.



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 20:11       ` Qian Cai
  0 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 20:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, kernel test robot,
	Michal Hocko, Vladimir Davydov

On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > This version rebase on next/master 20201104, with much of Johannes's
> > > Acks and some changes according to Johannes comments. And add a new patch
> > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > v21-0007.
> > > 
> > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > added to -mm tree yesterday.
> > > 
> > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > Johannes Weiner.
> > 
> > Given the troublesome history of this patchset, and had been put into linux-
> > next
> > recently, as well as it touched both THP and mlock. Is it a good idea to
> > suspect
> > this patchset introducing some races and a spontaneous crash with some mlock
> > memory presume?
> 
> This has already been merged into the linus tree. Were you able to get
> a similar crash on the latest upstream kernel as well?

No, I seldom test the mainline those days. Before the vacations, I have tested
linux-next up to something like 12/10 which did not include this patchset IIRC
and never saw any crash like this. I am still trying to figure out how to
reproduce it fast, so I can try a revert to confirm.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
  2021-01-05 20:11       ` Qian Cai
  (?)
@ 2021-01-05 21:35         ` Hugh Dickins
  -1 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2021-01-05 21:35 UTC (permalink / raw)
  To: Qian Cai
  Cc: Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Matthew Wilcox, Johannes Weiner, kernel test robot, Linux MM,
	LKML, Cgroups, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	alexander.duyck, kernel test robot, Michal Hocko,
	Vladimir Davydov, Yang Shi

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> > On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai@redhat.com> wrote:
> > > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > > This version rebase on next/master 20201104, with much of Johannes's
> > > > Acks and some changes according to Johannes comments. And add a new patch
> > > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > > v21-0007.
> > > > 
> > > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > > added to -mm tree yesterday.
> > > > 
> > > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > > Johannes Weiner.
> > > 
> > > Given the troublesome history of this patchset, and had been put into linux-
> > > next
> > > recently, as well as it touched both THP and mlock. Is it a good idea to
> > > suspect
> > > this patchset introducing some races and a spontaneous crash with some mlock
> > > memory presume?
> > 
> > This has already been merged into the linus tree. Were you able to get
> > a similar crash on the latest upstream kernel as well?
> 
> No, I seldom test the mainline those days. Before the vacations, I have tested
> linux-next up to something like 12/10 which did not include this patchset IIRC
> and never saw any crash like this. I am still trying to figure out how to
> reproduce it fast, so I can try a revert to confirm.

This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
on 2020-11-17: you'll have had three trouble-free weeks testing with it
in, so it's not a likely suspect.  I haven't looked yet at your report,
to think of a more likely suspect: will do.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 21:35         ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2021-01-05 21:35 UTC (permalink / raw)
  To: Qian Cai
  Cc: Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Matthew Wilcox, Johannes Weiner, kernel test robot, Linux MM,
	LKML, Cgroups, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	alexander.duyck, kernel test robot, Michal Hocko,
	Vladimir Davydov, Yang Shi

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> > On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai@redhat.com> wrote:
> > > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > > This version rebase on next/master 20201104, with much of Johannes's
> > > > Acks and some changes according to Johannes comments. And add a new patch
> > > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > > v21-0007.
> > > > 
> > > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > > added to -mm tree yesterday.
> > > > 
> > > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > > Johannes Weiner.
> > > 
> > > Given the troublesome history of this patchset, and had been put into linux-
> > > next
> > > recently, as well as it touched both THP and mlock. Is it a good idea to
> > > suspect
> > > this patchset introducing some races and a spontaneous crash with some mlock
> > > memory presume?
> > 
> > This has already been merged into the linus tree. Were you able to get
> > a similar crash on the latest upstream kernel as well?
> 
> No, I seldom test the mainline those days. Before the vacations, I have tested
> linux-next up to something like 12/10 which did not include this patchset IIRC
> and never saw any crash like this. I am still trying to figure out how to
> reproduce it fast, so I can try a revert to confirm.

This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
on 2020-11-17: you'll have had three trouble-free weeks testing with it
in, so it's not a likely suspect.  I haven't looked yet at your report,
to think of a more likely suspect: will do.

Hugh


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 21:35         ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2021-01-05 21:35 UTC (permalink / raw)
  To: Qian Cai
  Cc: Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Matthew Wilcox, Johannes Weiner, kernel test robot, Linux MM,
	LKML, Cgroups, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, kernel test robot,
	Michal Hocko

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> > On Tue, Jan 5, 2021 at 11:30 AM Qian Cai <qcai-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > > This version rebase on next/master 20201104, with much of Johannes's
> > > > Acks and some changes according to Johannes comments. And add a new patch
> > > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > > v21-0007.
> > > > 
> > > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > > added to -mm tree yesterday.
> > > > 
> > > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > > Johannes Weiner.
> > > 
> > > Given the troublesome history of this patchset, and had been put into linux-
> > > next
> > > recently, as well as it touched both THP and mlock. Is it a good idea to
> > > suspect
> > > this patchset introducing some races and a spontaneous crash with some mlock
> > > memory presume?
> > 
> > This has already been merged into the linus tree. Were you able to get
> > a similar crash on the latest upstream kernel as well?
> 
> No, I seldom test the mainline those days. Before the vacations, I have tested
> linux-next up to something like 12/10 which did not include this patchset IIRC
> and never saw any crash like this. I am still trying to figure out how to
> reproduce it fast, so I can try a revert to confirm.

This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
on 2020-11-17: you'll have had three trouble-free weeks testing with it
in, so it's not a likely suspect.  I haven't looked yet at your report,
to think of a more likely suspect: will do.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
  2021-01-05 21:35         ` Hugh Dickins
  (?)
@ 2021-01-05 22:01           ` Qian Cai
  -1 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 22:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> on 2020-11-17: you'll have had three trouble-free weeks testing with it
> in, so it's not a likely suspect.  I haven't looked yet at your report,
> to think of a more likely suspect: will do.

Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
the Thanksgiving as well. I have tried a few times so far and only been able to
reproduce once. Looks nasty...


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 22:01           ` Qian Cai
  0 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 22:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> on 2020-11-17: you'll have had three trouble-free weeks testing with it
> in, so it's not a likely suspect.  I haven't looked yet at your report,
> to think of a more likely suspect: will do.

Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
the Thanksgiving as well. I have tried a few times so far and only been able to
reproduce once. Looks nasty...



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-05 22:01           ` Qian Cai
  0 siblings, 0 replies; 111+ messages in thread
From: Qian Cai @ 2021-01-05 22:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, kernel test robot,
	Michal Hocko, Vladimir Davydov

On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> on 2020-11-17: you'll have had three trouble-free weeks testing with it
> in, so it's not a likely suspect.  I haven't looked yet at your report,
> to think of a more likely suspect: will do.

Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
the Thanksgiving as well. I have tried a few times so far and only been able to
reproduce once. Looks nasty...


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
  2021-01-05 22:01           ` Qian Cai
  (?)
@ 2021-01-06  3:10             ` Hugh Dickins
  -1 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2021-01-06  3:10 UTC (permalink / raw)
  To: Qian Cai
  Cc: Hugh Dickins, Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman,
	Tejun Heo, Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> > This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> > on 2020-11-17: you'll have had three trouble-free weeks testing with it
> > in, so it's not a likely suspect.  I haven't looked yet at your report,
> > to think of a more likely suspect: will do.
> 
> Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
> the Thanksgiving as well. I have tried a few times so far and only been able to
> reproduce once. Looks nasty...

I have not found a likely suspect.

What it smells like is a defect in cloning anon_vma during fork,
such that mappings of the THP can get added even after all that
could be found were unmapped (tree lookup ordering should prevent
that).  But I've not seen any recent change there.

It would be very easily fixed by deleting the whole BUG() block,
which is only there as a sanity check for developers: but we would
not want to delete it without understanding why it has gone wrong
(and would also have to reconsider two related VM_BUG_ON_PAGEs).

It is possible that b6769834aac1 ("mm/thp: narrow lru locking") of this
patchset has changed the timing and made a pre-existing bug more likely
in some situations: it used to hold an lru_lock before that BUG() on
total_mapcount(), and now does not; but that's not a lock which should
be relevant to the check.

When you get more info (or not), please repost the bugstack in a
new email thread: this thread is not really useful for pursuing it.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-06  3:10             ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2021-01-06  3:10 UTC (permalink / raw)
  To: Qian Cai
  Cc: Hugh Dickins, Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman,
	Tejun Heo, Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, alexander.duyck,
	kernel test robot, Michal Hocko, Vladimir Davydov, Yang Shi

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> > This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> > on 2020-11-17: you'll have had three trouble-free weeks testing with it
> > in, so it's not a likely suspect.  I haven't looked yet at your report,
> > to think of a more likely suspect: will do.
> 
> Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
> the Thanksgiving as well. I have tried a few times so far and only been able to
> reproduce once. Looks nasty...

I have not found a likely suspect.

What it smells like is a defect in cloning anon_vma during fork,
such that mappings of the THP can get added even after all that
could be found were unmapped (tree lookup ordering should prevent
that).  But I've not seen any recent change there.

It would be very easily fixed by deleting the whole BUG() block,
which is only there as a sanity check for developers: but we would
not want to delete it without understanding why it has gone wrong
(and would also have to reconsider two related VM_BUG_ON_PAGEs).

It is possible that b6769834aac1 ("mm/thp: narrow lru locking") of this
patchset has changed the timing and made a pre-existing bug more likely
in some situations: it used to hold an lru_lock before that BUG() on
total_mapcount(), and now does not; but that's not a lock which should
be relevant to the check.

When you get more info (or not), please repost the bugstack in a
new email thread: this thread is not really useful for pursuing it.

Hugh


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v21 00/19] per memcg lru lock
@ 2021-01-06  3:10             ` Hugh Dickins
  0 siblings, 0 replies; 111+ messages in thread
From: Hugh Dickins @ 2021-01-06  3:10 UTC (permalink / raw)
  To: Qian Cai
  Cc: Hugh Dickins, Shakeel Butt, Alex Shi, Andrew Morton, Mel Gorman,
	Tejun Heo, Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kernel test robot, Linux MM, LKML, Cgroups,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	alexander.duyck-Re5JQEeQqe8AvxtiuMwx3w, kernel test robot,
	Michal Hocko

On Tue, 5 Jan 2021, Qian Cai wrote:
> On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> > This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> > on 2020-11-17: you'll have had three trouble-free weeks testing with it
> > in, so it's not a likely suspect.  I haven't looked yet at your report,
> > to think of a more likely suspect: will do.
> 
> Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
> the Thanksgiving as well. I have tried a few times so far and only been able to
> reproduce once. Looks nasty...

I have not found a likely suspect.

What it smells like is a defect in cloning anon_vma during fork,
such that mappings of the THP can get added even after all that
could be found were unmapped (tree lookup ordering should prevent
that).  But I've not seen any recent change there.

It would be very easily fixed by deleting the whole BUG() block,
which is only there as a sanity check for developers: but we would
not want to delete it without understanding why it has gone wrong
(and would also have to reconsider two related VM_BUG_ON_PAGEs).

It is possible that b6769834aac1 ("mm/thp: narrow lru locking") of this
patchset has changed the timing and made a pre-existing bug more likely
in some situations: it used to hold an lru_lock before that BUG() on
total_mapcount(), and now does not; but that's not a lock which should
be relevant to the check.

When you get more info (or not), please repost the bugstack in a
new email thread: this thread is not really useful for pursuing it.

Hugh

^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2021-01-06  3:11 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-05  8:55 [PATCH v21 00/19] per memcg lru lock Alex Shi
2020-11-05  8:55 ` Alex Shi
2020-11-05  8:55 ` [PATCH v21 01/19] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-05  8:55 ` [PATCH v21 02/19] mm/thp: use head for head page in lru_add_page_tail Alex Shi
2020-11-05  8:55 ` [PATCH v21 03/19] mm/thp: Simplify lru_add_page_tail() Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-05  8:55 ` [PATCH v21 04/19] mm/thp: narrow lru locking Alex Shi
2020-11-05  8:55 ` [PATCH v21 05/19] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-11 12:36   ` Vlastimil Babka
2020-11-11 12:36     ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 06/19] mm/rmap: stop store reordering issue on page->mapping Alex Shi
2020-11-06  1:20   ` Alex Shi
2020-11-06  1:20     ` Alex Shi
2020-11-10 19:06     ` Johannes Weiner
2020-11-11  7:41     ` Hugh Dickins
2020-11-11  7:41       ` Hugh Dickins
2020-11-05  8:55 ` [PATCH v21 07/19] mm: page_idle_get_page() does not need lru_lock Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-10 19:01   ` Johannes Weiner
2020-11-11  8:17   ` huang ying
2020-11-11  8:17     ` huang ying
2020-11-11  8:17     ` huang ying
2020-11-11 12:52     ` Vlastimil Babka
2020-11-11 12:52       ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 08/19] mm/memcg: add debug checking in lock_page_memcg Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-05  8:55 ` [PATCH v21 09/19] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
2020-11-05  8:55 ` [PATCH v21 10/19] mm/lru: move lock into lru_note_cost Alex Shi
2020-11-05  8:55 ` [PATCH v21 11/19] mm/vmscan: remove lruvec reget in move_pages_to_lru Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-05  8:55 ` [PATCH v21 12/19] mm/mlock: remove lru_lock on TestClearPageMlocked Alex Shi
2020-11-11 13:03   ` Vlastimil Babka
2020-11-11 13:03     ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 13/19] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
2020-11-11 13:07   ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 14/19] mm/lru: introduce TestClearPageLRU Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-11 13:36   ` Vlastimil Babka
2020-11-12  2:03     ` Hugh Dickins
2020-11-12  2:03       ` Hugh Dickins
2020-11-12 11:24       ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 15/19] mm/compaction: do page isolation first in compaction Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-11 17:12   ` Vlastimil Babka
2020-11-11 17:12     ` Vlastimil Babka
2020-11-12  2:28     ` Hugh Dickins
2020-11-12  2:28       ` Hugh Dickins
2020-11-12  3:35       ` Alex Shi
2020-11-12  3:35         ` Alex Shi
2020-11-12 11:25       ` Vlastimil Babka
2020-11-12 11:25         ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 16/19] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
2020-11-11 18:00   ` Vlastimil Babka
2020-11-11 18:00     ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 17/19] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-05 13:43   ` Alex Shi
2020-11-05 13:43     ` Alex Shi
2020-11-06  7:48     ` Alex Shi
2020-11-06  7:48       ` Alex Shi
2020-11-10 18:54       ` Johannes Weiner
2020-11-10 18:54         ` Johannes Weiner
2020-11-11 17:46   ` Vlastimil Babka
2020-11-11 17:46     ` Vlastimil Babka
2020-11-11 17:59     ` Vlastimil Babka
2020-11-12 12:19   ` Vlastimil Babka
2020-11-12 12:19     ` Vlastimil Babka
2020-11-12 14:19     ` Alex Shi
2020-11-12 14:19       ` Alex Shi
2020-11-05  8:55 ` [PATCH v21 18/19] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-11-05  8:55   ` Alex Shi
2020-11-06  7:50   ` Alex Shi
2020-11-06  7:50     ` Alex Shi
2020-11-10 18:59     ` Johannes Weiner
2020-11-10 18:59       ` Johannes Weiner
2020-11-12 12:31   ` Vlastimil Babka
2020-11-12 12:31     ` Vlastimil Babka
2020-11-05  8:55 ` [PATCH v21 19/19] mm/lru: revise the comments of lru_lock Alex Shi
2020-11-12 12:37   ` Vlastimil Babka
2020-11-12 12:37     ` Vlastimil Babka
2020-11-10 12:14 ` [PATCH v21 00/19] per memcg lru lock Alex Shi
2020-11-10 12:14   ` Alex Shi
2020-11-16  3:45 ` Alex Shi
2020-11-16  3:45   ` Alex Shi
2020-12-15  0:47 ` Andrew Morton
2020-12-15  0:47   ` Andrew Morton
2020-12-15  2:16   ` Hugh Dickins
2020-12-15  2:16     ` Hugh Dickins
2020-12-15  2:16     ` Hugh Dickins
2020-12-15  2:28     ` Andrew Morton
2020-12-15  2:28       ` Andrew Morton
2021-01-05 19:30 ` Qian Cai
2021-01-05 19:30   ` Qian Cai
2021-01-05 19:30   ` Qian Cai
2021-01-05 19:42   ` Shakeel Butt
2021-01-05 19:42     ` Shakeel Butt
2021-01-05 19:42     ` Shakeel Butt
2021-01-05 20:11     ` Qian Cai
2021-01-05 20:11       ` Qian Cai
2021-01-05 20:11       ` Qian Cai
2021-01-05 21:35       ` Hugh Dickins
2021-01-05 21:35         ` Hugh Dickins
2021-01-05 21:35         ` Hugh Dickins
2021-01-05 22:01         ` Qian Cai
2021-01-05 22:01           ` Qian Cai
2021-01-05 22:01           ` Qian Cai
2021-01-06  3:10           ` Hugh Dickins
2021-01-06  3:10             ` Hugh Dickins
2021-01-06  3:10             ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.