linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v20 00/20] per memcg lru lock
@ 2020-10-29 10:44 Alex Shi
  2020-10-29 10:44 ` [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged Alex Shi
                   ` (20 more replies)
  0 siblings, 21 replies; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

This version just a rebase on v5.10-rc1. and moves the lru_lock position
down after lists[] in lruvec, which resolves a fio.read regression that
revealed by Rong Chen -- Intel LKP.

Many thanks for line by line review by Hugh Dickins and Alexander Duyck.

So now this patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation. 
2, use TestCleanPageLRU as page isolation's precondition.
3, replace per node lru_lock with per memcg per node lru_lock.

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box on v18, which has no much different
with this v20.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Alex Shi (17):
  mm/memcg: warning on !memcg after readahead page charged
  mm/memcg: bail early from swap accounting if memcg disabled
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: use head for head page in lru_add_page_tail
  mm/thp: Simplify lru_add_page_tail()
  mm/thp: narrow lru locking
  mm/vmscan: remove unnecessary lruvec adding
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lock into lru_note_cost
  mm/vmscan: remove lruvec reget in move_pages_to_lru
  mm/mlock: remove lru_lock on TestClearPageMlocked
  mm/mlock: remove __munlock_isolate_lru_page
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock

Alexander Duyck (1):
  mm/lru: introduce the relock_page_lruvec function

Hugh Dickins (2):
  mm: page_idle_get_page() does not need lru_lock
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/memcontrol.h                         | 110 +++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmdebug.h                            |  13 ++
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 mm/compaction.c                                    |  94 +++++++---
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  45 +++--
 mm/memcontrol.c                                    |  85 ++++++++-
 mm/mlock.c                                         |  63 ++-----
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   4 -
 mm/rmap.c                                          |   4 +-
 mm/swap.c                                          | 199 ++++++++------------
 mm/vmscan.c                                        | 203 +++++++++++----------
 mm/workingset.c                                    |   2 -
 22 files changed, 523 insertions(+), 378 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-10-29 13:43   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled Alex Shi
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Add VM_WARN_ON_ONCE_PAGE() macro.

Since readahead page is charged on memcg too, in theory we don't have to
check this exception now. Before safely remove them all, add a warning
for the unexpected !memcg.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmdebug.h | 13 +++++++++++++
 mm/memcontrol.c         | 11 ++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
index 2ad72d2c8cc5..5d0767cb424a 100644
--- a/include/linux/mmdebug.h
+++ b/include/linux/mmdebug.h
@@ -37,6 +37,18 @@
 			BUG();						\
 		}							\
 	} while (0)
+#define VM_WARN_ON_ONCE_PAGE(cond, page)	({			\
+	static bool __section(".data.once") __warned;			\
+	int __ret_warn_once = !!(cond);					\
+									\
+	if (unlikely(__ret_warn_once && !__warned)) {			\
+		dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\
+		__warned = true;					\
+		WARN_ON(1);						\
+	}								\
+	unlikely(__ret_warn_once);					\
+})
+
 #define VM_WARN_ON(cond) (void)WARN_ON(cond)
 #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
 #define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format)
@@ -48,6 +60,7 @@
 #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond)
 #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond)
+#define VM_WARN_ON_ONCE_PAGE(cond, page)  BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a24e3b619f5..6b67da305958 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1350,10 +1350,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	}
 
 	memcg = page->mem_cgroup;
-	/*
-	 * Swapcache readahead pages are added to the LRU - and
-	 * possibly migrated - before they are charged.
-	 */
+	VM_WARN_ON_ONCE_PAGE(!memcg, page);
 	if (!memcg)
 		memcg = root_mem_cgroup;
 
@@ -6979,8 +6976,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	if (newpage->mem_cgroup)
 		return;
 
-	/* Swapcache readahead pages can get replaced before being charged */
 	memcg = oldpage->mem_cgroup;
+	VM_WARN_ON_ONCE_PAGE(!memcg, oldpage);
 	if (!memcg)
 		return;
 
@@ -7177,7 +7174,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 
 	memcg = page->mem_cgroup;
 
-	/* Readahead page, never charged */
+	VM_WARN_ON_ONCE_PAGE(!memcg, page);
 	if (!memcg)
 		return;
 
@@ -7241,7 +7238,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 
 	memcg = page->mem_cgroup;
 
-	/* Readahead page, never charged */
+	VM_WARN_ON_ONCE_PAGE(!memcg, page);
 	if (!memcg)
 		return 0;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
  2020-10-29 10:44 ` [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-10-29 13:46   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

If we disabled memcg by cgroup_disable=memory, page->memcg will be NULL
and so the charge is skipped and that will trigger a warning like below.
Let's return from the funcs earlier.

 anon flags:0x5005b48008000d(locked|uptodate|dirty|swapbacked)
 raw: 005005b48008000d dead000000000100 dead000000000122 ffff8897c7c76ad1
 raw: 0000000000000022 0000000000000000 0000000200000000 0000000000000000
 page dumped because: VM_WARN_ON_ONCE_PAGE(!memcg)
...
 RIP: 0010:vprintk_emit+0x1f7/0x260
 Code: 00 84 d2 74 72 0f b6 15 27 58 64 01 48 c7 c0 00 d4 72 82 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 de 0d 00 00 4c 89 e7 57 9d <0f> 1f 44 00 00 e9 62 ff ff ff 80 3d 88 c9 3a 01 00 0f 85 54 fe ff
 RSP: 0018:ffffc9000faab358 EFLAGS: 00000202
 RAX: ffffffff8272d400 RBX: 000000000000005e RCX: ffff88afd80d0040
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000202
 RBP: ffffc9000faab3a8 R08: ffffffff8272d440 R09: 0000000000022480
 R10: 00120c77be68bfac R11: 0000000000cd7568 R12: 0000000000000202
 R13: 0057ffffc0080005 R14: ffffffff820a0130 R15: ffffc9000faab3e8
 ? vprintk_emit+0x140/0x260
 vprintk_default+0x1a/0x20
 vprintk_func+0x4f/0xc4
 ? vprintk_func+0x4f/0xc4
 printk+0x53/0x6a
 ? xas_load+0xc/0x80
 __dump_page.cold.6+0xff/0x4ee
 ? xas_init_marks+0x23/0x50
 ? xas_store+0x30/0x40
 ? free_swap_slot+0x43/0xd0
 ? put_swap_page+0x119/0x320
 ? update_load_avg+0x82/0x580
 dump_page+0x9/0xb
 mem_cgroup_try_charge_swap+0x16e/0x1d0
 get_swap_page+0x130/0x210
 add_to_swap+0x41/0xc0
 shrink_page_list+0x99e/0xdf0
 shrink_inactive_list+0x199/0x360
 shrink_lruvec+0x40d/0x650
 ? _cond_resched+0x14/0x30
 ? _cond_resched+0x14/0x30
 shrink_node+0x226/0x6e0
 do_try_to_free_pages+0xd0/0x400
 try_to_free_pages+0xef/0x130
 __alloc_pages_slowpath.constprop.127+0x38d/0xbd0
 ? ___slab_alloc+0x31d/0x6f0
 __alloc_pages_nodemask+0x27f/0x2c0
 alloc_pages_vma+0x75/0x220
 shmem_alloc_page+0x46/0x90
 ? release_pages+0x1ae/0x410
 shmem_alloc_and_acct_page+0x77/0x1c0
 shmem_getpage_gfp+0x162/0x910
 shmem_fault+0x74/0x210
 ? filemap_map_pages+0x29c/0x410
 __do_fault+0x37/0x190
 handle_mm_fault+0x120a/0x1770
 exc_page_fault+0x251/0x450
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6b67da305958..e46b9f9501c2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7169,6 +7169,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON_PAGE(page_count(page), page);
 
+	if (mem_cgroup_disabled())
+		return;
+
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
@@ -7233,6 +7236,9 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
 
+	if (mem_cgroup_disabled())
+		return 0;
+
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return 0;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
  2020-10-29 10:44 ` [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged Alex Shi
  2020-10-29 10:44 ` [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-10-29 13:47   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail Alex Shi
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 --
 mm/huge_memory.c     | 30 ++++++++++++++++++++++++++++++
 mm/swap.c            | 33 ---------------------------------
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 667935c0dbd4..5e1e967c225f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 			  unsigned int nr_pages);
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
-			 struct lruvec *lruvec, struct list_head *head);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9474dbc150ed..038db815ebba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,6 +2346,36 @@ static void remap_page(struct page *page, unsigned int nr)
 	}
 }
 
+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
+
 static void __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/swap.c b/mm/swap.c
index 47a47681c86b..05bc9ff6d8c0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -974,39 +974,6 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
-		       struct lruvec *lruvec, struct list_head *list)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
-
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
-	else if (list) {
-		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (2 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-10-29 13:50   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail() Alex Shi
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 038db815ebba..93c0b73eb8c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,19 +2346,19 @@ static void remap_page(struct page *page, unsigned int nr)
 	}
 }
 
-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *page_tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON_PAGE(!PageHead(head), head);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
 
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
+	if (likely(PageLRU(head)))
+		list_add_tail(&page_tail->lru, &head->lru);
 	else if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail()
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (3 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-10-29 14:00   ` Johannes Weiner
  2020-10-30  2:48   ` Alex Shi
  2020-10-29 10:44 ` [PATCH v20 06/20] mm/thp: narrow lru locking Alex Shi
                   ` (15 subsequent siblings)
  20 siblings, 2 replies; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Mika Penttilä

Simplify lru_add_page_tail(), there are actually only two cases possible:
split_huge_page_to_list(), with list supplied and head isolated from lru
by its caller; or split_huge_page(), with NULL list and head on lru -
because when head is racily isolated from lru, the isolator's reference
will stop the split from getting any further than its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 93c0b73eb8c6..4b72dd7b8b34 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,25 +2354,16 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&page_tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
+		VM_WARN_ON(PageLRU(head));
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
 	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
+		/* head is still on lru (and we have it frozen) */
+		VM_WARN_ON(!PageLRU(head));
+		SetPageLRU(page_tail);
+		list_add_tail(&page_tail->lru, &head->lru);
 	}
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 06/20] mm/thp: narrow lru locking
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (4 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail() Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-10-29 10:44 ` [PATCH v20 07/20] mm/vmscan: remove unnecessary lruvec adding Alex Shi
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrea Arcangeli

lru_lock and page cache xa_lock have no obvious reason to be taken
one way round or the other: until now, lru_lock has been taken before
page cache xa_lock, when splitting a THP; but nothing else takes them
together.  Reverse that ordering: let's narrow the lru locking - but
leave local_irq_disable to block interrupts throughout, like before.

Hugh Dickins point: split_huge_page_to_list() was already silly, to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.  So we
can save passing flags argument down to __split_huge_page().

Why change the lock ordering here? That was hard to decide. One reason:
when this series reaches per-memcg lru locking, it relies on the THP's
memcg to be stable when taking the lru_lock: that is now done after the
THP's refcount has been frozen, which ensures page memcg cannot change.

Another reason: previously, lock_page_memcg()'s move_lock was presumed
to nest inside lru_lock; but now lru_lock must nest inside (page cache
lock inside) move_lock, so it becomes possible to use lock_page_memcg()
to stabilize page memcg before taking its lru_lock.  That is not the
mechanism used in this series, but it is an option we want to keep open.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b72dd7b8b34..5fa890e26975 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2433,7 +2433,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+		pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2443,8 +2443,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned int nr = thp_nr_pages(head);
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2456,6 +2454,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock(&pgdat->lru_lock);
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2475,6 +2478,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock(&pgdat->lru_lock);
+	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
 
@@ -2492,8 +2497,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		page_ref_add(head, 2);
 		xa_unlock(&head->mapping->i_pages);
 	}
-
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	local_irq_enable();
 
 	remap_page(head, nr);
 
@@ -2639,12 +2643,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
-	unsigned long flags;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2705,9 +2707,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+	/* block interrupt reentry in xa_lock and spinlock */
+	local_irq_disable();
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2737,7 +2738,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2751,7 +2752,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		local_irq_enable();
 		remap_page(head, thp_nr_pages(head));
 		ret = -EBUSY;
 	}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 07/20] mm/vmscan: remove unnecessary lruvec adding
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (5 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 06/20] mm/thp: narrow lru locking Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-11-02 14:20   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock Alex Shi
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept before put_page_testzero for list
integrity, otherwise:

  #0 move_pages_to_lru             #1 release_pages
  if !put_page_testzero
     			           if (put_page_testzero())
     			              !PageLRU //skip lru_lock
     SetPageLRU()
     list_add(&page->lru,)
                                         list_add(&page->lru,)

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1b8f0e059767..6c6965cbbdef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1852,26 +1852,30 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			list_del(&page->lru);
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		/*
+		 * The SetPageLRU needs to be kept here for list integrity.
+		 * Otherwise:
+		 *   #0 move_pages_to_lru             #1 release_pages
+		 *   if !put_page_testzero
+		 *				      if (put_page_testzero())
+		 *				        !PageLRU //skip lru_lock
+		 *     SetPageLRU()
+		 *     list_add(&page->lru,)
+		 *                                        list_add(&page->lru,)
+		 */
 		SetPageLRU(page);
-		lru = page_lru(page);
 
-		nr_pages = thp_nr_pages(page);
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
-
-		if (put_page_testzero(page)) {
+		if (unlikely(put_page_testzero(page))) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -1879,11 +1883,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
-		} else {
-			nr_moved += nr_pages;
-			if (PageActive(page))
-				workingset_age_nonresident(lruvec, nr_pages);
+
+			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lru = page_lru(page);
+		nr_pages = thp_nr_pages(page);
+
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &lruvec->lists[lru]);
+		nr_moved += nr_pages;
+		if (PageActive(page))
+			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (6 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 07/20] mm/vmscan: remove unnecessary lruvec adding Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-11-02 14:41   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 09/20] mm/memcg: add debug checking in lock_page_memcg Alex Shi
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Vlastimil Babka, Minchan Kim

From: Hugh Dickins <hughd@google.com>

It is necessary for page_idle_get_page() to recheck PageLRU() after
get_page_unless_zero(), but holding lru_lock around that serves no
useful purpose, and adds to lru_lock contention: delete it.

See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
discussion that led to lru_lock there; but __page_set_anon_rmap() now
uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs()
using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly
but not entirely prevented by page_count() check in ksm.c's
write_protect_page(): that risk being shared with page_referenced() and
not helped by lru_lock).

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_idle.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..64e5344a992c 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,15 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 09/20] mm/memcg: add debug checking in lock_page_memcg
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (7 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-11-02 14:45   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Add a debug checking in lock_page_memcg, then we could get alarm
if anything wrong here.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e46b9f9501c2..599aa8863111 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2142,6 +2142,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (unlikely(!memcg))
 		return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
 	if (atomic_read(&memcg->moving_account) <= 0)
 		return memcg;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (8 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 09/20] mm/memcg: add debug checking in lock_page_memcg Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-11-02 14:48   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 11/20] mm/lru: move lock into lru_note_cost Alex Shi
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Fold the PGROTATED event collection into pagevec_move_tail_fn call back
func like other funcs does in pagevec_lru_move_fn. Thus we could save
func call pagevec_move_tail().
Now all usage of pagevec_lru_move_fn are same and no needs of its 3rd
parameter.

It's just simply the calling. No functional change.

[lkp@intel.com: found a build issue in the original patch, thanks]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 65 ++++++++++++++++++++++-----------------------------------------
 1 file changed, 23 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 05bc9ff6d8c0..31fc3ebc1079 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
-	void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
 	struct pglist_data *pgdat = NULL;
@@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		(*move_fn)(page, lruvec, arg);
+		(*move_fn)(page, lruvec);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -232,35 +231,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	int *pgmoved = arg;
-
 	if (PageLRU(page) && !PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-		(*pgmoved) += thp_nr_pages(page);
+		__count_vm_events(PGROTATED, thp_nr_pages(page));
 	}
 }
 
 /*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
-	int pgmoved = 0;
-
-	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
-	__count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
+ *
+ * rotate_reclaimable_page() must disable IRQs, to prevent nasty races.
  */
 void rotate_reclaimable_page(struct page *page)
 {
@@ -273,7 +259,7 @@ void rotate_reclaimable_page(struct page *page)
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_move_tail(pvec);
+			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 }
@@ -315,8 +301,7 @@ void lru_note_cost_page(struct page *page)
 		      page_is_file_lru(page), thp_nr_pages(page));
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -340,7 +325,7 @@ static void activate_page_drain(int cpu)
 	struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
 
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, __activate_page, NULL);
+		pagevec_lru_move_fn(pvec, __activate_page);
 }
 
 static bool need_activate_page_drain(int cpu)
@@ -358,7 +343,7 @@ static void activate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, __activate_page, NULL);
+			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -374,7 +359,7 @@ static void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -525,8 +510,7 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
-			      void *arg)
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 {
 	int lru;
 	bool active;
@@ -573,8 +557,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -591,8 +574,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
@@ -636,21 +618,21 @@ void lru_add_drain_cpu(int cpu)
 
 		/* No harm done if a racing interrupt already did this */
 		local_lock_irqsave(&lru_rotate.lock, flags);
-		pagevec_move_tail(pvec);
+		pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
 }
@@ -679,7 +661,7 @@ void deactivate_file_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -701,7 +683,7 @@ void deactivate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -723,7 +705,7 @@ void mark_page_lazyfree(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -974,8 +956,7 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
@@ -1034,7 +1015,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
 }
 
 /**
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 11/20] mm/lru: move lock into lru_note_cost
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (9 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-10-29 13:42   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru Alex Shi
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

We have to move lru_lock into lru_note_cost, since it cycle up on memcg
tree, for future per lruvec lru_lock replace. It's a bit ugly and may
cost a bit more locking, but benefit from multiple memcg locking could
cover the lost.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c       | 3 +++
 mm/vmscan.c     | 4 +---
 mm/workingset.c | 2 --
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 31fc3ebc1079..8798bd899db0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -268,7 +268,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+		spin_lock_irq(&pgdat->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -292,6 +294,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
+		spin_unlock_irq(&pgdat->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6c6965cbbdef..42bac12aacb4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1973,19 +1973,17 @@ static int current_may_throttle(void)
 				&stat, false);
 
 	spin_lock_irq(&pgdat->lru_lock);
-
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	lru_note_cost(lruvec, file, stat.nr_pageout);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
+	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
 
diff --git a/mm/workingset.c b/mm/workingset.c
index 975a4d2dd02e..d8d2fdf70c24 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -381,9 +381,7 @@ void workingset_refault(struct page *page, void *shadow)
 	if (workingset) {
 		SetPageWorkingset(page);
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
-		spin_lock_irq(&page_pgdat(page)->lru_lock);
 		lru_note_cost_page(page);
-		spin_unlock_irq(&page_pgdat(page)->lru_lock);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
 	}
 out:
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (10 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 11/20] mm/lru: move lock into lru_note_cost Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-11-02 14:52   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked Alex Shi
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Michal Hocko

A isolated page shouldn't be recharged by memcg since the memcg
migration isn't possible at the time.
So remove unnecessary regetting.

Thanks to Alexander Duyck for pointing this out.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42bac12aacb4..845733afccde 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1887,7 +1887,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
+							!= lruvec, page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (11 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-11-02 14:55   ` Johannes Weiner
  2020-10-29 10:44 ` [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov, Vlastimil Babka

In the func munlock_vma_page, comments mentained lru_lock needed for
serialization with split_huge_pages. But the page must be PageLocked
as well as pages in split_huge_page series funcs. Thus the PageLocked
is enough to serialize both funcs.

Further more, Hugh Dickins pointed: before splitting in
split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes
which protect the page from munlock. Thus, no needs to guard
__split_huge_page_tail for mlock clean, just keep the lru_lock there for
isolation purpose.

LKP found a preempt issue on __mod_zone_page_state which need change
to mod_zone_page_state. Thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 26 +++++---------------------
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 884b1216da6a..796c726a0407 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -187,40 +187,24 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	pg_data_t *pgdat = page_pgdat(page);
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
-	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes thp_nr_pages().
-	 */
-	spin_lock_irq(&pgdat->lru_lock);
-
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
+		return 0;
 	}
 
 	nr_pages = thp_nr_pages(page);
-	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+	if (!isolate_lru_page(page))
 		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
-
-unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	else
+		__munlock_isolation_failed(page);
 
-out:
 	return nr_pages - 1;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (12 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked Alex Shi
@ 2020-10-29 10:44 ` Alex Shi
  2020-11-02 14:56   ` Johannes Weiner
  2020-10-29 10:45 ` [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU Alex Shi
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:44 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov, Vlastimil Babka

The func only has one caller, remove it to clean up code and simplify
code.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 31 +++++++++----------------------
 1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 796c726a0407..d487aa864e86 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,26 +106,6 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
-{
-	if (PageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (getpage)
-			get_page(page);
-		ClearPageLRU(page);
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		return true;
-	}
-
-	return false;
-}
-
-/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -296,9 +276,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (PageLRU(page)) {
+				struct lruvec *lruvec;
+
+				ClearPageLRU(page);
+				lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+				del_page_from_lru_list(page, lruvec,
+							page_lru(page));
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
 		} else {
 			delta_munlocked++;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (13 preceding siblings ...)
  2020-10-29 10:44 ` [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
@ 2020-10-29 10:45 ` Alex Shi
  2020-11-02 15:10   ` Johannes Weiner
  2020-10-29 10:45 ` [PATCH v20 16/20] mm/compaction: do page isolation first in compaction Alex Shi
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:45 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we could clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
	1, get_page(); 	       #pin the page avoid to be free
	2, TestClearPageLRU(); #block other isolation like memcg change
	3, spin_lock on lru_lock; #serialize lru list access
	4, delete page from lru list;
The step 2 could be optimzed/replaced in scenarios which page is
unlikely be accessed or be moved between memcgs.

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable in Rong Chen
<rong.a.chen@intel.com> report:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c                 |  3 +--
 mm/vmscan.c                | 39 +++++++++++++++++++--------------------
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4f6ba9379112..14a0cac9e099 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -335,6 +335,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+	TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index d487aa864e86..7b0e6334be6f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -276,10 +276,9 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (PageLRU(page)) {
+			if (TestClearPageLRU(page)) {
 				struct lruvec *lruvec;
 
-				ClearPageLRU(page);
 				lruvec = mem_cgroup_page_lruvec(page,
 							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 845733afccde..ce4ab932805c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1542,7 +1542,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  */
 int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 {
-	int ret = -EINVAL;
+	int ret = -EBUSY;
 
 	/* Only take pages on the LRU. */
 	if (!PageLRU(page))
@@ -1552,8 +1552,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
-	ret = -EBUSY;
-
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1600,8 +1598,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		 * sure the page is not being freed elsewhere -- the
 		 * page release code relies on it.
 		 */
-		ClearPageLRU(page);
-		ret = 0;
+		if (TestClearPageLRU(page))
+			ret = 0;
+		else
+			put_page(page);
 	}
 
 	return ret;
@@ -1667,8 +1667,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-
 		nr_pages = compound_nr(page);
 		total_scan += nr_pages;
 
@@ -1765,21 +1763,18 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lru_lock);
+		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		if (PageLRU(page)) {
-			int lru = page_lru(page);
-			get_page(page);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
-			ret = 0;
-		}
+		spin_lock_irq(&pgdat->lru_lock);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		spin_unlock_irq(&pgdat->lru_lock);
+		ret = 0;
 	}
+
 	return ret;
 }
 
@@ -4289,6 +4284,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		nr_pages = thp_nr_pages(page);
 		pgscanned += nr_pages;
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		if (pagepgdat != pgdat) {
 			if (pgdat)
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -4297,10 +4296,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		if (!PageLRU(page) || !PageUnevictable(page))
-			continue;
-
-		if (page_evictable(page)) {
+		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
 			VM_BUG_ON_PAGE(PageActive(page), page);
@@ -4309,12 +4305,15 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 			add_page_to_lru_list(page, lruvec, lru);
 			pgrescued += nr_pages;
 		}
+		SetPageLRU(page);
 	}
 
 	if (pgdat) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 		spin_unlock_irq(&pgdat->lru_lock);
+	} else if (pgscanned) {
+		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 16/20] mm/compaction: do page isolation first in compaction
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (14 preceding siblings ...)
  2020-10-29 10:45 ` [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-10-29 10:45 ` Alex Shi
  2020-11-02 15:18   ` Johannes Weiner
  2020-10-29 10:45 ` [PATCH v20 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:45 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 43 ++++++++++++++++++++++---------------------
 3 files changed, 56 insertions(+), 31 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e1e967c225f..596bc2f4d9b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -356,7 +356,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index 6e0ee5641788..75f7973605f4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -886,6 +886,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
+				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -967,6 +968,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
+		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+			goto isolate_fail_put;
+
+		/* Try isolate the page */
+		if (!TestClearPageLRU(page))
+			goto isolate_fail_put;
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
 			locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -979,10 +995,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}
 
-			/* Recheck PageLRU and PageCompound under lock */
-			if (!PageLRU(page))
-				goto isolate_fail;
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -990,16 +1002,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
 				low_pfn += compound_nr(page) - 1;
-				goto isolate_fail;
+				SetPageLRU(page);
+				goto isolate_fail_put;
 			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, isolate_mode) != 0)
-			goto isolate_fail;
-
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
 			low_pfn += compound_nr(page) - 1;
@@ -1028,6 +1037,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		}
 
 		continue;
+
+isolate_fail_put:
+		/* Avoid potential deadlock in freeing page under lru_lock */
+		if (locked) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			locked = false;
+		}
+		put_page(page);
+
 isolate_fail:
 		if (!skip_on_failure)
 			continue;
@@ -1064,9 +1082,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	page = NULL;
+
 isolate_abort:
 	if (locked)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (page) {
+		SetPageLRU(page);
+		put_page(page);
+	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce4ab932805c..e28df9cb5be3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EBUSY;
 
@@ -1592,22 +1592,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
 
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		if (TestClearPageLRU(page))
-			ret = 0;
-		else
-			put_page(page);
-	}
-
-	return ret;
+	return 0;
 }
 
-
 /*
  * Update LRU sizes after isolating pages. The LRU size updates must
  * be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1687,20 +1674,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		 * only when the page is being freed somewhere else.
 		 */
 		scan += nr_pages;
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page_prepare(page, mode)) {
 		case 0:
+			/*
+			 * Be careful not to clear PageLRU until after we're
+			 * sure the page is not being freed elsewhere -- the
+			 * page release code relies on it.
+			 */
+			if (unlikely(!get_page_unless_zero(page)))
+				goto busy;
+
+			if (!TestClearPageLRU(page)) {
+				/*
+				 * This page may in other isolation path,
+				 * but we still hold lru_lock.
+				 */
+				put_page(page);
+				goto busy;
+			}
+
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
 
-		case -EBUSY:
+		default:
+busy:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			continue;
-
-		default:
-			BUG();
 		}
 	}
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (15 preceding siblings ...)
  2020-10-29 10:45 ` [PATCH v20 16/20] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-10-29 10:45 ` Alex Shi
  2020-11-02 15:20   ` Johannes Weiner
  2020-10-29 10:45 ` [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:45 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Hugh Dickins' found a memcg change bug on original version:
If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
possible bad scenario would like:

	cpu 0					cpu 1
lruvec = mem_cgroup_page_lruvec()
					if (!isolate_lru_page())
						mem_cgroup_move_account

spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.

So we need TestClearPageLRU to block isolate_lru_page(), that serializes
the memcg change. and then removing the PageLRU check in move_fn callee
as the consequence.

__pagevec_lru_add_fn() is different from the others, because the pages
it deals with are, by definition, not yet on the lru.  TestClearPageLRU
is not needed and would not work, so __pagevec_lru_add() goes its own
way.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 44 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 8798bd899db0..9e30f096309b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -222,8 +222,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&pgdat->lru_lock, flags);
 		}
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec);
+
+		SetPageLRU(page);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -233,7 +239,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (!PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);
 
@@ -362,7 +368,8 @@ static void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
+	if (PageLRU(page))
+		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -519,9 +526,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 	bool active;
 	int nr_pages = thp_nr_pages(page);
 
-	if (!PageLRU(page))
-		return;
-
 	if (PageUnevictable(page))
 		return;
 
@@ -562,7 +566,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);
 
@@ -579,7 +583,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	if (PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 		int nr_pages = thp_nr_pages(page);
@@ -1018,7 +1022,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
+	int i;
+	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec;
+	unsigned long flags = 0;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct pglist_data *pagepgdat = page_pgdat(page);
+
+		if (pagepgdat != pgdat) {
+			if (pgdat)
+				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			pgdat = pagepgdat;
+			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		__pagevec_lru_add_fn(page, lruvec);
+	}
+	if (pgdat)
+		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
 }
 
 /**
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (16 preceding siblings ...)
  2020-10-29 10:45 ` [PATCH v20 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
@ 2020-10-29 10:45 ` Alex Shi
  2020-10-30  2:49   ` Alex Shi
  2020-10-29 10:45 ` [PATCH v20 19/20] mm/lru: introduce the relock_page_lruvec function Alex Shi
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:45 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 +++++++++++++++---------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  62 ++++++++++++++++++++++++--
 mm/mlock.c                 |  22 +++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 105 +++++++++++++++++++++------------------------
 mm/vmscan.c                |  55 +++++++++++-------------
 10 files changed, 249 insertions(+), 125 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e391e3c56de5..f447a1bfa654 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -497,6 +497,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1038,6 +1051,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1285,6 +1323,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1414,6 +1456,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 75f7973605f4..a69784820324 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5fa890e26975..9b3e6479c0c4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2352,7 +2352,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2436,7 +2436,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2454,10 +2453,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2478,7 +2475,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 599aa8863111..0c97292834fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1330,6 +1330,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1367,6 +1380,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -3270,10 +3328,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page->mem_cgroup is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..713e6554becd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6791,7 +6791,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 9e30f096309b..580ea18a9596 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +359,13 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+		__activate_page(page, lruvec);
+	unlock_page_lruvec_irq(lruvec);
 }
 #endif
 
@@ -860,8 +855,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +865,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +877,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -904,27 +897,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -934,8 +927,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1023,26 +1016,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e28df9cb5be3..9e726b587d74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1949,7 +1946,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1961,7 +1958,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1969,7 +1966,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1978,7 +1975,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2031,7 +2028,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2042,7 +2039,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2088,7 +2085,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2099,7 +2096,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2689,10 +2686,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4268,16 +4265,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4289,13 +4285,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4309,10 +4304,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 19/20] mm/lru: introduce the relock_page_lruvec function
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (17 preceding siblings ...)
  2020-10-29 10:45 ` [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-10-29 10:45 ` Alex Shi
  2020-11-02 20:44   ` Johannes Weiner
  2020-10-29 10:45 ` [PATCH v20 20/20] mm/lru: revise the comments of lru_lock Alex Shi
  2020-11-04 11:55 ` [PATCH v20 00/20] per memcg lru lock Alex Shi
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:45 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 | 11 +---------
 mm/swap.c                  | 33 +++++++----------------------
 mm/vmscan.c                | 12 ++---------
 4 files changed, 62 insertions(+), 46 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f447a1bfa654..3c5c5c433167 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -491,6 +491,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	const struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node *mz;
+
+	if (mem_cgroup_disabled())
+		return lruvec == &pgdat->__lruvec;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+	memcg = page->mem_cgroup ? : root_mem_cgroup;
+
+	return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1026,6 +1042,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+
+	return lruvec == &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1472,6 +1496,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irq(locked_lruvec);
+	}
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+	}
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index ab164a675c25..55b3b3672977 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -277,16 +277,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *new_lruvec;
-
-				new_lruvec = mem_cgroup_page_lruvec(page,
-						page_pgdat(page));
-				if (new_lruvec != lruvec) {
-					if (lruvec)
-						unlock_page_lruvec_irq(lruvec);
-					lruvec = lock_page_lruvec_irq(page);
-				}
-
+				lruvec = relock_page_lruvec_irq(page, lruvec);
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
diff --git a/mm/swap.c b/mm/swap.c
index 580ea18a9596..9fe5ff9a8111 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -210,19 +210,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
-
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -906,17 +899,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *prev_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (prev_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
@@ -1021,15 +1009,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9e726b587d74..ee0b08a67d2d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1880,8 +1880,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			continue;
 		}
 
-		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
-							!= lruvec, page);
+		VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
@@ -4273,7 +4272,6 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
 		int nr_pages;
-		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4285,13 +4283,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
-
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v20 20/20] mm/lru: revise the comments of lru_lock
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (18 preceding siblings ...)
  2020-10-29 10:45 ` [PATCH v20 19/20] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-10-29 10:45 ` Alex Shi
  2020-11-02 20:46   ` Johannes Weiner
  2020-11-04 11:55 ` [PATCH v20 00/20] per memcg lru lock Alex Shi
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-29 10:45 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

I struggled to understand the comment above move_pages_to_lru() (surely
it never calls page_referenced()), and eventually realized that most of
it had got separated from shrink_active_list(): move that comment back.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 ++------
 Documentation/admin-guide/cgroup-v1/memory.rst     | 21 +++++------
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 +++++-------
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  3 +-
 mm/filemap.c                                       |  4 +--
 mm/rmap.c                                          |  4 +--
 mm/vmscan.c                                        | 41 ++++++++++++----------
 9 files changed, 50 insertions(+), 64 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ======
-        Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global pgdat->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under pgdat->lru_lock.
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-
-	(By __isolate_lru_page(), the page is removed from both of global and
-	private LRU.)
-
+	Each memcg has its own vector of LRUs (inactive anon, active anon,
+	inactive file, active file, unevictable) of pages from each node,
+	each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 12757e63b26c..24450696579f 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered.
 2.6 Locking
 -----------
 
-   lock_page_cgroup()/unlock_page_cgroup() should not be called under
-   the i_pages lock.
+Lock order is as follows:
 
-   Other lock order is following:
+  Page lock (PG_locked bit of page->flags)
+    mm->page_table_lock or split pte_lock
+      lock_page_memcg (memcg->move_lock)
+        mapping->i_pages lock
+          lruvec->lru_lock.
 
-   PG_locked.
-     mm->page_table_lock
-         pgdat->lru_lock
-	   lock_page_cgroup.
-
-  In many cases, just lock_page_cgroup() is called.
-
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 -----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million 4k pages in a single node.  When a large
 fraction of these pages are not evictable for any reason [see below], vmscan
 will spend a lot of time scanning the LRU lists looking for the small fraction
 of pages that are evictable.  This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
 The Unevictable Page List
 -------------------------
 
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
 called the "unevictable" list and an associated page flag, PG_unevictable, to
 indicate that the page is being managed on the unevictable list.
 
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
 swap-backed pages.  This differentiation is only important while the pages are,
 in fact, evictable.
 
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
 lists and statistics originally proposed and posted by Christoph Lameter.
 
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock.  This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
 
 Memory Control Group Interaction
 --------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
 memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
 lru_list enum.
 
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
 lru_list enum element).  The memory controller tracks the movement of pages to
 and from the unevictable list.
 
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
 active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
 pages in all of the shrink_{active|inactive|page}_list() functions and will
 "cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
 
 There may be situations where a page is mapped into a VM_LOCKED VMA, but the
 page is not marked as PG_mlocked.  Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
 page from the LRU, as it is likely on the appropriate active or inactive list
 at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
 back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list.  If
+is now mlocked and divert the page to the node's unevictable list.  If
 mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
 it later if and when it attempts to reclaim the page.
 
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
      unevictable list in mlock_vma_page().
 
 shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
 
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5a9238f6caad..c3fdd8638a6f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0afba4ea2a21..1299b8ce64d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
- * So add a wild amount of padding here to ensure that they fall into separate
+ * Add a wild amount of padding here to ensure datas fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index d5e7c2029d16..5d81946d6873 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -102,8 +102,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 1b84945d655c..c050dab2ae65 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -28,12 +28,12 @@
  *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
- *                   mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
+ *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
  *                     i_pages lock (widely used)
+ *                       lruvec->lru_lock (in lock_page_lruvec_irq)
  *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
  *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                   sb_lock (within inode_lock in fs/fs-writeback.c)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee0b08a67d2d..7ed10ade548d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1614,14 +1614,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
  * For pagecache intensive workloads, this function is the hottest
  * spot in the kernel (apart from copy_*_user functions).
  *
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
  *
  * @nr_to_scan:	The number of eligible pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
@@ -1815,25 +1817,11 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 }
 
 /*
- * This moves pages from @list to corresponding LRU list.
- *
- * We move them the other way if the page is referenced by one or more
- * processes, from rmap.
- *
- * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
- * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
- * this, so instead we remove the pages from the LRU while processing them.
- * It is safe to rely on PG_active against the non-LRU pages in here because
- * nobody will play with that bit on a non-LRU page.
- *
- * The downside is that we have to touch page->_refcount against each page.
- * But we had to alter page->flags anyway.
+ * move_pages_to_lru() moves pages from private @list to appropriate LRU list.
+ * On return, @list is reused as a list of pages to be freed by the caller.
  *
  * Returns the number of pages moved to the given lruvec.
  */
-
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
@@ -2008,6 +1996,23 @@ static int current_may_throttle(void)
 	return nr_reclaimed;
 }
 
+/*
+ * shrink_active_list() moves pages from the active LRU to the inactive LRU.
+ *
+ * We move them the other way if the page is referenced by one or more
+ * processes.
+ *
+ * If the pages are mostly unmapped, the processing is fast and it is
+ * appropriate to hold lru_lock across the whole operation.  But if
+ * the pages are mapped, the processing is slow (page_referenced()), so
+ * we should drop lru_lock around each page.  It's impossible to balance
+ * this, so instead we remove the pages from the LRU while processing them.
+ * It is safe to rely on PG_active against the non-LRU pages in here because
+ * nobody will play with that bit on a non-LRU page.
+ *
+ * The downside is that we have to touch page->_refcount against each page.
+ * But we had to alter page->flags anyway.
+ */
 static void shrink_active_list(unsigned long nr_to_scan,
 			       struct lruvec *lruvec,
 			       struct scan_control *sc,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 11/20] mm/lru: move lock into lru_note_cost
  2020-10-29 10:44 ` [PATCH v20 11/20] mm/lru: move lock into lru_note_cost Alex Shi
@ 2020-10-29 13:42   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-10-29 13:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu, Oct 29, 2020 at 06:44:56PM +0800, Alex Shi wrote:
> We have to move lru_lock into lru_note_cost, since it cycle up on memcg
> tree, for future per lruvec lru_lock replace. It's a bit ugly and may
> cost a bit more locking, but benefit from multiple memcg locking could
> cover the lost.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged
  2020-10-29 10:44 ` [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged Alex Shi
@ 2020-10-29 13:43   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-10-29 13:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko

On Thu, Oct 29, 2020 at 06:44:46PM +0800, Alex Shi wrote:
> Add VM_WARN_ON_ONCE_PAGE() macro.
> 
> Since readahead page is charged on memcg too, in theory we don't have to
> check this exception now. Before safely remove them all, add a warning
> for the unexpected !memcg.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled
  2020-10-29 10:44 ` [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled Alex Shi
@ 2020-10-29 13:46   ` Johannes Weiner
  2020-10-30  2:27     ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-10-29 13:46 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko

On Thu, Oct 29, 2020 at 06:44:47PM +0800, Alex Shi wrote:
> If we disabled memcg by cgroup_disable=memory, page->memcg will be NULL
> and so the charge is skipped and that will trigger a warning like below.
> Let's return from the funcs earlier.
> 
>  anon flags:0x5005b48008000d(locked|uptodate|dirty|swapbacked)
>  raw: 005005b48008000d dead000000000100 dead000000000122 ffff8897c7c76ad1
>  raw: 0000000000000022 0000000000000000 0000000200000000 0000000000000000
>  page dumped because: VM_WARN_ON_ONCE_PAGE(!memcg)
> ...
>  RIP: 0010:vprintk_emit+0x1f7/0x260
>  Code: 00 84 d2 74 72 0f b6 15 27 58 64 01 48 c7 c0 00 d4 72 82 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 de 0d 00 00 4c 89 e7 57 9d <0f> 1f 44 00 00 e9 62 ff ff ff 80 3d 88 c9 3a 01 00 0f 85 54 fe ff
>  RSP: 0018:ffffc9000faab358 EFLAGS: 00000202
>  RAX: ffffffff8272d400 RBX: 000000000000005e RCX: ffff88afd80d0040
>  RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000202
>  RBP: ffffc9000faab3a8 R08: ffffffff8272d440 R09: 0000000000022480
>  R10: 00120c77be68bfac R11: 0000000000cd7568 R12: 0000000000000202
>  R13: 0057ffffc0080005 R14: ffffffff820a0130 R15: ffffc9000faab3e8
>  ? vprintk_emit+0x140/0x260
>  vprintk_default+0x1a/0x20
>  vprintk_func+0x4f/0xc4
>  ? vprintk_func+0x4f/0xc4
>  printk+0x53/0x6a
>  ? xas_load+0xc/0x80
>  __dump_page.cold.6+0xff/0x4ee
>  ? xas_init_marks+0x23/0x50
>  ? xas_store+0x30/0x40
>  ? free_swap_slot+0x43/0xd0
>  ? put_swap_page+0x119/0x320
>  ? update_load_avg+0x82/0x580
>  dump_page+0x9/0xb
>  mem_cgroup_try_charge_swap+0x16e/0x1d0
>  get_swap_page+0x130/0x210
>  add_to_swap+0x41/0xc0
>  shrink_page_list+0x99e/0xdf0
>  shrink_inactive_list+0x199/0x360
>  shrink_lruvec+0x40d/0x650
>  ? _cond_resched+0x14/0x30
>  ? _cond_resched+0x14/0x30
>  shrink_node+0x226/0x6e0
>  do_try_to_free_pages+0xd0/0x400
>  try_to_free_pages+0xef/0x130
>  __alloc_pages_slowpath.constprop.127+0x38d/0xbd0
>  ? ___slab_alloc+0x31d/0x6f0
>  __alloc_pages_nodemask+0x27f/0x2c0
>  alloc_pages_vma+0x75/0x220
>  shmem_alloc_page+0x46/0x90
>  ? release_pages+0x1ae/0x410
>  shmem_alloc_and_acct_page+0x77/0x1c0
>  shmem_getpage_gfp+0x162/0x910
>  shmem_fault+0x74/0x210
>  ? filemap_map_pages+0x29c/0x410
>  __do_fault+0x37/0x190
>  handle_mm_fault+0x120a/0x1770
>  exc_page_fault+0x251/0x450
>  ? asm_exc_page_fault+0x8/0x30
>  asm_exc_page_fault+0x1e/0x30
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Roman Gushchin <guro@fb.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

This should go in before the previous patch that adds the WARN for it.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-10-29 10:44 ` [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
@ 2020-10-29 13:47   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-10-29 13:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu, Oct 29, 2020 at 06:44:48PM +0800, Alex Shi wrote:
> The func is only used in huge_memory.c, defining it in other file with a
> CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.
> 
> Let's move it THP. And make it static as Hugh Dickin suggested.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail
  2020-10-29 10:44 ` [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail Alex Shi
@ 2020-10-29 13:50   ` Johannes Weiner
  2020-10-30  2:46     ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-10-29 13:50 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu, Oct 29, 2020 at 06:44:49PM +0800, Alex Shi wrote:
> Since the first parameter is only used by head page, it's better to make
> it explicit.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/huge_memory.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 038db815ebba..93c0b73eb8c6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2346,19 +2346,19 @@ static void remap_page(struct page *page, unsigned int nr)
>  	}
>  }
>  
> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
> +static void lru_add_page_tail(struct page *head, struct page *page_tail,

It may be better to pick either
	head and tail
or
	page_head and page_tail

?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail()
  2020-10-29 10:44 ` [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail() Alex Shi
@ 2020-10-29 14:00   ` Johannes Weiner
  2020-10-30  2:48   ` Alex Shi
  1 sibling, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-10-29 14:00 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Mika Penttilä

On Thu, Oct 29, 2020 at 06:44:50PM +0800, Alex Shi wrote:
> Simplify lru_add_page_tail(), there are actually only two cases possible:
> split_huge_page_to_list(), with list supplied and head isolated from lru
> by its caller; or split_huge_page(), with NULL list and head on lru -
> because when head is racily isolated from lru, the isolator's reference
> will stop the split from getting any further than its page_ref_freeze().
> 
> So decide between the two cases by "list", but add VM_WARN_ON()s to
> verify that they match our lru expectations.
> 
> [Hugh Dickins: rewrite commit log]
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Mika Penttilä <mika.penttila@nextfour.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled
  2020-10-29 13:46   ` Johannes Weiner
@ 2020-10-30  2:27     ` Alex Shi
  2020-10-30 14:04       ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-30  2:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko



在 2020/10/29 下午9:46, Johannes Weiner 写道:
>>  ? release_pages+0x1ae/0x410
>>  shmem_alloc_and_acct_page+0x77/0x1c0
>>  shmem_getpage_gfp+0x162/0x910
>>  shmem_fault+0x74/0x210
>>  ? filemap_map_pages+0x29c/0x410
>>  __do_fault+0x37/0x190
>>  handle_mm_fault+0x120a/0x1770
>>  exc_page_fault+0x251/0x450
>>  ? asm_exc_page_fault+0x8/0x30
>>  asm_exc_page_fault+0x1e/0x30
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Reviewed-by: Roman Gushchin <guro@fb.com>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Acked-by: Hugh Dickins <hughd@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: cgroups@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> This should go in before the previous patch that adds the WARN for it.

Right, but than the long ops may not weird. Should I remove the ops and resend the whole patchset?

Which way is convenient for you?

Thanks
Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail
  2020-10-29 13:50   ` Johannes Weiner
@ 2020-10-30  2:46     ` Alex Shi
  2020-10-30 13:52       ` Johannes Weiner
  2020-11-02 16:03       ` Matthew Wilcox
  0 siblings, 2 replies; 67+ messages in thread
From: Alex Shi @ 2020-10-30  2:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/10/29 下午9:50, Johannes Weiner 写道:
> On Thu, Oct 29, 2020 at 06:44:49PM +0800, Alex Shi wrote:
>> Since the first parameter is only used by head page, it's better to make
>> it explicit.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Acked-by: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  mm/huge_memory.c | 12 ++++++------
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 038db815ebba..93c0b73eb8c6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2346,19 +2346,19 @@ static void remap_page(struct page *page, unsigned int nr)
>>  	}
>>  }
>>  
>> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
>> +static void lru_add_page_tail(struct page *head, struct page *page_tail,
> 
> It may be better to pick either
> 	head and tail

Hi Johannes,

Thanks for comments!

Right, Consider functions in this file are using head/tail more as parameters
I will change to use head/tail too. And then, the 04th, 05th, and 18th patch 
will be changed accordingly.

Thanks
Alex

> or
> 	page_head and page_tail
> 
> ?
> 

From a9ee63a213f40eb4d5a69b52fbb348ff9cd7cf6c Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 26 May 2020 16:49:22 +0800
Subject: [PATCH v21 04/20] mm/thp: use head for head page in lru_add_page_tail

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 038db815ebba..32a4bf5b80c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,33 +2346,32 @@ static void remap_page(struct page *page, unsigned int nr)
 	}
 }
 
-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON_PAGE(!PageHead(head), head);
+	VM_BUG_ON_PAGE(PageCompound(tail), head);
+	VM_BUG_ON_PAGE(PageLRU(tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
 	if (!list)
-		SetPageLRU(page_tail);
+		SetPageLRU(tail);
 
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
+	if (likely(PageLRU(head)))
+		list_add_tail(&tail->lru, &head->lru);
 	else if (list) {
 		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
+		get_page(tail);
+		list_add_tail(&tail->lru, list);
 	} else {
 		/*
 		 * Head page has not yet been counted, as an hpage,
 		 * so we must account for each subpage individually.
 		 *
-		 * Put page_tail on the list at the correct position
+		 * Put tail on the list at the correct position
 		 * so they all end up in order.
 		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
+		add_page_to_lru_list_tail(tail, lruvec, page_lru(tail));
 	}
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail()
  2020-10-29 10:44 ` [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail() Alex Shi
  2020-10-29 14:00   ` Johannes Weiner
@ 2020-10-30  2:48   ` Alex Shi
  1 sibling, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-10-30  2:48 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Mika Penttilä

patch changed since varible rename in 4th patch:

From 5014c78418284f70be232a37fa3a4660a54e83c0 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Fri, 10 Jul 2020 12:53:22 +0800
Subject: [PATCH v21 05/20] mm/thp: Simplify lru_add_page_tail()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Simplify lru_add_page_tail(), there are actually only two cases possible:
split_huge_page_to_list(), with list supplied and head isolated from lru
by its caller; or split_huge_page(), with NULL list and head on lru -
because when head is racily isolated from lru, the isolator's reference
will stop the split from getting any further than its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 32a4bf5b80c8..cedcdbeb98b4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,24 +2354,16 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
+		VM_WARN_ON(PageLRU(head));
 		get_page(tail);
 		list_add_tail(&tail->lru, list);
 	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(tail, lruvec, page_lru(tail));
+		/* head is still on lru (and we have it frozen) */
+		VM_WARN_ON(!PageLRU(head));
+		SetPageLRU(tail);
+		list_add_tail(&tail->lru, &head->lru);
 	}
 }
 
-- 
1.8.3.1




^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-10-29 10:45 ` [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-10-30  2:49   ` Alex Shi
  2020-11-02 20:41     ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-10-30  2:49 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi



patch changed since variable renaming in 04th patch:

From e892e74a35c27e69bebb73d2e4cff54e438f8d7d Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 18 Aug 2020 16:44:21 +0800
Subject: [PATCH v21 18/20] mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on the patch polish, thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rong Chen <rong.a.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
 include/linux/mmzone.h     |   3 +-
 mm/compaction.c            |  56 +++++++++++++++---------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  62 ++++++++++++++++++++++++--
 mm/mlock.c                 |  22 +++++++---
 mm/mmzone.c                |   1 +
 mm/page_alloc.c            |   1 -
 mm/swap.c                  | 105 +++++++++++++++++++++------------------------
 mm/vmscan.c                |  55 +++++++++++-------------
 10 files changed, 249 insertions(+), 125 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e391e3c56de5..f447a1bfa654 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -497,6 +497,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1038,6 +1051,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1285,6 +1323,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1414,6 +1456,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..0afba4ea2a21 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,8 @@ enum lruvec_flags {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
 	 * over the other. As the observed cost of reclaiming one LRU
@@ -796,7 +798,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 75f7973605f4..a69784820324 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -864,11 +864,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -940,9 +949,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -983,10 +991,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1005,9 +1022,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1041,8 +1057,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,8 +1073,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1086,7 +1102,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7811a30080fb..7b5da37895bd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2352,7 +2352,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(tail), head);
 	VM_BUG_ON_PAGE(PageLRU(tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2436,7 +2436,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2454,10 +2453,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2478,7 +2475,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, nr);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 599aa8863111..0c97292834fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1330,6 +1330,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1367,6 +1380,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -3270,10 +3328,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
 /*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page->mem_cgroup is not set on compound tails, set it now.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 7b0e6334be6f..ab164a675c25 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -262,12 +262,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * so we can spare the get_page() here.
 			 */
 			if (TestClearPageLRU(page)) {
-				struct lruvec *lruvec;
+				struct lruvec *new_lruvec;
+
+				new_lruvec = mem_cgroup_page_lruvec(page,
+						page_pgdat(page));
+				if (new_lruvec != lruvec) {
+					if (lruvec)
+						unlock_page_lruvec_irq(lruvec);
+					lruvec = lock_page_lruvec_irq(page);
+				}
 
-				lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
 				del_page_from_lru_list(page, lruvec,
 							page_lru(page));
 				continue;
@@ -299,8 +305,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	} else if (delta_munlocked) {
+		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..713e6554becd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6791,7 +6791,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 9e30f096309b..580ea18a9596 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +359,13 @@ static inline void activate_page_drain(int cpu)
 
 static void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+		__activate_page(page, lruvec);
+	unlock_page_lruvec_irq(lruvec);
 }
 #endif
 
@@ -860,8 +855,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -871,11 +865,11 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		page = compound_head(page);
@@ -883,10 +877,9 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -904,27 +897,27 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -934,8 +927,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -1023,26 +1016,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e28df9cb5be3..9e726b587d74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1874,9 +1871,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
@@ -1949,7 +1946,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1961,7 +1958,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1969,7 +1966,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1978,7 +1975,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2031,7 +2028,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2042,7 +2039,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2088,7 +2085,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2099,7 +2096,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2689,10 +2686,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4268,16 +4265,15 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 		int nr_pages;
+		struct lruvec *new_lruvec;
 
 		if (PageTransTail(page))
 			continue;
@@ -4289,13 +4285,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		if (!TestClearPageLRU(page))
 			continue;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (page_evictable(page) && PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
@@ -4309,10 +4304,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		SetPageLRU(page);
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail
  2020-10-30  2:46     ` Alex Shi
@ 2020-10-30 13:52       ` Johannes Weiner
  2020-10-31  1:14         ` Alex Shi
  2020-11-02 16:03       ` Matthew Wilcox
  1 sibling, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-10-30 13:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Fri, Oct 30, 2020 at 10:46:54AM +0800, Alex Shi wrote:
> 在 2020/10/29 下午9:50, Johannes Weiner 写道:
> > It may be better to pick either
> > 	head and tail
> 
> Hi Johannes,
> 
> Thanks for comments!
> 
> Right, Consider functions in this file are using head/tail more as parameters
> I will change to use head/tail too. And then, the 04th, 05th, and 18th patch 
> will be changed accordingly.

That's great, thank you!

> From a9ee63a213f40eb4d5a69b52fbb348ff9cd7cf6c Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Tue, 26 May 2020 16:49:22 +0800
> Subject: [PATCH v21 04/20] mm/thp: use head for head page in lru_add_page_tail
> 
> Since the first parameter is only used by head page, it's better to make
> it explicit.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled
  2020-10-30  2:27     ` Alex Shi
@ 2020-10-30 14:04       ` Johannes Weiner
  2020-10-31  1:13         ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-10-30 14:04 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko

On Fri, Oct 30, 2020 at 10:27:51AM +0800, Alex Shi wrote:
> 
> 
> 在 2020/10/29 下午9:46, Johannes Weiner 写道:
> >>  ? release_pages+0x1ae/0x410
> >>  shmem_alloc_and_acct_page+0x77/0x1c0
> >>  shmem_getpage_gfp+0x162/0x910
> >>  shmem_fault+0x74/0x210
> >>  ? filemap_map_pages+0x29c/0x410
> >>  __do_fault+0x37/0x190
> >>  handle_mm_fault+0x120a/0x1770
> >>  exc_page_fault+0x251/0x450
> >>  ? asm_exc_page_fault+0x8/0x30
> >>  asm_exc_page_fault+0x1e/0x30
> >>
> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> >> Reviewed-by: Roman Gushchin <guro@fb.com>
> >> Acked-by: Michal Hocko <mhocko@suse.com>
> >> Acked-by: Hugh Dickins <hughd@google.com>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> Cc: Michal Hocko <mhocko@kernel.org>
> >> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: cgroups@vger.kernel.org
> >> Cc: linux-mm@kvack.org
> >> Cc: linux-kernel@vger.kernel.org
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > This should go in before the previous patch that adds the WARN for it.
> 
> Right, but than the long ops may not weird. Should I remove the ops and resend the whole patchset?

You mean the warning in the changelog? I think that's alright. You can
just say that you're about to remove the !page->memcg check in the
next patch because the original reasons for having it are gone, and
memcg being disabled is the only remaining exception, so this patch
makes that check explicit in preparation for the next.

Sorry, it's all a bit of a hassle, I just wouldn't want to introduce a
known warning into the kernel between those two patches (could confuse
bisection runs, complicates partial reverts etc.)


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled
  2020-10-30 14:04       ` Johannes Weiner
@ 2020-10-31  1:13         ` Alex Shi
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-10-31  1:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko



在 2020/10/30 下午10:04, Johannes Weiner 写道:
>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>
>>> This should go in before the previous patch that adds the WARN for it.
>> Right, but than the long ops may not weird. Should I remove the ops and resend the whole patchset?
> You mean the warning in the changelog? I think that's alright. You can
> just say that you're about to remove the !page->memcg check in the
> next patch because the original reasons for having it are gone, and
> memcg being disabled is the only remaining exception, so this patch
> makes that check explicit in preparation for the next.
> 
> Sorry, it's all a bit of a hassle, I just wouldn't want to introduce a
> known warning into the kernel between those two patches (could confuse
> bisection runs, complicates partial reverts etc.)

H Johannes,

I see, I will exchange the 1st and 2nd patch place with above comments in commit log.
I guess you could give more comments on other patches, so I am going to wait you for
more comments and send v21 as a whole. :)

Many thanks!
Alex 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail
  2020-10-30 13:52       ` Johannes Weiner
@ 2020-10-31  1:14         ` Alex Shi
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-10-31  1:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/10/30 下午9:52, Johannes Weiner 写道:
> 
>> From a9ee63a213f40eb4d5a69b52fbb348ff9cd7cf6c Mon Sep 17 00:00:00 2001
>> From: Alex Shi <alex.shi@linux.alibaba.com>
>> Date: Tue, 26 May 2020 16:49:22 +0800
>> Subject: [PATCH v21 04/20] mm/thp: use head for head page in lru_add_page_tail
>>
>> Since the first parameter is only used by head page, it's better to make
>> it explicit.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Acked-by: Hugh Dickins <hughd@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>


Thanks a lot!
Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 07/20] mm/vmscan: remove unnecessary lruvec adding
  2020-10-29 10:44 ` [PATCH v20 07/20] mm/vmscan: remove unnecessary lruvec adding Alex Shi
@ 2020-11-02 14:20   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 14:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu, Oct 29, 2020 at 06:44:52PM +0800, Alex Shi wrote:
> We don't have to add a freeable page into lru and then remove from it.
> This change saves a couple of actions and makes the moving more clear.
> 
> The SetPageLRU needs to be kept before put_page_testzero for list
> integrity, otherwise:
> 
>   #0 move_pages_to_lru             #1 release_pages
>   if !put_page_testzero
>      			           if (put_page_testzero())
>      			              !PageLRU //skip lru_lock
>      SetPageLRU()
>      list_add(&page->lru,)
>                                          list_add(&page->lru,)
> 
> [akpm@linux-foundation.org: coding style fixes]
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-10-29 10:44 ` [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock Alex Shi
@ 2020-11-02 14:41   ` Johannes Weiner
  2020-11-02 14:49     ` Matthew Wilcox
  2020-11-11  7:27     ` Hugh Dickins
  0 siblings, 2 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 14:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Vlastimil Babka, Minchan Kim

On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> From: Hugh Dickins <hughd@google.com>
> 
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
> 
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> uses WRITE_ONCE(),

That doesn't seem to be the case in Linus's or Andrew's tree. Am I
missing a dependent patch series?

> and I see no other risk in page_idle_clear_pte_refs() using
> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> not entirely prevented by page_count() check in ksm.c's
> write_protect_page(): that risk being shared with page_referenced()
> and not helped by lru_lock).

Isn't it possible, as per Minchan's description, for page->mapping to
point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
thinking it's looking at a struct address_space?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 09/20] mm/memcg: add debug checking in lock_page_memcg
  2020-10-29 10:44 ` [PATCH v20 09/20] mm/memcg: add debug checking in lock_page_memcg Alex Shi
@ 2020-11-02 14:45   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 14:45 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko

On Thu, Oct 29, 2020 at 06:44:54PM +0800, Alex Shi wrote:
> Add a debug checking in lock_page_memcg, then we could get alarm
> if anything wrong here.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  2020-10-29 10:44 ` [PATCH v20 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
@ 2020-11-02 14:48   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 14:48 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu, Oct 29, 2020 at 06:44:55PM +0800, Alex Shi wrote:
> Fold the PGROTATED event collection into pagevec_move_tail_fn call back
> func like other funcs does in pagevec_lru_move_fn. Thus we could save
> func call pagevec_move_tail().
> Now all usage of pagevec_lru_move_fn are same and no needs of its 3rd
> parameter.
> 
> It's just simply the calling. No functional change.
> 
> [lkp@intel.com: found a build issue in the original patch, thanks]
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Nice.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-02 14:41   ` Johannes Weiner
@ 2020-11-02 14:49     ` Matthew Wilcox
  2020-11-02 20:20       ` Johannes Weiner
  2020-11-11  7:27     ` Hugh Dickins
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2020-11-02 14:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Vlastimil Babka, Minchan Kim

On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > From: Hugh Dickins <hughd@google.com>
> > 
> > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > get_page_unless_zero(), but holding lru_lock around that serves no
> > useful purpose, and adds to lru_lock contention: delete it.
> > 
> > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > uses WRITE_ONCE(),
> 
> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> missing a dependent patch series?
> 
> > and I see no other risk in page_idle_clear_pte_refs() using
> > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> > not entirely prevented by page_count() check in ksm.c's
> > write_protect_page(): that risk being shared with page_referenced()
> > and not helped by lru_lock).
> 
> Isn't it possible, as per Minchan's description, for page->mapping to
> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> thinking it's looking at a struct address_space?

I don't think it can point to an anon_vma without the ANON bit set.
Minchan's concern in that email was that it might still be NULL.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru
  2020-10-29 10:44 ` [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru Alex Shi
@ 2020-11-02 14:52   ` Johannes Weiner
  2020-11-03  2:51     ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 14:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Alexander Duyck, Michal Hocko

On Thu, Oct 29, 2020 at 06:44:57PM +0800, Alex Shi wrote:
> A isolated page shouldn't be recharged by memcg since the memcg
> migration isn't possible at the time.
> So remove unnecessary regetting.
> 
> Thanks to Alexander Duyck for pointing this out.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

A brief comment in the code could be good: all pages were isolated
from the same lruvec (and isolation inhibits memcg migration).


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked
  2020-10-29 10:44 ` [PATCH v20 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked Alex Shi
@ 2020-11-02 14:55   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 14:55 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Kirill A. Shutemov, Vlastimil Babka

On Thu, Oct 29, 2020 at 06:44:58PM +0800, Alex Shi wrote:
> In the func munlock_vma_page, comments mentained lru_lock needed for
> serialization with split_huge_pages. But the page must be PageLocked
> as well as pages in split_huge_page series funcs. Thus the PageLocked
> is enough to serialize both funcs.
> 
> Further more, Hugh Dickins pointed: before splitting in
> split_huge_page_to_list, the page was unmap_page() to remove pmd/ptes
> which protect the page from munlock. Thus, no needs to guard
> __split_huge_page_tail for mlock clean, just keep the lru_lock there for
> isolation purpose.
> 
> LKP found a preempt issue on __mod_zone_page_state which need change
> to mod_zone_page_state. Thanks!
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page
  2020-10-29 10:44 ` [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
@ 2020-11-02 14:56   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 14:56 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Kirill A. Shutemov, Vlastimil Babka

On Thu, Oct 29, 2020 at 06:44:59PM +0800, Alex Shi wrote:
> The func only has one caller, remove it to clean up code and simplify
> code.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU
  2020-10-29 10:45 ` [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-11-02 15:10   ` Johannes Weiner
  2020-11-03  3:02     ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 15:10 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko

On Thu, Oct 29, 2020 at 06:45:00PM +0800, Alex Shi wrote:
> Currently lru_lock still guards both lru list and page's lru bit, that's
> ok. but if we want to use specific lruvec lock on the page, we need to
> pin down the page's lruvec/memcg during locking. Just taking lruvec
> lock first may be undermined by the page's memcg charge/migration. To
> fix this problem, we could clear the lru bit out of locking and use
> it as pin down action to block the page isolation in memcg changing.

Small nit, but the use of "could" in this sentence sounds like you're
describing one possible solution that isn't being taken, when in fact
you are describing the chosen locking mechanism.

Replacing "could" with "will" would make things a bit clearer IMO.

> So now a standard steps of page isolation is following:
> 	1, get_page(); 	       #pin the page avoid to be free
> 	2, TestClearPageLRU(); #block other isolation like memcg change
> 	3, spin_lock on lru_lock; #serialize lru list access
> 	4, delete page from lru list;
> The step 2 could be optimzed/replaced in scenarios which page is
> unlikely be accessed or be moved between memcgs.

This is a bit ominous. I'd either elaborate / provide an example /
clarify why some sites can deal with races - or just remove that
sentence altogether from this part of the changelog.

> This patch start with the first part: TestClearPageLRU, which combines
> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
> function will be used as page isolation precondition to prevent other
> isolations some where else. Then there are may !PageLRU page on lru
> list, need to remove BUG() checking accordingly.
> 
> There 2 rules for lru bit now:
> 1, the lru bit still indicate if a page on lru list, just in some
>    temporary moment(isolating), the page may have no lru bit when
>    it's on lru list.  but the page still must be on lru list when the
>    lru bit set.
> 2, have to remove lru bit before delete it from lru list.
> 
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable in Rong Chen
> <rong.a.chen@intel.com> report:
> https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 16/20] mm/compaction: do page isolation first in compaction
  2020-10-29 10:45 ` [PATCH v20 16/20] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-11-02 15:18   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 15:18 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu, Oct 29, 2020 at 06:45:01PM +0800, Alex Shi wrote:
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
> 
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
> 
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
> 
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  2020-10-29 10:45 ` [PATCH v20 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
@ 2020-11-02 15:20   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 15:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Thu, Oct 29, 2020 at 06:45:02PM +0800, Alex Shi wrote:
> Hugh Dickins' found a memcg change bug on original version:
> If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
> to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
> possible bad scenario would like:
> 
> 	cpu 0					cpu 1
> lruvec = mem_cgroup_page_lruvec()
> 					if (!isolate_lru_page())
> 						mem_cgroup_move_account
> 
> spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.
> 
> So we need TestClearPageLRU to block isolate_lru_page(), that serializes
> the memcg change. and then removing the PageLRU check in move_fn callee
> as the consequence.
> 
> __pagevec_lru_add_fn() is different from the others, because the pages
> it deals with are, by definition, not yet on the lru.  TestClearPageLRU
> is not needed and would not work, so __pagevec_lru_add() goes its own
> way.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail
  2020-10-30  2:46     ` Alex Shi
  2020-10-30 13:52       ` Johannes Weiner
@ 2020-11-02 16:03       ` Matthew Wilcox
  2020-11-03  2:43         ` Alex Shi
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2020-11-02 16:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: Johannes Weiner, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Fri, Oct 30, 2020 at 10:46:54AM +0800, Alex Shi wrote:
> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
> +static void lru_add_page_tail(struct page *head, struct page *tail,
>  		struct lruvec *lruvec, struct list_head *list)
>  {
> -	VM_BUG_ON_PAGE(!PageHead(page), page);
> -	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
> -	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> +	VM_BUG_ON_PAGE(!PageHead(head), head);
> +	VM_BUG_ON_PAGE(PageCompound(tail), head);
> +	VM_BUG_ON_PAGE(PageLRU(tail), head);

These last two should surely have been
	VM_BUG_ON_PAGE(PageCompound(tail), tail);
	VM_BUG_ON_PAGE(PageLRU(tail), tail);

Also, what do people think about converting these to VM_BUG_ON_PGFLAGS?

Either way:

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-02 14:49     ` Matthew Wilcox
@ 2020-11-02 20:20       ` Johannes Weiner
  2020-11-04 11:27         ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 20:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Vlastimil Babka, Minchan Kim

On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> > On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > > From: Hugh Dickins <hughd@google.com>
> > > 
> > > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > > get_page_unless_zero(), but holding lru_lock around that serves no
> > > useful purpose, and adds to lru_lock contention: delete it.
> > > 
> > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > > uses WRITE_ONCE(),
> > 
> > That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> > missing a dependent patch series?
> > 
> > > and I see no other risk in page_idle_clear_pte_refs() using
> > > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> > > not entirely prevented by page_count() check in ksm.c's
> > > write_protect_page(): that risk being shared with page_referenced()
> > > and not helped by lru_lock).
> > 
> > Isn't it possible, as per Minchan's description, for page->mapping to
> > point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> > thinking it's looking at a struct address_space?
> 
> I don't think it can point to an anon_vma without the ANON bit set.
> Minchan's concern in that email was that it might still be NULL.

Hm, no, the thread is a lengthy discussion about whether the store
could be split such that page->mapping is actually pointing to
something invalid (anon_vma without the PageAnon bit).

From his email:

        CPU 0                                                                           CPU 1

do_anonymous_page
  __page_set_anon_rmap
  /* out of order happened so SetPageLRU is done ahead */
  SetPageLRU(page)
  /* Compilr changed store operation like below */
  page->mapping = (struct address_space *) anon_vma;
  /* Big stall happens */
                                                                /* idletacking judged it as LRU page so pass the page
                                                                   in page_reference */
                                                                page_refernced
                                                                        page_rmapping return true because
                                                                        page->mapping has some vaule but not complete
                                                                        so it calls rmap_walk_file.
                                                                        it's okay to pass non-completed anon page in rmap_walk_file?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-10-30  2:49   ` Alex Shi
@ 2020-11-02 20:41     ` Johannes Weiner
  2020-11-03  4:58       ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 20:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko, Yang Shi

On Fri, Oct 30, 2020 at 10:49:41AM +0800, Alex Shi wrote:
> 
> 
> patch changed since variable renaming in 04th patch:
> 
> From e892e74a35c27e69bebb73d2e4cff54e438f8d7d Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Tue, 18 Aug 2020 16:44:21 +0800
> Subject: [PATCH v21 18/20] mm/lru: replace pgdat lru_lock with lruvec lock
> 
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> Daniel Jordan's testing show 62% improvement on modified readtwice case
> on his 2P * 10 core * 2 HT broadwell box.
> https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on the patch polish, thanks!
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Rong Chen <rong.a.chen@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
>  include/linux/mmzone.h     |   3 +-
>  mm/compaction.c            |  56 +++++++++++++++---------
>  mm/huge_memory.c           |  11 ++---
>  mm/memcontrol.c            |  62 ++++++++++++++++++++++++--
>  mm/mlock.c                 |  22 +++++++---
>  mm/mmzone.c                |   1 +
>  mm/page_alloc.c            |   1 -
>  mm/swap.c                  | 105 +++++++++++++++++++++------------------------
>  mm/vmscan.c                |  55 +++++++++++-------------
>  10 files changed, 249 insertions(+), 125 deletions(-)

This came out really well. Thanks for persisting!

A few inline comments:

> @@ -1367,6 +1380,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  	return lruvec;
>  }
>  
> +struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +	struct lruvec *lruvec;
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock(&lruvec->lru_lock);
> +	rcu_read_unlock();
> +
> +	lruvec_memcg_debug(lruvec, page);
> +
> +	return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +	struct lruvec *lruvec;
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock_irq(&lruvec->lru_lock);
> +	rcu_read_unlock();
> +
> +	lruvec_memcg_debug(lruvec, page);
> +
> +	return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
> +{
> +	struct lruvec *lruvec;
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock_irqsave(&lruvec->lru_lock, *flags);
> +	rcu_read_unlock();
> +
> +	lruvec_memcg_debug(lruvec, page);
> +
> +	return lruvec;
> +}

As these are used quite widely throughout the VM code now, it would be
good to give them kerneldoc comments that explain the interface.

In particular, I think it's necessary to explain the contexts from
which this is safe to use (in particular wrt pages moving between
memcgs - see the comment in commit_charge()).

I'm going to go through the callsites that follow and try to identify
what makes them safe. It's mostly an exercise to double check our
thinking here.

Most of them are straight-forward, and I don't think they warrant
individual comments. But some do, IMO. And it appears at least one
actually isn't safe yet:

> @@ -277,10 +277,16 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  			 * so we can spare the get_page() here.
>  			 */
>  			if (TestClearPageLRU(page)) {
> -				struct lruvec *lruvec;
> +				struct lruvec *new_lruvec;
> +
> +				new_lruvec = mem_cgroup_page_lruvec(page,
> +						page_pgdat(page));
> +				if (new_lruvec != lruvec) {
> +					if (lruvec)
> +						unlock_page_lruvec_irq(lruvec);
> +					lruvec = lock_page_lruvec_irq(page);

This is safe because PageLRU has been cleared.

> @@ -79,16 +79,14 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
>  static void __page_cache_release(struct page *page)
>  {
>  	if (PageLRU(page)) {
> -		pg_data_t *pgdat = page_pgdat(page);
>  		struct lruvec *lruvec;
>  		unsigned long flags;
>  
> -		spin_lock_irqsave(&pgdat->lru_lock, flags);
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		lruvec = lock_page_lruvec_irqsave(page, &flags);
>  		VM_BUG_ON_PAGE(!PageLRU(page), page);
>  		__ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> -		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +		unlock_page_lruvec_irqrestore(lruvec, flags);

This is safe because the page refcount is 0.

> @@ -207,32 +205,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  	void (*move_fn)(struct page *page, struct lruvec *lruvec))
>  {
>  	int i;
> -	struct pglist_data *pgdat = NULL;
> -	struct lruvec *lruvec;
> +	struct lruvec *lruvec = NULL;
>  	unsigned long flags = 0;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> -		struct pglist_data *pagepgdat = page_pgdat(page);
> -
> -		if (pagepgdat != pgdat) {
> -			if (pgdat)
> -				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -			pgdat = pagepgdat;
> -			spin_lock_irqsave(&pgdat->lru_lock, flags);
> -		}
> +		struct lruvec *new_lruvec;
>  
>  		/* block memcg migration during page moving between lru */
>  		if (!TestClearPageLRU(page))
>  			continue;
>  
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (lruvec != new_lruvec) {
> +			if (lruvec)
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
> +			lruvec = lock_page_lruvec_irqsave(page, &flags);

This is safe because PageLRU has been cleared.

> @@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  {
>  	do {
>  		unsigned long lrusize;
> -		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  
> -		spin_lock_irq(&pgdat->lru_lock);
> +		spin_lock_irq(&lruvec->lru_lock);
>  		/* Record cost event */
>  		if (file)
>  			lruvec->file_cost += nr_pages;
> @@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  			lruvec->file_cost /= 2;
>  			lruvec->anon_cost /= 2;
>  		}
> -		spin_unlock_irq(&pgdat->lru_lock);
> +		spin_unlock_irq(&lruvec->lru_lock);
>  	} while ((lruvec = parent_lruvec(lruvec)));
>  }

This is safe because it either comes from

	1) the pinned lruvec in reclaim, or

	2) from a pre-LRU page during refault (which also holds the
	   rcu lock, so would be safe even if the page was on the LRU
	   and could move simultaneously to a new lruvec).

The second one seems a bit tricky. It could be good to add a comment
to lru_note_cost_page() that explains why its mem_cgroup_page_lruvec()
is safe.

> @@ -364,13 +359,13 @@ static inline void activate_page_drain(int cpu)
>  
>  static void activate_page(struct page *page)
>  {
> -	pg_data_t *pgdat = page_pgdat(page);
> +	struct lruvec *lruvec;
>  
>  	page = compound_head(page);
> -	spin_lock_irq(&pgdat->lru_lock);
> +	lruvec = lock_page_lruvec_irq(page);

I don't see what makes this safe. There is nothing that appears to
lock out a concurrent page move between memcgs/lruvecs, which means
the following could manipulate an unlocked lru list:

>  	if (PageLRU(page))
> -		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
> -	spin_unlock_irq(&pgdat->lru_lock);
> +		__activate_page(page, lruvec);
> +	unlock_page_lruvec_irq(lruvec);
>  }
>  #endif

Shouldn't this be something like this?

	if (TestClearPageLRU()) {
		lruvec = lock_page_lruvec_irq(page);
		__activate_page(page, lruvec);
		unlock_page_lruvec_irq(lruvec);
		SetPageLRU(page);
	}

> @@ -904,27 +897,27 @@ void release_pages(struct page **pages, int nr)
>  			continue;
>  
>  		if (PageCompound(page)) {
> -			if (locked_pgdat) {
> -				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -				locked_pgdat = NULL;
> +			if (lruvec) {
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
> +				lruvec = NULL;
>  			}
>  			__put_compound_page(page);
>  			continue;
>  		}
>  
>  		if (PageLRU(page)) {
> -			struct pglist_data *pgdat = page_pgdat(page);
> +			struct lruvec *new_lruvec;
>  
> -			if (pgdat != locked_pgdat) {
> -				if (locked_pgdat)
> -					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> +			new_lruvec = mem_cgroup_page_lruvec(page,
> +							page_pgdat(page));
> +			if (new_lruvec != lruvec) {
> +				if (lruvec)
> +					unlock_page_lruvec_irqrestore(lruvec,
>  									flags);
>  				lock_batch = 0;
> -				locked_pgdat = pgdat;
> -				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> +				lruvec = lock_page_lruvec_irqsave(page, &flags);

Safe because page refcount=0.

> @@ -1023,26 +1016,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
>  	int i;
> -	struct pglist_data *pgdat = NULL;
> -	struct lruvec *lruvec;
> +	struct lruvec *lruvec = NULL;
>  	unsigned long flags = 0;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> -		struct pglist_data *pagepgdat = page_pgdat(page);
> +		struct lruvec *new_lruvec;
>  
> -		if (pagepgdat != pgdat) {
> -			if (pgdat)
> -				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -			pgdat = pagepgdat;
> -			spin_lock_irqsave(&pgdat->lru_lock, flags);
> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (lruvec != new_lruvec) {
> +			if (lruvec)
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
> +			lruvec = lock_page_lruvec_irqsave(page, &flags);

Safe because PageLRU hasn't been set yet.

> @@ -1765,14 +1765,12 @@ int isolate_lru_page(struct page *page)
>  	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>  
>  	if (TestClearPageLRU(page)) {
> -		pg_data_t *pgdat = page_pgdat(page);
>  		struct lruvec *lruvec;
>  
>  		get_page(page);
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		spin_lock_irq(&pgdat->lru_lock);
> +		lruvec = lock_page_lruvec_irq(page);

Safe because PageLRU is cleared.

> @@ -1839,7 +1837,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  						     struct list_head *list)
>  {
> -	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	int nr_pages, nr_moved = 0;
>  	LIST_HEAD(pages_to_free);
>  	struct page *page;
> @@ -1850,9 +1847,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
>  		list_del(&page->lru);
>  		if (unlikely(!page_evictable(page))) {
> -			spin_unlock_irq(&pgdat->lru_lock);
> +			spin_unlock_irq(&lruvec->lru_lock);

[snipped all the reclaim lock sites as they start with lruvec]

> @@ -4289,13 +4285,12 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>  		if (!TestClearPageLRU(page))
>  			continue;
>  
> -		if (pagepgdat != pgdat) {
> -			if (pgdat)
> -				spin_unlock_irq(&pgdat->lru_lock);
> -			pgdat = pagepgdat;
> -			spin_lock_irq(&pgdat->lru_lock);
> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (lruvec != new_lruvec) {
> +			if (lruvec)
> +				unlock_page_lruvec_irq(lruvec);
> +			lruvec = lock_page_lruvec_irq(page);

Safe because PageLRU is clear.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 19/20] mm/lru: introduce the relock_page_lruvec function
  2020-10-29 10:45 ` [PATCH v20 19/20] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-11-02 20:44   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 20:44 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Alexander Duyck, Thomas Gleixner,
	Andrey Ryabinin

On Thu, Oct 29, 2020 at 06:45:04PM +0800, Alex Shi wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Use this new function to replace repeated same code, no func change.
> 
> When testing for relock we can avoid the need for RCU locking if we simply
> compare the page pgdat and memcg pointers versus those that the lruvec is
> holding. By doing this we can avoid the extra pointer walks and accesses of
> the memory cgroup.
> 
> In addition we can avoid the checks entirely if lruvec is currently NULL.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 20/20] mm/lru: revise the comments of lru_lock
  2020-10-29 10:45 ` [PATCH v20 20/20] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-11-02 20:46   ` Johannes Weiner
  0 siblings, 0 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-02 20:46 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Andrey Ryabinin, Jann Horn

On Thu, Oct 29, 2020 at 06:45:05PM +0800, Alex Shi wrote:
> From: Hugh Dickins <hughd@google.com>
> 
> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
> fix the incorrect comments in code. Also fixed some zone->lru_lock comment
> error from ancient time. etc.
> 
> I struggled to understand the comment above move_pages_to_lru() (surely
> it never calls page_referenced()), and eventually realized that most of
> it had got separated from shrink_active_list(): move that comment back.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail
  2020-11-02 16:03       ` Matthew Wilcox
@ 2020-11-03  2:43         ` Alex Shi
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-11-03  2:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301



在 2020/11/3 上午12:03, Matthew Wilcox 写道:
> On Fri, Oct 30, 2020 at 10:46:54AM +0800, Alex Shi wrote:
>> -static void lru_add_page_tail(struct page *page, struct page *page_tail,
>> +static void lru_add_page_tail(struct page *head, struct page *tail,
>>  		struct lruvec *lruvec, struct list_head *list)
>>  {
>> -	VM_BUG_ON_PAGE(!PageHead(page), page);
>> -	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>> -	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>> +	VM_BUG_ON_PAGE(!PageHead(head), head);
>> +	VM_BUG_ON_PAGE(PageCompound(tail), head);
>> +	VM_BUG_ON_PAGE(PageLRU(tail), head);
> 
> These last two should surely have been
> 	VM_BUG_ON_PAGE(PageCompound(tail), tail);
> 	VM_BUG_ON_PAGE(PageLRU(tail), tail);
> 
> Also, what do people think about converting these to VM_BUG_ON_PGFLAGS?

Hi Matthew,

Thanks for reminder! Looks these changes worth for another patch.

> 
> Either way:
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 


I will take this option this time. :)

Thanks!
Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru
  2020-11-02 14:52   ` Johannes Weiner
@ 2020-11-03  2:51     ` Alex Shi
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-11-03  2:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Alexander Duyck, Michal Hocko



在 2020/11/2 下午10:52, Johannes Weiner 写道:
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> A brief comment in the code could be good: all pages were isolated
> from the same lruvec (and isolation inhibits memcg migration).

Yes, I will add the words both in code and commit log.

Thanks


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU
  2020-11-02 15:10   ` Johannes Weiner
@ 2020-11-03  3:02     ` Alex Shi
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-11-03  3:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko



在 2020/11/2 下午11:10, Johannes Weiner 写道:
> On Thu, Oct 29, 2020 at 06:45:00PM +0800, Alex Shi wrote:
>> Currently lru_lock still guards both lru list and page's lru bit, that's
>> ok. but if we want to use specific lruvec lock on the page, we need to
>> pin down the page's lruvec/memcg during locking. Just taking lruvec
>> lock first may be undermined by the page's memcg charge/migration. To
>> fix this problem, we could clear the lru bit out of locking and use
>> it as pin down action to block the page isolation in memcg changing.
> 
> Small nit, but the use of "could" in this sentence sounds like you're
> describing one possible solution that isn't being taken, when in fact
> you are describing the chosen locking mechanism.
> 
> Replacing "could" with "will" would make things a bit clearer IMO.
> 

Yes, 'will' is better here. Thanks!

>> So now a standard steps of page isolation is following:
>> 	1, get_page(); 	       #pin the page avoid to be free
>> 	2, TestClearPageLRU(); #block other isolation like memcg change
>> 	3, spin_lock on lru_lock; #serialize lru list access
>> 	4, delete page from lru list;
>> The step 2 could be optimzed/replaced in scenarios which page is
>> unlikely be accessed or be moved between memcgs.
> 
> This is a bit ominous. I'd either elaborate / provide an example /
> clarify why some sites can deal with races - or just remove that
> sentence altogether from this part of the changelog.
> 

A few scenarios here, so examples looks verbose or cann't describe whole.
Maybe removing above 2 lines "The step 2 could be optimzed/replaced in 
scenarios which page is unlikely be accessed or be moved between memcgs."
is better. 

Thanks!

>> This patch start with the first part: TestClearPageLRU, which combines
>> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
>> function will be used as page isolation precondition to prevent other
>> isolations some where else. Then there are may !PageLRU page on lru
>> list, need to remove BUG() checking accordingly.
>>
>> There 2 rules for lru bit now:
>> 1, the lru bit still indicate if a page on lru list, just in some
>>    temporary moment(isolating), the page may have no lru bit when
>>    it's on lru list.  but the page still must be on lru list when the
>>    lru bit set.
>> 2, have to remove lru bit before delete it from lru list.
>>
>> As Andrew Morton mentioned this change would dirty cacheline for page
>> isn't on LRU. But the lost would be acceptable in Rong Chen
>> <rong.a.chen@intel.com> report:
>> https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/
>>
>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Acked-by: Hugh Dickins <hughd@google.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: cgroups@vger.kernel.org
>> Cc: linux-mm@kvack.org
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 

Thanks!
Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-11-02 20:41     ` Johannes Weiner
@ 2020-11-03  4:58       ` Alex Shi
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-11-03  4:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko, Yang Shi



在 2020/11/3 上午4:41, Johannes Weiner 写道:
> On Fri, Oct 30, 2020 at 10:49:41AM +0800, Alex Shi wrote:
>>
>>
>> @@ -1367,6 +1380,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>>  	return lruvec;
>>  }
>>  
>> +struct lruvec *lock_page_lruvec(struct page *page)
>> +{
>> +	struct lruvec *lruvec;
>> +	struct pglist_data *pgdat = page_pgdat(page);
>> +
>> +	rcu_read_lock();
>> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +	spin_lock(&lruvec->lru_lock);
>> +	rcu_read_unlock();
>> +
>> +	lruvec_memcg_debug(lruvec, page);
>> +
>> +	return lruvec;
>> +}
>> +
>> +struct lruvec *lock_page_lruvec_irq(struct page *page)
>> +{
>> +	struct lruvec *lruvec;
>> +	struct pglist_data *pgdat = page_pgdat(page);
>> +
>> +	rcu_read_lock();
>> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +	spin_lock_irq(&lruvec->lru_lock);
>> +	rcu_read_unlock();
>> +
>> +	lruvec_memcg_debug(lruvec, page);
>> +
>> +	return lruvec;
>> +}
>> +
>> +struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
>> +{
>> +	struct lruvec *lruvec;
>> +	struct pglist_data *pgdat = page_pgdat(page);
>> +
>> +	rcu_read_lock();
>> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +	spin_lock_irqsave(&lruvec->lru_lock, *flags);
>> +	rcu_read_unlock();
>> +
>> +	lruvec_memcg_debug(lruvec, page);
>> +
>> +	return lruvec;
>> +}
> 
> As these are used quite widely throughout the VM code now, it would be
> good to give them kerneldoc comments that explain the interface.
> 
> In particular, I think it's necessary to explain the contexts from
> which this is safe to use (in particular wrt pages moving between
> memcgs - see the comment in commit_charge()).
> 
> I'm going to go through the callsites that follow and try to identify
> what makes them safe. It's mostly an exercise to double check our
> thinking here.
> 
> Most of them are straight-forward, and I don't think they warrant
> individual comments. But some do, IMO. And it appears at least one
> actually isn't safe yet:

Thanks a lot reminder. is the following comments fine?

/**
 * lock_page_lruvec - return lruvec for the locked page.
 * @page: the page
 *
 * This series functions should be used in either conditions:
 * PageLRU is cleared or unset
 * or, page->_refcount is zero
 */
struct lruvec *lock_page_lruvec(struct page *page)
{

....

>> @@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>>  {
>>  	do {
>>  		unsigned long lrusize;
>> -		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>  
>> -		spin_lock_irq(&pgdat->lru_lock);
>> +		spin_lock_irq(&lruvec->lru_lock);
>>  		/* Record cost event */
>>  		if (file)
>>  			lruvec->file_cost += nr_pages;
>> @@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>>  			lruvec->file_cost /= 2;
>>  			lruvec->anon_cost /= 2;
>>  		}
>> -		spin_unlock_irq(&pgdat->lru_lock);
>> +		spin_unlock_irq(&lruvec->lru_lock);
>>  	} while ((lruvec = parent_lruvec(lruvec)));
>>  }
> 
> This is safe because it either comes from
> 
> 	1) the pinned lruvec in reclaim, or
> 
> 	2) from a pre-LRU page during refault (which also holds the
> 	   rcu lock, so would be safe even if the page was on the LRU
> 	   and could move simultaneously to a new lruvec).
> 
> The second one seems a bit tricky. It could be good to add a comment
> to lru_note_cost_page() that explains why its mem_cgroup_page_lruvec()
> is safe.

Thanks for pointed, is the following comments fine?
diff --git a/mm/swap.c b/mm/swap.c
index 9fe5ff9a8111..55ccc93ffb49 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -264,6 +264,13 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
        do {
                unsigned long lrusize;

+               /*
+                * Hold lruvec->lru_lock is safe here, since
+                * 1) The pinned lruvec in reclaim, or
+                * 2) From a pre-LRU page during refault (which also holds the
+                *    rcu lock, so would be safe even if the page was on the LRU
+                *    and could move simultaneously to a new lruvec).
+                */
                spin_lock_irq(&lruvec->lru_lock);
                /* Record cost event */
> 
>> @@ -364,13 +359,13 @@ static inline void activate_page_drain(int cpu)
>>  
>>  static void activate_page(struct page *page)
>>  {
>> -	pg_data_t *pgdat = page_pgdat(page);
>> +	struct lruvec *lruvec;
>>  
>>  	page = compound_head(page);
>> -	spin_lock_irq(&pgdat->lru_lock);
>> +	lruvec = lock_page_lruvec_irq(page);
> 
> I don't see what makes this safe. There is nothing that appears to
> lock out a concurrent page move between memcgs/lruvecs, which means
> the following could manipulate an unlocked lru list:
> 

This funtion is for !CONFIG_SMP, could the cpu be preempt with RT kernel?

>>  	if (PageLRU(page))
>> -		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
>> -	spin_unlock_irq(&pgdat->lru_lock);
>> +		__activate_page(page, lruvec);
>> +	unlock_page_lruvec_irq(lruvec);
>>  }
>>  #endif
> 
> Shouldn't this be something like this?
> 
> 	if (TestClearPageLRU()) {
> 		lruvec = lock_page_lruvec_irq(page);
> 		__activate_page(page, lruvec);
> 		unlock_page_lruvec_irq(lruvec);
> 		SetPageLRU(page);
> 	}

But anyway, your new changes are more beautiful and logcially. I will change
to this.

Thanks a lot!
Alex


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-02 20:20       ` Johannes Weiner
@ 2020-11-04 11:27         ` Alex Shi
  2020-11-04 17:46           ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-11-04 11:27 UTC (permalink / raw)
  To: Johannes Weiner, Matthew Wilcox
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, lkp,
	linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Vlastimil Babka, Minchan Kim



在 2020/11/3 上午4:20, Johannes Weiner 写道:
> On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
>> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
>>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
>>>> From: Hugh Dickins <hughd@google.com>
>>>>
>>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>>>> get_page_unless_zero(), but holding lru_lock around that serves no
>>>> useful purpose, and adds to lru_lock contention: delete it.
>>>>
>>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>>>> uses WRITE_ONCE(),
>>>
>>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
>>> missing a dependent patch series?
>>>
>>>> and I see no other risk in page_idle_clear_pte_refs() using
>>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
>>>> not entirely prevented by page_count() check in ksm.c's
>>>> write_protect_page(): that risk being shared with page_referenced()
>>>> and not helped by lru_lock).
>>>
>>> Isn't it possible, as per Minchan's description, for page->mapping to
>>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
>>> thinking it's looking at a struct address_space?
>>
>> I don't think it can point to an anon_vma without the ANON bit set.
>> Minchan's concern in that email was that it might still be NULL.
> 
> Hm, no, the thread is a lengthy discussion about whether the store
> could be split such that page->mapping is actually pointing to
> something invalid (anon_vma without the PageAnon bit).
> 
> From his email:
> 
>         CPU 0                                                                           CPU 1
> 
> do_anonymous_page
>   __page_set_anon_rmap
>   /* out of order happened so SetPageLRU is done ahead */
>   SetPageLRU(page)

This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
protection, so the original memory barrier or lock concern isn't
correct. that means, the SetPageLRU isn't possible to be here.
And then no warry on right side 'CPU 1' problem.

>   /* Compilr changed store operation like below */
>   page->mapping = (struct address_space *) anon_vma;
>   /* Big stall happens */
>                                                                 /* idletacking judged it as LRU page so pass the page
>                                                                    in page_reference */
>                                                                 page_refernced
								  page_referenced should be page_idle_clear_pte_refs_one here?	
>                                                                         page_rmapping return true because
>                                                                         page->mapping has some vaule but not complete
>                                                                         so it calls rmap_walk_file.
>                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> 


For this patch, According to comments of page_idle_get_page() 
PageLRU just used to judge if the page is in user using. for this
purpose, a unguard PageLRU check is good enough. So this patch
should be fine.

Thanks
Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 00/20] per memcg lru lock
  2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
                   ` (19 preceding siblings ...)
  2020-10-29 10:45 ` [PATCH v20 20/20] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-11-04 11:55 ` Alex Shi
  2020-11-04 16:59   ` Johannes Weiner
  20 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-11-04 11:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Hi Johannes & all,

Thanks for all comments and suggestions, here is a patch base on v20, as a summary for all you suggested:
Is this ok?

Many thanks!
Alex

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c97292834fa..0fe4172c8c14 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
  * Lockless page tracking & accounting
  * Unified hierarchy configuration model
  * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */

 #include <linux/page_counter.h>
@@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
        return lruvec;
 }

+/**
+ * lock_page_lruvec - return lruvec for the locked page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or, page->_refcount is zero
+ */
 struct lruvec *lock_page_lruvec(struct page *page)
 {
        struct lruvec *lruvec;
diff --git a/mm/swap.c b/mm/swap.c
index 9fe5ff9a8111..bcc814de35c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -264,6 +264,13 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
        do {
                unsigned long lrusize;

+               /*
+                * Hold lruvec->lru_lock is safe here, since
+                * 1) The pinned lruvec in reclaim, or
+                * 2) From a pre-LRU page during refault (which also holds the
+                *    rcu lock, so would be safe even if the page was on the LRU
+                *    and could move simultaneously to a new lruvec).
+                */
                spin_lock_irq(&lruvec->lru_lock);
                /* Record cost event */
                if (file)
@@ -355,10 +362,12 @@ static void activate_page(struct page *page)
        struct lruvec *lruvec;

        page = compound_head(page);
-       lruvec = lock_page_lruvec_irq(page);
-       if (PageLRU(page))
+       if (TestClearPageLRU(page)) {
+               lruvec = lock_page_lruvec_irq(page);
                __activate_page(page, lruvec);
-       unlock_page_lruvec_irq(lruvec);
+               unlock_page_lruvec_irq(lruvec);
+               SetPageLRU(page);
+       }
 }
 #endif

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7ed10ade548d..af03a7f2e1b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1868,6 +1868,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
                        continue;
                }

+               /*
+                * All pages were isolated from the same lruvec (and isolation
+                * inhibits memcg migration).
+                */
                VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
                lru = page_lru(page);
                nr_pages = thp_nr_pages(page);



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 00/20] per memcg lru lock
  2020-11-04 11:55 ` [PATCH v20 00/20] per memcg lru lock Alex Shi
@ 2020-11-04 16:59   ` Johannes Weiner
  2020-11-05  5:07     ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-11-04 16:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Wed, Nov 04, 2020 at 07:55:39PM +0800, Alex Shi wrote:
> @@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>         return lruvec;
>  }
> 
> +/**
> + * lock_page_lruvec - return lruvec for the locked page.

I would say "lock and return the lruvec for a given page"

> + * @page: the page
> + *
> + * This series functions should be used in either conditions:
> + * PageLRU is cleared or unset
> + * or, page->_refcount is zero

or page is locked

The other changes look good to me, thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-04 11:27         ` Alex Shi
@ 2020-11-04 17:46           ` Johannes Weiner
  2020-11-05  4:52             ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Johannes Weiner @ 2020-11-04 17:46 UTC (permalink / raw)
  To: Alex Shi
  Cc: Matthew Wilcox, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Vlastimil Babka,
	Minchan Kim

On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote:
> 在 2020/11/3 上午4:20, Johannes Weiner 写道:
> > On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
> >> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
> >>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> >>>> From: Hugh Dickins <hughd@google.com>
> >>>>
> >>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
> >>>> get_page_unless_zero(), but holding lru_lock around that serves no
> >>>> useful purpose, and adds to lru_lock contention: delete it.
> >>>>
> >>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> >>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
> >>>> uses WRITE_ONCE(),
> >>>
> >>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> >>> missing a dependent patch series?
> >>>
> >>>> and I see no other risk in page_idle_clear_pte_refs() using
> >>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
> >>>> not entirely prevented by page_count() check in ksm.c's
> >>>> write_protect_page(): that risk being shared with page_referenced()
> >>>> and not helped by lru_lock).
> >>>
> >>> Isn't it possible, as per Minchan's description, for page->mapping to
> >>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
> >>> thinking it's looking at a struct address_space?
> >>
> >> I don't think it can point to an anon_vma without the ANON bit set.
> >> Minchan's concern in that email was that it might still be NULL.
> > 
> > Hm, no, the thread is a lengthy discussion about whether the store
> > could be split such that page->mapping is actually pointing to
> > something invalid (anon_vma without the PageAnon bit).
> > 
> > From his email:
> > 
> >         CPU 0                                                                           CPU 1
> > 
> > do_anonymous_page
> >   __page_set_anon_rmap
> >   /* out of order happened so SetPageLRU is done ahead */
> >   SetPageLRU(page)
> 
> This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
> protection, so the original memory barrier or lock concern isn't
> correct. that means, the SetPageLRU isn't possible to be here.
> And then no warry on right side 'CPU 1' problem.

The SetPageLRU is done under lru_lock, but the store to page->mapping
is not, so what ensures ordering between them? And what prevents the
compiler from tearing the store to page->mapping?

The writer does this:

	CPU 0
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)

The concern is what CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment that's referenced in
the changelog of this patch but I cannot find in any tree or patch.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-04 17:46           ` Johannes Weiner
@ 2020-11-05  4:52             ` Alex Shi
  2020-11-05  4:57               ` Matthew Wilcox
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-11-05  4:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Vlastimil Babka,
	Minchan Kim



在 2020/11/5 上午1:46, Johannes Weiner 写道:
> On Wed, Nov 04, 2020 at 07:27:21PM +0800, Alex Shi wrote:
>> 在 2020/11/3 上午4:20, Johannes Weiner 写道:
>>> On Mon, Nov 02, 2020 at 02:49:27PM +0000, Matthew Wilcox wrote:
>>>> On Mon, Nov 02, 2020 at 09:41:10AM -0500, Johannes Weiner wrote:
>>>>> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
>>>>>> From: Hugh Dickins <hughd@google.com>
>>>>>>
>>>>>> It is necessary for page_idle_get_page() to recheck PageLRU() after
>>>>>> get_page_unless_zero(), but holding lru_lock around that serves no
>>>>>> useful purpose, and adds to lru_lock contention: delete it.
>>>>>>
>>>>>> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
>>>>>> discussion that led to lru_lock there; but __page_set_anon_rmap() now
>>>>>> uses WRITE_ONCE(),
>>>>>
>>>>> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
>>>>> missing a dependent patch series?
>>>>>
>>>>>> and I see no other risk in page_idle_clear_pte_refs() using
>>>>>> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but
>>>>>> not entirely prevented by page_count() check in ksm.c's
>>>>>> write_protect_page(): that risk being shared with page_referenced()
>>>>>> and not helped by lru_lock).
>>>>>
>>>>> Isn't it possible, as per Minchan's description, for page->mapping to
>>>>> point to a struct anon_vma without PAGE_MAPPING_ANON set, and rmap
>>>>> thinking it's looking at a struct address_space?
>>>>
>>>> I don't think it can point to an anon_vma without the ANON bit set.
>>>> Minchan's concern in that email was that it might still be NULL.
>>>
>>> Hm, no, the thread is a lengthy discussion about whether the store
>>> could be split such that page->mapping is actually pointing to
>>> something invalid (anon_vma without the PageAnon bit).
>>>
>>> From his email:
>>>
>>>         CPU 0                                                                           CPU 1
>>>
>>> do_anonymous_page
>>>   __page_set_anon_rmap
>>>   /* out of order happened so SetPageLRU is done ahead */
>>>   SetPageLRU(page)
>>
>> This SetPageLRU done in __pagevec_lru_add_fn() which under the lru_lock
>> protection, so the original memory barrier or lock concern isn't
>> correct. that means, the SetPageLRU isn't possible to be here.
>> And then no warry on right side 'CPU 1' problem.
> 
> The SetPageLRU is done under lru_lock, but the store to page->mapping
> is not, so what ensures ordering between them? And what prevents the
> compiler from tearing the store to page->mapping?
> 

Right, I misunderstand the spin_lock in memory barrier. Thanks a lot
for point out this.
So, is this patch fine to fix the problem?

From 8427121da195a6a386d1ce71abcb41b31211e21f Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

	CPU 0						CPU1
   do_anonymous_page
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)
						/* idletacking judged it as LRU
						 * page so pass the page in
						 * page_idle_clear_pte_refs
						 */
						page_idle_clear_pte_refs
						  rmap_walk
						    if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/rmap.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c050dab2ae65..56af18aa57de 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	/*
+	 * w/o the WRITE_ONCE here the following scenario may happens due to
+	 * store reordering.
+	 *
+	 *      CPU 0                                          CPU 1
+	 *
+	 * do_anonymous_page				page_idle_clear_pte_refs
+	 *   __page_set_anon_rmap
+	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
+	 *   lru_cache_add_inactive_or_unevictable()
+	 *     SetPageLRU(page)
+	 *                                               rmap_walk
+	 *                                                if PageAnon(page)
+	 *
+	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
+	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
+	 *  may goes to rmap_walk_file() for a anon page.
+	 */
+
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
 
-- 
1.8.3.1


> The writer does this:
> 
> 	CPU 0
> 	page_add_new_anon_rmap()
> 	  page->mapping = anon_vma + PAGE_MAPPING_ANON
> 	lru_cache_add_inactive_or_unevictable()
> 	  spin_lock(lruvec->lock)
> 	  SetPageLRU()
> 	  spin_unlock(lruvec->lock)
> 
> The concern is what CPU 1 will observe at page->mapping after
> observing PageLRU set on the page.
> 
> 1. anon_vma + PAGE_MAPPING_ANON
> 
>    That's the in-order scenario and is fine.
> 
> 2. NULL
> 
>    That's possible if the page->mapping store gets reordered to occur
>    after SetPageLRU. That's fine too because we check for it.
> 
> 3. anon_vma without the PAGE_MAPPING_ANON bit
> 
>    That would be a problem and could lead to all kinds of undesirable
>    behavior including crashes and data corruption.
> 
>    Is it possible? AFAICT the compiler is allowed to tear the store to
>    page->mapping and I don't see anything that would prevent it.
> 
> That said, I also don't see how the reader testing PageLRU under the
> lru_lock would prevent that in the first place. AFAICT we need that
> WRITE_ONCE() around the page->mapping assignment that's referenced in
> the changelog of this patch but I cannot find in any tree or patch.
> 


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-05  4:52             ` Alex Shi
@ 2020-11-05  4:57               ` Matthew Wilcox
  2020-11-05  5:03                 ` Alex Shi
  0 siblings, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2020-11-05  4:57 UTC (permalink / raw)
  To: Alex Shi
  Cc: Johannes Weiner, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Vlastimil Babka,
	Minchan Kim

On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
>  	if (!exclusive)
>  		anon_vma = anon_vma->root;
>  
> +	/*
> +	 * w/o the WRITE_ONCE here the following scenario may happens due to
> +	 * store reordering.
> +	 *
> +	 *      CPU 0                                          CPU 1
> +	 *
> +	 * do_anonymous_page				page_idle_clear_pte_refs
> +	 *   __page_set_anon_rmap
> +	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
> +	 *   lru_cache_add_inactive_or_unevictable()
> +	 *     SetPageLRU(page)
> +	 *                                               rmap_walk
> +	 *                                                if PageAnon(page)
> +	 *
> +	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
> +	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
> +	 *  may goes to rmap_walk_file() for a anon page.
> +	 */
> +
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> -	page->mapping = (struct address_space *) anon_vma;
> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
>  	page->index = linear_page_index(vma, address);
>  }

I don't like these verbose comments with detailed descriptions in
the source code.  They're fine in changelogs, but they clutter the
code, and they get outdated really quickly.  My preference is for
something more brief:

	/*
	 * Prevent page->mapping from pointing to an anon_vma without
	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
	 */



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-05  4:57               ` Matthew Wilcox
@ 2020-11-05  5:03                 ` Alex Shi
  2020-11-05 15:36                   ` Johannes Weiner
  0 siblings, 1 reply; 67+ messages in thread
From: Alex Shi @ 2020-11-05  5:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Vlastimil Babka,
	Minchan Kim



在 2020/11/5 下午12:57, Matthew Wilcox 写道:
> On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
>> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
>>  	if (!exclusive)
>>  		anon_vma = anon_vma->root;
>>  
>> +	/*
>> +	 * w/o the WRITE_ONCE here the following scenario may happens due to
>> +	 * store reordering.
>> +	 *
>> +	 *      CPU 0                                          CPU 1
>> +	 *
>> +	 * do_anonymous_page				page_idle_clear_pte_refs
>> +	 *   __page_set_anon_rmap
>> +	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
>> +	 *   lru_cache_add_inactive_or_unevictable()
>> +	 *     SetPageLRU(page)
>> +	 *                                               rmap_walk
>> +	 *                                                if PageAnon(page)
>> +	 *
>> +	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
>> +	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
>> +	 *  may goes to rmap_walk_file() for a anon page.
>> +	 */
>> +
>>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>> -	page->mapping = (struct address_space *) anon_vma;
>> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
>>  	page->index = linear_page_index(vma, address);
>>  }
> 
> I don't like these verbose comments with detailed descriptions in
> the source code.  They're fine in changelogs, but they clutter the
> code, and they get outdated really quickly.  My preference is for
> something more brief:
> 
> 	/*
> 	 * Prevent page->mapping from pointing to an anon_vma without
> 	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
> 	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
> 	 */
> 

Yes, it's reansonble. So is the following fine?

From f166f0d5df350c5eae1218456b9e6e1bd43434e7 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Thu, 5 Nov 2020 11:38:24 +0800
Subject: [PATCH] mm/rmap: stop store reordering issue on page->mapping

Hugh Dickins and Minchan Kim observed a long time issue which
discussed here, but actully the mentioned fix missed.
https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
The store reordering may cause problem in the scenario:

	CPU 0						CPU1
   do_anonymous_page
	page_add_new_anon_rmap()
	  page->mapping = anon_vma + PAGE_MAPPING_ANON
	lru_cache_add_inactive_or_unevictable()
	  spin_lock(lruvec->lock)
	  SetPageLRU()
	  spin_unlock(lruvec->lock)
						/* idletacking judged it as LRU
						 * page so pass the page in
						 * page_idle_clear_pte_refs
						 */
						page_idle_clear_pte_refs
						  rmap_walk
						    if PageAnon(page)

Johannes give detailed examples how the store reordering could cause
a trouble:
The concern is the SetPageLRU may get reorder before 'page->mapping'
setting, That would make CPU 1 will observe at page->mapping after
observing PageLRU set on the page.

1. anon_vma + PAGE_MAPPING_ANON

   That's the in-order scenario and is fine.

2. NULL

   That's possible if the page->mapping store gets reordered to occur
   after SetPageLRU. That's fine too because we check for it.

3. anon_vma without the PAGE_MAPPING_ANON bit

   That would be a problem and could lead to all kinds of undesirable
   behavior including crashes and data corruption.

   Is it possible? AFAICT the compiler is allowed to tear the store to
   page->mapping and I don't see anything that would prevent it.

That said, I also don't see how the reader testing PageLRU under the
lru_lock would prevent that in the first place. AFAICT we need that
WRITE_ONCE() around the page->mapping assignment.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/rmap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c050dab2ae65..73788505aa0a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1054,8 +1054,13 @@ static void __page_set_anon_rmap(struct page *page,
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	/*
+	 * Prevent page->mapping from pointing to an anon_vma without
+	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
+	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
+	 */
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 00/20] per memcg lru lock
  2020-11-04 16:59   ` Johannes Weiner
@ 2020-11-05  5:07     ` Alex Shi
  0 siblings, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-11-05  5:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/11/5 上午12:59, Johannes Weiner 写道:
> On Wed, Nov 04, 2020 at 07:55:39PM +0800, Alex Shi wrote:
>> @@ -1380,6 +1383,14 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>>         return lruvec;
>>  }
>>
>> +/**
>> + * lock_page_lruvec - return lruvec for the locked page.
> 
> I would say "lock and return the lruvec for a given page"
> 
>> + * @page: the page
>> + *
>> + * This series functions should be used in either conditions:
>> + * PageLRU is cleared or unset
>> + * or, page->_refcount is zero
> 
> or page is locked
> 
> The other changes look good to me, thanks!
> 

Thanks a lot for both comments!
I will pick them and sent out in v21.

Thanks!
Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-05  5:03                 ` Alex Shi
@ 2020-11-05 15:36                   ` Johannes Weiner
  2020-11-05 15:43                     ` Matthew Wilcox
  2020-11-06  1:11                     ` Alex Shi
  0 siblings, 2 replies; 67+ messages in thread
From: Johannes Weiner @ 2020-11-05 15:36 UTC (permalink / raw)
  To: Alex Shi
  Cc: Matthew Wilcox, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Vlastimil Babka,
	Minchan Kim

On Thu, Nov 05, 2020 at 01:03:18PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/11/5 下午12:57, Matthew Wilcox 写道:
> > On Thu, Nov 05, 2020 at 12:52:05PM +0800, Alex Shi wrote:
> >> @@ -1054,8 +1054,27 @@ static void __page_set_anon_rmap(struct page *page,
> >>  	if (!exclusive)
> >>  		anon_vma = anon_vma->root;
> >>  
> >> +	/*
> >> +	 * w/o the WRITE_ONCE here the following scenario may happens due to
> >> +	 * store reordering.
> >> +	 *
> >> +	 *      CPU 0                                          CPU 1
> >> +	 *
> >> +	 * do_anonymous_page				page_idle_clear_pte_refs
> >> +	 *   __page_set_anon_rmap
> >> +	 *     page->mapping = anon_vma + PAGE_MAPPING_ANON
> >> +	 *   lru_cache_add_inactive_or_unevictable()
> >> +	 *     SetPageLRU(page)
> >> +	 *                                               rmap_walk
> >> +	 *                                                if PageAnon(page)
> >> +	 *
> >> +	 *  The 'SetPageLRU' may reordered before page->mapping setting, and
> >> +	 *  page->mapping may set with anon_vma, w/o anon bit, then rmap_walk
> >> +	 *  may goes to rmap_walk_file() for a anon page.
> >> +	 */
> >> +
> >>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> >> -	page->mapping = (struct address_space *) anon_vma;
> >> +	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
> >>  	page->index = linear_page_index(vma, address);
> >>  }
> > 
> > I don't like these verbose comments with detailed descriptions in
> > the source code.  They're fine in changelogs, but they clutter the
> > code, and they get outdated really quickly.  My preference is for
> > something more brief:
> > 
> > 	/*
> > 	 * Prevent page->mapping from pointing to an anon_vma without
> > 	 * the PAGE_MAPPING_ANON bit set.  This could happen if the
> > 	 * compiler stores anon_vma and then adds PAGE_MAPPING_ANON to it.
> > 	 */
> > 

Yeah, I don't think this scenario warrants the full race diagram in
the code itself.

But the code is highly specific - synchronizing one struct page member
for one particular use case. Let's keep at least a reference to what
we are synchronizing against. There is a non-zero chance that if the
comment goes out of date, so does the code. How about this?

	/*
	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
	 * Make sure the compiler doesn't split the stores of anon_vma and
	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
	 * could mistake the mapping for a struct address_space and crash.
	 */


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-05 15:36                   ` Johannes Weiner
@ 2020-11-05 15:43                     ` Matthew Wilcox
  2020-11-06  1:11                     ` Alex Shi
  1 sibling, 0 replies; 67+ messages in thread
From: Matthew Wilcox @ 2020-11-05 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Vlastimil Babka, Minchan Kim

On Thu, Nov 05, 2020 at 10:36:49AM -0500, Johannes Weiner wrote:
> But the code is highly specific - synchronizing one struct page member
> for one particular use case. Let's keep at least a reference to what
> we are synchronizing against. There is a non-zero chance that if the
> comment goes out of date, so does the code. How about this?
> 
> 	/*
> 	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
> 	 * Make sure the compiler doesn't split the stores of anon_vma and
> 	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> 	 * could mistake the mapping for a struct address_space and crash.
> 	 */

Fine by me!  There may be other cases where seeing a split store would
be bad, so I didn't want to call out page_idle explicitly.  But if you
want to, I'm happy with this comment.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-05 15:36                   ` Johannes Weiner
  2020-11-05 15:43                     ` Matthew Wilcox
@ 2020-11-06  1:11                     ` Alex Shi
  1 sibling, 0 replies; 67+ messages in thread
From: Alex Shi @ 2020-11-06  1:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, akpm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Vlastimil Babka,
	Minchan Kim



在 2020/11/5 下午11:36, Johannes Weiner 写道:
>>> 	 */
>>>
> Yeah, I don't think this scenario warrants the full race diagram in
> the code itself.
> 
> But the code is highly specific - synchronizing one struct page member
> for one particular use case. Let's keep at least a reference to what
> we are synchronizing against. There is a non-zero chance that if the
> comment goes out of date, so does the code. How about this?
> 
> 	/*
> 	 * page_idle does a lockless/optimistic rmap scan on page->mapping.
> 	 * Make sure the compiler doesn't split the stores of anon_vma and
> 	 * the PAGE_MAPPING_ANON type identifier, otherwise the rmap code
> 	 * could mistake the mapping for a struct address_space and crash.
> 	 */

Thanks a lot to you all. I will update to v21 patch

Alex


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock
  2020-11-02 14:41   ` Johannes Weiner
  2020-11-02 14:49     ` Matthew Wilcox
@ 2020-11-11  7:27     ` Hugh Dickins
  1 sibling, 0 replies; 67+ messages in thread
From: Hugh Dickins @ 2020-11-11  7:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	willy, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Vlastimil Babka,
	Minchan Kim

On Mon, 2 Nov 2020, Johannes Weiner wrote:
> On Thu, Oct 29, 2020 at 06:44:53PM +0800, Alex Shi wrote:
> > From: Hugh Dickins <hughd@google.com>
> > 
> > It is necessary for page_idle_get_page() to recheck PageLRU() after
> > get_page_unless_zero(), but holding lru_lock around that serves no
> > useful purpose, and adds to lru_lock contention: delete it.
> > 
> > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> > discussion that led to lru_lock there; but __page_set_anon_rmap() now
> > uses WRITE_ONCE(),
> 
> That doesn't seem to be the case in Linus's or Andrew's tree. Am I
> missing a dependent patch series?

Sorry, I was out of action, then slower than ever, for a while.

Many thanks for calling out my falsehood there, Johannes.

What led me to write that?  It has baffled me, but at last I see:
this patch to page_idle_get_page() was 0002 in my lru_lock patchset
against v5.3 last year, and 0001 was the patch which made it true.
Then when I checked against mainline, I must have got confused by
the similar WRITE_ONCE in page_move_anon_rmap().

Appended below, but not rediffed, and let's not hold up Alex's set
for the rest of it: it is all theoretical until the kernel gets to
be built with a suitably malicious compiler; but I'll follow up
with a fresh version of the below after his set is safely in.

From a1abcbc2aac70c6ba47b8991992bb85b86b4a160 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd@google.com>
Date: Thu, 22 Aug 2019 15:49:44 -0700
Subject: [PATCH 1/9] mm: more WRITE_ONCE and READ_ONCE on page->mapping

v4.2 commit 414e2fb8ce5a ("rmap: fix theoretical race between do_wp_page
and shrink_active_list") added a WRITE_ONCE() where page_move_anon_rmap()
composes page->mapping from anon_vma pointer and PAGE_MAPPING_ANON.

Now do the same where __page_set_anon_rmap() does the same, and where
compaction.c applies PAGE_MAPPING_MOVABLE, and ksm.c PAGE_MAPPING_KSM.

rmap.c already uses READ_ONCE(page->mapping), but util.c should too:
add READ_ONCE() in page_rmapping(), page_anon_vma() and page_mapping().
Delete the then unused helper __page_rmapping().

I doubt that this commit fixes anything, but it's harmless and
unintrusive, and makes reasoning about page mapping flags easier.

What if a compiler implements "page->mapping = mapping" in other places
by, say, first assigning the odd bits of mapping, then adding in the
even bits?  Then we shall not build the kernel with such a compiler.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
---
 mm/compaction.c |  7 ++++---
 mm/ksm.c        |  2 +-
 mm/rmap.c       |  7 ++++++-
 mm/util.c       | 24 ++++++++++--------------
 4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..c405f4362624 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -113,7 +113,8 @@ void __SetPageMovable(struct page *page, struct address_space *mapping)
 {
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page);
-	page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE);
+	WRITE_ONCE(page->mapping,
+		   (unsigned long)mapping | PAGE_MAPPING_MOVABLE);
 }
 EXPORT_SYMBOL(__SetPageMovable);
 
@@ -126,8 +127,8 @@ void __ClearPageMovable(struct page *page)
 	 * flag so that VM can catch up released page by driver after isolation.
 	 * With it, VM migration doesn't try to put it back.
 	 */
-	page->mapping = (void *)((unsigned long)page->mapping &
-				PAGE_MAPPING_MOVABLE);
+	WRITE_ONCE(page->mapping,
+		   (unsigned long)page->mapping & PAGE_MAPPING_MOVABLE);
 }
 EXPORT_SYMBOL(__ClearPageMovable);
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 3dc4346411e4..426b6a40ea41 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -865,7 +865,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
 static inline void set_page_stable_node(struct page *page,
 					struct stable_node *stable_node)
 {
-	page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
+	WRITE_ONCE(page->mapping, (unsigned long)stable_node | PAGE_MAPPING_KSM);
 }
 
 #ifdef CONFIG_SYSFS
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..9480df437edc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1044,7 +1044,12 @@ static void __page_set_anon_rmap(struct page *page,
 		anon_vma = anon_vma->root;
 
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-	page->mapping = (struct address_space *) anon_vma;
+	/*
+	 * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written
+	 * simultaneously, so a concurrent reader (eg page_referenced()'s
+	 * PageAnon()) will not see one without the other.
+	 */
+	WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 	page->index = linear_page_index(vma, address);
 }
 
diff --git a/mm/util.c b/mm/util.c
index e6351a80f248..09b9fcbedac3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -489,21 +489,14 @@ void kvfree(const void *addr)
 }
 EXPORT_SYMBOL(kvfree);
 
-static inline void *__page_rmapping(struct page *page)
-{
-	unsigned long mapping;
-
-	mapping = (unsigned long)page->mapping;
-	mapping &= ~PAGE_MAPPING_FLAGS;
-
-	return (void *)mapping;
-}
-
 /* Neutral page->mapping pointer to address_space or anon_vma or other */
 void *page_rmapping(struct page *page)
 {
+	unsigned long mapping;
+
 	page = compound_head(page);
-	return __page_rmapping(page);
+	mapping = (unsigned long)READ_ONCE(page->mapping);
+	return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
 }
 
 /*
@@ -534,10 +527,11 @@ struct anon_vma *page_anon_vma(struct page *page)
 	unsigned long mapping;
 
 	page = compound_head(page);
-	mapping = (unsigned long)page->mapping;
+	mapping = (unsigned long)READ_ONCE(page->mapping);
+	/* Return NULL if file or PageMovable or PageKsm */
 	if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
 		return NULL;
-	return __page_rmapping(page);
+	return (struct anon_vma *)(mapping & ~PAGE_MAPPING_FLAGS);
 }
 
 struct address_space *page_mapping(struct page *page)
@@ -557,10 +551,12 @@ struct address_space *page_mapping(struct page *page)
 		return swap_address_space(entry);
 	}
 
-	mapping = page->mapping;
+	mapping = READ_ONCE(page->mapping);
+	/* Return NULL if PageAnon (including PageKsm) */
 	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
 		return NULL;
 
+	/* Return struct address_space pointer if file or PageMovable */
 	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
 }
 EXPORT_SYMBOL(page_mapping);
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2020-11-11  7:28 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-29 10:44 [PATCH v20 00/20] per memcg lru lock Alex Shi
2020-10-29 10:44 ` [PATCH v20 01/20] mm/memcg: warning on !memcg after readahead page charged Alex Shi
2020-10-29 13:43   ` Johannes Weiner
2020-10-29 10:44 ` [PATCH v20 02/20] mm/memcg: bail early from swap accounting if memcg disabled Alex Shi
2020-10-29 13:46   ` Johannes Weiner
2020-10-30  2:27     ` Alex Shi
2020-10-30 14:04       ` Johannes Weiner
2020-10-31  1:13         ` Alex Shi
2020-10-29 10:44 ` [PATCH v20 03/20] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
2020-10-29 13:47   ` Johannes Weiner
2020-10-29 10:44 ` [PATCH v20 04/20] mm/thp: use head for head page in lru_add_page_tail Alex Shi
2020-10-29 13:50   ` Johannes Weiner
2020-10-30  2:46     ` Alex Shi
2020-10-30 13:52       ` Johannes Weiner
2020-10-31  1:14         ` Alex Shi
2020-11-02 16:03       ` Matthew Wilcox
2020-11-03  2:43         ` Alex Shi
2020-10-29 10:44 ` [PATCH v20 05/20] mm/thp: Simplify lru_add_page_tail() Alex Shi
2020-10-29 14:00   ` Johannes Weiner
2020-10-30  2:48   ` Alex Shi
2020-10-29 10:44 ` [PATCH v20 06/20] mm/thp: narrow lru locking Alex Shi
2020-10-29 10:44 ` [PATCH v20 07/20] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-11-02 14:20   ` Johannes Weiner
2020-10-29 10:44 ` [PATCH v20 08/20] mm: page_idle_get_page() does not need lru_lock Alex Shi
2020-11-02 14:41   ` Johannes Weiner
2020-11-02 14:49     ` Matthew Wilcox
2020-11-02 20:20       ` Johannes Weiner
2020-11-04 11:27         ` Alex Shi
2020-11-04 17:46           ` Johannes Weiner
2020-11-05  4:52             ` Alex Shi
2020-11-05  4:57               ` Matthew Wilcox
2020-11-05  5:03                 ` Alex Shi
2020-11-05 15:36                   ` Johannes Weiner
2020-11-05 15:43                     ` Matthew Wilcox
2020-11-06  1:11                     ` Alex Shi
2020-11-11  7:27     ` Hugh Dickins
2020-10-29 10:44 ` [PATCH v20 09/20] mm/memcg: add debug checking in lock_page_memcg Alex Shi
2020-11-02 14:45   ` Johannes Weiner
2020-10-29 10:44 ` [PATCH v20 10/20] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
2020-11-02 14:48   ` Johannes Weiner
2020-10-29 10:44 ` [PATCH v20 11/20] mm/lru: move lock into lru_note_cost Alex Shi
2020-10-29 13:42   ` Johannes Weiner
2020-10-29 10:44 ` [PATCH v20 12/20] mm/vmscan: remove lruvec reget in move_pages_to_lru Alex Shi
2020-11-02 14:52   ` Johannes Weiner
2020-11-03  2:51     ` Alex Shi
2020-10-29 10:44 ` [PATCH v20 13/20] mm/mlock: remove lru_lock on TestClearPageMlocked Alex Shi
2020-11-02 14:55   ` Johannes Weiner
2020-10-29 10:44 ` [PATCH v20 14/20] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
2020-11-02 14:56   ` Johannes Weiner
2020-10-29 10:45 ` [PATCH v20 15/20] mm/lru: introduce TestClearPageLRU Alex Shi
2020-11-02 15:10   ` Johannes Weiner
2020-11-03  3:02     ` Alex Shi
2020-10-29 10:45 ` [PATCH v20 16/20] mm/compaction: do page isolation first in compaction Alex Shi
2020-11-02 15:18   ` Johannes Weiner
2020-10-29 10:45 ` [PATCH v20 17/20] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
2020-11-02 15:20   ` Johannes Weiner
2020-10-29 10:45 ` [PATCH v20 18/20] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-10-30  2:49   ` Alex Shi
2020-11-02 20:41     ` Johannes Weiner
2020-11-03  4:58       ` Alex Shi
2020-10-29 10:45 ` [PATCH v20 19/20] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-11-02 20:44   ` Johannes Weiner
2020-10-29 10:45 ` [PATCH v20 20/20] mm/lru: revise the comments of lru_lock Alex Shi
2020-11-02 20:46   ` Johannes Weiner
2020-11-04 11:55 ` [PATCH v20 00/20] per memcg lru lock Alex Shi
2020-11-04 16:59   ` Johannes Weiner
2020-11-05  5:07     ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).