linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v18 00/32] per memcg lru_lock
@ 2020-08-24 12:54 Alex Shi
  2020-08-24 12:54 ` [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged Alex Shi
                   ` (32 more replies)
  0 siblings, 33 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

The new version which bases on v5.9-rc2. The first 6 patches was picked into
linux-mm, and add patch 25-32 that do some further post optimization.

The patchset includes 4 parts:
1, some code cleanup and minimum optimization as a preparation. patch 1-15.
2, use TestCleanPageLRU as page isolation's precondition. patch 16-19
3, replace per node lru_lock with per memcg per node lru_lock. patch 20
4, some post optimization. 				       patch 21-32

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this patchset, the readtwice performance increased about 80%
in concurrent containers.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Alex Shi (23):
  mm/memcg: warning on !memcg after readahead page charged
  mm/memcg: bail out early from swap accounting when memcg is disabled
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: clean up lru_add_page_tail
  mm/thp: remove code path which never got into
  mm/thp: narrow lru locking
  mm/swap.c: stop deactivate_file_page if page not on lru
  mm/vmscan: remove unnecessary lruvec adding
  mm/page_idle: no unlikely double check for idle page counting
  mm/compaction: rename compact_deferred as compact_should_defer
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lru_lock holding in func lru_note_cost_page
  mm/lru: move lock into lru_note_cost
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/thp: add tail pages into lru anyway in split_huge_page()
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/pgdat: remove pgdat lru_lock
  mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
  mm/mlock: remove __munlock_isolate_lru_page
  mm/swap.c: optimizing __pagevec_lru_add lru_lock

Alexander Duyck (6):
  mm/lru: introduce the relock_page_lruvec function
  mm/compaction: Drop locked from isolate_migratepages_block
  mm: Identify compound pages sooner in isolate_migratepages_block
  mm: Drop use of test_and_set_skip in favor of just setting skip
  mm: Add explicit page decrement in exception path for
    isolate_lru_pages
  mm: Split release_pages work into 3 passes

Hugh Dickins (3):
  mm/memcg: optimize mem_cgroup_page_lruvec
  mm/vmscan: use relock for move_pages_to_lru
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +-
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +-
 include/linux/compaction.h                         |   4 +-
 include/linux/memcontrol.h                         | 110 ++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmdebug.h                            |  13 +
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 include/trace/events/compaction.h                  |   2 +-
 mm/compaction.c                                    | 166 +++++------
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  48 +++-
 mm/memcontrol.c                                    |  92 +++++-
 mm/mlock.c                                         |  76 ++---
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   8 -
 mm/rmap.c                                          |   4 +-
 mm/swap.c                                          | 307 +++++++++++----------
 mm/vmscan.c                                        | 178 ++++++------
 mm/workingset.c                                    |   2 -
 24 files changed, 646 insertions(+), 443 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled Alex Shi
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Since readahead page is charged on memcg too, in theory we don't have to
check this exception now. Before safely remove them all, add a warning
for the unexpected !memcg.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmdebug.h | 13 +++++++++++++
 mm/memcontrol.c         | 15 ++++++++-------
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
index 2ad72d2c8cc5..4ed52879ce55 100644
--- a/include/linux/mmdebug.h
+++ b/include/linux/mmdebug.h
@@ -37,6 +37,18 @@
 			BUG();						\
 		}							\
 	} while (0)
+#define VM_WARN_ON_ONCE_PAGE(cond, page)	({			\
+	static bool __section(.data.once) __warned;			\
+	int __ret_warn_once = !!(cond);					\
+									\
+	if (unlikely(__ret_warn_once && !__warned)) {			\
+		dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\
+		__warned = true;					\
+		WARN_ON(1);						\
+	}								\
+	unlikely(__ret_warn_once);					\
+})
+
 #define VM_WARN_ON(cond) (void)WARN_ON(cond)
 #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
 #define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format)
@@ -48,6 +60,7 @@
 #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond)
 #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond)
+#define VM_WARN_ON_ONCE_PAGE(cond, page)  BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b807952b4d43..ffdc622e5828 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1322,10 +1322,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	}
 
 	memcg = page->mem_cgroup;
-	/*
-	 * Swapcache readahead pages are added to the LRU - and
-	 * possibly migrated - before they are charged.
-	 */
+	/* Readahead page is charged too, to see if other page uncharged */
+	VM_WARN_ON_ONCE_PAGE(!memcg, page);
 	if (!memcg)
 		memcg = root_mem_cgroup;
 
@@ -6906,8 +6904,9 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	if (newpage->mem_cgroup)
 		return;
 
-	/* Swapcache readahead pages can get replaced before being charged */
 	memcg = oldpage->mem_cgroup;
+	/* Readahead page is charged too, to see if other page uncharged */
+	VM_WARN_ON_ONCE_PAGE(!memcg, oldpage);
 	if (!memcg)
 		return;
 
@@ -7104,7 +7103,8 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 
 	memcg = page->mem_cgroup;
 
-	/* Readahead page, never charged */
+	/* Readahead page is charged too, to see if other page uncharged */
+	VM_WARN_ON_ONCE_PAGE(!memcg, page);
 	if (!memcg)
 		return;
 
@@ -7168,7 +7168,8 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 
 	memcg = page->mem_cgroup;
 
-	/* Readahead page, never charged */
+	/* Readahead page is charged too, to see if other page uncharged */
+	VM_WARN_ON_ONCE_PAGE(!memcg, page);
 	if (!memcg)
 		return 0;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
  2020-08-24 12:54 ` [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

If we disabled memcg by cgroup_disable=memory, page->memcg will be NULL
and so the charge is skipped and that will trigger a warning like below.
Let's return from the funcs earlier.

 anon flags:0x5005b48008000d(locked|uptodate|dirty|swapbacked)
 raw: 005005b48008000d dead000000000100 dead000000000122 ffff8897c7c76ad1
 raw: 0000000000000022 0000000000000000 0000000200000000 0000000000000000
 page dumped because: VM_WARN_ON_ONCE_PAGE(!memcg)
...
 RIP: 0010:vprintk_emit+0x1f7/0x260
 Code: 00 84 d2 74 72 0f b6 15 27 58 64 01 48 c7 c0 00 d4 72 82 84 d2 74 09 f3 90 0f b6 10 84 d2 75 f7 e8 de 0d 00 00 4c 89 e7 57 9d <0f> 1f 44 00 00 e9 62 ff ff ff 80 3d 88 c9 3a 01 00 0f 85 54 fe ff
 RSP: 0018:ffffc9000faab358 EFLAGS: 00000202
 RAX: ffffffff8272d400 RBX: 000000000000005e RCX: ffff88afd80d0040
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000202
 RBP: ffffc9000faab3a8 R08: ffffffff8272d440 R09: 0000000000022480
 R10: 00120c77be68bfac R11: 0000000000cd7568 R12: 0000000000000202
 R13: 0057ffffc0080005 R14: ffffffff820a0130 R15: ffffc9000faab3e8
 ? vprintk_emit+0x140/0x260
 vprintk_default+0x1a/0x20
 vprintk_func+0x4f/0xc4
 ? vprintk_func+0x4f/0xc4
 printk+0x53/0x6a
 ? xas_load+0xc/0x80
 __dump_page.cold.6+0xff/0x4ee
 ? xas_init_marks+0x23/0x50
 ? xas_store+0x30/0x40
 ? free_swap_slot+0x43/0xd0
 ? put_swap_page+0x119/0x320
 ? update_load_avg+0x82/0x580
 dump_page+0x9/0xb
 mem_cgroup_try_charge_swap+0x16e/0x1d0
 get_swap_page+0x130/0x210
 add_to_swap+0x41/0xc0
 shrink_page_list+0x99e/0xdf0
 shrink_inactive_list+0x199/0x360
 shrink_lruvec+0x40d/0x650
 ? _cond_resched+0x14/0x30
 ? _cond_resched+0x14/0x30
 shrink_node+0x226/0x6e0
 do_try_to_free_pages+0xd0/0x400
 try_to_free_pages+0xef/0x130
 __alloc_pages_slowpath.constprop.127+0x38d/0xbd0
 ? ___slab_alloc+0x31d/0x6f0
 __alloc_pages_nodemask+0x27f/0x2c0
 alloc_pages_vma+0x75/0x220
 shmem_alloc_page+0x46/0x90
 ? release_pages+0x1ae/0x410
 shmem_alloc_and_acct_page+0x77/0x1c0
 shmem_getpage_gfp+0x162/0x910
 shmem_fault+0x74/0x210
 ? filemap_map_pages+0x29c/0x410
 __do_fault+0x37/0x190
 handle_mm_fault+0x120a/0x1770
 exc_page_fault+0x251/0x450
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffdc622e5828..5974b449d783 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7098,6 +7098,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON_PAGE(page_count(page), page);
 
+	if (mem_cgroup_disabled())
+		return;
+
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
@@ -7163,6 +7166,9 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
 
+	if (mem_cgroup_disabled())
+		return 0;
+
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return 0;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
  2020-08-24 12:54 ` [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged Alex Shi
  2020-08-24 12:54 ` [PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 04/32] mm/thp: clean up lru_add_page_tail Alex Shi
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 --
 mm/huge_memory.c     | 30 ++++++++++++++++++++++++++++++
 mm/swap.c            | 33 ---------------------------------
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 661046994db4..43e6b3458f58 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 			  unsigned int nr_pages);
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
-			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ccff8472cd4..84fb64e8faa1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2313,6 +2313,36 @@ static void remap_page(struct page *page)
 	}
 }
 
+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+				struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
+
 static void __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/swap.c b/mm/swap.c
index d16d65d9b4e0..c674fb441fe9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -935,39 +935,6 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
-		       struct lruvec *lruvec, struct list_head *list)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
-
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
-	else if (list) {
-		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 04/32] mm/thp: clean up lru_add_page_tail
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (2 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 05/32] mm/thp: remove code path which never got into Alex Shi
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 84fb64e8faa1..739497770a3d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2313,19 +2313,19 @@ static void remap_page(struct page *page)
 	}
 }
 
-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *page_tail,
 				struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON_PAGE(!PageHead(head), head);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
 
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
+	if (likely(PageLRU(head)))
+		list_add_tail(&page_tail->lru, &head->lru);
 	else if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 05/32] mm/thp: remove code path which never got into
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (3 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 04/32] mm/thp: clean up lru_add_page_tail Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 06/32] mm/thp: narrow lru locking Alex Shi
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

split_huge_page() will never call on a page which isn't on lru list, so
this code never got a chance to run, and should not be run, to add tail
pages on a lru list which head page isn't there.

Although the bug was never triggered, it'better be removed for code
correctness, and add a warn for unexpected calling.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 739497770a3d..247f53def87b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2330,17 +2330,8 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
-	}
+	} else
+		VM_WARN_ON(!PageLRU(head));
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 06/32] mm/thp: narrow lru locking
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (4 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 05/32] mm/thp: remove code path which never got into Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-10 13:49   ` Matthew Wilcox
  2020-08-24 12:54 ` [PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru Alex Shi
                   ` (26 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrea Arcangeli

lru_lock and page cache xa_lock have no reason with current sequence,
put them together isn't necessary. let's narrow the lru locking, but
left the local_irq_disable to block interrupt re-entry and statistic update.

Hugh Dickins point: split_huge_page_to_list() was already silly,to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.
so we can save passing flags argument down to __split_huge_page().

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 247f53def87b..0132d363253e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2397,7 +2397,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2406,8 +2406,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned long offset = 0;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2419,6 +2417,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock(&pgdat->lru_lock);
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2438,6 +2441,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock(&pgdat->lru_lock);
+	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
 
@@ -2455,8 +2460,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		page_ref_add(head, 2);
 		xa_unlock(&head->mapping->i_pages);
 	}
-
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	local_irq_enable();
 
 	remap_page(head);
 
@@ -2595,12 +2599,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
-	unsigned long flags;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2661,9 +2663,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+	/* block interrupt reentry in xa_lock and spinlock */
+	local_irq_disable();
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2693,7 +2694,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2712,7 +2713,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		local_irq_enable();
 		remap_page(head);
 		ret = -EBUSY;
 	}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (5 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 06/32] mm/thp: narrow lru locking Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding Alex Shi
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Keeping deactivate_file_page is useless if page isn't on lru list. So
let's stop it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index c674fb441fe9..ea9e1f538313 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -671,7 +671,7 @@ void deactivate_file_page(struct page *page)
 	 * In a workload with many unevictable page such as mprotect,
 	 * unevictable page deactivation for accelerating reclaim is pointless.
 	 */
-	if (PageUnevictable(page))
+	if (PageUnevictable(page) || !PageLRU(page))
 		return;
 
 	if (likely(get_page_unless_zero(page))) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (6 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting Alex Shi
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept before put_page_testzero for list
intergrity, otherwise:

  #0 mave_pages_to_lru             #1 release_pages
  if !put_page_testzero
     			           if (put_page_testzero())
     			              !PageLRU //skip lru_lock
     SetPageLRU()
     list_add(&page->lru,)
                                         list_add(&page->lru,)

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99e1796eb833..ffccb94defaf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1850,26 +1850,30 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			list_del(&page->lru);
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		/*
+		 * The SetPageLRU needs to be kept here for list intergrity.
+		 * Otherwise:
+		 *   #0 mave_pages_to_lru             #1 release_pages
+		 *   if !put_page_testzero
+		 *				      if (put_page_testzero())
+		 *				        !PageLRU //skip lru_lock
+		 *     SetPageLRU()
+		 *     list_add(&page->lru,)
+		 *                                        list_add(&page->lru,)
+		 */
 		SetPageLRU(page);
-		lru = page_lru(page);
 
-		nr_pages = thp_nr_pages(page);
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
-
-		if (put_page_testzero(page)) {
+		if (unlikely(put_page_testzero(page))) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -1877,11 +1881,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
-		} else {
-			nr_moved += nr_pages;
-			if (PageActive(page))
-				workingset_age_nonresident(lruvec, nr_pages);
+
+			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lru = page_lru(page);
+		nr_pages = thp_nr_pages(page);
+
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &lruvec->lists[lru]);
+		nr_moved += nr_pages;
+		if (PageActive(page))
+			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (7 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

As func comments mentioned, few isolated page missing be tolerated.
So why not do further to drop the unlikely double check. That won't
cause more idle pages, but reduce a lock contention.

This is also a preparation for later new page isolation feature.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_idle.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..5fdd753e151a 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,11 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (unlikely(!PageLRU(page))) {
-		put_page(page);
-		page = NULL;
-	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (8 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg Alex Shi
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Steven Rostedt, Ingo Molnar, Vlastimil Babka, Mike Kravetz

The compact_deferred is a defer suggestion check, deferring action does in
defer_compaction not here. so, better rename it to avoid confusing.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/compaction.h        | 4 ++--
 include/trace/events/compaction.h | 2 +-
 mm/compaction.c                   | 8 ++++----
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 25a521d299c1..096fd0eec4db 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -102,7 +102,7 @@ extern enum compact_result compaction_suitable(struct zone *zone, int order,
 		unsigned int alloc_flags, int highest_zoneidx);
 
 extern void defer_compaction(struct zone *zone, int order);
-extern bool compaction_deferred(struct zone *zone, int order);
+extern bool compaction_should_defer(struct zone *zone, int order);
 extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
 extern bool compaction_restarting(struct zone *zone, int order);
@@ -201,7 +201,7 @@ static inline void defer_compaction(struct zone *zone, int order)
 {
 }
 
-static inline bool compaction_deferred(struct zone *zone, int order)
+static inline bool compaction_should_defer(struct zone *zone, int order)
 {
 	return true;
 }
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 54e5bf081171..33633c71df04 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -274,7 +274,7 @@
 		1UL << __entry->defer_shift)
 );
 
-DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred,
+DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_should_defer,
 
 	TP_PROTO(struct zone *zone, int order),
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 176dcded298e..4e2c66869041 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -172,7 +172,7 @@ void defer_compaction(struct zone *zone, int order)
 }
 
 /* Returns true if compaction should be skipped this time */
-bool compaction_deferred(struct zone *zone, int order)
+bool compaction_should_defer(struct zone *zone, int order)
 {
 	unsigned long defer_limit = 1UL << zone->compact_defer_shift;
 
@@ -186,7 +186,7 @@ bool compaction_deferred(struct zone *zone, int order)
 	if (zone->compact_considered >= defer_limit)
 		return false;
 
-	trace_mm_compaction_deferred(zone, order);
+	trace_mm_compaction_should_defer(zone, order);
 
 	return true;
 }
@@ -2485,7 +2485,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		enum compact_result status;
 
 		if (prio > MIN_COMPACT_PRIORITY
-					&& compaction_deferred(zone, order)) {
+				&& compaction_should_defer(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
 		}
@@ -2711,7 +2711,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		if (!populated_zone(zone))
 			continue;
 
-		if (compaction_deferred(zone, cc.order))
+		if (compaction_should_defer(zone, cc.order))
 			continue;
 
 		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (9 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec Alex Shi
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Add a debug checking in lock_page_memcg, then we could get alarm
if anything wrong here.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5974b449d783..505f54087e82 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2098,6 +2098,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (unlikely(!memcg))
 		return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
 	if (atomic_read(&memcg->moving_account) <= 0)
 		return memcg;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (10 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

From: Hugh Dickins <hughd@google.com>

Add READ_ONCE on page->mem_cgroup, since we will check it later.
Also the page should not be PageTail(page), so add a check.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 505f54087e82..65c1e873153e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1321,7 +1321,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 		goto out;
 	}
 
-	memcg = page->mem_cgroup;
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	memcg = READ_ONCE(page->mem_cgroup);
 	/* Readahead page is charged too, to see if other page uncharged */
 	VM_WARN_ON_ONCE_PAGE(!memcg, page);
 	if (!memcg)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (11 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Fold the PGROTATED event collection into pagevec_move_tail_fn call back
func like other funcs does in pagevec_lru_move_fn. Thus we could save
func call pagevec_move_tail().
Now all usage of pagevec_lru_move_fn are same and no needs of its 3rd
parameter.

It's just simply the calling. no func change.

[lkp@intel.com: found a build issue in the original patch, thanks]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 66 +++++++++++++++++++++++----------------------------------------
 1 file changed, 24 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index ea9e1f538313..4eea95a4286f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
-	void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
 	struct pglist_data *pgdat = NULL;
@@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		(*move_fn)(page, lruvec, arg);
+		(*move_fn)(page, lruvec);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -232,35 +231,23 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	int *pgmoved = arg;
-
 	if (PageLRU(page) && !PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-		(*pgmoved) += thp_nr_pages(page);
+		__count_vm_events(PGROTATED, thp_nr_pages(page));
 	}
 }
 
 /*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
-	int pgmoved = 0;
-
-	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
-	__count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
+ *
+ * pagevec_move_tail_fn() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
  */
 void rotate_reclaimable_page(struct page *page)
 {
@@ -273,7 +260,7 @@ void rotate_reclaimable_page(struct page *page)
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_move_tail(pvec);
+			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 }
@@ -315,8 +302,7 @@ void lru_note_cost_page(struct page *page)
 		      page_is_file_lru(page), thp_nr_pages(page));
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -340,7 +326,7 @@ static void activate_page_drain(int cpu)
 	struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
 
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, __activate_page, NULL);
+		pagevec_lru_move_fn(pvec, __activate_page);
 }
 
 static bool need_activate_page_drain(int cpu)
@@ -358,7 +344,7 @@ void activate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, __activate_page, NULL);
+			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -374,7 +360,7 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -527,8 +513,7 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
-			      void *arg)
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 {
 	int lru;
 	bool active;
@@ -575,8 +560,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -593,8 +577,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
@@ -638,21 +621,21 @@ void lru_add_drain_cpu(int cpu)
 
 		/* No harm done if a racing interrupt already did this */
 		local_lock_irqsave(&lru_rotate.lock, flags);
-		pagevec_move_tail(pvec);
+		pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
 }
@@ -681,7 +664,7 @@ void deactivate_file_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -703,7 +686,7 @@ void deactivate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -725,7 +708,7 @@ void mark_page_lazyfree(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -935,8 +918,7 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
@@ -995,7 +977,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
 }
 
 /**
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (12 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 15/32] mm/lru: move lock into lru_note_cost Alex Shi
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

It's a clean up patch w/o function changes.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c       | 2 ++
 mm/workingset.c | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 4eea95a4286f..906255db6006 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -298,8 +298,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 
 void lru_note_cost_page(struct page *page)
 {
+	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
 		      page_is_file_lru(page), thp_nr_pages(page));
+	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
diff --git a/mm/workingset.c b/mm/workingset.c
index 92e66113a577..32e24cda1b4f 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -381,9 +381,7 @@ void workingset_refault(struct page *page, void *shadow)
 	if (workingset) {
 		SetPageWorkingset(page);
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
-		spin_lock_irq(&page_pgdat(page)->lru_lock);
 		lru_note_cost_page(page);
-		spin_unlock_irq(&page_pgdat(page)->lru_lock);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
 	}
 out:
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 15/32] mm/lru: move lock into lru_note_cost
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (13 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-21 21:36   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU Alex Shi
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

We have to move lru_lock into lru_note_cost, since it cycle up on memcg
tree, for future per lruvec lru_lock replace. It's a bit ugly and may
cost a bit more locking, but benefit from multiple memcg locking could
cover the lost.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c   | 5 +++--
 mm/vmscan.c | 4 +---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 906255db6006..f80ccd6f3cb4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -269,7 +269,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+		spin_lock_irq(&pgdat->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -293,15 +295,14 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
+		spin_unlock_irq(&pgdat->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 void lru_note_cost_page(struct page *page)
 {
-	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
 		      page_is_file_lru(page), thp_nr_pages(page));
-	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffccb94defaf..7b7b36bd1448 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1971,19 +1971,17 @@ static int current_may_throttle(void)
 				&stat, false);
 
 	spin_lock_irq(&pgdat->lru_lock);
-
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	lru_note_cost(lruvec, file, stat.nr_pageout);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
+	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (14 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 15/32] mm/lru: move lock into lru_note_cost Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-21 23:16   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 17/32] mm/compaction: do page isolation first in compaction Alex Shi
                   ` (16 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we could clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
	1, get_page(); 	       #pin the page avoid to be free
	2, TestClearPageLRU(); #block other isolation like memcg change
	3, spin_lock on lru_lock; #serialize lru list access
	4, delete page from lru list;
The step 2 could be optimzed/replaced in scenarios which page is
unlikely be accessed or be moved between memcgs.

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

Hugh Dickins pointed that when a page is in free path and no one is
possible to take it, non atomic lru bit clearing is better, like in
__page_cache_release and release_pages.
And no need get_page() before lru bit clear in isolate_lru_page,
since it '(1) Must be called with an elevated refcount on the page'.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable with Rong Chen
<rong.a.chen@intel.com> report:
https://lkml.org/lkml/2020/3/4/173

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c                 |  3 +--
 mm/swap.c                  |  5 ++---
 mm/vmscan.c                | 18 +++++++-----------
 4 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6be1aa559b1e..9554ed1387dc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+	TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index 93ca2bf30b4f..3762d9dd5b31 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -107,13 +107,12 @@ void mlock_vma_page(struct page *page)
  */
 static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
 {
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		struct lruvec *lruvec;
 
 		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
-		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		return true;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index f80ccd6f3cb4..446ffe280809 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
+		__ClearPageLRU(page);
 		spin_lock_irqsave(&pgdat->lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 	}
@@ -880,9 +879,9 @@ void release_pages(struct page **pages, int nr)
 				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
+			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b7b36bd1448..1b3e0eeaad64 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1665,8 +1665,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-
 		nr_pages = compound_nr(page);
 		total_scan += nr_pages;
 
@@ -1763,21 +1761,19 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
+		int lru = page_lru(page);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		if (PageLRU(page)) {
-			int lru = page_lru(page);
-			get_page(page);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
-			ret = 0;
-		}
+		spin_lock_irq(&pgdat->lru_lock);
+		del_page_from_lru_list(page, lruvec, lru);
 		spin_unlock_irq(&pgdat->lru_lock);
+		ret = 0;
 	}
+
 	return ret;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 17/32] mm/compaction: do page isolation first in compaction
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (15 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-21 23:49   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
                   ` (15 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
 3 files changed, 60 insertions(+), 30 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 43e6b3458f58..550fdfdc3506 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -357,7 +357,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index 4e2c66869041..253382d99969 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -887,6 +887,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
+				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -968,6 +969,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
+		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+			goto isolate_fail_put;
+
+		/* Try isolate the page */
+		if (!TestClearPageLRU(page))
+			goto isolate_fail_put;
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
 			locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -980,10 +996,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}
 
-			/* Recheck PageLRU and PageCompound under lock */
-			if (!PageLRU(page))
-				goto isolate_fail;
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -991,16 +1003,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
 				low_pfn += compound_nr(page) - 1;
-				goto isolate_fail;
+				SetPageLRU(page);
+				goto isolate_fail_put;
 			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, isolate_mode) != 0)
-			goto isolate_fail;
-
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
 			low_pfn += compound_nr(page) - 1;
@@ -1029,6 +1038,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		}
 
 		continue;
+
+isolate_fail_put:
+		/* Avoid potential deadlock in freeing page under lru_lock */
+		if (locked) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			locked = false;
+		}
+		put_page(page);
+
 isolate_fail:
 		if (!skip_on_failure)
 			continue;
@@ -1065,9 +1083,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	page = NULL;
+
 isolate_abort:
 	if (locked)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (page) {
+		SetPageLRU(page);
+		put_page(page);
+	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1b3e0eeaad64..48b50695f883 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1538,20 +1538,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EINVAL;
 
-	/* Only take pages on the LRU. */
-	if (!PageLRU(page))
-		return ret;
-
 	/* Compaction should not handle unevictable pages but CMA can do so */
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
 	ret = -EBUSY;
 
+	/* Only take pages on the LRU. */
+	if (!PageLRU(page))
+		return ret;
+
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1592,20 +1592,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
 
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		ClearPageLRU(page);
-		ret = 0;
-	}
-
-	return ret;
+	return 0;
 }
 
-
 /*
  * Update LRU sizes after isolating pages. The LRU size updates must
  * be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1685,17 +1674,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		 * only when the page is being freed somewhere else.
 		 */
 		scan += nr_pages;
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page_prepare(page, mode)) {
 		case 0:
+			/*
+			 * Be careful not to clear PageLRU until after we're
+			 * sure the page is not being freed elsewhere -- the
+			 * page release code relies on it.
+			 */
+			if (unlikely(!get_page_unless_zero(page)))
+				goto busy;
+
+			if (!TestClearPageLRU(page)) {
+				/*
+				 * This page may in other isolation path,
+				 * but we still hold lru_lock.
+				 */
+				put_page(page);
+				goto busy;
+			}
+
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
-
+busy:
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			continue;
+			break;
 
 		default:
 			BUG();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page()
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (16 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 17/32] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:54 ` [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Mika Penttilä

Split_huge_page() must start with PageLRU(head), and we are holding the
lru_lock here. If the head was cleared lru bit unexpected, keep tracking it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0132d363253e..6380c925e904 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2321,17 +2321,20 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&page_tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
-	} else
+	} else {
+		/*
+		 * Split start from PageLRU(head), and we are holding the
+		 * lru_lock. Do a warning if the head's lru bit was cleared
+		 * unexpected.
+		 */
 		VM_WARN_ON(!PageLRU(head));
+		SetPageLRU(page_tail);
+		list_add_tail(&page_tail->lru, &head->lru);
+	}
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (17 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-22  0:42   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
                   ` (13 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Hugh Dickins' found a memcg change bug on original version:
If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
possible bad scenario would like:

	cpu 0					cpu 1
lruvec = mem_cgroup_page_lruvec()
					if (!isolate_lru_page())
						mem_cgroup_move_account

spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.

So we need the ClearPageLRU to block isolate_lru_page(), that serializes
the memcg change. and then removing the PageLRU check in move_fn callee
as the consequence.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 44 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 446ffe280809..2d9a86bf93a4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -221,8 +221,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&pgdat->lru_lock, flags);
 		}
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec);
+
+		SetPageLRU(page);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -232,7 +238,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (!PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);
 
@@ -362,7 +368,8 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
+	if (PageLRU(page))
+		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -521,9 +528,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 	bool active;
 	int nr_pages = thp_nr_pages(page);
 
-	if (!PageLRU(page))
-		return;
-
 	if (PageUnevictable(page))
 		return;
 
@@ -564,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);
 
@@ -581,7 +585,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	if (PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 		int nr_pages = thp_nr_pages(page);
@@ -979,7 +983,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
+	int i;
+	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec;
+	unsigned long flags = 0;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct pglist_data *pagepgdat = page_pgdat(page);
+
+		if (pagepgdat != pgdat) {
+			if (pgdat)
+				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			pgdat = pagepgdat;
+			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		__pagevec_lru_add_fn(page, lruvec);
+	}
+	if (pgdat)
+		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
 }
 
 /**
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (18 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-22  5:27   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function Alex Shi
                   ` (12 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Michal Hocko, Yang Shi

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort is
opend, and lock_page_lruvec logical is embedded for tight process.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this and later patches, the readtwice performance increases
about 80% within concurrent containers.

On a large machine with memcg enabled but not used, the page's lruvec
seeking pass a few pointers, that may lead to lru_lock holding time
increase and a bit regression.

Hugh Dickins helped on patch polish, thanks!

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
 include/linux/mmzone.h     |   2 +
 mm/compaction.c            |  56 +++++++++++++++---------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  60 +++++++++++++++++++++++++-
 mm/mlock.c                 |  47 +++++++++++++-------
 mm/mmzone.c                |   1 +
 mm/swap.c                  | 105 +++++++++++++++++++++------------------------
 mm/vmscan.c                |  70 +++++++++++++++++-------------
 9 files changed, 279 insertions(+), 131 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d0b036123c6a..7b170e9028b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -494,6 +494,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1035,6 +1048,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1282,6 +1320,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1411,6 +1453,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8379432f4f2f..27a1513a43fc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -273,6 +273,8 @@ enum lruvec_flags {
 };
 
 struct lruvec {
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	struct list_head		lists[NR_LRU_LISTS];
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
diff --git a/mm/compaction.c b/mm/compaction.c
index 253382d99969..b724eacf6421 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -805,7 +805,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -865,11 +865,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -941,9 +950,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -984,10 +992,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked, flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -1006,9 +1023,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1042,8 +1058,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1058,8 +1074,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1087,7 +1103,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6380c925e904..c9e08fdc08e9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2319,7 +2319,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2403,7 +2403,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2420,10 +2419,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2444,7 +2441,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 65c1e873153e..5b95529e64a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1302,6 +1302,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1341,6 +1354,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -3222,7 +3280,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 3762d9dd5b31..177d2588e863 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -105,12 +105,10 @@ void mlock_vma_page(struct page *page)
  * Isolate a page from LRU with optional get_page() pin.
  * Assumes lru_lock already held and page already pinned.
  */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+				struct lruvec *lruvec, bool getpage)
 {
 	if (TestClearPageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
@@ -180,7 +178,7 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -188,11 +186,16 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
+	 * Serialize split tail pages in __split_huge_page_tail() which
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes thp_nr_pages().
+	 * TestClearPageLRU can't be used here to block page isolation, since
+	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
+	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
+	 * wrong lru list here. So relay on PageLocked to stop lruvec change
+	 * in mem_cgroup_move_account().
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -203,15 +206,15 @@ unsigned int munlock_vma_page(struct page *page)
 	nr_pages = thp_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+	if (__munlock_isolate_lru_page(page, lruvec, true)) {
+		unlock_page_lruvec_irq(lruvec);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	unlock_page_lruvec_irq(lruvec);
 
 out:
 	return nr_pages - 1;
@@ -291,23 +294,34 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
+		struct lruvec *new_lruvec;
+
+		/* block memcg change in mem_cgroup_move_account */
+		lock_page_memcg(page);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+				unlock_page_memcg(page);
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
 		} else {
 			delta_munlocked++;
@@ -319,11 +333,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 * pin. We cannot do it under lru_lock however. If it's
 		 * the last pin, __page_cache_release() would deadlock.
 		 */
+		unlock_page_memcg(page);
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/swap.c b/mm/swap.c
index 2d9a86bf93a4..b67959b701c0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
 		__ClearPageLRU(page);
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -364,13 +359,13 @@ static inline void activate_page_drain(int cpu)
 
 void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+		__activate_page(page, lruvec);
+	unlock_page_lruvec_irq(lruvec);
 }
 #endif
 
@@ -819,8 +814,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags;
 	unsigned int lock_batch;
 
@@ -830,21 +824,20 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -863,29 +856,29 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
@@ -895,8 +888,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -984,26 +977,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48b50695f883..789444ae4c88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1768,15 +1768,13 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		int lru = page_lru(page);
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, lru);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1843,20 +1841,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
+	struct lruvec *orig_lruvec = lruvec;
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
+		struct lruvec *new_lruvec = NULL;
+
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1871,6 +1871,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 *     list_add(&page->lru,)
 		 *                                        list_add(&page->lru,)
 		 */
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				spin_unlock_irq(&lruvec->lru_lock);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 		SetPageLRU(page);
 
 		if (unlikely(put_page_testzero(page))) {
@@ -1878,16 +1884,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		lru = page_lru(page);
 		nr_pages = thp_nr_pages(page);
 
@@ -1897,6 +1902,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		if (PageActive(page))
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
+	if (orig_lruvec != lruvec) {
+		if (lruvec)
+			spin_unlock_irq(&lruvec->lru_lock);
+		spin_lock_irq(&orig_lruvec->lru_lock);
+	}
 
 	/*
 	 * To save our caller's stack, now use input list for pages to free.
@@ -1952,7 +1962,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1964,7 +1974,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1972,7 +1982,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1981,7 +1991,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2034,7 +2044,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2045,7 +2055,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2091,7 +2101,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2102,7 +2112,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2684,10 +2694,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4263,24 +4273,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
 		pgscanned++;
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -4296,10 +4304,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (19 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-22  5:40   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru Alex Shi
                   ` (11 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Thomas Gleixner, Andrey Ryabinin

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Use this new function to replace repeated same code, no func change.

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 |  9 +-------
 mm/swap.c                  | 33 +++++++----------------------
 mm/vmscan.c                |  8 +------
 4 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b170e9028b5..ee6ef2d8ad52 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -488,6 +488,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	const struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node *mz;
+
+	if (mem_cgroup_disabled())
+		return lruvec == &pgdat->__lruvec;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+	memcg = page->mem_cgroup ? : root_mem_cgroup;
+
+	return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1023,6 +1039,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+		pg_data_t *pgdat = page_pgdat(page);
+
+		return lruvec == &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1469,6 +1493,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irq(locked_lruvec);
+	}
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
+
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+	}
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index 177d2588e863..0448409184e3 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -302,17 +302,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	/* Phase 1: page isolation */
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg change in mem_cgroup_move_account */
 		lock_page_memcg(page);
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (new_lruvec != lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
-
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
diff --git a/mm/swap.c b/mm/swap.c
index b67959b701c0..2ac78e8fab71 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
-
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -865,17 +858,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *prev_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (prev_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
@@ -982,15 +970,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 789444ae4c88..2c94790d4cb1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4280,15 +4280,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		pgscanned++;
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (20 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-22  5:44   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 23/32] mm/lru: revise the comments of lru_lock Alex Shi
                   ` (10 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Use the relock function to replace relocking action. And try to save few
lock times.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/vmscan.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2c94790d4cb1..04ef94190530 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1848,15 +1848,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
-		struct lruvec *new_lruvec = NULL;
-
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&lruvec->lru_lock);
+			if (lruvec) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
+			}
 			putback_lru_page(page);
-			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1871,12 +1871,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 *     list_add(&page->lru,)
 		 *                                        list_add(&page->lru,)
 		 */
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (new_lruvec != lruvec) {
-			if (lruvec)
-				spin_unlock_irq(&lruvec->lru_lock);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		SetPageLRU(page);
 
 		if (unlikely(put_page_testzero(page))) {
@@ -1885,8 +1880,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
 				destroy_compound_page(page);
-				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 23/32] mm/lru: revise the comments of lru_lock
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (21 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-22  5:48   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock Alex Shi
                   ` (9 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
 Documentation/admin-guide/cgroup-v1/memory.rst     | 21 +++++++++------------
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 ++++++++--------------
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  3 +--
 mm/filemap.c                                       |  4 ++--
 mm/memcontrol.c                                    |  2 +-
 mm/rmap.c                                          |  4 ++--
 mm/vmscan.c                                        | 12 ++++++++----
 10 files changed, 36 insertions(+), 51 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ======
-        Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global pgdat->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under pgdat->lru_lock.
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-
-	(By __isolate_lru_page(), the page is removed from both of global and
-	private LRU.)
-
+	Each memcg has its own vector of LRUs (inactive anon, active anon,
+	inactive file, active file, unevictable) of pages from each node,
+	each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 12757e63b26c..24450696579f 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered.
 2.6 Locking
 -----------
 
-   lock_page_cgroup()/unlock_page_cgroup() should not be called under
-   the i_pages lock.
+Lock order is as follows:
 
-   Other lock order is following:
+  Page lock (PG_locked bit of page->flags)
+    mm->page_table_lock or split pte_lock
+      lock_page_memcg (memcg->move_lock)
+        mapping->i_pages lock
+          lruvec->lru_lock.
 
-   PG_locked.
-     mm->page_table_lock
-         pgdat->lru_lock
-	   lock_page_cgroup.
-
-  In many cases, just lock_page_cgroup() is called.
-
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 -----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million 4k pages in a single node.  When a large
 fraction of these pages are not evictable for any reason [see below], vmscan
 will spend a lot of time scanning the LRU lists looking for the small fraction
 of pages that are evictable.  This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
 The Unevictable Page List
 -------------------------
 
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
 called the "unevictable" list and an associated page flag, PG_unevictable, to
 indicate that the page is being managed on the unevictable list.
 
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
 swap-backed pages.  This differentiation is only important while the pages are,
 in fact, evictable.
 
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
 lists and statistics originally proposed and posted by Christoph Lameter.
 
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock.  This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
 
 Memory Control Group Interaction
 --------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
 memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
 lru_list enum.
 
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
 lru_list enum element).  The memory controller tracks the movement of pages to
 and from the unevictable list.
 
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
 active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
 pages in all of the shrink_{active|inactive|page}_list() functions and will
 "cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
 
 There may be situations where a page is mapped into a VM_LOCKED VMA, but the
 page is not marked as PG_mlocked.  Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
 page from the LRU, as it is likely on the appropriate active or inactive list
 at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
 back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list.  If
+is now mlocked and divert the page to the node's unevictable list.  If
 mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
 it later if and when it attempts to reclaim the page.
 
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
      unevictable list in mlock_vma_page().
 
 shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
 
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..c3f1e76720af 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 27a1513a43fc..f0596e634863 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
- * So add a wild amount of padding here to ensure that they fall into separate
+ * Add a wild amount of padding here to ensure datas fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
  */
diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..6f8d58fb16db 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -102,8 +102,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5b95529e64a4..454b3f205d1b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3279,7 +3279,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 /*
- * Because tail pages are not marked as "used", set it. We're under
+ * Because tail pages are not marked as "used", set it. Don't need
  * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/rmap.c b/mm/rmap.c
index 83cc459edc40..259c323e06ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -28,12 +28,12 @@
  *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
- *                   mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
+ *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
  *                     i_pages lock (widely used)
+ *                       lruvec->lru_lock (in lock_page_lruvec_irq)
  *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
  *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                   sb_lock (within inode_lock in fs/fs-writeback.c)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 04ef94190530..601fbcb994fb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1614,14 +1614,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
  * For pagecache intensive workloads, this function is the hottest
  * spot in the kernel (apart from copy_*_user functions).
  *
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
  *
  * @nr_to_scan:	The number of eligible pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
@@ -1820,14 +1822,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 
 /*
  * This moves pages from @list to corresponding LRU list.
+ * The pages from @list is out of any lruvec, and in the end list reuses as
+ * pages_to_free list.
  *
  * We move them the other way if the page is referenced by one or more
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
+ * appropriate to hold lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
+ * should drop lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (22 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 23/32] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-09-22  5:53   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page Alex Shi
                   ` (8 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 include/linux/mmzone.h | 1 -
 mm/page_alloc.c        | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f0596e634863..0ed520954843 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -758,7 +758,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fab5e97dc9ca..775120fcc869 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6733,7 +6733,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (23 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-26  5:52   ` Alex Shi
  2020-09-22  6:13   ` Hugh Dickins
  2020-08-24 12:54 ` [PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
                   ` (7 subsequent siblings)
  32 siblings, 2 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov, Vlastimil Babka

In the func munlock_vma_page, the page must be PageLocked as well as
pages in split_huge_page series funcs. Thus the PageLocked is enough
to serialize both funcs.

So we could relief the TestClearPageMlocked/hpage_nr_pages which are not
necessary under lru lock.

As to another munlock func __munlock_pagevec, which no PageLocked
protection and should remain lru protecting.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0448409184e3..46a05e6ec5ba 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -69,9 +69,9 @@ void clear_page_mlock(struct page *page)
 	 *
 	 * See __pagevec_lru_add_fn for more explanation.
 	 */
-	if (!isolate_lru_page(page)) {
+	if (!isolate_lru_page(page))
 		putback_lru_page(page);
-	} else {
+	else {
 		/*
 		 * We lost the race. the page already moved to evictable list.
 		 */
@@ -178,7 +178,6 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -186,37 +185,22 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize split tail pages in __split_huge_page_tail() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes thp_nr_pages().
-	 * TestClearPageLRU can't be used here to block page isolation, since
-	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
-	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
-	 * wrong lru list here. So relay on PageLocked to stop lruvec change
-	 * in mem_cgroup_move_account().
+	 * Serialize split tail pages in __split_huge_page_tail() by
+	 * lock_page(); Do TestClearPageMlocked/PageLRU sequence like
+	 * clear_page_mlock().
 	 */
-	lruvec = lock_page_lruvec_irq(page);
-
-	if (!TestClearPageMlocked(page)) {
+	if (!TestClearPageMlocked(page))
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
-	}
+		return 0;
 
 	nr_pages = thp_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, lruvec, true)) {
-		unlock_page_lruvec_irq(lruvec);
+	if (!isolate_lru_page(page))
 		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
-
-unlock_out:
-	unlock_page_lruvec_irq(lruvec);
+	else
+		__munlock_isolation_failed(page);
 
-out:
 	return nr_pages - 1;
 }
 
@@ -305,6 +289,11 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 
 		/* block memcg change in mem_cgroup_move_account */
 		lock_page_memcg(page);
+		/*
+		 * Serialize split tail pages in __split_huge_page_tail() which
+		 * might otherwise copy PageMlocked to part of the tail pages
+		 * before we clear it in the head page.
+		 */
 		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (TestClearPageMlocked(page)) {
 			/*
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (24 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page Alex Shi
@ 2020-08-24 12:54 ` Alex Shi
  2020-08-24 12:55 ` [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock Alex Shi
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov, Vlastimil Babka

The func only has one caller, remove it to clean up code and simplify
code.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 46a05e6ec5ba..40a8bb79c65e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -102,23 +102,6 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page,
-				struct lruvec *lruvec, bool getpage)
-{
-	if (TestClearPageLRU(page)) {
-		if (getpage)
-			get_page(page);
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		return true;
-	}
-
-	return false;
-}
-
-/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -300,7 +283,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+			if (TestClearPageLRU(page)) {
+				enum lru_list lru = page_lru(page);
+
+				del_page_from_lru_list(page, lruvec, lru);
 				unlock_page_memcg(page);
 				continue;
 			} else
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (25 preceding siblings ...)
  2020-08-24 12:54 ` [PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
@ 2020-08-24 12:55 ` Alex Shi
  2020-08-26  9:07   ` Alex Shi
  2020-08-24 12:55 ` [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block Alex Shi
                   ` (5 subsequent siblings)
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

The current relock logical will change lru_lock when if found a new
lruvec, so if 2 memcgs are reading file or alloc page equally, they
could hold the lru_lock alternately.

This patch will record the needed lru_lock and only hold them once in
above scenario. That could reduce the lock contention.

Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 43 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 2ac78e8fab71..fe53449fa1b8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -958,24 +958,53 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 	trace_mm_lru_insertion(page, lru);
 }
 
+struct add_lruvecs {
+	struct list_head lists[PAGEVEC_SIZE];
+	struct lruvec *vecs[PAGEVEC_SIZE];
+};
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
+	int i, j, total;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
+	struct page *page;
+	struct add_lruvecs lruvecs;
+
+	lruvecs.vecs[0] = NULL;
+	for (i = total = 0; i < pagevec_count(pvec); i++) {
+		page = pvec->pages[i];
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+		/* Try to find a same lruvec */
+		for (j = 0; j <= total; j++)
+			if (lruvec == lruvecs.vecs[j])
+				break;
+		/* A new lruvec */
+		if (j > total) {
+			INIT_LIST_HEAD(&lruvecs.lists[total]);
+			lruvecs.vecs[total] = lruvec;
+			j = total++;
+			lruvecs.vecs[total] = 0;
+		}
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+		list_add(&page->lru, &lruvecs.lists[j]);
+	}
 
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
+	for (i = 0; i < total; i++) {
+		spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
+		while (!list_empty(&lruvecs.lists[i])) {
+			page = lru_to_page(&lruvecs.lists[i]);
+			list_del(&page->lru);
+			__pagevec_lru_add_fn(page, lruvecs.vecs[i]);
+		}
+		spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (26 preceding siblings ...)
  2020-08-24 12:55 ` [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock Alex Shi
@ 2020-08-24 12:55 ` Alex Shi
  2020-08-24 12:55 ` [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block Alex Shi
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Stephen Rothwell

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

We can drop the need for the locked variable by making use of the
lruvec_holds_page_lru_lock function. By doing this we can avoid some rcu
locking ugliness for the case where the lruvec is still holding the LRU
lock associated with the page. Instead we can just use the lruvec and if it
is NULL we assume the lock was released.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/compaction.c | 46 +++++++++++++++++++++-------------------------
 1 file changed, 21 insertions(+), 25 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b724eacf6421..6bf5ccd8fcf6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -803,9 +803,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 {
 	pg_data_t *pgdat = cc->zone->zone_pgdat;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
-	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -866,9 +865,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * a fatal signal is pending.
 		 */
 		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
-			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
-				locked = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 
 			if (fatal_signal_pending(current)) {
@@ -949,9 +948,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					unlock_page_lruvec_irqrestore(locked, flags);
-					locked = NULL;
+				if (lruvec) {
+					unlock_page_lruvec_irqrestore(lruvec, flags);
+					lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -992,16 +991,14 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
-		rcu_read_lock();
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-
 		/* If we already hold the lock, we can skip some rechecking */
-		if (lruvec != locked) {
-			if (locked)
-				unlock_page_lruvec_irqrestore(locked, flags);
+		if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
 
+			rcu_read_lock();
+			lruvec = mem_cgroup_page_lruvec(page, pgdat);
 			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
-			locked = lruvec;
 			rcu_read_unlock();
 
 			lruvec_memcg_debug(lruvec, page);
@@ -1023,8 +1020,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		} else
-			rcu_read_unlock();
+		}
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1057,9 +1053,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
-		if (locked) {
-			unlock_page_lruvec_irqrestore(locked, flags);
-			locked = NULL;
+		if (lruvec) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 		put_page(page);
 
@@ -1073,9 +1069,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
-				locked = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1102,8 +1098,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	page = NULL;
 
 isolate_abort:
-	if (locked)
-		unlock_page_lruvec_irqrestore(locked, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (27 preceding siblings ...)
  2020-08-24 12:55 ` [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block Alex Shi
@ 2020-08-24 12:55 ` Alex Shi
  2020-08-24 12:55 ` [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip Alex Shi
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Stephen Rothwell

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Since we are holding a reference to the page much sooner in
isolate_migratepages_block we can move the PageCompound check out of the
LRU locked section and instead just place it after get_page_unless_zero. By
doing this we can allow any of the items that might trigger a failure to
trigger a failure for the compound page rather than the order 0 page and as
a result we should be able to process the pageblock faster.

In addition by testing for PageCompound sooner we can avoid having the LRU
flag cleared and then reset in the exception case. As a result this should
prevent possible races where another thread might be attempting to pull the
LRU pages from the list.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/compaction.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 6bf5ccd8fcf6..a0e48d079124 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -984,6 +984,24 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (unlikely(!get_page_unless_zero(page)))
 			goto isolate_fail;
 
+		/*
+		 * Page is compound. We know the order before we know if it is
+		 * on the LRU so we cannot assume it is THP. However since the
+		 * page will have the LRU validated shortly we can use the value
+		 * to skip over this page for now or validate the LRU is set and
+		 * then isolate the entire compound page if we are isolating to
+		 * generate a CMA page.
+		 */
+		if (PageCompound(page)) {
+			const unsigned int order = compound_order(page);
+
+			if (likely(order < MAX_ORDER))
+				low_pfn += (1UL << order) - 1;
+
+			if (!cc->alloc_contig)
+				goto isolate_fail_put;
+		}
+
 		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
 			goto isolate_fail_put;
 
@@ -1009,23 +1027,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				if (test_and_set_skip(cc, page, low_pfn))
 					goto isolate_abort;
 			}
-
-			/*
-			 * Page become compound since the non-locked check,
-			 * and it's on LRU. It can only be a THP so the order
-			 * is safe to read and it's 0 for tail pages.
-			 */
-			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
-				low_pfn += compound_nr(page) - 1;
-				SetPageLRU(page);
-				goto isolate_fail_put;
-			}
 		}
 
-		/* The whole page is taken off the LRU; skip the tail pages. */
-		if (PageCompound(page))
-			low_pfn += compound_nr(page) - 1;
-
 		/* Successfully isolated */
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		mod_node_page_state(page_pgdat(page),
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (28 preceding siblings ...)
  2020-08-24 12:55 ` [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block Alex Shi
@ 2020-08-24 12:55 ` Alex Shi
  2020-08-24 12:55 ` [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages Alex Shi
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck, Stephen Rothwell

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

The only user of test_and_set_skip was isolate_migratepages_block and it
was using it after a call that was testing and clearing the LRU flag. As
such it really didn't need to be behind the LRU lock anymore as it wasn't
really fulfilling its purpose.

Since it is only possible to be able to test and set the skip flag if we
were able to obtain the LRU bit for the first page in the pageblock the
use of the test_and_set_skip becomes redundant as the LRU flag now becomes
the item that limits us to only one thread being able to perform the
operation and there being no need for a test_and_set operation.

With that being the case we can simply drop the bit and instead directly
just call the set_pageblock_skip function if the page we are working on is
the valid_page at the start of the pageblock. Then any other threads that
enter this pageblock should see the skip bit set on the first valid page in
the pageblock.

Since we have dropped the late abort case we can drop the code that was
clearing the LRU flag and calling page_put since the abort case will now
not be holding a reference to a page now.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/compaction.c | 53 +++++++++++++----------------------------------------
 1 file changed, 13 insertions(+), 40 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a0e48d079124..9443bc4d763d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -399,29 +399,6 @@ void reset_isolation_suitable(pg_data_t *pgdat)
 	}
 }
 
-/*
- * Sets the pageblock skip bit if it was clear. Note that this is a hint as
- * locks are not required for read/writers. Returns true if it was already set.
- */
-static bool test_and_set_skip(struct compact_control *cc, struct page *page,
-							unsigned long pfn)
-{
-	bool skip;
-
-	/* Do no update if skip hint is being ignored */
-	if (cc->ignore_skip_hint)
-		return false;
-
-	if (!IS_ALIGNED(pfn, pageblock_nr_pages))
-		return false;
-
-	skip = get_pageblock_skip(page);
-	if (!skip && !cc->no_set_skip_hint)
-		set_pageblock_skip(page);
-
-	return skip;
-}
-
 static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
 {
 	struct zone *zone = cc->zone;
@@ -480,12 +457,6 @@ static inline void update_pageblock_skip(struct compact_control *cc,
 static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
 {
 }
-
-static bool test_and_set_skip(struct compact_control *cc, struct page *page,
-							unsigned long pfn)
-{
-	return false;
-}
 #endif /* CONFIG_COMPACTION */
 
 /*
@@ -895,7 +866,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
-				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -1021,11 +991,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 			lruvec_memcg_debug(lruvec, page);
 
-			/* Try get exclusive access under lock */
-			if (!skip_updated) {
+			/*
+			 * Indicate that we want exclusive access to the
+			 * rest of the pageblock.
+			 *
+			 * The LRU flag prevents simultaneous access to the
+			 * first PFN, and the LRU lock helps to prevent
+			 * simultaneous update of multiple pageblocks shared
+			 * in the same bitmap.
+			 */
+			if (page == valid_page) {
+				if (!cc->ignore_skip_hint &&
+				    !cc->no_set_skip_hint)
+					set_pageblock_skip(page);
 				skip_updated = true;
-				if (test_and_set_skip(cc, page, low_pfn))
-					goto isolate_abort;
 			}
 		}
 
@@ -1098,15 +1077,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
-	page = NULL;
-
 isolate_abort:
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
-	if (page) {
-		SetPageLRU(page);
-		put_page(page);
-	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (29 preceding siblings ...)
  2020-08-24 12:55 ` [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip Alex Shi
@ 2020-08-24 12:55 ` Alex Shi
  2020-09-09  1:01   ` Matthew Wilcox
  2020-08-24 12:55 ` [PATCH v18 32/32] mm: Split release_pages work into 3 passes Alex Shi
  2020-08-24 18:42 ` [PATCH v18 00/32] per memcg lru_lock Andrew Morton
  32 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

In isolate_lru_pages we have an exception path where if we call
get_page_unless_zero and that succeeds, but TestClearPageLRU fails we call
put_page. Normally this would be problematic but due to the way that the
calls are ordered and the fact that we are holding the LRU lock we know
that the caller must be holding another reference for the page. Since we
can assume that we can replace the put_page with a call to
put_page_testzero contained within a WARN_ON. By doing this we should see
if we ever leak a page as a result of the reference count somehow hitting
zero when it shouldn't, and can avoid the overhead and confusion of using
the full put_page call.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/vmscan.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 601fbcb994fb..604240303ea2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1688,10 +1688,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 			if (!TestClearPageLRU(page)) {
 				/*
-				 * This page may in other isolation path,
-				 * but we still hold lru_lock.
+				 * This page is being isolated in another
+				 * thread, but we still hold lru_lock. The
+				 * other thread must be holding a reference
+				 * to the page so this should never hit a
+				 * reference count of 0.
 				 */
-				put_page(page);
+				WARN_ON(put_page_testzero(page));
 				goto busy;
 			}
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [PATCH v18 32/32] mm: Split release_pages work into 3 passes
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (30 preceding siblings ...)
  2020-08-24 12:55 ` [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages Alex Shi
@ 2020-08-24 12:55 ` Alex Shi
  2020-08-24 18:42 ` [PATCH v18 00/32] per memcg lru_lock Andrew Morton
  32 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-24 12:55 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Alexander Duyck

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

The release_pages function has a number of paths that end up with the
LRU lock having to be released and reacquired. Such an example would be the
freeing of THP pages as it requires releasing the LRU lock so that it can
be potentially reacquired by __put_compound_page.

In order to avoid that we can split the work into 3 passes, the first
without the LRU lock to go through and sort out those pages that are not in
the LRU so they can be freed immediately from those that can't. The second
pass will then go through removing those pages from the LRU in batches as
large as a pagevec can hold before freeing the LRU lock. Once the pages have
been removed from the LRU we can then proceed to free the remaining pages
without needing to worry about if they are in the LRU any further.

The general idea is to avoid bouncing the LRU lock between pages and to
hopefully aggregate the lock for up to the full page vector worth of pages.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/swap.c | 109 ++++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 67 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index fe53449fa1b8..b405f81b2c60 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -795,6 +795,54 @@ void lru_add_drain_all(void)
 }
 #endif
 
+static void __release_page(struct page *page, struct list_head *pages_to_free)
+{
+	if (PageCompound(page)) {
+		__put_compound_page(page);
+	} else {
+		/* Clear Active bit in case of parallel mark_page_accessed */
+		__ClearPageActive(page);
+		__ClearPageWaiters(page);
+
+		list_add(&page->lru, pages_to_free);
+	}
+}
+
+static void __release_lru_pages(struct pagevec *pvec,
+				struct list_head *pages_to_free)
+{
+	struct lruvec *lruvec = NULL;
+	unsigned long flags = 0;
+	int i;
+
+	/*
+	 * The pagevec at this point should contain a set of pages with
+	 * their reference count at 0 and the LRU flag set. We will now
+	 * need to pull the pages from their LRU lists.
+	 *
+	 * We walk the list backwards here since that way we are starting at
+	 * the pages that should be warmest in the cache.
+	 */
+	for (i = pagevec_count(pvec); i--;) {
+		struct page *page = pvec->pages[i];
+
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
+		VM_BUG_ON_PAGE(!PageLRU(page), page);
+		__ClearPageLRU(page);
+		del_page_from_lru_list(page, lruvec, page_off_lru(page));
+	}
+
+	unlock_page_lruvec_irqrestore(lruvec, flags);
+
+	/*
+	 * A batch of pages are no longer on the LRU list. Go through and
+	 * start the final process of returning the deferred pages to their
+	 * appropriate freelists.
+	 */
+	for (i = pagevec_count(pvec); i--;)
+		__release_page(pvec->pages[i], pages_to_free);
+}
+
 /**
  * release_pages - batched put_page()
  * @pages: array of pages to release
@@ -806,32 +854,24 @@ void lru_add_drain_all(void)
 void release_pages(struct page **pages, int nr)
 {
 	int i;
+	struct pagevec pvec;
 	LIST_HEAD(pages_to_free);
-	struct lruvec *lruvec = NULL;
-	unsigned long flags;
-	unsigned int lock_batch;
 
+	pagevec_init(&pvec);
+
+	/*
+	 * We need to first walk through the list cleaning up the low hanging
+	 * fruit and clearing those pages that either cannot be freed or that
+	 * are non-LRU. We will store the LRU pages in a pagevec so that we
+	 * can get to them in the next pass.
+	 */
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
-		/*
-		 * Make sure the IRQ-safe lock-holding time does not get
-		 * excessive with a continuous string of pages from the
-		 * same lruvec. The lock is held only if lruvec != NULL.
-		 */
-		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
-			unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = NULL;
-		}
-
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-				lruvec = NULL;
-			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
 			 * put_devmap_managed_page() do not require special
@@ -848,36 +888,21 @@ void release_pages(struct page **pages, int nr)
 		if (!put_page_testzero(page))
 			continue;
 
-		if (PageCompound(page)) {
-			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-				lruvec = NULL;
-			}
-			__put_compound_page(page);
+		if (!PageLRU(page)) {
+			__release_page(page, &pages_to_free);
 			continue;
 		}
 
-		if (PageLRU(page)) {
-			struct lruvec *prev_lruvec = lruvec;
-
-			lruvec = relock_page_lruvec_irqsave(page, lruvec,
-									&flags);
-			if (prev_lruvec != lruvec)
-				lock_batch = 0;
-
-			VM_BUG_ON_PAGE(!PageLRU(page), page);
-			__ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, page_off_lru(page));
+		/* record page so we can get it in the next pass */
+		if (!pagevec_add(&pvec, page)) {
+			__release_lru_pages(&pvec, &pages_to_free);
+			pagevec_reinit(&pvec);
 		}
-
-		/* Clear Active bit in case of parallel mark_page_accessed */
-		__ClearPageActive(page);
-		__ClearPageWaiters(page);
-
-		list_add(&page->lru, &pages_to_free);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+
+	/* flush any remaining LRU pages that need to be processed */
+	if (pagevec_count(&pvec))
+		__release_lru_pages(&pvec, &pages_to_free);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
                   ` (31 preceding siblings ...)
  2020-08-24 12:55 ` [PATCH v18 32/32] mm: Split release_pages work into 3 passes Alex Shi
@ 2020-08-24 18:42 ` Andrew Morton
  2020-08-24 20:24   ` Hugh Dickins
  2020-08-25  7:21   ` [PATCH v18 00/32] per memcg lru_lock Michal Hocko
  32 siblings, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2020-08-24 18:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301

On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:

> The new version which bases on v5.9-rc2. The first 6 patches was picked into
> linux-mm, and add patch 25-32 that do some further post optimization.

32 patches, version 18.  That's quite heroic.  I'm unsure whether I
should merge it up at this point - what do people think?

> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.

That's rather a slight amount of performance testing for a huge
performance patchset!  Is more detailed testing planned?



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-24 18:42 ` [PATCH v18 00/32] per memcg lru_lock Andrew Morton
@ 2020-08-24 20:24   ` Hugh Dickins
  2020-08-25  1:56     ` Daniel Jordan
  2020-08-27  7:01     ` Hugh Dickins
  2020-08-25  7:21   ` [PATCH v18 00/32] per memcg lru_lock Michal Hocko
  1 sibling, 2 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-08-24 20:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 24 Aug 2020, Andrew Morton wrote:
> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
> > The new version which bases on v5.9-rc2.

Well timed and well based, thank you Alex.  Particulary helpful to me,
to include those that already went into mmotm: it's a surer foundation
to test on top of the -rc2 base.

> > the first 6 patches was picked into
> > linux-mm, and add patch 25-32 that do some further post optimization.
> 
> 32 patches, version 18.  That's quite heroic.  I'm unsure whether I
> should merge it up at this point - what do people think?

I'd love for it to go into mmotm - but not today.

Version 17 tested out well.  I've only just started testing version 18,
but I'm afraid there's been a number of "improvements" in between,
which show up as warnings (lots of VM_WARN_ON_ONCE_PAGE(!memcg) -
I think one or more of those are already in mmotm and under discussion
on the list, but I haven't read through yet, and I may have caught
more cases to examine; a per-cpu warning from munlock_vma_page();
something else flitted by at reboot time before I could read it).
No crashes so far, but I haven't got very far with it yet.

I'll report back later in the week.

Andrew demurred on version 17 for lack of review.  Alexander Duyck has
been doing a lot on that front since then.  I have intended to do so,
but it's a mirage that moves away from me as I move towards it: I have
some time in the coming weeks to get back to that, but it would help
me if the series is held more static by being in mmotm - we may need
fixes, but improvements are liable to get in the way of finalizing.

I still find the reliance on TestClearPageLRU, rather than lru_lock,
hard to wrap my head around: but for so long as it's working correctly,
please take that as a problem with my head (and something we can
certainly change later if necessary, by re-adding the use of lru_lock
in certain places (or by fitting me with a new head)).

> 
> > 
> > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> > containers on a 2s * 26cores * HT box with a modefied case:
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> > With this patchset, the readtwice performance increased about 80%
> > in concurrent containers.
> 
> That's rather a slight amount of performance testing for a huge
> performance patchset!

Indeed.  And I see that clause about readtwice performance increased 80%
going back eight months to v6: a lot of fundamental bugs have been fixed
in it since then, so I do think it needs refreshing.  It could be faster
now: v16 or v17 fixed the last bug I knew of, which had been slowing
down reclaim considerably.

When I last timed my repetitive swapping loads (not loads anyone sensible
would be running with), across only two memcgs, Alex's patchset was
slightly faster than without: it really did make a difference.  But
I tend to think that for all patchsets, there exists at least one
test that shows it faster, and another that shows it slower.

> Is more detailed testing planned?

Not by me, performance testing is not something I trust myself with,
just get lost in the numbers: Alex, this is what we hoped for months
ago, please make a more convincing case, I hope Daniel and others
can make more suggestions.  But my own evidence suggests it's good.

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-24 20:24   ` Hugh Dickins
@ 2020-08-25  1:56     ` Daniel Jordan
  2020-08-25  3:26       ` Alex Shi
  2020-08-25  8:52       ` Alex Shi
  2020-08-27  7:01     ` Hugh Dickins
  1 sibling, 2 replies; 102+ messages in thread
From: Daniel Jordan @ 2020-08-25  1:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Alex Shi, mgorman, tj, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote:
> On Mon, 24 Aug 2020, Andrew Morton wrote:
> > On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> Andrew demurred on version 17 for lack of review.  Alexander Duyck has
> been doing a lot on that front since then.  I have intended to do so,
> but it's a mirage that moves away from me as I move towards it: I have

Same, I haven't been able to keep up with the versions or the recent review
feedback.  I got through about half of v17 last week and hope to have more time
for the rest this week and beyond.

> > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> > > containers on a 2s * 26cores * HT box with a modefied case:

Alex, do you have a pointer to the modified readtwice case?

Even better would be a description of the problem you're having in production
with lru_lock.  We might be able to create at least a simulation of it to show
what the expected improvement of your real workload is.

> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> > > With this patchset, the readtwice performance increased about 80%
> > > in concurrent containers.
> > 
> > That's rather a slight amount of performance testing for a huge
> > performance patchset!
> 
> Indeed.  And I see that clause about readtwice performance increased 80%
> going back eight months to v6: a lot of fundamental bugs have been fixed
> in it since then, so I do think it needs refreshing.  It could be faster
> now: v16 or v17 fixed the last bug I knew of, which had been slowing
> down reclaim considerably.
> 
> When I last timed my repetitive swapping loads (not loads anyone sensible
> would be running with), across only two memcgs, Alex's patchset was
> slightly faster than without: it really did make a difference.  But
> I tend to think that for all patchsets, there exists at least one
> test that shows it faster, and another that shows it slower.
> 
> > Is more detailed testing planned?
> 
> Not by me, performance testing is not something I trust myself with,
> just get lost in the numbers: Alex, this is what we hoped for months
> ago, please make a more convincing case, I hope Daniel and others
> can make more suggestions.  But my own evidence suggests it's good.

I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from
mmtests, a memcg-ized version of the readtwice case I cooked up) and then today
discovered there's a chance I wasn't running the right kernels, so I'm redoing
them on v18.  Plan to look into what other, more "macro" tests would be
sensitive to these changes.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-25  1:56     ` Daniel Jordan
@ 2020-08-25  3:26       ` Alex Shi
  2020-08-25 11:39         ` Matthew Wilcox
  2020-08-26  1:19         ` Daniel Jordan
  2020-08-25  8:52       ` Alex Shi
  1 sibling, 2 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-25  3:26 UTC (permalink / raw)
  To: Daniel Jordan, Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, willy, hannes, lkp,
	linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/8/25 上午9:56, Daniel Jordan 写道:
> On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote:
>> On Mon, 24 Aug 2020, Andrew Morton wrote:
>>> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
>> Andrew demurred on version 17 for lack of review.  Alexander Duyck has
>> been doing a lot on that front since then.  I have intended to do so,
>> but it's a mirage that moves away from me as I move towards it: I have
> 
> Same, I haven't been able to keep up with the versions or the recent review
> feedback.  I got through about half of v17 last week and hope to have more time
> for the rest this week and beyond.
> 
>>>> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
>>>> containers on a 2s * 26cores * HT box with a modefied case:
> 
> Alex, do you have a pointer to the modified readtwice case?

Sorry, no. my developer machine crashed, so I lost case my container and modified
case. I am struggling to get my container back from a account problematic repository. 

But some testing scripts is here, generally, the original readtwice case will
run each of threads on each of cpus. The new case will run one container on each cpus,
and just run one readtwice thead in each of containers.

Here is readtwice case changes(Just a reference)
diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice
index 85533b248634..48c6b5f44256 100755
--- a/case-lru-file-readtwice
+++ b/case-lru-file-readtwice
@@ -15,12 +15,9 @@

 . ./hw_vars

-for i in `seq 1 $nr_task`
-do
        create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task))
        timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 &
        timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 &
-done

 wait
 sleep 1
@@ -31,7 +28,7 @@ do
                echo "dd output file empty: $file" >&2
        }
        cat $file
-       rm  $file
+       #rm  $file
 done

 rm `seq -f $SPARSE_FILE-%g 1 $nr_task`

And here is how to running the case: 
--------
#run all case on 24 cpu machine, lrulockv2 is the container with modified case.
for ((i=0; i<24; i++))
do
        #btw, vm-scalability need create 23 loop devices
        docker run --privileged=true --rm lrulockv2 bash -c " sleep 20000" &
done
sleep 15  #wait all container ready. 

#kick testing
for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do docker exec --privileged=true -it $i bash -c "cd vm-scalability/; bash -x ./run case-lru-file-readtwice "& done

#show testing result for all
for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do echo === $i ===; docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-* '  & done
for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do echo === $i ===; docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-* '  & done | grep MB | awk 'BEGIN {a=0
;} { a+=$8} END {print NR, a/(NR)}'


> 
> Even better would be a description of the problem you're having in production
> with lru_lock.  We might be able to create at least a simulation of it to show
> what the expected improvement of your real workload is.

we are using thousands memcgs in a machine, but as a simulation, I guess above case
could be helpful to show the problem.

Thanks a lot!
Alex

> 
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>>>> With this patchset, the readtwice performance increased about 80%
>>>> in concurrent containers.
>>>
>>> That's rather a slight amount of performance testing for a huge
>>> performance patchset!
>>
>> Indeed.  And I see that clause about readtwice performance increased 80%
>> going back eight months to v6: a lot of fundamental bugs have been fixed
>> in it since then, so I do think it needs refreshing.  It could be faster
>> now: v16 or v17 fixed the last bug I knew of, which had been slowing
>> down reclaim considerably.
>>
>> When I last timed my repetitive swapping loads (not loads anyone sensible
>> would be running with), across only two memcgs, Alex's patchset was
>> slightly faster than without: it really did make a difference.  But
>> I tend to think that for all patchsets, there exists at least one
>> test that shows it faster, and another that shows it slower.

In my testing, case-lru-file-mmap-read has a bit slower, 10+% on 96 thread machine,
when memcg is enabled but unused, that may due to longer pointer jumpping on 
lruvec than pgdat->lru_lock, since cgroup_disable=memory could fully remove the
regression with the new lock path.

I tried reusing page->prviate to store lruvec pointer, that could remove some 
regression on this, since private is generally unused on a lru page. But the patch
is too buggy now. 

BTW, 
Guess memcg would cause more memory disturb on a large machine, if it's enabled but
unused, isn't it?


>>
>>> Is more detailed testing planned?
>>
>> Not by me, performance testing is not something I trust myself with,
>> just get lost in the numbers: Alex, this is what we hoped for months
>> ago, please make a more convincing case, I hope Daniel and others
>> can make more suggestions.  But my own evidence suggests it's good.
> 
> I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from
> mmtests, a memcg-ized version of the readtwice case I cooked up) and then today
> discovered there's a chance I wasn't running the right kernels, so I'm redoing
> them on v18.  Plan to look into what other, more "macro" tests would be
> sensitive to these changes.
> 


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-24 18:42 ` [PATCH v18 00/32] per memcg lru_lock Andrew Morton
  2020-08-24 20:24   ` Hugh Dickins
@ 2020-08-25  7:21   ` Michal Hocko
  1 sibling, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2020-08-25  7:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, vdavydov.dev, shy828301

On Mon 24-08-20 11:42:04, Andrew Morton wrote:
> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> 
> > The new version which bases on v5.9-rc2. The first 6 patches was picked into
> > linux-mm, and add patch 25-32 that do some further post optimization.
> 
> 32 patches, version 18.  That's quite heroic.  I'm unsure whether I
> should merge it up at this point - what do people think?

This really needs a proper review. Unfortunately
: 24 files changed, 646 insertions(+), 443 deletions(-)
is quite an undertaking to review as well. Especially in a tricky code
which is full of surprises.

I do agree that per memcg locking looks like a nice feature but I do not
see any pressing reason to merge it ASAP. The cover letter doesn't
really describe any pressing usecase that cannot really live without
this being merged.

I am fully aware of my dept to review but I simply cannot find enough
time to sit on it and think it through to have a meaningful feedback at
this moment.

> > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> > containers on a 2s * 26cores * HT box with a modefied case:
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> > With this patchset, the readtwice performance increased about 80%
> > in concurrent containers.
> 
> That's rather a slight amount of performance testing for a huge
> performance patchset!  Is more detailed testing planned?

Agreed! This needs much better testing coverage.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-25  1:56     ` Daniel Jordan
  2020-08-25  3:26       ` Alex Shi
@ 2020-08-25  8:52       ` Alex Shi
  2020-08-25 13:00         ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-25  8:52 UTC (permalink / raw)
  To: Daniel Jordan, Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, willy, hannes, lkp,
	linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/8/25 上午9:56, Daniel Jordan 写道:
> On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote:
>> On Mon, 24 Aug 2020, Andrew Morton wrote:
>>> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
>> Andrew demurred on version 17 for lack of review.  Alexander Duyck has
>> been doing a lot on that front since then.  I have intended to do so,
>> but it's a mirage that moves away from me as I move towards it: I have
> 
> Same, I haven't been able to keep up with the versions or the recent review
> feedback.  I got through about half of v17 last week and hope to have more time
> for the rest this week and beyond.
> 
>>>> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
>>>> containers on a 2s * 26cores * HT box with a modefied case:
> 
> Alex, do you have a pointer to the modified readtwice case?
> 

Hi Daniel,

my readtwice modification like below.

diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice
index 85533b248634..57cb97d121ae 100755
--- a/case-lru-file-readtwice
+++ b/case-lru-file-readtwice
@@ -15,23 +15,30 @@

 . ./hw_vars

-for i in `seq 1 $nr_task`
-do
-       create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task))
-       timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 &
-       timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 &
-done
+OUT_DIR=$(hostname)-${nr_task}c-$(((mem + (1<<29))>>30))g
+TEST_CASES=${@:-$(echo case-*)}
+
+echo $((1<<30)) > /proc/sys/vm/max_map_count
+echo $((1<<20)) > /proc/sys/kernel/threads-max
+echo 1 > /proc/sys/vm/overcommit_memory
+#echo 3 > /proc/sys/vm/drop_caches
+
+
+i=1
+
+if [ "$1" == "m" ];then
+       mount_tmpfs
+       create_sparse_root
+       create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES))
+       exit
+fi
+
+
+if [ "$1" == "r" ];then
+       (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1)&
+       (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1)&
+fi

 wait
 sleep 1

-for file in $TMPFS_MNT/dd-output-*
-do
-       [ -s "$file" ] || {
-               echo "dd output file empty: $file" >&2
-       }
-       cat $file
-       rm  $file
-done
-
-rm `seq -f $SPARSE_FILE-%g 1 $nr_task`
diff --git a/hw_vars b/hw_vars
index 8731cefb9f57..ceeaa9f17c0b 100755
--- a/hw_vars
+++ b/hw_vars
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/sh -ex

 if [ -n "$runtime" ]; then
        USEMEM="$CMD ./usemem --runtime $runtime"
@@ -43,7 +43,7 @@ create_loop_devices()
        modprobe loop 2>/dev/null
        [ -e "/dev/loop0" ] || modprobe loop 2>/dev/null

-       for i in $(seq 0 8)
+       for i in $(seq 0 104)
        do
                [ -e "/dev/loop$i" ] && continue
                mknod /dev/loop$i b 7 $i


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-25  3:26       ` Alex Shi
@ 2020-08-25 11:39         ` Matthew Wilcox
  2020-08-26  1:19         ` Daniel Jordan
  1 sibling, 0 replies; 102+ messages in thread
From: Matthew Wilcox @ 2020-08-25 11:39 UTC (permalink / raw)
  To: Alex Shi
  Cc: Daniel Jordan, Hugh Dickins, Andrew Morton, mgorman, tj,
	khlebnikov, hannes, lkp, linux-mm, linux-kernel, cgroups,
	shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301

On Tue, Aug 25, 2020 at 11:26:58AM +0800, Alex Shi wrote:
> I tried reusing page->prviate to store lruvec pointer, that could remove some 
> regression on this, since private is generally unused on a lru page. But the patch
> is too buggy now. 

page->private is for the use of the filesystem.  You can grep for
attach_page_private() to see how most filesystems use it.
Some still use set_page_private() for various reasons.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-25  8:52       ` Alex Shi
@ 2020-08-25 13:00         ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-25 13:00 UTC (permalink / raw)
  To: Daniel Jordan, Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, willy, hannes, lkp,
	linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/8/25 下午4:52, Alex Shi 写道:
> 
> 在 2020/8/25 上午9:56, Daniel Jordan 写道:
>> On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote:
>>> On Mon, 24 Aug 2020, Andrew Morton wrote:
>>>> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>> Andrew demurred on version 17 for lack of review.  Alexander Duyck has
>>> been doing a lot on that front since then.  I have intended to do so,
>>> but it's a mirage that moves away from me as I move towards it: I have
>> Same, I haven't been able to keep up with the versions or the recent review
>> feedback.  I got through about half of v17 last week and hope to have more time
>> for the rest this week and beyond.
>>
>>>>> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
>>>>> containers on a 2s * 26cores * HT box with a modefied case:
>> Alex, do you have a pointer to the modified readtwice case?
>>
> Hi Daniel,
> 
> my readtwice modification like below.
> 
> diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice

Hi Diniel,

I finally settle down my container, and found I give a different version of my scripts
which can't work out together. I am sorry!

I will try to bring them up together. and try to give a new version.

Thanks a lot!
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-25  3:26       ` Alex Shi
  2020-08-25 11:39         ` Matthew Wilcox
@ 2020-08-26  1:19         ` Daniel Jordan
  2020-08-26  8:59           ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Daniel Jordan @ 2020-08-26  1:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: Daniel Jordan, Hugh Dickins, Andrew Morton, mgorman, tj,
	khlebnikov, willy, hannes, lkp, linux-mm, linux-kernel, cgroups,
	shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301

On Tue, Aug 25, 2020 at 11:26:58AM +0800, Alex Shi wrote:
> 在 2020/8/25 上午9:56, Daniel Jordan 写道:
> > Alex, do you have a pointer to the modified readtwice case?
> 
> Sorry, no. my developer machine crashed, so I lost case my container and modified
> case. I am struggling to get my container back from a account problematic repository. 
> 
> But some testing scripts is here, generally, the original readtwice case will
> run each of threads on each of cpus. The new case will run one container on each cpus,
> and just run one readtwice thead in each of containers.

Ok, what you've sent so far gives me an idea of what you did.  My readtwice
changes were similar, except I used the cgroup interface directly instead of
docker and shared a filesystem between all the cgroups whereas it looks like
you had one per memcg.  30 second runs on 5.9-rc2 and v18 gave 11% more data
read with v18.  This was using 16 cgroups (32 dd tasks) on a 40 CPU, 2 socket
machine.

> > Even better would be a description of the problem you're having in production
> > with lru_lock.  We might be able to create at least a simulation of it to show
> > what the expected improvement of your real workload is.
> 
> we are using thousands memcgs in a machine, but as a simulation, I guess above case
> could be helpful to show the problem.

Using thousands of memcgs to do what?  Any particulars about the type of
workload?  Surely it's more complicated than page cache reads :)

> > I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from
> > mmtests, a memcg-ized version of the readtwice case I cooked up) and then today
> > discovered there's a chance I wasn't running the right kernels, so I'm redoing
> > them on v18.

Neither kernel compile nor git checkout in the root cgroup changed much, just
0.31% slower on elapsed time for the compile, so no significant regressions
there.  Now for sysbench again.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
  2020-08-24 12:54 ` [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page Alex Shi
@ 2020-08-26  5:52   ` Alex Shi
  2020-09-22  6:13   ` Hugh Dickins
  1 sibling, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-26  5:52 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301
  Cc: Kirill A. Shutemov, Vlastimil Babka

LKP reported a preemptiable issue on this patch. update and refresh
the commit log.

From f18e8c87a045bbb8040006b6816ded1f55fa6f9c Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 25 Jul 2020 22:31:03 +0800
Subject: [PATCH] mm/mlock: remove lru_lock on TestClearPageMlocked in
 munlock_vma_page

In the func munlock_vma_page, comments mentained lru_lock needed for
serialization with split_huge_pages. But the page must be PageLocked
as well as pages in split_huge_page series funcs. Thus the PageLocked
is enough to serialize both funcs.

So we could relief the TestClearPageMlocked/hpage_nr_pages which are not
necessary under lru lock.

As to another munlock func __munlock_pagevec, which no PageLocked
protection and should remain lru protecting.

LKP found a preempt issue on __mod_zone_page_state which need change
to mod_zone_page_state. Thanks!

Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 43 ++++++++++++++++---------------------------
 1 file changed, 16 insertions(+), 27 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0448409184e3..cd88b93b0f0d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -69,9 +69,9 @@ void clear_page_mlock(struct page *page)
 	 *
 	 * See __pagevec_lru_add_fn for more explanation.
 	 */
-	if (!isolate_lru_page(page)) {
+	if (!isolate_lru_page(page))
 		putback_lru_page(page);
-	} else {
+	else {
 		/*
 		 * We lost the race. the page already moved to evictable list.
 		 */
@@ -178,7 +178,6 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -186,37 +185,22 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize split tail pages in __split_huge_page_tail() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes thp_nr_pages().
-	 * TestClearPageLRU can't be used here to block page isolation, since
-	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
-	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
-	 * wrong lru list here. So relay on PageLocked to stop lruvec change
-	 * in mem_cgroup_move_account().
+	 * Serialize split tail pages in __split_huge_page_tail() by
+	 * lock_page(); Do TestClearPageMlocked/PageLRU sequence like
+	 * clear_page_mlock().
 	 */
-	lruvec = lock_page_lruvec_irq(page);
-
-	if (!TestClearPageMlocked(page)) {
+	if (!TestClearPageMlocked(page))
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
-	}
+		return 0;
 
 	nr_pages = thp_nr_pages(page);
-	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, lruvec, true)) {
-		unlock_page_lruvec_irq(lruvec);
+	if (!isolate_lru_page(page))
 		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
-
-unlock_out:
-	unlock_page_lruvec_irq(lruvec);
+	else
+		__munlock_isolation_failed(page);
 
-out:
 	return nr_pages - 1;
 }
 
@@ -305,6 +289,11 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 
 		/* block memcg change in mem_cgroup_move_account */
 		lock_page_memcg(page);
+		/*
+		 * Serialize split tail pages in __split_huge_page_tail() which
+		 * might otherwise copy PageMlocked to part of the tail pages
+		 * before we clear it in the head page.
+		 */
 		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (TestClearPageMlocked(page)) {
 			/*
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-26  1:19         ` Daniel Jordan
@ 2020-08-26  8:59           ` Alex Shi
  2020-08-28  1:40             ` Daniel Jordan
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-26  8:59 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Hugh Dickins, Andrew Morton, mgorman, tj, khlebnikov, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

[-- Attachment #1: Type: text/plain, Size: 3373 bytes --]



在 2020/8/26 上午9:19, Daniel Jordan 写道:
> On Tue, Aug 25, 2020 at 11:26:58AM +0800, Alex Shi wrote:
>> 在 2020/8/25 上午9:56, Daniel Jordan 写道:
>>> Alex, do you have a pointer to the modified readtwice case?
>>
>> Sorry, no. my developer machine crashed, so I lost case my container and modified
>> case. I am struggling to get my container back from a account problematic repository. 
>>
>> But some testing scripts is here, generally, the original readtwice case will
>> run each of threads on each of cpus. The new case will run one container on each cpus,
>> and just run one readtwice thead in each of containers.
> 
> Ok, what you've sent so far gives me an idea of what you did.  My readtwice
> changes were similar, except I used the cgroup interface directly instead of
> docker and shared a filesystem between all the cgroups whereas it looks like
> you had one per memcg.  30 second runs on 5.9-rc2 and v18 gave 11% more data
> read with v18.  This was using 16 cgroups (32 dd tasks) on a 40 CPU, 2 socket
> machine.

I clean up my testing and make it reproducable by a Dockerfile and a case patch which
attached. 
User can build a container from the file, and then do testing like following:

#start some testing containers
for ((i=0; i< 80; i++)); do docker run --privileged=true --rm lrulock bash -c " sleep 20000" & done

#do testing evn setup 
for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do docker exec --privileged=true -it $i bash -c "cd vm-scalability/; bash -x ./case-lru-file-readtwice m"& done

#kick testing
for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do docker exec --privileged=true -it $i bash -c "cd vm-scalability/; bash -x ./case-lru-file-readtwice r"& done

#show result
for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do echo === $i ===; docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-* '  & done | grep MB | awk 'BEGIN {a=0;} { a+=$10 } END {print NR, a/(NR)}'

This time, on a 2P * 20 core * 2 HT machine,
This readtwice performance is 252% compare to v5.9-rc2 kernel. A good surprise!

> 
>>> Even better would be a description of the problem you're having in production
>>> with lru_lock.  We might be able to create at least a simulation of it to show
>>> what the expected improvement of your real workload is.
>>
>> we are using thousands memcgs in a machine, but as a simulation, I guess above case
>> could be helpful to show the problem.
> 
> Using thousands of memcgs to do what?  Any particulars about the type of
> workload?  Surely it's more complicated than page cache reads :)

Yes, the workload are quit different on different business, some use cpu a lot, some use
memory a lot, and some are may mixed. For containers number, that are also quit various
from tens to hundreds to thousands.

> 
>>> I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from
>>> mmtests, a memcg-ized version of the readtwice case I cooked up) and then today
>>> discovered there's a chance I wasn't running the right kernels, so I'm redoing
>>> them on v18.
> 
> Neither kernel compile nor git checkout in the root cgroup changed much, just
> 0.31% slower on elapsed time for the compile, so no significant regressions
> there.  Now for sysbench again.
> 

Thanks a lot for testing report!
Alex

[-- Attachment #2: Dockerfile --]
[-- Type: text/plain, Size: 509 bytes --]

FROM centos:8
MAINTAINER Alexs 
#WORKDIR /vm-scalability 
#RUN yum update -y && yum groupinstall "Development Tools" -y && yum clean all && \
#examples https://www.linuxtechi.com/build-docker-container-images-with-dockerfile/
RUN yum install git xfsprogs patch make gcc -y && yum clean all && \
git clone  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ && \
cd vm-scalability && make usemem

COPY readtwice.patch /vm-scalability/

RUN cd vm-scalability && patch -p1 < readtwice.patch

[-- Attachment #3: readtwice.patch --]
[-- Type: text/plain, Size: 2243 bytes --]

diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice
index 85533b248634..57cb97d121ae 100755
--- a/case-lru-file-readtwice
+++ b/case-lru-file-readtwice
@@ -15,23 +15,30 @@
 
 . ./hw_vars
 
-for i in `seq 1 $nr_task`
-do
-	create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task))
-	timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 &
-	timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 &
-done
+OUT_DIR=$(hostname)-${nr_task}c-$(((mem + (1<<29))>>30))g
+TEST_CASES=${@:-$(echo case-*)}
+
+echo $((1<<30)) > /proc/sys/vm/max_map_count
+echo $((1<<20)) > /proc/sys/kernel/threads-max
+echo 1 > /proc/sys/vm/overcommit_memory
+#echo 3 > /proc/sys/vm/drop_caches
+
+
+i=1
+
+if [ "$1" == "m" ];then
+	mount_tmpfs
+	create_sparse_root
+	create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES))
+	exit
+fi
+
+
+if [ "$1" == "r" ];then
+	(timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1)&
+	(timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1)&
+fi
 
 wait
 sleep 1
 
-for file in $TMPFS_MNT/dd-output-*
-do
-	[ -s "$file" ] || {
-		echo "dd output file empty: $file" >&2
-	}
-	cat $file
-	rm  $file
-done
-
-rm `seq -f $SPARSE_FILE-%g 1 $nr_task`
diff --git a/hw_vars b/hw_vars
index 8731cefb9f57..ceeaa9f17c0b 100755
--- a/hw_vars
+++ b/hw_vars
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/sh -ex
 
 if [ -n "$runtime" ]; then
 	USEMEM="$CMD ./usemem --runtime $runtime"
@@ -43,7 +43,7 @@ create_loop_devices()
 	modprobe loop 2>/dev/null
 	[ -e "/dev/loop0" ] || modprobe loop 2>/dev/null
 
-	for i in $(seq 0 8)
+	for i in $(seq 0 104)
 	do
 		[ -e "/dev/loop$i" ] && continue
 		mknod /dev/loop$i b 7 $i
@@ -101,11 +101,11 @@ remove_sparse_root () {
 create_sparse_file () {
 	name=$1
 	size=$2
-	# echo "$name is of size $size"
+	echo "$name is of size $size"
 	$CMD truncate $name -s $size
 	# dd if=/dev/zero of=$name bs=1k count=1 seek=$((size >> 10)) 2>/dev/null
-	# ls $SPARSE_ROOT
-	# ls /tmp/vm-scalability/*
+	ls $SPARSE_ROOT
+	ls /tmp/vm-scalability/*
 }
 
 


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
  2020-08-24 12:55 ` [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock Alex Shi
@ 2020-08-26  9:07   ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-26  9:07 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

The patch need update since a bug found.

From 547d95205e666c7c5a81c44b7b1f8e1b6c7b1749 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 1 Aug 2020 22:49:31 +0800
Subject: [PATCH] mm/swap.c: optimizing __pagevec_lru_add lru_lock

The current relock logical will change lru_lock when if found a new
lruvec, so if 2 memcgs are reading file or alloc page equally, they
could hold the lru_lock alternately.

This patch will record the needed lru_lock and only hold them once in
above scenario. That could reduce the lock contention.

Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 2ac78e8fab71..dba3f0aba2a0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -958,24 +958,52 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 	trace_mm_lru_insertion(page, lru);
 }
 
+struct add_lruvecs {
+	struct list_head lists[PAGEVEC_SIZE];
+	struct lruvec *vecs[PAGEVEC_SIZE];
+};
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
+	int i, j, total;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
+	struct page *page;
+	struct add_lruvecs lruvecs;
+
+	for (i = total = 0; i < pagevec_count(pvec); i++) {
+		page = pvec->pages[i];
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		lruvecs.vecs[i] = NULL;
+
+		/* Try to find a same lruvec */
+		for (j = 0; j < total; j++)
+			if (lruvec == lruvecs.vecs[j])
+				break;
+		/* A new lruvec */
+		if (j == total) {
+			INIT_LIST_HEAD(&lruvecs.lists[total]);
+			lruvecs.vecs[total] = lruvec;
+			total++;
+		}
 
-	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+		list_add(&page->lru, &lruvecs.lists[j]);
+	}
 
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
+	for (i = 0; i < total; i++) {
+		spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
+		while (!list_empty(&lruvecs.lists[i])) {
+			page = lru_to_page(&lruvecs.lists[i]);
+			list_del(&page->lru);
+			__pagevec_lru_add_fn(page, lruvecs.vecs[i]);
+		}
+		spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-24 20:24   ` Hugh Dickins
  2020-08-25  1:56     ` Daniel Jordan
@ 2020-08-27  7:01     ` Hugh Dickins
  2020-08-27 12:20       ` Race between freeing and waking page Matthew Wilcox
  2020-09-08 23:41       ` [PATCH v18 00/32] per memcg lru_lock: reviews Hugh Dickins
  1 sibling, 2 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-08-27  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alex Shi, Hugh Dickins, mgorman, tj, khlebnikov, daniel.m.jordan,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, cai

On Mon, 24 Aug 2020, Hugh Dickins wrote:
> On Mon, 24 Aug 2020, Andrew Morton wrote:
> > On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote:
> > 
> > > The new version which bases on v5.9-rc2.
> 
> Well timed and well based, thank you Alex.  Particulary helpful to me,
> to include those that already went into mmotm: it's a surer foundation
> to test on top of the -rc2 base.
> 
> > > the first 6 patches was picked into
> > > linux-mm, and add patch 25-32 that do some further post optimization.
> > 
> > 32 patches, version 18.  That's quite heroic.  I'm unsure whether I
> > should merge it up at this point - what do people think?
> 
> I'd love for it to go into mmotm - but not today.
> 
> Version 17 tested out well.  I've only just started testing version 18,
> but I'm afraid there's been a number of "improvements" in between,
> which show up as warnings (lots of VM_WARN_ON_ONCE_PAGE(!memcg) -
> I think one or more of those are already in mmotm and under discussion
> on the list, but I haven't read through yet, and I may have caught
> more cases to examine; a per-cpu warning from munlock_vma_page();

Alex already posted the fix for that one.

> something else flitted by at reboot time before I could read it).

That one still eludes me, but I'm not giving it high priority.

> No crashes so far, but I haven't got very far with it yet.
> 
> I'll report back later in the week.

Just a quick report for now: I have some fixes, not to Alex's patchset
itself, but to things it revealed - a couple of which I knew of already,
but better now be fixed.  Once I've fleshed those out with comments and
sent them in, I'll get down to review.

Testing held up very well, no other problems seen in the patchset,
and the 1/27 discovered something useful.

I was going to say, no crashes observed at all, but one did crash
this afternoon.  But like before, I think it's something unrelated
to Alex's work, just revealed now that I hammer harder on compaction
(knowing that to be the hardest test for per-memcg lru_lock).

It was a crash from checking PageWaiters on a Tail in wake_up_page(),
called from end_page_writeback(), from ext4_finish_bio(): yet the
page a tail of a shmem huge page.  Linus's wake_up_page_bit() changes?
No, I don't think so.  It seems to me that once end_page_writeback()
has done its test_clear_page_writeback(), it has no further hold on
the struct page, which could be reused as part of a compound page
by the time of wake_up_page()'s PageWaiters check.  But I probably
need to muse on that for longer.

(I'm also kind-of-worried because Alex's patchset should make no
functional difference, yet appears to fix some undebugged ZONE_DMA=y
slow leak of memory that's been plaguing my testing for months.
I mention that in case those vague words are enough to prompt an
idea from someone, but cannot afford to spend much time on it.)

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Race between freeing and waking page
  2020-08-27  7:01     ` Hugh Dickins
@ 2020-08-27 12:20       ` Matthew Wilcox
  2020-09-08 23:41       ` [PATCH v18 00/32] per memcg lru_lock: reviews Hugh Dickins
  1 sibling, 0 replies; 102+ messages in thread
From: Matthew Wilcox @ 2020-08-27 12:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, kirill, Nicholas Piggin

On Thu, Aug 27, 2020 at 12:01:00AM -0700, Hugh Dickins wrote:
> It was a crash from checking PageWaiters on a Tail in wake_up_page(),
> called from end_page_writeback(), from ext4_finish_bio(): yet the
> page a tail of a shmem huge page.  Linus's wake_up_page_bit() changes?
> No, I don't think so.  It seems to me that once end_page_writeback()
> has done its test_clear_page_writeback(), it has no further hold on
> the struct page, which could be reused as part of a compound page
> by the time of wake_up_page()'s PageWaiters check.  But I probably
> need to muse on that for longer.

I think you're right.  Example:

truncate_inode_pages_range()
pagevec_lookup_entries()
lock_page()

--- ctx switch ---

ext4_finish_bio()
end_page_writeback()
test_clear_page_writeback()

--- ctx switch ---

wait_on_page_writeback() <- noop
truncate_inode_page()
unlock_page()
pagevec_release()

... page can now be allocated

--- ctx switch ---

wake_up_page()
PageWaiters then has that check for PageTail.

This isn't unique to ext4; the iomap completion path behaves the exact
same way.  The thing is, this is a harmless race.  It seems unnecessary
for anybody here to incur the overhead of adding a page ref to be sure
the page isn't reallocated.  We don't want to wake up the waiters before
clearing the bit in question.

I'm tempted to suggest this:

 static void wake_up_page(struct page *page, int bit)
 {
-       if (!PageWaiters(page))
+       if (PageTail(page) || !PageWaiters(page))
                return;
        wake_up_page_bit(page, bit);

which only adds an extra read to the struct page that we were going to
access anyway.  Even that seems unnecessary though; PageWaiters is
going to be clear.  Maybe we can just change the PF policy from
PF_ONLY_HEAD to PF_ANY.  I don't think it's critical that we have this
check.

Nick?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-26  8:59           ` Alex Shi
@ 2020-08-28  1:40             ` Daniel Jordan
  2020-08-28  5:22               ` Alex Shi
  2020-09-09  2:44               ` Aaron Lu
  0 siblings, 2 replies; 102+ messages in thread
From: Daniel Jordan @ 2020-08-28  1:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: Daniel Jordan, Hugh Dickins, Andrew Morton, mgorman, tj,
	khlebnikov, willy, hannes, lkp, linux-mm, linux-kernel, cgroups,
	shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301

On Wed, Aug 26, 2020 at 04:59:28PM +0800, Alex Shi wrote:
> I clean up my testing and make it reproducable by a Dockerfile and a case patch which
> attached. 

Ok, I'll give that a shot once I've taken care of sysbench.

> >>> Even better would be a description of the problem you're having in production
> >>> with lru_lock.  We might be able to create at least a simulation of it to show
> >>> what the expected improvement of your real workload is.
> >>
> >> we are using thousands memcgs in a machine, but as a simulation, I guess above case
> >> could be helpful to show the problem.
> > 
> > Using thousands of memcgs to do what?  Any particulars about the type of
> > workload?  Surely it's more complicated than page cache reads :)
> 
> Yes, the workload are quit different on different business, some use cpu a
> lot, some use memory a lot, and some are may mixed.

That's pretty vague, but I don't suppose I could do much better describing what
all runs on our systems  :-/

I went back to your v1 post to see what motivated you originally, and you had
some results from aim9 but nothing about where this reared its head in the
first place.  How did you discover the bottleneck?  I'm just curious about how
lru_lock hurts in practice.

> > Neither kernel compile nor git checkout in the root cgroup changed much, just
> > 0.31% slower on elapsed time for the compile, so no significant regressions
> > there.  Now for sysbench again.

Still working on getting repeatable sysbench runs, no luck so far.  The numbers
have stayed fairly consistent with your series but vary a lot on the base
kernel, not sure why yet.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-28  1:40             ` Daniel Jordan
@ 2020-08-28  5:22               ` Alex Shi
  2020-09-09  2:44               ` Aaron Lu
  1 sibling, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-28  5:22 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Hugh Dickins, Andrew Morton, mgorman, tj, khlebnikov, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301


在 2020/8/28 上午9:40, Daniel Jordan 写道:
> I went back to your v1 post to see what motivated you originally, and you had
> some results from aim9 but nothing about where this reared its head in the
> first place.  How did you discover the bottleneck?  I'm just curious about how
> lru_lock hurts in practice.

We have gotten very high 'sys' in some buiness/machines. And found much of time spent
on the lru_lock and/or zone lock. Seems per memcg lru_lock could help this, but still
no idea on zone lock.

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-08-27  7:01     ` Hugh Dickins
  2020-08-27 12:20       ` Race between freeing and waking page Matthew Wilcox
@ 2020-09-08 23:41       ` Hugh Dickins
  2020-09-09  2:24         ` Wei Yang
                           ` (2 more replies)
  1 sibling, 3 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-08 23:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, vbabka, minchan,
	cai, hughd

Miscellaneous Acks and NAKs and other comments on the beginning and
the end of the series, but not much yet on the all-important middle.
I'm hoping to be spared sending ~20 email replies to ~20 patches.

[PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged
Acked-by: Hugh Dickins <hughd@google.com>
if you make these changes:

Please add "Add VM_WARN_ON_ONCE_PAGE() macro." or something like that to
the commit message: that's a good addition that we shall find useful in
other places, so please advertise it.

Delete the four comment lines
/* Readahead page is charged too, to see if other page uncharged */
which make no sense on their own.

[PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled
Acked-by: Hugh Dickins <hughd@google.com>

[PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c
Acked-by: Hugh Dickins <hughd@google.com>

[PATCH v18 04/32] mm/thp: clean up lru_add_page_tail
Acked-by: Hugh Dickins <hughd@google.com>

Though I'd prefer "mm/thp: use head for head page in lru_add_page_tail"
to the unnecessarily vague "clean up".  But you're right to keep this
renaming separate from the code movement in the previous commit, and
perhaps right to keep it from the more interesting cleanup next.

[PATCH v18 05/32] mm/thp: remove code path which never got into
This is a good simplification, but I see no sign that you understand
why it's valid: it relies on lru_add_page_tail() being called while
head refcount is frozen to 0: we would not get this far if someone
else holds a reference to the THP - which they must hold if they have
isolated the page from its lru (and that's true before or after your
per-memcg changes - but even truer after those changes, since PageLRU
can then be flipped without lru_lock at any instant): please explain
something of this in the commit message.

You revisit this same code in 18/32, and I much prefer the way it looks
after that (if (list) {} else {}) - this 05/32 is a bit weird, it would
be easier to understand if it just did VM_WARN_ON(1).  Please pull the
18/32 mods back into this one, maybe adding a VM_WARN_ON(PageLRU) into
the "if (list)" block too.

[PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page()
Please merge into 05/32. But what do "Split_huge_page() must start with
PageLRU(head)" and "Split start from PageLRU(head)" mean? Perhaps you mean
that if list is NULL, then if the head was not on the LRU, then it cannot
have got through page_ref_freeze(), because isolator would hold page ref?
That is subtle, and deserves mention in the commit comment, but is not
what you have said at all.  s/unexpected/unexpectedly/.

[PATCH v18 06/32] mm/thp: narrow lru locking
Why? What part does this play in the series? "narrow lru locking" can
also be described as "widen page cache locking": you are changing the
lock ordering, and not giving any reason to do so. This may be an
excellent change, or it may be a terrible change: I find that usually
lock ordering is forced upon us, and it's rare to meet an instance like
this that could go either way, and I don't know myself how to judge it.

I do want this commit to go in, partly because it has been present in
all the testing we have done, and partly because I *can at last* see a
logical advantage to it - it also nests lru_lock inside memcg->move_lock,
allowing lock_page_memcg() to be used to stabilize page->mem_cgroup when
getting per-memcg lru_lock - though only in one place, starting in v17,
do you actually use that (and, warning: it's not used correctly there).

I'm not very bothered by how the local_irq_disable() looks to RT: THP
seems a very bad idea in an RT kernel.  Earlier I asked you to run this
past Kirill and Matthew and Johannes: you did so, thank you, and Kirill
has blessed it, and no one has nacked it, and I have not noticed any
disadvantage from this change in lock ordering (documented in 23/32),
so I'm now going to say

Acked-by: Hugh Dickins <hughd@google.com>

But I wish you could give some reason for it in the commit message!

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Is that correct? Or Wei Yang suggested some part of it perhaps?

[PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru
Perhaps; or perhaps by the time the pagevec is full, the page has been
drained to the lru, and it should be deactivated? I'm indifferent.
Is this important for per-memcg lru_lock?

[PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding
You are optimizing for a case which you then mark unlikely(), and I
don't agree that it makes the flow clearer; but you've added a useful
comment on the race there, so please s/intergrity/integrity/ in commit
message and in code comment, then
Acked-by: Hugh Dickins <hughd@google.com>

[PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting
I strongly approve of removing the abuse of lru_lock here, but the
patch is wrong: you are mistaken in thinking the PageLRU check after
get_page_unless_zero() is an unnecessary duplicaton of the one before.
No, the one before is an optimization, and the one after is essential,
for telling whether this page (arrived at via pfn, like in compaction)
is the kind of page we understand (address_space or anon_vma or KSM
stable_node pointer in page->mapping), so can use rmap_walk() on.

Please replace this patch by mine from the tarball I posted a year ago,
which keeps both checks, and justifies it against why the lru_lock was
put there in the first place - thanks to Vladimir for pointing me to
that mail thread when I tried to submit this patch a few years ago.
Appended at the end of this mail.
       
[PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer
I'm indifferent: I see your point about the name, but it hasn't caused
confusion in ten years, whereas changing name and tracepoint might cause
confusion.  And how does changing the name help per-memcg lru_lock?  It
just seems to be a random patch from your private tree.  If it's Acked
by Mel who coined the name, or someone who has done a lot of work there
(Vlastimil? Joonsoo?), fine, I have no problem with it; but I don't
see what it's doing in this series - better left out.

[PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg
This is a very useful change for helping lockdep:
Acked-by: Hugh Dickins <hughd@google.com>

[PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec
Hah, I see this is in my name.  Well, I did once suggest folding this
into one of your patches, but it's not an optimization, and that was
before you added VM_WARN_ON_ONCE_PAGE() here.  It looks strange now,
a VM_BUG_ON_PAGE() next to a VM_WARN_ON_ONCE_PAGE(); and the latter
will catch that PageTail case anyway (once).  And although I feel
slightly safer with READ_ONCE(page->mem_cgroup), I'm finding it hard
to justify, doing so here but not in other places: particularly since
just above it says "This function relies on page->mem_cgroup being
stable".  Let's just drop this patch.

[PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
Yes, nice cleanup, I don't see why it should be different and force an
unused arg on the others.  But I have one reservation: you added comment
+ *
+ * pagevec_move_tail_fn() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
above rotate_reclaimable_page(), having deleted pagevec_move_tail() which
had such a comment. It doesn't make sense, because pagevec_move_tail_fn()
is called with IRQ disabled anyway. That comment had better say
+ *
+ * rotate_reclaimable_page() must disable IRQs, to prevent nasty races.
I dimly remember hitting those nasty races many years ago, but forget
the details. Oh, one other thing, you like to use "func" as abbreviation
for "function", okay: but then at the end of the commit message you say
"no func change" - please change that to "No functional change".
Acked-by: Hugh Dickins <hughd@google.com>

[PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page
"w/o functional changes" instead of "w/o function changes".  But please
just merge this into the next, 15/32: there is no point in separating them.

[PATCH v18 15/32] mm/lru: move lock into lru_note_cost
[PATCH v18 16/32] mm/lru: introduce TestClearPageLRU
[PATCH v18 17/32] mm/compaction: do page isolation first in compaction
[PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
[PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock
[PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function
[PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru
[PATCH v18 23/32] mm/lru: revise the comments of lru_lock
[PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock
[PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
[PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page

I have tested, but not yet studied these, and it's a good point to break
off and send my comments so far, because 15/32 is where the cleanups end
and per-memcg lru_lock kind-of begins - lru_note_cost() being potentially
more costly, because it needs to use a different lock at each level.
(When I tried rebasing my own series a couple of months ago, I stopped
here at lru_note_cost() too, wondering if there was a better way.)

Two things I do know about from testing, that need to be corrected:

check_move_unevictable_pages() needs protection from page->memcg
being changed while doing the relock_page_lruvec_irq(): could use
TestClearPageLRU there (!PageLRU pages are safely skipped), but
that doubles the number of atomic ops involved. I intended to use
lock_page_memcg() instead, but that's harder than you'd expect: so
probably TestClearPageLRU will be the best to use there for now.

The use of lock_page_memcg() in __munlock_pagevec() in 20/32,
introduced in patchset v17, looks good but it isn't: I was lucky that
systemd at reboot did some munlocking that exposed the problem to lockdep.
The first time into the loop, lock_page_memcg() is done before lru_lock
(as 06/32 has allowed); but the second time around the loop, it is done
while still holding lru_lock.

lock_page_memcg() really needs to be absorbed into (a variant of)
relock_page_lruvec(), and I do have that (it's awkward because of
the different ways in which the IRQ flags are handled).  And out of
curiosity, I've also tried using that in mm/swap.c too, instead of the
TestClearPageLRU technique: lockdep is happy, but an update_lru_size()
warning showed that it cannot safely be mixed with the TestClearPageLRU
technique (that I'd left in isolate_lru_page()).  So I'll stash away
that relock_page_lruvec(), and consider what's best for mm/mlock.c:
now that I've posted these comments so far, that's my priority, then
to get the result under testing again, before resuming these comments.

Jumping over 15-26, and resuming comments on recent additions:

[PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
Could we please drop this one for the moment? And come back to it later
when the basic series is safely in.  It's a good idea to try sorting
together those pages which come under the same lock (though my guess is
that they naturally gather themselves together quite well already); but
I'm not happy adding 360 bytes to the kernel stack here (and that in
addition to 192 bytes of horrid pseudo-vma in the shmem swapin case),
though that could be avoided by making it per-cpu. But I hope there's
a simpler way of doing it, as efficient, but also useful for the other
pagevec operations here: perhaps scanning the pagevec for same page->
mem_cgroup (and flags node bits), NULLing entries as they are done.
Another, easily fixed, minor defect in this patch: if I'm reading it
right, it reverses the order in which the pages are put on the lru?

[PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block
Most of this consists of replacing "locked" by "lruvec", which is good:
but please fold those changes back into 20/32 (or would it be 17/32?
I've not yet looked into the relationship between those two), so we
can then see more clearly what change this 28/32 (will need renaming!)
actually makes, to use lruvec_holds_page_lru_lock(). That may be a
good change, but it's mixed up with the "locked"->"lruvec" at present,
and I think you could have just used lruvec for locked all along
(but of course there's a place where you'll need new_lruvec too).

[PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block
NAK. I agree that isolate_migratepages_block() looks nicer this way, but
take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is
where set_page_refcounted() changes page->_refcount from 0 to 1, allowing
a racing get_page_unless_zero() to succeed; then later prep_compound_page()
is where PageHead and PageTails get set. So there's a small race window in
which this patch could deliver a compound page when it should not.

[PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip
I haven't looked at this yet (but recall that per-memcg lru_lock can
change the point at which compaction should skip a contended lock: IIRC
the current kernel needs nothing extra, whereas some earlier kernels did
need extra; but when I look at 30/32, may find these remarks irrelevant).

[PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
The title of this patch is definitely wrong: there was an explicit page
decrement there before (put_page), now it's wrapping it up inside a
WARN_ON().  We usually prefer to avoid doing functional operations
inside WARN/BUGs, but I think I'll overlook that - anyone else worried?
The comment is certainly better than what was there before: yes, this
warning reflects the difficulty we have in thinking about the
TestClearPageLRU protocol: which I'm still not sold on, but
agree we should proceed with.  With a change in title, perhaps
"mm: add warning where TestClearPageLRU failed on freeable page"?
Acked-by: Hugh Dickins <hughd@google.com>

[PATCH v18 32/32] mm: Split release_pages work into 3 passes
I haven't looked at this yet (but seen no problem with it in testing).

And finally, here's my replacement (rediffed against 5.9-rc) for 
[PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting

From: Hugh Dickins <hughd@google.com>
Date: Mon, 13 Jun 2016 19:43:34 -0700
Subject: [PATCH] mm: page_idle_get_page() does not need lru_lock

It is necessary for page_idle_get_page() to recheck PageLRU() after
get_page_unless_zero(), but holding lru_lock around that serves no
useful purpose, and adds to lru_lock contention: delete it.

See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
discussion that led to lru_lock there; but __page_set_anon_rmap() now uses
WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs() using
rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but not
entirely prevented by page_count() check in ksm.c's write_protect_page():
that risk being shared with page_referenced() and not helped by lru_lock).

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
---
 mm/page_idle.c | 4 ----
 1 file changed, 4 deletions(-)

--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,15 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }
 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-08-24 12:55 ` [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages Alex Shi
@ 2020-09-09  1:01   ` Matthew Wilcox
  2020-09-09 15:43     ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Matthew Wilcox @ 2020-09-09  1:01 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Alexander Duyck

On Mon, Aug 24, 2020 at 08:55:04PM +0800, Alex Shi wrote:
> +++ b/mm/vmscan.c
> @@ -1688,10 +1688,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  
>  			if (!TestClearPageLRU(page)) {
>  				/*
> -				 * This page may in other isolation path,
> -				 * but we still hold lru_lock.
> +				 * This page is being isolated in another
> +				 * thread, but we still hold lru_lock. The
> +				 * other thread must be holding a reference
> +				 * to the page so this should never hit a
> +				 * reference count of 0.
>  				 */
> -				put_page(page);
> +				WARN_ON(put_page_testzero(page));
>  				goto busy;

I read Hugh's review and that led me to take a look at this.  We don't
do it like this.  Use the same pattern as elsewhere in mm:

        page_ref_sub(page, nr);
        VM_BUG_ON_PAGE(page_count(page) <= 0, page);



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-08 23:41       ` [PATCH v18 00/32] per memcg lru_lock: reviews Hugh Dickins
@ 2020-09-09  2:24         ` Wei Yang
  2020-09-09 15:08         ` Alex Shi
  2020-09-09 16:11         ` Alexander Duyck
  2 siblings, 0 replies; 102+ messages in thread
From: Wei Yang @ 2020-09-09  2:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, Andrew Morton, mgorman, tj, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301,
	vbabka, minchan, cai

On Tue, Sep 08, 2020 at 04:41:00PM -0700, Hugh Dickins wrote:
[...]
>[PATCH v18 06/32] mm/thp: narrow lru locking
>Why? What part does this play in the series? "narrow lru locking" can
>also be described as "widen page cache locking": you are changing the
>lock ordering, and not giving any reason to do so. This may be an
>excellent change, or it may be a terrible change: I find that usually
>lock ordering is forced upon us, and it's rare to meet an instance like
>this that could go either way, and I don't know myself how to judge it.
>
>I do want this commit to go in, partly because it has been present in
>all the testing we have done, and partly because I *can at last* see a
>logical advantage to it - it also nests lru_lock inside memcg->move_lock,
>allowing lock_page_memcg() to be used to stabilize page->mem_cgroup when
>getting per-memcg lru_lock - though only in one place, starting in v17,
>do you actually use that (and, warning: it's not used correctly there).
>
>I'm not very bothered by how the local_irq_disable() looks to RT: THP
>seems a very bad idea in an RT kernel.  Earlier I asked you to run this
>past Kirill and Matthew and Johannes: you did so, thank you, and Kirill
>has blessed it, and no one has nacked it, and I have not noticed any
>disadvantage from this change in lock ordering (documented in 23/32),
>so I'm now going to say
>
>Acked-by: Hugh Dickins <hughd@google.com>
>
>But I wish you could give some reason for it in the commit message!
>
>Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>Is that correct? Or Wei Yang suggested some part of it perhaps?
>

If my memory is correct, we had some offline discussion about this change.

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-08-28  1:40             ` Daniel Jordan
  2020-08-28  5:22               ` Alex Shi
@ 2020-09-09  2:44               ` Aaron Lu
  2020-09-09 11:40                 ` Michal Hocko
  1 sibling, 1 reply; 102+ messages in thread
From: Aaron Lu @ 2020-09-09  2:44 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Alex Shi, Hugh Dickins, Andrew Morton, mgorman, tj, khlebnikov,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Thu, Aug 27, 2020 at 09:40:22PM -0400, Daniel Jordan wrote:
> I went back to your v1 post to see what motivated you originally, and you had
> some results from aim9 but nothing about where this reared its head in the
> first place.  How did you discover the bottleneck?  I'm just curious about how
> lru_lock hurts in practice.

I think making lru_lock per-memcg helps in colocated environment: some
workloads are of high priority while some workloads are of low priority.

For these low priority workloads, we may even want to use some swap for
it to save memory and this can cause frequent alloc/reclaim, depending
on its workingset etc. and these alloc/reclaim need to hold the global
lru lock and zone lock. And then when the high priority workloads do
page fault, their performance can be adversely affected and that is not
acceptible since these high priority workloads normally have strict SLA
requirement.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock
  2020-09-09  2:44               ` Aaron Lu
@ 2020-09-09 11:40                 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2020-09-09 11:40 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Daniel Jordan, Alex Shi, Hugh Dickins, Andrew Morton, mgorman,
	tj, khlebnikov, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, vdavydov.dev, shy828301

On Wed 09-09-20 10:44:32, Aaron Lu wrote:
> On Thu, Aug 27, 2020 at 09:40:22PM -0400, Daniel Jordan wrote:
> > I went back to your v1 post to see what motivated you originally, and you had
> > some results from aim9 but nothing about where this reared its head in the
> > first place.  How did you discover the bottleneck?  I'm just curious about how
> > lru_lock hurts in practice.
> 
> I think making lru_lock per-memcg helps in colocated environment: some
> workloads are of high priority while some workloads are of low priority.
> 
> For these low priority workloads, we may even want to use some swap for
> it to save memory and this can cause frequent alloc/reclaim, depending
> on its workingset etc. and these alloc/reclaim need to hold the global
> lru lock and zone lock. And then when the high priority workloads do
> page fault, their performance can be adversely affected and that is not
> acceptible since these high priority workloads normally have strict SLA
> requirement.

While this all sounds reasonably. We are lacking _any_ numbers to
actually make that a solid argumentation rather than hand waving.
Having something solid is absolutely necessary for a big change like
this.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-08 23:41       ` [PATCH v18 00/32] per memcg lru_lock: reviews Hugh Dickins
  2020-09-09  2:24         ` Wei Yang
@ 2020-09-09 15:08         ` Alex Shi
  2020-09-09 23:16           ` Hugh Dickins
  2020-09-12  8:38           ` Hugh Dickins
  2020-09-09 16:11         ` Alexander Duyck
  2 siblings, 2 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-09 15:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, vbabka, minchan,
	cai

Hi Hugh,

Thanks a lot for so rich review and comments!

在 2020/9/9 上午7:41, Hugh Dickins 写道:
> Miscellaneous Acks and NAKs and other comments on the beginning and
> the end of the series, but not much yet on the all-important middle.
> I'm hoping to be spared sending ~20 email replies to ~20 patches.
> 
> [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged
> Acked-by: Hugh Dickins <hughd@google.com>
> if you make these changes:
> 
> Please add "Add VM_WARN_ON_ONCE_PAGE() macro." or something like that to
> the commit message: that's a good addition that we shall find useful in
> other places, so please advertise it.

Accepted!

> 
> Delete the four comment lines
> /* Readahead page is charged too, to see if other page uncharged */
> which make no sense on their own.
> 

Accepted!
> [PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> [PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> [PATCH v18 04/32] mm/thp: clean up lru_add_page_tail
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> Though I'd prefer "mm/thp: use head for head page in lru_add_page_tail"
> to the unnecessarily vague "clean up".  But you're right to keep this
> renaming separate from the code movement in the previous commit, and
> perhaps right to keep it from the more interesting cleanup next.
> 
> [PATCH v18 05/32] mm/thp: remove code path which never got into
> This is a good simplification, but I see no sign that you understand
> why it's valid: it relies on lru_add_page_tail() being called while
> head refcount is frozen to 0: we would not get this far if someone
> else holds a reference to the THP - which they must hold if they have
> isolated the page from its lru (and that's true before or after your
> per-memcg changes - but even truer after those changes, since PageLRU
> can then be flipped without lru_lock at any instant): please explain
> something of this in the commit message.

Is the following commit log better?

    split_huge_page() will never call on a page which isn't on lru list, so
    this code never got a chance to run, and should not be run, to add tail
    pages on a lru list which head page isn't there.

    Hugh Dickins' mentioned:
    The path should never be called since lru_add_page_tail() being called
    while head refcount is frozen to 0: we would not get this far if someone
    else holds a reference to the THP - which they must hold if they have
    isolated the page from its lru.

    Although the bug was never triggered, it'better be removed for code
    correctness, and add a warn for unexpected calling.

> 
> You revisit this same code in 18/32, and I much prefer the way it looks
> after that (if (list) {} else {}) - this 05/32 is a bit weird, it would
> be easier to understand if it just did VM_WARN_ON(1).  Please pull the
> 18/32 mods back into this one, maybe adding a VM_WARN_ON(PageLRU) into
> the "if (list)" block too.

Accepted.
> 
> [PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page()
> Please merge into 05/32.
> But what do "Split_huge_page() must start with
> PageLRU(head)" and "Split start from PageLRU(head)" mean? Perhaps you mean
> that if list is NULL, then if the head was not on the LRU, then it cannot
> have got through page_ref_freeze(), because isolator would hold page ref?

No, what I mean is only PageLRU(head) could be called and get here. Would you
like to give a suggestion to replace old one?


> That is subtle, and deserves mention in the commit comment, but is not
> what you have said at all.  s/unexpected/unexpectedly/.

Thanks!
> 
> [PATCH v18 06/32] mm/thp: narrow lru locking
> Why? What part does this play in the series? "narrow lru locking" can
> also be described as "widen page cache locking": 

Uh, the page cache locking isn't widen, it's still on the old place.

> you are changing the
> lock ordering, and not giving any reason to do so. This may be an
> excellent change, or it may be a terrible change: I find that usually
> lock ordering is forced upon us, and it's rare to meet an instance like
> this that could go either way, and I don't know myself how to judge it.
> 
> I do want this commit to go in, partly because it has been present in
> all the testing we have done, and partly because I *can at last* see a
> logical advantage to it - it also nests lru_lock inside memcg->move_lock,

I must overlook sth on the lock nest. Would you like to reveal it for me?
Thanks!

> allowing lock_page_memcg() to be used to stabilize page->mem_cgroup when
> getting per-memcg lru_lock - though only in one place, starting in v17,
> do you actually use that (and, warning: it's not used correctly there).
> 
> I'm not very bothered by how the local_irq_disable() looks to RT: THP
> seems a very bad idea in an RT kernel.  Earlier I asked you to run this
> past Kirill and Matthew and Johannes: you did so, thank you, and Kirill
> has blessed it, and no one has nacked it, and I have not noticed any
> disadvantage from this change in lock ordering (documented in 23/32),
> so I'm now going to say
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> But I wish you could give some reason for it in the commit message!

It's a head scratch task. Would you like to tell me what's detailed info 
should be there? Thanks!
 
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> Is that correct? Or Wei Yang suggested some part of it perhaps?

Yes, we talked a lot to confirm the locking change is safe.

> 
> [PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru
> Perhaps; or perhaps by the time the pagevec is full, the page has been
> drained to the lru, and it should be deactivated? I'm indifferent.
> Is this important for per-memcg lru_lock?

It's no much related with theme, so I'm fine to remove it.

> 
> [PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding
> You are optimizing for a case which you then mark unlikely(), and I
> don't agree that it makes the flow clearer; but you've added a useful
> comment on the race there, so please s/intergrity/integrity/ in commit

thanks for fixing.
> message and in code comment, then
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting
> I strongly approve of removing the abuse of lru_lock here, but the
> patch is wrong: you are mistaken in thinking the PageLRU check after
> get_page_unless_zero() is an unnecessary duplicaton of the one before.
> No, the one before is an optimization, and the one after is essential,
> for telling whether this page (arrived at via pfn, like in compaction)
> is the kind of page we understand (address_space or anon_vma or KSM
> stable_node pointer in page->mapping), so can use rmap_walk() on.
> 
> Please replace this patch by mine from the tarball I posted a year ago,
> which keeps both checks, and justifies it against why the lru_lock was
> put there in the first place - thanks to Vladimir for pointing me to
> that mail thread when I tried to submit this patch a few years ago.
> Appended at the end of this mail.

You are right. thanks!
>        
> [PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer
> I'm indifferent: I see your point about the name, but it hasn't caused
> confusion in ten years, whereas changing name and tracepoint might cause
> confusion.  And how does changing the name help per-memcg lru_lock?  It
> just seems to be a random patch from your private tree.  If it's Acked
> by Mel who coined the name, or someone who has done a lot of work there
> (Vlastimil? Joonsoo?), fine, I have no problem with it; but I don't
> see what it's doing in this series - better left out.

I will drop this patch.
> 
> [PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg
> This is a very useful change for helping lockdep:
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> [PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec
> Hah, I see this is in my name.  Well, I did once suggest folding this
> into one of your patches, but it's not an optimization, and that was
> before you added VM_WARN_ON_ONCE_PAGE() here.  It looks strange now,
> a VM_BUG_ON_PAGE() next to a VM_WARN_ON_ONCE_PAGE(); and the latter
> will catch that PageTail case anyway (once).  And although I feel
> slightly safer with READ_ONCE(page->mem_cgroup), I'm finding it hard
> to justify, doing so here but not in other places: particularly since
> just above it says "This function relies on page->mem_cgroup being
> stable".  Let's just drop this patch.

Accepted. Thanks!
> 
> [PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
> Yes, nice cleanup, I don't see why it should be different and force an
> unused arg on the others.  But I have one reservation: you added comment
> + *
> + * pagevec_move_tail_fn() must be called with IRQ disabled.
> + * Otherwise this may cause nasty races.
> above rotate_reclaimable_page(), having deleted pagevec_move_tail() which
> had such a comment. It doesn't make sense, because pagevec_move_tail_fn()
> is called with IRQ disabled anyway. That comment had better say
> + *
> + * rotate_reclaimable_page() must disable IRQs, to prevent nasty races.
> I dimly remember hitting those nasty races many years ago, but forget
> the details. Oh, one other thing, you like to use "func" as abbreviation
> for "function", okay: but then at the end of the commit message you say
> "no func change" - please change that to "No functional change".
> Acked-by: Hugh Dickins <hughd@google.com>
> 
Accepted. Thanks!

> [PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page
> "w/o functional changes" instead of "w/o function changes".  But please
> just merge this into the next, 15/32: there is no point in separating them.
> 
> [PATCH v18 15/32] mm/lru: move lock into lru_note_cost
> [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU
> [PATCH v18 17/32] mm/compaction: do page isolation first in compaction
> [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
> [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock
> [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function
> [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru
> [PATCH v18 23/32] mm/lru: revise the comments of lru_lock
> [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock
> [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
> [PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page
> 
> I have tested, but not yet studied these, and it's a good point to break
> off and send my comments so far, because 15/32 is where the cleanups end
> and per-memcg lru_lock kind-of begins - lru_note_cost() being potentially
> more costly, because it needs to use a different lock at each level.
> (When I tried rebasing my own series a couple of months ago, I stopped
> here at lru_note_cost() too, wondering if there was a better way.)
> 
> Two things I do know about from testing, that need to be corrected:
> 
> check_move_unevictable_pages() needs protection from page->memcg
> being changed while doing the relock_page_lruvec_irq(): could use
> TestClearPageLRU there (!PageLRU pages are safely skipped), but
> that doubles the number of atomic ops involved. I intended to use
> lock_page_memcg() instead, but that's harder than you'd expect: so
> probably TestClearPageLRU will be the best to use there for now.

Accepted. Thanks!

> 
> The use of lock_page_memcg() in __munlock_pagevec() in 20/32,
> introduced in patchset v17, looks good but it isn't: I was lucky that
> systemd at reboot did some munlocking that exposed the problem to lockdep.
> The first time into the loop, lock_page_memcg() is done before lru_lock
> (as 06/32 has allowed); but the second time around the loop, it is done
> while still holding lru_lock.

I don't know the details of lockdep show. Just wondering could it possible 
to solid the move_lock/lru_lock sequence?
or try other blocking way which mentioned in commit_charge()?

> 
> lock_page_memcg() really needs to be absorbed into (a variant of)
> relock_page_lruvec(), and I do have that (it's awkward because of
> the different ways in which the IRQ flags are handled).  And out of
> curiosity, I've also tried using that in mm/swap.c too, instead of the
> TestClearPageLRU technique: lockdep is happy, but an update_lru_size()
> warning showed that it cannot safely be mixed with the TestClearPageLRU
> technique (that I'd left in isolate_lru_page()).  So I'll stash away
> that relock_page_lruvec(), and consider what's best for mm/mlock.c:
> now that I've posted these comments so far, that's my priority, then
> to get the result under testing again, before resuming these comments.

No idea of your solution, but looking forward for your good news! :)

> 
> Jumping over 15-26, and resuming comments on recent additions:
> 
> [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
> Could we please drop this one for the moment? And come back to it later
> when the basic series is safely in.  It's a good idea to try sorting
> together those pages which come under the same lock (though my guess is
> that they naturally gather themselves together quite well already); but
> I'm not happy adding 360 bytes to the kernel stack here (and that in
> addition to 192 bytes of horrid pseudo-vma in the shmem swapin case),
> though that could be avoided by making it per-cpu. But I hope there's
> a simpler way of doing it, as efficient, but also useful for the other
> pagevec operations here: perhaps scanning the pagevec for same page->
> mem_cgroup (and flags node bits), NULLing entries as they are done.
> Another, easily fixed, minor defect in this patch: if I'm reading it
> right, it reverses the order in which the pages are put on the lru?

this patch could give about 10+% performance gain on my multiple memcg
readtwice testing. fairness locking cost the performance much.

I also tried per cpu solution but that cause much trouble of per cpu func
things, and looks no benefit except a bit struct size of stack, so if 
stack size still fine. May we could use the solution and improve it better.
like, functionlize, fix the reverse issue etc.
> 
> [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block
> Most of this consists of replacing "locked" by "lruvec", which is good:
> but please fold those changes back into 20/32 (or would it be 17/32?
> I've not yet looked into the relationship between those two), so we
> can then see more clearly what change this 28/32 (will need renaming!)
> actually makes, to use lruvec_holds_page_lru_lock(). That may be a
> good change, but it's mixed up with the "locked"->"lruvec" at present,
> and I think you could have just used lruvec for locked all along
> (but of course there's a place where you'll need new_lruvec too).

Uh, let me rethink about this. anyway the patch is logically different from
patch 20 since it's need a new function lruvec_holds_page_lru_lock.

> 
> [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block
> NAK. I agree that isolate_migratepages_block() looks nicer this way, but
> take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is
> where set_page_refcounted() changes page->_refcount from 0 to 1, allowing
> a racing get_page_unless_zero() to succeed; then later prep_compound_page()
> is where PageHead and PageTails get set. So there's a small race window in
> which this patch could deliver a compound page when it should not.

will drop this patch.
> 
> [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip
> I haven't looked at this yet (but recall that per-memcg lru_lock can
> change the point at which compaction should skip a contended lock: IIRC
> the current kernel needs nothing extra, whereas some earlier kernels did
> need extra; but when I look at 30/32, may find these remarks irrelevant).

will wait for your further comments. :)
> 
> [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
> The title of this patch is definitely wrong: there was an explicit page
> decrement there before (put_page), now it's wrapping it up inside a
> WARN_ON().  We usually prefer to avoid doing functional operations
> inside WARN/BUGs, but I think I'll overlook that - anyone else worried?
> The comment is certainly better than what was there before: yes, this
> warning reflects the difficulty we have in thinking about the
> TestClearPageLRU protocol: which I'm still not sold on, but
> agree we should proceed with.  With a change in title, perhaps
> "mm: add warning where TestClearPageLRU failed on freeable page"?
> Acked-by: Hugh Dickins <hughd@google.com>
> 

Accepted, thanks

> [PATCH v18 32/32] mm: Split release_pages work into 3 passes
> I haven't looked at this yet (but seen no problem with it in testing).
> 
> And finally, here's my replacement (rediffed against 5.9-rc) for 
> [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting
> 
> From: Hugh Dickins <hughd@google.com>
> Date: Mon, 13 Jun 2016 19:43:34 -0700
> Subject: [PATCH] mm: page_idle_get_page() does not need lru_lock

accepted, thanks!

> 
> It is necessary for page_idle_get_page() to recheck PageLRU() after
> get_page_unless_zero(), but holding lru_lock around that serves no
> useful purpose, and adds to lru_lock contention: delete it.
> 
> See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the
> discussion that led to lru_lock there; but __page_set_anon_rmap() now uses
> WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs() using
> rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but not
> entirely prevented by page_count() check in ksm.c's write_protect_page():
> that risk being shared with page_referenced() and not helped by lru_lock).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> ---
>  mm/page_idle.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -32,19 +32,15 @@
>  static struct page *page_idle_get_page(unsigned long pfn)
>  {
>  	struct page *page = pfn_to_online_page(pfn);
> -	pg_data_t *pgdat;
>  
>  	if (!page || !PageLRU(page) ||
>  	    !get_page_unless_zero(page))
>  		return NULL;
>  
> -	pgdat = page_pgdat(page);
> -	spin_lock_irq(&pgdat->lru_lock);
>  	if (unlikely(!PageLRU(page))) {
>  		put_page(page);
>  		page = NULL;
>  	}
> -	spin_unlock_irq(&pgdat->lru_lock);
>  	return page;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-09-09  1:01   ` Matthew Wilcox
@ 2020-09-09 15:43     ` Alexander Duyck
  2020-09-09 17:07       ` Matthew Wilcox
  2020-09-09 18:24       ` Hugh Dickins
  0 siblings, 2 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-09-09 15:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Johannes Weiner,
	kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen,
	Michal Hocko, Vladimir Davydov, shy828301, Alexander Duyck

On Tue, Sep 8, 2020 at 6:01 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Aug 24, 2020 at 08:55:04PM +0800, Alex Shi wrote:
> > +++ b/mm/vmscan.c
> > @@ -1688,10 +1688,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >
> >                       if (!TestClearPageLRU(page)) {
> >                               /*
> > -                              * This page may in other isolation path,
> > -                              * but we still hold lru_lock.
> > +                              * This page is being isolated in another
> > +                              * thread, but we still hold lru_lock. The
> > +                              * other thread must be holding a reference
> > +                              * to the page so this should never hit a
> > +                              * reference count of 0.
> >                                */
> > -                             put_page(page);
> > +                             WARN_ON(put_page_testzero(page));
> >                               goto busy;
>
> I read Hugh's review and that led me to take a look at this.  We don't
> do it like this.  Use the same pattern as elsewhere in mm:
>
>         page_ref_sub(page, nr);
>         VM_BUG_ON_PAGE(page_count(page) <= 0, page);
>
>

Actually for this case page_ref_dec(page) would make more sense
wouldn't it? Otherwise I agree that would be a better change if that
is the way it has been handled before. I just wasn't familiar with
those other spots.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-08 23:41       ` [PATCH v18 00/32] per memcg lru_lock: reviews Hugh Dickins
  2020-09-09  2:24         ` Wei Yang
  2020-09-09 15:08         ` Alex Shi
@ 2020-09-09 16:11         ` Alexander Duyck
  2020-09-10  0:32           ` Hugh Dickins
  2 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-09-09 16:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov, shy828301,
	Vlastimil Babka, Minchan Kim, Qian Cai

On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote:
>

<snip>

> [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block
> Most of this consists of replacing "locked" by "lruvec", which is good:
> but please fold those changes back into 20/32 (or would it be 17/32?
> I've not yet looked into the relationship between those two), so we
> can then see more clearly what change this 28/32 (will need renaming!)
> actually makes, to use lruvec_holds_page_lru_lock(). That may be a
> good change, but it's mixed up with the "locked"->"lruvec" at present,
> and I think you could have just used lruvec for locked all along
> (but of course there's a place where you'll need new_lruvec too).

I am good with my patch being folded in. No need to keep it separate.

> [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block
> NAK. I agree that isolate_migratepages_block() looks nicer this way, but
> take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is
> where set_page_refcounted() changes page->_refcount from 0 to 1, allowing
> a racing get_page_unless_zero() to succeed; then later prep_compound_page()
> is where PageHead and PageTails get set. So there's a small race window in
> which this patch could deliver a compound page when it should not.

So the main motivation for the patch was to avoid the case where we
are having to reset the LRU flag. One question I would have is what if
we swapped the code block with the __isolate_lru_page_prepare section?
WIth that we would be taking a reference on the page, then verifying
the LRU flag is set, and then testing for compound page flag bit.
Would doing that close the race window since the LRU flag being set
should indicate that the allocation has already been completed has it
not?

> [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip
> I haven't looked at this yet (but recall that per-memcg lru_lock can
> change the point at which compaction should skip a contended lock: IIRC
> the current kernel needs nothing extra, whereas some earlier kernels did
> need extra; but when I look at 30/32, may find these remarks irrelevant).
>
> [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
> The title of this patch is definitely wrong: there was an explicit page
> decrement there before (put_page), now it's wrapping it up inside a
> WARN_ON().  We usually prefer to avoid doing functional operations
> inside WARN/BUGs, but I think I'll overlook that - anyone else worried?
> The comment is certainly better than what was there before: yes, this
> warning reflects the difficulty we have in thinking about the
> TestClearPageLRU protocol: which I'm still not sold on, but
> agree we should proceed with.  With a change in title, perhaps
> "mm: add warning where TestClearPageLRU failed on freeable page"?
> Acked-by: Hugh Dickins <hughd@google.com>

I can update that and resubmit it if needed. I know there were also
some suggestions from Matthew.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-09-09 15:43     ` Alexander Duyck
@ 2020-09-09 17:07       ` Matthew Wilcox
  2020-09-09 18:24       ` Hugh Dickins
  1 sibling, 0 replies; 102+ messages in thread
From: Matthew Wilcox @ 2020-09-09 17:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Johannes Weiner,
	kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen,
	Michal Hocko, Vladimir Davydov, shy828301, Alexander Duyck

On Wed, Sep 09, 2020 at 08:43:38AM -0700, Alexander Duyck wrote:
> On Tue, Sep 8, 2020 at 6:01 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Aug 24, 2020 at 08:55:04PM +0800, Alex Shi wrote:
> > > +++ b/mm/vmscan.c
> > > @@ -1688,10 +1688,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >
> > >                       if (!TestClearPageLRU(page)) {
> > >                               /*
> > > -                              * This page may in other isolation path,
> > > -                              * but we still hold lru_lock.
> > > +                              * This page is being isolated in another
> > > +                              * thread, but we still hold lru_lock. The
> > > +                              * other thread must be holding a reference
> > > +                              * to the page so this should never hit a
> > > +                              * reference count of 0.
> > >                                */
> > > -                             put_page(page);
> > > +                             WARN_ON(put_page_testzero(page));
> > >                               goto busy;
> >
> > I read Hugh's review and that led me to take a look at this.  We don't
> > do it like this.  Use the same pattern as elsewhere in mm:
> >
> >         page_ref_sub(page, nr);
> >         VM_BUG_ON_PAGE(page_count(page) <= 0, page);
> 
> Actually for this case page_ref_dec(page) would make more sense
> wouldn't it? Otherwise I agree that would be a better change if that
> is the way it has been handled before. I just wasn't familiar with
> those other spots.

Yes, page_ref_dec() should be fine.  It's hard to remember which of
VM_BUG_ON, WARN_ON, etc, compile down to nothing with various CONFIG
options, and which ones actually evalauate their arguments.  Safer not
to put things with side-effects inside macros.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-09-09 15:43     ` Alexander Duyck
  2020-09-09 17:07       ` Matthew Wilcox
@ 2020-09-09 18:24       ` Hugh Dickins
  2020-09-09 20:15         ` Matthew Wilcox
  2020-09-09 21:17         ` Alexander Duyck
  1 sibling, 2 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-09 18:24 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Matthew Wilcox, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov, shy828301,
	Alexander Duyck

On Wed, 9 Sep 2020, Alexander Duyck wrote:
> On Tue, Sep 8, 2020 at 6:01 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Mon, Aug 24, 2020 at 08:55:04PM +0800, Alex Shi wrote:
> > > +++ b/mm/vmscan.c
> > > @@ -1688,10 +1688,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >
> > >                       if (!TestClearPageLRU(page)) {
> > >                               /*
> > > -                              * This page may in other isolation path,
> > > -                              * but we still hold lru_lock.
> > > +                              * This page is being isolated in another
> > > +                              * thread, but we still hold lru_lock. The
> > > +                              * other thread must be holding a reference
> > > +                              * to the page so this should never hit a
> > > +                              * reference count of 0.
> > >                                */
> > > -                             put_page(page);
> > > +                             WARN_ON(put_page_testzero(page));
> > >                               goto busy;
> >
> > I read Hugh's review and that led me to take a look at this.  We don't
> > do it like this.  Use the same pattern as elsewhere in mm:
> >
> >         page_ref_sub(page, nr);
> >         VM_BUG_ON_PAGE(page_count(page) <= 0, page);
> >
> >
> 
> Actually for this case page_ref_dec(page) would make more sense
> wouldn't it? Otherwise I agree that would be a better change if that
> is the way it has been handled before. I just wasn't familiar with
> those other spots.

After overnight reflection, my own preference would be simply to
drop this patch.  I think we are making altogether too much of a
fuss here over what was simply correct as plain put_page()
(and further from correct if we change it to leak the page in an
unforeseen circumstance).

And if Alex's comment was not quite grammatically correct, never mind,
it said as much as was worth saying.  I got more worried by his
placement of the "busy:" label, but that does appear to work correctly.

There's probably a thousand places where put_page() is used, where
it would be troublesome if it were the final put_page(): this one
bothered you because you'd been looking at isolate_migratepages_block(),
and its necessary avoidance of lru_lock recursion on put_page();
but let's just just leave this put_page() as is.

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-09-09 18:24       ` Hugh Dickins
@ 2020-09-09 20:15         ` Matthew Wilcox
  2020-09-09 21:05           ` Hugh Dickins
  2020-09-09 21:17         ` Alexander Duyck
  1 sibling, 1 reply; 102+ messages in thread
From: Matthew Wilcox @ 2020-09-09 20:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alexander Duyck, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Johannes Weiner,
	kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen,
	Michal Hocko, Vladimir Davydov, shy828301, Alexander Duyck

On Wed, Sep 09, 2020 at 11:24:14AM -0700, Hugh Dickins wrote:
> After overnight reflection, my own preference would be simply to
> drop this patch.  I think we are making altogether too much of a
> fuss here over what was simply correct as plain put_page()
> (and further from correct if we change it to leak the page in an
> unforeseen circumstance).
> 
> And if Alex's comment was not quite grammatically correct, never mind,
> it said as much as was worth saying.  I got more worried by his
> placement of the "busy:" label, but that does appear to work correctly.
> 
> There's probably a thousand places where put_page() is used, where
> it would be troublesome if it were the final put_page(): this one
> bothered you because you'd been looking at isolate_migratepages_block(),
> and its necessary avoidance of lru_lock recursion on put_page();
> but let's just just leave this put_page() as is.

My problem with put_page() is that it's no longer the simple
decrement-and-branch-to-slow-path-if-zero that it used to be.  It has the
awful devmap excrement in it so it really expands into a lot of code.
I really wish that "feature" could be backed out again.  It clearly
wasn't ready for merge.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-09-09 20:15         ` Matthew Wilcox
@ 2020-09-09 21:05           ` Hugh Dickins
  0 siblings, 0 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-09 21:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Alexander Duyck, Alex Shi, Andrew Morton,
	Mel Gorman, Tejun Heo, Konstantin Khlebnikov, Daniel Jordan,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov, shy828301,
	Alexander Duyck

On Wed, 9 Sep 2020, Matthew Wilcox wrote:
> On Wed, Sep 09, 2020 at 11:24:14AM -0700, Hugh Dickins wrote:
> > After overnight reflection, my own preference would be simply to
> > drop this patch.  I think we are making altogether too much of a
> > fuss here over what was simply correct as plain put_page()
> > (and further from correct if we change it to leak the page in an
> > unforeseen circumstance).
> > 
> > And if Alex's comment was not quite grammatically correct, never mind,
> > it said as much as was worth saying.  I got more worried by his
> > placement of the "busy:" label, but that does appear to work correctly.
> > 
> > There's probably a thousand places where put_page() is used, where
> > it would be troublesome if it were the final put_page(): this one
> > bothered you because you'd been looking at isolate_migratepages_block(),
> > and its necessary avoidance of lru_lock recursion on put_page();
> > but let's just just leave this put_page() as is.
> 
> My problem with put_page() is that it's no longer the simple
> decrement-and-branch-to-slow-path-if-zero that it used to be.  It has the
> awful devmap excrement in it so it really expands into a lot of code.
> I really wish that "feature" could be backed out again.  It clearly
> wasn't ready for merge.

And I suppose I should thank you for opening my eyes to that.
I knew there was "dev" stuff inside __put_page(), but didn't
realize that the inline put_page() has now been defiled.
Yes, I agree, that is horrid and begs to be undone.

But this is not the mail thread for discussing that, and we should
not use strange alternatives to put_page(), here or elsewhere,
just to avoid that (surely? hopefully?) temporary excrescence.

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages
  2020-09-09 18:24       ` Hugh Dickins
  2020-09-09 20:15         ` Matthew Wilcox
@ 2020-09-09 21:17         ` Alexander Duyck
  1 sibling, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-09-09 21:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Johannes Weiner,
	kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen,
	Michal Hocko, Vladimir Davydov, shy828301, Alexander Duyck

On Wed, Sep 9, 2020 at 11:24 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 9 Sep 2020, Alexander Duyck wrote:
> > On Tue, Sep 8, 2020 at 6:01 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > On Mon, Aug 24, 2020 at 08:55:04PM +0800, Alex Shi wrote:
> > > > +++ b/mm/vmscan.c
> > > > @@ -1688,10 +1688,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > >
> > > >                       if (!TestClearPageLRU(page)) {
> > > >                               /*
> > > > -                              * This page may in other isolation path,
> > > > -                              * but we still hold lru_lock.
> > > > +                              * This page is being isolated in another
> > > > +                              * thread, but we still hold lru_lock. The
> > > > +                              * other thread must be holding a reference
> > > > +                              * to the page so this should never hit a
> > > > +                              * reference count of 0.
> > > >                                */
> > > > -                             put_page(page);
> > > > +                             WARN_ON(put_page_testzero(page));
> > > >                               goto busy;
> > >
> > > I read Hugh's review and that led me to take a look at this.  We don't
> > > do it like this.  Use the same pattern as elsewhere in mm:
> > >
> > >         page_ref_sub(page, nr);
> > >         VM_BUG_ON_PAGE(page_count(page) <= 0, page);
> > >
> > >
> >
> > Actually for this case page_ref_dec(page) would make more sense
> > wouldn't it? Otherwise I agree that would be a better change if that
> > is the way it has been handled before. I just wasn't familiar with
> > those other spots.
>
> After overnight reflection, my own preference would be simply to
> drop this patch.  I think we are making altogether too much of a
> fuss here over what was simply correct as plain put_page()
> (and further from correct if we change it to leak the page in an
> unforeseen circumstance).
>
> And if Alex's comment was not quite grammatically correct, never mind,
> it said as much as was worth saying.  I got more worried by his
> placement of the "busy:" label, but that does appear to work correctly.
>
> There's probably a thousand places where put_page() is used, where
> it would be troublesome if it were the final put_page(): this one
> bothered you because you'd been looking at isolate_migratepages_block(),
> and its necessary avoidance of lru_lock recursion on put_page();
> but let's just just leave this put_page() as is.

I'd be fine with that, but I would still like to see the comment
updated. At a minimum we should make it clear that we believe that
put_page is safe here as it should never reach zero and if it does
then we are looking at a bug. Then if this starts triggering soft
lockups  we at least have documentation somewhere that someone can
reference on what we expected and why we triggered a lockup.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-09 15:08         ` Alex Shi
@ 2020-09-09 23:16           ` Hugh Dickins
  2020-09-11  2:50             ` Alex Shi
  2020-09-12  8:38           ` Hugh Dickins
  1 sibling, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-09 23:16 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Andrew Morton, mgorman, tj, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301,
	vbabka, minchan, cai

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5696 bytes --]

On Wed, 9 Sep 2020, Alex Shi wrote:
> 在 2020/9/9 上午7:41, Hugh Dickins 写道:
> > 
> > [PATCH v18 05/32] mm/thp: remove code path which never got into
> > This is a good simplification, but I see no sign that you understand
> > why it's valid: it relies on lru_add_page_tail() being called while
> > head refcount is frozen to 0: we would not get this far if someone
> > else holds a reference to the THP - which they must hold if they have
> > isolated the page from its lru (and that's true before or after your
> > per-memcg changes - but even truer after those changes, since PageLRU
> > can then be flipped without lru_lock at any instant): please explain
> > something of this in the commit message.
> 
> Is the following commit log better?
> 
>     split_huge_page() will never call on a page which isn't on lru list, so
>     this code never got a chance to run, and should not be run, to add tail
>     pages on a lru list which head page isn't there.
> 
>     Hugh Dickins' mentioned:
>     The path should never be called since lru_add_page_tail() being called
>     while head refcount is frozen to 0: we would not get this far if someone
>     else holds a reference to the THP - which they must hold if they have
>     isolated the page from its lru.
> 
>     Although the bug was never triggered, it'better be removed for code
>     correctness, and add a warn for unexpected calling.

Not much better, no.  split_huge_page() can easily be called for a page
which is not on the lru list at the time, and I don't know what was the
bug which was never triggered.  Stick with whatever text you end up with
for the combination of 05/32 and 18/32, and I'll rewrite it after.

> > [PATCH v18 06/32] mm/thp: narrow lru locking
> > Why? What part does this play in the series? "narrow lru locking" can
> > also be described as "widen page cache locking": 
> 
> Uh, the page cache locking isn't widen, it's still on the old place.

I'm not sure if you're joking there. Perhaps just a misunderstanding.

Yes, patch 06/32 does not touch the xa_lock(&mapping->i_pages) and
xa_lock(&swap_cache->i_pages) lines (odd how we've arrived at two of
those, but please do not get into cleaning it up now); but it removes
the spin_lock_irqsave(&pgdata->lru_lock, flags) which used to come
before them, and inserts a spin_lock(&pgdat->lru_lock) after them.

You call that narrowing the lru locking, okay, but I see it as also
pushing the page cache locking outwards: before this patch, page cache
lock was taken inside lru_lock; after this patch, page cache lock is
taken outside lru_lock.  If you cannot see that, then I think you
should not have touched this code at all; but it's what we have
been testing, and I think we should go forward with it.

> > But I wish you could give some reason for it in the commit message!
> 
> It's a head scratch task. Would you like to tell me what's detailed info 
> should be there? Thanks!

So, you don't know why you did it either: then it will be hard to
justify.  I guess I'll have to write something for it later.  I'm
strongly tempted just to drop the patch, but expect it will become
useful later, for using lock_page_memcg() before getting lru_lock.

> > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> > Is that correct? Or Wei Yang suggested some part of it perhaps?
> 
> Yes, we talked a lot to confirm the locking change is safe.

Okay, but the patch was written by you, and sent by you to Andrew:
that is not a case for "Signed-off-by: Someone Else".

> > [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
> > Could we please drop this one for the moment? And come back to it later
> > when the basic series is safely in.  It's a good idea to try sorting
> > together those pages which come under the same lock (though my guess is
> > that they naturally gather themselves together quite well already); but
> > I'm not happy adding 360 bytes to the kernel stack here (and that in
> > addition to 192 bytes of horrid pseudo-vma in the shmem swapin case),
> > though that could be avoided by making it per-cpu. But I hope there's
> > a simpler way of doing it, as efficient, but also useful for the other
> > pagevec operations here: perhaps scanning the pagevec for same page->
> > mem_cgroup (and flags node bits), NULLing entries as they are done.
> > Another, easily fixed, minor defect in this patch: if I'm reading it
> > right, it reverses the order in which the pages are put on the lru?
> 
> this patch could give about 10+% performance gain on my multiple memcg
> readtwice testing. fairness locking cost the performance much.

Good to know, should have been mentioned.  s/fairness/Repeated/

But what was the gain or loss on your multiple memcg readtwice
testing without this patch, compared against node-only lru_lock?
The 80% gain mentioned before, I presume.  So this further
optimization can wait until the rest is solid.

> 
> I also tried per cpu solution but that cause much trouble of per cpu func
> things, and looks no benefit except a bit struct size of stack, so if 
> stack size still fine. May we could use the solution and improve it better.
> like, functionlize, fix the reverse issue etc.

I don't know how important the stack depth consideration is nowadays:
I still care, maybe others don't, since VMAP_STACK became an option.

Yes, please fix the reversal (if I was right on that); and I expect
you could use a singly linked list instead of the double.

But I'll look for an alternative - later, once the urgent stuff
is completed - and leave the acks on this patch to others.

Hugh

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-09 16:11         ` Alexander Duyck
@ 2020-09-10  0:32           ` Hugh Dickins
  2020-09-10 14:24             ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-10  0:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Hugh Dickins, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov, shy828301,
	Vlastimil Babka, Minchan Kim, Qian Cai

On Wed, 9 Sep 2020, Alexander Duyck wrote:
> On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote:
> > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block
> > Most of this consists of replacing "locked" by "lruvec", which is good:
> > but please fold those changes back into 20/32 (or would it be 17/32?
> > I've not yet looked into the relationship between those two), so we
> > can then see more clearly what change this 28/32 (will need renaming!)
> > actually makes, to use lruvec_holds_page_lru_lock(). That may be a
> > good change, but it's mixed up with the "locked"->"lruvec" at present,
> > and I think you could have just used lruvec for locked all along
> > (but of course there's a place where you'll need new_lruvec too).
> 
> I am good with my patch being folded in. No need to keep it separate.

Thanks.  Though it was only the "locked"->"lruvec" changes I was
suggesting to fold back, to minimize the diff, so that we could
see your use of lruvec_holds_page_lru_lock() more clearly - you
had not introduced that function at the stage of the earlier patches.

But now that I stare at it again, using lruvec_holds_page_lru_lock()
there doesn't look like an advantage to me: when it decides no, the
same calculation is made all over again in mem_cgroup_page_lruvec(),
whereas the code before only had to calculate it once.

So, the code before looks better to me: I wonder, do you think that
rcu_read_lock() is more expensive than I think it?  There can be
debug instrumentation that makes it heavier, but by itself it is
very cheap (by design) - not worth branching around.

> 
> > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block
> > NAK. I agree that isolate_migratepages_block() looks nicer this way, but
> > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is
> > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing
> > a racing get_page_unless_zero() to succeed; then later prep_compound_page()
> > is where PageHead and PageTails get set. So there's a small race window in
> > which this patch could deliver a compound page when it should not.
> 
> So the main motivation for the patch was to avoid the case where we
> are having to reset the LRU flag.

That would be satisfying.  Not necessary, but I agree satisfying.
Maybe depends also on your "skip" change, which I've not looked at yet?

> One question I would have is what if
> we swapped the code block with the __isolate_lru_page_prepare section?
> WIth that we would be taking a reference on the page, then verifying
> the LRU flag is set, and then testing for compound page flag bit.
> Would doing that close the race window since the LRU flag being set
> should indicate that the allocation has already been completed has it
> not?

Yes, I think that would be safe, and would look better.  But I am
very hesitant to give snap assurances here (I've twice missed out
a vital PageLRU check from this sequence myself): it is very easy
to deceive myself and only see it later.

If you can see a bug in what's there before these patches, certainly
we need to fix it.  But adding non-essential patches to the already
overlong series risks delaying it.

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 06/32] mm/thp: narrow lru locking
  2020-08-24 12:54 ` [PATCH v18 06/32] mm/thp: narrow lru locking Alex Shi
@ 2020-09-10 13:49   ` Matthew Wilcox
  2020-09-11  3:37     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Matthew Wilcox @ 2020-09-10 13:49 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Andrea Arcangeli

On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
> lru_lock and page cache xa_lock have no reason with current sequence,
> put them together isn't necessary. let's narrow the lru locking, but
> left the local_irq_disable to block interrupt re-entry and statistic update.

What stats are you talking about here?

> +++ b/mm/huge_memory.c
> @@ -2397,7 +2397,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  }
>  
>  static void __split_huge_page(struct page *page, struct list_head *list,
> -		pgoff_t end, unsigned long flags)
> +			      pgoff_t end)

Please don't change this whitespace.  It's really annoying having to
adjust the whitespace when renaming a function.  Just two tabs indentation
to give a clear separation of arguments from code is fine.


How about this patch instead?  It occurred to me we already have
perfectly good infrastructure to track whether or not interrupts are
already disabled, and so we should use that instead of ensuring that
interrupts are disabled, or tracking that ourselves.

But I may have missed something else that's relying on having
interrupts disabled.  Please check carefully.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2ccff8472cd4..74cae6c032f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2376,17 +2376,16 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+		pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
+	unsigned long flags;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2395,9 +2394,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 		offset = swp_offset(entry);
 		swap_cache = swap_address_space(entry);
-		xa_lock(&swap_cache->i_pages);
+		xa_lock_irq(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irqsave(&pgdat->lru_lock, flags);
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2417,6 +2420,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
 
@@ -2425,18 +2429,16 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		/* Additional pin to swap cache */
 		if (PageSwapCache(head)) {
 			page_ref_add(head, 2);
-			xa_unlock(&swap_cache->i_pages);
+			xa_unlock_irq(&swap_cache->i_pages);
 		} else {
 			page_ref_inc(head);
 		}
 	} else {
 		/* Additional pin to page cache */
 		page_ref_add(head, 2);
-		xa_unlock(&head->mapping->i_pages);
+		xa_unlock_irq(&head->mapping->i_pages);
 	}
 
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-
 	remap_page(head);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -2574,7 +2576,6 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
@@ -2640,9 +2641,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2650,13 +2648,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		 * Check if the head page is present in page cache.
 		 * We assume all tail are present too, if head is there.
 		 */
-		xa_lock(&mapping->i_pages);
+		xa_lock_irq(&mapping->i_pages);
 		if (xas_load(&xas) != head)
 			goto fail;
 	}
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
-	spin_lock(&ds_queue->split_queue_lock);
+	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	count = page_count(head);
 	mapcount = total_mapcount(head);
 	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
@@ -2664,7 +2662,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
 		}
-		spin_unlock(&ds_queue->split_queue_lock);
+		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 		if (mapping) {
 			if (PageSwapBacked(head))
 				__dec_node_page_state(head, NR_SHMEM_THPS);
@@ -2672,7 +2670,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2688,10 +2686,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			dump_page(page, "total_mapcount(head) > 0");
 			BUG();
 		}
-		spin_unlock(&ds_queue->split_queue_lock);
+		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 fail:		if (mapping)
-			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+			xa_unlock_irq(&mapping->i_pages);
 		remap_page(head);
 		ret = -EBUSY;
 	}


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-10  0:32           ` Hugh Dickins
@ 2020-09-10 14:24             ` Alexander Duyck
  2020-09-12  5:12               ` Hugh Dickins
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-09-10 14:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov, shy828301,
	Vlastimil Babka, Minchan Kim, Qian Cai

On Wed, Sep 9, 2020 at 5:32 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 9 Sep 2020, Alexander Duyck wrote:
> > On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote:
> > > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block
> > > Most of this consists of replacing "locked" by "lruvec", which is good:
> > > but please fold those changes back into 20/32 (or would it be 17/32?
> > > I've not yet looked into the relationship between those two), so we
> > > can then see more clearly what change this 28/32 (will need renaming!)
> > > actually makes, to use lruvec_holds_page_lru_lock(). That may be a
> > > good change, but it's mixed up with the "locked"->"lruvec" at present,
> > > and I think you could have just used lruvec for locked all along
> > > (but of course there's a place where you'll need new_lruvec too).
> >
> > I am good with my patch being folded in. No need to keep it separate.
>
> Thanks.  Though it was only the "locked"->"lruvec" changes I was
> suggesting to fold back, to minimize the diff, so that we could
> see your use of lruvec_holds_page_lru_lock() more clearly - you
> had not introduced that function at the stage of the earlier patches.
>
> But now that I stare at it again, using lruvec_holds_page_lru_lock()
> there doesn't look like an advantage to me: when it decides no, the
> same calculation is made all over again in mem_cgroup_page_lruvec(),
> whereas the code before only had to calculate it once.
>
> So, the code before looks better to me: I wonder, do you think that
> rcu_read_lock() is more expensive than I think it?  There can be
> debug instrumentation that makes it heavier, but by itself it is
> very cheap (by design) - not worth branching around.

Actually what I was more concerned with was the pointer chase that
required the RCU lock. With this function we are able to compare a
pair of pointers from the page and the lruvec and avoid the need for
the RCU lock. The way the old code was working we had to crawl through
the memcg to get to the lruvec before we could compare it to the one
we currently hold. The general idea is to use the data we have instead
of having to pull in some additional cache lines to perform the test.

> >
> > > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block
> > > NAK. I agree that isolate_migratepages_block() looks nicer this way, but
> > > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is
> > > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing
> > > a racing get_page_unless_zero() to succeed; then later prep_compound_page()
> > > is where PageHead and PageTails get set. So there's a small race window in
> > > which this patch could deliver a compound page when it should not.
> >
> > So the main motivation for the patch was to avoid the case where we
> > are having to reset the LRU flag.
>
> That would be satisfying.  Not necessary, but I agree satisfying.
> Maybe depends also on your "skip" change, which I've not looked at yet?

My concern is that we have scenarios where isolate_migratepages_block
could possibly prevent another page from being able to isolate a page.
I'm mostly concerned with us potentially creating something like an
isolation leak if multiple threads are doing something like clearing
and then resetting the LRU flag. In my mind if we clear the LRU flag
we should be certain we are going to remove the page as otherwise
another thread would have done it if it would have been allowed
access.

> > One question I would have is what if
> > we swapped the code block with the __isolate_lru_page_prepare section?
> > WIth that we would be taking a reference on the page, then verifying
> > the LRU flag is set, and then testing for compound page flag bit.
> > Would doing that close the race window since the LRU flag being set
> > should indicate that the allocation has already been completed has it
> > not?
>
> Yes, I think that would be safe, and would look better.  But I am
> very hesitant to give snap assurances here (I've twice missed out
> a vital PageLRU check from this sequence myself): it is very easy
> to deceive myself and only see it later.

I'm not looking for assurances, just sanity checks to make sure I am
not missing something obvious.

> If you can see a bug in what's there before these patches, certainly
> we need to fix it.  But adding non-essential patches to the already
> overlong series risks delaying it.

My concern ends up being that if we are clearing the bit and restoring
it while holding the LRU lock we can effectively cause pages to become
pseudo-pinned on the LRU. In my mind I would want us to avoid clearing
the LRU flag until we know we are going to be pulling the page from
the list once we take the lruvec lock. I interpret clearing of the
flag to indicate the page has already been pulled, it just hasn't left
the list yet. With us resetting the bit we are violating that which I
worry will lead to issues.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-09 23:16           ` Hugh Dickins
@ 2020-09-11  2:50             ` Alex Shi
  2020-09-12  2:13               ` Hugh Dickins
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-09-11  2:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, vbabka, minchan,
	cai



在 2020/9/10 上午7:16, Hugh Dickins 写道:
> On Wed, 9 Sep 2020, Alex Shi wrote:
>> 在 2020/9/9 上午7:41, Hugh Dickins 写道:
>>>
>>> [PATCH v18 05/32] mm/thp: remove code path which never got into
>>> This is a good simplification, but I see no sign that you understand
>>> why it's valid: it relies on lru_add_page_tail() being called while
>>> head refcount is frozen to 0: we would not get this far if someone
>>> else holds a reference to the THP - which they must hold if they have
>>> isolated the page from its lru (and that's true before or after your
>>> per-memcg changes - but even truer after those changes, since PageLRU
>>> can then be flipped without lru_lock at any instant): please explain
>>> something of this in the commit message.
>>
>> Is the following commit log better?
>>
>>     split_huge_page() will never call on a page which isn't on lru list, so
>>     this code never got a chance to run, and should not be run, to add tail
>>     pages on a lru list which head page isn't there.
>>
>>     Hugh Dickins' mentioned:
>>     The path should never be called since lru_add_page_tail() being called
>>     while head refcount is frozen to 0: we would not get this far if someone
>>     else holds a reference to the THP - which they must hold if they have
>>     isolated the page from its lru.
>>
>>     Although the bug was never triggered, it'better be removed for code
>>     correctness, and add a warn for unexpected calling.
> 
> Not much better, no.  split_huge_page() can easily be called for a page
> which is not on the lru list at the time, 

Hi Hugh,

Thanks for comments!

There are some discussion on this point a couple of weeks ago,
https://lkml.org/lkml/2020/7/9/760

Matthew Wilcox and Kirill have the following comments,
> I don't understand how we get to split_huge_page() with a page that's
> not on an LRU list.  Both anonymous and page cache pages should be on
> an LRU list.  What am I missing?

Right, and it's never got removed from LRU during the split. The tail
pages have to be added to LRU because they now separate from the tail
page.

-- 
 Kirill A. Shutemov

> and I don't know what was the
> bug which was never triggered.  

So the only path to the removed part should be a bug, like  sth here,
https://lkml.org/lkml/2020/7/10/118
or
https://lkml.org/lkml/2020/7/10/972

> Stick with whatever text you end up with
> for the combination of 05/32 and 18/32, and I'll rewrite it after.

I am not object to merge them into one, I just don't know how to say
clear about 2 patches in commit log. As patch 18, TestClearPageLRU
add the incorrect posibility of remove lru bit during split, that's
the reason of code path rewrite and a WARN there.

Thanks
Alex
> 
>>> [PATCH v18 06/32] mm/thp: narrow lru locking
>>> Why? What part does this play in the series? "narrow lru locking" can
>>> also be described as "widen page cache locking": 
>>
>> Uh, the page cache locking isn't widen, it's still on the old place.
> 
> I'm not sure if you're joking there. Perhaps just a misunderstanding.
> 
> Yes, patch 06/32 does not touch the xa_lock(&mapping->i_pages) and
> xa_lock(&swap_cache->i_pages) lines (odd how we've arrived at two of
> those, but please do not get into cleaning it up now); but it removes
> the spin_lock_irqsave(&pgdata->lru_lock, flags) which used to come
> before them, and inserts a spin_lock(&pgdat->lru_lock) after them.
> 
> You call that narrowing the lru locking, okay, but I see it as also
> pushing the page cache locking outwards: before this patch, page cache
> lock was taken inside lru_lock; after this patch, page cache lock is
> taken outside lru_lock.  If you cannot see that, then I think you
> should not have touched this code at all; but it's what we have
> been testing, and I think we should go forward with it.
> 
>>> But I wish you could give some reason for it in the commit message!
>>
>> It's a head scratch task. Would you like to tell me what's detailed info 
>> should be there? Thanks!
> 
> So, you don't know why you did it either: then it will be hard to
> justify.  I guess I'll have to write something for it later.  I'm
> strongly tempted just to drop the patch, but expect it will become
> useful later, for using lock_page_memcg() before getting lru_lock.
> 

I thought the xa_lock and lru_lock relationship was described clear
in the commit log, and still no idea of the move_lock in the chain.
Please refill them for what I overlooked.
Thanks!

>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>> Is that correct? Or Wei Yang suggested some part of it perhaps?
>>
>> Yes, we talked a lot to confirm the locking change is safe.
> 
> Okay, but the patch was written by you, and sent by you to Andrew:
> that is not a case for "Signed-off-by: Someone Else".
> 

Ok. let's remove his signed-off.

>>> [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
>>> Could we please drop this one for the moment? And come back to it later
>>> when the basic series is safely in.  It's a good idea to try sorting
>>> together those pages which come under the same lock (though my guess is
>>> that they naturally gather themselves together quite well already); but
>>> I'm not happy adding 360 bytes to the kernel stack here (and that in
>>> addition to 192 bytes of horrid pseudo-vma in the shmem swapin case),
>>> though that could be avoided by making it per-cpu. But I hope there's
>>> a simpler way of doing it, as efficient, but also useful for the other
>>> pagevec operations here: perhaps scanning the pagevec for same page->
>>> mem_cgroup (and flags node bits), NULLing entries as they are done.
>>> Another, easily fixed, minor defect in this patch: if I'm reading it
>>> right, it reverses the order in which the pages are put on the lru?
>>
>> this patch could give about 10+% performance gain on my multiple memcg
>> readtwice testing. fairness locking cost the performance much.
> 
> Good to know, should have been mentioned.  s/fairness/Repeated/
> 
> But what was the gain or loss on your multiple memcg readtwice
> testing without this patch, compared against node-only lru_lock?
> The 80% gain mentioned before, I presume.  So this further
> optimization can wait until the rest is solid.

the gain based on the patch 26.

> 
>>
>> I also tried per cpu solution but that cause much trouble of per cpu func
>> things, and looks no benefit except a bit struct size of stack, so if 
>> stack size still fine. May we could use the solution and improve it better.
>> like, functionlize, fix the reverse issue etc.
> 
> I don't know how important the stack depth consideration is nowadays:
> I still care, maybe others don't, since VMAP_STACK became an option.
> 
> Yes, please fix the reversal (if I was right on that); and I expect
> you could use a singly linked list instead of the double.

single linked list is more saving, but do we have to reverse walking to seek
the head or tail for correct sequence?

> 
> But I'll look for an alternative - later, once the urgent stuff
> is completed - and leave the acks on this patch to others.

Ok, looking forward for your new solution!

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 06/32] mm/thp: narrow lru locking
  2020-09-10 13:49   ` Matthew Wilcox
@ 2020-09-11  3:37     ` Alex Shi
  2020-09-13 15:27       ` Matthew Wilcox
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-09-11  3:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Andrea Arcangeli



在 2020/9/10 下午9:49, Matthew Wilcox 写道:
> On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
>> lru_lock and page cache xa_lock have no reason with current sequence,
>> put them together isn't necessary. let's narrow the lru locking, but
>> left the local_irq_disable to block interrupt re-entry and statistic update.
> 
> What stats are you talking about here?

Hi Matthew,

Thanks for comments!

like __dec_node_page_state(head, NR_SHMEM_THPS); will have preemptive warning...

> 
>> +++ b/mm/huge_memory.c
>> @@ -2397,7 +2397,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>  }
>>  
>>  static void __split_huge_page(struct page *page, struct list_head *list,
>> -		pgoff_t end, unsigned long flags)
>> +			      pgoff_t end)
> 
> Please don't change this whitespace.  It's really annoying having to
> adjust the whitespace when renaming a function.  Just two tabs indentation
> to give a clear separation of arguments from code is fine.
> 
> 
> How about this patch instead?  It occurred to me we already have
> perfectly good infrastructure to track whether or not interrupts are
> already disabled, and so we should use that instead of ensuring that
> interrupts are disabled, or tracking that ourselves.

So your proposal looks like;
1, xa_lock_irq(&mapping->i_pages); (optional)
2, spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
3, spin_lock_irqsave(&pgdat->lru_lock, flags);

Is there meaningful for the 2nd and 3rd flags?

IIRC, I had a similar proposal as your, the flags used in xa_lock_irqsave(),
but objected by Hugh.

Thanks
Alex

> 
> But I may have missed something else that's relying on having
> interrupts disabled.  Please check carefully.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2ccff8472cd4..74cae6c032f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2376,17 +2376,16 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  }
>  
>  static void __split_huge_page(struct page *page, struct list_head *list,
> -		pgoff_t end, unsigned long flags)
> +		pgoff_t end)
>  {
>  	struct page *head = compound_head(page);
>  	pg_data_t *pgdat = page_pgdat(head);
>  	struct lruvec *lruvec;
>  	struct address_space *swap_cache = NULL;
>  	unsigned long offset = 0;
> +	unsigned long flags;
>  	int i;
>  
> -	lruvec = mem_cgroup_page_lruvec(head, pgdat);
> -
>  	/* complete memcg works before add pages to LRU */
>  	mem_cgroup_split_huge_fixup(head);
>  
> @@ -2395,9 +2394,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  
>  		offset = swp_offset(entry);
>  		swap_cache = swap_address_space(entry);
> -		xa_lock(&swap_cache->i_pages);
> +		xa_lock_irq(&swap_cache->i_pages);
>  	}
>  
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irqsave(&pgdat->lru_lock, flags);
> +	lruvec = mem_cgroup_page_lruvec(head, pgdat);
> +
>  	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>  		__split_huge_page_tail(head, i, lruvec, list);
>  		/* Some pages can be beyond i_size: drop them from page cache */
> @@ -2417,6 +2420,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	}
>  
>  	ClearPageCompound(head);
> +	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>  
>  	split_page_owner(head, HPAGE_PMD_ORDER);
>  
> @@ -2425,18 +2429,16 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		/* Additional pin to swap cache */
>  		if (PageSwapCache(head)) {
>  			page_ref_add(head, 2);
> -			xa_unlock(&swap_cache->i_pages);
> +			xa_unlock_irq(&swap_cache->i_pages);
>  		} else {
>  			page_ref_inc(head);
>  		}
>  	} else {
>  		/* Additional pin to page cache */
>  		page_ref_add(head, 2);
> -		xa_unlock(&head->mapping->i_pages);
> +		xa_unlock_irq(&head->mapping->i_pages);
>  	}
>  
> -	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -
>  	remap_page(head);
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> @@ -2574,7 +2576,6 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
>  int split_huge_page_to_list(struct page *page, struct list_head *list)
>  {
>  	struct page *head = compound_head(page);
> -	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
>  	struct deferred_split *ds_queue = get_deferred_split_queue(head);
>  	struct anon_vma *anon_vma = NULL;
>  	struct address_space *mapping = NULL;
> @@ -2640,9 +2641,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  	unmap_page(head);
>  	VM_BUG_ON_PAGE(compound_mapcount(head), head);
>  
> -	/* prevent PageLRU to go away from under us, and freeze lru stats */
> -	spin_lock_irqsave(&pgdata->lru_lock, flags);
> -
>  	if (mapping) {
>  		XA_STATE(xas, &mapping->i_pages, page_index(head));
>  
> @@ -2650,13 +2648,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  		 * Check if the head page is present in page cache.
>  		 * We assume all tail are present too, if head is there.
>  		 */
> -		xa_lock(&mapping->i_pages);
> +		xa_lock_irq(&mapping->i_pages);
>  		if (xas_load(&xas) != head)
>  			goto fail;
>  	}
>  
>  	/* Prevent deferred_split_scan() touching ->_refcount */
> -	spin_lock(&ds_queue->split_queue_lock);
> +	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	count = page_count(head);
>  	mapcount = total_mapcount(head);
>  	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
> @@ -2664,7 +2662,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  			ds_queue->split_queue_len--;
>  			list_del(page_deferred_list(head));
>  		}
> -		spin_unlock(&ds_queue->split_queue_lock);
> +		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  		if (mapping) {
>  			if (PageSwapBacked(head))
>  				__dec_node_page_state(head, NR_SHMEM_THPS);
> @@ -2672,7 +2670,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  				__dec_node_page_state(head, NR_FILE_THPS);
>  		}
>  
> -		__split_huge_page(page, list, end, flags);
> +		__split_huge_page(page, list, end);
>  		if (PageSwapCache(head)) {
>  			swp_entry_t entry = { .val = page_private(head) };
>  
> @@ -2688,10 +2686,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  			dump_page(page, "total_mapcount(head) > 0");
>  			BUG();
>  		}
> -		spin_unlock(&ds_queue->split_queue_lock);
> +		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  fail:		if (mapping)
> -			xa_unlock(&mapping->i_pages);
> -		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
> +			xa_unlock_irq(&mapping->i_pages);
>  		remap_page(head);
>  		ret = -EBUSY;
>  	}
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-11  2:50             ` Alex Shi
@ 2020-09-12  2:13               ` Hugh Dickins
  2020-09-13 14:21                 ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-12  2:13 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Andrew Morton, mgorman, tj, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301,
	vbabka, minchan, cai

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11229 bytes --]

On Fri, 11 Sep 2020, Alex Shi wrote:
> 在 2020/9/10 上午7:16, Hugh Dickins 写道:
> > On Wed, 9 Sep 2020, Alex Shi wrote:
> >> 在 2020/9/9 上午7:41, Hugh Dickins 写道:
> >>>
> >>> [PATCH v18 05/32] mm/thp: remove code path which never got into
> >>> This is a good simplification, but I see no sign that you understand
> >>> why it's valid: it relies on lru_add_page_tail() being called while
> >>> head refcount is frozen to 0: we would not get this far if someone
> >>> else holds a reference to the THP - which they must hold if they have
> >>> isolated the page from its lru (and that's true before or after your
> >>> per-memcg changes - but even truer after those changes, since PageLRU
> >>> can then be flipped without lru_lock at any instant): please explain
> >>> something of this in the commit message.
> >>
> >> Is the following commit log better?
> >>
> >>     split_huge_page() will never call on a page which isn't on lru list, so
> >>     this code never got a chance to run, and should not be run, to add tail
> >>     pages on a lru list which head page isn't there.
> >>
> >>     Hugh Dickins' mentioned:
> >>     The path should never be called since lru_add_page_tail() being called
> >>     while head refcount is frozen to 0: we would not get this far if someone
> >>     else holds a reference to the THP - which they must hold if they have
> >>     isolated the page from its lru.
> >>
> >>     Although the bug was never triggered, it'better be removed for code
> >>     correctness, and add a warn for unexpected calling.
> > 
> > Not much better, no.  split_huge_page() can easily be called for a page
> > which is not on the lru list at the time, 
> 
> Hi Hugh,
> 
> Thanks for comments!
> 
> There are some discussion on this point a couple of weeks ago,
> https://lkml.org/lkml/2020/7/9/760
> 
> Matthew Wilcox and Kirill have the following comments,
> > I don't understand how we get to split_huge_page() with a page that's
> > not on an LRU list.  Both anonymous and page cache pages should be on
> > an LRU list.  What am I missing?
> 
> Right, and it's never got removed from LRU during the split. The tail
> pages have to be added to LRU because they now separate from the tail
> page.
> 
> -- 
>  Kirill A. Shutemov

Yes, those were among the mails that I read through before getting
down to review.  I was surprised by their not understanding, but
it was a bit late to reply to that thread.

Perhaps everybody had been focused on pages which have been and
naturally belong on an LRU list, rather than pages which are on
the LRU list at the instant that split_huge_page() is called.

There are a number of places where PageLRU gets cleared, and a
number of places where we del_page_from_lru_list(), I think you'll
agree: your patches touch all or most of them.  Let's think of a
common one, isolate_lru_pages() used by page reclaim, but the same
would apply to most of the others.

Then there a number of places where split_huge_page() is called:
I am having difficulty finding any of those which cannot race with
page reclaim, but shall we choose anon THP's deferred_split_scan(),
or shmem THP's shmem_punch_compound()?

What prevents either of those from calling split_huge_page() at
a time when isolate_lru_pages() has removed the page from LRU?

But there's no problem in this race, because anyone isolating the
page from LRU must hold their own reference to the page (to prevent
it from being freed independently), and the can_split_huge_page() or
page_ref_freeze() in split_huge_page_to_list() will detect that and
fail the split with -EBUSY (or else succeed and prevent new references
from being acquired).  So this case never reaches lru_add_page_tail().

> 
> > and I don't know what was the
> > bug which was never triggered.  
> 
> So the only path to the removed part should be a bug, like  sth here,
> https://lkml.org/lkml/2020/7/10/118
> or
> https://lkml.org/lkml/2020/7/10/972

Oh, the use of split_huge_page() in __iommu_dma_alloc_pages() is just
nonsense, I thought it had already been removed - perhaps some debate
over __GFP_COMP held it up.  Not something you need worry about in
this patchset.

> 
> > Stick with whatever text you end up with
> > for the combination of 05/32 and 18/32, and I'll rewrite it after.
> 
> I am not object to merge them into one, I just don't know how to say
> clear about 2 patches in commit log. As patch 18, TestClearPageLRU
> add the incorrect posibility of remove lru bit during split, that's
> the reason of code path rewrite and a WARN there.

I did not know that was why you were putting 18/32 in at that
point, it does not mention TestClearPageLRU at all.  But the fact
remains that it's a nice cleanup, contains a reassuring WARN if we
got it wrong (and I've suggested a WARN on the other branch too),
it was valid before your changes, and it's valid after your changes.
Please merge it back into the uglier 05/32, and again I'll rewrite
whatever comment you come up with if necessary.

> > 
> >>> [PATCH v18 06/32] mm/thp: narrow lru locking
> >>> Why? What part does this play in the series? "narrow lru locking" can
> >>> also be described as "widen page cache locking": 
> >>
> >> Uh, the page cache locking isn't widen, it's still on the old place.
> > 
> > I'm not sure if you're joking there. Perhaps just a misunderstanding.
> > 
> > Yes, patch 06/32 does not touch the xa_lock(&mapping->i_pages) and
> > xa_lock(&swap_cache->i_pages) lines (odd how we've arrived at two of
> > those, but please do not get into cleaning it up now); but it removes
> > the spin_lock_irqsave(&pgdata->lru_lock, flags) which used to come
> > before them, and inserts a spin_lock(&pgdat->lru_lock) after them.
> > 
> > You call that narrowing the lru locking, okay, but I see it as also
> > pushing the page cache locking outwards: before this patch, page cache
> > lock was taken inside lru_lock; after this patch, page cache lock is
> > taken outside lru_lock.  If you cannot see that, then I think you
> > should not have touched this code at all; but it's what we have
> > been testing, and I think we should go forward with it.
> > 
> >>> But I wish you could give some reason for it in the commit message!
> >>
> >> It's a head scratch task. Would you like to tell me what's detailed info 
> >> should be there? Thanks!
> > 
> > So, you don't know why you did it either: then it will be hard to
> > justify.  I guess I'll have to write something for it later.  I'm
> > strongly tempted just to drop the patch, but expect it will become
> > useful later, for using lock_page_memcg() before getting lru_lock.
> > 
> 
> I thought the xa_lock and lru_lock relationship was described clear
> in the commit log,

You say "lru_lock and page cache xa_lock have no reason with current
sequence", but you give no reason for inverting their sequence:
"let's" is not a reason.

> and still no idea of the move_lock in the chain.

memcg->move_lock is what's at the heart of lock_page_memcg(), but
as much as possible that tries to avoid the overhead of actually
taking it, since moving memcg is a rare operation.  For lock ordering,
see the diagram in mm/rmap.c, which 23/32 updates to match this change.

Before this commit: lru_lock > move_lock > i_pages lock was the
expected lock ordering (but it looks as if the lru_lock > move_lock
requirement came from my per-memcg lru_lock patches).

After this commit:  move_lock > i_pages lock > lru_lock is the
required lock ordering, since there are strong reasons (in dirty
writeback) for move_lock > i_pages lock.

> Please refill them for what I overlooked.

Will do, but not before reviewing your remaining patches.

> Thanks!
> 
> >>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> >>> Is that correct? Or Wei Yang suggested some part of it perhaps?
> >>
> >> Yes, we talked a lot to confirm the locking change is safe.
> > 
> > Okay, but the patch was written by you, and sent by you to Andrew:
> > that is not a case for "Signed-off-by: Someone Else".
> > 
> 
> Ok. let's remove his signed-off.
> 
> >>> [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
> >>> Could we please drop this one for the moment? And come back to it later
> >>> when the basic series is safely in.  It's a good idea to try sorting
> >>> together those pages which come under the same lock (though my guess is
> >>> that they naturally gather themselves together quite well already); but
> >>> I'm not happy adding 360 bytes to the kernel stack here (and that in
> >>> addition to 192 bytes of horrid pseudo-vma in the shmem swapin case),
> >>> though that could be avoided by making it per-cpu. But I hope there's
> >>> a simpler way of doing it, as efficient, but also useful for the other
> >>> pagevec operations here: perhaps scanning the pagevec for same page->
> >>> mem_cgroup (and flags node bits), NULLing entries as they are done.
> >>> Another, easily fixed, minor defect in this patch: if I'm reading it
> >>> right, it reverses the order in which the pages are put on the lru?
> >>
> >> this patch could give about 10+% performance gain on my multiple memcg
> >> readtwice testing. fairness locking cost the performance much.
> > 
> > Good to know, should have been mentioned.  s/fairness/Repeated/
> > 
> > But what was the gain or loss on your multiple memcg readtwice
> > testing without this patch, compared against node-only lru_lock?
> > The 80% gain mentioned before, I presume.  So this further
> > optimization can wait until the rest is solid.
> 
> the gain based on the patch 26.

If I understand your brief comment there, you're saying that
in a fixed interval of time, the baseline 5.9-rc did 100 runs,
the patches up to and including 26/32 did 180 runs, then with
27/32 on top, did 198 runs?

That's a good improvement by 27/32, but not essential for getting
the patchset in: I don't think 27/32 is the right way to do it,
so I'd still prefer to hold it back from the "initial offering".

> 
> > 
> >>
> >> I also tried per cpu solution but that cause much trouble of per cpu func
> >> things, and looks no benefit except a bit struct size of stack, so if 
> >> stack size still fine. May we could use the solution and improve it better.
> >> like, functionlize, fix the reverse issue etc.
> > 
> > I don't know how important the stack depth consideration is nowadays:
> > I still care, maybe others don't, since VMAP_STACK became an option.
> > 
> > Yes, please fix the reversal (if I was right on that); and I expect
> > you could use a singly linked list instead of the double.
> 
> single linked list is more saving, but do we have to reverse walking to seek
> the head or tail for correct sequence?

I imagine all you need is to start off with a
	for (i = pagevec_count(pvec) - 1; i >= 0; i--)
loop.

> 
> > 
> > But I'll look for an alternative - later, once the urgent stuff
> > is completed - and leave the acks on this patch to others.
> 
> Ok, looking forward for your new solution!
> 
> Thanks
> Alex

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-10 14:24             ` Alexander Duyck
@ 2020-09-12  5:12               ` Hugh Dickins
  0 siblings, 0 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-12  5:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Hugh Dickins, Alex Shi, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov, shy828301,
	Vlastimil Babka, Minchan Kim, Qian Cai

On Thu, 10 Sep 2020, Alexander Duyck wrote:
> On Wed, Sep 9, 2020 at 5:32 PM Hugh Dickins <hughd@google.com> wrote:
> > On Wed, 9 Sep 2020, Alexander Duyck wrote:
> > > On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote:
> > > > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block
> > > > Most of this consists of replacing "locked" by "lruvec", which is good:
> > > > but please fold those changes back into 20/32 (or would it be 17/32?
> > > > I've not yet looked into the relationship between those two), so we
> > > > can then see more clearly what change this 28/32 (will need renaming!)
> > > > actually makes, to use lruvec_holds_page_lru_lock(). That may be a
> > > > good change, but it's mixed up with the "locked"->"lruvec" at present,
> > > > and I think you could have just used lruvec for locked all along
> > > > (but of course there's a place where you'll need new_lruvec too).
> > >
> > > I am good with my patch being folded in. No need to keep it separate.
> >
> > Thanks.  Though it was only the "locked"->"lruvec" changes I was
> > suggesting to fold back, to minimize the diff, so that we could
> > see your use of lruvec_holds_page_lru_lock() more clearly - you
> > had not introduced that function at the stage of the earlier patches.
> >
> > But now that I stare at it again, using lruvec_holds_page_lru_lock()
> > there doesn't look like an advantage to me: when it decides no, the
> > same calculation is made all over again in mem_cgroup_page_lruvec(),
> > whereas the code before only had to calculate it once.
> >
> > So, the code before looks better to me: I wonder, do you think that
> > rcu_read_lock() is more expensive than I think it?  There can be
> > debug instrumentation that makes it heavier, but by itself it is
> > very cheap (by design) - not worth branching around.
> 
> Actually what I was more concerned with was the pointer chase that
> required the RCU lock. With this function we are able to compare a
> pair of pointers from the page and the lruvec and avoid the need for
> the RCU lock. The way the old code was working we had to crawl through
> the memcg to get to the lruvec before we could compare it to the one
> we currently hold. The general idea is to use the data we have instead
> of having to pull in some additional cache lines to perform the test.

When you say "With this function...", I think you are referring to
lruvec_holds_page_lru_lock().  Yes, I appreciate what you're doing
there, making calculations from known-stable data, and taking it no
further than the required comparison; and I think (I don't yet claim
to have reviewed 21/32) what you do with it in relock_page_lruvec*()
is an improvement over what we had there before.

But here I'm talking about using it in isolate_migratepages_block()
in 28/32: in this case, the code before evaluated the new lruvec,
compared against the old, and immediately used the new lruvec if
different; whereas using lruvec_holds_page_lru_lock() makes an
almost (I agree not entirely, and I haven't counted cachelines)
equivalent evaluation, but its results have to thrown away when
it's false, then the new lruvec actually calculated and used.

The same "results thrown away" criticism can be made of
relock_page_lruvec*(), but what was done there before your rewrite
in v18 was no better: they both resort to lock_page_lruvec*(page),
working it all out again from page.  And I'm not suggesting that
be changed, not at this point anyway; but 28/32 looks to me
like a regression from what was done there before 28/32.

> 
> > >
> > > > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block
> > > > NAK. I agree that isolate_migratepages_block() looks nicer this way, but
> > > > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is
> > > > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing
> > > > a racing get_page_unless_zero() to succeed; then later prep_compound_page()
> > > > is where PageHead and PageTails get set. So there's a small race window in
> > > > which this patch could deliver a compound page when it should not.
> > >
> > > So the main motivation for the patch was to avoid the case where we
> > > are having to reset the LRU flag.
> >
> > That would be satisfying.  Not necessary, but I agree satisfying.
> > Maybe depends also on your "skip" change, which I've not looked at yet?
> 
> My concern is that we have scenarios where isolate_migratepages_block
> could possibly prevent another page from being able to isolate a page.
> I'm mostly concerned with us potentially creating something like an
> isolation leak if multiple threads are doing something like clearing
> and then resetting the LRU flag. In my mind if we clear the LRU flag
> we should be certain we are going to remove the page as otherwise
> another thread would have done it if it would have been allowed
> access.

I agree it's nicer not to TestClearPageLRU unnecessarily; but if the
occasional unnecessary TestClearPageLRU were really a concern, then
there's a lot of more serious places to worry about - page reclaim
is the great isolator that comes first to my mind.

> 
> > > One question I would have is what if
> > > we swapped the code block with the __isolate_lru_page_prepare section?
> > > WIth that we would be taking a reference on the page, then verifying
> > > the LRU flag is set, and then testing for compound page flag bit.
> > > Would doing that close the race window since the LRU flag being set
> > > should indicate that the allocation has already been completed has it
> > > not?
> >
> > Yes, I think that would be safe, and would look better.  But I am
> > very hesitant to give snap assurances here (I've twice missed out
> > a vital PageLRU check from this sequence myself): it is very easy
> > to deceive myself and only see it later.
> 
> I'm not looking for assurances, just sanity checks to make sure I am
> not missing something obvious.
> 
> > If you can see a bug in what's there before these patches, certainly
> > we need to fix it.  But adding non-essential patches to the already
> > overlong series risks delaying it.
> 
> My concern ends up being that if we are clearing the bit and restoring
> it while holding the LRU lock we can effectively cause pages to become
> pseudo-pinned on the LRU. In my mind I would want us to avoid clearing
> the LRU flag until we know we are going to be pulling the page from
> the list once we take the lruvec lock. I interpret clearing of the
> flag to indicate the page has already been pulled, it just hasn't left
> the list yet. With us resetting the bit we are violating that which I
> worry will lead to issues.

Your concern and my concern are different, but we are "on the same page".

I've said repeatedly (to Alex) that I am not at ease with this
TestClearPageLRU() technique: he has got it working, reliably, but
I find it hard to reason about.  Perhaps I'm just too used to what
was there before, but clearing PageLRU and removing from LRU while
holding lru_lock seems natural to me; whereas disconnecting them
leaves us on shaky ground, adding comments and warnings about the
peculiar races involved.  And it adds a pair of atomic operations
on each page in pagevec_lru_move_fn(), which were not needed before.

I want us to go ahead with TestClearPageLRU, to get the series into
mmotm and under wider testing.  But if we accept the lock reordering
in 06/32, then it becomes possible to replace those TestClearPageLRUs
by lock_page_memcg()s: which in principle should be cheaper, but that
will have to be measured.

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-09 15:08         ` Alex Shi
  2020-09-09 23:16           ` Hugh Dickins
@ 2020-09-12  8:38           ` Hugh Dickins
  2020-09-13 14:22             ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-12  8:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Andrew Morton, mgorman, tj, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301,
	vbabka, minchan, cai

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5229 bytes --]

On Wed, 9 Sep 2020, Alex Shi wrote:
> 在 2020/9/9 上午7:41, Hugh Dickins 写道:
> > 
> > The use of lock_page_memcg() in __munlock_pagevec() in 20/32,
> > introduced in patchset v17, looks good but it isn't: I was lucky that
> > systemd at reboot did some munlocking that exposed the problem to lockdep.
> > The first time into the loop, lock_page_memcg() is done before lru_lock
> > (as 06/32 has allowed); but the second time around the loop, it is done
> > while still holding lru_lock.
> 
> I don't know the details of lockdep show. Just wondering could it possible 
> to solid the move_lock/lru_lock sequence?
> or try other blocking way which mentioned in commit_charge()?
> 
> > 
> > lock_page_memcg() really needs to be absorbed into (a variant of)
> > relock_page_lruvec(), and I do have that (it's awkward because of
> > the different ways in which the IRQ flags are handled).  And out of
> > curiosity, I've also tried using that in mm/swap.c too, instead of the
> > TestClearPageLRU technique: lockdep is happy, but an update_lru_size()
> > warning showed that it cannot safely be mixed with the TestClearPageLRU
> > technique (that I'd left in isolate_lru_page()).  So I'll stash away
> > that relock_page_lruvec(), and consider what's best for mm/mlock.c:
> > now that I've posted these comments so far, that's my priority, then
> > to get the result under testing again, before resuming these comments.
> 
> No idea of your solution, but looking forward for your good news! :)

Yes, it is good news, and simpler than anything suggested above.

The main difficulties will probably be to look good in the 80 columns
(I know that limit has been lifted recently, but some of us use xterms
side by side), and to explain it.

mm/mlock.c has not been kept up-to-date very well: and in particular,
you have taken too seriously that "Serialize with any parallel
__split_huge_page_refcount()..." comment that you updated to two
comments "Serialize split tail pages in __split_huge_page_tail()...".

Delete them! The original comment was by Vlastimil for v3.14 in 2014.
But Kirill redesigned THP refcounting for v4.5 in 2016: that's when
__split_huge_page_refcount() went away.  And with the new refcounting,
the THP splitting races that lru_lock protected munlock_vma_page()
and __munlock_pagevec() from: those races have become impossible.

Or maybe there never was such a race in __munlock_pagevec(): you
have added the comment there, assuming lru_lock was for that purpose,
but that was probably just the convenient place to take it,
to cover all the del_page_from_lru()s.

Observe how split_huge_page_to_list() uses unmap_page() to remove
all pmds and all ptes for the huge page being split, and remap_page()
only replaces the migration entries (used for anon but not for shmem
or file) after doing all of the __split_huge_page_tail()s, before
unlocking any of the pages.  Recall that munlock_vma_page() and
__munlock_pagevec() are being applied to pages found mapped
into userspace, by ptes or pmd: there are none of those while
__split_huge_page_tail() is being used, so no race to protect from.

(Could a newly detached tail be freshly faulted into userspace just
before __split_huge_page() has reached the head?  Not quite, the
fault has to wait to get the tail's page lock. But even if it
could, how would that be a problem for __munlock_pagevec()?)

There's lots more that could be said: for example, PageMlocked will
always be clear on the THP head during __split_huge_page_tail(),
because the last unmap of a PageMlocked page does clear_page_mlock().
But that's not required to prove the case, it's just another argument
against the "Serialize" comment you have in __munlock_pagevec().

So, no need for the problematic lock_page_memcg(page) there in
__munlock_pagevec(), nor to lock (or relock) lruvec just below it.
__munlock_pagevec() still needs lru_lock to del_page_from_lru_list(),
of course, but that must be done after your TestClearPageMlocked has
stabilized page->memcg.  Use relock_page_lruvec_irq() here?  I suppose
that will be easiest, but notice how __munlock_pagevec_fill() has
already made sure that all the pages in the pagevec are from the same
zone (and it cannot do the same for memcg without locking page memcg);
so some of relock's work will be redundant.

Otherwise, I'm much happier with your mm/mlock.c since looking at it
in more detail: a couple of nits though - drop the clear_page_mlock()
hunk from 25/32 - kernel style says do it the way you are undoing by
-	if (!isolate_lru_page(page)) {
+	if (!isolate_lru_page(page))
 		putback_lru_page(page);
-	} else {
+	else {
I don't always follow that over-braced style when making changes,
but you should not touch otherwise untouched code just to make it
go against the approved style.  And in munlock_vma_page(),
-	if (!TestClearPageMlocked(page)) {
+	if (!TestClearPageMlocked(page))
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
-	}
+		return 0;
please restore the braces: with that comment line in there,
the compiler does not need the braces, but the human eye does.

Hugh

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-12  2:13               ` Hugh Dickins
@ 2020-09-13 14:21                 ` Alex Shi
  2020-09-15  8:21                   ` Hugh Dickins
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-09-13 14:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, vbabka, minchan,
	cai



在 2020/9/12 上午10:13, Hugh Dickins 写道:
> On Fri, 11 Sep 2020, Alex Shi wrote:
>> 在 2020/9/10 上午7:16, Hugh Dickins 写道:
>>> On Wed, 9 Sep 2020, Alex Shi wrote:
>>>> 在 2020/9/9 上午7:41, Hugh Dickins 写道:
>>>>>
>>>>> [PATCH v18 05/32] mm/thp: remove code path which never got into
>>>>> This is a good simplification, but I see no sign that you understand
>>>>> why it's valid: it relies on lru_add_page_tail() being called while
>>>>> head refcount is frozen to 0: we would not get this far if someone
>>>>> else holds a reference to the THP - which they must hold if they have
>>>>> isolated the page from its lru (and that's true before or after your
>>>>> per-memcg changes - but even truer after those changes, since PageLRU
>>>>> can then be flipped without lru_lock at any instant): please explain
>>>>> something of this in the commit message.
>>>>
>>>> Is the following commit log better?
>>>>
>>>>     split_huge_page() will never call on a page which isn't on lru list, so
>>>>     this code never got a chance to run, and should not be run, to add tail
>>>>     pages on a lru list which head page isn't there.
>>>>
>>>>     Hugh Dickins' mentioned:
>>>>     The path should never be called since lru_add_page_tail() being called
>>>>     while head refcount is frozen to 0: we would not get this far if someone
>>>>     else holds a reference to the THP - which they must hold if they have
>>>>     isolated the page from its lru.
>>>>
>>>>     Although the bug was never triggered, it'better be removed for code
>>>>     correctness, and add a warn for unexpected calling.
>>>
>>> Not much better, no.  split_huge_page() can easily be called for a page
>>> which is not on the lru list at the time, 
>>
>> Hi Hugh,
>>
>> Thanks for comments!
>>
>> There are some discussion on this point a couple of weeks ago,
>> https://lkml.org/lkml/2020/7/9/760
>>
>> Matthew Wilcox and Kirill have the following comments,
>>> I don't understand how we get to split_huge_page() with a page that's
>>> not on an LRU list.  Both anonymous and page cache pages should be on
>>> an LRU list.  What am I missing?
>>
>> Right, and it's never got removed from LRU during the split. The tail
>> pages have to be added to LRU because they now separate from the tail
>> page.
>>
>> -- 
>>  Kirill A. Shutemov
> 
> Yes, those were among the mails that I read through before getting
> down to review.  I was surprised by their not understanding, but
> it was a bit late to reply to that thread.
> 
> Perhaps everybody had been focused on pages which have been and
> naturally belong on an LRU list, rather than pages which are on
> the LRU list at the instant that split_huge_page() is called.
> 
> There are a number of places where PageLRU gets cleared, and a
> number of places where we del_page_from_lru_list(), I think you'll
> agree: your patches touch all or most of them.  Let's think of a
> common one, isolate_lru_pages() used by page reclaim, but the same
> would apply to most of the others.
> 
> Then there a number of places where split_huge_page() is called:
> I am having difficulty finding any of those which cannot race with
> page reclaim, but shall we choose anon THP's deferred_split_scan(),
> or shmem THP's shmem_punch_compound()?
> 
> What prevents either of those from calling split_huge_page() at
> a time when isolate_lru_pages() has removed the page from LRU?
> 
> But there's no problem in this race, because anyone isolating the
> page from LRU must hold their own reference to the page (to prevent
> it from being freed independently), and the can_split_huge_page() or
> page_ref_freeze() in split_huge_page_to_list() will detect that and
> fail the split with -EBUSY (or else succeed and prevent new references
> from being acquired).  So this case never reaches lru_add_page_tail().

Hi Hugh,

Thanks for comments!

We are the same page here, we all know split_huge_page_to_list could block
them go futher and the code is functionality right.
If the comments 'Split start from PageLRU(head), and ...' doesn't make 
things clear as it's should be, I am glad to see you rewrite and improve
them.

> 
>>
>>> and I don't know what was the
>>> bug which was never triggered.  
>>
>> So the only path to the removed part should be a bug, like  sth here,
>> https://lkml.org/lkml/2020/7/10/118
>> or
>> https://lkml.org/lkml/2020/7/10/972
> 
> Oh, the use of split_huge_page() in __iommu_dma_alloc_pages() is just
> nonsense, I thought it had already been removed - perhaps some debate
> over __GFP_COMP held it up.  Not something you need worry about in
> this patchset.
> 
>>
>>> Stick with whatever text you end up with
>>> for the combination of 05/32 and 18/32, and I'll rewrite it after.
>>
>> I am not object to merge them into one, I just don't know how to say
>> clear about 2 patches in commit log. As patch 18, TestClearPageLRU
>> add the incorrect posibility of remove lru bit during split, that's
>> the reason of code path rewrite and a WARN there.
> 
> I did not know that was why you were putting 18/32 in at that
> point, it does not mention TestClearPageLRU at all.  But the fact
> remains that it's a nice cleanup, contains a reassuring WARN if we
> got it wrong (and I've suggested a WARN on the other branch too),
> it was valid before your changes, and it's valid after your changes.
> Please merge it back into the uglier 05/32, and again I'll rewrite
> whatever comment you come up with if necessary.

I merge them together on the following git branch, and let the commit log
to you. :)

https://github.com/alexshi/linux.git lruv19
> 
>>>
>>>>> [PATCH v18 06/32] mm/thp: narrow lru locking
>>>>> Why? What part does this play in the series? "narrow lru locking" can
>>>>> also be described as "widen page cache locking": 
>>>>
>>>> Uh, the page cache locking isn't widen, it's still on the old place.
>>>
>>> I'm not sure if you're joking there. Perhaps just a misunderstanding.
>>>
>>> Yes, patch 06/32 does not touch the xa_lock(&mapping->i_pages) and
>>> xa_lock(&swap_cache->i_pages) lines (odd how we've arrived at two of
>>> those, but please do not get into cleaning it up now); but it removes
>>> the spin_lock_irqsave(&pgdata->lru_lock, flags) which used to come
>>> before them, and inserts a spin_lock(&pgdat->lru_lock) after them.
>>>
>>> You call that narrowing the lru locking, okay, but I see it as also
>>> pushing the page cache locking outwards: before this patch, page cache
>>> lock was taken inside lru_lock; after this patch, page cache lock is
>>> taken outside lru_lock.  If you cannot see that, then I think you
>>> should not have touched this code at all; but it's what we have
>>> been testing, and I think we should go forward with it.
>>>
>>>>> But I wish you could give some reason for it in the commit message!
>>>>
>>>> It's a head scratch task. Would you like to tell me what's detailed info 
>>>> should be there? Thanks!
>>>
>>> So, you don't know why you did it either: then it will be hard to
>>> justify.  I guess I'll have to write something for it later.  I'm
>>> strongly tempted just to drop the patch, but expect it will become
>>> useful later, for using lock_page_memcg() before getting lru_lock.
>>>
>>
>> I thought the xa_lock and lru_lock relationship was described clear
>> in the commit log,
> 
> You say "lru_lock and page cache xa_lock have no reason with current
> sequence", but you give no reason for inverting their sequence:
> "let's" is not a reason.
> 
>> and still no idea of the move_lock in the chain.
> 
> memcg->move_lock is what's at the heart of lock_page_memcg(), but
> as much as possible that tries to avoid the overhead of actually
> taking it, since moving memcg is a rare operation.  For lock ordering,
> see the diagram in mm/rmap.c, which 23/32 updates to match this change.

I see. thanks!

> 
> Before this commit: lru_lock > move_lock > i_pages lock was the
> expected lock ordering (but it looks as if the lru_lock > move_lock
> requirement came from my per-memcg lru_lock patches).
> 
> After this commit:  move_lock > i_pages lock > lru_lock is the
> required lock ordering, since there are strong reasons (in dirty
> writeback) for move_lock > i_pages lock.
> 
>> Please refill them for what I overlooked.
> 
> Will do, but not before reviewing your remaining patches.

IIRC, all of comments are accepted and push to 
https://github.com/alexshi/linux.git lruv19
If you don't minder, could you change everything and send out a new version
for further review?

> 
>> Thanks!
>>
>>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>>>>> Is that correct? Or Wei Yang suggested some part of it perhaps?
>>>>
>>>> Yes, we talked a lot to confirm the locking change is safe.
>>>
>>> Okay, but the patch was written by you, and sent by you to Andrew:
>>> that is not a case for "Signed-off-by: Someone Else".
>>>
>>
>> Ok. let's remove his signed-off.
>>
>>>>> [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock
>>>>> Could we please drop this one for the moment? And come back to it later
>>>>> when the basic series is safely in.  It's a good idea to try sorting
>>>>> together those pages which come under the same lock (though my guess is
>>>>> that they naturally gather themselves together quite well already); but
>>>>> I'm not happy adding 360 bytes to the kernel stack here (and that in
>>>>> addition to 192 bytes of horrid pseudo-vma in the shmem swapin case),
>>>>> though that could be avoided by making it per-cpu. But I hope there's
>>>>> a simpler way of doing it, as efficient, but also useful for the other
>>>>> pagevec operations here: perhaps scanning the pagevec for same page->
>>>>> mem_cgroup (and flags node bits), NULLing entries as they are done.
>>>>> Another, easily fixed, minor defect in this patch: if I'm reading it
>>>>> right, it reverses the order in which the pages are put on the lru?
>>>>
>>>> this patch could give about 10+% performance gain on my multiple memcg
>>>> readtwice testing. fairness locking cost the performance much.
>>>
>>> Good to know, should have been mentioned.  s/fairness/Repeated/
>>>
>>> But what was the gain or loss on your multiple memcg readtwice
>>> testing without this patch, compared against node-only lru_lock?
>>> The 80% gain mentioned before, I presume.  So this further
>>> optimization can wait until the rest is solid.
>>
>> the gain based on the patch 26.
> 
> If I understand your brief comment there, you're saying that
> in a fixed interval of time, the baseline 5.9-rc did 100 runs,
> the patches up to and including 26/32 did 180 runs, then with
> 27/32 on top, did 198 runs?

Uh, I updated the testing with some new results here:
https://lkml.org/lkml/2020/8/26/212

> 
> That's a good improvement by 27/32, but not essential for getting
> the patchset in: I don't think 27/32 is the right way to do it,
> so I'd still prefer to hold it back from the "initial offering".

I am ok to hold it back.
> 
>>
>>>
>>>>
>>>> I also tried per cpu solution but that cause much trouble of per cpu func
>>>> things, and looks no benefit except a bit struct size of stack, so if 
>>>> stack size still fine. May we could use the solution and improve it better.
>>>> like, functionlize, fix the reverse issue etc.
>>>
>>> I don't know how important the stack depth consideration is nowadays:
>>> I still care, maybe others don't, since VMAP_STACK became an option.
>>>
>>> Yes, please fix the reversal (if I was right on that); and I expect
>>> you could use a singly linked list instead of the double.
>>
>> single linked list is more saving, but do we have to reverse walking to seek
>> the head or tail for correct sequence?
> 
> I imagine all you need is to start off with a
> 	for (i = pagevec_count(pvec) - 1; i >= 0; i--)

a nice simple solution, thanks!

Thanks
alex

> loop.
> 
>>
>>>
>>> But I'll look for an alternative - later, once the urgent stuff
>>> is completed - and leave the acks on this patch to others.
>>
>> Ok, looking forward for your new solution!
>>
>> Thanks
>> Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-12  8:38           ` Hugh Dickins
@ 2020-09-13 14:22             ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-13 14:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, vbabka, minchan,
	cai



在 2020/9/12 下午4:38, Hugh Dickins 写道:
> On Wed, 9 Sep 2020, Alex Shi wrote:
>> 在 2020/9/9 上午7:41, Hugh Dickins 写道:
>>>
>>> The use of lock_page_memcg() in __munlock_pagevec() in 20/32,
>>> introduced in patchset v17, looks good but it isn't: I was lucky that
>>> systemd at reboot did some munlocking that exposed the problem to lockdep.
>>> The first time into the loop, lock_page_memcg() is done before lru_lock
>>> (as 06/32 has allowed); but the second time around the loop, it is done
>>> while still holding lru_lock.
>>
>> I don't know the details of lockdep show. Just wondering could it possible 
>> to solid the move_lock/lru_lock sequence?
>> or try other blocking way which mentioned in commit_charge()?
>>
>>>
>>> lock_page_memcg() really needs to be absorbed into (a variant of)
>>> relock_page_lruvec(), and I do have that (it's awkward because of
>>> the different ways in which the IRQ flags are handled).  And out of
>>> curiosity, I've also tried using that in mm/swap.c too, instead of the
>>> TestClearPageLRU technique: lockdep is happy, but an update_lru_size()
>>> warning showed that it cannot safely be mixed with the TestClearPageLRU
>>> technique (that I'd left in isolate_lru_page()).  So I'll stash away
>>> that relock_page_lruvec(), and consider what's best for mm/mlock.c:
>>> now that I've posted these comments so far, that's my priority, then
>>> to get the result under testing again, before resuming these comments.
>>
>> No idea of your solution, but looking forward for your good news! :)
> 
> Yes, it is good news, and simpler than anything suggested above.

Awesome!
> 
> The main difficulties will probably be to look good in the 80 columns
> (I know that limit has been lifted recently, but some of us use xterms
> side by side), and to explain it.
> 
> mm/mlock.c has not been kept up-to-date very well: and in particular,
> you have taken too seriously that "Serialize with any parallel
> __split_huge_page_refcount()..." comment that you updated to two
> comments "Serialize split tail pages in __split_huge_page_tail()...".
> 
> Delete them! The original comment was by Vlastimil for v3.14 in 2014.
> But Kirill redesigned THP refcounting for v4.5 in 2016: that's when
> __split_huge_page_refcount() went away.  And with the new refcounting,
> the THP splitting races that lru_lock protected munlock_vma_page()
> and __munlock_pagevec() from: those races have become impossible.
> 
> Or maybe there never was such a race in __munlock_pagevec(): you
> have added the comment there, assuming lru_lock was for that purpose,
> but that was probably just the convenient place to take it,
> to cover all the del_page_from_lru()s.
> 
> Observe how split_huge_page_to_list() uses unmap_page() to remove
> all pmds and all ptes for the huge page being split, and remap_page()
> only replaces the migration entries (used for anon but not for shmem
> or file) after doing all of the __split_huge_page_tail()s, before
> unlocking any of the pages.  Recall that munlock_vma_page() and
> __munlock_pagevec() are being applied to pages found mapped
> into userspace, by ptes or pmd: there are none of those while
> __split_huge_page_tail() is being used, so no race to protect from.
> 
> (Could a newly detached tail be freshly faulted into userspace just
> before __split_huge_page() has reached the head?  Not quite, the
> fault has to wait to get the tail's page lock. But even if it
> could, how would that be a problem for __munlock_pagevec()?)
> 
> There's lots more that could be said: for example, PageMlocked will
> always be clear on the THP head during __split_huge_page_tail(),
> because the last unmap of a PageMlocked page does clear_page_mlock().
> But that's not required to prove the case, it's just another argument
> against the "Serialize" comment you have in __munlock_pagevec().
> 
> So, no need for the problematic lock_page_memcg(page) there in
> __munlock_pagevec(), nor to lock (or relock) lruvec just below it.
> __munlock_pagevec() still needs lru_lock to del_page_from_lru_list(),
> of course, but that must be done after your TestClearPageMlocked has
> stabilized page->memcg.  Use relock_page_lruvec_irq() here?  I suppose
> that will be easiest, but notice how __munlock_pagevec_fill() has
> already made sure that all the pages in the pagevec are from the same
> zone (and it cannot do the same for memcg without locking page memcg);
> so some of relock's work will be redundant.

It sounds reasonable for me.

> 
> Otherwise, I'm much happier with your mm/mlock.c since looking at it
> in more detail: a couple of nits though - drop the clear_page_mlock()
> hunk from 25/32 - kernel style says do it the way you are undoing by
> -	if (!isolate_lru_page(page)) {
> +	if (!isolate_lru_page(page))
>  		putback_lru_page(page);
> -	} else {
> +	else {
> I don't always follow that over-braced style when making changes,
> but you should not touch otherwise untouched code just to make it
> go against the approved style.  And in munlock_vma_page(),
> -	if (!TestClearPageMlocked(page)) {
> +	if (!TestClearPageMlocked(page))
>  		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> -		nr_pages = 1;
> -		goto unlock_out;
> -	}
> +		return 0;
> please restore the braces: with that comment line in there,
> the compiler does not need the braces, but the human eye does.

Yes, That's better to keep the brace there.

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 06/32] mm/thp: narrow lru locking
  2020-09-11  3:37     ` Alex Shi
@ 2020-09-13 15:27       ` Matthew Wilcox
  2020-09-19  1:00         ` Hugh Dickins
  0 siblings, 1 reply; 102+ messages in thread
From: Matthew Wilcox @ 2020-09-13 15:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Andrea Arcangeli

On Fri, Sep 11, 2020 at 11:37:50AM +0800, Alex Shi wrote:
> 
> 
> 在 2020/9/10 下午9:49, Matthew Wilcox 写道:
> > On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
> >> lru_lock and page cache xa_lock have no reason with current sequence,
> >> put them together isn't necessary. let's narrow the lru locking, but
> >> left the local_irq_disable to block interrupt re-entry and statistic update.
> > 
> > What stats are you talking about here?
> 
> Hi Matthew,
> 
> Thanks for comments!
> 
> like __dec_node_page_state(head, NR_SHMEM_THPS); will have preemptive warning...

OK, but those stats are guarded by 'if (mapping)', so this patch doesn't
produce that warning because we'll have taken the xarray lock and disabled
interrupts.

> > How about this patch instead?  It occurred to me we already have
> > perfectly good infrastructure to track whether or not interrupts are
> > already disabled, and so we should use that instead of ensuring that
> > interrupts are disabled, or tracking that ourselves.
> 
> So your proposal looks like;
> 1, xa_lock_irq(&mapping->i_pages); (optional)
> 2, spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> 3, spin_lock_irqsave(&pgdat->lru_lock, flags);
> 
> Is there meaningful for the 2nd and 3rd flags?

Yes.  We want to avoid doing:

	if (mapping)
		spin_lock(&ds_queue->split_queue_lock);
	else
		spin_lock_irq(&ds_queue->split_queue_lock);
...
	if (mapping)
		spin_unlock(&ds_queue->split_queue_lock);
	else
		spin_unlock_irq(&ds_queue->split_queue_lock);

Just using _irqsave has the same effect and is easier to reason about.

> IIRC, I had a similar proposal as your, the flags used in xa_lock_irqsave(),
> but objected by Hugh.

I imagine Hugh's objection was that we know it's safe to disable/enable
interrupts here because we're in a sleepable context.  But for the
other two locks, we'd rather not track whether we've already disabled
interrupts or not.

Maybe you could dig up the email from Hugh?  I can't find it.



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-13 14:21                 ` Alex Shi
@ 2020-09-15  8:21                   ` Hugh Dickins
  2020-09-15 16:58                     ` Daniel Jordan
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-15  8:21 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Andrew Morton, mgorman, tj, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301,
	vbabka, minchan, cai

On Sun, 13 Sep 2020, Alex Shi wrote:
> 
> IIRC, all of comments are accepted and push to 
> https://github.com/alexshi/linux.git lruv19

I just had to relax for the weekend, so no progress from me.
I'll take a look at your tree tomorrow, er, later today.

> If you don't minder, could you change everything and send out a new version
> for further review?

Sorry, no.  Tiresome though it is for both of us, I'll continue
to send you comments, and leave all the posting to you.

> Uh, I updated the testing with some new results here:
> https://lkml.org/lkml/2020/8/26/212

Right, I missed that, that's better, thanks.  Any other test results?

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-15  8:21                   ` Hugh Dickins
@ 2020-09-15 16:58                     ` Daniel Jordan
  2020-09-16 12:44                       ` Alex Shi
  2020-09-17  2:37                       ` Alex Shi
  0 siblings, 2 replies; 102+ messages in thread
From: Daniel Jordan @ 2020-09-15 16:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alex Shi, Andrew Morton, mgorman, tj, khlebnikov,
	daniel.m.jordan, willy, hannes, lkp, linux-mm, linux-kernel,
	cgroups, shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301,
	vbabka, minchan, cai

[-- Attachment #1: Type: text/plain, Size: 8263 bytes --]

On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
> On Sun, 13 Sep 2020, Alex Shi wrote:
> > Uh, I updated the testing with some new results here:
> > https://lkml.org/lkml/2020/8/26/212
> 
> Right, I missed that, that's better, thanks.  Any other test results?

Alex, you were doing some will-it-scale runs earlier.  Are you planning to do
more of those?  Otherwise I can add them in.

This is what I have so far.


sysbench oltp read-only
-----------------------

The goal was to run a real world benchmark, at least more so than something
like vm-scalability, with the memory controller enabled but unused to check for
regressions.

I chose sysbench because it was relatively straightforward to run, but I'm open
to ideas for other high level benchmarks that might be more sensitive to this
series.

CoeffVar shows the test was pretty noisy overall.  It's nice to see there's no
significant difference between the kernels for low thread counts (1-12), but
I'm not sure what to make of the 18 and 20 thread cases.  At 20 threads, the
CPUs of the node that the test was confined to were saturated and the variance
is especially high.  I'm tempted to write the 18 and 20 thread cases off as
noise.

- 2-socket * 10-core * 2-hyperthread broadwell server
- test bound to node 1 to lower variance
- 251G memory, divided evenly between the nodes (memory size of test shrunk to
  accommodate confining to one node)
- 12 iterations per thread count per kernel
- THP enabled

export OLTP_CACHESIZE=$(($MEMTOTAL_BYTES/4))
export OLTP_SHAREDBUFFERS=$((MEMTOTAL_BYTES/8))
export OLTP_PAGESIZES="default"
export SYSBENCH_DRIVER=postgres
export SYSBENCH_MAX_TRANSACTIONS=auto
export SYSBENCH_READONLY=yes
export SYSBENCH_MAX_THREADS=$((NUMCPUS / 2))
export SYSBENCH_ITERATIONS=12
export SYSBENCH_WORKLOAD_SIZE=$((MEMTOTAL_BYTES*3/8))
export SYSBENCH_CACHE_COLD=no
export DATABASE_INIT_ONCE=yes

export MMTESTS_NUMA_POLICY=fullbind_single_instance_node
numactl --cpunodebind=1 --membind=1 <mmtests_cmdline>

sysbench Transactions per second
                            5.9-rc2        5.9-rc2-lru-v18
Min       1       593.23 (   0.00%)      583.37 (  -1.66%)
Min       4      1897.34 (   0.00%)     1871.77 (  -1.35%)
Min       7      2471.14 (   0.00%)     2449.77 (  -0.86%)
Min       12     2680.00 (   0.00%)     2853.25 (   6.46%)
Min       18     2183.82 (   0.00%)     1191.43 ( -45.44%)
Min       20      924.96 (   0.00%)      526.66 ( -43.06%)
Hmean     1       912.08 (   0.00%)      904.24 (  -0.86%)
Hmean     4      2057.11 (   0.00%)     2044.69 (  -0.60%)
Hmean     7      2817.59 (   0.00%)     2812.80 (  -0.17%)
Hmean     12     3201.05 (   0.00%)     3171.09 (  -0.94%)
Hmean     18     2529.10 (   0.00%)     2009.99 * -20.53%*
Hmean     20     1742.29 (   0.00%)     1127.77 * -35.27%*
Stddev    1       219.21 (   0.00%)      220.92 (  -0.78%)
Stddev    4        94.94 (   0.00%)       84.34 (  11.17%)
Stddev    7       189.42 (   0.00%)      167.58 (  11.53%)
Stddev    12      372.13 (   0.00%)      199.40 (  46.42%)
Stddev    18      248.42 (   0.00%)      574.66 (-131.32%)
Stddev    20      757.69 (   0.00%)      666.87 (  11.99%)
CoeffVar  1        22.54 (   0.00%)       22.86 (  -1.42%)
CoeffVar  4         4.61 (   0.00%)        4.12 (  10.60%)
CoeffVar  7         6.69 (   0.00%)        5.94 (  11.30%)
CoeffVar  12       11.49 (   0.00%)        6.27 (  45.46%)
CoeffVar  18        9.74 (   0.00%)       26.22 (-169.23%)
CoeffVar  20       36.32 (   0.00%)       47.18 ( -29.89%)
Max       1      1117.45 (   0.00%)     1107.33 (  -0.91%)
Max       4      2184.92 (   0.00%)     2136.65 (  -2.21%)
Max       7      3086.81 (   0.00%)     3049.52 (  -1.21%)
Max       12     4020.07 (   0.00%)     3580.95 ( -10.92%)
Max       18     3032.30 (   0.00%)     2810.85 (  -7.30%)
Max       20     2891.27 (   0.00%)     2675.80 (  -7.45%)
BHmean-50 1      1098.77 (   0.00%)     1093.58 (  -0.47%)
BHmean-50 4      2139.76 (   0.00%)     2107.13 (  -1.52%)
BHmean-50 7      2972.18 (   0.00%)     2953.94 (  -0.61%)
BHmean-50 12     3494.73 (   0.00%)     3311.33 (  -5.25%)
BHmean-50 18     2729.70 (   0.00%)     2606.32 (  -4.52%)
BHmean-50 20     2668.72 (   0.00%)     1779.87 ( -33.31%)
BHmean-95 1       958.94 (   0.00%)      951.84 (  -0.74%)
BHmean-95 4      2072.98 (   0.00%)     2062.01 (  -0.53%)
BHmean-95 7      2853.96 (   0.00%)     2851.21 (  -0.10%)
BHmean-95 12     3258.65 (   0.00%)     3203.53 (  -1.69%)
BHmean-95 18     2565.99 (   0.00%)     2143.90 ( -16.45%)
BHmean-95 20     1894.47 (   0.00%)     1258.34 ( -33.58%)
BHmean-99 1       958.94 (   0.00%)      951.84 (  -0.74%)
BHmean-99 4      2072.98 (   0.00%)     2062.01 (  -0.53%)
BHmean-99 7      2853.96 (   0.00%)     2851.21 (  -0.10%)
BHmean-99 12     3258.65 (   0.00%)     3203.53 (  -1.69%)
BHmean-99 18     2565.99 (   0.00%)     2143.90 ( -16.45%)
BHmean-99 20     1894.47 (   0.00%)     1258.34 ( -33.58%)

sysbench Time
                            5.9-rc2            5.9-rc2-lru
Min       1         8.96 (   0.00%)        9.04 (  -0.89%)
Min       4         4.63 (   0.00%)        4.74 (  -2.38%)
Min       7         3.34 (   0.00%)        3.38 (  -1.20%)
Min       12        2.65 (   0.00%)        2.95 ( -11.32%)
Min       18        3.54 (   0.00%)        3.80 (  -7.34%)
Min       20        3.74 (   0.00%)        4.02 (  -7.49%)
Amean     1        11.00 (   0.00%)       11.11 (  -0.98%)
Amean     4         4.92 (   0.00%)        4.95 (  -0.59%)
Amean     7         3.65 (   0.00%)        3.65 (  -0.16%)
Amean     12        3.29 (   0.00%)        3.32 (  -0.89%)
Amean     18        4.20 (   0.00%)        5.22 * -24.39%*
Amean     20        6.02 (   0.00%)        9.14 * -51.98%*
Stddev    1         3.33 (   0.00%)        3.45 (  -3.40%)
Stddev    4         0.23 (   0.00%)        0.21 (   7.89%)
Stddev    7         0.25 (   0.00%)        0.22 (   9.87%)
Stddev    12        0.35 (   0.00%)        0.19 (  45.09%)
Stddev    18        0.38 (   0.00%)        1.75 (-354.74%)
Stddev    20        2.93 (   0.00%)        4.73 ( -61.72%)
CoeffVar  1        30.30 (   0.00%)       31.02 (  -2.40%)
CoeffVar  4         4.63 (   0.00%)        4.24 (   8.43%)
CoeffVar  7         6.77 (   0.00%)        6.10 (  10.02%)
CoeffVar  12       10.74 (   0.00%)        5.85 (  45.57%)
CoeffVar  18        9.15 (   0.00%)       33.45 (-265.58%)
CoeffVar  20       48.64 (   0.00%)       51.75 (  -6.41%)
Max       1        17.01 (   0.00%)       17.36 (  -2.06%)
Max       4         5.33 (   0.00%)        5.40 (  -1.31%)
Max       7         4.14 (   0.00%)        4.18 (  -0.97%)
Max       12        3.89 (   0.00%)        3.67 (   5.66%)
Max       18        4.82 (   0.00%)        8.64 ( -79.25%)
Max       20       11.09 (   0.00%)       19.26 ( -73.67%)
BAmean-50 1         9.12 (   0.00%)        9.16 (  -0.49%)
BAmean-50 4         4.73 (   0.00%)        4.80 (  -1.55%)
BAmean-50 7         3.46 (   0.00%)        3.48 (  -0.58%)
BAmean-50 12        3.02 (   0.00%)        3.18 (  -5.24%)
BAmean-50 18        3.90 (   0.00%)        4.08 (  -4.52%)
BAmean-50 20        4.02 (   0.00%)        5.90 ( -46.56%)
BAmean-95 1        10.45 (   0.00%)       10.54 (  -0.82%)
BAmean-95 4         4.88 (   0.00%)        4.91 (  -0.52%)
BAmean-95 7         3.60 (   0.00%)        3.60 (  -0.08%)
BAmean-95 12        3.23 (   0.00%)        3.28 (  -1.60%)
BAmean-95 18        4.14 (   0.00%)        4.91 ( -18.58%)
BAmean-95 20        5.56 (   0.00%)        8.22 ( -48.04%)
BAmean-99 1        10.45 (   0.00%)       10.54 (  -0.82%)
BAmean-99 4         4.88 (   0.00%)        4.91 (  -0.52%)
BAmean-99 7         3.60 (   0.00%)        3.60 (  -0.08%)
BAmean-99 12        3.23 (   0.00%)        3.28 (  -1.60%)
BAmean-99 18        4.14 (   0.00%)        4.91 ( -18.58%)
BAmean-99 20        5.56 (   0.00%)        8.22 ( -48.04%)


docker-ized readtwice microbenchmark
------------------------------------

This is Alex's modified readtwice case.  Needed a few fixes, and I made it into
a script.  Updated version attached.

Same machine, three runs per kernel, 40 containers per test.  This is average
MB/s over all containers.

    5.9-rc2          5.9-rc2-lru
-----------          -----------
220.5 (3.3)          356.9 (0.5)

That's a 62% improvement.

[-- Attachment #2: Dockerfile --]
[-- Type: text/plain, Size: 509 bytes --]

FROM centos:8
MAINTAINER Alexs 
#WORKDIR /vm-scalability 
#RUN yum update -y && yum groupinstall "Development Tools" -y && yum clean all && \
#examples https://www.linuxtechi.com/build-docker-container-images-with-dockerfile/
RUN yum install git xfsprogs patch make gcc -y && yum clean all && \
git clone  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ && \
cd vm-scalability && make usemem

COPY readtwice.patch /vm-scalability/

RUN cd vm-scalability && patch -p1 < readtwice.patch

[-- Attachment #3: run.sh --]
[-- Type: text/plain, Size: 1858 bytes --]

#!/usr/bin/env bash
#
# Originally by Alex Shi <alex.shi@linux.alibaba.com>
# Changes from Daniel Jordan <daniel.m.jordan@oracle.com>

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
TAG='lrulock'
runtime=300

nr_cont=$(nproc)

cd "$SCRIPT_DIR"

echo -e "starting $nr_cont containers\n"

pids=()

sudo docker build -t "$TAG" .

nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l)
if (( nr_running_cont != 0 )); then
	echo "error: $nr_running_cont containers already running"
	exit 1
fi

# start some testing containers
for ((i=0; i < nr_cont; i++)); do
	sudo docker run --privileged=true --rm "$TAG" bash -c "sleep infinity" &
done

nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l)
until (( nr_running_cont == nr_cont )); do
	sleep .5
	nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l)
done

# do testing evn setup 
for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do
	sudo docker exec --privileged=true -t $i \
		bash -c "cd /vm-scalability/; bash ./case-lru-file-readtwice m" &
	pids+=($!)
done

wait "${pids[@]}"
pids=()

# kick testing
for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do
	sudo docker exec --privileged=true -t -e runtime=$runtime $i \
		bash -c "cd /vm-scalability/; bash ./case-lru-file-readtwice r" &
	pids+=($!)
done

wait "${pids[@]}"
pids=()

# save results
ts=$(date +%y-%m-%d_%H:%M:%S)
f="$ts/summary.txt"

mkdir "$ts"
echo "$ts" >> "$f"
uname -r >> "$f"

for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do
	sudo docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-*' &> "$ts/$i.out" &
	pids+=($!)
done

wait "${pids[@]}"
pids=()

grep 'copied' "$ts"/*.out | \
	awk 'BEGIN {a=0;} { a+=$10 } END {print NR, a/(NR)}' | \
	tee -a "$f"

for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do
	sudo docker stop $i &>/dev/null &
done
wait

echo 'test finished'
echo

[-- Attachment #4: readtwice.patch --]
[-- Type: text/plain, Size: 1876 bytes --]

diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice
index 85533b248634..57cb97d121ae 100755
--- a/case-lru-file-readtwice
+++ b/case-lru-file-readtwice
@@ -15,23 +15,30 @@
 
 . ./hw_vars
 
-for i in `seq 1 $nr_task`
-do
-	create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task))
-	timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 &
-	timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 &
-done
+OUT_DIR=$(hostname)-${nr_task}c-$(((mem + (1<<29))>>30))g
+TEST_CASES=${@:-$(echo case-*)}
+
+echo $((1<<30)) > /proc/sys/vm/max_map_count
+echo $((1<<20)) > /proc/sys/kernel/threads-max
+echo 1 > /proc/sys/vm/overcommit_memory
+#echo 3 > /proc/sys/vm/drop_caches
+
+
+i=1
+
+if [ "$1" == "m" ];then
+	mount_tmpfs
+	create_sparse_root
+	create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES))
+	exit
+fi
+
+
+if [ "$1" == "r" ];then
+	(timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1)&
+	(timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1)&
+fi
 
 wait
 sleep 1
 
-for file in $TMPFS_MNT/dd-output-*
-do
-	[ -s "$file" ] || {
-		echo "dd output file empty: $file" >&2
-	}
-	cat $file
-	rm  $file
-done
-
-rm `seq -f $SPARSE_FILE-%g 1 $nr_task`
diff --git a/hw_vars b/hw_vars
index 8731cefb9f57..ceeaa9f17c0b 100755
--- a/hw_vars
+++ b/hw_vars
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/sh -e
 
 if [ -n "$runtime" ]; then
 	USEMEM="$CMD ./usemem --runtime $runtime"
@@ -43,7 +43,7 @@ create_loop_devices()
 	modprobe loop 2>/dev/null
 	[ -e "/dev/loop0" ] || modprobe loop 2>/dev/null
 
-	for i in $(seq 0 8)
+	for i in $(seq 0 104)
 	do
 		[ -e "/dev/loop$i" ] && continue
 		mknod /dev/loop$i b 7 $i

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-15 16:58                     ` Daniel Jordan
@ 2020-09-16 12:44                       ` Alex Shi
  2020-09-17  2:37                       ` Alex Shi
  1 sibling, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-16 12:44 UTC (permalink / raw)
  To: Daniel Jordan, Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, willy, hannes, lkp,
	linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, vbabka, minchan, cai



在 2020/9/16 上午12:58, Daniel Jordan 写道:
> On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
>> On Sun, 13 Sep 2020, Alex Shi wrote:
>>> Uh, I updated the testing with some new results here:
>>> https://lkml.org/lkml/2020/8/26/212
>>
>> Right, I missed that, that's better, thanks.  Any other test results?
> 
> Alex, you were doing some will-it-scale runs earlier.  Are you planning to do
> more of those?  Otherwise I can add them in.
> 

Hi Daniel,

I am happy to see your testing result. :)

> This is what I have so far.
> 
> 
> sysbench oltp read-only
> -----------------------
> 
> The goal was to run a real world benchmark, at least more so than something
> like vm-scalability, with the memory controller enabled but unused to check for
> regressions.
> 
> I chose sysbench because it was relatively straightforward to run, but I'm open
> to ideas for other high level benchmarks that might be more sensitive to this
> series.
> 
> CoeffVar shows the test was pretty noisy overall.  It's nice to see there's no
> significant difference between the kernels for low thread counts (1-12), but
> I'm not sure what to make of the 18 and 20 thread cases.  At 20 threads, the
> CPUs of the node that the test was confined to were saturated and the variance
> is especially high.  I'm tempted to write the 18 and 20 thread cases off as
> noise.
> 
> - 2-socket * 10-core * 2-hyperthread broadwell server
> - test bound to node 1 to lower variance
> - 251G memory, divided evenly between the nodes (memory size of test shrunk to
>   accommodate confining to one node)
> - 12 iterations per thread count per kernel
> - THP enabled

Thanks a lot for the results!
Alex

> 
> export OLTP_CACHESIZE=$(($MEMTOTAL_BYTES/4))
> export OLTP_SHAREDBUFFERS=$((MEMTOTAL_BYTES/8))
> export OLTP_PAGESIZES="default"
> export SYSBENCH_DRIVER=postgres
> export SYSBENCH_MAX_TRANSACTIONS=auto
> export SYSBENCH_READONLY=yes
> export SYSBENCH_MAX_THREADS=$((NUMCPUS / 2))
> export SYSBENCH_ITERATIONS=12
> export SYSBENCH_WORKLOAD_SIZE=$((MEMTOTAL_BYTES*3/8))
> export SYSBENCH_CACHE_COLD=no
> export DATABASE_INIT_ONCE=yes
> 
> export MMTESTS_NUMA_POLICY=fullbind_single_instance_node
> numactl --cpunodebind=1 --membind=1 <mmtests_cmdline>
> 
> sysbench Transactions per second
>                             5.9-rc2        5.9-rc2-lru-v18
> Min       1       593.23 (   0.00%)      583.37 (  -1.66%)
> Min       4      1897.34 (   0.00%)     1871.77 (  -1.35%)
> Min       7      2471.14 (   0.00%)     2449.77 (  -0.86%)
> Min       12     2680.00 (   0.00%)     2853.25 (   6.46%)
> Min       18     2183.82 (   0.00%)     1191.43 ( -45.44%)
> Min       20      924.96 (   0.00%)      526.66 ( -43.06%)
> Hmean     1       912.08 (   0.00%)      904.24 (  -0.86%)
> Hmean     4      2057.11 (   0.00%)     2044.69 (  -0.60%)
> Hmean     7      2817.59 (   0.00%)     2812.80 (  -0.17%)
> Hmean     12     3201.05 (   0.00%)     3171.09 (  -0.94%)
> Hmean     18     2529.10 (   0.00%)     2009.99 * -20.53%*
> Hmean     20     1742.29 (   0.00%)     1127.77 * -35.27%*
> Stddev    1       219.21 (   0.00%)      220.92 (  -0.78%)
> Stddev    4        94.94 (   0.00%)       84.34 (  11.17%)
> Stddev    7       189.42 (   0.00%)      167.58 (  11.53%)
> Stddev    12      372.13 (   0.00%)      199.40 (  46.42%)
> Stddev    18      248.42 (   0.00%)      574.66 (-131.32%)
> Stddev    20      757.69 (   0.00%)      666.87 (  11.99%)
> CoeffVar  1        22.54 (   0.00%)       22.86 (  -1.42%)
> CoeffVar  4         4.61 (   0.00%)        4.12 (  10.60%)
> CoeffVar  7         6.69 (   0.00%)        5.94 (  11.30%)
> CoeffVar  12       11.49 (   0.00%)        6.27 (  45.46%)
> CoeffVar  18        9.74 (   0.00%)       26.22 (-169.23%)
> CoeffVar  20       36.32 (   0.00%)       47.18 ( -29.89%)
> Max       1      1117.45 (   0.00%)     1107.33 (  -0.91%)
> Max       4      2184.92 (   0.00%)     2136.65 (  -2.21%)
> Max       7      3086.81 (   0.00%)     3049.52 (  -1.21%)
> Max       12     4020.07 (   0.00%)     3580.95 ( -10.92%)
> Max       18     3032.30 (   0.00%)     2810.85 (  -7.30%)
> Max       20     2891.27 (   0.00%)     2675.80 (  -7.45%)
> BHmean-50 1      1098.77 (   0.00%)     1093.58 (  -0.47%)
> BHmean-50 4      2139.76 (   0.00%)     2107.13 (  -1.52%)
> BHmean-50 7      2972.18 (   0.00%)     2953.94 (  -0.61%)
> BHmean-50 12     3494.73 (   0.00%)     3311.33 (  -5.25%)
> BHmean-50 18     2729.70 (   0.00%)     2606.32 (  -4.52%)
> BHmean-50 20     2668.72 (   0.00%)     1779.87 ( -33.31%)
> BHmean-95 1       958.94 (   0.00%)      951.84 (  -0.74%)
> BHmean-95 4      2072.98 (   0.00%)     2062.01 (  -0.53%)
> BHmean-95 7      2853.96 (   0.00%)     2851.21 (  -0.10%)
> BHmean-95 12     3258.65 (   0.00%)     3203.53 (  -1.69%)
> BHmean-95 18     2565.99 (   0.00%)     2143.90 ( -16.45%)
> BHmean-95 20     1894.47 (   0.00%)     1258.34 ( -33.58%)
> BHmean-99 1       958.94 (   0.00%)      951.84 (  -0.74%)
> BHmean-99 4      2072.98 (   0.00%)     2062.01 (  -0.53%)
> BHmean-99 7      2853.96 (   0.00%)     2851.21 (  -0.10%)
> BHmean-99 12     3258.65 (   0.00%)     3203.53 (  -1.69%)
> BHmean-99 18     2565.99 (   0.00%)     2143.90 ( -16.45%)
> BHmean-99 20     1894.47 (   0.00%)     1258.34 ( -33.58%)
> 
> sysbench Time
>                             5.9-rc2            5.9-rc2-lru
> Min       1         8.96 (   0.00%)        9.04 (  -0.89%)
> Min       4         4.63 (   0.00%)        4.74 (  -2.38%)
> Min       7         3.34 (   0.00%)        3.38 (  -1.20%)
> Min       12        2.65 (   0.00%)        2.95 ( -11.32%)
> Min       18        3.54 (   0.00%)        3.80 (  -7.34%)
> Min       20        3.74 (   0.00%)        4.02 (  -7.49%)
> Amean     1        11.00 (   0.00%)       11.11 (  -0.98%)
> Amean     4         4.92 (   0.00%)        4.95 (  -0.59%)
> Amean     7         3.65 (   0.00%)        3.65 (  -0.16%)
> Amean     12        3.29 (   0.00%)        3.32 (  -0.89%)
> Amean     18        4.20 (   0.00%)        5.22 * -24.39%*
> Amean     20        6.02 (   0.00%)        9.14 * -51.98%*
> Stddev    1         3.33 (   0.00%)        3.45 (  -3.40%)
> Stddev    4         0.23 (   0.00%)        0.21 (   7.89%)
> Stddev    7         0.25 (   0.00%)        0.22 (   9.87%)
> Stddev    12        0.35 (   0.00%)        0.19 (  45.09%)
> Stddev    18        0.38 (   0.00%)        1.75 (-354.74%)
> Stddev    20        2.93 (   0.00%)        4.73 ( -61.72%)
> CoeffVar  1        30.30 (   0.00%)       31.02 (  -2.40%)
> CoeffVar  4         4.63 (   0.00%)        4.24 (   8.43%)
> CoeffVar  7         6.77 (   0.00%)        6.10 (  10.02%)
> CoeffVar  12       10.74 (   0.00%)        5.85 (  45.57%)
> CoeffVar  18        9.15 (   0.00%)       33.45 (-265.58%)
> CoeffVar  20       48.64 (   0.00%)       51.75 (  -6.41%)
> Max       1        17.01 (   0.00%)       17.36 (  -2.06%)
> Max       4         5.33 (   0.00%)        5.40 (  -1.31%)
> Max       7         4.14 (   0.00%)        4.18 (  -0.97%)
> Max       12        3.89 (   0.00%)        3.67 (   5.66%)
> Max       18        4.82 (   0.00%)        8.64 ( -79.25%)
> Max       20       11.09 (   0.00%)       19.26 ( -73.67%)
> BAmean-50 1         9.12 (   0.00%)        9.16 (  -0.49%)
> BAmean-50 4         4.73 (   0.00%)        4.80 (  -1.55%)
> BAmean-50 7         3.46 (   0.00%)        3.48 (  -0.58%)
> BAmean-50 12        3.02 (   0.00%)        3.18 (  -5.24%)
> BAmean-50 18        3.90 (   0.00%)        4.08 (  -4.52%)
> BAmean-50 20        4.02 (   0.00%)        5.90 ( -46.56%)
> BAmean-95 1        10.45 (   0.00%)       10.54 (  -0.82%)
> BAmean-95 4         4.88 (   0.00%)        4.91 (  -0.52%)
> BAmean-95 7         3.60 (   0.00%)        3.60 (  -0.08%)
> BAmean-95 12        3.23 (   0.00%)        3.28 (  -1.60%)
> BAmean-95 18        4.14 (   0.00%)        4.91 ( -18.58%)
> BAmean-95 20        5.56 (   0.00%)        8.22 ( -48.04%)
> BAmean-99 1        10.45 (   0.00%)       10.54 (  -0.82%)
> BAmean-99 4         4.88 (   0.00%)        4.91 (  -0.52%)
> BAmean-99 7         3.60 (   0.00%)        3.60 (  -0.08%)
> BAmean-99 12        3.23 (   0.00%)        3.28 (  -1.60%)
> BAmean-99 18        4.14 (   0.00%)        4.91 ( -18.58%)
> BAmean-99 20        5.56 (   0.00%)        8.22 ( -48.04%)
> 
> 
> docker-ized readtwice microbenchmark
> ------------------------------------
> 
> This is Alex's modified readtwice case.  Needed a few fixes, and I made it into
> a script.  Updated version attached.
> 
> Same machine, three runs per kernel, 40 containers per test.  This is average
> MB/s over all containers.
> 
>     5.9-rc2          5.9-rc2-lru
> -----------          -----------
> 220.5 (3.3)          356.9 (0.5)
> 
> That's a 62% improvement.
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-15 16:58                     ` Daniel Jordan
  2020-09-16 12:44                       ` Alex Shi
@ 2020-09-17  2:37                       ` Alex Shi
  2020-09-17 14:35                         ` Daniel Jordan
  1 sibling, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-09-17  2:37 UTC (permalink / raw)
  To: Daniel Jordan, Hugh Dickins
  Cc: Andrew Morton, mgorman, tj, khlebnikov, willy, hannes, lkp,
	linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, vbabka, minchan, cai



在 2020/9/16 上午12:58, Daniel Jordan 写道:
> On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
>> On Sun, 13 Sep 2020, Alex Shi wrote:
>>> Uh, I updated the testing with some new results here:
>>> https://lkml.org/lkml/2020/8/26/212
>> Right, I missed that, that's better, thanks.  Any other test results?
> Alex, you were doing some will-it-scale runs earlier.  Are you planning to do
> more of those?  Otherwise I can add them in.

Hi Daniel,

Does compaction perf scalable, like thpscale, I except they could get some benefit.

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-17  2:37                       ` Alex Shi
@ 2020-09-17 14:35                         ` Daniel Jordan
  2020-09-17 15:39                           ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Daniel Jordan @ 2020-09-17 14:35 UTC (permalink / raw)
  To: Alex Shi
  Cc: Daniel Jordan, Hugh Dickins, Andrew Morton, mgorman, tj,
	khlebnikov, willy, hannes, lkp, linux-mm, linux-kernel, cgroups,
	shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen, mhocko, vdavydov.dev, shy828301,
	vbabka, minchan, cai

On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote:
> 在 2020/9/16 上午12:58, Daniel Jordan 写道:
> > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
> >> On Sun, 13 Sep 2020, Alex Shi wrote:
> >>> Uh, I updated the testing with some new results here:
> >>> https://lkml.org/lkml/2020/8/26/212
> >> Right, I missed that, that's better, thanks.  Any other test results?
> > Alex, you were doing some will-it-scale runs earlier.  Are you planning to do
> > more of those?  Otherwise I can add them in.
> 
> Hi Daniel,
> 
> Does compaction perf scalable, like thpscale, I except they could get some benefit.

Yep, I plan to stress compaction.  Reclaim as well.

I should have said which Alex I meant.  I was asking Alex Duyck since he'd done
some will-it-scale runs.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-17 14:35                         ` Daniel Jordan
@ 2020-09-17 15:39                           ` Alexander Duyck
  2020-09-17 16:48                             ` Daniel Jordan
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-09-17 15:39 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Alex Shi, Hugh Dickins, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Matthew Wilcox, Johannes Weiner,
	kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov, Rong Chen,
	Michal Hocko, Vladimir Davydov, shy828301, Vlastimil Babka,
	Minchan Kim, Qian Cai

On Thu, Sep 17, 2020 at 7:26 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote:
> > 在 2020/9/16 上午12:58, Daniel Jordan 写道:
> > > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
> > >> On Sun, 13 Sep 2020, Alex Shi wrote:
> > >>> Uh, I updated the testing with some new results here:
> > >>> https://lkml.org/lkml/2020/8/26/212
> > >> Right, I missed that, that's better, thanks.  Any other test results?
> > > Alex, you were doing some will-it-scale runs earlier.  Are you planning to do
> > > more of those?  Otherwise I can add them in.
> >
> > Hi Daniel,
> >
> > Does compaction perf scalable, like thpscale, I except they could get some benefit.
>
> Yep, I plan to stress compaction.  Reclaim as well.
>
> I should have said which Alex I meant.  I was asking Alex Duyck since he'd done
> some will-it-scale runs.

I probably won't be able to do any will-it-scale runs any time soon.
If I recall I ran them for this latest v18 patch set and didn't see
any regressions like I did with the previous set. However the system I
was using is tied up for other purposes and it may be awhile before I
can free it up to look into this again.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 00/32] per memcg lru_lock: reviews
  2020-09-17 15:39                           ` Alexander Duyck
@ 2020-09-17 16:48                             ` Daniel Jordan
  0 siblings, 0 replies; 102+ messages in thread
From: Daniel Jordan @ 2020-09-17 16:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Daniel Jordan, Alex Shi, Hugh Dickins, Andrew Morton, Mel Gorman,
	Tejun Heo, Konstantin Khlebnikov, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov, shy828301,
	Vlastimil Babka, Minchan Kim, Qian Cai

On Thu, Sep 17, 2020 at 08:39:34AM -0700, Alexander Duyck wrote:
> On Thu, Sep 17, 2020 at 7:26 AM Daniel Jordan
> <daniel.m.jordan@oracle.com> wrote:
> >
> > On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote:
> > > 在 2020/9/16 上午12:58, Daniel Jordan 写道:
> > > > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote:
> > > >> On Sun, 13 Sep 2020, Alex Shi wrote:
> > > >>> Uh, I updated the testing with some new results here:
> > > >>> https://lkml.org/lkml/2020/8/26/212
> > > >> Right, I missed that, that's better, thanks.  Any other test results?
> > > > Alex, you were doing some will-it-scale runs earlier.  Are you planning to do
> > > > more of those?  Otherwise I can add them in.
> > >
> > > Hi Daniel,
> > >
> > > Does compaction perf scalable, like thpscale, I except they could get some benefit.
> >
> > Yep, I plan to stress compaction.  Reclaim as well.
> >
> > I should have said which Alex I meant.  I was asking Alex Duyck since he'd done
> > some will-it-scale runs.
> 
> I probably won't be able to do any will-it-scale runs any time soon.
> If I recall I ran them for this latest v18 patch set and didn't see
> any regressions like I did with the previous set. However the system I
> was using is tied up for other purposes and it may be awhile before I
> can free it up to look into this again.

Ok, sure.  I hadn't seen the regressions were taken case of, that's good to
hear.  Might still add them to my testing for v19 and beyond, we'll see.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 06/32] mm/thp: narrow lru locking
  2020-09-13 15:27       ` Matthew Wilcox
@ 2020-09-19  1:00         ` Hugh Dickins
  0 siblings, 0 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-19  1:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alex Shi, akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Andrea Arcangeli

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5857 bytes --]

On Sun, 13 Sep 2020, Matthew Wilcox wrote:
> On Fri, Sep 11, 2020 at 11:37:50AM +0800, Alex Shi wrote:
> > 在 2020/9/10 下午9:49, Matthew Wilcox 写道:
> > > On Mon, Aug 24, 2020 at 08:54:39PM +0800, Alex Shi wrote:
> > >> lru_lock and page cache xa_lock have no reason with current sequence,
> > >> put them together isn't necessary. let's narrow the lru locking, but
> > >> left the local_irq_disable to block interrupt re-entry and statistic update.
> > > 
> > > What stats are you talking about here?
> > 
> > Hi Matthew,
> > 
> > Thanks for comments!
> > 
> > like __dec_node_page_state(head, NR_SHMEM_THPS); will have preemptive warning...
> 
> OK, but those stats are guarded by 'if (mapping)', so this patch doesn't
> produce that warning because we'll have taken the xarray lock and disabled
> interrupts.
> 
> > > How about this patch instead?  It occurred to me we already have
> > > perfectly good infrastructure to track whether or not interrupts are
> > > already disabled, and so we should use that instead of ensuring that
> > > interrupts are disabled, or tracking that ourselves.
> > 
> > So your proposal looks like;
> > 1, xa_lock_irq(&mapping->i_pages); (optional)
> > 2, spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > 3, spin_lock_irqsave(&pgdat->lru_lock, flags);
> > 
> > Is there meaningful for the 2nd and 3rd flags?
> 
> Yes.  We want to avoid doing:
> 
> 	if (mapping)
> 		spin_lock(&ds_queue->split_queue_lock);
> 	else
> 		spin_lock_irq(&ds_queue->split_queue_lock);
> ...
> 	if (mapping)
> 		spin_unlock(&ds_queue->split_queue_lock);
> 	else
> 		spin_unlock_irq(&ds_queue->split_queue_lock);
> 
> Just using _irqsave has the same effect and is easier to reason about.
> 
> > IIRC, I had a similar proposal as your, the flags used in xa_lock_irqsave(),
> > but objected by Hugh.
> 
> I imagine Hugh's objection was that we know it's safe to disable/enable
> interrupts here because we're in a sleepable context.  But for the
> other two locks, we'd rather not track whether we've already disabled
> interrupts or not.
> 
> Maybe you could dig up the email from Hugh?  I can't find it.

I did not find exactly the objection Alex seems to be remembering, but
I have certainly expressed frustration with the lack of a reason for
the THP split lock reordering, and in private mail in June while I was
testing and sending back fixes: "I'd prefer that you never got into this:
it looks like an unrelated and debatable cleanup, and I can see more
such cleanup to make there, that we'd better not get into right now."

I've several times toyed with just leaving this patch out of the series:
but each time ended up, for better or worse, deciding we'd better keep
it in - partly because we've never tested without it, and it cannot be
dropped without making some other change (to stabilize the memcg in
the !list case) - easily doable, but already done by this patch.

Alex asked me to improve his commit message to satisfy my objections,
here's what I sent him last night:

===
lru_lock and page cache xa_lock have no obvious reason to be taken
one way round or the other: until now, lru_lock has been taken before
page cache xa_lock, when splitting a THP; but nothing else takes them
together.  Reverse that ordering: let's narrow the lru locking - but
leave local_irq_disable to block interrupts throughout, like before.

Hugh Dickins point: split_huge_page_to_list() was already silly, to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.  So we
can save passing flags argument down to __split_huge_page().

Why change the lock ordering here? That was hard to decide. One reason:
when this series reaches per-memcg lru locking, it relies on the THP's
memcg to be stable when taking the lru_lock: that is now done after the
THP's refcount has been frozen, which ensures page memcg cannot change.

Another reason: previously, lock_page_memcg()'s move_lock was presumed
to nest inside lru_lock; but now lru_lock must nest inside (page cache
lock inside) move_lock, so it becomes possible to use lock_page_memcg()
to stabilize page memcg before taking its lru_lock.  That is not the
mechanism used in this series, but it is an option we want to keep open.
===

It's still the case that I want to avoid further cleanups and
bikeshedding here for now.  I took an open-minded look at Alex's
patch versus Matthew's patch, and do prefer Alex's: largely because
it's simple and explicit about where the irq disabling and enabling
is done (exactly where it was done before), and doesn't need irqsave
clutter in between.  If this were to be the only local_irq_disable()
in mm I'd NAK it, but that's not so - and as I said before, I don't
take the RT THP case very seriously anyway.

One slight worry in Matthew's version:

	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
	count = page_count(head);
	mapcount = total_mapcount(head);
	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
		if (!list_empty(page_deferred_list(head))) {
			ds_queue->split_queue_len--;
			list_del(page_deferred_list(head));
		}
		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
		if (mapping) {
			if (PageSwapBacked(head))
				__dec_node_page_state(head, NR_SHMEM_THPS);
			else
				__dec_node_page_state(head, NR_FILE_THPS);
		}
		__split_huge_page(page, list, end);

In the Anon case, interrupts are enabled when calling __split_huge_page()
there, but head's refcount is frozen: I'm uneasy about preemption when a
refcount is frozen.  But I'd worry much more if it were the mapping case:
no, that has interrupts safely disabled at that point (as does Anon in
the current kernel, and with Alex's patch).

Hugh

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 15/32] mm/lru: move lock into lru_note_cost
  2020-08-24 12:54 ` [PATCH v18 15/32] mm/lru: move lock into lru_note_cost Alex Shi
@ 2020-09-21 21:36   ` Hugh Dickins
  2020-09-21 22:03     ` Hugh Dickins
  2020-09-22  3:38     ` Alex Shi
  0 siblings, 2 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-21 21:36 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 24 Aug 2020, Alex Shi wrote:

> We have to move lru_lock into lru_note_cost, since it cycle up on memcg
> tree, for future per lruvec lru_lock replace. It's a bit ugly and may
> cost a bit more locking, but benefit from multiple memcg locking could
> cover the lost.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

Acked-by: Hugh Dickins <hughd@google.com>

In your lruv19 github tree, you have merged 14/32 into this one: thanks.

> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/swap.c   | 5 +++--
>  mm/vmscan.c | 4 +---
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 906255db6006..f80ccd6f3cb4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -269,7 +269,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  {
>  	do {
>  		unsigned long lrusize;
> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  
> +		spin_lock_irq(&pgdat->lru_lock);
>  		/* Record cost event */
>  		if (file)
>  			lruvec->file_cost += nr_pages;
> @@ -293,15 +295,14 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  			lruvec->file_cost /= 2;
>  			lruvec->anon_cost /= 2;
>  		}
> +		spin_unlock_irq(&pgdat->lru_lock);
>  	} while ((lruvec = parent_lruvec(lruvec)));
>  }
>  
>  void lru_note_cost_page(struct page *page)
>  {
> -	spin_lock_irq(&page_pgdat(page)->lru_lock);
>  	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
>  		      page_is_file_lru(page), thp_nr_pages(page));
> -	spin_unlock_irq(&page_pgdat(page)->lru_lock);
>  }
>  
>  static void __activate_page(struct page *page, struct lruvec *lruvec)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffccb94defaf..7b7b36bd1448 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1971,19 +1971,17 @@ static int current_may_throttle(void)
>  				&stat, false);
>  
>  	spin_lock_irq(&pgdat->lru_lock);
> -
>  	move_pages_to_lru(lruvec, &page_list);
>  
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> -	lru_note_cost(lruvec, file, stat.nr_pageout);
>  	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
>  	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_reclaimed);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>  	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> -
>  	spin_unlock_irq(&pgdat->lru_lock);
>  
> +	lru_note_cost(lruvec, file, stat.nr_pageout);
>  	mem_cgroup_uncharge_list(&page_list);
>  	free_unref_page_list(&page_list);
>  
> -- 
> 1.8.3.1
> 
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 15/32] mm/lru: move lock into lru_note_cost
  2020-09-21 21:36   ` Hugh Dickins
@ 2020-09-21 22:03     ` Hugh Dickins
  2020-09-22  3:39       ` Alex Shi
  2020-09-22  3:38     ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-21 22:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 21 Sep 2020, Hugh Dickins wrote:
> On Mon, 24 Aug 2020, Alex Shi wrote:
> 
> > We have to move lru_lock into lru_note_cost, since it cycle up on memcg
> > tree, for future per lruvec lru_lock replace. It's a bit ugly and may
> > cost a bit more locking, but benefit from multiple memcg locking could
> > cover the lost.
> > 
> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> 
> In your lruv19 github tree, you have merged 14/32 into this one: thanks.

Grr, I've only just started, and already missed some of my notes.

I wanted to point out that this patch does introduce an extra unlock+lock
in shrink_inactive_list(), even in a !CONFIG_MEMCG build.  I think you've
done the right thing for now, keeping it simple, and maybe nobody will
notice the extra overhead; but I expect us to replace lru_note_cost()
by lru_note_cost_unlock_irq() later on, expecting the caller to do the
initial lock_irq().

lru_note_cost_page() looks redundant to me, but you're right not to
delete it here, unless Johannes asks you to add that in: that's his
business, and it may be dependent on the XXX at its callsite.

> 
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  mm/swap.c   | 5 +++--
> >  mm/vmscan.c | 4 +---
> >  2 files changed, 4 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 906255db6006..f80ccd6f3cb4 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -269,7 +269,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
> >  {
> >  	do {
> >  		unsigned long lrusize;
> > +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >  
> > +		spin_lock_irq(&pgdat->lru_lock);
> >  		/* Record cost event */
> >  		if (file)
> >  			lruvec->file_cost += nr_pages;
> > @@ -293,15 +295,14 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
> >  			lruvec->file_cost /= 2;
> >  			lruvec->anon_cost /= 2;
> >  		}
> > +		spin_unlock_irq(&pgdat->lru_lock);
> >  	} while ((lruvec = parent_lruvec(lruvec)));
> >  }
> >  
> >  void lru_note_cost_page(struct page *page)
> >  {
> > -	spin_lock_irq(&page_pgdat(page)->lru_lock);
> >  	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
> >  		      page_is_file_lru(page), thp_nr_pages(page));
> > -	spin_unlock_irq(&page_pgdat(page)->lru_lock);
> >  }
> >  
> >  static void __activate_page(struct page *page, struct lruvec *lruvec)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ffccb94defaf..7b7b36bd1448 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1971,19 +1971,17 @@ static int current_may_throttle(void)
> >  				&stat, false);
> >  
> >  	spin_lock_irq(&pgdat->lru_lock);
> > -
> >  	move_pages_to_lru(lruvec, &page_list);
> >  
> >  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> > -	lru_note_cost(lruvec, file, stat.nr_pageout);
> >  	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> >  	if (!cgroup_reclaim(sc))
> >  		__count_vm_events(item, nr_reclaimed);
> >  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> >  	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> > -
> >  	spin_unlock_irq(&pgdat->lru_lock);
> >  
> > +	lru_note_cost(lruvec, file, stat.nr_pageout);
> >  	mem_cgroup_uncharge_list(&page_list);
> >  	free_unref_page_list(&page_list);
> >  
> > -- 
> > 1.8.3.1
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU
  2020-08-24 12:54 ` [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-09-21 23:16   ` Hugh Dickins
  2020-09-22  3:53     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-21 23:16 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Michal Hocko

On Mon, 24 Aug 2020, Alex Shi wrote:

> Currently lru_lock still guards both lru list and page's lru bit, that's
> ok. but if we want to use specific lruvec lock on the page, we need to
> pin down the page's lruvec/memcg during locking. Just taking lruvec
> lock first may be undermined by the page's memcg charge/migration. To
> fix this problem, we could clear the lru bit out of locking and use
> it as pin down action to block the page isolation in memcg changing.
> 
> So now a standard steps of page isolation is following:
> 	1, get_page(); 	       #pin the page avoid to be free
> 	2, TestClearPageLRU(); #block other isolation like memcg change
> 	3, spin_lock on lru_lock; #serialize lru list access
> 	4, delete page from lru list;
> The step 2 could be optimzed/replaced in scenarios which page is
> unlikely be accessed or be moved between memcgs.
> 
> This patch start with the first part: TestClearPageLRU, which combines
> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
> function will be used as page isolation precondition to prevent other
> isolations some where else. Then there are may !PageLRU page on lru
> list, need to remove BUG() checking accordingly.
> 
> There 2 rules for lru bit now:
> 1, the lru bit still indicate if a page on lru list, just in some
>    temporary moment(isolating), the page may have no lru bit when
>    it's on lru list.  but the page still must be on lru list when the
>    lru bit set.
> 2, have to remove lru bit before delete it from lru list.
> 
> Hugh Dickins pointed that when a page is in free path and no one is
> possible to take it, non atomic lru bit clearing is better, like in
> __page_cache_release and release_pages.
> And no need get_page() before lru bit clear in isolate_lru_page,
> since it '(1) Must be called with an elevated refcount on the page'.

Delete that paragraph: you're justifying changes made during the
course of earlier review, but not needed here.  If we start to
comment on everything that is not done...!

> 
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable with Rong Chen
> <rong.a.chen@intel.com> report:
> https://lkml.org/lkml/2020/3/4/173

Please use a lore link instead, lkml.org is nice but unreliable:
https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

Acked-by: Hugh Dickins <hughd@google.com>
when you make the changes suggested above and below.

I still have long-standing reservations about this TestClearPageLRU
technique (it's hard to reason about, and requires additional atomic ops
in some places); but it's working, so I'd like it to go in, then later
we can experiment with whether lock_page_memcg() does a better job, or
rechecking memcg when getting the lru_lock (my original technique).

> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/page-flags.h |  1 +
>  mm/mlock.c                 |  3 +--
>  mm/swap.c                  |  5 ++---
>  mm/vmscan.c                | 18 +++++++-----------
>  4 files changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6be1aa559b1e..9554ed1387dc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
>  PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
>  	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
>  PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
> +	TESTCLEARFLAG(LRU, lru, PF_HEAD)
>  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
>  	TESTCLEARFLAG(Active, active, PF_HEAD)
>  PAGEFLAG(Workingset, workingset, PF_HEAD)
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 93ca2bf30b4f..3762d9dd5b31 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -107,13 +107,12 @@ void mlock_vma_page(struct page *page)
>   */
>  static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>  {
> -	if (PageLRU(page)) {
> +	if (TestClearPageLRU(page)) {
>  		struct lruvec *lruvec;
>  
>  		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  		if (getpage)
>  			get_page(page);
> -		ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>  		return true;
>  	}
> diff --git a/mm/swap.c b/mm/swap.c
> index f80ccd6f3cb4..446ffe280809 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>  		struct lruvec *lruvec;
>  		unsigned long flags;
>  
> +		__ClearPageLRU(page);
>  		spin_lock_irqsave(&pgdat->lru_lock, flags);
>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		VM_BUG_ON_PAGE(!PageLRU(page), page);
> -		__ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>  	}
> @@ -880,9 +879,9 @@ void release_pages(struct page **pages, int nr)
>  				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>  			}
>  
> -			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>  			VM_BUG_ON_PAGE(!PageLRU(page), page);
>  			__ClearPageLRU(page);
> +			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		}
>  

Please delete all those mods to mm/swap.c from this patch.  This patch
is about introducing TestClearPageLRU, but that is not involved here.
Several versions ago, yes it was, then I pointed out that these are
operations on refcount 0 pages, and we don't want to add unnecessary
atomic operations on them.  I expect you want to keep the rearrangements,
but do them where you need them later (I expect that's in 20/32).

And I notice that one VM_BUG_ON_PAGE was kept and the other deleted:
though one can certainly argue that they're redundant (as all BUGs
should be), I think most people will feel safer to keep them both.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b7b36bd1448..1b3e0eeaad64 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1665,8 +1665,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  		page = lru_to_page(src);
>  		prefetchw_prev_lru_page(page, src, flags);
>  
> -		VM_BUG_ON_PAGE(!PageLRU(page), page);
> -
>  		nr_pages = compound_nr(page);
>  		total_scan += nr_pages;
>  

It is not enough to remove just that one VM_BUG_ON_PAGE there.
This is a patch series, and we don't need it to be perfect at every
bisection point between patches, but we do need it to be reasonably
robust, so as not to waste unrelated bughunters' time.  It didn't
take me very long to crash on the "default: BUG()" further down
isolate_lru_pages(), because now PageLRU may get cleared at any
instant, whatever locks are held.

(But you're absolutely right to leave the compaction and pagevec
mods to subsequent patches: it's fairly safe to separate those out,
and much easier for reviewers that you did so.)

This patch is much more robust with __isolate_lru_page() mods below
on top.  I agree there's other ways to do it, but given that nobody
cares what the error return is from __isolate_lru_page(), except for
the isolate_lru_pages() switch statement BUG() which has become
invalid, I suggest just use -EBUSY throughout __isolate_lru_page().
Yes, we can and should change that switch statement to an
"if {} else {}" without any BUG(), but I don't want to mess
you around at this time, leave cleanup like that until later.
Please fold in this patch on top:

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  */
 int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 {
-	int ret = -EINVAL;
+	int ret = -EBUSY;
 
 	/* Only take pages on the LRU. */
 	if (!PageLRU(page))
@@ -1550,8 +1550,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
-	ret = -EBUSY;
-
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1598,8 +1596,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		 * sure the page is not being freed elsewhere -- the
 		 * page release code relies on it.
 		 */
-		ClearPageLRU(page);
-		ret = 0;
+		if (TestClearPageLRU(page))
+			ret = 0;
+		else
+			put_page(page);
 	}
 
 	return ret;

> @@ -1763,21 +1761,19 @@ int isolate_lru_page(struct page *page)
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>  
> -	if (PageLRU(page)) {
> +	if (TestClearPageLRU(page)) {
>  		pg_data_t *pgdat = page_pgdat(page);
>  		struct lruvec *lruvec;
> +		int lru = page_lru(page);
>  
> -		spin_lock_irq(&pgdat->lru_lock);
> +		get_page(page);
>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		if (PageLRU(page)) {
> -			int lru = page_lru(page);
> -			get_page(page);
> -			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, lru);
> -			ret = 0;
> -		}
> +		spin_lock_irq(&pgdat->lru_lock);
> +		del_page_from_lru_list(page, lruvec, lru);
>  		spin_unlock_irq(&pgdat->lru_lock);
> +		ret = 0;
>  	}
> +
>  	return ret;
>  }

And a small mod to isolate_lru_page() to be folded in.  I had
never noticed this before, but here you are evaluating page_lru()
after clearing PageLRU, but before getting lru_lock: that seems unsafe.
I'm pretty sure it's unsafe at this stage of the series; I did once
persuade myself that it becomes safe by the end of the series,
but I've already forgotten the argument for that (I have already
said TestClearPageLRU is difficult to reason about).  Please don't
force us to have to think about this! Just get page_lru after lru_lock.

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1764,12 +1764,11 @@ int isolate_lru_page(struct page *page)
 	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
-		int lru = page_lru(page);
 
 		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		spin_lock_irq(&pgdat->lru_lock);
-		del_page_from_lru_list(page, lruvec, lru);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		spin_unlock_irq(&pgdat->lru_lock);
 		ret = 0;
 	}

And lastly, please do check_move_unevictable_pages()'s TestClearPageLRU
mod here at the end of mm/vmscan.c in this patch: I noticed that your
lruv19 branch is doing it in a later patch, but it fits better here.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 17/32] mm/compaction: do page isolation first in compaction
  2020-08-24 12:54 ` [PATCH v18 17/32] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-09-21 23:49   ` Hugh Dickins
  2020-09-22  4:57     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-21 23:49 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 24 Aug 2020, Alex Shi wrote:

> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
> 
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
> 
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
> 
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

Okay, whatever. I was about to say
Acked-by: Hugh Dickins <hughd@google.com>
With my signed-off-by there, someone will ask if it should say
"From: Hugh ..." at the top: no, it should not, this is Alex's patch,
but I proposed some fixes to it, as you already acknowledged.

A couple of comments below on the mm/vmscan.c part of it.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 +-
>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
>  3 files changed, 60 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 43e6b3458f58..550fdfdc3506 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -357,7 +357,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  						  unsigned long nr_pages,
>  						  gfp_t gfp_mask,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4e2c66869041..253382d99969 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -887,6 +887,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>  			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>  				low_pfn = end_pfn;
> +				page = NULL;
>  				goto isolate_abort;
>  			}
>  			valid_page = page;
> @@ -968,6 +969,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>  			goto isolate_fail;
>  
> +		/*
> +		 * Be careful not to clear PageLRU until after we're
> +		 * sure the page is not being freed elsewhere -- the
> +		 * page release code relies on it.
> +		 */
> +		if (unlikely(!get_page_unless_zero(page)))
> +			goto isolate_fail;
> +
> +		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> +			goto isolate_fail_put;
> +
> +		/* Try isolate the page */
> +		if (!TestClearPageLRU(page))
> +			goto isolate_fail_put;
> +
>  		/* If we already hold the lock, we can skip some rechecking */
>  		if (!locked) {
>  			locked = compact_lock_irqsave(&pgdat->lru_lock,
> @@ -980,10 +996,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  					goto isolate_abort;
>  			}
>  
> -			/* Recheck PageLRU and PageCompound under lock */
> -			if (!PageLRU(page))
> -				goto isolate_fail;
> -
>  			/*
>  			 * Page become compound since the non-locked check,
>  			 * and it's on LRU. It can only be a THP so the order
> @@ -991,16 +1003,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  			 */
>  			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>  				low_pfn += compound_nr(page) - 1;
> -				goto isolate_fail;
> +				SetPageLRU(page);
> +				goto isolate_fail_put;
>  			}
>  		}
>  
>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  
> -		/* Try isolate the page */
> -		if (__isolate_lru_page(page, isolate_mode) != 0)
> -			goto isolate_fail;
> -
>  		/* The whole page is taken off the LRU; skip the tail pages. */
>  		if (PageCompound(page))
>  			low_pfn += compound_nr(page) - 1;
> @@ -1029,6 +1038,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		}
>  
>  		continue;
> +
> +isolate_fail_put:
> +		/* Avoid potential deadlock in freeing page under lru_lock */
> +		if (locked) {
> +			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +			locked = false;
> +		}
> +		put_page(page);
> +
>  isolate_fail:
>  		if (!skip_on_failure)
>  			continue;
> @@ -1065,9 +1083,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  	if (unlikely(low_pfn > end_pfn))
>  		low_pfn = end_pfn;
>  
> +	page = NULL;
> +
>  isolate_abort:
>  	if (locked)
>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +	if (page) {
> +		SetPageLRU(page);
> +		put_page(page);
> +	}
>  
>  	/*
>  	 * Updated the cached scanner pfn once the pageblock has been scanned
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1b3e0eeaad64..48b50695f883 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1538,20 +1538,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   *
>   * returns 0 on success, -ve errno on failure.
>   */
> -int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
>  {
>  	int ret = -EINVAL;
>  
> -	/* Only take pages on the LRU. */
> -	if (!PageLRU(page))
> -		return ret;
> -
>  	/* Compaction should not handle unevictable pages but CMA can do so */
>  	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>  		return ret;
>  
>  	ret = -EBUSY;
>  
> +	/* Only take pages on the LRU. */
> +	if (!PageLRU(page))
> +		return ret;
> +

So here you do deal with that BUG() issue.  But I'd prefer you to leave
it as I suggested in 16/32, just start with "int ret = -EBUSY;" and
don't rearrange the checks here at all.  I say that partly because
the !PageLRU check is very important (when called for compaction), and
the easier it is to find (at the very start), the less anxious I get!

>  	/*
>  	 * To minimise LRU disruption, the caller can indicate that it only
>  	 * wants to isolate pages it will be able to operate on without
> @@ -1592,20 +1592,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
>  		return ret;
>  
> -	if (likely(get_page_unless_zero(page))) {
> -		/*
> -		 * Be careful not to clear PageLRU until after we're
> -		 * sure the page is not being freed elsewhere -- the
> -		 * page release code relies on it.
> -		 */
> -		ClearPageLRU(page);
> -		ret = 0;
> -	}
> -
> -	return ret;
> +	return 0;
>  }
>  
> -
>  /*
>   * Update LRU sizes after isolating pages. The LRU size updates must
>   * be complete before mem_cgroup_update_lru_size due to a sanity check.
> @@ -1685,17 +1674,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  		 * only when the page is being freed somewhere else.
>  		 */
>  		scan += nr_pages;
> -		switch (__isolate_lru_page(page, mode)) {
> +		switch (__isolate_lru_page_prepare(page, mode)) {
>  		case 0:
> +			/*
> +			 * Be careful not to clear PageLRU until after we're
> +			 * sure the page is not being freed elsewhere -- the
> +			 * page release code relies on it.
> +			 */
> +			if (unlikely(!get_page_unless_zero(page)))
> +				goto busy;
> +
> +			if (!TestClearPageLRU(page)) {
> +				/*
> +				 * This page may in other isolation path,
> +				 * but we still hold lru_lock.
> +				 */
> +				put_page(page);
> +				goto busy;
> +			}
> +
>  			nr_taken += nr_pages;
>  			nr_zone_taken[page_zonenum(page)] += nr_pages;
>  			list_move(&page->lru, dst);
>  			break;
> -
> +busy:
>  		case -EBUSY:

It's a long time since I read a C manual. I had to try that out in a
little test program: and it does seem to do the right thing.  Maybe
I'm just very ignorant, and everybody else finds that natural: but I'd
feel more comfortable with the busy label on the line after the
"case -EBUSY:" - wouldn't you?

You could, of course, change that "case -EBUSY" to "default",
and delete the "default: BUG();" that follows: whatever you prefer.

>  			/* else it is being freed elsewhere */
>  			list_move(&page->lru, src);
> -			continue;
> +			break;

Aha. Yes, I like that change, I'm not going to throw a tantrum,
accusing you of sneaking in unrelated changes etc. You made me look
back at the history: it was "continue" from back in the days of
lumpy reclaim, when there was stuff after the switch statement
which needed to be skipped in the -EBUSY case.  "break" looks
more natural to me now.

>  
>  		default:
>  			BUG();
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  2020-08-24 12:54 ` [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
@ 2020-09-22  0:42   ` Hugh Dickins
  2020-09-22  5:00     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-22  0:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 24 Aug 2020, Alex Shi wrote:

> Hugh Dickins' found a memcg change bug on original version:
> If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
> to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
> possible bad scenario would like:
> 
> 	cpu 0					cpu 1
> lruvec = mem_cgroup_page_lruvec()
> 					if (!isolate_lru_page())
> 						mem_cgroup_move_account
> 
> spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.
> 
> So we need the ClearPageLRU to block isolate_lru_page(), that serializes

s/the ClearPageLRU/TestClearPageLRU/

> the memcg change. and then removing the PageLRU check in move_fn callee
> as the consequence.

Deserves another paragraph about __pagevec_lru_add():
"__pagevec_lru_add_fn() is different from the others, because the pages
it deals with are, by definition, not yet on the lru.  TestClearPageLRU
is not needed and would not work, so __pagevec_lru_add() goes its own way."

> 
> Reported-by: Hugh Dickins <hughd@google.com>

True.

> Signed-off-by: Hugh Dickins <hughd@google.com>

I did provide some lines, but I think it's just
Acked-by: Hugh Dickins <hughd@google.com>
to go below your Signed-off-by.

> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/swap.c | 44 +++++++++++++++++++++++++++++++++++---------
>  1 file changed, 35 insertions(+), 9 deletions(-)

In your lruv19 branch, this patch got renamed (s/moveing/moving/):
but I think it's better with the old name used here in v18, and without
those mm/vmscan.c mods to check_move_unevictable_pages() tacked on:
please move those back to 16/32, which already makes changes to vmscan.c.

> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 446ffe280809..2d9a86bf93a4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -221,8 +221,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  			spin_lock_irqsave(&pgdat->lru_lock, flags);
>  		}
>  
> +		/* block memcg migration during page moving between lru */
> +		if (!TestClearPageLRU(page))
> +			continue;
> +
>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  		(*move_fn)(page, lruvec);
> +
> +		SetPageLRU(page);
>  	}
>  	if (pgdat)
>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> @@ -232,7 +238,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  
>  static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
>  {
> -	if (PageLRU(page) && !PageUnevictable(page)) {
> +	if (!PageUnevictable(page)) {
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>  		ClearPageActive(page);
>  		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
> @@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
>  
>  static void __activate_page(struct page *page, struct lruvec *lruvec)
>  {
> -	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> +	if (!PageActive(page) && !PageUnevictable(page)) {
>  		int lru = page_lru_base_type(page);
>  		int nr_pages = thp_nr_pages(page);
>  
> @@ -362,7 +368,8 @@ void activate_page(struct page *page)
>  
>  	page = compound_head(page);
>  	spin_lock_irq(&pgdat->lru_lock);
> -	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
> +	if (PageLRU(page))
> +		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
>  	spin_unlock_irq(&pgdat->lru_lock);
>  }
>  #endif

Every time I look at this, I wonder if that's right, or an unnecessary
optimization strayed in, or whatever.  For the benefit of others looking
at this patch, yes it is right: this is the !CONFIG_SMP alternative
version of activate_page(), and needs that PageLRU check to compensate
for the check that has now been removed from __activate_page() itself.

> @@ -521,9 +528,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>  	bool active;
>  	int nr_pages = thp_nr_pages(page);
>  
> -	if (!PageLRU(page))
> -		return;
> -
>  	if (PageUnevictable(page))
>  		return;
>  
> @@ -564,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>  
>  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>  {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +	if (PageActive(page) && !PageUnevictable(page)) {
>  		int lru = page_lru_base_type(page);
>  		int nr_pages = thp_nr_pages(page);
>  
> @@ -581,7 +585,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>  
>  static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
>  {
> -	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	if (PageAnon(page) && PageSwapBacked(page) &&
>  	    !PageSwapCache(page) && !PageUnevictable(page)) {
>  		bool active = PageActive(page);
>  		int nr_pages = thp_nr_pages(page);
> @@ -979,7 +983,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>   */
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
> -	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
> +	int i;
> +	struct pglist_data *pgdat = NULL;
> +	struct lruvec *lruvec;
> +	unsigned long flags = 0;
> +
> +	for (i = 0; i < pagevec_count(pvec); i++) {
> +		struct page *page = pvec->pages[i];
> +		struct pglist_data *pagepgdat = page_pgdat(page);
> +
> +		if (pagepgdat != pgdat) {
> +			if (pgdat)
> +				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +			pgdat = pagepgdat;
> +			spin_lock_irqsave(&pgdat->lru_lock, flags);
> +		}
> +
> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		__pagevec_lru_add_fn(page, lruvec);
> +	}
> +	if (pgdat)
> +		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +	release_pages(pvec->pages, pvec->nr);
> +	pagevec_reinit(pvec);
>  }
>  
>  /**
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 15/32] mm/lru: move lock into lru_note_cost
  2020-09-21 21:36   ` Hugh Dickins
  2020-09-21 22:03     ` Hugh Dickins
@ 2020-09-22  3:38     ` Alex Shi
  1 sibling, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-22  3:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/9/22 上午5:36, Hugh Dickins 写道:
> 
>> We have to move lru_lock into lru_note_cost, since it cycle up on memcg
>> tree, for future per lruvec lru_lock replace. It's a bit ugly and may
>> cost a bit more locking, but benefit from multiple memcg locking could
>> cover the lost.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Acked-by: Hugh Dickins <hughd@google.com>

Thanks!



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 15/32] mm/lru: move lock into lru_note_cost
  2020-09-21 22:03     ` Hugh Dickins
@ 2020-09-22  3:39       ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-22  3:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/9/22 上午6:03, Hugh Dickins 写道:
>> Acked-by: Hugh Dickins <hughd@google.com>
>>
>> In your lruv19 github tree, you have merged 14/32 into this one: thanks.
> Grr, I've only just started, and already missed some of my notes.
> 
> I wanted to point out that this patch does introduce an extra unlock+lock
> in shrink_inactive_list(), even in a !CONFIG_MEMCG build.  I think you've
> done the right thing for now, keeping it simple, and maybe nobody will
> notice the extra overhead; but I expect us to replace lru_note_cost()
> by lru_note_cost_unlock_irq() later on, expecting the caller to do the
> initial lock_irq().
> 
> lru_note_cost_page() looks redundant to me, but you're right not to
> delete it here, unless Johannes asks you to add that in: that's his
> business, and it may be dependent on the XXX at its callsite.
> 

Thanks for comments! And got your point. so I will leave this patch alone.

Thanks!


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU
  2020-09-21 23:16   ` Hugh Dickins
@ 2020-09-22  3:53     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-22  3:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko



在 2020/9/22 上午7:16, Hugh Dickins 写道:
> On Mon, 24 Aug 2020, Alex Shi wrote:
> 
>> Currently lru_lock still guards both lru list and page's lru bit, that's
>> ok. but if we want to use specific lruvec lock on the page, we need to
>> pin down the page's lruvec/memcg during locking. Just taking lruvec
>> lock first may be undermined by the page's memcg charge/migration. To
>> fix this problem, we could clear the lru bit out of locking and use
>> it as pin down action to block the page isolation in memcg changing.
>>
>> So now a standard steps of page isolation is following:
>> 	1, get_page(); 	       #pin the page avoid to be free
>> 	2, TestClearPageLRU(); #block other isolation like memcg change
>> 	3, spin_lock on lru_lock; #serialize lru list access
>> 	4, delete page from lru list;
>> The step 2 could be optimzed/replaced in scenarios which page is
>> unlikely be accessed or be moved between memcgs.
>>
>> This patch start with the first part: TestClearPageLRU, which combines
>> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
>> function will be used as page isolation precondition to prevent other
>> isolations some where else. Then there are may !PageLRU page on lru
>> list, need to remove BUG() checking accordingly.
>>
>> There 2 rules for lru bit now:
>> 1, the lru bit still indicate if a page on lru list, just in some
>>    temporary moment(isolating), the page may have no lru bit when
>>    it's on lru list.  but the page still must be on lru list when the
>>    lru bit set.
>> 2, have to remove lru bit before delete it from lru list.
>>
>> Hugh Dickins pointed that when a page is in free path and no one is
>> possible to take it, non atomic lru bit clearing is better, like in
>> __page_cache_release and release_pages.
>> And no need get_page() before lru bit clear in isolate_lru_page,
>> since it '(1) Must be called with an elevated refcount on the page'.
> 
> Delete that paragraph: you're justifying changes made during the
> course of earlier review, but not needed here.  If we start to
> comment on everything that is not done...!
> 

Will delete it!

>>
>> As Andrew Morton mentioned this change would dirty cacheline for page
>> isn't on LRU. But the lost would be acceptable with Rong Chen
>> <rong.a.chen@intel.com> report:
>> https://lkml.org/lkml/2020/3/4/173
> 
> Please use a lore link instead, lkml.org is nice but unreliable:
> https://lore.kernel.org/lkml/20200304090301.GB5972@shao2-debian/

Yes, will replace the link.

> 
>>
>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> when you make the changes suggested above and below.

Thanks!

> 
> I still have long-standing reservations about this TestClearPageLRU
> technique (it's hard to reason about, and requires additional atomic ops
> in some places); but it's working, so I'd like it to go in, then later
> we can experiment with whether lock_page_memcg() does a better job, or
> rechecking memcg when getting the lru_lock (my original technique).
> 
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: cgroups@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> ---
>>  include/linux/page-flags.h |  1 +
>>  mm/mlock.c                 |  3 +--
>>  mm/swap.c                  |  5 ++---
>>  mm/vmscan.c                | 18 +++++++-----------
>>  4 files changed, 11 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index 6be1aa559b1e..9554ed1387dc 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
>>  PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
>>  	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
>>  PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
>> +	TESTCLEARFLAG(LRU, lru, PF_HEAD)
>>  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
>>  	TESTCLEARFLAG(Active, active, PF_HEAD)
>>  PAGEFLAG(Workingset, workingset, PF_HEAD)
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index 93ca2bf30b4f..3762d9dd5b31 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -107,13 +107,12 @@ void mlock_vma_page(struct page *page)
>>   */
>>  static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>>  {
>> -	if (PageLRU(page)) {
>> +	if (TestClearPageLRU(page)) {
>>  		struct lruvec *lruvec;
>>  
>>  		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>>  		if (getpage)
>>  			get_page(page);
>> -		ClearPageLRU(page);
>>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>>  		return true;
>>  	}
>> diff --git a/mm/swap.c b/mm/swap.c
>> index f80ccd6f3cb4..446ffe280809 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>>  		struct lruvec *lruvec;
>>  		unsigned long flags;
>>  
>> +		__ClearPageLRU(page);
>>  		spin_lock_irqsave(&pgdat->lru_lock, flags);
>>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> -		VM_BUG_ON_PAGE(!PageLRU(page), page);
>> -		__ClearPageLRU(page);
>>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>>  	}
>> @@ -880,9 +879,9 @@ void release_pages(struct page **pages, int nr)
>>  				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>>  			}
>>  
>> -			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>>  			VM_BUG_ON_PAGE(!PageLRU(page), page);
>>  			__ClearPageLRU(page);
>> +			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>>  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
>>  		}
>>  
> 
> Please delete all those mods to mm/swap.c from this patch.  This patch
> is about introducing TestClearPageLRU, but that is not involved here.
> Several versions ago, yes it was, then I pointed out that these are
> operations on refcount 0 pages, and we don't want to add unnecessary
> atomic operations on them.  I expect you want to keep the rearrangements,
> but do them where you need them later (I expect that's in 20/32).

When I look into the 20th patch, replace lru_lock, it seems this change isn't
belong there too. And I try to reduce more code changes from 20th patch, since
it's already big enough. that make it hard to do bisect if anything wrong.

So the same dilemma is here to this patch. For the bisection friendly, may it's
better to split this part out?

Thanks!

> 
> And I notice that one VM_BUG_ON_PAGE was kept and the other deleted:
> though one can certainly argue that they're redundant (as all BUGs
> should be), I think most people will feel safer to keep them both.

Right, will keep the BUG check here.

> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 7b7b36bd1448..1b3e0eeaad64 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1665,8 +1665,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>  		page = lru_to_page(src);
>>  		prefetchw_prev_lru_page(page, src, flags);
>>  
>> -		VM_BUG_ON_PAGE(!PageLRU(page), page);
>> -
>>  		nr_pages = compound_nr(page);
>>  		total_scan += nr_pages;
>>  
> 
> It is not enough to remove just that one VM_BUG_ON_PAGE there.
> This is a patch series, and we don't need it to be perfect at every
> bisection point between patches, but we do need it to be reasonably
> robust, so as not to waste unrelated bughunters' time.  It didn't
> take me very long to crash on the "default: BUG()" further down
> isolate_lru_pages(), because now PageLRU may get cleared at any
> instant, whatever locks are held.
> 
> (But you're absolutely right to leave the compaction and pagevec
> mods to subsequent patches: it's fairly safe to separate those out,
> and much easier for reviewers that you did so.)
> 
> This patch is much more robust with __isolate_lru_page() mods below
> on top.  I agree there's other ways to do it, but given that nobody
> cares what the error return is from __isolate_lru_page(), except for
> the isolate_lru_pages() switch statement BUG() which has become
> invalid, I suggest just use -EBUSY throughout __isolate_lru_page().
> Yes, we can and should change that switch statement to an
> "if {} else {}" without any BUG(), but I don't want to mess
> you around at this time, leave cleanup like that until later.
> Please fold in this patch on top:
> 

Thanks a lot! will merge it.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1540,7 +1540,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   */
>  int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  {
> -	int ret = -EINVAL;
> +	int ret = -EBUSY;
>  
>  	/* Only take pages on the LRU. */
>  	if (!PageLRU(page))
> @@ -1550,8 +1550,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>  		return ret;
>  
> -	ret = -EBUSY;
> -
>  	/*
>  	 * To minimise LRU disruption, the caller can indicate that it only
>  	 * wants to isolate pages it will be able to operate on without
> @@ -1598,8 +1596,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  		 * sure the page is not being freed elsewhere -- the
>  		 * page release code relies on it.
>  		 */
> -		ClearPageLRU(page);
> -		ret = 0;
> +		if (TestClearPageLRU(page))
> +			ret = 0;
> +		else
> +			put_page(page);
>  	}

this code will finally be removed in next patch, but it's better in here now.
Thanks!

>  
>  	return ret;
> 
>> @@ -1763,21 +1761,19 @@ int isolate_lru_page(struct page *page)
>>  	VM_BUG_ON_PAGE(!page_count(page), page);
>>  	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>>  
>> -	if (PageLRU(page)) {
>> +	if (TestClearPageLRU(page)) {
>>  		pg_data_t *pgdat = page_pgdat(page);
>>  		struct lruvec *lruvec;
>> +		int lru = page_lru(page);
>>  
>> -		spin_lock_irq(&pgdat->lru_lock);
>> +		get_page(page);
>>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> -		if (PageLRU(page)) {
>> -			int lru = page_lru(page);
>> -			get_page(page);
>> -			ClearPageLRU(page);
>> -			del_page_from_lru_list(page, lruvec, lru);
>> -			ret = 0;
>> -		}
>> +		spin_lock_irq(&pgdat->lru_lock);
>> +		del_page_from_lru_list(page, lruvec, lru);
>>  		spin_unlock_irq(&pgdat->lru_lock);
>> +		ret = 0;
>>  	}
>> +
>>  	return ret;
>>  }
> 
> And a small mod to isolate_lru_page() to be folded in.  I had
> never noticed this before, but here you are evaluating page_lru()
> after clearing PageLRU, but before getting lru_lock: that seems unsafe.
> I'm pretty sure it's unsafe at this stage of the series; I did once
> persuade myself that it becomes safe by the end of the series,
> but I've already forgotten the argument for that (I have already
> said TestClearPageLRU is difficult to reason about).  Please don't
> force us to have to think about this! Just get page_lru after lru_lock.
> 
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1764,12 +1764,11 @@ int isolate_lru_page(struct page *page)
>  	if (TestClearPageLRU(page)) {
>  		pg_data_t *pgdat = page_pgdat(page);
>  		struct lruvec *lruvec;
> -		int lru = page_lru(page);
>  
>  		get_page(page);
>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  		spin_lock_irq(&pgdat->lru_lock);
> -		del_page_from_lru_list(page, lruvec, lru);
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
>  		spin_unlock_irq(&pgdat->lru_lock);
>  		ret = 0;
>  	}
> 

took thanks!

> And lastly, please do check_move_unevictable_pages()'s TestClearPageLRU
> mod here at the end of mm/vmscan.c in this patch: I noticed that your
> lruv19 branch is doing it in a later patch, but it fits better here.
> 

will move that part change here
Thanks!
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 17/32] mm/compaction: do page isolation first in compaction
  2020-09-21 23:49   ` Hugh Dickins
@ 2020-09-22  4:57     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-22  4:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/9/22 上午7:49, Hugh Dickins 写道:
> On Mon, 24 Aug 2020, Alex Shi wrote:
> 
>> Currently, compaction would get the lru_lock and then do page isolation
>> which works fine with pgdat->lru_lock, since any page isoltion would
>> compete for the lru_lock. If we want to change to memcg lru_lock, we
>> have to isolate the page before getting lru_lock, thus isoltion would
>> block page's memcg change which relay on page isoltion too. Then we
>> could safely use per memcg lru_lock later.
>>
>> The new page isolation use previous introduced TestClearPageLRU() +
>> pgdat lru locking which will be changed to memcg lru lock later.
>>
>> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
>> early version:
>>
>> Fix lots of crashes under compaction load: isolate_migratepages_block()
>> must clean up appropriately when rejecting a page, setting PageLRU again
>> if it had been cleared; and a put_page() after get_page_unless_zero()
>> cannot safely be done while holding locked_lruvec - it may turn out to
>> be the final put_page(), which will take an lruvec lock when PageLRU.
>> And move __isolate_lru_page_prepare back after get_page_unless_zero to
>> make trylock_page() safe:
>> trylock_page() is not safe to use at this time: its setting PG_locked
>> can race with the page being freed or allocated ("Bad page"), and can
>> also erase flags being set by one of those "sole owners" of a freshly
>> allocated page who use non-atomic __SetPageFlag().
>>
>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> 
> Okay, whatever. I was about to say
> Acked-by: Hugh Dickins <hughd@google.com>

Thanks!

> With my signed-off-by there, someone will ask if it should say
> "From: Hugh ..." at the top: no, it should not, this is Alex's patch,
> but I proposed some fixes to it, as you already acknowledged.

I guess you prefer to remove your signed off here, don't you?

> 
> A couple of comments below on the mm/vmscan.c part of it.
> 
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> ---
>>  include/linux/swap.h |  2 +-
>>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>>  mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
>>  3 files changed, 60 insertions(+), 30 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 43e6b3458f58..550fdfdc3506 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -357,7 +357,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
>>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>>  					gfp_t gfp_mask, nodemask_t *mask);
>> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
>> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>  						  unsigned long nr_pages,
>>  						  gfp_t gfp_mask,
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 4e2c66869041..253382d99969 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -887,6 +887,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>  		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>>  			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>>  				low_pfn = end_pfn;
>> +				page = NULL;
>>  				goto isolate_abort;
>>  			}
>>  			valid_page = page;
>> @@ -968,6 +969,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>  		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>>  			goto isolate_fail;
>>  
>> +		/*
>> +		 * Be careful not to clear PageLRU until after we're
>> +		 * sure the page is not being freed elsewhere -- the
>> +		 * page release code relies on it.
>> +		 */
>> +		if (unlikely(!get_page_unless_zero(page)))
>> +			goto isolate_fail;
>> +
>> +		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>> +			goto isolate_fail_put;
>> +
>> +		/* Try isolate the page */
>> +		if (!TestClearPageLRU(page))
>> +			goto isolate_fail_put;
>> +
>>  		/* If we already hold the lock, we can skip some rechecking */
>>  		if (!locked) {
>>  			locked = compact_lock_irqsave(&pgdat->lru_lock,
>> @@ -980,10 +996,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>  					goto isolate_abort;
>>  			}
>>  
>> -			/* Recheck PageLRU and PageCompound under lock */
>> -			if (!PageLRU(page))
>> -				goto isolate_fail;
>> -
>>  			/*
>>  			 * Page become compound since the non-locked check,
>>  			 * and it's on LRU. It can only be a THP so the order
>> @@ -991,16 +1003,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>  			 */
>>  			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>>  				low_pfn += compound_nr(page) - 1;
>> -				goto isolate_fail;
>> +				SetPageLRU(page);
>> +				goto isolate_fail_put;
>>  			}
>>  		}
>>  
>>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>>  
>> -		/* Try isolate the page */
>> -		if (__isolate_lru_page(page, isolate_mode) != 0)
>> -			goto isolate_fail;
>> -
>>  		/* The whole page is taken off the LRU; skip the tail pages. */
>>  		if (PageCompound(page))
>>  			low_pfn += compound_nr(page) - 1;
>> @@ -1029,6 +1038,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>  		}
>>  
>>  		continue;
>> +
>> +isolate_fail_put:
>> +		/* Avoid potential deadlock in freeing page under lru_lock */
>> +		if (locked) {
>> +			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> +			locked = false;
>> +		}
>> +		put_page(page);
>> +
>>  isolate_fail:
>>  		if (!skip_on_failure)
>>  			continue;
>> @@ -1065,9 +1083,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>  	if (unlikely(low_pfn > end_pfn))
>>  		low_pfn = end_pfn;
>>  
>> +	page = NULL;
>> +
>>  isolate_abort:
>>  	if (locked)
>>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> +	if (page) {
>> +		SetPageLRU(page);
>> +		put_page(page);
>> +	}
>>  
>>  	/*
>>  	 * Updated the cached scanner pfn once the pageblock has been scanned
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 1b3e0eeaad64..48b50695f883 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1538,20 +1538,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>>   *
>>   * returns 0 on success, -ve errno on failure.
>>   */
>> -int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>> +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
>>  {
>>  	int ret = -EINVAL;
>>  
>> -	/* Only take pages on the LRU. */
>> -	if (!PageLRU(page))
>> -		return ret;
>> -
>>  	/* Compaction should not handle unevictable pages but CMA can do so */
>>  	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>>  		return ret;
>>  
>>  	ret = -EBUSY;
>>  
>> +	/* Only take pages on the LRU. */
>> +	if (!PageLRU(page))
>> +		return ret;
>> +
> 
> So here you do deal with that BUG() issue.  But I'd prefer you to leave
> it as I suggested in 16/32, just start with "int ret = -EBUSY;" and
> don't rearrange the checks here at all.  I say that partly because
> the !PageLRU check is very important (when called for compaction), and
> the easier it is to find (at the very start), the less anxious I get!

yes, have done as your suggestion.

> 
>>  	/*
>>  	 * To minimise LRU disruption, the caller can indicate that it only
>>  	 * wants to isolate pages it will be able to operate on without
>> @@ -1592,20 +1592,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>>  	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
>>  		return ret;
>>  
>> -	if (likely(get_page_unless_zero(page))) {
>> -		/*
>> -		 * Be careful not to clear PageLRU until after we're
>> -		 * sure the page is not being freed elsewhere -- the
>> -		 * page release code relies on it.
>> -		 */
>> -		ClearPageLRU(page);
>> -		ret = 0;
>> -	}
>> -
>> -	return ret;
>> +	return 0;
>>  }
>>  
>> -
>>  /*
>>   * Update LRU sizes after isolating pages. The LRU size updates must
>>   * be complete before mem_cgroup_update_lru_size due to a sanity check.
>> @@ -1685,17 +1674,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>  		 * only when the page is being freed somewhere else.
>>  		 */
>>  		scan += nr_pages;
>> -		switch (__isolate_lru_page(page, mode)) {
>> +		switch (__isolate_lru_page_prepare(page, mode)) {
>>  		case 0:
>> +			/*
>> +			 * Be careful not to clear PageLRU until after we're
>> +			 * sure the page is not being freed elsewhere -- the
>> +			 * page release code relies on it.
>> +			 */
>> +			if (unlikely(!get_page_unless_zero(page)))
>> +				goto busy;
>> +
>> +			if (!TestClearPageLRU(page)) {
>> +				/*
>> +				 * This page may in other isolation path,
>> +				 * but we still hold lru_lock.
>> +				 */
>> +				put_page(page);
>> +				goto busy;
>> +			}
>> +
>>  			nr_taken += nr_pages;
>>  			nr_zone_taken[page_zonenum(page)] += nr_pages;
>>  			list_move(&page->lru, dst);
>>  			break;
>> -
>> +busy:
>>  		case -EBUSY:
> 
> It's a long time since I read a C manual. I had to try that out in a
> little test program: and it does seem to do the right thing.  Maybe
> I'm just very ignorant, and everybody else finds that natural: but I'd
> feel more comfortable with the busy label on the line after the
> "case -EBUSY:" - wouldn't you?

will move down. Thanks!

> 
> You could, of course, change that "case -EBUSY" to "default",
> and delete the "default: BUG();" that follows: whatever you prefer.
> 

yes, the default is enough after last patch's change.

>>  			/* else it is being freed elsewhere */
>>  			list_move(&page->lru, src);
>> -			continue;
>> +			break;
> 
> Aha. Yes, I like that change, I'm not going to throw a tantrum,
> accusing you of sneaking in unrelated changes etc. You made me look
> back at the history: it was "continue" from back in the days of
> lumpy reclaim, when there was stuff after the switch statement
> which needed to be skipped in the -EBUSY case.  "break" looks
> more natural to me now.

Thanks!
with above 'default' change, the break could be saved finally. :)

Thanks!

> 
>>  
>>  		default:
>>  			BUG();
>> -- 
>> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  2020-09-22  0:42   ` Hugh Dickins
@ 2020-09-22  5:00     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-22  5:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/9/22 上午8:42, Hugh Dickins 写道:
> On Mon, 24 Aug 2020, Alex Shi wrote:
> 
>> Hugh Dickins' found a memcg change bug on original version:
>> If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
>> to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
>> possible bad scenario would like:
>>
>> 	cpu 0					cpu 1
>> lruvec = mem_cgroup_page_lruvec()
>> 					if (!isolate_lru_page())
>> 						mem_cgroup_move_account
>>
>> spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.
>>
>> So we need the ClearPageLRU to block isolate_lru_page(), that serializes
> 
> s/the ClearPageLRU/TestClearPageLRU/

Thanks, will change it.

> 
>> the memcg change. and then removing the PageLRU check in move_fn callee
>> as the consequence.
> 
> Deserves another paragraph about __pagevec_lru_add():
> "__pagevec_lru_add_fn() is different from the others, because the pages
> it deals with are, by definition, not yet on the lru.  TestClearPageLRU
> is not needed and would not work, so __pagevec_lru_add() goes its own way."

Thanks for comments! will add it into new commit log.
> 
>>
>> Reported-by: Hugh Dickins <hughd@google.com>
> 
> True.
> 
>> Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> I did provide some lines, but I think it's just
> Acked-by: Hugh Dickins <hughd@google.com>
> to go below your Signed-off-by.

Thanks!
> 
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  mm/swap.c | 44 +++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 35 insertions(+), 9 deletions(-)
> 
> In your lruv19 branch, this patch got renamed (s/moveing/moving/):
> but I think it's better with the old name used here in v18, and without
> those mm/vmscan.c mods to check_move_unevictable_pages() tacked on:
> please move those back to 16/32, which already makes changes to vmscan.c.
> 

Yes, will move that part there.
Thanks!
Alex

>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index 446ffe280809..2d9a86bf93a4 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -221,8 +221,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>>  			spin_lock_irqsave(&pgdat->lru_lock, flags);
>>  		}
>>  
>> +		/* block memcg migration during page moving between lru */
>> +		if (!TestClearPageLRU(page))
>> +			continue;
>> +
>>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>>  		(*move_fn)(page, lruvec);
>> +
>> +		SetPageLRU(page);
>>  	}
>>  	if (pgdat)
>>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> @@ -232,7 +238,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>>  
>>  static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
>>  {
>> -	if (PageLRU(page) && !PageUnevictable(page)) {
>> +	if (!PageUnevictable(page)) {
>>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>>  		ClearPageActive(page);
>>  		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
>> @@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
>>  
>>  static void __activate_page(struct page *page, struct lruvec *lruvec)
>>  {
>> -	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
>> +	if (!PageActive(page) && !PageUnevictable(page)) {
>>  		int lru = page_lru_base_type(page);
>>  		int nr_pages = thp_nr_pages(page);
>>  
>> @@ -362,7 +368,8 @@ void activate_page(struct page *page)
>>  
>>  	page = compound_head(page);
>>  	spin_lock_irq(&pgdat->lru_lock);
>> -	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
>> +	if (PageLRU(page))
>> +		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
>>  	spin_unlock_irq(&pgdat->lru_lock);
>>  }
>>  #endif
> 
> Every time I look at this, I wonder if that's right, or an unnecessary
> optimization strayed in, or whatever.  For the benefit of others looking
> at this patch, yes it is right: this is the !CONFIG_SMP alternative
> version of activate_page(), and needs that PageLRU check to compensate
> for the check that has now been removed from __activate_page() itself.
> 
>> @@ -521,9 +528,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>>  	bool active;
>>  	int nr_pages = thp_nr_pages(page);
>>  
>> -	if (!PageLRU(page))
>> -		return;
>> -
>>  	if (PageUnevictable(page))
>>  		return;
>>  
>> @@ -564,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
>>  
>>  static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>>  {
>> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
>> +	if (PageActive(page) && !PageUnevictable(page)) {
>>  		int lru = page_lru_base_type(page);
>>  		int nr_pages = thp_nr_pages(page);
>>  
>> @@ -581,7 +585,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
>>  
>>  static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
>>  {
>> -	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
>> +	if (PageAnon(page) && PageSwapBacked(page) &&
>>  	    !PageSwapCache(page) && !PageUnevictable(page)) {
>>  		bool active = PageActive(page);
>>  		int nr_pages = thp_nr_pages(page);
>> @@ -979,7 +983,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>>   */
>>  void __pagevec_lru_add(struct pagevec *pvec)
>>  {
>> -	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
>> +	int i;
>> +	struct pglist_data *pgdat = NULL;
>> +	struct lruvec *lruvec;
>> +	unsigned long flags = 0;
>> +
>> +	for (i = 0; i < pagevec_count(pvec); i++) {
>> +		struct page *page = pvec->pages[i];
>> +		struct pglist_data *pagepgdat = page_pgdat(page);
>> +
>> +		if (pagepgdat != pgdat) {
>> +			if (pgdat)
>> +				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> +			pgdat = pagepgdat;
>> +			spin_lock_irqsave(&pgdat->lru_lock, flags);
>> +		}
>> +
>> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +		__pagevec_lru_add_fn(page, lruvec);
>> +	}
>> +	if (pgdat)
>> +		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> +	release_pages(pvec->pages, pvec->nr);
>> +	pagevec_reinit(pvec);
>>  }
>>  
>>  /**
>> -- 
>> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-08-24 12:54 ` [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-09-22  5:27   ` Hugh Dickins
  2020-09-22  8:58     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-22  5:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Michal Hocko,
	Yang Shi

On Mon, 24 Aug 2020, Alex Shi wrote:

> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
> 
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
> 
> In func isolate_migratepages_block, compact_unlock_should_abort is
> opend, and lock_page_lruvec logical is embedded for tight process.

Hard to understand: perhaps:

In func isolate_migratepages_block, compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.

> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
> 
> According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

s/modeified/modified/
lruv19 has an lkml.org link there, please substitute
https://lore.kernel.org/lkml/01ed6e45-3853-dcba-61cb-b429a49a7572@linux.alibaba.com/

> 
> With this and later patches, the readtwice performance increases
> about 80% within concurrent containers.
> 
> On a large machine with memcg enabled but not used, the page's lruvec
> seeking pass a few pointers, that may lead to lru_lock holding time
> increase and a bit regression.
> 
> Hugh Dickins helped on patch polish, thanks!
> 
> Reported-by: kernel test robot <lkp@intel.com>

Eh? It may have reported some locking bugs somewhere, but this
is the main patch of your per-memcg lru_lock: I don't think the
kernel test robot inspired your whole design, did it?  Delete that.


> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

I can't quite Ack this one yet, because there are several functions
(mainly __munlock_pagevec and check_move_unevictable_pages) which are
not right in this v18 version, and a bit tricky to correct: I already
suggested what to do in other mail, but this patch comes before
relock_page_lruvec, so must look different from the final result;
I need to look at a later version, perhaps already there in your
github tree, before I can Ack: but it's not far off.
Comments below.

> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
>  include/linux/mmzone.h     |   2 +
>  mm/compaction.c            |  56 +++++++++++++++---------
>  mm/huge_memory.c           |  11 ++---
>  mm/memcontrol.c            |  60 +++++++++++++++++++++++++-
>  mm/mlock.c                 |  47 +++++++++++++-------
>  mm/mmzone.c                |   1 +
>  mm/swap.c                  | 105 +++++++++++++++++++++------------------------
>  mm/vmscan.c                |  70 +++++++++++++++++-------------
>  9 files changed, 279 insertions(+), 131 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d0b036123c6a..7b170e9028b5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -494,6 +494,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>  
>  struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
>  
> +struct lruvec *lock_page_lruvec(struct page *page);
> +struct lruvec *lock_page_lruvec_irq(struct page *page);
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +						unsigned long *flags);
> +
> +#ifdef CONFIG_DEBUG_VM
> +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
> +#else
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
> +#endif
> +
>  static inline
>  struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>  	return css ? container_of(css, struct mem_cgroup, css) : NULL;
> @@ -1035,6 +1048,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
>  
> +static inline struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	spin_lock(&pgdat->__lruvec.lru_lock);
> +	return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	spin_lock_irq(&pgdat->__lruvec.lru_lock);
> +	return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +		unsigned long *flagsp)
> +{
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
> +	return &pgdat->__lruvec;
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>  		struct mem_cgroup *prev,
> @@ -1282,6 +1320,10 @@ static inline void count_memcg_page_event(struct page *page,
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> @@ -1411,6 +1453,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
>  	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
>  }
>  
> +static inline void unlock_page_lruvec(struct lruvec *lruvec)
> +{
> +	spin_unlock(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
> +{
> +	spin_unlock_irq(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
> +		unsigned long flags)
> +{
> +	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
> +}
> +

I may have trouble deciding when to use the unlock_page_lruvec
wrapper and when to use the direct spin_unlock: but your choices
throughout looked sensible to me.

>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8379432f4f2f..27a1513a43fc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -273,6 +273,8 @@ enum lruvec_flags {
>  };
>  
>  struct lruvec {
> +	/* per lruvec lru_lock for memcg */
> +	spinlock_t			lru_lock;
>  	struct list_head		lists[NR_LRU_LISTS];
>  	/*
>  	 * These track the cost of reclaiming one LRU - file or anon -
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 253382d99969..b724eacf6421 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -805,7 +805,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  	unsigned long nr_scanned = 0, nr_isolated = 0;
>  	struct lruvec *lruvec;
>  	unsigned long flags = 0;
> -	bool locked = false;
> +	struct lruvec *locked = NULL;
>  	struct page *page = NULL, *valid_page = NULL;
>  	unsigned long start_pfn = low_pfn;
>  	bool skip_on_failure = false;
> @@ -865,11 +865,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		 * contention, to give chance to IRQs. Abort completely if
>  		 * a fatal signal is pending.
>  		 */
> -		if (!(low_pfn % SWAP_CLUSTER_MAX)
> -		    && compact_unlock_should_abort(&pgdat->lru_lock,
> -					    flags, &locked, cc)) {
> -			low_pfn = 0;
> -			goto fatal_pending;
> +		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
> +			if (locked) {
> +				unlock_page_lruvec_irqrestore(locked, flags);
> +				locked = NULL;
> +			}
> +
> +			if (fatal_signal_pending(current)) {
> +				cc->contended = true;
> +
> +				low_pfn = 0;
> +				goto fatal_pending;
> +			}
> +
> +			cond_resched();
>  		}
>  
>  		if (!pfn_valid_within(low_pfn))
> @@ -941,9 +950,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  			if (unlikely(__PageMovable(page)) &&
>  					!PageIsolated(page)) {
>  				if (locked) {
> -					spin_unlock_irqrestore(&pgdat->lru_lock,
> -									flags);
> -					locked = false;
> +					unlock_page_lruvec_irqrestore(locked, flags);
> +					locked = NULL;
>  				}
>  
>  				if (!isolate_movable_page(page, isolate_mode))
> @@ -984,10 +992,19 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		if (!TestClearPageLRU(page))
>  			goto isolate_fail_put;
>  
> +		rcu_read_lock();
> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>  		/* If we already hold the lock, we can skip some rechecking */
> -		if (!locked) {
> -			locked = compact_lock_irqsave(&pgdat->lru_lock,
> -								&flags, cc);
> +		if (lruvec != locked) {
> +			if (locked)
> +				unlock_page_lruvec_irqrestore(locked, flags);
> +
> +			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> +			locked = lruvec;
> +			rcu_read_unlock();
> +
> +			lruvec_memcg_debug(lruvec, page);
>  
>  			/* Try get exclusive access under lock */
>  			if (!skip_updated) {
> @@ -1006,9 +1023,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  				SetPageLRU(page);
>  				goto isolate_fail_put;
>  			}
> -		}
> -
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		} else
> +			rcu_read_unlock();
>  
>  		/* The whole page is taken off the LRU; skip the tail pages. */
>  		if (PageCompound(page))
> @@ -1042,8 +1058,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  isolate_fail_put:
>  		/* Avoid potential deadlock in freeing page under lru_lock */
>  		if (locked) {
> -			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -			locked = false;
> +			unlock_page_lruvec_irqrestore(locked, flags);
> +			locked = NULL;
>  		}
>  		put_page(page);
>  
> @@ -1058,8 +1074,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		 */
>  		if (nr_isolated) {
>  			if (locked) {
> -				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -				locked = false;
> +				unlock_page_lruvec_irqrestore(locked, flags);
> +				locked = NULL;
>  			}
>  			putback_movable_pages(&cc->migratepages);
>  			cc->nr_migratepages = 0;
> @@ -1087,7 +1103,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  
>  isolate_abort:
>  	if (locked)
> -		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +		unlock_page_lruvec_irqrestore(locked, flags);
>  	if (page) {
>  		SetPageLRU(page);
>  		put_page(page);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6380c925e904..c9e08fdc08e9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2319,7 +2319,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
>  	VM_BUG_ON_PAGE(!PageHead(head), head);
>  	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
>  	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
> -	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
> +	lockdep_assert_held(&lruvec->lru_lock);
>  
>  	if (list) {
>  		/* page reclaim is reclaiming a huge page */
> @@ -2403,7 +2403,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  			      pgoff_t end)
>  {
>  	struct page *head = compound_head(page);
> -	pg_data_t *pgdat = page_pgdat(head);
>  	struct lruvec *lruvec;
>  	struct address_space *swap_cache = NULL;
>  	unsigned long offset = 0;
> @@ -2420,10 +2419,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		xa_lock(&swap_cache->i_pages);
>  	}
>  
> -	/* prevent PageLRU to go away from under us, and freeze lru stats */
> -	spin_lock(&pgdat->lru_lock);
> -
> -	lruvec = mem_cgroup_page_lruvec(head, pgdat);
> +	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
> +	lruvec = lock_page_lruvec(head);
>  
>  	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>  		__split_huge_page_tail(head, i, lruvec, list);
> @@ -2444,7 +2441,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	}
>  
>  	ClearPageCompound(head);
> -	spin_unlock(&pgdat->lru_lock);
> +	unlock_page_lruvec(lruvec);
>  	/* Caller disabled irqs, so they are still disabled here */
>  
>  	split_page_owner(head, HPAGE_PMD_ORDER);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 65c1e873153e..5b95529e64a4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1302,6 +1302,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_DEBUG_VM
> +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	if (!page->mem_cgroup)
> +		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
> +	else
> +		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
> +}
> +#endif

That function is not very effective, but I don't see how to improve it
either: the trouble is, it gets called to confirm what has just been
decided a moment before, when it would be much more powerful if it were
called later, at the time of unlocking - but we generally don't know the
page by then. I'll be tempted just to delete it later on (historically,
bugs have tended to show up as list_debug or update_lru_size warnings);
but we should certainly leave it in for now.

> +
>  /**
>   * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>   * @page: the page
> @@ -1341,6 +1354,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  	return lruvec;
>  }
>  
> +struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +	struct lruvec *lruvec;
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock(&lruvec->lru_lock);
> +	rcu_read_unlock();
> +
> +	lruvec_memcg_debug(lruvec, page);
> +
> +	return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +	struct lruvec *lruvec;
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock_irq(&lruvec->lru_lock);
> +	rcu_read_unlock();
> +
> +	lruvec_memcg_debug(lruvec, page);
> +
> +	return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
> +{
> +	struct lruvec *lruvec;
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock_irqsave(&lruvec->lru_lock, *flags);
> +	rcu_read_unlock();
> +
> +	lruvec_memcg_debug(lruvec, page);
> +
> +	return lruvec;
> +}
> +
>  /**
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
> @@ -3222,7 +3280,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  
>  /*
>   * Because tail pages are not marked as "used", set it. We're under
> - * pgdat->lru_lock and migration entries setup in all page mappings.
> + * lruvec->lru_lock and migration entries setup in all page mappings.
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
>  {

Don't you come back to that comment in 23/32, correctly changing
"We're under" to "Don't need". Might as well get the comment right
in one place or the other, I don't mind which (get it right in this
one and 23/32 need not touch mm/memcontrol.c).  The reference to
"used" goes back several years, to when there was a special flag to
mark a page as charged: now it's just done by setting mem_cgroup,
so I think the comment should just say:

* Because page->mem_cgroup is not set on compound tails, set it now.

I tried to make sense of "and migration entries setup in all page
mappings" but couldn't: oh, it means that the page is unmapped from
userspace at this point; well, that's true, but irrelevant here.
No need to mention that or the lru_lock here at all.

> diff --git a/mm/mlock.c b/mm/mlock.c
> index 3762d9dd5b31..177d2588e863 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -105,12 +105,10 @@ void mlock_vma_page(struct page *page)
>   * Isolate a page from LRU with optional get_page() pin.
>   * Assumes lru_lock already held and page already pinned.
>   */
> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> +static bool __munlock_isolate_lru_page(struct page *page,
> +				struct lruvec *lruvec, bool getpage)
>  {
>  	if (TestClearPageLRU(page)) {
> -		struct lruvec *lruvec;
> -
> -		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  		if (getpage)
>  			get_page(page);
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
> @@ -180,7 +178,7 @@ static void __munlock_isolation_failed(struct page *page)
>  unsigned int munlock_vma_page(struct page *page)
>  {
>  	int nr_pages;
> -	pg_data_t *pgdat = page_pgdat(page);
> +	struct lruvec *lruvec;
>  
>  	/* For try_to_munlock() and to serialize with page migration */
>  	BUG_ON(!PageLocked(page));
> @@ -188,11 +186,16 @@ unsigned int munlock_vma_page(struct page *page)
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
>  	/*
> -	 * Serialize with any parallel __split_huge_page_refcount() which
> +	 * Serialize split tail pages in __split_huge_page_tail() which
>  	 * might otherwise copy PageMlocked to part of the tail pages before
>  	 * we clear it in the head page. It also stabilizes thp_nr_pages().
> +	 * TestClearPageLRU can't be used here to block page isolation, since
> +	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
> +	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
> +	 * wrong lru list here. So relay on PageLocked to stop lruvec change
> +	 * in mem_cgroup_move_account().
>  	 */

I have elsewhere recommended just deleting all of that comment, typos
(interfere, rely) and misunderstandings and all. But you are right that
PageLocked keeps mem_cgroup_move_account() out there.

> -	spin_lock_irq(&pgdat->lru_lock);
> +	lruvec = lock_page_lruvec_irq(page);
>  
>  	if (!TestClearPageMlocked(page)) {
>  		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> @@ -203,15 +206,15 @@ unsigned int munlock_vma_page(struct page *page)
>  	nr_pages = thp_nr_pages(page);
>  	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>  
> -	if (__munlock_isolate_lru_page(page, true)) {
> -		spin_unlock_irq(&pgdat->lru_lock);
> +	if (__munlock_isolate_lru_page(page, lruvec, true)) {
> +		unlock_page_lruvec_irq(lruvec);
>  		__munlock_isolated_page(page);
>  		goto out;
>  	}
>  	__munlock_isolation_failed(page);
>  
>  unlock_out:
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	unlock_page_lruvec_irq(lruvec);
>  
>  out:
>  	return nr_pages - 1;
> @@ -291,23 +294,34 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  	int nr = pagevec_count(pvec);
>  	int delta_munlocked = -nr;
>  	struct pagevec pvec_putback;
> +	struct lruvec *lruvec = NULL;
>  	int pgrescued = 0;
>  
>  	pagevec_init(&pvec_putback);
>  
>  	/* Phase 1: page isolation */
> -	spin_lock_irq(&zone->zone_pgdat->lru_lock);
>  	for (i = 0; i < nr; i++) {
>  		struct page *page = pvec->pages[i];
> +		struct lruvec *new_lruvec;
> +
> +		/* block memcg change in mem_cgroup_move_account */
> +		lock_page_memcg(page);

And elsewhere I've explained that lock_page_memcg() before
lock_page_lruvec() is good there the first time round the loop,
but the second time it is trying to lock_page_memcg() while
still holding lruvec lock: possibility of deadlock, not good.
I'll need to check your next version of this patch before Acking.

> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (new_lruvec != lruvec) {
> +			if (lruvec)
> +				unlock_page_lruvec_irq(lruvec);
> +			lruvec = lock_page_lruvec_irq(page);
> +		}
>  
>  		if (TestClearPageMlocked(page)) {
>  			/*
>  			 * We already have pin from follow_page_mask()
>  			 * so we can spare the get_page() here.
>  			 */
> -			if (__munlock_isolate_lru_page(page, false))
> +			if (__munlock_isolate_lru_page(page, lruvec, false)) {
> +				unlock_page_memcg(page);
>  				continue;
> -			else
> +			} else
>  				__munlock_isolation_failed(page);
>  		} else {
>  			delta_munlocked++;
> @@ -319,11 +333,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  		 * pin. We cannot do it under lru_lock however. If it's
>  		 * the last pin, __page_cache_release() would deadlock.
>  		 */
> +		unlock_page_memcg(page);
>  		pagevec_add(&pvec_putback, pvec->pages[i]);
>  		pvec->pages[i] = NULL;
>  	}
> -	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> -	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> +	if (lruvec) {
> +		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> +		unlock_page_lruvec_irq(lruvec);
> +	}
>  
>  	/* Now we can release pins of pages that we are not munlocking */
>  	pagevec_release(&pvec_putback);
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index 4686fdc23bb9..3750a90ed4a0 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
>  	enum lru_list lru;
>  
>  	memset(lruvec, 0, sizeof(struct lruvec));
> +	spin_lock_init(&lruvec->lru_lock);
>  
>  	for_each_lru(lru)
>  		INIT_LIST_HEAD(&lruvec->lists[lru]);
> diff --git a/mm/swap.c b/mm/swap.c
> index 2d9a86bf93a4..b67959b701c0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
>  static void __page_cache_release(struct page *page)
>  {
>  	if (PageLRU(page)) {
> -		pg_data_t *pgdat = page_pgdat(page);
>  		struct lruvec *lruvec;
>  		unsigned long flags;
>  
>  		__ClearPageLRU(page);
> -		spin_lock_irqsave(&pgdat->lru_lock, flags);
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		lruvec = lock_page_lruvec_irqsave(page, &flags);
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
> -		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +		unlock_page_lruvec_irqrestore(lruvec, flags);

This is where I asked you to drop a hunk from the TestClearPageLRU
patch; and a VM_BUG_ON_PAGE(!PageLRU) went missing. I agree it looks
very weird immediately after checking PageLRU, but IIRC years ago it
did actually catch some racy bugs, so I guess better to retain it.

I suppose it's then best to keep to the original ordering,
lock, BUG, __Clear, del, unlock - to widen the gap between the PageLRU
checks - though usually we would prefer to BUG outside of holding a lock.

>  	}
>  	__ClearPageWaiters(page);
>  }
> @@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  	void (*move_fn)(struct page *page, struct lruvec *lruvec))
>  {
>  	int i;
> -	struct pglist_data *pgdat = NULL;
> -	struct lruvec *lruvec;
> +	struct lruvec *lruvec = NULL;
>  	unsigned long flags = 0;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> -		struct pglist_data *pagepgdat = page_pgdat(page);
> -
> -		if (pagepgdat != pgdat) {
> -			if (pgdat)
> -				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -			pgdat = pagepgdat;
> -			spin_lock_irqsave(&pgdat->lru_lock, flags);
> -		}
> +		struct lruvec *new_lruvec;
>  
>  		/* block memcg migration during page moving between lru */
>  		if (!TestClearPageLRU(page))
>  			continue;
>  
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (lruvec != new_lruvec) {
> +			if (lruvec)
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
> +			lruvec = lock_page_lruvec_irqsave(page, &flags);
> +		}
> +
>  		(*move_fn)(page, lruvec);
>  
>  		SetPageLRU(page);
>  	}
> -	if (pgdat)
> -		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +	if (lruvec)
> +		unlock_page_lruvec_irqrestore(lruvec, flags);
>  	release_pages(pvec->pages, pvec->nr);
>  	pagevec_reinit(pvec);
>  }
> @@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  {
>  	do {
>  		unsigned long lrusize;
> -		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  
> -		spin_lock_irq(&pgdat->lru_lock);
> +		spin_lock_irq(&lruvec->lru_lock);
>  		/* Record cost event */
>  		if (file)
>  			lruvec->file_cost += nr_pages;
> @@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  			lruvec->file_cost /= 2;
>  			lruvec->anon_cost /= 2;
>  		}
> -		spin_unlock_irq(&pgdat->lru_lock);
> +		spin_unlock_irq(&lruvec->lru_lock);
>  	} while ((lruvec = parent_lruvec(lruvec)));
>  }
>  
> @@ -364,13 +359,13 @@ static inline void activate_page_drain(int cpu)
>  
>  void activate_page(struct page *page)
>  {
> -	pg_data_t *pgdat = page_pgdat(page);
> +	struct lruvec *lruvec;
>  
>  	page = compound_head(page);
> -	spin_lock_irq(&pgdat->lru_lock);
> +	lruvec = lock_page_lruvec_irq(page);
>  	if (PageLRU(page))
> -		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
> -	spin_unlock_irq(&pgdat->lru_lock);
> +		__activate_page(page, lruvec);
> +	unlock_page_lruvec_irq(lruvec);
>  }
>  #endif
>  
> @@ -819,8 +814,7 @@ void release_pages(struct page **pages, int nr)
>  {
>  	int i;
>  	LIST_HEAD(pages_to_free);
> -	struct pglist_data *locked_pgdat = NULL;
> -	struct lruvec *lruvec;
> +	struct lruvec *lruvec = NULL;
>  	unsigned long flags;
>  	unsigned int lock_batch;
>  
> @@ -830,21 +824,20 @@ void release_pages(struct page **pages, int nr)
>  		/*
>  		 * Make sure the IRQ-safe lock-holding time does not get
>  		 * excessive with a continuous string of pages from the
> -		 * same pgdat. The lock is held only if pgdat != NULL.
> +		 * same lruvec. The lock is held only if lruvec != NULL.
>  		 */
> -		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> -			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -			locked_pgdat = NULL;
> +		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
> +			unlock_page_lruvec_irqrestore(lruvec, flags);
> +			lruvec = NULL;
>  		}
>  
>  		if (is_huge_zero_page(page))
>  			continue;
>  
>  		if (is_zone_device_page(page)) {
> -			if (locked_pgdat) {
> -				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> -						       flags);
> -				locked_pgdat = NULL;
> +			if (lruvec) {
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
> +				lruvec = NULL;
>  			}
>  			/*
>  			 * ZONE_DEVICE pages that return 'false' from
> @@ -863,29 +856,29 @@ void release_pages(struct page **pages, int nr)
>  			continue;
>  
>  		if (PageCompound(page)) {
> -			if (locked_pgdat) {
> -				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -				locked_pgdat = NULL;
> +			if (lruvec) {
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
> +				lruvec = NULL;
>  			}
>  			__put_compound_page(page);
>  			continue;
>  		}
>  
>  		if (PageLRU(page)) {
> -			struct pglist_data *pgdat = page_pgdat(page);
> +			struct lruvec *new_lruvec;
>  
> -			if (pgdat != locked_pgdat) {
> -				if (locked_pgdat)
> -					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> +			new_lruvec = mem_cgroup_page_lruvec(page,
> +							page_pgdat(page));
> +			if (new_lruvec != lruvec) {
> +				if (lruvec)
> +					unlock_page_lruvec_irqrestore(lruvec,
>  									flags);
>  				lock_batch = 0;
> -				locked_pgdat = pgdat;
> -				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> +				lruvec = lock_page_lruvec_irqsave(page, &flags);
>  			}
>  
>  			VM_BUG_ON_PAGE(!PageLRU(page), page);
>  			__ClearPageLRU(page);
> -			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		}
>  
> @@ -895,8 +888,8 @@ void release_pages(struct page **pages, int nr)
>  
>  		list_add(&page->lru, &pages_to_free);
>  	}
> -	if (locked_pgdat)
> -		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +	if (lruvec)
> +		unlock_page_lruvec_irqrestore(lruvec, flags);
>  
>  	mem_cgroup_uncharge_list(&pages_to_free);
>  	free_unref_page_list(&pages_to_free);
> @@ -984,26 +977,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
>  	int i;
> -	struct pglist_data *pgdat = NULL;
> -	struct lruvec *lruvec;
> +	struct lruvec *lruvec = NULL;
>  	unsigned long flags = 0;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> -		struct pglist_data *pagepgdat = page_pgdat(page);
> +		struct lruvec *new_lruvec;
>  
> -		if (pagepgdat != pgdat) {
> -			if (pgdat)
> -				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -			pgdat = pagepgdat;
> -			spin_lock_irqsave(&pgdat->lru_lock, flags);
> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (lruvec != new_lruvec) {
> +			if (lruvec)
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
> +			lruvec = lock_page_lruvec_irqsave(page, &flags);
>  		}
>  
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  		__pagevec_lru_add_fn(page, lruvec);
>  	}
> -	if (pgdat)
> -		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +	if (lruvec)
> +		unlock_page_lruvec_irqrestore(lruvec, flags);
>  	release_pages(pvec->pages, pvec->nr);
>  	pagevec_reinit(pvec);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 48b50695f883..789444ae4c88 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1768,15 +1768,13 @@ int isolate_lru_page(struct page *page)
>  	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>  
>  	if (TestClearPageLRU(page)) {
> -		pg_data_t *pgdat = page_pgdat(page);
>  		struct lruvec *lruvec;
>  		int lru = page_lru(page);
>  
>  		get_page(page);
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		spin_lock_irq(&pgdat->lru_lock);
> +		lruvec = lock_page_lruvec_irq(page);
>  		del_page_from_lru_list(page, lruvec, lru);
> -		spin_unlock_irq(&pgdat->lru_lock);
> +		unlock_page_lruvec_irq(lruvec);
>  		ret = 0;
>  	}
>  
> @@ -1843,20 +1841,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  						     struct list_head *list)
>  {
> -	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	int nr_pages, nr_moved = 0;
>  	LIST_HEAD(pages_to_free);
>  	struct page *page;
> +	struct lruvec *orig_lruvec = lruvec;
>  	enum lru_list lru;
>  
>  	while (!list_empty(list)) {
> +		struct lruvec *new_lruvec = NULL;
> +
>  		page = lru_to_page(list);
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
>  		list_del(&page->lru);
>  		if (unlikely(!page_evictable(page))) {
> -			spin_unlock_irq(&pgdat->lru_lock);
> +			spin_unlock_irq(&lruvec->lru_lock);
>  			putback_lru_page(page);
> -			spin_lock_irq(&pgdat->lru_lock);
> +			spin_lock_irq(&lruvec->lru_lock);
>  			continue;
>  		}
>  
> @@ -1871,6 +1871,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  		 *     list_add(&page->lru,)
>  		 *                                        list_add(&page->lru,)
>  		 */
> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (new_lruvec != lruvec) {
> +			if (lruvec)
> +				spin_unlock_irq(&lruvec->lru_lock);
> +			lruvec = lock_page_lruvec_irq(page);
> +		}
>  		SetPageLRU(page);
>  
>  		if (unlikely(put_page_testzero(page))) {
> @@ -1878,16 +1884,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  			__ClearPageActive(page);
>  
>  			if (unlikely(PageCompound(page))) {
> -				spin_unlock_irq(&pgdat->lru_lock);
> +				spin_unlock_irq(&lruvec->lru_lock);
>  				destroy_compound_page(page);
> -				spin_lock_irq(&pgdat->lru_lock);
> +				spin_lock_irq(&lruvec->lru_lock);
>  			} else
>  				list_add(&page->lru, &pages_to_free);
>  
>  			continue;
>  		}
>  
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  		lru = page_lru(page);
>  		nr_pages = thp_nr_pages(page);
>  
> @@ -1897,6 +1902,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  		if (PageActive(page))
>  			workingset_age_nonresident(lruvec, nr_pages);
>  	}
> +	if (orig_lruvec != lruvec) {
> +		if (lruvec)
> +			spin_unlock_irq(&lruvec->lru_lock);
> +		spin_lock_irq(&orig_lruvec->lru_lock);
> +	}
>  
>  	/*
>  	 * To save our caller's stack, now use input list for pages to free.

No, AlexD was right, most of these changes in move_pages_to_lru(),
saving orig_lruvec, and allowing for change of lruvec from one page
to the next, are not necessary, and the patch much nicer without them.

All you need here is the change from pgdat to lruvec,
and add a check that that lruvec really is not changing:
		VM_BUG_ON_PAGE(mem_cgroup_page_lruvec(page, page_pgdat(page))
							!= lruvec, page);
after the VM_BUG_ON_PAGE(PageLRU) at the head of the loop in this patch,
which can be updated to the nicer
		VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec),
								page);
in the next patch, where that function becomes available.

It certainly used to be true that move_pages_to_lru() had to allow
for lruvec to change; and that was still true at the time of my v5.3
tarball; but down the years the various reasons for it have gone away,
most recently with Johannes's swapcache charging simplifications.

I did see your mail to AlexD, where you showed a NULL deref you had
hit in move_pages_to_lru() two months ago.  I spent quite a while
puzzling over that last night, but don't have an explanation, and
don't know exactly what source you were built from when you hit it.
I had hoped to explain it by that bug I've fixed in v5.9-rc6:
62fdb1632bcb ("ksm: reinstate memcg charge on copied pages")
but did not quite succeed in explaining it that way.

And you said that you haven't hit it again recently.  Whatever,
I don't see it as any reason for keeping the more complicated and
unnecessary code in move_pages_to_lru(): if we hit such a bug again,
then we investigate it.

> @@ -1952,7 +1962,7 @@ static int current_may_throttle(void)
>  
>  	lru_add_drain();
>  
> -	spin_lock_irq(&pgdat->lru_lock);
> +	spin_lock_irq(&lruvec->lru_lock);
>  
>  	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>  				     &nr_scanned, sc, lru);
> @@ -1964,7 +1974,7 @@ static int current_may_throttle(void)
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>  	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
>  
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	spin_unlock_irq(&lruvec->lru_lock);
>  
>  	if (nr_taken == 0)
>  		return 0;
> @@ -1972,7 +1982,7 @@ static int current_may_throttle(void)
>  	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
>  				&stat, false);
>  
> -	spin_lock_irq(&pgdat->lru_lock);
> +	spin_lock_irq(&lruvec->lru_lock);
>  	move_pages_to_lru(lruvec, &page_list);
>  
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> @@ -1981,7 +1991,7 @@ static int current_may_throttle(void)
>  		__count_vm_events(item, nr_reclaimed);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>  	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	spin_unlock_irq(&lruvec->lru_lock);
>  
>  	lru_note_cost(lruvec, file, stat.nr_pageout);
>  	mem_cgroup_uncharge_list(&page_list);
> @@ -2034,7 +2044,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  
>  	lru_add_drain();
>  
> -	spin_lock_irq(&pgdat->lru_lock);
> +	spin_lock_irq(&lruvec->lru_lock);
>  
>  	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>  				     &nr_scanned, sc, lru);
> @@ -2045,7 +2055,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  		__count_vm_events(PGREFILL, nr_scanned);
>  	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
>  
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	spin_unlock_irq(&lruvec->lru_lock);
>  
>  	while (!list_empty(&l_hold)) {
>  		cond_resched();
> @@ -2091,7 +2101,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	/*
>  	 * Move pages back to the lru list.
>  	 */
> -	spin_lock_irq(&pgdat->lru_lock);
> +	spin_lock_irq(&lruvec->lru_lock);
>  
>  	nr_activate = move_pages_to_lru(lruvec, &l_active);
>  	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
> @@ -2102,7 +2112,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
>  
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	spin_unlock_irq(&lruvec->lru_lock);
>  
>  	mem_cgroup_uncharge_list(&l_active);
>  	free_unref_page_list(&l_active);
> @@ -2684,10 +2694,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	/*
>  	 * Determine the scan balance between anon and file LRUs.
>  	 */
> -	spin_lock_irq(&pgdat->lru_lock);
> +	spin_lock_irq(&target_lruvec->lru_lock);
>  	sc->anon_cost = target_lruvec->anon_cost;
>  	sc->file_cost = target_lruvec->file_cost;
> -	spin_unlock_irq(&pgdat->lru_lock);
> +	spin_unlock_irq(&target_lruvec->lru_lock);
>  
>  	/*
>  	 * Target desirable inactive:active list ratios for the anon
> @@ -4263,24 +4273,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>   */
>  void check_move_unevictable_pages(struct pagevec *pvec)

And as I've described elsewhere, some changes needed here,
that I'll have to check in later versions before Acking this patch.

>  {
> -	struct lruvec *lruvec;
> -	struct pglist_data *pgdat = NULL;
> +	struct lruvec *lruvec = NULL;
>  	int pgscanned = 0;
>  	int pgrescued = 0;
>  	int i;
>  
>  	for (i = 0; i < pvec->nr; i++) {
>  		struct page *page = pvec->pages[i];
> -		struct pglist_data *pagepgdat = page_pgdat(page);
> +		struct lruvec *new_lruvec;
>  
>  		pgscanned++;
> -		if (pagepgdat != pgdat) {
> -			if (pgdat)
> -				spin_unlock_irq(&pgdat->lru_lock);
> -			pgdat = pagepgdat;
> -			spin_lock_irq(&pgdat->lru_lock);
> +		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +		if (lruvec != new_lruvec) {
> +			if (lruvec)
> +				unlock_page_lruvec_irq(lruvec);
> +			lruvec = lock_page_lruvec_irq(page);
>  		}
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  
>  		if (!PageLRU(page) || !PageUnevictable(page))
>  			continue;
> @@ -4296,10 +4304,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>  		}
>  	}
>  
> -	if (pgdat) {
> +	if (lruvec) {
>  		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>  		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
> -		spin_unlock_irq(&pgdat->lru_lock);
> +		unlock_page_lruvec_irq(lruvec);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function
  2020-08-24 12:54 ` [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-09-22  5:40   ` Hugh Dickins
  0 siblings, 0 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-22  5:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Alexander Duyck,
	Thomas Gleixner, Andrey Ryabinin

On Mon, 24 Aug 2020, Alex Shi wrote:

> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Use this new function to replace repeated same code, no func change.
> 
> When testing for relock we can avoid the need for RCU locking if we simply
> compare the page pgdat and memcg pointers versus those that the lruvec is
> holding. By doing this we can avoid the extra pointer walks and accesses of
> the memory cgroup.
> 
> In addition we can avoid the checks entirely if lruvec is currently NULL.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

Again, I'll wait to see __munlock_pagevec() fixed before Acking this
one, but that's the only issue.  And I've suggested that you use
lruvec_holds_page_lru_lock() in mm/vmscan.c move_pages_to_lru(),
to replace the uglier and less efficient VM_BUG_ON_PAGE there.

Oh, there is one other issue: 0day robot did report (2020-06-19)
that sparse doesn't understand relock_page_lruvec*(): I've never
got around to working out how to write what it needs, conditional
__release plus __acquire in some form, I imagine. I've never got
into sparse annotations before, I'll give it a try, but if anyone
beats me that will be welcome: and there are higher priorities -
I do not think you should wait for the sparse warning to be fixed
before reposting.

> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/memcontrol.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/mlock.c                 |  9 +-------
>  mm/swap.c                  | 33 +++++++----------------------
>  mm/vmscan.c                |  8 +------
>  4 files changed, 61 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7b170e9028b5..ee6ef2d8ad52 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -488,6 +488,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>  
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>  
> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> +					      struct lruvec *lruvec)
> +{
> +	pg_data_t *pgdat = page_pgdat(page);
> +	const struct mem_cgroup *memcg;
> +	struct mem_cgroup_per_node *mz;
> +
> +	if (mem_cgroup_disabled())
> +		return lruvec == &pgdat->__lruvec;
> +
> +	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> +	memcg = page->mem_cgroup ? : root_mem_cgroup;
> +
> +	return lruvec->pgdat == pgdat && mz->memcg == memcg;
> +}
> +
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  
>  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> @@ -1023,6 +1039,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>  	return &pgdat->__lruvec;
>  }
>  
> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> +					      struct lruvec *lruvec)
> +{
> +		pg_data_t *pgdat = page_pgdat(page);
> +
> +		return lruvec == &pgdat->__lruvec;
> +}
> +
>  static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
>  {
>  	return NULL;
> @@ -1469,6 +1493,34 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
>  	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
>  }
>  
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
> +		struct lruvec *locked_lruvec)
> +{
> +	if (locked_lruvec) {
> +		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> +			return locked_lruvec;
> +
> +		unlock_page_lruvec_irq(locked_lruvec);
> +	}
> +
> +	return lock_page_lruvec_irq(page);
> +}
> +
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
> +		struct lruvec *locked_lruvec, unsigned long *flags)
> +{
> +	if (locked_lruvec) {
> +		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> +			return locked_lruvec;
> +
> +		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> +	}
> +
> +	return lock_page_lruvec_irqsave(page, flags);
> +}
> +
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 177d2588e863..0448409184e3 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -302,17 +302,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  	/* Phase 1: page isolation */
>  	for (i = 0; i < nr; i++) {
>  		struct page *page = pvec->pages[i];
> -		struct lruvec *new_lruvec;
>  
>  		/* block memcg change in mem_cgroup_move_account */
>  		lock_page_memcg(page);
> -		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -		if (new_lruvec != lruvec) {
> -			if (lruvec)
> -				unlock_page_lruvec_irq(lruvec);
> -			lruvec = lock_page_lruvec_irq(page);
> -		}
> -
> +		lruvec = relock_page_lruvec_irq(page, lruvec);
>  		if (TestClearPageMlocked(page)) {
>  			/*
>  			 * We already have pin from follow_page_mask()
> diff --git a/mm/swap.c b/mm/swap.c
> index b67959b701c0..2ac78e8fab71 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> -		struct lruvec *new_lruvec;
>  
>  		/* block memcg migration during page moving between lru */
>  		if (!TestClearPageLRU(page))
>  			continue;
>  
> -		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -		if (lruvec != new_lruvec) {
> -			if (lruvec)
> -				unlock_page_lruvec_irqrestore(lruvec, flags);
> -			lruvec = lock_page_lruvec_irqsave(page, &flags);
> -		}
> -
> +		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>  		(*move_fn)(page, lruvec);
>  
>  		SetPageLRU(page);
> @@ -865,17 +858,12 @@ void release_pages(struct page **pages, int nr)
>  		}
>  
>  		if (PageLRU(page)) {
> -			struct lruvec *new_lruvec;
> -
> -			new_lruvec = mem_cgroup_page_lruvec(page,
> -							page_pgdat(page));
> -			if (new_lruvec != lruvec) {
> -				if (lruvec)
> -					unlock_page_lruvec_irqrestore(lruvec,
> -									flags);
> +			struct lruvec *prev_lruvec = lruvec;
> +
> +			lruvec = relock_page_lruvec_irqsave(page, lruvec,
> +									&flags);
> +			if (prev_lruvec != lruvec)
>  				lock_batch = 0;
> -				lruvec = lock_page_lruvec_irqsave(page, &flags);
> -			}
>  
>  			VM_BUG_ON_PAGE(!PageLRU(page), page);
>  			__ClearPageLRU(page);
> @@ -982,15 +970,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> -		struct lruvec *new_lruvec;
> -
> -		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -		if (lruvec != new_lruvec) {
> -			if (lruvec)
> -				unlock_page_lruvec_irqrestore(lruvec, flags);
> -			lruvec = lock_page_lruvec_irqsave(page, &flags);
> -		}
>  
> +		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>  		__pagevec_lru_add_fn(page, lruvec);
>  	}
>  	if (lruvec)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 789444ae4c88..2c94790d4cb1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4280,15 +4280,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>  
>  	for (i = 0; i < pvec->nr; i++) {
>  		struct page *page = pvec->pages[i];
> -		struct lruvec *new_lruvec;
>  
>  		pgscanned++;
> -		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -		if (lruvec != new_lruvec) {
> -			if (lruvec)
> -				unlock_page_lruvec_irq(lruvec);
> -			lruvec = lock_page_lruvec_irq(page);
> -		}
> +		lruvec = relock_page_lruvec_irq(page, lruvec);
>  
>  		if (!PageLRU(page) || !PageUnevictable(page))
>  			continue;
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru
  2020-08-24 12:54 ` [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru Alex Shi
@ 2020-09-22  5:44   ` Hugh Dickins
  2020-09-23  1:55     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-22  5:44 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Andrey Ryabinin,
	Jann Horn

On Mon, 24 Aug 2020, Alex Shi wrote:

> From: Hugh Dickins <hughd@google.com>
> 
> Use the relock function to replace relocking action. And try to save few
> lock times.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

NAK. Who wrote this rubbish? Oh, did I? Maybe something you extracted
from my tarball. No, we don't need any of this now, as explained when
going through 20/32.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  mm/vmscan.c | 17 ++++++-----------
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2c94790d4cb1..04ef94190530 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1848,15 +1848,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  	enum lru_list lru;
>  
>  	while (!list_empty(list)) {
> -		struct lruvec *new_lruvec = NULL;
> -
>  		page = lru_to_page(list);
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
>  		list_del(&page->lru);
>  		if (unlikely(!page_evictable(page))) {
> -			spin_unlock_irq(&lruvec->lru_lock);
> +			if (lruvec) {
> +				spin_unlock_irq(&lruvec->lru_lock);
> +				lruvec = NULL;
> +			}
>  			putback_lru_page(page);
> -			spin_lock_irq(&lruvec->lru_lock);
>  			continue;
>  		}
>  
> @@ -1871,12 +1871,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  		 *     list_add(&page->lru,)
>  		 *                                        list_add(&page->lru,)
>  		 */
> -		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -		if (new_lruvec != lruvec) {
> -			if (lruvec)
> -				spin_unlock_irq(&lruvec->lru_lock);
> -			lruvec = lock_page_lruvec_irq(page);
> -		}
> +		lruvec = relock_page_lruvec_irq(page, lruvec);
>  		SetPageLRU(page);
>  
>  		if (unlikely(put_page_testzero(page))) {
> @@ -1885,8 +1880,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  
>  			if (unlikely(PageCompound(page))) {
>  				spin_unlock_irq(&lruvec->lru_lock);
> +				lruvec = NULL;
>  				destroy_compound_page(page);
> -				spin_lock_irq(&lruvec->lru_lock);
>  			} else
>  				list_add(&page->lru, &pages_to_free);
>  
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 23/32] mm/lru: revise the comments of lru_lock
  2020-08-24 12:54 ` [PATCH v18 23/32] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-09-22  5:48   ` Hugh Dickins
  0 siblings, 0 replies; 102+ messages in thread
From: Hugh Dickins @ 2020-09-22  5:48 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Andrey Ryabinin,
	Jann Horn

On Mon, 24 Aug 2020, Alex Shi wrote:

> From: Hugh Dickins <hughd@google.com>
> 
> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
> fix the incorrect comments in code. Also fixed some zone->lru_lock comment
> error from ancient time. etc.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

I'm not the right person to be Acking this one; but when I scanned
through, I did notice some wording had been added that I want to
change. I should just send you a new version, but not tonight.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
>  Documentation/admin-guide/cgroup-v1/memory.rst     | 21 +++++++++------------
>  Documentation/trace/events-kmem.rst                |  2 +-
>  Documentation/vm/unevictable-lru.rst               | 22 ++++++++--------------
>  include/linux/mm_types.h                           |  2 +-
>  include/linux/mmzone.h                             |  3 +--
>  mm/filemap.c                                       |  4 ++--
>  mm/memcontrol.c                                    |  2 +-
>  mm/rmap.c                                          |  4 ++--
>  mm/vmscan.c                                        | 12 ++++++++----
>  10 files changed, 36 insertions(+), 51 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
> index 3f7115e07b5d..0b9f91589d3d 100644
> --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
> @@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
>  
>  8. LRU
>  ======
> -        Each memcg has its own private LRU. Now, its handling is under global
> -	VM's control (means that it's handled under global pgdat->lru_lock).
> -	Almost all routines around memcg's LRU is called by global LRU's
> -	list management functions under pgdat->lru_lock.
> -
> -	A special function is mem_cgroup_isolate_pages(). This scans
> -	memcg's private LRU and call __isolate_lru_page() to extract a page
> -	from LRU.
> -
> -	(By __isolate_lru_page(), the page is removed from both of global and
> -	private LRU.)
> -
> +	Each memcg has its own vector of LRUs (inactive anon, active anon,
> +	inactive file, active file, unevictable) of pages from each node,
> +	each LRU handled under a single lru_lock for that memcg and node.
>  
>  9. Typical Tests.
>  =================
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 12757e63b26c..24450696579f 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered.
>  2.6 Locking
>  -----------
>  
> -   lock_page_cgroup()/unlock_page_cgroup() should not be called under
> -   the i_pages lock.
> +Lock order is as follows:
>  
> -   Other lock order is following:
> +  Page lock (PG_locked bit of page->flags)
> +    mm->page_table_lock or split pte_lock
> +      lock_page_memcg (memcg->move_lock)
> +        mapping->i_pages lock
> +          lruvec->lru_lock.
>  
> -   PG_locked.
> -     mm->page_table_lock
> -         pgdat->lru_lock
> -	   lock_page_cgroup.
> -
> -  In many cases, just lock_page_cgroup() is called.
> -
> -  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
> -  pgdat->lru_lock, it has no lock of its own.
> +Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
> +lruvec->lru_lock; PG_lru bit of page->flags is cleared before
> +isolating a page from its LRU under lruvec->lru_lock.
>  
>  2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
>  -----------------------------------------------
> diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
> index 555484110e36..68fa75247488 100644
> --- a/Documentation/trace/events-kmem.rst
> +++ b/Documentation/trace/events-kmem.rst
> @@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
>  Broadly speaking, pages are taken off the LRU lock in bulk and
>  freed in batch with a page list. Significant amounts of activity here could
>  indicate that the system is under memory pressure and can also indicate
> -contention on the zone->lru_lock.
> +contention on the lruvec->lru_lock.
>  
>  4. Per-CPU Allocator Activity
>  =============================
> diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
> index 17d0861b0f1d..0e1490524f53 100644
> --- a/Documentation/vm/unevictable-lru.rst
> +++ b/Documentation/vm/unevictable-lru.rst
> @@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
>  memory x86_64 systems.
>  
>  To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
> -main memory will have over 32 million 4k pages in a single zone.  When a large
> +main memory will have over 32 million 4k pages in a single node.  When a large
>  fraction of these pages are not evictable for any reason [see below], vmscan
>  will spend a lot of time scanning the LRU lists looking for the small fraction
>  of pages that are evictable.  This can result in a situation where all CPUs are
> @@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
>  The Unevictable Page List
>  -------------------------
>  
> -The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
> +The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
>  called the "unevictable" list and an associated page flag, PG_unevictable, to
>  indicate that the page is being managed on the unevictable list.
>  
> @@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
>  swap-backed pages.  This differentiation is only important while the pages are,
>  in fact, evictable.
>  
> -The unevictable list benefits from the "arrayification" of the per-zone LRU
> +The unevictable list benefits from the "arrayification" of the per-node LRU
>  lists and statistics originally proposed and posted by Christoph Lameter.
>  
> -The unevictable list does not use the LRU pagevec mechanism. Rather,
> -unevictable pages are placed directly on the page's zone's unevictable list
> -under the zone lru_lock.  This allows us to prevent the stranding of pages on
> -the unevictable list when one task has the page isolated from the LRU and other
> -tasks are changing the "evictability" state of the page.
> -
>  
>  Memory Control Group Interaction
>  --------------------------------
> @@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
>  memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
>  lru_list enum.
>  
> -The memory controller data structure automatically gets a per-zone unevictable
> -list as a result of the "arrayification" of the per-zone LRU lists (one per
> +The memory controller data structure automatically gets a per-node unevictable
> +list as a result of the "arrayification" of the per-node LRU lists (one per
>  lru_list enum element).  The memory controller tracks the movement of pages to
>  and from the unevictable list.
>  
> @@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
>  active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
>  pages in all of the shrink_{active|inactive|page}_list() functions and will
>  "cull" such pages that it encounters: that is, it diverts those pages to the
> -unevictable list for the zone being scanned.
> +unevictable list for the node being scanned.
>  
>  There may be situations where a page is mapped into a VM_LOCKED VMA, but the
>  page is not marked as PG_mlocked.  Such pages will make it all the way to
> @@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
>  page from the LRU, as it is likely on the appropriate active or inactive list
>  at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
>  back the page - by calling putback_lru_page() - which will notice that the page
> -is now mlocked and divert the page to the zone's unevictable list.  If
> +is now mlocked and divert the page to the node's unevictable list.  If
>  mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
>  it later if and when it attempts to reclaim the page.
>  
> @@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
>       unevictable list in mlock_vma_page().
>  
>  shrink_inactive_list() also diverts any unevictable pages that it finds on the
> -inactive lists to the appropriate zone's unevictable list.
> +inactive lists to the appropriate node's unevictable list.
>  
>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
>  after shrink_active_list() had moved them to the inactive list, or pages mapped
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 496c3ff97cce..c3f1e76720af 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,7 +78,7 @@ struct page {
>  		struct {	/* Page cache and anonymous pages */
>  			/**
>  			 * @lru: Pageout list, eg. active_list protected by
> -			 * pgdat->lru_lock.  Sometimes used as a generic list
> +			 * lruvec->lru_lock.  Sometimes used as a generic list
>  			 * by the page owner.
>  			 */
>  			struct list_head lru;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 27a1513a43fc..f0596e634863 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>  struct pglist_data;
>  
>  /*
> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
> - * So add a wild amount of padding here to ensure that they fall into separate
> + * Add a wild amount of padding here to ensure datas fall into separate
>   * cachelines.  There are very few zone structures in the machine, so space
>   * consumption is not a concern here.
>   */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1aaea26556cc..6f8d58fb16db 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -102,8 +102,8 @@
>   *    ->swap_lock		(try_to_unmap_one)
>   *    ->private_lock		(try_to_unmap_one)
>   *    ->i_pages lock		(try_to_unmap_one)
> - *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
> - *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
> + *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
> + *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
>   *    ->private_lock		(page_remove_rmap->set_page_dirty)
>   *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
>   *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5b95529e64a4..454b3f205d1b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3279,7 +3279,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  
>  /*
> - * Because tail pages are not marked as "used", set it. We're under
> + * Because tail pages are not marked as "used", set it. Don't need
>   * lruvec->lru_lock and migration entries setup in all page mappings.
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 83cc459edc40..259c323e06ea 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -28,12 +28,12 @@
>   *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
>   *           anon_vma->rwsem
>   *             mm->page_table_lock or pte_lock
> - *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
>   *               swap_lock (in swap_duplicate, swap_info_get)
>   *                 mmlist_lock (in mmput, drain_mmlist and others)
>   *                 mapping->private_lock (in __set_page_dirty_buffers)
> - *                   mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
> + *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
>   *                     i_pages lock (widely used)
> + *                       lruvec->lru_lock (in lock_page_lruvec_irq)
>   *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
>   *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
>   *                   sb_lock (within inode_lock in fs/fs-writeback.c)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 04ef94190530..601fbcb994fb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1614,14 +1614,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
>  }
>  
>  /**
> - * pgdat->lru_lock is heavily contended.  Some of the functions that
> + * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
> + *
> + * lruvec->lru_lock is heavily contended.  Some of the functions that
>   * shrink the lists perform better by taking out a batch of pages
>   * and working on them outside the LRU lock.
>   *
>   * For pagecache intensive workloads, this function is the hottest
>   * spot in the kernel (apart from copy_*_user functions).
>   *
> - * Appropriate locks must be held before calling this function.
> + * Lru_lock must be held before calling this function.
>   *
>   * @nr_to_scan:	The number of eligible pages to look through on the list.
>   * @lruvec:	The LRU vector to pull pages from.
> @@ -1820,14 +1822,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  
>  /*
>   * This moves pages from @list to corresponding LRU list.
> + * The pages from @list is out of any lruvec, and in the end list reuses as
> + * pages_to_free list.
>   *
>   * We move them the other way if the page is referenced by one or more
>   * processes, from rmap.
>   *
>   * If the pages are mostly unmapped, the processing is fast and it is
> - * appropriate to hold zone_lru_lock across the whole operation.  But if
> + * appropriate to hold lru_lock across the whole operation.  But if
>   * the pages are mapped, the processing is slow (page_referenced()) so we
> - * should drop zone_lru_lock around each page.  It's impossible to balance
> + * should drop lru_lock around each page.  It's impossible to balance
>   * this, so instead we remove the pages from the LRU while processing them.
>   * It is safe to rely on PG_active against the non-LRU pages in here because
>   * nobody will play with that bit on a non-LRU page.
> -- 
> 1.8.3.1
> 
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock
  2020-08-24 12:54 ` [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2020-09-22  5:53   ` Hugh Dickins
  2020-09-23  1:55     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-22  5:53 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301

On Mon, 24 Aug 2020, Alex Shi wrote:

> Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

I don't take pleasure in spoiling your celebrations and ceremonies,
but I strongly agree with AlexD that this should simply be merged
into the big one, 20/32.  That can be ceremony enough.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/mmzone.h | 1 -
>  mm/page_alloc.c        | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f0596e634863..0ed520954843 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -758,7 +758,6 @@ struct deferred_split {
>  
>  	/* Write-intensive fields used by page reclaim */
>  	ZONE_PADDING(_pad1_)
> -	spinlock_t		lru_lock;
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  	/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fab5e97dc9ca..775120fcc869 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6733,7 +6733,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
>  
>  	pgdat_page_ext_init(pgdat);
> -	spin_lock_init(&pgdat->lru_lock);
>  	lruvec_init(&pgdat->__lruvec);
>  }
>  
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
  2020-08-24 12:54 ` [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page Alex Shi
  2020-08-26  5:52   ` Alex Shi
@ 2020-09-22  6:13   ` Hugh Dickins
  2020-09-23  1:58     ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-09-22  6:13 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, mhocko, vdavydov.dev, shy828301, Kirill A. Shutemov,
	Vlastimil Babka

On Mon, 24 Aug 2020, Alex Shi wrote:

> In the func munlock_vma_page, the page must be PageLocked as well as
> pages in split_huge_page series funcs. Thus the PageLocked is enough
> to serialize both funcs.
> 
> So we could relief the TestClearPageMlocked/hpage_nr_pages which are not
> necessary under lru lock.
> 
> As to another munlock func __munlock_pagevec, which no PageLocked
> protection and should remain lru protecting.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

I made some comments on the mlock+munlock situation last week:
I won't review this 24/32 and 25/32 now, but will take a look
at your github tree tomorrow instead.  Perhaps I'll find you have
already done the fixes, perhaps I'll find you have merged these back
into earlier patches.  And I won't be reviewing beyond this point:
this is enough for now, I think.

Hugh

> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/mlock.c | 41 +++++++++++++++--------------------------
>  1 file changed, 15 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0448409184e3..46a05e6ec5ba 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -69,9 +69,9 @@ void clear_page_mlock(struct page *page)
>  	 *
>  	 * See __pagevec_lru_add_fn for more explanation.
>  	 */
> -	if (!isolate_lru_page(page)) {
> +	if (!isolate_lru_page(page))
>  		putback_lru_page(page);
> -	} else {
> +	else {
>  		/*
>  		 * We lost the race. the page already moved to evictable list.
>  		 */
> @@ -178,7 +178,6 @@ static void __munlock_isolation_failed(struct page *page)
>  unsigned int munlock_vma_page(struct page *page)
>  {
>  	int nr_pages;
> -	struct lruvec *lruvec;
>  
>  	/* For try_to_munlock() and to serialize with page migration */
>  	BUG_ON(!PageLocked(page));
> @@ -186,37 +185,22 @@ unsigned int munlock_vma_page(struct page *page)
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
>  	/*
> -	 * Serialize split tail pages in __split_huge_page_tail() which
> -	 * might otherwise copy PageMlocked to part of the tail pages before
> -	 * we clear it in the head page. It also stabilizes thp_nr_pages().
> -	 * TestClearPageLRU can't be used here to block page isolation, since
> -	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
> -	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
> -	 * wrong lru list here. So relay on PageLocked to stop lruvec change
> -	 * in mem_cgroup_move_account().
> +	 * Serialize split tail pages in __split_huge_page_tail() by
> +	 * lock_page(); Do TestClearPageMlocked/PageLRU sequence like
> +	 * clear_page_mlock().
>  	 */
> -	lruvec = lock_page_lruvec_irq(page);
> -
> -	if (!TestClearPageMlocked(page)) {
> +	if (!TestClearPageMlocked(page))
>  		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> -		nr_pages = 1;
> -		goto unlock_out;
> -	}
> +		return 0;
>  
>  	nr_pages = thp_nr_pages(page);
>  	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>  
> -	if (__munlock_isolate_lru_page(page, lruvec, true)) {
> -		unlock_page_lruvec_irq(lruvec);
> +	if (!isolate_lru_page(page))
>  		__munlock_isolated_page(page);
> -		goto out;
> -	}
> -	__munlock_isolation_failed(page);
> -
> -unlock_out:
> -	unlock_page_lruvec_irq(lruvec);
> +	else
> +		__munlock_isolation_failed(page);
>  
> -out:
>  	return nr_pages - 1;
>  }
>  
> @@ -305,6 +289,11 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  
>  		/* block memcg change in mem_cgroup_move_account */
>  		lock_page_memcg(page);
> +		/*
> +		 * Serialize split tail pages in __split_huge_page_tail() which
> +		 * might otherwise copy PageMlocked to part of the tail pages
> +		 * before we clear it in the head page.
> +		 */
>  		lruvec = relock_page_lruvec_irq(page, lruvec);
>  		if (TestClearPageMlocked(page)) {
>  			/*
> -- 
> 1.8.3.1
> 
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-09-22  5:27   ` Hugh Dickins
@ 2020-09-22  8:58     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-22  8:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Michal Hocko, Yang Shi



在 2020/9/22 下午1:27, Hugh Dickins 写道:
> On Mon, 24 Aug 2020, Alex Shi wrote:
> 
>> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
>> each of memcg per node. So on a large machine, each of memcg don't
>> have to suffer from per node pgdat->lru_lock competition. They could go
>> fast with their self lru_lock.
>>
>> After move memcg charge before lru inserting, page isolation could
>> serialize page's memcg, then per memcg lruvec lock is stable and could
>> replace per node lru lock.
>>
>> In func isolate_migratepages_block, compact_unlock_should_abort is
>> opend, and lock_page_lruvec logical is embedded for tight process.
> 
> Hard to understand: perhaps:
> 
> In func isolate_migratepages_block, compact_unlock_should_abort and
> lock_page_lruvec_irqsave are open coded to work with compact_control.

will update with your suggestion. Thanks!

> 
>> Also add a debug func in locking which may give some clues if there are
>> sth out of hands.
>>
>> According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
>> containers on a 2s * 26cores * HT box with a modefied case:
>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> 
> s/modeified/modified/
> lruv19 has an lkml.org link there, please substitut
> https://lore.kernel.org/lkml/01ed6e45-3853-dcba-61cb-b429a49a7572@linux.alibaba.com/
> 

Thanks!

>>
>> With this and later patches, the readtwice performance increases
>> about 80% within concurrent containers.
>>
>> On a large machine with memcg enabled but not used, the page's lruvec
>> seeking pass a few pointers, that may lead to lru_lock holding time
>> increase and a bit regression.
>>
>> Hugh Dickins helped on patch polish, thanks!
>>
>> Reported-by: kernel test robot <lkp@intel.com>
> 
> Eh? It may have reported some locking bugs somewhere, but this
> is the main patch of your per-memcg lru_lock: I don't think the
> kernel test robot inspired your whole design, did it?  Delete that.
> 
> 
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> 
> I can't quite Ack this one yet, because there are several functions
> (mainly __munlock_pagevec and check_move_unevictable_pages) which are
> not right in this v18 version, and a bit tricky to correct: I already
> suggested what to do in other mail, but this patch comes before
> relock_page_lruvec, so must look different from the final result;
> I need to look at a later version, perhaps already there in your
> github tree, before I can Ack: but it's not far off.
> Comments below.

All suggestions are taken! Many thanks for so detailed review!
A new branch with all comments is updated as 
    https://github.com/alexshi/linux.git lruv19.5

A quick summary for the branch,
Add a new patch for move_pages_to_lru:
	mm/vmscan: remove lruvec reget in move_pages_to_lru
Add another patch for split part from 'Introduce TestClearPageLRU':
 	mm/swap.c: reorder __ClearPageLRU and lruvec

the mlock changes moved earlier:
	mm/mlock: remove __munlock_isolate_lru_page
	mm/mlock: remove __munlock_isolate_lru_page

I am wondering if it's good to send out v19 here or maybe better to wait
for your confirm if all suggestion/comments are settled?

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru
  2020-09-22  5:44   ` Hugh Dickins
@ 2020-09-23  1:55     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-23  1:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Andrey Ryabinin, Jann Horn



在 2020/9/22 下午1:44, Hugh Dickins 写道:
> On Mon, 24 Aug 2020, Alex Shi wrote:
> 
>> From: Hugh Dickins <hughd@google.com>
>>
>> Use the relock function to replace relocking action. And try to save few
>> lock times.
>>
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> NAK. Who wrote this rubbish? Oh, did I? Maybe something you extracted
> from my tarball. No, we don't need any of this now, as explained when
> going through 20/32.
> 

removed in lruv19.5

Thanks!


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock
  2020-09-22  5:53   ` Hugh Dickins
@ 2020-09-23  1:55     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-23  1:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301



在 2020/9/22 下午1:53, Hugh Dickins 写道:
>> Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> I don't take pleasure in spoiling your celebrations and ceremonies,
> but I strongly agree with AlexD that this should simply be merged
> into the big one, 20/32.  That can be ceremony enough.
> 

folded into that patch.
Thanks!


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
  2020-09-22  6:13   ` Hugh Dickins
@ 2020-09-23  1:58     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-09-23  1:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, willy, hannes,
	lkp, linux-mm, linux-kernel, cgroups, shakeelb, iamjoonsoo.kim,
	richard.weiyang, kirill, alexander.duyck, rong.a.chen, mhocko,
	vdavydov.dev, shy828301, Kirill A. Shutemov, Vlastimil Babka



在 2020/9/22 下午2:13, Hugh Dickins 写道:
> On Mon, 24 Aug 2020, Alex Shi wrote:
> 
>> In the func munlock_vma_page, the page must be PageLocked as well as
>> pages in split_huge_page series funcs. Thus the PageLocked is enough
>> to serialize both funcs.
>>
>> So we could relief the TestClearPageMlocked/hpage_nr_pages which are not
>> necessary under lru lock.
>>
>> As to another munlock func __munlock_pagevec, which no PageLocked
>> protection and should remain lru protecting.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> I made some comments on the mlock+munlock situation last week:
> I won't review this 24/32 and 25/32 now, but will take a look
> at your github tree tomorrow instead.  Perhaps I'll find you have
> already done the fixes, perhaps I'll find you have merged these back
> into earlier patches.  And I won't be reviewing beyond this point:
> this is enough for now, I think.
> 

Yes, these 2 patches was fixed as your suggested on 
https://github.com/alexshi/linux.git lruv19.5 

83f8582dcd5a mm/mlock: remove lru_lock on TestClearPageMlocked
20836d10f0ed mm/mlock: remove __munlock_isolate_lru_page

Thanks!
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2020-09-23  2:01 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-24 12:54 [PATCH v18 00/32] per memcg lru_lock Alex Shi
2020-08-24 12:54 ` [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged Alex Shi
2020-08-24 12:54 ` [PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled Alex Shi
2020-08-24 12:54 ` [PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
2020-08-24 12:54 ` [PATCH v18 04/32] mm/thp: clean up lru_add_page_tail Alex Shi
2020-08-24 12:54 ` [PATCH v18 05/32] mm/thp: remove code path which never got into Alex Shi
2020-08-24 12:54 ` [PATCH v18 06/32] mm/thp: narrow lru locking Alex Shi
2020-09-10 13:49   ` Matthew Wilcox
2020-09-11  3:37     ` Alex Shi
2020-09-13 15:27       ` Matthew Wilcox
2020-09-19  1:00         ` Hugh Dickins
2020-08-24 12:54 ` [PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru Alex Shi
2020-08-24 12:54 ` [PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-08-24 12:54 ` [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting Alex Shi
2020-08-24 12:54 ` [PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
2020-08-24 12:54 ` [PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg Alex Shi
2020-08-24 12:54 ` [PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec Alex Shi
2020-08-24 12:54 ` [PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
2020-08-24 12:54 ` [PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
2020-08-24 12:54 ` [PATCH v18 15/32] mm/lru: move lock into lru_note_cost Alex Shi
2020-09-21 21:36   ` Hugh Dickins
2020-09-21 22:03     ` Hugh Dickins
2020-09-22  3:39       ` Alex Shi
2020-09-22  3:38     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU Alex Shi
2020-09-21 23:16   ` Hugh Dickins
2020-09-22  3:53     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 17/32] mm/compaction: do page isolation first in compaction Alex Shi
2020-09-21 23:49   ` Hugh Dickins
2020-09-22  4:57     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
2020-08-24 12:54 ` [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn Alex Shi
2020-09-22  0:42   ` Hugh Dickins
2020-09-22  5:00     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-09-22  5:27   ` Hugh Dickins
2020-09-22  8:58     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-09-22  5:40   ` Hugh Dickins
2020-08-24 12:54 ` [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru Alex Shi
2020-09-22  5:44   ` Hugh Dickins
2020-09-23  1:55     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 23/32] mm/lru: revise the comments of lru_lock Alex Shi
2020-09-22  5:48   ` Hugh Dickins
2020-08-24 12:54 ` [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock Alex Shi
2020-09-22  5:53   ` Hugh Dickins
2020-09-23  1:55     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page Alex Shi
2020-08-26  5:52   ` Alex Shi
2020-09-22  6:13   ` Hugh Dickins
2020-09-23  1:58     ` Alex Shi
2020-08-24 12:54 ` [PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page Alex Shi
2020-08-24 12:55 ` [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock Alex Shi
2020-08-26  9:07   ` Alex Shi
2020-08-24 12:55 ` [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block Alex Shi
2020-08-24 12:55 ` [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block Alex Shi
2020-08-24 12:55 ` [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip Alex Shi
2020-08-24 12:55 ` [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages Alex Shi
2020-09-09  1:01   ` Matthew Wilcox
2020-09-09 15:43     ` Alexander Duyck
2020-09-09 17:07       ` Matthew Wilcox
2020-09-09 18:24       ` Hugh Dickins
2020-09-09 20:15         ` Matthew Wilcox
2020-09-09 21:05           ` Hugh Dickins
2020-09-09 21:17         ` Alexander Duyck
2020-08-24 12:55 ` [PATCH v18 32/32] mm: Split release_pages work into 3 passes Alex Shi
2020-08-24 18:42 ` [PATCH v18 00/32] per memcg lru_lock Andrew Morton
2020-08-24 20:24   ` Hugh Dickins
2020-08-25  1:56     ` Daniel Jordan
2020-08-25  3:26       ` Alex Shi
2020-08-25 11:39         ` Matthew Wilcox
2020-08-26  1:19         ` Daniel Jordan
2020-08-26  8:59           ` Alex Shi
2020-08-28  1:40             ` Daniel Jordan
2020-08-28  5:22               ` Alex Shi
2020-09-09  2:44               ` Aaron Lu
2020-09-09 11:40                 ` Michal Hocko
2020-08-25  8:52       ` Alex Shi
2020-08-25 13:00         ` Alex Shi
2020-08-27  7:01     ` Hugh Dickins
2020-08-27 12:20       ` Race between freeing and waking page Matthew Wilcox
2020-09-08 23:41       ` [PATCH v18 00/32] per memcg lru_lock: reviews Hugh Dickins
2020-09-09  2:24         ` Wei Yang
2020-09-09 15:08         ` Alex Shi
2020-09-09 23:16           ` Hugh Dickins
2020-09-11  2:50             ` Alex Shi
2020-09-12  2:13               ` Hugh Dickins
2020-09-13 14:21                 ` Alex Shi
2020-09-15  8:21                   ` Hugh Dickins
2020-09-15 16:58                     ` Daniel Jordan
2020-09-16 12:44                       ` Alex Shi
2020-09-17  2:37                       ` Alex Shi
2020-09-17 14:35                         ` Daniel Jordan
2020-09-17 15:39                           ` Alexander Duyck
2020-09-17 16:48                             ` Daniel Jordan
2020-09-12  8:38           ` Hugh Dickins
2020-09-13 14:22             ` Alex Shi
2020-09-09 16:11         ` Alexander Duyck
2020-09-10  0:32           ` Hugh Dickins
2020-09-10 14:24             ` Alexander Duyck
2020-09-12  5:12               ` Hugh Dickins
2020-08-25  7:21   ` [PATCH v18 00/32] per memcg lru_lock Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).