All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v16 00/22] per memcg lru_lock
@ 2020-07-11  0:58 Alex Shi
  2020-07-11  0:58 ` [PATCH v16 01/22] mm/vmscan: remove unnecessary lruvec adding Alex Shi
                   ` (25 more replies)
  0 siblings, 26 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

The new version which bases on v5.8-rc4. Add 2 more patchs:
'mm/thp: remove code path which never got into'
'mm/thp: add tail pages into lru anyway in split_huge_page()'
and modified 'mm/mlock: reorder isolation sequence during munlock'

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

The patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation.
2, use TestCleanPageLRU as page isolation's precondition
3, replace per node lru_lock with per memcg per node lru_lock

Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this patchset, the readtwice performance increased about 80%
in concurrent containers.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Alex Shi (20):
  mm/vmscan: remove unnecessary lruvec adding
  mm/page_idle: no unlikely double check for idle page counting
  mm/compaction: correct the comments of compact_defer_shift
  mm/compaction: rename compact_deferred as compact_should_defer
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: clean up lru_add_page_tail
  mm/thp: remove code path which never got into
  mm/thp: narrow lru locking
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lru_lock holding in func lru_note_cost_page
  mm/lru: move lock into lru_note_cost
  mm/lru: introduce TestClearPageLRU
  mm/thp: add tail pages into lru anyway in split_huge_page()
  mm/compaction: do page isolation first in compaction
  mm/mlock: reorder isolation sequence during munlock
  mm/swap: serialize memcg changes during pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/pgdat: remove pgdat lru_lock

Hugh Dickins (2):
  mm/vmscan: use relock for move_pages_to_lru
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/compaction.h                         |   4 +-
 include/linux/memcontrol.h                         |  98 +++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 include/trace/events/compaction.h                  |   2 +-
 mm/compaction.c                                    | 113 ++++++++----
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  47 +++--
 mm/memcontrol.c                                    |  71 +++++++-
 mm/memory.c                                        |   3 -
 mm/mlock.c                                         |  93 +++++-----
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   8 -
 mm/rmap.c                                          |   4 +-
 mm/swap.c                                          | 189 ++++++++-------------
 mm/swap_state.c                                    |   2 -
 mm/vmscan.c                                        | 174 ++++++++++---------
 mm/workingset.c                                    |   2 -
 25 files changed, 524 insertions(+), 365 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v16 01/22] mm/vmscan: remove unnecessary lruvec adding
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 02/22] mm/page_idle: no unlikely double check for idle page counting Alex Shi
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept here for list intergrity.
Otherwise:
 #0 mave_pages_to_lru              #1 release_pages
                                   if (put_page_testzero())
 if !put_page_testzero
                                     !PageLRU //skip lru_lock
                                       list_add(&page->lru,)
   list_add(&page->lru,) //corrupt

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749d239c62b2..ddb29d813d77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1856,26 +1856,29 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			list_del(&page->lru);
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		/*
+		 * The SetPageLRU needs to be kept here for list intergrity.
+		 * Otherwise:
+		 *   #0 mave_pages_to_lru             #1 release_pages
+		 *				      if (put_page_testzero())
+		 *   if !put_page_testzero
+		 *				        !PageLRU //skip lru_lock
+		 *                                        list_add(&page->lru,)
+		 *     list_add(&page->lru,) //corrupt
+		 */
 		SetPageLRU(page);
-		lru = page_lru(page);
 
-		nr_pages = hpage_nr_pages(page);
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
-
-		if (put_page_testzero(page)) {
+		if (unlikely(put_page_testzero(page))) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -1883,11 +1886,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
-		} else {
-			nr_moved += nr_pages;
-			if (PageActive(page))
-				workingset_age_nonresident(lruvec, nr_pages);
+
+			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lru = page_lru(page);
+		nr_pages = hpage_nr_pages(page);
+
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &lruvec->lists[lru]);
+		nr_moved += nr_pages;
+		if (PageActive(page))
+			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 02/22] mm/page_idle: no unlikely double check for idle page counting
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
  2020-07-11  0:58 ` [PATCH v16 01/22] mm/vmscan: remove unnecessary lruvec adding Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 03/22] mm/compaction: correct the comments of compact_defer_shift Alex Shi
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

As func comments mentioned, few isolated page missing be tolerated.
So why not do further to drop the unlikely double check. That won't
cause more idle pages, but reduce a lock contention.

This is also a preparation for later new page isolation feature.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_idle.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..5fdd753e151a 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,11 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (unlikely(!PageLRU(page))) {
-		put_page(page);
-		page = NULL;
-	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 03/22] mm/compaction: correct the comments of compact_defer_shift
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
  2020-07-11  0:58 ` [PATCH v16 01/22] mm/vmscan: remove unnecessary lruvec adding Alex Shi
  2020-07-11  0:58 ` [PATCH v16 02/22] mm/page_idle: no unlikely double check for idle page counting Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 04/22] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h | 1 +
 mm/compaction.c        | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f6f884970511..14c668b7e793 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -512,6 +512,7 @@ struct zone {
 	 * On compaction failure, 1<<compact_defer_shift compactions
 	 * are skipped before trying again. The number attempted since
 	 * last failure is tracked with compact_considered.
+	 * compact_order_failed is the minimum compaction failed order.
 	 */
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
diff --git a/mm/compaction.c b/mm/compaction.c
index 86375605faa9..cd1ef9e5e638 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page)
 
 /*
  * Compaction is deferred when compaction fails to result in a page
- * allocation success. 1 << compact_defer_limit compactions are skipped up
+ * allocation success. compact_defer_shift++, compactions are skipped up
  * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
  */
 void defer_compaction(struct zone *zone, int order)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 04/22] mm/compaction: rename compact_deferred as compact_should_defer
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (2 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 03/22] mm/compaction: correct the comments of compact_defer_shift Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Steven Rostedt, Ingo Molnar, Vlastimil Babka, Mike Kravetz

The compact_deferred is a defer suggestion check, deferring action does in
defer_compaction not here. so, better rename it to avoid confusing.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/compaction.h        | 4 ++--
 include/trace/events/compaction.h | 2 +-
 mm/compaction.c                   | 8 ++++----
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6fa0eea3f530..be9ed7437a38 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -100,7 +100,7 @@ extern enum compact_result compaction_suitable(struct zone *zone, int order,
 		unsigned int alloc_flags, int highest_zoneidx);
 
 extern void defer_compaction(struct zone *zone, int order);
-extern bool compaction_deferred(struct zone *zone, int order);
+extern bool compaction_should_defer(struct zone *zone, int order);
 extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
 extern bool compaction_restarting(struct zone *zone, int order);
@@ -199,7 +199,7 @@ static inline void defer_compaction(struct zone *zone, int order)
 {
 }
 
-static inline bool compaction_deferred(struct zone *zone, int order)
+static inline bool compaction_should_defer(struct zone *zone, int order)
 {
 	return true;
 }
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 54e5bf081171..33633c71df04 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -274,7 +274,7 @@
 		1UL << __entry->defer_shift)
 );
 
-DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred,
+DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_should_defer,
 
 	TP_PROTO(struct zone *zone, int order),
 
diff --git a/mm/compaction.c b/mm/compaction.c
index cd1ef9e5e638..f14780fc296a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -154,7 +154,7 @@ void defer_compaction(struct zone *zone, int order)
 }
 
 /* Returns true if compaction should be skipped this time */
-bool compaction_deferred(struct zone *zone, int order)
+bool compaction_should_defer(struct zone *zone, int order)
 {
 	unsigned long defer_limit = 1UL << zone->compact_defer_shift;
 
@@ -168,7 +168,7 @@ bool compaction_deferred(struct zone *zone, int order)
 	if (zone->compact_considered >= defer_limit)
 		return false;
 
-	trace_mm_compaction_deferred(zone, order);
+	trace_mm_compaction_should_defer(zone, order);
 
 	return true;
 }
@@ -2377,7 +2377,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		enum compact_result status;
 
 		if (prio > MIN_COMPACT_PRIORITY
-					&& compaction_deferred(zone, order)) {
+				&& compaction_should_defer(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
 		}
@@ -2561,7 +2561,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		if (!populated_zone(zone))
 			continue;
 
-		if (compaction_deferred(zone, cc.order))
+		if (compaction_should_defer(zone, cc.order))
 			continue;
 
 		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (3 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 04/22] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-16  8:59   ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 06/22] mm/thp: clean up lru_add_page_tail Alex Shi
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 --
 mm/huge_memory.c     | 30 ++++++++++++++++++++++++++++++
 mm/swap.c            | 33 ---------------------------------
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5b3216ba39a9..2c29399b29a0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,8 +339,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 			  unsigned int nr_pages);
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
-			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84bee7e29..9e050b13f597 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2340,6 +2340,36 @@ static void remap_page(struct page *page)
 	}
 }
 
+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+				struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
+
 static void __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/swap.c b/mm/swap.c
index a82efc33411f..7701d855873d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -933,39 +933,6 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
-		       struct lruvec *lruvec, struct list_head *list)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
-
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
-	else if (list) {
-		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 06/22] mm/thp: clean up lru_add_page_tail
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (4 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-20  8:43   ` Kirill A. Shutemov
  2020-07-11  0:58 ` [PATCH v16 07/22] mm/thp: remove code path which never got into Alex Shi
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9e050b13f597..b18f21da4dac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2340,19 +2340,19 @@ static void remap_page(struct page *page)
 	}
 }
 
-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *page_tail,
 				struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON_PAGE(!PageHead(head), head);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
 
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
+	if (likely(PageLRU(head)))
+		list_add_tail(&page_tail->lru, &head->lru);
 	else if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 07/22] mm/thp: remove code path which never got into
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (5 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 06/22] mm/thp: clean up lru_add_page_tail Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-20  8:43   ` Kirill A. Shutemov
  2020-07-11  0:58 ` [PATCH v16 08/22] mm/thp: narrow lru locking Alex Shi
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

split_huge_page() will never call on a page which isn't on lru list, so
this code never got a chance to run, and should not be run, to add tail
pages on a lru list which head page isn't there.

Although the bug was never triggered, it'better be removed for code
correctness.

BTW, it looks better to have BUG() or soem warning set in the wrong
path, but the path will be changed in incomming new page isolation
func. So just save it here.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b18f21da4dac..1fb4147ff854 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,16 +2357,6 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 08/22] mm/thp: narrow lru locking
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (6 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 07/22] mm/thp: remove code path which never got into Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 09/22] mm/memcg: add debug checking in lock_page_memcg Alex Shi
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Andrea Arcangeli

lru_lock and page cache xa_lock have no reason with current sequence,
put them together isn't necessary. let's narrow the lru locking, but
left the local_irq_disable to block interrupt re-entry and statistic update.

Hugh Dickins point: split_huge_page_to_list() was already silly,to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.
so we can save passing flags argument down to __split_huge_page().

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1fb4147ff854..d866b6e43434 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2423,7 +2423,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2432,8 +2432,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned long offset = 0;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2445,6 +2443,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock(&pgdat->lru_lock);
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2464,6 +2467,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock(&pgdat->lru_lock);
+	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
 
@@ -2481,8 +2486,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		page_ref_add(head, 2);
 		xa_unlock(&head->mapping->i_pages);
 	}
-
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	local_irq_enable();
 
 	remap_page(head);
 
@@ -2621,12 +2625,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
-	unsigned long flags;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2687,9 +2689,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+	/* block interrupt reentry in xa_lock and spinlock */
+	local_irq_disable();
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2719,7 +2720,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2738,7 +2739,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		local_irq_enable();
 		remap_page(head);
 		ret = -EBUSY;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 09/22] mm/memcg: add debug checking in lock_page_memcg
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (7 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 08/22] mm/thp: narrow lru locking Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 10/22] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Michal Hocko, Vladimir Davydov

Add a debug checking in lock_page_memcg, then we could get alarm
if anything wrong here.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19622328e4b5..fde47272b13c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1983,6 +1983,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (unlikely(!memcg))
 		return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
 	if (atomic_read(&memcg->moving_account) <= 0)
 		return memcg;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 10/22] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (8 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 09/22] mm/memcg: add debug checking in lock_page_memcg Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 11/22] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Fold the PGROTATED event collection into pagevec_move_tail_fn call back
func like other funcs does in pagevec_lru_move_fn. Now all usage of
pagevec_lru_move_fn are same and no needs of the 3rd parameter.

It's simply the calling.

[lkp@intel.com: found a build issue in the original patch, thanks]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 66 +++++++++++++++++++++++----------------------------------------
 1 file changed, 24 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 7701d855873d..dc8b02cdddcb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
-	void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
 	struct pglist_data *pgdat = NULL;
@@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		(*move_fn)(page, lruvec, arg);
+		(*move_fn)(page, lruvec);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -232,35 +231,23 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	int *pgmoved = arg;
-
 	if (PageLRU(page) && !PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-		(*pgmoved) += hpage_nr_pages(page);
+		__count_vm_events(PGROTATED, hpage_nr_pages(page));
 	}
 }
 
 /*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
-	int pgmoved = 0;
-
-	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
-	__count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
+ *
+ * pagevec_move_tail_fn() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
  */
 void rotate_reclaimable_page(struct page *page)
 {
@@ -273,7 +260,7 @@ void rotate_reclaimable_page(struct page *page)
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_move_tail(pvec);
+			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 }
@@ -315,8 +302,7 @@ void lru_note_cost_page(struct page *page)
 		      page_is_file_lru(page), hpage_nr_pages(page));
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -340,7 +326,7 @@ static void activate_page_drain(int cpu)
 	struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
 
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, __activate_page, NULL);
+		pagevec_lru_move_fn(pvec, __activate_page);
 }
 
 static bool need_activate_page_drain(int cpu)
@@ -358,7 +344,7 @@ void activate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, __activate_page, NULL);
+			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -374,7 +360,7 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -526,8 +512,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
-			      void *arg)
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 {
 	int lru;
 	bool active;
@@ -574,8 +559,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -592,8 +576,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
@@ -636,21 +619,21 @@ void lru_add_drain_cpu(int cpu)
 
 		/* No harm done if a racing interrupt already did this */
 		local_lock_irqsave(&lru_rotate.lock, flags);
-		pagevec_move_tail(pvec);
+		pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
 }
@@ -679,7 +662,7 @@ void deactivate_file_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -701,7 +684,7 @@ void deactivate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -723,7 +706,7 @@ void mark_page_lazyfree(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -933,8 +916,7 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
@@ -993,7 +975,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 11/22] mm/lru: move lru_lock holding in func lru_note_cost_page
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (9 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 10/22] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 12/22] mm/lru: move lock into lru_note_cost Alex Shi
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

It's a clean up patch w/o function changes.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memory.c     | 3 ---
 mm/swap.c       | 2 ++
 mm/swap_state.c | 2 --
 mm/workingset.c | 2 --
 4 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 87ec87cdc1ff..dafc5585517e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3150,10 +3150,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				 * XXX: Move to lru_cache_add() when it
 				 * supports new vs putback
 				 */
-				spin_lock_irq(&page_pgdat(page)->lru_lock);
 				lru_note_cost_page(page);
-				spin_unlock_irq(&page_pgdat(page)->lru_lock);
-
 				lru_cache_add(page);
 				swap_readpage(page, true);
 			}
diff --git a/mm/swap.c b/mm/swap.c
index dc8b02cdddcb..b88ca630db70 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -298,8 +298,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 
 void lru_note_cost_page(struct page *page)
 {
+	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
 		      page_is_file_lru(page), hpage_nr_pages(page));
+	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 05889e8e3c97..080be52db6a8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -440,9 +440,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	}
 
 	/* XXX: Move to lru_cache_add() when it supports new vs putback */
-	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost_page(page);
-	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 
 	/* Caller will initiate read into locked page */
 	SetPageWorkingset(page);
diff --git a/mm/workingset.c b/mm/workingset.c
index 50b7937bab32..337d5b9ad132 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -372,9 +372,7 @@ void workingset_refault(struct page *page, void *shadow)
 	if (workingset) {
 		SetPageWorkingset(page);
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
-		spin_lock_irq(&page_pgdat(page)->lru_lock);
 		lru_note_cost_page(page);
-		spin_unlock_irq(&page_pgdat(page)->lru_lock);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
 out:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 12/22] mm/lru: move lock into lru_note_cost
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (10 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 11/22] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU Alex Shi
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

This patch move lru_lock into lru_note_cost. It's a bit ugly and may
cost more locking, but it's necessary for later per pgdat lru_lock to
per memcg lru_lock change.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c   | 5 +++--
 mm/vmscan.c | 4 +---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index b88ca630db70..f645965fde0e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -269,7 +269,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+		spin_lock_irq(&pgdat->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -293,15 +295,14 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
+		spin_unlock_irq(&pgdat->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 void lru_note_cost_page(struct page *page)
 {
-	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
 		      page_is_file_lru(page), hpage_nr_pages(page));
-	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddb29d813d77..c1c4259b4de5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1976,19 +1976,17 @@ static int current_may_throttle(void)
 				&stat, false);
 
 	spin_lock_irq(&pgdat->lru_lock);
-
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	lru_note_cost(lruvec, file, stat.nr_pageout);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
+	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (11 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 12/22] mm/lru: move lock into lru_note_cost Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-16  9:06   ` Alex Shi
  2020-07-16 21:12     ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
                   ` (12 subsequent siblings)
  25 siblings, 2 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Michal Hocko, Vladimir Davydov

Combine PageLRU check and ClearPageLRU into a function by new
introduced func TestClearPageLRU. This function will be used as page
isolation precondition to prevent other isolations some where else.
Then there are may non PageLRU page on lru list, need to remove BUG
checking accordingly.

Hugh Dickins pointed that __page_cache_release and release_pages
has no need to do atomic clear bit since no user on the page at that
moment. and no need get_page() before lru bit clear in isolate_lru_page,
since it '(1) Must be called with an elevated refcount on the page'.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable with Rong Chen
<rong.a.chen@intel.com> report:
https://lkml.org/lkml/2020/3/4/173

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c                 |  3 +--
 mm/swap.c                  |  6 ++----
 mm/vmscan.c                | 26 +++++++++++---------------
 4 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6be1aa559b1e..9554ed1387dc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+	TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index f8736136fad7..228ba5a8e0a5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page)
  */
 static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
 {
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		struct lruvec *lruvec;
 
 		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
-		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		return true;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index f645965fde0e..5092fe9c8c47 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
+		__ClearPageLRU(page);
 		spin_lock_irqsave(&pgdat->lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 	}
@@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
 				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
-			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
+			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c1c4259b4de5..18986fefd49b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EINVAL;
 
-	/* Only take pages on the LRU. */
-	if (!PageLRU(page))
-		return ret;
-
 	/* Compaction should not handle unevictable pages but CMA can do so */
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
 	ret = -EBUSY;
 
+	/* Only take pages on the LRU. */
+	if (!PageLRU(page))
+		return ret;
+
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-
 		nr_pages = compound_nr(page);
 		total_scan += nr_pages;
 
@@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
+		int lru = page_lru(page);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		if (PageLRU(page)) {
-			int lru = page_lru(page);
-			get_page(page);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
-			ret = 0;
-		}
+		spin_lock_irq(&pgdat->lru_lock);
+		del_page_from_lru_list(page, lruvec, lru);
 		spin_unlock_irq(&pgdat->lru_lock);
+		ret = 0;
 	}
+
 	return ret;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page()
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (12 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-17  9:30   ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 15/22] mm/compaction: do page isolation first in compaction Alex Shi
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Mika Penttilä

split_huge_page() must start with PageLRU(head), but lru bit *maybe*
cleared by isolate_lru_page, anyway the head still in lru list, since we
still held the lru_lock.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d866b6e43434..4fe7b92c9330 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2348,15 +2348,18 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&page_tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Split start from PageLRU(head), but lru bit maybe cleared
+		 * by isolate_lru_page, but head still in lru list, since we
+		 * held the lru_lock.
+		 */
+		SetPageLRU(page_tail);
+		list_add_tail(&page_tail->lru, &head->lru);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 15/22] mm/compaction: do page isolation first in compaction
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (13 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-16 21:32     ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock Alex Shi
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Johannes Weiner has suggested:
"So here is a crazy idea that may be worth exploring:

Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
linked list.

Can we make PageLRU atomic and use it to stabilize the lru_lock
instead, and then use the lru_lock only serialize list operations?
..."

Yes, this patch is doing so on  __isolate_lru_page which is the core
page isolation func in compaction and shrinking path.
With this patch, the compaction will only deal the PageLRU set and now
isolated pages to skip the just alloced page which no LRU bit. And the
isolation could exclusive the other isolations in memcg move_account,
page migrations and thp split_huge_page.

As a side effect, PageLRU may be cleared during shrink_inactive_list
path for isolation reason. If so, we can skip that page.

Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 38 ++++++++++++++++++++++----------------
 3 files changed, 56 insertions(+), 26 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2c29399b29a0..6d23d3beeff7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index f14780fc296a..2da2933fe56b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
+				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
+		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+			goto isolate_fail_put;
+
+		/* Try isolate the page */
+		if (!TestClearPageLRU(page))
+			goto isolate_fail_put;
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
 			locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}
 
-			/* Recheck PageLRU and PageCompound under lock */
-			if (!PageLRU(page))
-				goto isolate_fail;
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
 				low_pfn += compound_nr(page) - 1;
-				goto isolate_fail;
+				SetPageLRU(page);
+				goto isolate_fail_put;
 			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, isolate_mode) != 0)
-			goto isolate_fail;
-
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
 			low_pfn += compound_nr(page) - 1;
@@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		}
 
 		continue;
+
+isolate_fail_put:
+		/* Avoid potential deadlock in freeing page under lru_lock */
+		if (locked) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			locked = false;
+		}
+		put_page(page);
+
 isolate_fail:
 		if (!skip_on_failure)
 			continue;
@@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	page = NULL;
+
 isolate_abort:
 	if (locked)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (page) {
+		SetPageLRU(page);
+		put_page(page);
+	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 18986fefd49b..f77748adc340 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1544,7 +1544,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EINVAL;
 
@@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
 
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		ClearPageLRU(page);
-		ret = 0;
-	}
-
-	return ret;
+	return 0;
 }
 
-
 /*
  * Update LRU sizes after isolating pages. The LRU size updates must
  * be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		 * only when the page is being freed somewhere else.
 		 */
 		scan += nr_pages;
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page_prepare(page, mode)) {
 		case 0:
+			/*
+			 * Be careful not to clear PageLRU until after we're
+			 * sure the page is not being freed elsewhere -- the
+			 * page release code relies on it.
+			 */
+			if (unlikely(!get_page_unless_zero(page)))
+				goto busy;
+
+			if (!TestClearPageLRU(page)) {
+				/*
+				 * This page may in other isolation path,
+				 * but we still hold lru_lock.
+				 */
+				put_page(page);
+				goto busy;
+			}
+
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
-
+busy:
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			continue;
+			break;
 
 		default:
 			BUG();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (14 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 15/22] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-17 20:30     ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 17/22] mm/swap: serialize memcg changes during pagevec_lru_move_fn Alex Shi
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

This patch reorder the isolation steps during munlock, move the lru lock
to guard each pages, unfold __munlock_isolate_lru_page func, to do the
preparation for lru lock change.

__split_huge_page_refcount doesn't exist, but we still have to guard
PageMlocked and PageLRU for tail page in __split_huge_page_tail.

[lkp@intel.com: found a sleeping function bug ... at mm/rmap.c]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 93 ++++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 51 insertions(+), 42 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 228ba5a8e0a5..0bdde88b4438 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -103,25 +103,6 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
-{
-	if (TestClearPageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (getpage)
-			get_page(page);
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		return true;
-	}
-
-	return false;
-}
-
-/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -181,6 +162,7 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
+	bool clearlru = false;
 	pg_data_t *pgdat = page_pgdat(page);
 
 	/* For try_to_munlock() and to serialize with page migration */
@@ -189,32 +171,42 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
+	 * Serialize split tail pages in __split_huge_page_tail() which
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
+	get_page(page);
+	clearlru = TestClearPageLRU(page);
 	spin_lock_irq(&pgdat->lru_lock);
 
 	if (!TestClearPageMlocked(page)) {
-		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
+		if (clearlru)
+			SetPageLRU(page);
+		/*
+		 * Potentially, PTE-mapped THP: do not skip the rest PTEs
+		 * Reuse lock as memory barrier for release_pages racing.
+		 */
+		spin_unlock_irq(&pgdat->lru_lock);
+		put_page(page);
+		return 0;
 	}
 
 	nr_pages = hpage_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
+	if (clearlru) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		del_page_from_lru_list(page, lruvec, page_lru(page));
 		spin_unlock_irq(&pgdat->lru_lock);
 		__munlock_isolated_page(page);
-		goto out;
+	} else {
+		spin_unlock_irq(&pgdat->lru_lock);
+		put_page(page);
+		__munlock_isolation_failed(page);
 	}
-	__munlock_isolation_failed(page);
-
-unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
 
-out:
 	return nr_pages - 1;
 }
 
@@ -297,34 +289,51 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
+		struct lruvec *lruvec;
+		bool clearlru;
 
-		if (TestClearPageMlocked(page)) {
-			/*
-			 * We already have pin from follow_page_mask()
-			 * so we can spare the get_page() here.
-			 */
-			if (__munlock_isolate_lru_page(page, false))
-				continue;
-			else
-				__munlock_isolation_failed(page);
-		} else {
+		clearlru = TestClearPageLRU(page);
+		spin_lock_irq(&zone->zone_pgdat->lru_lock);
+
+		if (!TestClearPageMlocked(page)) {
 			delta_munlocked++;
+			if (clearlru)
+				SetPageLRU(page);
+			goto putback;
+		}
+
+		if (!clearlru) {
+			__munlock_isolation_failed(page);
+			goto putback;
 		}
 
 		/*
+		 * Isolate this page.
+		 * We already have pin from follow_page_mask()
+		 * so we can spare the get_page() here.
+		 */
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+		continue;
+
+		/*
 		 * We won't be munlocking this page in the next phase
 		 * but we still need to release the follow_page_mask()
 		 * pin. We cannot do it under lru_lock however. If it's
 		 * the last pin, __page_cache_release() would deadlock.
 		 */
+putback:
+		spin_unlock_irq(&zone->zone_pgdat->lru_lock);
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
+	/* tempary disable irq, will remove later */
+	local_irq_disable();
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	local_irq_enable();
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 17/22] mm/swap: serialize memcg changes during pagevec_lru_move_fn
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (15 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  0:58 ` [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Hugh Dickins' found a memcg change bug on original version:
If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
possible bad scenario would like:

	cpu 0					cpu 1
lruvec = mem_cgroup_page_lruvec()
					if (!isolate_lru_page())
						mem_cgroup_move_account

spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.

So we need the ClearPageLRU to block isolate_lru_page(), then serialize
the memcg change here.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5092fe9c8c47..8488b9b25730 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -221,8 +221,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&pgdat->lru_lock, flags);
 		}
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec);
+
+		SetPageLRU(page);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -976,7 +982,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
+	int i;
+	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec;
+	unsigned long flags = 0;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct pglist_data *pagepgdat = page_pgdat(page);
+
+		if (pagepgdat != pgdat) {
+			if (pgdat)
+				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			pgdat = pagepgdat;
+			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		__pagevec_lru_add_fn(page, lruvec);
+	}
+	if (pgdat)
+		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
 }
 
 /**
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (16 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 17/22] mm/swap: serialize memcg changes during pagevec_lru_move_fn Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-17 21:38     ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function Alex Shi
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Michal Hocko, Vladimir Davydov

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this and later patches, the readtwice performance increases about
80% within concurrent containers.

Also add a debug func in locking which may give some clues if there are
sth out of hands.

Hugh Dickins helped on patch polish, thanks!

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  98 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |   2 +
 mm/compaction.c            |  67 +++++++++++++++++++-----------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  63 +++++++++++++++++++++++++++-
 mm/mlock.c                 |  32 +++++++--------
 mm/mmzone.c                |   1 +
 mm/swap.c                  | 100 +++++++++++++++++++++------------------------
 mm/vmscan.c                |  70 +++++++++++++++++--------------
 9 files changed, 310 insertions(+), 134 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e77197a62809..6e670f991b42 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1255,6 +1297,62 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	bool locked;
+
+	rcu_read_lock();
+	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
+	rcu_read_unlock();
+
+	if (locked)
+		return locked_lruvec;
+
+	if (locked_lruvec)
+		unlock_page_lruvec_irq(locked_lruvec);
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	bool locked;
+
+	rcu_read_lock();
+	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
+	rcu_read_unlock();
+
+	if (locked)
+		return locked_lruvec;
+
+	if (locked_lruvec)
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 14c668b7e793..36c1680efd90 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -261,6 +261,8 @@ struct lruvec {
 	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
 #ifdef CONFIG_MEMCG
diff --git a/mm/compaction.c b/mm/compaction.c
index 2da2933fe56b..88bbd2e93895 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked_lruvec = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+				locked_lruvec = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+				if (locked_lruvec) {
+					unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+					locked_lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked_lruvec) {
+			if (locked_lruvec)
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked_lruvec = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
-		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+		if (locked_lruvec) {
+			unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+			locked_lruvec = NULL;
 		}
 		put_page(page);
 
@@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+				locked_lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	page = NULL;
 
 isolate_abort:
-	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (locked_lruvec)
+		unlock_page_lruvec_irqrestore(locked_lruvec, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4fe7b92c9330..1ff0c1ff6a52 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,7 +2346,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2429,7 +2429,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2446,10 +2445,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2470,7 +2467,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fde47272b13c..d5e56be42f21 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1196,6 +1196,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 		goto out;
 	}
 
-	memcg = page->mem_cgroup;
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	memcg = READ_ONCE(page->mem_cgroup);
 	/*
 	 * Swapcache readahead pages are added to the LRU - and
 	 * possibly migrated - before they are charged.
@@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -2999,7 +3058,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 0bdde88b4438..cb23a0c2cfbf 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -163,7 +163,7 @@ unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
 	bool clearlru = false;
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -177,7 +177,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 */
 	get_page(page);
 	clearlru = TestClearPageLRU(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 
 	if (!TestClearPageMlocked(page)) {
 		if (clearlru)
@@ -186,7 +186,7 @@ unsigned int munlock_vma_page(struct page *page)
 		 * Potentially, PTE-mapped THP: do not skip the rest PTEs
 		 * Reuse lock as memory barrier for release_pages racing.
 		 */
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		put_page(page);
 		return 0;
 	}
@@ -195,14 +195,11 @@ unsigned int munlock_vma_page(struct page *page)
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
 	if (clearlru) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		__munlock_isolated_page(page);
 	} else {
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		put_page(page);
 		__munlock_isolation_failed(page);
 	}
@@ -284,6 +281,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
@@ -291,11 +289,17 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	/* Phase 1: page isolation */
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *lruvec;
+		struct lruvec *new_lruvec;
 		bool clearlru;
 
 		clearlru = TestClearPageLRU(page);
-		spin_lock_irq(&zone->zone_pgdat->lru_lock);
+
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 
 		if (!TestClearPageMlocked(page)) {
 			delta_munlocked++;
@@ -314,9 +318,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 * We already have pin from follow_page_mask()
 		 * so we can spare the get_page() here.
 		 */
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&zone->zone_pgdat->lru_lock);
 		continue;
 
 		/*
@@ -326,14 +328,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 * the last pin, __page_cache_release() would deadlock.
 		 */
 putback:
-		spin_unlock_irq(&zone->zone_pgdat->lru_lock);
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	/* tempary disable irq, will remove later */
-	local_irq_disable();
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	local_irq_enable();
+	if (lruvec)
+		unlock_page_lruvec_irq(lruvec);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/swap.c b/mm/swap.c
index 8488b9b25730..129c532357a4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
 		__ClearPageLRU(page);
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -365,11 +360,12 @@ static inline void activate_page_drain(int cpu)
 void activate_page(struct page *page)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
+	__activate_page(page, lruvec);
+	unlock_page_lruvec_irq(lruvec);
 }
 #endif
 
@@ -819,8 +815,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
 
@@ -830,21 +825,20 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -863,28 +857,28 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
 			__ClearPageLRU(page);
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
@@ -894,8 +888,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -983,26 +977,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f77748adc340..168c1659e430 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		int lru = page_lru(page);
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, lru);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
+	struct lruvec *orig_lruvec = lruvec;
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
+		struct lruvec *new_lruvec = NULL;
+
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 *                                        list_add(&page->lru,)
 		 *     list_add(&page->lru,) //corrupt
 		 */
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				spin_unlock_irq(&lruvec->lru_lock);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 		SetPageLRU(page);
 
 		if (unlikely(put_page_testzero(page))) {
@@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		lru = page_lru(page);
 		nr_pages = hpage_nr_pages(page);
 
@@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		if (PageActive(page))
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
+	if (orig_lruvec != lruvec) {
+		if (lruvec)
+			spin_unlock_irq(&lruvec->lru_lock);
+		spin_lock_irq(&orig_lruvec->lru_lock);
+	}
 
 	/*
 	 * To save our caller's stack, now use input list for pages to free.
@@ -1957,7 +1967,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1969,7 +1979,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1977,7 +1987,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1986,7 +1996,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2039,7 +2049,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2049,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2095,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2106,7 +2116,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2696,10 +2706,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4275,24 +4285,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
 		pgscanned++;
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -4308,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (17 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-17 22:03     ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru Alex Shi
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Thomas Gleixner, Andrey Ryabinin

Use this new function to replace repeated same code, no func change.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/mlock.c  |  9 +--------
 mm/swap.c   | 33 +++++++--------------------------
 mm/vmscan.c |  8 +-------
 3 files changed, 9 insertions(+), 41 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index cb23a0c2cfbf..4f40fc091cf9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -289,17 +289,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	/* Phase 1: page isolation */
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 		bool clearlru;
 
 		clearlru = TestClearPageLRU(page);
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (new_lruvec != lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (!TestClearPageMlocked(page)) {
 			delta_munlocked++;
diff --git a/mm/swap.c b/mm/swap.c
index 129c532357a4..9fb906fbaed5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -866,17 +859,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *pre_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (pre_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -982,15 +970,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 168c1659e430..bdb53a678e7e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4292,15 +4292,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		pgscanned++;
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (18 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-17 21:44     ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock Alex Shi
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Use the relock function to replace relocking action. And try to save few
lock times.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/vmscan.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bdb53a678e7e..078a1640ec60 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1854,15 +1854,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
-		struct lruvec *new_lruvec = NULL;
-
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&lruvec->lru_lock);
+			if (lruvec) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
+			}
 			putback_lru_page(page);
-			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1876,12 +1876,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 *                                        list_add(&page->lru,)
 		 *     list_add(&page->lru,) //corrupt
 		 */
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (new_lruvec != lruvec) {
-			if (lruvec)
-				spin_unlock_irq(&lruvec->lru_lock);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		SetPageLRU(page);
 
 		if (unlikely(put_page_testzero(page))) {
@@ -1890,8 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
 				destroy_compound_page(page);
-				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (19 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-17 21:09     ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 22/22] mm/lru: revise the comments of lru_lock Alex Shi
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 include/linux/mmzone.h | 1 -
 mm/page_alloc.c        | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 36c1680efd90..8d7318ce5f62 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -735,7 +735,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e028b87ce294..4d7df42b32d6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v16 22/22] mm/lru: revise the comments of lru_lock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (20 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2020-07-11  0:58 ` Alex Shi
  2020-07-11  1:02 ` [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  0:58 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
 Documentation/admin-guide/cgroup-v1/memory.rst     | 21 +++++++++------------
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 ++++++++--------------
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  2 +-
 mm/filemap.c                                       |  4 ++--
 mm/memcontrol.c                                    |  2 +-
 mm/rmap.c                                          |  4 ++--
 mm/vmscan.c                                        | 12 ++++++++----
 10 files changed, 36 insertions(+), 50 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ======
-        Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global pgdat->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under pgdat->lru_lock.
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-
-	(By __isolate_lru_page(), the page is removed from both of global and
-	private LRU.)
-
+	Each memcg has its own vector of LRUs (inactive anon, active anon,
+	inactive file, active file, unevictable) of pages from each node,
+	each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 12757e63b26c..24450696579f 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered.
 2.6 Locking
 -----------
 
-   lock_page_cgroup()/unlock_page_cgroup() should not be called under
-   the i_pages lock.
+Lock order is as follows:
 
-   Other lock order is following:
+  Page lock (PG_locked bit of page->flags)
+    mm->page_table_lock or split pte_lock
+      lock_page_memcg (memcg->move_lock)
+        mapping->i_pages lock
+          lruvec->lru_lock.
 
-   PG_locked.
-     mm->page_table_lock
-         pgdat->lru_lock
-	   lock_page_cgroup.
-
-  In many cases, just lock_page_cgroup() is called.
-
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 -----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million 4k pages in a single node.  When a large
 fraction of these pages are not evictable for any reason [see below], vmscan
 will spend a lot of time scanning the LRU lists looking for the small fraction
 of pages that are evictable.  This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
 The Unevictable Page List
 -------------------------
 
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
 called the "unevictable" list and an associated page flag, PG_unevictable, to
 indicate that the page is being managed on the unevictable list.
 
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
 swap-backed pages.  This differentiation is only important while the pages are,
 in fact, evictable.
 
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
 lists and statistics originally proposed and posted by Christoph Lameter.
 
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock.  This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
 
 Memory Control Group Interaction
 --------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
 memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
 lru_list enum.
 
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
 lru_list enum element).  The memory controller tracks the movement of pages to
 and from the unevictable list.
 
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
 active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
 pages in all of the shrink_{active|inactive|page}_list() functions and will
 "cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
 
 There may be situations where a page is mapped into a VM_LOCKED VMA, but the
 page is not marked as PG_mlocked.  Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
 page from the LRU, as it is likely on the appropriate active or inactive list
 at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
 back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list.  If
+is now mlocked and divert the page to the node's unevictable list.  If
 mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
 it later if and when it attempts to reclaim the page.
 
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
      unevictable list in mlock_vma_page().
 
 shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
 
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 64ede5f150dc..44738cdb5a55 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8d7318ce5f62..dddeabd6ea8d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
diff --git a/mm/filemap.c b/mm/filemap.c
index f0ae9a6308cb..1b42aaae4d3e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,8 +101,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5e56be42f21..926d7d95dc1d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3057,7 +3057,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 /*
- * Because tail pages are not marked as "used", set it. We're under
+ * Because tail pages are not marked as "used", set it. Don't need
  * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/rmap.c b/mm/rmap.c
index 5fe2dedce1fc..7f6e95680c47 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -28,12 +28,12 @@
  *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
- *                   mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
+ *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
  *                     i_pages lock (widely used)
+ *                       lruvec->lru_lock (in lock_page_lruvec_irq)
  *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
  *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                   sb_lock (within inode_lock in fs/fs-writeback.c)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 078a1640ec60..bb3ac52de058 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1620,14 +1620,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
  * For pagecache intensive workloads, this function is the hottest
  * spot in the kernel (apart from copy_*_user functions).
  *
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
  *
  * @nr_to_scan:	The number of eligible pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
@@ -1826,14 +1828,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 
 /*
  * This moves pages from @list to corresponding LRU list.
+ * The pages from @list is out of any lruvec, and in the end list reuses as
+ * pages_to_free list.
  *
  * We move them the other way if the page is referenced by one or more
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
+ * appropriate to hold lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
+ * should drop lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (21 preceding siblings ...)
  2020-07-11  0:58 ` [PATCH v16 22/22] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-07-11  1:02 ` Alex Shi
  2020-07-16  8:49 ` Alex Shi
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-11  1:02 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Hi Hugh,

I believe I own your a 'tested-by' for previous version.
Could you like to give a try on the new version and give a reviewed or tested-by
if it's fine.

Thanks
Alex 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (22 preceding siblings ...)
  2020-07-11  1:02 ` [PATCH v16 00/22] per memcg lru_lock Alex Shi
@ 2020-07-16  8:49 ` Alex Shi
  2020-07-16 14:11   ` Alexander Duyck
  2020-07-20  7:30 ` Alex Shi
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-16  8:49 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Hi All,

This version get tested and passed Hugh Dickin's testing as well as v15/v14.
Thanks, Hugh!

Anyone like to give any comments or concerns for the patches?


Thanks
Alex


在 2020/7/11 上午8:58, Alex Shi 写道:
> The new version which bases on v5.8-rc4. Add 2 more patchs:
> 'mm/thp: remove code path which never got into'
> 'mm/thp: add tail pages into lru anyway in split_huge_page()'
> and modified 'mm/mlock: reorder isolation sequence during munlock'
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> Alex Shi (20):
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/page_idle: no unlikely double check for idle page counting
>   mm/compaction: correct the comments of compact_defer_shift
>   mm/compaction: rename compact_deferred as compact_should_defer
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: clean up lru_add_page_tail
>   mm/thp: remove code path which never got into
>   mm/thp: narrow lru locking
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lru_lock holding in func lru_note_cost_page
>   mm/lru: move lock into lru_note_cost
>   mm/lru: introduce TestClearPageLRU
>   mm/thp: add tail pages into lru anyway in split_huge_page()
>   mm/compaction: do page isolation first in compaction
>   mm/mlock: reorder isolation sequence during munlock
>   mm/swap: serialize memcg changes during pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
>   mm/lru: introduce the relock_page_lruvec function
>   mm/pgdat: remove pgdat lru_lock
> 
> Hugh Dickins (2):
>   mm/vmscan: use relock for move_pages_to_lru
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/compaction.h                         |   4 +-
>  include/linux/memcontrol.h                         |  98 +++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  include/trace/events/compaction.h                  |   2 +-
>  mm/compaction.c                                    | 113 ++++++++----
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  47 +++--
>  mm/memcontrol.c                                    |  71 +++++++-
>  mm/memory.c                                        |   3 -
>  mm/mlock.c                                         |  93 +++++-----
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   8 -
>  mm/rmap.c                                          |   4 +-
>  mm/swap.c                                          | 189 ++++++++-------------
>  mm/swap_state.c                                    |   2 -
>  mm/vmscan.c                                        | 174 ++++++++++---------
>  mm/workingset.c                                    |   2 -
>  25 files changed, 524 insertions(+), 365 deletions(-)
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-07-11  0:58 ` [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
@ 2020-07-16  8:59   ` Alex Shi
  2020-07-16 13:17     ` Kirill A. Shutemov
  0 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-16  8:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill

Hi Kirill & Matthew,

Is there any concern from for the THP involved patches?

Thanks
Alex


在 2020/7/11 上午8:58, Alex Shi 写道:
> The func is only used in huge_memory.c, defining it in other file with a
> CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.
> 
> Let's move it THP. And make it static as Hugh Dickin suggested.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 --
>  mm/huge_memory.c     | 30 ++++++++++++++++++++++++++++++
>  mm/swap.c            | 33 ---------------------------------
>  3 files changed, 30 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 5b3216ba39a9..2c29399b29a0 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -339,8 +339,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
>  			  unsigned int nr_pages);
>  extern void lru_note_cost_page(struct page *);
>  extern void lru_cache_add(struct page *);
> -extern void lru_add_page_tail(struct page *page, struct page *page_tail,
> -			 struct lruvec *lruvec, struct list_head *head);
>  extern void activate_page(struct page *);
>  extern void mark_page_accessed(struct page *);
>  extern void lru_add_drain(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 78c84bee7e29..9e050b13f597 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2340,6 +2340,36 @@ static void remap_page(struct page *page)
>  	}
>  }
>  
> +static void lru_add_page_tail(struct page *page, struct page *page_tail,
> +				struct lruvec *lruvec, struct list_head *list)
> +{
> +	VM_BUG_ON_PAGE(!PageHead(page), page);
> +	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
> +	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> +	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
> +
> +	if (!list)
> +		SetPageLRU(page_tail);
> +
> +	if (likely(PageLRU(page)))
> +		list_add_tail(&page_tail->lru, &page->lru);
> +	else if (list) {
> +		/* page reclaim is reclaiming a huge page */
> +		get_page(page_tail);
> +		list_add_tail(&page_tail->lru, list);
> +	} else {
> +		/*
> +		 * Head page has not yet been counted, as an hpage,
> +		 * so we must account for each subpage individually.
> +		 *
> +		 * Put page_tail on the list at the correct position
> +		 * so they all end up in order.
> +		 */
> +		add_page_to_lru_list_tail(page_tail, lruvec,
> +					  page_lru(page_tail));
> +	}
> +}
> +
>  static void __split_huge_page_tail(struct page *head, int tail,
>  		struct lruvec *lruvec, struct list_head *list)
>  {
> diff --git a/mm/swap.c b/mm/swap.c
> index a82efc33411f..7701d855873d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -933,39 +933,6 @@ void __pagevec_release(struct pagevec *pvec)
>  }
>  EXPORT_SYMBOL(__pagevec_release);
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -/* used by __split_huge_page_refcount() */
> -void lru_add_page_tail(struct page *page, struct page *page_tail,
> -		       struct lruvec *lruvec, struct list_head *list)
> -{
> -	VM_BUG_ON_PAGE(!PageHead(page), page);
> -	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
> -	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> -	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
> -
> -	if (!list)
> -		SetPageLRU(page_tail);
> -
> -	if (likely(PageLRU(page)))
> -		list_add_tail(&page_tail->lru, &page->lru);
> -	else if (list) {
> -		/* page reclaim is reclaiming a huge page */
> -		get_page(page_tail);
> -		list_add_tail(&page_tail->lru, list);
> -	} else {
> -		/*
> -		 * Head page has not yet been counted, as an hpage,
> -		 * so we must account for each subpage individually.
> -		 *
> -		 * Put page_tail on the list at the correct position
> -		 * so they all end up in order.
> -		 */
> -		add_page_to_lru_list_tail(page_tail, lruvec,
> -					  page_lru(page_tail));
> -	}
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -
>  static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>  				 void *arg)
>  {
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
  2020-07-11  0:58 ` [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-07-16  9:06   ` Alex Shi
  2020-07-16 21:12     ` Alexander Duyck
  1 sibling, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-16  9:06 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Michal Hocko, Vladimir Davydov

Hi Johannes,

The patchset looks good from logical and testing part. Is there any concern
for any patches?

Thanks
Alex

在 2020/7/11 上午8:58, Alex Shi 写道:
> Combine PageLRU check and ClearPageLRU into a function by new
> introduced func TestClearPageLRU. This function will be used as page
> isolation precondition to prevent other isolations some where else.
> Then there are may non PageLRU page on lru list, need to remove BUG
> checking accordingly.
> 
> Hugh Dickins pointed that __page_cache_release and release_pages
> has no need to do atomic clear bit since no user on the page at that
> moment. and no need get_page() before lru bit clear in isolate_lru_page,
> since it '(1) Must be called with an elevated refcount on the page'.
> 
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable with Rong Chen
> <rong.a.chen@intel.com> report:
> https://lkml.org/lkml/2020/3/4/173
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/page-flags.h |  1 +
>  mm/mlock.c                 |  3 +--
>  mm/swap.c                  |  6 ++----
>  mm/vmscan.c                | 26 +++++++++++---------------
>  4 files changed, 15 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6be1aa559b1e..9554ed1387dc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
>  PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
>  	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
>  PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
> +	TESTCLEARFLAG(LRU, lru, PF_HEAD)
>  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
>  	TESTCLEARFLAG(Active, active, PF_HEAD)
>  PAGEFLAG(Workingset, workingset, PF_HEAD)
> diff --git a/mm/mlock.c b/mm/mlock.c
> index f8736136fad7..228ba5a8e0a5 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page)
>   */
>  static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>  {
> -	if (PageLRU(page)) {
> +	if (TestClearPageLRU(page)) {
>  		struct lruvec *lruvec;
>  
>  		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>  		if (getpage)
>  			get_page(page);
> -		ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>  		return true;
>  	}
> diff --git a/mm/swap.c b/mm/swap.c
> index f645965fde0e..5092fe9c8c47 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>  		struct lruvec *lruvec;
>  		unsigned long flags;
>  
> +		__ClearPageLRU(page);
>  		spin_lock_irqsave(&pgdat->lru_lock, flags);
>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		VM_BUG_ON_PAGE(!PageLRU(page), page);
> -		__ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>  	}
> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
>  				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>  			}
>  
> -			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> -			VM_BUG_ON_PAGE(!PageLRU(page), page);
>  			__ClearPageLRU(page);
> +			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>  			del_page_from_lru_list(page, lruvec, page_off_lru(page));
>  		}
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c1c4259b4de5..18986fefd49b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  {
>  	int ret = -EINVAL;
>  
> -	/* Only take pages on the LRU. */
> -	if (!PageLRU(page))
> -		return ret;
> -
>  	/* Compaction should not handle unevictable pages but CMA can do so */
>  	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>  		return ret;
>  
>  	ret = -EBUSY;
>  
> +	/* Only take pages on the LRU. */
> +	if (!PageLRU(page))
> +		return ret;
> +
>  	/*
>  	 * To minimise LRU disruption, the caller can indicate that it only
>  	 * wants to isolate pages it will be able to operate on without
> @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  		page = lru_to_page(src);
>  		prefetchw_prev_lru_page(page, src, flags);
>  
> -		VM_BUG_ON_PAGE(!PageLRU(page), page);
> -
>  		nr_pages = compound_nr(page);
>  		total_scan += nr_pages;
>  
> @@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page)
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>  
> -	if (PageLRU(page)) {
> +	if (TestClearPageLRU(page)) {
>  		pg_data_t *pgdat = page_pgdat(page);
>  		struct lruvec *lruvec;
> +		int lru = page_lru(page);
>  
> -		spin_lock_irq(&pgdat->lru_lock);
> +		get_page(page);
>  		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -		if (PageLRU(page)) {
> -			int lru = page_lru(page);
> -			get_page(page);
> -			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, lru);
> -			ret = 0;
> -		}
> +		spin_lock_irq(&pgdat->lru_lock);
> +		del_page_from_lru_list(page, lruvec, lru);
>  		spin_unlock_irq(&pgdat->lru_lock);
> +		ret = 0;
>  	}
> +
>  	return ret;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-07-16  8:59   ` Alex Shi
@ 2020-07-16 13:17     ` Kirill A. Shutemov
  2020-07-17  5:13       ` Alex Shi
  0 siblings, 1 reply; 80+ messages in thread
From: Kirill A. Shutemov @ 2020-07-16 13:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang

On Thu, Jul 16, 2020 at 04:59:48PM +0800, Alex Shi wrote:
> Hi Kirill & Matthew,
> 
> Is there any concern from for the THP involved patches?

It is mechanical move. I don't see a problem.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
@ 2020-07-16 14:11   ` Alexander Duyck
  2020-07-11  0:58 ` [PATCH v16 02/22] mm/page_idle: no unlikely double check for idle page counting Alex Shi
                     ` (24 subsequent siblings)
  25 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-16 14:11 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> The new version which bases on v5.8-rc4. Add 2 more patchs:
> 'mm/thp: remove code path which never got into'
> 'mm/thp: add tail pages into lru anyway in split_huge_page()'
> and modified 'mm/mlock: reorder isolation sequence during munlock'
>
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
>
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
>
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new
> lru_lock in it.
>
> The above solution suggested by Johannes Weiner, and based on his new memcg
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
>
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
>
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
>
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan,
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
>
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Hi Alex,

I think I am seeing a regression with this patch set when I run the
will-it-scale/page_fault3 test. Specifically the processes result is
dropping from 56371083 to 43127382 when I apply these patches.

I haven't had a chance to bisect and figure out what is causing it,
and wanted to let you know in case you are aware of anything specific
that may be causing this.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
@ 2020-07-16 14:11   ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-16 14:11 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> The new version which bases on v5.8-rc4. Add 2 more patchs:
> 'mm/thp: remove code path which never got into'
> 'mm/thp: add tail pages into lru anyway in split_huge_page()'
> and modified 'mm/mlock: reorder isolation sequence during munlock'
>
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
>
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
>
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new
> lru_lock in it.
>
> The above solution suggested by Johannes Weiner, and based on his new memcg
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
>
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
>
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
>
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan,
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
>
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

Hi Alex,

I think I am seeing a regression with this patch set when I run the
will-it-scale/page_fault3 test. Specifically the processes result is
dropping from 56371083 to 43127382 when I apply these patches.

I haven't had a chance to bisect and figure out what is causing it,
and wanted to let you know in case you are aware of anything specific
that may be causing this.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
  2020-07-11  0:58 ` [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-07-16 21:12     ` Alexander Duyck
  2020-07-16 21:12     ` Alexander Duyck
  1 sibling, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-16 21:12 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Combine PageLRU check and ClearPageLRU into a function by new
> introduced func TestClearPageLRU. This function will be used as page
> isolation precondition to prevent other isolations some where else.
> Then there are may non PageLRU page on lru list, need to remove BUG
> checking accordingly.
>
> Hugh Dickins pointed that __page_cache_release and release_pages
> has no need to do atomic clear bit since no user on the page at that
> moment. and no need get_page() before lru bit clear in isolate_lru_page,
> since it '(1) Must be called with an elevated refcount on the page'.
>
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable with Rong Chen
> <rong.a.chen@intel.com> report:
> https://lkml.org/lkml/2020/3/4/173
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/page-flags.h |  1 +
>  mm/mlock.c                 |  3 +--
>  mm/swap.c                  |  6 ++----
>  mm/vmscan.c                | 26 +++++++++++---------------
>  4 files changed, 15 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6be1aa559b1e..9554ed1387dc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
>  PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
>         __CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
>  PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
> +       TESTCLEARFLAG(LRU, lru, PF_HEAD)
>  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
>         TESTCLEARFLAG(Active, active, PF_HEAD)
>  PAGEFLAG(Workingset, workingset, PF_HEAD)
> diff --git a/mm/mlock.c b/mm/mlock.c
> index f8736136fad7..228ba5a8e0a5 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page)
>   */
>  static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>  {
> -       if (PageLRU(page)) {
> +       if (TestClearPageLRU(page)) {
>                 struct lruvec *lruvec;
>
>                 lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
> -               ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 return true;
>         }
> diff --git a/mm/swap.c b/mm/swap.c
> index f645965fde0e..5092fe9c8c47 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>                 struct lruvec *lruvec;
>                 unsigned long flags;
>
> +               __ClearPageLRU(page);
>                 spin_lock_irqsave(&pgdat->lru_lock, flags);
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> -               __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>         }

So this piece doesn't make much sense to me. Why not use
TestClearPageLRU(page) here? Just a few lines above you are testing
for PageLRU(page) and it seems like if you are going to go for an
atomic test/clear and then remove the page from the LRU list you
should be using it here as well otherwise it seems like you could run
into a potential collision since you are testing here without clearing
the bit.

> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 }
>

Same here. You are just moving the flag clearing, but you didn't
combine it with the test. It seems like if you are expecting this to
be treated as an atomic operation. It should be a relatively low cost
to do since you already should own the cacheline as a result of
calling put_page_testzero so I am not sure why you are not combining
the two.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c1c4259b4de5..18986fefd49b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  {
>         int ret = -EINVAL;
>
> -       /* Only take pages on the LRU. */
> -       if (!PageLRU(page))
> -               return ret;
> -
>         /* Compaction should not handle unevictable pages but CMA can do so */
>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>                 return ret;
>
>         ret = -EBUSY;
>
> +       /* Only take pages on the LRU. */
> +       if (!PageLRU(page))
> +               return ret;
> +
>         /*
>          * To minimise LRU disruption, the caller can indicate that it only
>          * wants to isolate pages it will be able to operate on without
> @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                 page = lru_to_page(src);
>                 prefetchw_prev_lru_page(page, src, flags);
>
> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> -
>                 nr_pages = compound_nr(page);
>                 total_scan += nr_pages;
>

So effectively the changes here are making it so that a !PageLRU page
will cycle to the start of the LRU list. Now if I understand correctly
we are guaranteed that if the flag is not set it cannot be set while
we are holding the lru_lock, however it can be cleared while we are
holding the lock, correct? Thus that is why isolate_lru_pages has to
call TestClearPageLRU after the earlier check in __isolate_lru_page.

It might make it more readable to pull in the later patch that
modifies isolate_lru_pages that has it using TestClearPageLRU.

> @@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page)
>         VM_BUG_ON_PAGE(!page_count(page), page);
>         WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>
> -       if (PageLRU(page)) {
> +       if (TestClearPageLRU(page)) {
>                 pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
> +               int lru = page_lru(page);
>
> -               spin_lock_irq(&pgdat->lru_lock);
> +               get_page(page);
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               if (PageLRU(page)) {
> -                       int lru = page_lru(page);
> -                       get_page(page);
> -                       ClearPageLRU(page);
> -                       del_page_from_lru_list(page, lruvec, lru);
> -                       ret = 0;
> -               }
> +               spin_lock_irq(&pgdat->lru_lock);
> +               del_page_from_lru_list(page, lruvec, lru);
>                 spin_unlock_irq(&pgdat->lru_lock);
> +               ret = 0;
>         }
> +
>         return ret;
>  }
>
> --
> 1.8.3.1
>
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
@ 2020-07-16 21:12     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-16 21:12 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Combine PageLRU check and ClearPageLRU into a function by new
> introduced func TestClearPageLRU. This function will be used as page
> isolation precondition to prevent other isolations some where else.
> Then there are may non PageLRU page on lru list, need to remove BUG
> checking accordingly.
>
> Hugh Dickins pointed that __page_cache_release and release_pages
> has no need to do atomic clear bit since no user on the page at that
> moment. and no need get_page() before lru bit clear in isolate_lru_page,
> since it '(1) Must be called with an elevated refcount on the page'.
>
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable with Rong Chen
> <rong.a.chen@intel.com> report:
> https://lkml.org/lkml/2020/3/4/173
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/page-flags.h |  1 +
>  mm/mlock.c                 |  3 +--
>  mm/swap.c                  |  6 ++----
>  mm/vmscan.c                | 26 +++++++++++---------------
>  4 files changed, 15 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6be1aa559b1e..9554ed1387dc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
>  PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
>         __CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
>  PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
> +       TESTCLEARFLAG(LRU, lru, PF_HEAD)
>  PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
>         TESTCLEARFLAG(Active, active, PF_HEAD)
>  PAGEFLAG(Workingset, workingset, PF_HEAD)
> diff --git a/mm/mlock.c b/mm/mlock.c
> index f8736136fad7..228ba5a8e0a5 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page)
>   */
>  static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>  {
> -       if (PageLRU(page)) {
> +       if (TestClearPageLRU(page)) {
>                 struct lruvec *lruvec;
>
>                 lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
> -               ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 return true;
>         }
> diff --git a/mm/swap.c b/mm/swap.c
> index f645965fde0e..5092fe9c8c47 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>                 struct lruvec *lruvec;
>                 unsigned long flags;
>
> +               __ClearPageLRU(page);
>                 spin_lock_irqsave(&pgdat->lru_lock, flags);
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> -               __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>         }

So this piece doesn't make much sense to me. Why not use
TestClearPageLRU(page) here? Just a few lines above you are testing
for PageLRU(page) and it seems like if you are going to go for an
atomic test/clear and then remove the page from the LRU list you
should be using it here as well otherwise it seems like you could run
into a potential collision since you are testing here without clearing
the bit.

> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 }
>

Same here. You are just moving the flag clearing, but you didn't
combine it with the test. It seems like if you are expecting this to
be treated as an atomic operation. It should be a relatively low cost
to do since you already should own the cacheline as a result of
calling put_page_testzero so I am not sure why you are not combining
the two.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c1c4259b4de5..18986fefd49b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  {
>         int ret = -EINVAL;
>
> -       /* Only take pages on the LRU. */
> -       if (!PageLRU(page))
> -               return ret;
> -
>         /* Compaction should not handle unevictable pages but CMA can do so */
>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>                 return ret;
>
>         ret = -EBUSY;
>
> +       /* Only take pages on the LRU. */
> +       if (!PageLRU(page))
> +               return ret;
> +
>         /*
>          * To minimise LRU disruption, the caller can indicate that it only
>          * wants to isolate pages it will be able to operate on without
> @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                 page = lru_to_page(src);
>                 prefetchw_prev_lru_page(page, src, flags);
>
> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> -
>                 nr_pages = compound_nr(page);
>                 total_scan += nr_pages;
>

So effectively the changes here are making it so that a !PageLRU page
will cycle to the start of the LRU list. Now if I understand correctly
we are guaranteed that if the flag is not set it cannot be set while
we are holding the lru_lock, however it can be cleared while we are
holding the lock, correct? Thus that is why isolate_lru_pages has to
call TestClearPageLRU after the earlier check in __isolate_lru_page.

It might make it more readable to pull in the later patch that
modifies isolate_lru_pages that has it using TestClearPageLRU.

> @@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page)
>         VM_BUG_ON_PAGE(!page_count(page), page);
>         WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>
> -       if (PageLRU(page)) {
> +       if (TestClearPageLRU(page)) {
>                 pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
> +               int lru = page_lru(page);
>
> -               spin_lock_irq(&pgdat->lru_lock);
> +               get_page(page);
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               if (PageLRU(page)) {
> -                       int lru = page_lru(page);
> -                       get_page(page);
> -                       ClearPageLRU(page);
> -                       del_page_from_lru_list(page, lruvec, lru);
> -                       ret = 0;
> -               }
> +               spin_lock_irq(&pgdat->lru_lock);
> +               del_page_from_lru_list(page, lruvec, lru);
>                 spin_unlock_irq(&pgdat->lru_lock);
> +               ret = 0;
>         }
> +
>         return ret;
>  }
>
> --
> 1.8.3.1
>
>



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 15/22] mm/compaction: do page isolation first in compaction
  2020-07-11  0:58 ` [PATCH v16 15/22] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-07-16 21:32     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-16 21:32 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Johannes Weiner has suggested:
> "So here is a crazy idea that may be worth exploring:
>
> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> linked list.
>
> Can we make PageLRU atomic and use it to stabilize the lru_lock
> instead, and then use the lru_lock only serialize list operations?
> ..."
>
> Yes, this patch is doing so on  __isolate_lru_page which is the core
> page isolation func in compaction and shrinking path.
> With this patch, the compaction will only deal the PageLRU set and now
> isolated pages to skip the just alloced page which no LRU bit. And the
> isolation could exclusive the other isolations in memcg move_account,
> page migrations and thp split_huge_page.
>
> As a side effect, PageLRU may be cleared during shrink_inactive_list
> path for isolation reason. If so, we can skip that page.
>
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
>
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 +-
>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 38 ++++++++++++++++++++++----------------
>  3 files changed, 56 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2c29399b29a0..6d23d3beeff7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f14780fc296a..2da2933fe56b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>                         if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>                                 low_pfn = end_pfn;
> +                               page = NULL;
>                                 goto isolate_abort;
>                         }
>                         valid_page = page;
> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>                         goto isolate_fail;
>
> +               /*
> +                * Be careful not to clear PageLRU until after we're
> +                * sure the page is not being freed elsewhere -- the
> +                * page release code relies on it.
> +                */
> +               if (unlikely(!get_page_unless_zero(page)))
> +                       goto isolate_fail;
> +
> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> +                       goto isolate_fail_put;
> +
> +               /* Try isolate the page */
> +               if (!TestClearPageLRU(page))
> +                       goto isolate_fail_put;
> +
>                 /* If we already hold the lock, we can skip some rechecking */
>                 if (!locked) {
>                         locked = compact_lock_irqsave(&pgdat->lru_lock,

Why not do the __isolate_lru_page_prepare before getting the page?
That way you can avoid performing an extra atomic operation on non-LRU
pages.

> @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /* Recheck PageLRU and PageCompound under lock */
> -                       if (!PageLRU(page))
> -                               goto isolate_fail;
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>                                 low_pfn += compound_nr(page) - 1;
> -                               goto isolate_fail;
> +                               SetPageLRU(page);
> +                               goto isolate_fail_put;
>                         }
>                 }
>
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
> -               /* Try isolate the page */
> -               if (__isolate_lru_page(page, isolate_mode) != 0)
> -                       goto isolate_fail;
> -
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
>                         low_pfn += compound_nr(page) - 1;
> @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 }
>
>                 continue;
> +
> +isolate_fail_put:
> +               /* Avoid potential deadlock in freeing page under lru_lock */
> +               if (locked) {
> +                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +                       locked = false;
> +               }
> +               put_page(page);
> +
>  isolate_fail:
>                 if (!skip_on_failure)
>                         continue;
> @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         if (unlikely(low_pfn > end_pfn))
>                 low_pfn = end_pfn;
>
> +       page = NULL;
> +
>  isolate_abort:
>         if (locked)
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (page) {
> +               SetPageLRU(page);
> +               put_page(page);
> +       }
>
>         /*
>          * Updated the cached scanner pfn once the pageblock has been scanned
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 18986fefd49b..f77748adc340 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1544,7 +1544,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   *
>   * returns 0 on success, -ve errno on failure.
>   */
> -int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
>  {
>         int ret = -EINVAL;
>
> @@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>         if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
>                 return ret;
>
> -       if (likely(get_page_unless_zero(page))) {
> -               /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> -                */
> -               ClearPageLRU(page);
> -               ret = 0;
> -       }
> -
> -       return ret;
> +       return 0;
>  }
>
> -
>  /*
>   * Update LRU sizes after isolating pages. The LRU size updates must
>   * be complete before mem_cgroup_update_lru_size due to a sanity check.
> @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                  * only when the page is being freed somewhere else.
>                  */
>                 scan += nr_pages;
> -               switch (__isolate_lru_page(page, mode)) {
> +               switch (__isolate_lru_page_prepare(page, mode)) {
>                 case 0:
> +                       /*
> +                        * Be careful not to clear PageLRU until after we're
> +                        * sure the page is not being freed elsewhere -- the
> +                        * page release code relies on it.
> +                        */
> +                       if (unlikely(!get_page_unless_zero(page)))
> +                               goto busy;
> +
> +                       if (!TestClearPageLRU(page)) {
> +                               /*
> +                                * This page may in other isolation path,
> +                                * but we still hold lru_lock.
> +                                */
> +                               put_page(page);
> +                               goto busy;
> +                       }
> +

I wonder if it wouldn't make sense to combine these two atomic ops
with tests and the put_page into a single inline function? Then it
could be possible to just do one check and if succeeds you do the
block of code below, otherwise you just fall-through into the -EBUSY
case.

>                         nr_taken += nr_pages;
>                         nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
> -
> +busy:
>                 case -EBUSY:
>                         /* else it is being freed elsewhere */
>                         list_move(&page->lru, src);
> -                       continue;
> +                       break;
>
>                 default:
>                         BUG();
> --
> 1.8.3.1
>
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 15/22] mm/compaction: do page isolation first in compaction
@ 2020-07-16 21:32     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-16 21:32 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Johannes Weiner has suggested:
> "So here is a crazy idea that may be worth exploring:
>
> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> linked list.
>
> Can we make PageLRU atomic and use it to stabilize the lru_lock
> instead, and then use the lru_lock only serialize list operations?
> ..."
>
> Yes, this patch is doing so on  __isolate_lru_page which is the core
> page isolation func in compaction and shrinking path.
> With this patch, the compaction will only deal the PageLRU set and now
> isolated pages to skip the just alloced page which no LRU bit. And the
> isolation could exclusive the other isolations in memcg move_account,
> page migrations and thp split_huge_page.
>
> As a side effect, PageLRU may be cleared during shrink_inactive_list
> path for isolation reason. If so, we can skip that page.
>
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
>
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 +-
>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 38 ++++++++++++++++++++++----------------
>  3 files changed, 56 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2c29399b29a0..6d23d3beeff7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f14780fc296a..2da2933fe56b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>                         if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>                                 low_pfn = end_pfn;
> +                               page = NULL;
>                                 goto isolate_abort;
>                         }
>                         valid_page = page;
> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>                         goto isolate_fail;
>
> +               /*
> +                * Be careful not to clear PageLRU until after we're
> +                * sure the page is not being freed elsewhere -- the
> +                * page release code relies on it.
> +                */
> +               if (unlikely(!get_page_unless_zero(page)))
> +                       goto isolate_fail;
> +
> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> +                       goto isolate_fail_put;
> +
> +               /* Try isolate the page */
> +               if (!TestClearPageLRU(page))
> +                       goto isolate_fail_put;
> +
>                 /* If we already hold the lock, we can skip some rechecking */
>                 if (!locked) {
>                         locked = compact_lock_irqsave(&pgdat->lru_lock,

Why not do the __isolate_lru_page_prepare before getting the page?
That way you can avoid performing an extra atomic operation on non-LRU
pages.

> @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /* Recheck PageLRU and PageCompound under lock */
> -                       if (!PageLRU(page))
> -                               goto isolate_fail;
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>                                 low_pfn += compound_nr(page) - 1;
> -                               goto isolate_fail;
> +                               SetPageLRU(page);
> +                               goto isolate_fail_put;
>                         }
>                 }
>
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
> -               /* Try isolate the page */
> -               if (__isolate_lru_page(page, isolate_mode) != 0)
> -                       goto isolate_fail;
> -
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
>                         low_pfn += compound_nr(page) - 1;
> @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 }
>
>                 continue;
> +
> +isolate_fail_put:
> +               /* Avoid potential deadlock in freeing page under lru_lock */
> +               if (locked) {
> +                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +                       locked = false;
> +               }
> +               put_page(page);
> +
>  isolate_fail:
>                 if (!skip_on_failure)
>                         continue;
> @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         if (unlikely(low_pfn > end_pfn))
>                 low_pfn = end_pfn;
>
> +       page = NULL;
> +
>  isolate_abort:
>         if (locked)
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (page) {
> +               SetPageLRU(page);
> +               put_page(page);
> +       }
>
>         /*
>          * Updated the cached scanner pfn once the pageblock has been scanned
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 18986fefd49b..f77748adc340 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1544,7 +1544,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   *
>   * returns 0 on success, -ve errno on failure.
>   */
> -int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
>  {
>         int ret = -EINVAL;
>
> @@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>         if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
>                 return ret;
>
> -       if (likely(get_page_unless_zero(page))) {
> -               /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> -                */
> -               ClearPageLRU(page);
> -               ret = 0;
> -       }
> -
> -       return ret;
> +       return 0;
>  }
>
> -
>  /*
>   * Update LRU sizes after isolating pages. The LRU size updates must
>   * be complete before mem_cgroup_update_lru_size due to a sanity check.
> @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                  * only when the page is being freed somewhere else.
>                  */
>                 scan += nr_pages;
> -               switch (__isolate_lru_page(page, mode)) {
> +               switch (__isolate_lru_page_prepare(page, mode)) {
>                 case 0:
> +                       /*
> +                        * Be careful not to clear PageLRU until after we're
> +                        * sure the page is not being freed elsewhere -- the
> +                        * page release code relies on it.
> +                        */
> +                       if (unlikely(!get_page_unless_zero(page)))
> +                               goto busy;
> +
> +                       if (!TestClearPageLRU(page)) {
> +                               /*
> +                                * This page may in other isolation path,
> +                                * but we still hold lru_lock.
> +                                */
> +                               put_page(page);
> +                               goto busy;
> +                       }
> +

I wonder if it wouldn't make sense to combine these two atomic ops
with tests and the put_page into a single inline function? Then it
could be possible to just do one check and if succeeds you do the
block of code below, otherwise you just fall-through into the -EBUSY
case.

>                         nr_taken += nr_pages;
>                         nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
> -
> +busy:
>                 case -EBUSY:
>                         /* else it is being freed elsewhere */
>                         list_move(&page->lru, src);
> -                       continue;
> +                       break;
>
>                 default:
>                         BUG();
> --
> 1.8.3.1
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 15/22] mm/compaction: do page isolation first in compaction
  2020-07-16 21:32     ` Alexander Duyck
  (?)
@ 2020-07-17  5:09     ` Alex Shi
  2020-07-17 16:09         ` Alexander Duyck
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-17  5:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov


>> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>>                         goto isolate_fail;
>>
>> +               /*
>> +                * Be careful not to clear PageLRU until after we're
>> +                * sure the page is not being freed elsewhere -- the
>> +                * page release code relies on it.
>> +                */
>> +               if (unlikely(!get_page_unless_zero(page)))
>> +                       goto isolate_fail;
>> +
>> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>> +                       goto isolate_fail_put;
>> +
>> +               /* Try isolate the page */
>> +               if (!TestClearPageLRU(page))
>> +                       goto isolate_fail_put;
>> +
>>                 /* If we already hold the lock, we can skip some rechecking */
>>                 if (!locked) {
>>                         locked = compact_lock_irqsave(&pgdat->lru_lock,
> 
> Why not do the __isolate_lru_page_prepare before getting the page?
> That way you can avoid performing an extra atomic operation on non-LRU
> pages.
>

This change come from Hugh Dickins as mentioned from commit log:
>> trylock_page() is not safe to use at this time: its setting PG_locked
>> can race with the page being freed or allocated ("Bad page"), and can
>> also erase flags being set by one of those "sole owners" of a freshly
>> allocated page who use non-atomic __SetPageFlag().

Hi Hugh,

would you like to show more details of the bug?

...

>> +                        * sure the page is not being freed elsewhere -- the
>> +                        * page release code relies on it.
>> +                        */
>> +                       if (unlikely(!get_page_unless_zero(page)))
>> +                               goto busy;
>> +
>> +                       if (!TestClearPageLRU(page)) {
>> +                               /*
>> +                                * This page may in other isolation path,
>> +                                * but we still hold lru_lock.
>> +                                */
>> +                               put_page(page);
>> +                               goto busy;
>> +                       }
>> +
> 
> I wonder if it wouldn't make sense to combine these two atomic ops
> with tests and the put_page into a single inline function? Then it
> could be possible to just do one check and if succeeds you do the
> block of code below, otherwise you just fall-through into the -EBUSY
> case.
> 

Uh, since get_page changes page->_refcount, TestClearPageLRU changes page->flags,
So I don't know how to combine them, could you make it more clear with code?

Thanks
Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-07-16 13:17     ` Kirill A. Shutemov
@ 2020-07-17  5:13       ` Alex Shi
  2020-07-20  8:37         ` Kirill A. Shutemov
  0 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-17  5:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang



在 2020/7/16 下午9:17, Kirill A. Shutemov 写道:
> On Thu, Jul 16, 2020 at 04:59:48PM +0800, Alex Shi wrote:
>> Hi Kirill & Matthew,
>>
>> Is there any concern from for the THP involved patches?
> 
> It is mechanical move. I don't see a problem.
> 

Many thanks! Kirill,

Do you mind to give a reviewed-by?

And rre they ok for patch 6th,7th and 14th?

Thanks a lot!
Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-16 14:11   ` Alexander Duyck
  (?)
@ 2020-07-17  5:24   ` Alex Shi
  2020-07-19 15:23       ` Hugh Dickins
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-17  5:24 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov



在 2020/7/16 下午10:11, Alexander Duyck 写道:
>> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
>> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> Hi Alex,
> 
> I think I am seeing a regression with this patch set when I run the
> will-it-scale/page_fault3 test. Specifically the processes result is
> dropping from 56371083 to 43127382 when I apply these patches.
> 
> I haven't had a chance to bisect and figure out what is causing it,
> and wanted to let you know in case you are aware of anything specific
> that may be causing this.


Thanks a lot for the info!

Actually, the patch 17th, and patch 13th may changed performance a little,
like the 17th, intel LKP found vm-scalability.throughput 68.0% improvement,
and stress-ng.remap.ops_per_sec -76.3% regression, or stress-ng.memfd.ops_per_sec
 +23.2%. etc.

This kind performance interference is known and acceptable.
Thanks
Alex
 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
  2020-07-16 21:12     ` Alexander Duyck
  (?)
@ 2020-07-17  7:45     ` Alex Shi
  2020-07-17 18:26         ` Alexander Duyck
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-17  7:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov



在 2020/7/17 上午5:12, Alexander Duyck 写道:
> On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>> Combine PageLRU check and ClearPageLRU into a function by new
>> introduced func TestClearPageLRU. This function will be used as page
>> isolation precondition to prevent other isolations some where else.
>> Then there are may non PageLRU page on lru list, need to remove BUG
>> checking accordingly.
>>
>> Hugh Dickins pointed that __page_cache_release and release_pages
>> has no need to do atomic clear bit since no user on the page at that
>> moment. and no need get_page() before lru bit clear in isolate_lru_page,
>> since it '(1) Must be called with an elevated refcount on the page'.
>>
>> As Andrew Morton mentioned this change would dirty cacheline for page
>> isn't on LRU. But the lost would be acceptable with Rong Chen
>> <rong.a.chen@intel.com> report:
>> https://lkml.org/lkml/2020/3/4/173
>>

...

>> diff --git a/mm/swap.c b/mm/swap.c
>> index f645965fde0e..5092fe9c8c47 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>>                 struct lruvec *lruvec;
>>                 unsigned long flags;
>>
>> +               __ClearPageLRU(page);
>>                 spin_lock_irqsave(&pgdat->lru_lock, flags);
>>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
>> -               __ClearPageLRU(page);
>>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
>>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>>         }
> 
> So this piece doesn't make much sense to me. Why not use
> TestClearPageLRU(page) here? Just a few lines above you are testing
> for PageLRU(page) and it seems like if you are going to go for an
> atomic test/clear and then remove the page from the LRU list you
> should be using it here as well otherwise it seems like you could run
> into a potential collision since you are testing here without clearing
> the bit.
> 

Hi Alex,

Thanks a lot for comments! 

In this func's call path __page_cache_release, the page is unlikely be
ClearPageLRU, since this page isn't used by anyone, and going to be freed.
just __ClearPageLRU would be safe, and could save a non lru page flags disturb.


>> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
>>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>>                         }
>>
>> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
>>                         __ClearPageLRU(page);
>> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>>                 }
>>
> 
> Same here. You are just moving the flag clearing, but you didn't
> combine it with the test. It seems like if you are expecting this to
> be treated as an atomic operation. It should be a relatively low cost
> to do since you already should own the cacheline as a result of
> calling put_page_testzero so I am not sure why you are not combining
> the two.

before the ClearPageLRU, there is a put_page_testzero(), that means no one using
this page, and isolate_lru_page can not run on this page the in func checking. 
	VM_BUG_ON_PAGE(!page_count(page), page);
So it would be safe here.


> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c1c4259b4de5..18986fefd49b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>>  {
>>         int ret = -EINVAL;
>>
>> -       /* Only take pages on the LRU. */
>> -       if (!PageLRU(page))
>> -               return ret;
>> -
>>         /* Compaction should not handle unevictable pages but CMA can do so */
>>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>>                 return ret;
>>
>>         ret = -EBUSY;
>>
>> +       /* Only take pages on the LRU. */
>> +       if (!PageLRU(page))
>> +               return ret;
>> +
>>         /*
>>          * To minimise LRU disruption, the caller can indicate that it only
>>          * wants to isolate pages it will be able to operate on without
>> @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>                 page = lru_to_page(src);
>>                 prefetchw_prev_lru_page(page, src, flags);
>>
>> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
>> -
>>                 nr_pages = compound_nr(page);
>>                 total_scan += nr_pages;
>>
> 
> So effectively the changes here are making it so that a !PageLRU page
> will cycle to the start of the LRU list. Now if I understand correctly
> we are guaranteed that if the flag is not set it cannot be set while
> we are holding the lru_lock, however it can be cleared while we are
> holding the lock, correct? Thus that is why isolate_lru_pages has to
> call TestClearPageLRU after the earlier check in __isolate_lru_page.

Right. 

> 
> It might make it more readable to pull in the later patch that
> modifies isolate_lru_pages that has it using TestClearPageLRU.
As to this change, It has to do in this patch, since any TestClearPageLRU may
cause lru bit miss in the lru list, so the precondication check has to
removed here.

Thank
Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page()
  2020-07-11  0:58 ` [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
@ 2020-07-17  9:30   ` Alex Shi
  2020-07-20  8:49     ` Kirill A. Shutemov
  0 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-17  9:30 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill
  Cc: Mika Penttilä


Add a VM_WARN_ON for tracking. and updated comments for the code.

Thanks

---
From f1381a1547625a6521777bf9235823d8fbd00dac Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Fri, 10 Jul 2020 16:54:37 +0800
Subject: [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in
 split_huge_page()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Split_huge_page() must start with PageLRU(head), and we are holding the
lru_lock here. If the head was cleared lru bit unexpected, tracking it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d866b6e43434..28538444197b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2348,15 +2348,19 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&page_tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Split start from PageLRU(head), and we are holding the
+		 * lru_lock.
+		 * Do a warning if the head's lru bit was cleared unexpected.
+		 */
+		VM_WARN_ON(!PageLRU(head));
+		SetPageLRU(page_tail);
+		list_add_tail(&page_tail->lru, &head->lru);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 15/22] mm/compaction: do page isolation first in compaction
  2020-07-17  5:09     ` Alex Shi
@ 2020-07-17 16:09         ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 16:09 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Thu, Jul 16, 2020 at 10:10 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
> >> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
> >>                         goto isolate_fail;
> >>
> >> +               /*
> >> +                * Be careful not to clear PageLRU until after we're
> >> +                * sure the page is not being freed elsewhere -- the
> >> +                * page release code relies on it.
> >> +                */
> >> +               if (unlikely(!get_page_unless_zero(page)))
> >> +                       goto isolate_fail;
> >> +
> >> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >> +                       goto isolate_fail_put;
> >> +
> >> +               /* Try isolate the page */
> >> +               if (!TestClearPageLRU(page))
> >> +                       goto isolate_fail_put;
> >> +
> >>                 /* If we already hold the lock, we can skip some rechecking */
> >>                 if (!locked) {
> >>                         locked = compact_lock_irqsave(&pgdat->lru_lock,
> >
> > Why not do the __isolate_lru_page_prepare before getting the page?
> > That way you can avoid performing an extra atomic operation on non-LRU
> > pages.
> >
>
> This change come from Hugh Dickins as mentioned from commit log:
> >> trylock_page() is not safe to use at this time: its setting PG_locked
> >> can race with the page being freed or allocated ("Bad page"), and can
> >> also erase flags being set by one of those "sole owners" of a freshly
> >> allocated page who use non-atomic __SetPageFlag().
>
> Hi Hugh,
>
> would you like to show more details of the bug?
>
> ...
>
> >> +                        * sure the page is not being freed elsewhere -- the
> >> +                        * page release code relies on it.
> >> +                        */
> >> +                       if (unlikely(!get_page_unless_zero(page)))
> >> +                               goto busy;
> >> +
> >> +                       if (!TestClearPageLRU(page)) {
> >> +                               /*
> >> +                                * This page may in other isolation path,
> >> +                                * but we still hold lru_lock.
> >> +                                */
> >> +                               put_page(page);
> >> +                               goto busy;
> >> +                       }
> >> +
> >
> > I wonder if it wouldn't make sense to combine these two atomic ops
> > with tests and the put_page into a single inline function? Then it
> > could be possible to just do one check and if succeeds you do the
> > block of code below, otherwise you just fall-through into the -EBUSY
> > case.
> >
>
> Uh, since get_page changes page->_refcount, TestClearPageLRU changes page->flags,
> So I don't know how to combine them, could you make it more clear with code?

Actually it is pretty straight forward. Something like this:
static inline bool get_page_unless_zero_or_nonlru(struct page *page)
{
    if (get_page_unless_zero(page)) {
        if (TestClearPageLRU(page))
            return true;
        put_page(page);
    }
    return false;
}

You can then add comments as necessary. The general idea is you are
having to do this in two different spots anyway so why not combine the
logic? Although it does assume you can change the ordering of the
other test above.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 15/22] mm/compaction: do page isolation first in compaction
@ 2020-07-17 16:09         ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 16:09 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Thu, Jul 16, 2020 at 10:10 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
> >> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
> >>                         goto isolate_fail;
> >>
> >> +               /*
> >> +                * Be careful not to clear PageLRU until after we're
> >> +                * sure the page is not being freed elsewhere -- the
> >> +                * page release code relies on it.
> >> +                */
> >> +               if (unlikely(!get_page_unless_zero(page)))
> >> +                       goto isolate_fail;
> >> +
> >> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >> +                       goto isolate_fail_put;
> >> +
> >> +               /* Try isolate the page */
> >> +               if (!TestClearPageLRU(page))
> >> +                       goto isolate_fail_put;
> >> +
> >>                 /* If we already hold the lock, we can skip some rechecking */
> >>                 if (!locked) {
> >>                         locked = compact_lock_irqsave(&pgdat->lru_lock,
> >
> > Why not do the __isolate_lru_page_prepare before getting the page?
> > That way you can avoid performing an extra atomic operation on non-LRU
> > pages.
> >
>
> This change come from Hugh Dickins as mentioned from commit log:
> >> trylock_page() is not safe to use at this time: its setting PG_locked
> >> can race with the page being freed or allocated ("Bad page"), and can
> >> also erase flags being set by one of those "sole owners" of a freshly
> >> allocated page who use non-atomic __SetPageFlag().
>
> Hi Hugh,
>
> would you like to show more details of the bug?
>
> ...
>
> >> +                        * sure the page is not being freed elsewhere -- the
> >> +                        * page release code relies on it.
> >> +                        */
> >> +                       if (unlikely(!get_page_unless_zero(page)))
> >> +                               goto busy;
> >> +
> >> +                       if (!TestClearPageLRU(page)) {
> >> +                               /*
> >> +                                * This page may in other isolation path,
> >> +                                * but we still hold lru_lock.
> >> +                                */
> >> +                               put_page(page);
> >> +                               goto busy;
> >> +                       }
> >> +
> >
> > I wonder if it wouldn't make sense to combine these two atomic ops
> > with tests and the put_page into a single inline function? Then it
> > could be possible to just do one check and if succeeds you do the
> > block of code below, otherwise you just fall-through into the -EBUSY
> > case.
> >
>
> Uh, since get_page changes page->_refcount, TestClearPageLRU changes page->flags,
> So I don't know how to combine them, could you make it more clear with code?

Actually it is pretty straight forward. Something like this:
static inline bool get_page_unless_zero_or_nonlru(struct page *page)
{
    if (get_page_unless_zero(page)) {
        if (TestClearPageLRU(page))
            return true;
        put_page(page);
    }
    return false;
}

You can then add comments as necessary. The general idea is you are
having to do this in two different spots anyway so why not combine the
logic? Although it does assume you can change the ordering of the
other test above.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
  2020-07-17  7:45     ` Alex Shi
@ 2020-07-17 18:26         ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 18:26 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov

On Fri, Jul 17, 2020 at 12:46 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/17 上午5:12, Alexander Duyck 写道:
> > On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >> Combine PageLRU check and ClearPageLRU into a function by new
> >> introduced func TestClearPageLRU. This function will be used as page
> >> isolation precondition to prevent other isolations some where else.
> >> Then there are may non PageLRU page on lru list, need to remove BUG
> >> checking accordingly.
> >>
> >> Hugh Dickins pointed that __page_cache_release and release_pages
> >> has no need to do atomic clear bit since no user on the page at that
> >> moment. and no need get_page() before lru bit clear in isolate_lru_page,
> >> since it '(1) Must be called with an elevated refcount on the page'.
> >>
> >> As Andrew Morton mentioned this change would dirty cacheline for page
> >> isn't on LRU. But the lost would be acceptable with Rong Chen
> >> <rong.a.chen@intel.com> report:
> >> https://lkml.org/lkml/2020/3/4/173
> >>
>
> ...
>
> >> diff --git a/mm/swap.c b/mm/swap.c
> >> index f645965fde0e..5092fe9c8c47 100644
> >> --- a/mm/swap.c
> >> +++ b/mm/swap.c
> >> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
> >>                 struct lruvec *lruvec;
> >>                 unsigned long flags;
> >>
> >> +               __ClearPageLRU(page);
> >>                 spin_lock_irqsave(&pgdat->lru_lock, flags);
> >>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> >> -               __ClearPageLRU(page);
> >>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> >>         }
> >
> > So this piece doesn't make much sense to me. Why not use
> > TestClearPageLRU(page) here? Just a few lines above you are testing
> > for PageLRU(page) and it seems like if you are going to go for an
> > atomic test/clear and then remove the page from the LRU list you
> > should be using it here as well otherwise it seems like you could run
> > into a potential collision since you are testing here without clearing
> > the bit.
> >
>
> Hi Alex,
>
> Thanks a lot for comments!
>
> In this func's call path __page_cache_release, the page is unlikely be
> ClearPageLRU, since this page isn't used by anyone, and going to be freed.
> just __ClearPageLRU would be safe, and could save a non lru page flags disturb.

So if I understand what you are saying correctly you are indicating
that this page should likely not have the LRU flag set and that we
just transitioned it from 1 -> 0 so there should be nobody else
accessing it correct?

It might be useful to document this somewhere. Essentially what we are
doing then is breaking this up into the following cases.

1. Setting the LRU bit requires holding the LRU lock
2. Clearing the LRU bit requires either:
        a. Use of atomic operations if page count is 1 or more
        b. Non-atomic operations can be used if we cleared last reference count

Is my understanding on this correct? So we have essentially two
scenarios, one for the get_page_unless_zero case, and another with the
put_page_testzero.

> >> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
> >>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> >>                         }
> >>
> >> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> >> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
> >>                         __ClearPageLRU(page);
> >> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> >>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >>                 }
> >>
> >
> > Same here. You are just moving the flag clearing, but you didn't
> > combine it with the test. It seems like if you are expecting this to
> > be treated as an atomic operation. It should be a relatively low cost
> > to do since you already should own the cacheline as a result of
> > calling put_page_testzero so I am not sure why you are not combining
> > the two.
>
> before the ClearPageLRU, there is a put_page_testzero(), that means no one using
> this page, and isolate_lru_page can not run on this page the in func checking.
>         VM_BUG_ON_PAGE(!page_count(page), page);
> So it would be safe here.

Okay, so this is another 2b case as defined above then.

> >
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index c1c4259b4de5..18986fefd49b 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> >>  {
> >>         int ret = -EINVAL;
> >>
> >> -       /* Only take pages on the LRU. */
> >> -       if (!PageLRU(page))
> >> -               return ret;
> >> -
> >>         /* Compaction should not handle unevictable pages but CMA can do so */
> >>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
> >>                 return ret;
> >>
> >>         ret = -EBUSY;
> >>
> >> +       /* Only take pages on the LRU. */
> >> +       if (!PageLRU(page))
> >> +               return ret;
> >> +
> >>         /*
> >>          * To minimise LRU disruption, the caller can indicate that it only
> >>          * wants to isolate pages it will be able to operate on without
> >> @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >>                 page = lru_to_page(src);
> >>                 prefetchw_prev_lru_page(page, src, flags);
> >>
> >> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> >> -
> >>                 nr_pages = compound_nr(page);
> >>                 total_scan += nr_pages;
> >>
> >
> > So effectively the changes here are making it so that a !PageLRU page
> > will cycle to the start of the LRU list. Now if I understand correctly
> > we are guaranteed that if the flag is not set it cannot be set while
> > we are holding the lru_lock, however it can be cleared while we are
> > holding the lock, correct? Thus that is why isolate_lru_pages has to
> > call TestClearPageLRU after the earlier check in __isolate_lru_page.
>
> Right.
>
> >
> > It might make it more readable to pull in the later patch that
> > modifies isolate_lru_pages that has it using TestClearPageLRU.
> As to this change, It has to do in this patch, since any TestClearPageLRU may
> cause lru bit miss in the lru list, so the precondication check has to
> removed here.

So I think some of my cognitive dissonance is from the fact that you
really are doing two different things here. You aren't really
implementing the full TestClearPageLRU until patch 15. So this patch
is doing part of 2a and 2b, and then patch 15 is following up and
completing the 2a cases. I still think it might make more sense to
pull out the pieces related to 2b and move them into a patch before
this with documentation explaining that there should be no competition
for the LRU flag because the page has transitioned to a reference
count of zero. Then take the remaining bits and combine them with
patch 15 since the description for the two is pretty similar.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
@ 2020-07-17 18:26         ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 18:26 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov

On Fri, Jul 17, 2020 at 12:46 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/17 上午5:12, Alexander Duyck 写道:
> > On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >> Combine PageLRU check and ClearPageLRU into a function by new
> >> introduced func TestClearPageLRU. This function will be used as page
> >> isolation precondition to prevent other isolations some where else.
> >> Then there are may non PageLRU page on lru list, need to remove BUG
> >> checking accordingly.
> >>
> >> Hugh Dickins pointed that __page_cache_release and release_pages
> >> has no need to do atomic clear bit since no user on the page at that
> >> moment. and no need get_page() before lru bit clear in isolate_lru_page,
> >> since it '(1) Must be called with an elevated refcount on the page'.
> >>
> >> As Andrew Morton mentioned this change would dirty cacheline for page
> >> isn't on LRU. But the lost would be acceptable with Rong Chen
> >> <rong.a.chen@intel.com> report:
> >> https://lkml.org/lkml/2020/3/4/173
> >>
>
> ...
>
> >> diff --git a/mm/swap.c b/mm/swap.c
> >> index f645965fde0e..5092fe9c8c47 100644
> >> --- a/mm/swap.c
> >> +++ b/mm/swap.c
> >> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
> >>                 struct lruvec *lruvec;
> >>                 unsigned long flags;
> >>
> >> +               __ClearPageLRU(page);
> >>                 spin_lock_irqsave(&pgdat->lru_lock, flags);
> >>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> >> -               __ClearPageLRU(page);
> >>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> >>         }
> >
> > So this piece doesn't make much sense to me. Why not use
> > TestClearPageLRU(page) here? Just a few lines above you are testing
> > for PageLRU(page) and it seems like if you are going to go for an
> > atomic test/clear and then remove the page from the LRU list you
> > should be using it here as well otherwise it seems like you could run
> > into a potential collision since you are testing here without clearing
> > the bit.
> >
>
> Hi Alex,
>
> Thanks a lot for comments!
>
> In this func's call path __page_cache_release, the page is unlikely be
> ClearPageLRU, since this page isn't used by anyone, and going to be freed.
> just __ClearPageLRU would be safe, and could save a non lru page flags disturb.

So if I understand what you are saying correctly you are indicating
that this page should likely not have the LRU flag set and that we
just transitioned it from 1 -> 0 so there should be nobody else
accessing it correct?

It might be useful to document this somewhere. Essentially what we are
doing then is breaking this up into the following cases.

1. Setting the LRU bit requires holding the LRU lock
2. Clearing the LRU bit requires either:
        a. Use of atomic operations if page count is 1 or more
        b. Non-atomic operations can be used if we cleared last reference count

Is my understanding on this correct? So we have essentially two
scenarios, one for the get_page_unless_zero case, and another with the
put_page_testzero.

> >> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
> >>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> >>                         }
> >>
> >> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> >> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
> >>                         __ClearPageLRU(page);
> >> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> >>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >>                 }
> >>
> >
> > Same here. You are just moving the flag clearing, but you didn't
> > combine it with the test. It seems like if you are expecting this to
> > be treated as an atomic operation. It should be a relatively low cost
> > to do since you already should own the cacheline as a result of
> > calling put_page_testzero so I am not sure why you are not combining
> > the two.
>
> before the ClearPageLRU, there is a put_page_testzero(), that means no one using
> this page, and isolate_lru_page can not run on this page the in func checking.
>         VM_BUG_ON_PAGE(!page_count(page), page);
> So it would be safe here.

Okay, so this is another 2b case as defined above then.

> >
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index c1c4259b4de5..18986fefd49b 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> >>  {
> >>         int ret = -EINVAL;
> >>
> >> -       /* Only take pages on the LRU. */
> >> -       if (!PageLRU(page))
> >> -               return ret;
> >> -
> >>         /* Compaction should not handle unevictable pages but CMA can do so */
> >>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
> >>                 return ret;
> >>
> >>         ret = -EBUSY;
> >>
> >> +       /* Only take pages on the LRU. */
> >> +       if (!PageLRU(page))
> >> +               return ret;
> >> +
> >>         /*
> >>          * To minimise LRU disruption, the caller can indicate that it only
> >>          * wants to isolate pages it will be able to operate on without
> >> @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >>                 page = lru_to_page(src);
> >>                 prefetchw_prev_lru_page(page, src, flags);
> >>
> >> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> >> -
> >>                 nr_pages = compound_nr(page);
> >>                 total_scan += nr_pages;
> >>
> >
> > So effectively the changes here are making it so that a !PageLRU page
> > will cycle to the start of the LRU list. Now if I understand correctly
> > we are guaranteed that if the flag is not set it cannot be set while
> > we are holding the lru_lock, however it can be cleared while we are
> > holding the lock, correct? Thus that is why isolate_lru_pages has to
> > call TestClearPageLRU after the earlier check in __isolate_lru_page.
>
> Right.
>
> >
> > It might make it more readable to pull in the later patch that
> > modifies isolate_lru_pages that has it using TestClearPageLRU.
> As to this change, It has to do in this patch, since any TestClearPageLRU may
> cause lru bit miss in the lru list, so the precondication check has to
> removed here.

So I think some of my cognitive dissonance is from the fact that you
really are doing two different things here. You aren't really
implementing the full TestClearPageLRU until patch 15. So this patch
is doing part of 2a and 2b, and then patch 15 is following up and
completing the 2a cases. I still think it might make more sense to
pull out the pieces related to 2b and move them into a patch before
this with documentation explaining that there should be no competition
for the LRU flag because the page has transitioned to a reference
count of zero. Then take the remaining bits and combine them with
patch 15 since the description for the two is pretty similar.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
  2020-07-11  0:58 ` [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock Alex Shi
@ 2020-07-17 20:30     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 20:30 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> This patch reorder the isolation steps during munlock, move the lru lock
> to guard each pages, unfold __munlock_isolate_lru_page func, to do the
> preparation for lru lock change.
>
> __split_huge_page_refcount doesn't exist, but we still have to guard
> PageMlocked and PageLRU for tail page in __split_huge_page_tail.
>
> [lkp@intel.com: found a sleeping function bug ... at mm/rmap.c]
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/mlock.c | 93 ++++++++++++++++++++++++++++++++++----------------------------
>  1 file changed, 51 insertions(+), 42 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 228ba5a8e0a5..0bdde88b4438 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,25 +103,6 @@ void mlock_vma_page(struct page *page)
>  }
>
>  /*
> - * Isolate a page from LRU with optional get_page() pin.
> - * Assumes lru_lock already held and page already pinned.
> - */
> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> -{
> -       if (TestClearPageLRU(page)) {
> -               struct lruvec *lruvec;
> -
> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (getpage)
> -                       get_page(page);
> -               del_page_from_lru_list(page, lruvec, page_lru(page));
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
> -/*
>   * Finish munlock after successful page isolation
>   *
>   * Page must be locked. This is a wrapper for try_to_munlock()
> @@ -181,6 +162,7 @@ static void __munlock_isolation_failed(struct page *page)
>  unsigned int munlock_vma_page(struct page *page)
>  {
>         int nr_pages;
> +       bool clearlru = false;
>         pg_data_t *pgdat = page_pgdat(page);
>
>         /* For try_to_munlock() and to serialize with page migration */
> @@ -189,32 +171,42 @@ unsigned int munlock_vma_page(struct page *page)
>         VM_BUG_ON_PAGE(PageTail(page), page);
>
>         /*
> -        * Serialize with any parallel __split_huge_page_refcount() which
> +        * Serialize split tail pages in __split_huge_page_tail() which
>          * might otherwise copy PageMlocked to part of the tail pages before
>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
>          */
> +       get_page(page);

I don't think this get_page() call needs to be up here. It could be
left down before we delete the page from the LRU list as it is really
needed to take a reference on the page before we call
__munlock_isolated_page(), or at least that is the way it looks to me.
By doing that you can avoid a bunch of cleanup in these exception
cases.

> +       clearlru = TestClearPageLRU(page);

I'm not sure I fully understand the reason for moving this here. By
clearing this flag before you clear Mlocked does this give you some
sort of extra protection? I don't see how since Mlocked doesn't
necessarily imply the page is on LRU.

>         spin_lock_irq(&pgdat->lru_lock);
>
>         if (!TestClearPageMlocked(page)) {
> -               /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> -               nr_pages = 1;
> -               goto unlock_out;
> +               if (clearlru)
> +                       SetPageLRU(page);
> +               /*
> +                * Potentially, PTE-mapped THP: do not skip the rest PTEs
> +                * Reuse lock as memory barrier for release_pages racing.
> +                */
> +               spin_unlock_irq(&pgdat->lru_lock);
> +               put_page(page);
> +               return 0;
>         }
>
>         nr_pages = hpage_nr_pages(page);
>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>
> -       if (__munlock_isolate_lru_page(page, true)) {
> +       if (clearlru) {
> +               struct lruvec *lruvec;
> +

You could just place the get_page() call here.

> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               del_page_from_lru_list(page, lruvec, page_lru(page));
>                 spin_unlock_irq(&pgdat->lru_lock);
>                 __munlock_isolated_page(page);
> -               goto out;
> +       } else {
> +               spin_unlock_irq(&pgdat->lru_lock);
> +               put_page(page);
> +               __munlock_isolation_failed(page);

If you move the get_page() as I suggested above there wouldn't be a
need for the put_page(). It then becomes possible to simplify the code
a bit by merging the unlock paths and doing an if/else with the
__munlock functions like so:
if (clearlru) {
    ...
    del_page_from_lru..
}

spin_unlock_irq()

if (clearlru)
    __munlock_isolated_page();
else
    __munlock_isolated_failed();

>         }
> -       __munlock_isolation_failed(page);
> -
> -unlock_out:
> -       spin_unlock_irq(&pgdat->lru_lock);
>
> -out:
>         return nr_pages - 1;
>  }
>
> @@ -297,34 +289,51 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         pagevec_init(&pvec_putback);
>
>         /* Phase 1: page isolation */
> -       spin_lock_irq(&zone->zone_pgdat->lru_lock);
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> +               struct lruvec *lruvec;
> +               bool clearlru;
>
> -               if (TestClearPageMlocked(page)) {
> -                       /*
> -                        * We already have pin from follow_page_mask()
> -                        * so we can spare the get_page() here.
> -                        */
> -                       if (__munlock_isolate_lru_page(page, false))
> -                               continue;
> -                       else
> -                               __munlock_isolation_failed(page);
> -               } else {
> +               clearlru = TestClearPageLRU(page);
> +               spin_lock_irq(&zone->zone_pgdat->lru_lock);

I still don't see what you are gaining by moving the bit test up to
this point. Seems like it would be better left below with the lock
just being used to prevent a possible race while you are pulling the
page out of the LRU list.

> +
> +               if (!TestClearPageMlocked(page)) {
>                         delta_munlocked++;
> +                       if (clearlru)
> +                               SetPageLRU(page);
> +                       goto putback;
> +               }
> +
> +               if (!clearlru) {
> +                       __munlock_isolation_failed(page);
> +                       goto putback;
>                 }

With the other function you were processing this outside of the lock,
here you are doing it inside. It would probably make more sense here
to follow similar logic and take care of the del_page_from_lru_list
ifr clealru is set, unlock, and then if clearlru is set continue else
track the isolation failure. That way you can avoid having to use as
many jump labels.

>                 /*
> +                * Isolate this page.
> +                * We already have pin from follow_page_mask()
> +                * so we can spare the get_page() here.
> +                */
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               del_page_from_lru_list(page, lruvec, page_lru(page));
> +               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> +               continue;
> +
> +               /*
>                  * We won't be munlocking this page in the next phase
>                  * but we still need to release the follow_page_mask()
>                  * pin. We cannot do it under lru_lock however. If it's
>                  * the last pin, __page_cache_release() would deadlock.
>                  */
> +putback:
> +               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
>         }
> +       /* tempary disable irq, will remove later */
> +       local_irq_disable();
>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> -       spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> +       local_irq_enable();
>
>         /* Now we can release pins of pages that we are not munlocking */
>         pagevec_release(&pvec_putback);
> --
> 1.8.3.1
>
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
@ 2020-07-17 20:30     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 20:30 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> This patch reorder the isolation steps during munlock, move the lru lock
> to guard each pages, unfold __munlock_isolate_lru_page func, to do the
> preparation for lru lock change.
>
> __split_huge_page_refcount doesn't exist, but we still have to guard
> PageMlocked and PageLRU for tail page in __split_huge_page_tail.
>
> [lkp@intel.com: found a sleeping function bug ... at mm/rmap.c]
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/mlock.c | 93 ++++++++++++++++++++++++++++++++++----------------------------
>  1 file changed, 51 insertions(+), 42 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 228ba5a8e0a5..0bdde88b4438 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -103,25 +103,6 @@ void mlock_vma_page(struct page *page)
>  }
>
>  /*
> - * Isolate a page from LRU with optional get_page() pin.
> - * Assumes lru_lock already held and page already pinned.
> - */
> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> -{
> -       if (TestClearPageLRU(page)) {
> -               struct lruvec *lruvec;
> -
> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (getpage)
> -                       get_page(page);
> -               del_page_from_lru_list(page, lruvec, page_lru(page));
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
> -/*
>   * Finish munlock after successful page isolation
>   *
>   * Page must be locked. This is a wrapper for try_to_munlock()
> @@ -181,6 +162,7 @@ static void __munlock_isolation_failed(struct page *page)
>  unsigned int munlock_vma_page(struct page *page)
>  {
>         int nr_pages;
> +       bool clearlru = false;
>         pg_data_t *pgdat = page_pgdat(page);
>
>         /* For try_to_munlock() and to serialize with page migration */
> @@ -189,32 +171,42 @@ unsigned int munlock_vma_page(struct page *page)
>         VM_BUG_ON_PAGE(PageTail(page), page);
>
>         /*
> -        * Serialize with any parallel __split_huge_page_refcount() which
> +        * Serialize split tail pages in __split_huge_page_tail() which
>          * might otherwise copy PageMlocked to part of the tail pages before
>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
>          */
> +       get_page(page);

I don't think this get_page() call needs to be up here. It could be
left down before we delete the page from the LRU list as it is really
needed to take a reference on the page before we call
__munlock_isolated_page(), or at least that is the way it looks to me.
By doing that you can avoid a bunch of cleanup in these exception
cases.

> +       clearlru = TestClearPageLRU(page);

I'm not sure I fully understand the reason for moving this here. By
clearing this flag before you clear Mlocked does this give you some
sort of extra protection? I don't see how since Mlocked doesn't
necessarily imply the page is on LRU.

>         spin_lock_irq(&pgdat->lru_lock);
>
>         if (!TestClearPageMlocked(page)) {
> -               /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> -               nr_pages = 1;
> -               goto unlock_out;
> +               if (clearlru)
> +                       SetPageLRU(page);
> +               /*
> +                * Potentially, PTE-mapped THP: do not skip the rest PTEs
> +                * Reuse lock as memory barrier for release_pages racing.
> +                */
> +               spin_unlock_irq(&pgdat->lru_lock);
> +               put_page(page);
> +               return 0;
>         }
>
>         nr_pages = hpage_nr_pages(page);
>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>
> -       if (__munlock_isolate_lru_page(page, true)) {
> +       if (clearlru) {
> +               struct lruvec *lruvec;
> +

You could just place the get_page() call here.

> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               del_page_from_lru_list(page, lruvec, page_lru(page));
>                 spin_unlock_irq(&pgdat->lru_lock);
>                 __munlock_isolated_page(page);
> -               goto out;
> +       } else {
> +               spin_unlock_irq(&pgdat->lru_lock);
> +               put_page(page);
> +               __munlock_isolation_failed(page);

If you move the get_page() as I suggested above there wouldn't be a
need for the put_page(). It then becomes possible to simplify the code
a bit by merging the unlock paths and doing an if/else with the
__munlock functions like so:
if (clearlru) {
    ...
    del_page_from_lru..
}

spin_unlock_irq()

if (clearlru)
    __munlock_isolated_page();
else
    __munlock_isolated_failed();

>         }
> -       __munlock_isolation_failed(page);
> -
> -unlock_out:
> -       spin_unlock_irq(&pgdat->lru_lock);
>
> -out:
>         return nr_pages - 1;
>  }
>
> @@ -297,34 +289,51 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         pagevec_init(&pvec_putback);
>
>         /* Phase 1: page isolation */
> -       spin_lock_irq(&zone->zone_pgdat->lru_lock);
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> +               struct lruvec *lruvec;
> +               bool clearlru;
>
> -               if (TestClearPageMlocked(page)) {
> -                       /*
> -                        * We already have pin from follow_page_mask()
> -                        * so we can spare the get_page() here.
> -                        */
> -                       if (__munlock_isolate_lru_page(page, false))
> -                               continue;
> -                       else
> -                               __munlock_isolation_failed(page);
> -               } else {
> +               clearlru = TestClearPageLRU(page);
> +               spin_lock_irq(&zone->zone_pgdat->lru_lock);

I still don't see what you are gaining by moving the bit test up to
this point. Seems like it would be better left below with the lock
just being used to prevent a possible race while you are pulling the
page out of the LRU list.

> +
> +               if (!TestClearPageMlocked(page)) {
>                         delta_munlocked++;
> +                       if (clearlru)
> +                               SetPageLRU(page);
> +                       goto putback;
> +               }
> +
> +               if (!clearlru) {
> +                       __munlock_isolation_failed(page);
> +                       goto putback;
>                 }

With the other function you were processing this outside of the lock,
here you are doing it inside. It would probably make more sense here
to follow similar logic and take care of the del_page_from_lru_list
ifr clealru is set, unlock, and then if clearlru is set continue else
track the isolation failure. That way you can avoid having to use as
many jump labels.

>                 /*
> +                * Isolate this page.
> +                * We already have pin from follow_page_mask()
> +                * so we can spare the get_page() here.
> +                */
> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               del_page_from_lru_list(page, lruvec, page_lru(page));
> +               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> +               continue;
> +
> +               /*
>                  * We won't be munlocking this page in the next phase
>                  * but we still need to release the follow_page_mask()
>                  * pin. We cannot do it under lru_lock however. If it's
>                  * the last pin, __page_cache_release() would deadlock.
>                  */
> +putback:
> +               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
>         }
> +       /* tempary disable irq, will remove later */
> +       local_irq_disable();
>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> -       spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> +       local_irq_enable();
>
>         /* Now we can release pins of pages that we are not munlocking */
>         pagevec_release(&pvec_putback);
> --
> 1.8.3.1
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock
  2020-07-11  0:58 ` [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2020-07-17 21:09     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 21:09 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/mmzone.h | 1 -
>  mm/page_alloc.c        | 1 -
>  2 files changed, 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 36c1680efd90..8d7318ce5f62 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -735,7 +735,6 @@ struct deferred_split {
>
>         /* Write-intensive fields used by page reclaim */
>         ZONE_PADDING(_pad1_)
> -       spinlock_t              lru_lock;
>
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>         /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e028b87ce294..4d7df42b32d6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
>
>         pgdat_page_ext_init(pgdat);
> -       spin_lock_init(&pgdat->lru_lock);
>         lruvec_init(&pgdat->__lruvec);
>  }
>

This patch would probably make more sense as part of patch 18 since
you removed all of the users of this field there.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock
@ 2020-07-17 21:09     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 21:09 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/mmzone.h | 1 -
>  mm/page_alloc.c        | 1 -
>  2 files changed, 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 36c1680efd90..8d7318ce5f62 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -735,7 +735,6 @@ struct deferred_split {
>
>         /* Write-intensive fields used by page reclaim */
>         ZONE_PADDING(_pad1_)
> -       spinlock_t              lru_lock;
>
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>         /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e028b87ce294..4d7df42b32d6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
>
>         pgdat_page_ext_init(pgdat);
> -       spin_lock_init(&pgdat->lru_lock);
>         lruvec_init(&pgdat->__lruvec);
>  }
>

This patch would probably make more sense as part of patch 18 since
you removed all of the users of this field there.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-11  0:58 ` [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-07-17 21:38     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 21:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov

On Fri, Jul 10, 2020 at 6:00 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
>
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
>
> According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>
> With this and later patches, the readtwice performance increases about
> 80% within concurrent containers.
>
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
>
> Hugh Dickins helped on patch polish, thanks!
>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/memcontrol.h |  98 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mmzone.h     |   2 +
>  mm/compaction.c            |  67 +++++++++++++++++++-----------
>  mm/huge_memory.c           |  11 ++---
>  mm/memcontrol.c            |  63 +++++++++++++++++++++++++++-
>  mm/mlock.c                 |  32 +++++++--------
>  mm/mmzone.c                |   1 +
>  mm/swap.c                  | 100 +++++++++++++++++++++------------------------
>  mm/vmscan.c                |  70 +++++++++++++++++--------------
>  9 files changed, 310 insertions(+), 134 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e77197a62809..6e670f991b42 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>
>  struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
>
> +struct lruvec *lock_page_lruvec(struct page *page);
> +struct lruvec *lock_page_lruvec_irq(struct page *page);
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +                                               unsigned long *flags);
> +
> +#ifdef CONFIG_DEBUG_VM
> +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
> +#else
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
> +#endif
> +
>  static inline
>  struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>         return css ? container_of(css, struct mem_cgroup, css) : NULL;
> @@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
>
> +static inline struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock(&pgdat->__lruvec.lru_lock);
> +       return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock_irq(&pgdat->__lruvec.lru_lock);
> +       return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +               unsigned long *flagsp)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
> +       return &pgdat->__lruvec;
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>                 struct mem_cgroup *prev,
> @@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page,
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
>  #endif /* CONFIG_MEMCG */
>
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> @@ -1255,6 +1297,62 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
>         return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
>  }
>
> +static inline void unlock_page_lruvec(struct lruvec *lruvec)
> +{
> +       spin_unlock(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
> +{
> +       spin_unlock_irq(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
> +               unsigned long flags)
> +{
> +       spin_unlock_irqrestore(&lruvec->lru_lock, flags);
> +}
> +
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
> +               struct lruvec *locked_lruvec)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +       bool locked;
> +
> +       rcu_read_lock();
> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> +       rcu_read_unlock();
> +
> +       if (locked)
> +               return locked_lruvec;
> +
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irq(locked_lruvec);
> +
> +       return lock_page_lruvec_irq(page);
> +}
> +
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
> +               struct lruvec *locked_lruvec, unsigned long *flags)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +       bool locked;
> +
> +       rcu_read_lock();
> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> +       rcu_read_unlock();
> +
> +       if (locked)
> +               return locked_lruvec;
> +
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> +
> +       return lock_page_lruvec_irqsave(page, flags);
> +}
> +

These relock functions have no users in this patch. It might make
sense and push this code to patch 19 in your series since that is
where they are first used. In addition they don't seem very efficient
as you already had to call mem_cgroup_page_lruvec once, why do it
again when you could just store the value and lock the new lruvec if
needed?

>  #ifdef CONFIG_CGROUP_WRITEBACK
>
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 14c668b7e793..36c1680efd90 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -261,6 +261,8 @@ struct lruvec {
>         atomic_long_t                   nonresident_age;
>         /* Refaults at the time of last reclaim cycle */
>         unsigned long                   refaults;
> +       /* per lruvec lru_lock for memcg */
> +       spinlock_t                      lru_lock;
>         /* Various lruvec state flags (enum lruvec_flags) */
>         unsigned long                   flags;

Any reason for placing this here instead of at the end of the
structure? From what I can tell it looks like lruvec is already 128B
long so placing the lock on the end would put it into the next
cacheline which may provide some performance benefit since it is
likely to be bounced quite a bit.

>  #ifdef CONFIG_MEMCG
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 2da2933fe56b..88bbd2e93895 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         unsigned long nr_scanned = 0, nr_isolated = 0;
>         struct lruvec *lruvec;
>         unsigned long flags = 0;
> -       bool locked = false;
> +       struct lruvec *locked_lruvec = NULL;
>         struct page *page = NULL, *valid_page = NULL;
>         unsigned long start_pfn = low_pfn;
>         bool skip_on_failure = false;
> @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * contention, to give chance to IRQs. Abort completely if
>                  * a fatal signal is pending.
>                  */
> -               if (!(low_pfn % SWAP_CLUSTER_MAX)
> -                   && compact_unlock_should_abort(&pgdat->lru_lock,
> -                                           flags, &locked, cc)) {
> -                       low_pfn = 0;
> -                       goto fatal_pending;
> +               if (!(low_pfn % SWAP_CLUSTER_MAX)) {
> +                       if (locked_lruvec) {
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +                               locked_lruvec = NULL;
> +                       }
> +
> +                       if (fatal_signal_pending(current)) {
> +                               cc->contended = true;
> +
> +                               low_pfn = 0;
> +                               goto fatal_pending;
> +                       }
> +
> +                       cond_resched();
>                 }
>
>                 if (!pfn_valid_within(low_pfn))
> @@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(__PageMovable(page)) &&
>                                         !PageIsolated(page)) {
> -                               if (locked) {
> -                                       spin_unlock_irqrestore(&pgdat->lru_lock,
> -                                                                       flags);
> -                                       locked = false;
> +                               if (locked_lruvec) {
> +                                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +                                       locked_lruvec = NULL;
>                                 }
>
>                                 if (!isolate_movable_page(page, isolate_mode))
> @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!TestClearPageLRU(page))
>                         goto isolate_fail_put;
>
> +               rcu_read_lock();
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>                 /* If we already hold the lock, we can skip some rechecking */
> -               if (!locked) {
> -                       locked = compact_lock_irqsave(&pgdat->lru_lock,
> -                                                               &flags, cc);
> +               if (lruvec != locked_lruvec) {
> +                       if (locked_lruvec)
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +
> +                       compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> +                       locked_lruvec = lruvec;
> +                       rcu_read_unlock();
> +
> +                       lruvec_memcg_debug(lruvec, page);
>
>                         /* Try get exclusive access under lock */
>                         if (!skip_updated) {
> @@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                 SetPageLRU(page);
>                                 goto isolate_fail_put;
>                         }
> -               }
> -
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +               } else
> +                       rcu_read_unlock();
>
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
> @@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>
>  isolate_fail_put:
>                 /* Avoid potential deadlock in freeing page under lru_lock */
> -               if (locked) {
> -                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       locked = false;
> +               if (locked_lruvec) {
> +                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +                       locked_lruvec = NULL;
>                 }
>                 put_page(page);
>
> @@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * page anyway.
>                  */
>                 if (nr_isolated) {
> -                       if (locked) {
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                               locked = false;
> +                       if (locked_lruvec) {
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +                               locked_lruvec = NULL;
>                         }
>                         putback_movable_pages(&cc->migratepages);
>                         cc->nr_migratepages = 0;
> @@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         page = NULL;
>
>  isolate_abort:
> -       if (locked)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irqrestore(locked_lruvec, flags);
>         if (page) {
>                 SetPageLRU(page);
>                 put_page(page);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4fe7b92c9330..1ff0c1ff6a52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2346,7 +2346,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
>         VM_BUG_ON_PAGE(!PageHead(head), head);
>         VM_BUG_ON_PAGE(PageCompound(page_tail), head);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), head);
> -       lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
> +       lockdep_assert_held(&lruvec->lru_lock);
>
>         if (list) {
>                 /* page reclaim is reclaiming a huge page */
> @@ -2429,7 +2429,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                               pgoff_t end)
>  {
>         struct page *head = compound_head(page);
> -       pg_data_t *pgdat = page_pgdat(head);
>         struct lruvec *lruvec;
>         struct address_space *swap_cache = NULL;
>         unsigned long offset = 0;
> @@ -2446,10 +2445,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                 xa_lock(&swap_cache->i_pages);
>         }
>
> -       /* prevent PageLRU to go away from under us, and freeze lru stats */
> -       spin_lock(&pgdat->lru_lock);
> -
> -       lruvec = mem_cgroup_page_lruvec(head, pgdat);
> +       /* lock lru list/PageCompound, ref freezed by page_ref_freeze */
> +       lruvec = lock_page_lruvec(head);
>
>         for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>                 __split_huge_page_tail(head, i, lruvec, list);
> @@ -2470,7 +2467,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         }
>
>         ClearPageCompound(head);
> -       spin_unlock(&pgdat->lru_lock);
> +       unlock_page_lruvec(lruvec);
>         /* Caller disabled irqs, so they are still disabled here */
>
>         split_page_owner(head, HPAGE_PMD_ORDER);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fde47272b13c..d5e56be42f21 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1196,6 +1196,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>         return ret;
>  }
>
> +#ifdef CONFIG_DEBUG_VM
> +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       if (!page->mem_cgroup)
> +               VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
> +       else
> +               VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
> +}
> +#endif
> +
>  /**
>   * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>   * @page: the page
> @@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>                 goto out;
>         }
>
> -       memcg = page->mem_cgroup;
> +       VM_BUG_ON_PAGE(PageTail(page), page);
> +       memcg = READ_ONCE(page->mem_cgroup);
>         /*
>          * Swapcache readahead pages are added to the LRU - and
>          * possibly migrated - before they are charged.
> @@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>         return lruvec;
>  }
>
> +struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +       struct lruvec *lruvec;
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       rcu_read_lock();
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock(&lruvec->lru_lock);
> +       rcu_read_unlock();
> +
> +       lruvec_memcg_debug(lruvec, page);
> +
> +       return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +       struct lruvec *lruvec;
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       rcu_read_lock();
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock_irq(&lruvec->lru_lock);
> +       rcu_read_unlock();
> +
> +       lruvec_memcg_debug(lruvec, page);
> +
> +       return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
> +{
> +       struct lruvec *lruvec;
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       rcu_read_lock();
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock_irqsave(&lruvec->lru_lock, *flags);
> +       rcu_read_unlock();
> +
> +       lruvec_memcg_debug(lruvec, page);
> +
> +       return lruvec;
> +}
> +
>  /**
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
> @@ -2999,7 +3058,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>
>  /*
>   * Because tail pages are not marked as "used", set it. We're under
> - * pgdat->lru_lock and migration entries setup in all page mappings.
> + * lruvec->lru_lock and migration entries setup in all page mappings.
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
>  {
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0bdde88b4438..cb23a0c2cfbf 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -163,7 +163,7 @@ unsigned int munlock_vma_page(struct page *page)
>  {
>         int nr_pages;
>         bool clearlru = false;
> -       pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
>
>         /* For try_to_munlock() and to serialize with page migration */
>         BUG_ON(!PageLocked(page));
> @@ -177,7 +177,7 @@ unsigned int munlock_vma_page(struct page *page)
>          */
>         get_page(page);
>         clearlru = TestClearPageLRU(page);
> -       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page);
>
>         if (!TestClearPageMlocked(page)) {
>                 if (clearlru)
> @@ -186,7 +186,7 @@ unsigned int munlock_vma_page(struct page *page)
>                  * Potentially, PTE-mapped THP: do not skip the rest PTEs
>                  * Reuse lock as memory barrier for release_pages racing.
>                  */
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 put_page(page);
>                 return 0;
>         }
> @@ -195,14 +195,11 @@ unsigned int munlock_vma_page(struct page *page)
>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>
>         if (clearlru) {
> -               struct lruvec *lruvec;
> -
> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 __munlock_isolated_page(page);
>         } else {
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 put_page(page);
>                 __munlock_isolation_failed(page);
>         }
> @@ -284,6 +281,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         int nr = pagevec_count(pvec);
>         int delta_munlocked = -nr;
>         struct pagevec pvec_putback;
> +       struct lruvec *lruvec = NULL;
>         int pgrescued = 0;
>
>         pagevec_init(&pvec_putback);
> @@ -291,11 +289,17 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         /* Phase 1: page isolation */
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *lruvec;
> +               struct lruvec *new_lruvec;
>                 bool clearlru;
>
>                 clearlru = TestClearPageLRU(page);
> -               spin_lock_irq(&zone->zone_pgdat->lru_lock);
> +
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (new_lruvec != lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irq(lruvec);
> +                       lruvec = lock_page_lruvec_irq(page);
> +               }

So instead of trying to optimize things here you should go for parity.
If you are taking the old lru_lock once per pass you should do that
here too. You can come back through and optimize with the relock
approach later.

>                 if (!TestClearPageMlocked(page)) {
>                         delta_munlocked++;
> @@ -314,9 +318,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                  * We already have pin from follow_page_mask()
>                  * so we can spare the get_page() here.
>                  */
> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>                 continue;
>
>                 /*
> @@ -326,14 +328,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                  * the last pin, __page_cache_release() would deadlock.
>                  */
>  putback:
> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
>         }
> -       /* tempary disable irq, will remove later */
> -       local_irq_disable();
>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> -       local_irq_enable();
> +       if (lruvec)
> +               unlock_page_lruvec_irq(lruvec);

So I am not a fan of this change. You went to all the trouble of
reducing the lock scope just to bring it back out here again. In
addition it implies there is a path where you might try to update the
page state without disabling interrupts.

>         /* Now we can release pins of pages that we are not munlocking */
>         pagevec_release(&pvec_putback);
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index 4686fdc23bb9..3750a90ed4a0 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
>         enum lru_list lru;
>
>         memset(lruvec, 0, sizeof(struct lruvec));
> +       spin_lock_init(&lruvec->lru_lock);
>
>         for_each_lru(lru)
>                 INIT_LIST_HEAD(&lruvec->lists[lru]);
> diff --git a/mm/swap.c b/mm/swap.c
> index 8488b9b25730..129c532357a4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
>  static void __page_cache_release(struct page *page)
>  {
>         if (PageLRU(page)) {
> -               pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
>                 unsigned long flags;
>
>                 __ClearPageLRU(page);
> -               spin_lock_irqsave(&pgdat->lru_lock, flags);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +               lruvec = lock_page_lruvec_irqsave(page, &flags);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>         }
>         __ClearPageWaiters(page);
>  }
> @@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>         void (*move_fn)(struct page *page, struct lruvec *lruvec))
>  {
>         int i;
> -       struct pglist_data *pgdat = NULL;
> -       struct lruvec *lruvec;
> +       struct lruvec *lruvec = NULL;
>         unsigned long flags = 0;
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
> +               struct lruvec *new_lruvec;
>
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irqsave(&pgdat->lru_lock, flags);
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (lruvec != new_lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                       lruvec = lock_page_lruvec_irqsave(page, &flags);
>                 }
>
>                 /* block memcg migration during page moving between lru */
>                 if (!TestClearPageLRU(page))
>                         continue;
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 (*move_fn)(page, lruvec);
>
>                 SetPageLRU(page);
>         }
> -       if (pgdat)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (lruvec)
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>         release_pages(pvec->pages, pvec->nr);
>         pagevec_reinit(pvec);
>  }
> @@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  {
>         do {
>                 unsigned long lrusize;
> -               struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
> -               spin_lock_irq(&pgdat->lru_lock);
> +               spin_lock_irq(&lruvec->lru_lock);
>                 /* Record cost event */
>                 if (file)
>                         lruvec->file_cost += nr_pages;
> @@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>                         lruvec->file_cost /= 2;
>                         lruvec->anon_cost /= 2;
>                 }
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               spin_unlock_irq(&lruvec->lru_lock);
>         } while ((lruvec = parent_lruvec(lruvec)));
>  }
>
> @@ -365,11 +360,12 @@ static inline void activate_page_drain(int cpu)
>  void activate_page(struct page *page)
>  {
>         pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
>
>         page = compound_head(page);
> -       spin_lock_irq(&pgdat->lru_lock);
> -       __activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page);
> +       __activate_page(page, lruvec);
> +       unlock_page_lruvec_irq(lruvec);
>  }
>  #endif
>
> @@ -819,8 +815,7 @@ void release_pages(struct page **pages, int nr)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct pglist_data *locked_pgdat = NULL;
> -       struct lruvec *lruvec;
> +       struct lruvec *lruvec = NULL;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
>
> @@ -830,21 +825,20 @@ void release_pages(struct page **pages, int nr)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same pgdat. The lock is held only if pgdat != NULL.
> +                * same lruvec. The lock is held only if lruvec != NULL.
>                  */
> -               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -                       locked_pgdat = NULL;
> +               if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       unlock_page_lruvec_irqrestore(lruvec, flags);
> +                       lruvec = NULL;
>                 }
>
>                 if (is_huge_zero_page(page))
>                         continue;
>
>                 if (is_zone_device_page(page)) {
> -                       if (locked_pgdat) {
> -                               spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> -                                                      flags);
> -                               locked_pgdat = NULL;
> +                       if (lruvec) {
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                               lruvec = NULL;
>                         }
>                         /*
>                          * ZONE_DEVICE pages that return 'false' from
> @@ -863,28 +857,28 @@ void release_pages(struct page **pages, int nr)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (locked_pgdat) {
> -                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -                               locked_pgdat = NULL;
> +                       if (lruvec) {
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                               lruvec = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct pglist_data *pgdat = page_pgdat(page);
> +                       struct lruvec *new_lruvec;
>
> -                       if (pgdat != locked_pgdat) {
> -                               if (locked_pgdat)
> -                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> +                       new_lruvec = mem_cgroup_page_lruvec(page,
> +                                                       page_pgdat(page));
> +                       if (new_lruvec != lruvec) {
> +                               if (lruvec)
> +                                       unlock_page_lruvec_irqrestore(lruvec,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               locked_pgdat = pgdat;
> -                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> +                               lruvec = lock_page_lruvec_irqsave(page, &flags);
>                         }

This just kind of seems ugly to me. I am not a fan of having to fetch
the lruvec twice when you already have it in new_lruvec. I suppose it
is fine though since you are just going to be replacing it later
anyway.

>
>                         __ClearPageLRU(page);
> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 }
>
> @@ -894,8 +888,8 @@ void release_pages(struct page **pages, int nr)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (locked_pgdat)
> -               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +       if (lruvec)
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_unref_page_list(&pages_to_free);
> @@ -983,26 +977,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
>         int i;
> -       struct pglist_data *pgdat = NULL;
> -       struct lruvec *lruvec;
> +       struct lruvec *lruvec = NULL;
>         unsigned long flags = 0;
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
> +               struct lruvec *new_lruvec;
>
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irqsave(&pgdat->lru_lock, flags);
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (lruvec != new_lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                       lruvec = lock_page_lruvec_irqsave(page, &flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 __pagevec_lru_add_fn(page, lruvec);
>         }
> -       if (pgdat)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (lruvec)
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>         release_pages(pvec->pages, pvec->nr);
>         pagevec_reinit(pvec);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f77748adc340..168c1659e430 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page)
>         WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>
>         if (TestClearPageLRU(page)) {
> -               pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
>                 int lru = page_lru(page);
>
>                 get_page(page);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               spin_lock_irq(&pgdat->lru_lock);
> +               lruvec = lock_page_lruvec_irq(page);
>                 del_page_from_lru_list(page, lruvec, lru);
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 ret = 0;
>         }
>
> @@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                                                      struct list_head *list)
>  {
> -       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         int nr_pages, nr_moved = 0;
>         LIST_HEAD(pages_to_free);
>         struct page *page;
> +       struct lruvec *orig_lruvec = lruvec;
>         enum lru_list lru;
>
>         while (!list_empty(list)) {
> +               struct lruvec *new_lruvec = NULL;
> +
>                 page = lru_to_page(list);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(&pgdat->lru_lock);
> +                       spin_unlock_irq(&lruvec->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(&pgdat->lru_lock);
> +                       spin_lock_irq(&lruvec->lru_lock);
>                         continue;
>                 }
>
> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                  *                                        list_add(&page->lru,)
>                  *     list_add(&page->lru,) //corrupt
>                  */
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (new_lruvec != lruvec) {
> +                       if (lruvec)
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                       lruvec = lock_page_lruvec_irq(page);
> +               }
>                 SetPageLRU(page);
>
>                 if (unlikely(put_page_testzero(page))) {
> @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                         __ClearPageActive(page);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(&pgdat->lru_lock);
> +                               spin_unlock_irq(&lruvec->lru_lock);
>                                 destroy_compound_page(page);
> -                               spin_lock_irq(&pgdat->lru_lock);
> +                               spin_lock_irq(&lruvec->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 lru = page_lru(page);
>                 nr_pages = hpage_nr_pages(page);
>
> @@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                 if (PageActive(page))
>                         workingset_age_nonresident(lruvec, nr_pages);
>         }
> +       if (orig_lruvec != lruvec) {
> +               if (lruvec)
> +                       spin_unlock_irq(&lruvec->lru_lock);
> +               spin_lock_irq(&orig_lruvec->lru_lock);
> +       }
>
>         /*
>          * To save our caller's stack, now use input list for pages to free.

Something like this seems much more readable than the block you had
above. It is what I would expect the relock code to look like.

> @@ -1957,7 +1967,7 @@ static int current_may_throttle(void)
>
>         lru_add_drain();
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, lru);
> @@ -1969,7 +1979,7 @@ static int current_may_throttle(void)
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>         __count_vm_events(PGSCAN_ANON + file, nr_scanned);
>
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
> @@ -1977,7 +1987,7 @@ static int current_may_throttle(void)
>         nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
>                                 &stat, false);
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>         move_pages_to_lru(lruvec, &page_list);
>
>         __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> @@ -1986,7 +1996,7 @@ static int current_may_throttle(void)
>                 __count_vm_events(item, nr_reclaimed);
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>         __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         lru_note_cost(lruvec, file, stat.nr_pageout);
>         mem_cgroup_uncharge_list(&page_list);
> @@ -2039,7 +2049,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         lru_add_drain();
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, lru);
> @@ -2049,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         __count_vm_events(PGREFILL, nr_scanned);
>         __count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -2095,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_activate = move_pages_to_lru(lruvec, &l_active);
>         nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
> @@ -2106,7 +2116,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
>
>         __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_active);
>         free_unref_page_list(&l_active);
> @@ -2696,10 +2706,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         /*
>          * Determine the scan balance between anon and file LRUs.
>          */
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&target_lruvec->lru_lock);
>         sc->anon_cost = target_lruvec->anon_cost;
>         sc->file_cost = target_lruvec->file_cost;
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&target_lruvec->lru_lock);
>
>         /*
>          * Target desirable inactive:active list ratios for the anon
> @@ -4275,24 +4285,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>   */
>  void check_move_unevictable_pages(struct pagevec *pvec)
>  {
> -       struct lruvec *lruvec;
> -       struct pglist_data *pgdat = NULL;
> +       struct lruvec *lruvec = NULL;
>         int pgscanned = 0;
>         int pgrescued = 0;
>         int i;
>
>         for (i = 0; i < pvec->nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
> +               struct lruvec *new_lruvec;
>
>                 pgscanned++;
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irq(&pgdat->lru_lock);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irq(&pgdat->lru_lock);
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (lruvec != new_lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irq(lruvec);
> +                       lruvec = lock_page_lruvec_irq(page);
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> @@ -4308,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>                 }
>         }
>
> -       if (pgdat) {
> +       if (lruvec) {
>                 __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>                 __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>         }
>  }
>  EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-07-17 21:38     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 21:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov

On Fri, Jul 10, 2020 at 6:00 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
>
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
>
> According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>
> With this and later patches, the readtwice performance increases about
> 80% within concurrent containers.
>
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
>
> Hugh Dickins helped on patch polish, thanks!
>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/memcontrol.h |  98 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mmzone.h     |   2 +
>  mm/compaction.c            |  67 +++++++++++++++++++-----------
>  mm/huge_memory.c           |  11 ++---
>  mm/memcontrol.c            |  63 +++++++++++++++++++++++++++-
>  mm/mlock.c                 |  32 +++++++--------
>  mm/mmzone.c                |   1 +
>  mm/swap.c                  | 100 +++++++++++++++++++++------------------------
>  mm/vmscan.c                |  70 +++++++++++++++++--------------
>  9 files changed, 310 insertions(+), 134 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e77197a62809..6e670f991b42 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>
>  struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
>
> +struct lruvec *lock_page_lruvec(struct page *page);
> +struct lruvec *lock_page_lruvec_irq(struct page *page);
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +                                               unsigned long *flags);
> +
> +#ifdef CONFIG_DEBUG_VM
> +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
> +#else
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
> +#endif
> +
>  static inline
>  struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>         return css ? container_of(css, struct mem_cgroup, css) : NULL;
> @@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
>
> +static inline struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock(&pgdat->__lruvec.lru_lock);
> +       return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock_irq(&pgdat->__lruvec.lru_lock);
> +       return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +               unsigned long *flagsp)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
> +       return &pgdat->__lruvec;
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>                 struct mem_cgroup *prev,
> @@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page,
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
>  #endif /* CONFIG_MEMCG */
>
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> @@ -1255,6 +1297,62 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
>         return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
>  }
>
> +static inline void unlock_page_lruvec(struct lruvec *lruvec)
> +{
> +       spin_unlock(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
> +{
> +       spin_unlock_irq(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
> +               unsigned long flags)
> +{
> +       spin_unlock_irqrestore(&lruvec->lru_lock, flags);
> +}
> +
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
> +               struct lruvec *locked_lruvec)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +       bool locked;
> +
> +       rcu_read_lock();
> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> +       rcu_read_unlock();
> +
> +       if (locked)
> +               return locked_lruvec;
> +
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irq(locked_lruvec);
> +
> +       return lock_page_lruvec_irq(page);
> +}
> +
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
> +               struct lruvec *locked_lruvec, unsigned long *flags)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +       bool locked;
> +
> +       rcu_read_lock();
> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> +       rcu_read_unlock();
> +
> +       if (locked)
> +               return locked_lruvec;
> +
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> +
> +       return lock_page_lruvec_irqsave(page, flags);
> +}
> +

These relock functions have no users in this patch. It might make
sense and push this code to patch 19 in your series since that is
where they are first used. In addition they don't seem very efficient
as you already had to call mem_cgroup_page_lruvec once, why do it
again when you could just store the value and lock the new lruvec if
needed?

>  #ifdef CONFIG_CGROUP_WRITEBACK
>
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 14c668b7e793..36c1680efd90 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -261,6 +261,8 @@ struct lruvec {
>         atomic_long_t                   nonresident_age;
>         /* Refaults at the time of last reclaim cycle */
>         unsigned long                   refaults;
> +       /* per lruvec lru_lock for memcg */
> +       spinlock_t                      lru_lock;
>         /* Various lruvec state flags (enum lruvec_flags) */
>         unsigned long                   flags;

Any reason for placing this here instead of at the end of the
structure? From what I can tell it looks like lruvec is already 128B
long so placing the lock on the end would put it into the next
cacheline which may provide some performance benefit since it is
likely to be bounced quite a bit.

>  #ifdef CONFIG_MEMCG
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 2da2933fe56b..88bbd2e93895 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         unsigned long nr_scanned = 0, nr_isolated = 0;
>         struct lruvec *lruvec;
>         unsigned long flags = 0;
> -       bool locked = false;
> +       struct lruvec *locked_lruvec = NULL;
>         struct page *page = NULL, *valid_page = NULL;
>         unsigned long start_pfn = low_pfn;
>         bool skip_on_failure = false;
> @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * contention, to give chance to IRQs. Abort completely if
>                  * a fatal signal is pending.
>                  */
> -               if (!(low_pfn % SWAP_CLUSTER_MAX)
> -                   && compact_unlock_should_abort(&pgdat->lru_lock,
> -                                           flags, &locked, cc)) {
> -                       low_pfn = 0;
> -                       goto fatal_pending;
> +               if (!(low_pfn % SWAP_CLUSTER_MAX)) {
> +                       if (locked_lruvec) {
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +                               locked_lruvec = NULL;
> +                       }
> +
> +                       if (fatal_signal_pending(current)) {
> +                               cc->contended = true;
> +
> +                               low_pfn = 0;
> +                               goto fatal_pending;
> +                       }
> +
> +                       cond_resched();
>                 }
>
>                 if (!pfn_valid_within(low_pfn))
> @@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(__PageMovable(page)) &&
>                                         !PageIsolated(page)) {
> -                               if (locked) {
> -                                       spin_unlock_irqrestore(&pgdat->lru_lock,
> -                                                                       flags);
> -                                       locked = false;
> +                               if (locked_lruvec) {
> +                                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +                                       locked_lruvec = NULL;
>                                 }
>
>                                 if (!isolate_movable_page(page, isolate_mode))
> @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!TestClearPageLRU(page))
>                         goto isolate_fail_put;
>
> +               rcu_read_lock();
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>                 /* If we already hold the lock, we can skip some rechecking */
> -               if (!locked) {
> -                       locked = compact_lock_irqsave(&pgdat->lru_lock,
> -                                                               &flags, cc);
> +               if (lruvec != locked_lruvec) {
> +                       if (locked_lruvec)
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +
> +                       compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> +                       locked_lruvec = lruvec;
> +                       rcu_read_unlock();
> +
> +                       lruvec_memcg_debug(lruvec, page);
>
>                         /* Try get exclusive access under lock */
>                         if (!skip_updated) {
> @@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                 SetPageLRU(page);
>                                 goto isolate_fail_put;
>                         }
> -               }
> -
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +               } else
> +                       rcu_read_unlock();
>
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
> @@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>
>  isolate_fail_put:
>                 /* Avoid potential deadlock in freeing page under lru_lock */
> -               if (locked) {
> -                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       locked = false;
> +               if (locked_lruvec) {
> +                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +                       locked_lruvec = NULL;
>                 }
>                 put_page(page);
>
> @@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * page anyway.
>                  */
>                 if (nr_isolated) {
> -                       if (locked) {
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                               locked = false;
> +                       if (locked_lruvec) {
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +                               locked_lruvec = NULL;
>                         }
>                         putback_movable_pages(&cc->migratepages);
>                         cc->nr_migratepages = 0;
> @@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         page = NULL;
>
>  isolate_abort:
> -       if (locked)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irqrestore(locked_lruvec, flags);
>         if (page) {
>                 SetPageLRU(page);
>                 put_page(page);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4fe7b92c9330..1ff0c1ff6a52 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2346,7 +2346,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
>         VM_BUG_ON_PAGE(!PageHead(head), head);
>         VM_BUG_ON_PAGE(PageCompound(page_tail), head);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), head);
> -       lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
> +       lockdep_assert_held(&lruvec->lru_lock);
>
>         if (list) {
>                 /* page reclaim is reclaiming a huge page */
> @@ -2429,7 +2429,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                               pgoff_t end)
>  {
>         struct page *head = compound_head(page);
> -       pg_data_t *pgdat = page_pgdat(head);
>         struct lruvec *lruvec;
>         struct address_space *swap_cache = NULL;
>         unsigned long offset = 0;
> @@ -2446,10 +2445,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                 xa_lock(&swap_cache->i_pages);
>         }
>
> -       /* prevent PageLRU to go away from under us, and freeze lru stats */
> -       spin_lock(&pgdat->lru_lock);
> -
> -       lruvec = mem_cgroup_page_lruvec(head, pgdat);
> +       /* lock lru list/PageCompound, ref freezed by page_ref_freeze */
> +       lruvec = lock_page_lruvec(head);
>
>         for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>                 __split_huge_page_tail(head, i, lruvec, list);
> @@ -2470,7 +2467,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>         }
>
>         ClearPageCompound(head);
> -       spin_unlock(&pgdat->lru_lock);
> +       unlock_page_lruvec(lruvec);
>         /* Caller disabled irqs, so they are still disabled here */
>
>         split_page_owner(head, HPAGE_PMD_ORDER);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fde47272b13c..d5e56be42f21 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1196,6 +1196,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>         return ret;
>  }
>
> +#ifdef CONFIG_DEBUG_VM
> +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       if (!page->mem_cgroup)
> +               VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
> +       else
> +               VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
> +}
> +#endif
> +
>  /**
>   * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>   * @page: the page
> @@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>                 goto out;
>         }
>
> -       memcg = page->mem_cgroup;
> +       VM_BUG_ON_PAGE(PageTail(page), page);
> +       memcg = READ_ONCE(page->mem_cgroup);
>         /*
>          * Swapcache readahead pages are added to the LRU - and
>          * possibly migrated - before they are charged.
> @@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>         return lruvec;
>  }
>
> +struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +       struct lruvec *lruvec;
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       rcu_read_lock();
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock(&lruvec->lru_lock);
> +       rcu_read_unlock();
> +
> +       lruvec_memcg_debug(lruvec, page);
> +
> +       return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +       struct lruvec *lruvec;
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       rcu_read_lock();
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock_irq(&lruvec->lru_lock);
> +       rcu_read_unlock();
> +
> +       lruvec_memcg_debug(lruvec, page);
> +
> +       return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
> +{
> +       struct lruvec *lruvec;
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       rcu_read_lock();
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock_irqsave(&lruvec->lru_lock, *flags);
> +       rcu_read_unlock();
> +
> +       lruvec_memcg_debug(lruvec, page);
> +
> +       return lruvec;
> +}
> +
>  /**
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
> @@ -2999,7 +3058,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>
>  /*
>   * Because tail pages are not marked as "used", set it. We're under
> - * pgdat->lru_lock and migration entries setup in all page mappings.
> + * lruvec->lru_lock and migration entries setup in all page mappings.
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
>  {
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0bdde88b4438..cb23a0c2cfbf 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -163,7 +163,7 @@ unsigned int munlock_vma_page(struct page *page)
>  {
>         int nr_pages;
>         bool clearlru = false;
> -       pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
>
>         /* For try_to_munlock() and to serialize with page migration */
>         BUG_ON(!PageLocked(page));
> @@ -177,7 +177,7 @@ unsigned int munlock_vma_page(struct page *page)
>          */
>         get_page(page);
>         clearlru = TestClearPageLRU(page);
> -       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page);
>
>         if (!TestClearPageMlocked(page)) {
>                 if (clearlru)
> @@ -186,7 +186,7 @@ unsigned int munlock_vma_page(struct page *page)
>                  * Potentially, PTE-mapped THP: do not skip the rest PTEs
>                  * Reuse lock as memory barrier for release_pages racing.
>                  */
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 put_page(page);
>                 return 0;
>         }
> @@ -195,14 +195,11 @@ unsigned int munlock_vma_page(struct page *page)
>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>
>         if (clearlru) {
> -               struct lruvec *lruvec;
> -
> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 __munlock_isolated_page(page);
>         } else {
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 put_page(page);
>                 __munlock_isolation_failed(page);
>         }
> @@ -284,6 +281,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         int nr = pagevec_count(pvec);
>         int delta_munlocked = -nr;
>         struct pagevec pvec_putback;
> +       struct lruvec *lruvec = NULL;
>         int pgrescued = 0;
>
>         pagevec_init(&pvec_putback);
> @@ -291,11 +289,17 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         /* Phase 1: page isolation */
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *lruvec;
> +               struct lruvec *new_lruvec;
>                 bool clearlru;
>
>                 clearlru = TestClearPageLRU(page);
> -               spin_lock_irq(&zone->zone_pgdat->lru_lock);
> +
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (new_lruvec != lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irq(lruvec);
> +                       lruvec = lock_page_lruvec_irq(page);
> +               }

So instead of trying to optimize things here you should go for parity.
If you are taking the old lru_lock once per pass you should do that
here too. You can come back through and optimize with the relock
approach later.

>                 if (!TestClearPageMlocked(page)) {
>                         delta_munlocked++;
> @@ -314,9 +318,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                  * We already have pin from follow_page_mask()
>                  * so we can spare the get_page() here.
>                  */
> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>                 continue;
>
>                 /*
> @@ -326,14 +328,12 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                  * the last pin, __page_cache_release() would deadlock.
>                  */
>  putback:
> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
>         }
> -       /* tempary disable irq, will remove later */
> -       local_irq_disable();
>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> -       local_irq_enable();
> +       if (lruvec)
> +               unlock_page_lruvec_irq(lruvec);

So I am not a fan of this change. You went to all the trouble of
reducing the lock scope just to bring it back out here again. In
addition it implies there is a path where you might try to update the
page state without disabling interrupts.

>         /* Now we can release pins of pages that we are not munlocking */
>         pagevec_release(&pvec_putback);
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index 4686fdc23bb9..3750a90ed4a0 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
>         enum lru_list lru;
>
>         memset(lruvec, 0, sizeof(struct lruvec));
> +       spin_lock_init(&lruvec->lru_lock);
>
>         for_each_lru(lru)
>                 INIT_LIST_HEAD(&lruvec->lists[lru]);
> diff --git a/mm/swap.c b/mm/swap.c
> index 8488b9b25730..129c532357a4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
>  static void __page_cache_release(struct page *page)
>  {
>         if (PageLRU(page)) {
> -               pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
>                 unsigned long flags;
>
>                 __ClearPageLRU(page);
> -               spin_lock_irqsave(&pgdat->lru_lock, flags);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +               lruvec = lock_page_lruvec_irqsave(page, &flags);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>         }
>         __ClearPageWaiters(page);
>  }
> @@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>         void (*move_fn)(struct page *page, struct lruvec *lruvec))
>  {
>         int i;
> -       struct pglist_data *pgdat = NULL;
> -       struct lruvec *lruvec;
> +       struct lruvec *lruvec = NULL;
>         unsigned long flags = 0;
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
> +               struct lruvec *new_lruvec;
>
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irqsave(&pgdat->lru_lock, flags);
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (lruvec != new_lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                       lruvec = lock_page_lruvec_irqsave(page, &flags);
>                 }
>
>                 /* block memcg migration during page moving between lru */
>                 if (!TestClearPageLRU(page))
>                         continue;
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 (*move_fn)(page, lruvec);
>
>                 SetPageLRU(page);
>         }
> -       if (pgdat)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (lruvec)
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>         release_pages(pvec->pages, pvec->nr);
>         pagevec_reinit(pvec);
>  }
> @@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>  {
>         do {
>                 unsigned long lrusize;
> -               struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
> -               spin_lock_irq(&pgdat->lru_lock);
> +               spin_lock_irq(&lruvec->lru_lock);
>                 /* Record cost event */
>                 if (file)
>                         lruvec->file_cost += nr_pages;
> @@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>                         lruvec->file_cost /= 2;
>                         lruvec->anon_cost /= 2;
>                 }
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               spin_unlock_irq(&lruvec->lru_lock);
>         } while ((lruvec = parent_lruvec(lruvec)));
>  }
>
> @@ -365,11 +360,12 @@ static inline void activate_page_drain(int cpu)
>  void activate_page(struct page *page)
>  {
>         pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
>
>         page = compound_head(page);
> -       spin_lock_irq(&pgdat->lru_lock);
> -       __activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page);
> +       __activate_page(page, lruvec);
> +       unlock_page_lruvec_irq(lruvec);
>  }
>  #endif
>
> @@ -819,8 +815,7 @@ void release_pages(struct page **pages, int nr)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct pglist_data *locked_pgdat = NULL;
> -       struct lruvec *lruvec;
> +       struct lruvec *lruvec = NULL;
>         unsigned long uninitialized_var(flags);
>         unsigned int uninitialized_var(lock_batch);
>
> @@ -830,21 +825,20 @@ void release_pages(struct page **pages, int nr)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same pgdat. The lock is held only if pgdat != NULL.
> +                * same lruvec. The lock is held only if lruvec != NULL.
>                  */
> -               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -                       locked_pgdat = NULL;
> +               if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       unlock_page_lruvec_irqrestore(lruvec, flags);
> +                       lruvec = NULL;
>                 }
>
>                 if (is_huge_zero_page(page))
>                         continue;
>
>                 if (is_zone_device_page(page)) {
> -                       if (locked_pgdat) {
> -                               spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> -                                                      flags);
> -                               locked_pgdat = NULL;
> +                       if (lruvec) {
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                               lruvec = NULL;
>                         }
>                         /*
>                          * ZONE_DEVICE pages that return 'false' from
> @@ -863,28 +857,28 @@ void release_pages(struct page **pages, int nr)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (locked_pgdat) {
> -                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -                               locked_pgdat = NULL;
> +                       if (lruvec) {
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                               lruvec = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct pglist_data *pgdat = page_pgdat(page);
> +                       struct lruvec *new_lruvec;
>
> -                       if (pgdat != locked_pgdat) {
> -                               if (locked_pgdat)
> -                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> +                       new_lruvec = mem_cgroup_page_lruvec(page,
> +                                                       page_pgdat(page));
> +                       if (new_lruvec != lruvec) {
> +                               if (lruvec)
> +                                       unlock_page_lruvec_irqrestore(lruvec,
>                                                                         flags);
>                                 lock_batch = 0;
> -                               locked_pgdat = pgdat;
> -                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> +                               lruvec = lock_page_lruvec_irqsave(page, &flags);
>                         }

This just kind of seems ugly to me. I am not a fan of having to fetch
the lruvec twice when you already have it in new_lruvec. I suppose it
is fine though since you are just going to be replacing it later
anyway.

>
>                         __ClearPageLRU(page);
> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 }
>
> @@ -894,8 +888,8 @@ void release_pages(struct page **pages, int nr)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (locked_pgdat)
> -               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +       if (lruvec)
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_unref_page_list(&pages_to_free);
> @@ -983,26 +977,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>  void __pagevec_lru_add(struct pagevec *pvec)
>  {
>         int i;
> -       struct pglist_data *pgdat = NULL;
> -       struct lruvec *lruvec;
> +       struct lruvec *lruvec = NULL;
>         unsigned long flags = 0;
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
> +               struct lruvec *new_lruvec;
>
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irqsave(&pgdat->lru_lock, flags);
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (lruvec != new_lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irqrestore(lruvec, flags);
> +                       lruvec = lock_page_lruvec_irqsave(page, &flags);
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 __pagevec_lru_add_fn(page, lruvec);
>         }
> -       if (pgdat)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (lruvec)
> +               unlock_page_lruvec_irqrestore(lruvec, flags);
>         release_pages(pvec->pages, pvec->nr);
>         pagevec_reinit(pvec);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f77748adc340..168c1659e430 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page)
>         WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>
>         if (TestClearPageLRU(page)) {
> -               pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
>                 int lru = page_lru(page);
>
>                 get_page(page);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               spin_lock_irq(&pgdat->lru_lock);
> +               lruvec = lock_page_lruvec_irq(page);
>                 del_page_from_lru_list(page, lruvec, lru);
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 ret = 0;
>         }
>
> @@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                                                      struct list_head *list)
>  {
> -       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         int nr_pages, nr_moved = 0;
>         LIST_HEAD(pages_to_free);
>         struct page *page;
> +       struct lruvec *orig_lruvec = lruvec;
>         enum lru_list lru;
>
>         while (!list_empty(list)) {
> +               struct lruvec *new_lruvec = NULL;
> +
>                 page = lru_to_page(list);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(&pgdat->lru_lock);
> +                       spin_unlock_irq(&lruvec->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(&pgdat->lru_lock);
> +                       spin_lock_irq(&lruvec->lru_lock);
>                         continue;
>                 }
>
> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                  *                                        list_add(&page->lru,)
>                  *     list_add(&page->lru,) //corrupt
>                  */
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (new_lruvec != lruvec) {
> +                       if (lruvec)
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                       lruvec = lock_page_lruvec_irq(page);
> +               }
>                 SetPageLRU(page);
>
>                 if (unlikely(put_page_testzero(page))) {
> @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                         __ClearPageActive(page);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(&pgdat->lru_lock);
> +                               spin_unlock_irq(&lruvec->lru_lock);
>                                 destroy_compound_page(page);
> -                               spin_lock_irq(&pgdat->lru_lock);
> +                               spin_lock_irq(&lruvec->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 lru = page_lru(page);
>                 nr_pages = hpage_nr_pages(page);
>
> @@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                 if (PageActive(page))
>                         workingset_age_nonresident(lruvec, nr_pages);
>         }
> +       if (orig_lruvec != lruvec) {
> +               if (lruvec)
> +                       spin_unlock_irq(&lruvec->lru_lock);
> +               spin_lock_irq(&orig_lruvec->lru_lock);
> +       }
>
>         /*
>          * To save our caller's stack, now use input list for pages to free.

Something like this seems much more readable than the block you had
above. It is what I would expect the relock code to look like.

> @@ -1957,7 +1967,7 @@ static int current_may_throttle(void)
>
>         lru_add_drain();
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, lru);
> @@ -1969,7 +1979,7 @@ static int current_may_throttle(void)
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>         __count_vm_events(PGSCAN_ANON + file, nr_scanned);
>
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
> @@ -1977,7 +1987,7 @@ static int current_may_throttle(void)
>         nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
>                                 &stat, false);
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>         move_pages_to_lru(lruvec, &page_list);
>
>         __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> @@ -1986,7 +1996,7 @@ static int current_may_throttle(void)
>                 __count_vm_events(item, nr_reclaimed);
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>         __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         lru_note_cost(lruvec, file, stat.nr_pageout);
>         mem_cgroup_uncharge_list(&page_list);
> @@ -2039,7 +2049,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         lru_add_drain();
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, lru);
> @@ -2049,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         __count_vm_events(PGREFILL, nr_scanned);
>         __count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -2095,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_activate = move_pages_to_lru(lruvec, &l_active);
>         nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
> @@ -2106,7 +2116,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
>
>         __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_active);
>         free_unref_page_list(&l_active);
> @@ -2696,10 +2706,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         /*
>          * Determine the scan balance between anon and file LRUs.
>          */
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&target_lruvec->lru_lock);
>         sc->anon_cost = target_lruvec->anon_cost;
>         sc->file_cost = target_lruvec->file_cost;
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&target_lruvec->lru_lock);
>
>         /*
>          * Target desirable inactive:active list ratios for the anon
> @@ -4275,24 +4285,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>   */
>  void check_move_unevictable_pages(struct pagevec *pvec)
>  {
> -       struct lruvec *lruvec;
> -       struct pglist_data *pgdat = NULL;
> +       struct lruvec *lruvec = NULL;
>         int pgscanned = 0;
>         int pgrescued = 0;
>         int i;
>
>         for (i = 0; i < pvec->nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
> +               struct lruvec *new_lruvec;
>
>                 pgscanned++;
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irq(&pgdat->lru_lock);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irq(&pgdat->lru_lock);
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (lruvec != new_lruvec) {
> +                       if (lruvec)
> +                               unlock_page_lruvec_irq(lruvec);
> +                       lruvec = lock_page_lruvec_irq(page);
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> @@ -4308,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>                 }
>         }
>
> -       if (pgdat) {
> +       if (lruvec) {
>                 __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>                 __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>         }
>  }
>  EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru
  2020-07-11  0:58 ` [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru Alex Shi
@ 2020-07-17 21:44     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 21:44 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Andrey Ryabinin, Jann Horn

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> From: Hugh Dickins <hughd@google.com>
>
> Use the relock function to replace relocking action. And try to save few
> lock times.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  mm/vmscan.c | 17 ++++++-----------
>  1 file changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bdb53a678e7e..078a1640ec60 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1854,15 +1854,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>         enum lru_list lru;
>
>         while (!list_empty(list)) {
> -               struct lruvec *new_lruvec = NULL;
> -
>                 page = lru_to_page(list);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(&lruvec->lru_lock);
> +                       if (lruvec) {
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                               lruvec = NULL;
> +                       }
>                         putback_lru_page(page);
> -                       spin_lock_irq(&lruvec->lru_lock);
>                         continue;
>                 }
>
> @@ -1876,12 +1876,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                  *                                        list_add(&page->lru,)
>                  *     list_add(&page->lru,) //corrupt
>                  */
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (new_lruvec != lruvec) {
> -                       if (lruvec)
> -                               spin_unlock_irq(&lruvec->lru_lock);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>                 SetPageLRU(page);
>
>                 if (unlikely(put_page_testzero(page))) {
> @@ -1890,8 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>
>                         if (unlikely(PageCompound(page))) {
>                                 spin_unlock_irq(&lruvec->lru_lock);
> +                               lruvec = NULL;
>                                 destroy_compound_page(page);
> -                               spin_lock_irq(&lruvec->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>

It seems like this should just be rolled into patch 19. Otherwise if
you are wanting to consider it as a "further optimization" type patch
you might pull some of the optimizations you were pushing in patch 18
into this patch as well and just call it out as adding relocks where
there previously were none.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru
@ 2020-07-17 21:44     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 21:44 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Andrey Ryabinin, Jann Horn

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> From: Hugh Dickins <hughd@google.com>
>
> Use the relock function to replace relocking action. And try to save few
> lock times.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  mm/vmscan.c | 17 ++++++-----------
>  1 file changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bdb53a678e7e..078a1640ec60 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1854,15 +1854,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>         enum lru_list lru;
>
>         while (!list_empty(list)) {
> -               struct lruvec *new_lruvec = NULL;
> -
>                 page = lru_to_page(list);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(&lruvec->lru_lock);
> +                       if (lruvec) {
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                               lruvec = NULL;
> +                       }
>                         putback_lru_page(page);
> -                       spin_lock_irq(&lruvec->lru_lock);
>                         continue;
>                 }
>
> @@ -1876,12 +1876,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                  *                                        list_add(&page->lru,)
>                  *     list_add(&page->lru,) //corrupt
>                  */
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (new_lruvec != lruvec) {
> -                       if (lruvec)
> -                               spin_unlock_irq(&lruvec->lru_lock);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>                 SetPageLRU(page);
>
>                 if (unlikely(put_page_testzero(page))) {
> @@ -1890,8 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>
>                         if (unlikely(PageCompound(page))) {
>                                 spin_unlock_irq(&lruvec->lru_lock);
> +                               lruvec = NULL;
>                                 destroy_compound_page(page);
> -                               spin_lock_irq(&lruvec->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>

It seems like this should just be rolled into patch 19. Otherwise if
you are wanting to consider it as a "further optimization" type patch
you might pull some of the optimizations you were pushing in patch 18
into this patch as well and just call it out as adding relocks where
there previously were none.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function
  2020-07-11  0:58 ` [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-07-17 22:03     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 22:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Thomas Gleixner, Andrey Ryabinin

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Use this new function to replace repeated same code, no func change.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  mm/mlock.c  |  9 +--------
>  mm/swap.c   | 33 +++++++--------------------------
>  mm/vmscan.c |  8 +-------
>  3 files changed, 9 insertions(+), 41 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index cb23a0c2cfbf..4f40fc091cf9 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -289,17 +289,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         /* Phase 1: page isolation */
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
>                 bool clearlru;
>
>                 clearlru = TestClearPageLRU(page);
> -
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (new_lruvec != lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irq(lruvec);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>
>                 if (!TestClearPageMlocked(page)) {
>                         delta_munlocked++;
> diff --git a/mm/swap.c b/mm/swap.c
> index 129c532357a4..9fb906fbaed5 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
> -
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irqrestore(lruvec, flags);
> -                       lruvec = lock_page_lruvec_irqsave(page, &flags);
> -               }
>
>                 /* block memcg migration during page moving between lru */
>                 if (!TestClearPageLRU(page))
>                         continue;
>
> +               lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>                 (*move_fn)(page, lruvec);
>
>                 SetPageLRU(page);

So looking at this I realize that patch 18 probably should have
ordered this the same way with the TestClearPageLRU happening before
you fetched the new_lruvec. Otherwise I think you are potentially
exposed to the original issue you were fixing the the previous patch
that added the call to TestClearPageLRU.

> @@ -866,17 +859,12 @@ void release_pages(struct page **pages, int nr)
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct lruvec *new_lruvec;
> -
> -                       new_lruvec = mem_cgroup_page_lruvec(page,
> -                                                       page_pgdat(page));
> -                       if (new_lruvec != lruvec) {
> -                               if (lruvec)
> -                                       unlock_page_lruvec_irqrestore(lruvec,
> -                                                                       flags);
> +                       struct lruvec *pre_lruvec = lruvec;
> +
> +                       lruvec = relock_page_lruvec_irqsave(page, lruvec,
> +                                                                       &flags);
> +                       if (pre_lruvec != lruvec)

So this doesn't really read right. I suppose "pre_lruvec" should
probably be "prev_lruvec" since I assume you mean "previous" not
"before".

>                                 lock_batch = 0;
> -                               lruvec = lock_page_lruvec_irqsave(page, &flags);
> -                       }
>
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -982,15 +970,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
> -
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irqrestore(lruvec, flags);
> -                       lruvec = lock_page_lruvec_irqsave(page, &flags);
> -               }
>
> +               lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>                 __pagevec_lru_add_fn(page, lruvec);
>         }
>         if (lruvec)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 168c1659e430..bdb53a678e7e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4292,15 +4292,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>
>         for (i = 0; i < pvec->nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
>
>                 pgscanned++;
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irq(lruvec);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> --
> 1.8.3.1
>
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function
@ 2020-07-17 22:03     ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-17 22:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Thomas Gleixner, Andrey Ryabinin

On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Use this new function to replace repeated same code, no func change.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  mm/mlock.c  |  9 +--------
>  mm/swap.c   | 33 +++++++--------------------------
>  mm/vmscan.c |  8 +-------
>  3 files changed, 9 insertions(+), 41 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index cb23a0c2cfbf..4f40fc091cf9 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -289,17 +289,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         /* Phase 1: page isolation */
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
>                 bool clearlru;
>
>                 clearlru = TestClearPageLRU(page);
> -
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (new_lruvec != lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irq(lruvec);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>
>                 if (!TestClearPageMlocked(page)) {
>                         delta_munlocked++;
> diff --git a/mm/swap.c b/mm/swap.c
> index 129c532357a4..9fb906fbaed5 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
> -
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irqrestore(lruvec, flags);
> -                       lruvec = lock_page_lruvec_irqsave(page, &flags);
> -               }
>
>                 /* block memcg migration during page moving between lru */
>                 if (!TestClearPageLRU(page))
>                         continue;
>
> +               lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>                 (*move_fn)(page, lruvec);
>
>                 SetPageLRU(page);

So looking at this I realize that patch 18 probably should have
ordered this the same way with the TestClearPageLRU happening before
you fetched the new_lruvec. Otherwise I think you are potentially
exposed to the original issue you were fixing the the previous patch
that added the call to TestClearPageLRU.

> @@ -866,17 +859,12 @@ void release_pages(struct page **pages, int nr)
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct lruvec *new_lruvec;
> -
> -                       new_lruvec = mem_cgroup_page_lruvec(page,
> -                                                       page_pgdat(page));
> -                       if (new_lruvec != lruvec) {
> -                               if (lruvec)
> -                                       unlock_page_lruvec_irqrestore(lruvec,
> -                                                                       flags);
> +                       struct lruvec *pre_lruvec = lruvec;
> +
> +                       lruvec = relock_page_lruvec_irqsave(page, lruvec,
> +                                                                       &flags);
> +                       if (pre_lruvec != lruvec)

So this doesn't really read right. I suppose "pre_lruvec" should
probably be "prev_lruvec" since I assume you mean "previous" not
"before".

>                                 lock_batch = 0;
> -                               lruvec = lock_page_lruvec_irqsave(page, &flags);
> -                       }
>
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -982,15 +970,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
> -
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irqrestore(lruvec, flags);
> -                       lruvec = lock_page_lruvec_irqsave(page, &flags);
> -               }
>
> +               lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>                 __pagevec_lru_add_fn(page, lruvec);
>         }
>         if (lruvec)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 168c1659e430..bdb53a678e7e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4292,15 +4292,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>
>         for (i = 0; i < pvec->nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
>
>                 pgscanned++;
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irq(lruvec);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> --
> 1.8.3.1
>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function
  2020-07-17 22:03     ` Alexander Duyck
  (?)
@ 2020-07-18 14:01     ` Alex Shi
  -1 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-18 14:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Thomas Gleixner, Andrey Ryabinin



在 2020/7/18 上午6:03, Alexander Duyck 写道:
>> index 129c532357a4..9fb906fbaed5 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>>
>>         for (i = 0; i < pagevec_count(pvec); i++) {
>>                 struct page *page = pvec->pages[i];
>> -               struct lruvec *new_lruvec;
>> -
>> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>> -               if (lruvec != new_lruvec) {
>> -                       if (lruvec)
>> -                               unlock_page_lruvec_irqrestore(lruvec, flags);
>> -                       lruvec = lock_page_lruvec_irqsave(page, &flags);
>> -               }
>>
>>                 /* block memcg migration during page moving between lru */
>>                 if (!TestClearPageLRU(page))
>>                         continue;
>>
>> +               lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>>                 (*move_fn)(page, lruvec);
>>
>>                 SetPageLRU(page);
> So looking at this I realize that patch 18 probably should have
> ordered this the same way with the TestClearPageLRU happening before
> you fetched the new_lruvec. Otherwise I think you are potentially
> exposed to the original issue you were fixing the the previous patch
> that added the call to TestClearPageLRU.

Good catch. It's better to be aligned in next version.
Thanks!

> 
>> @@ -866,17 +859,12 @@ void release_pages(struct page **pages, int nr)
>>                 }
>>
>>                 if (PageLRU(page)) {
>> -                       struct lruvec *new_lruvec;
>> -
>> -                       new_lruvec = mem_cgroup_page_lruvec(page,
>> -                                                       page_pgdat(page));
>> -                       if (new_lruvec != lruvec) {
>> -                               if (lruvec)
>> -                                       unlock_page_lruvec_irqrestore(lruvec,
>> -                                                                       flags);
>> +                       struct lruvec *pre_lruvec = lruvec;
>> +
>> +                       lruvec = relock_page_lruvec_irqsave(page, lruvec,
>> +                                                                       &flags);
>> +                       if (pre_lruvec != lruvec)
> So this doesn't really read right. I suppose "pre_lruvec" should
> probably be "prev_lruvec" since I assume you mean "previous" not
> "before".

yes, it's previous, I will rename it.
Thanks
Alex
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-17 21:38     ` Alexander Duyck
  (?)
@ 2020-07-18 14:15     ` Alex Shi
  2020-07-19  9:12       ` Alex Shi
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-18 14:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov, Rong Chen



在 2020/7/18 上午5:38, Alexander Duyck 写道:
>> +               return locked_lruvec;
>> +
>> +       if (locked_lruvec)
>> +               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
>> +
>> +       return lock_page_lruvec_irqsave(page, flags);
>> +}
>> +
> These relock functions have no users in this patch. It might make
> sense and push this code to patch 19 in your series since that is
> where they are first used. In addition they don't seem very efficient
> as you already had to call mem_cgroup_page_lruvec once, why do it
> again when you could just store the value and lock the new lruvec if
> needed?

Right, it's better to move for late patch.

As to call the func again, mainly it's for code neat.

Thanks!

> 
>>  #ifdef CONFIG_CGROUP_WRITEBACK
>>
>>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 14c668b7e793..36c1680efd90 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -261,6 +261,8 @@ struct lruvec {
>>         atomic_long_t                   nonresident_age;
>>         /* Refaults at the time of last reclaim cycle */
>>         unsigned long                   refaults;
>> +       /* per lruvec lru_lock for memcg */
>> +       spinlock_t                      lru_lock;
>>         /* Various lruvec state flags (enum lruvec_flags) */
>>         unsigned long                   flags;
> Any reason for placing this here instead of at the end of the
> structure? From what I can tell it looks like lruvec is already 128B
> long so placing the lock on the end would put it into the next
> cacheline which may provide some performance benefit since it is
> likely to be bounced quite a bit.

Rong Chen(Cced) once reported a performance regression when the lock at
the end of struct, and move here could remove it.
Although I can't not reproduce. But I trust his report.

...

>>  putback:
>> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>>                 pvec->pages[i] = NULL;
>>         }
>> -       /* tempary disable irq, will remove later */
>> -       local_irq_disable();
>>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>> -       local_irq_enable();
>> +       if (lruvec)
>> +               unlock_page_lruvec_irq(lruvec);
> So I am not a fan of this change. You went to all the trouble of
> reducing the lock scope just to bring it back out here again. In
> addition it implies there is a path where you might try to update the
> page state without disabling interrupts.

Right. but any idea to avoid this except a extra local_irq_disable?

...

>>                 if (PageLRU(page)) {
>> -                       struct pglist_data *pgdat = page_pgdat(page);
>> +                       struct lruvec *new_lruvec;
>>
>> -                       if (pgdat != locked_pgdat) {
>> -                               if (locked_pgdat)
>> -                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
>> +                       new_lruvec = mem_cgroup_page_lruvec(page,
>> +                                                       page_pgdat(page));
>> +                       if (new_lruvec != lruvec) {
>> +                               if (lruvec)
>> +                                       unlock_page_lruvec_irqrestore(lruvec,
>>                                                                         flags);
>>                                 lock_batch = 0;
>> -                               locked_pgdat = pgdat;
>> -                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>> +                               lruvec = lock_page_lruvec_irqsave(page, &flags);
>>                         }
> This just kind of seems ugly to me. I am not a fan of having to fetch
> the lruvec twice when you already have it in new_lruvec. I suppose it
> is fine though since you are just going to be replacing it later
> anyway.
> 

yes, it will be reproduce later.

Thanks
Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru
  2020-07-17 21:44     ` Alexander Duyck
  (?)
@ 2020-07-18 14:15     ` Alex Shi
  -1 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-18 14:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Andrey Ryabinin, Jann Horn



在 2020/7/18 上午5:44, Alexander Duyck 写道:
>>                         if (unlikely(PageCompound(page))) {
>>                                 spin_unlock_irq(&lruvec->lru_lock);
>> +                               lruvec = NULL;
>>                                 destroy_compound_page(page);
>> -                               spin_lock_irq(&lruvec->lru_lock);
>>                         } else
>>                                 list_add(&page->lru, &pages_to_free);
>>
> It seems like this should just be rolled into patch 19. Otherwise if
> you are wanting to consider it as a "further optimization" type patch
> you might pull some of the optimizations you were pushing in patch 18
> into this patch as well and just call it out as adding relocks where
> there previously were none.

This patch is picked from Hugh Dickin's version in my review. It could be
fine to have a extra patch which no harm for anyone. :)

Thanks
Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock
  2020-07-17 21:09     ` Alexander Duyck
  (?)
@ 2020-07-18 14:17     ` Alex Shi
  -1 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-18 14:17 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov



在 2020/7/18 上午5:09, Alexander Duyck 写道:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e028b87ce294..4d7df42b32d6 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
>>
>>         pgdat_page_ext_init(pgdat);
>> -       spin_lock_init(&pgdat->lru_lock);
>>         lruvec_init(&pgdat->__lruvec);
>>  }
>>
> This patch would probably make more sense as part of patch 18 since
> you removed all of the users of this field there.


yes, I just want to a bit of sense of ceremony to remove this huge big lock. :)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
  2020-07-17 20:30     ` Alexander Duyck
  (?)
@ 2020-07-19  3:55     ` Alex Shi
  2020-07-20 18:51         ` Alexander Duyck
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-19  3:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov



在 2020/7/18 上午4:30, Alexander Duyck 写道:
> On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>> This patch reorder the isolation steps during munlock, move the lru lock
>> to guard each pages, unfold __munlock_isolate_lru_page func, to do the
>> preparation for lru lock change.
>>
>> __split_huge_page_refcount doesn't exist, but we still have to guard
>> PageMlocked and PageLRU for tail page in __split_huge_page_tail.
>>
>> [lkp@intel.com: found a sleeping function bug ... at mm/rmap.c]
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  mm/mlock.c | 93 ++++++++++++++++++++++++++++++++++----------------------------
>>  1 file changed, 51 insertions(+), 42 deletions(-)
>>
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index 228ba5a8e0a5..0bdde88b4438 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -103,25 +103,6 @@ void mlock_vma_page(struct page *page)
>>  }
>>
>>  /*
>> - * Isolate a page from LRU with optional get_page() pin.
>> - * Assumes lru_lock already held and page already pinned.
>> - */
>> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
>> -{
>> -       if (TestClearPageLRU(page)) {
>> -               struct lruvec *lruvec;
>> -
>> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>> -               if (getpage)
>> -                       get_page(page);
>> -               del_page_from_lru_list(page, lruvec, page_lru(page));
>> -               return true;
>> -       }
>> -
>> -       return false;
>> -}
>> -
>> -/*
>>   * Finish munlock after successful page isolation
>>   *
>>   * Page must be locked. This is a wrapper for try_to_munlock()
>> @@ -181,6 +162,7 @@ static void __munlock_isolation_failed(struct page *page)
>>  unsigned int munlock_vma_page(struct page *page)
>>  {
>>         int nr_pages;
>> +       bool clearlru = false;
>>         pg_data_t *pgdat = page_pgdat(page);
>>
>>         /* For try_to_munlock() and to serialize with page migration */
>> @@ -189,32 +171,42 @@ unsigned int munlock_vma_page(struct page *page)
>>         VM_BUG_ON_PAGE(PageTail(page), page);
>>
>>         /*
>> -        * Serialize with any parallel __split_huge_page_refcount() which
>> +        * Serialize split tail pages in __split_huge_page_tail() which
>>          * might otherwise copy PageMlocked to part of the tail pages before
>>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
>>          */
>> +       get_page(page);
> 
> I don't think this get_page() call needs to be up here. It could be
> left down before we delete the page from the LRU list as it is really
> needed to take a reference on the page before we call
> __munlock_isolated_page(), or at least that is the way it looks to me.
> By doing that you can avoid a bunch of cleanup in these exception
> cases.

Uh, It seems unlikely for !page->_refcount, and then got to release_pages(),
if so, get_page do could move down.
Thanks

> 
>> +       clearlru = TestClearPageLRU(page);
> 
> I'm not sure I fully understand the reason for moving this here. By
> clearing this flag before you clear Mlocked does this give you some
> sort of extra protection? I don't see how since Mlocked doesn't
> necessarily imply the page is on LRU.
> 

Above comments give a reason for the lru_lock usage,
>> +        * Serialize split tail pages in __split_huge_page_tail() which
>>          * might otherwise copy PageMlocked to part of the tail pages before
>>          * we clear it in the head page. It also stabilizes hpage_nr_pages().

Look into the __split_huge_page_tail, there is a tiny gap between tail page
get PG_mlocked, and it is added into lru list.
The TestClearPageLRU could blocked memcg changes of the page from stopping
isolate_lru_page.


>>         spin_lock_irq(&pgdat->lru_lock);
>>
>>         if (!TestClearPageMlocked(page)) {
>> -               /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
>> -               nr_pages = 1;
>> -               goto unlock_out;
>> +               if (clearlru)
>> +                       SetPageLRU(page);
>> +               /*
>> +                * Potentially, PTE-mapped THP: do not skip the rest PTEs
>> +                * Reuse lock as memory barrier for release_pages racing.
>> +                */
>> +               spin_unlock_irq(&pgdat->lru_lock);
>> +               put_page(page);
>> +               return 0;
>>         }
>>
>>         nr_pages = hpage_nr_pages(page);
>>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>>
>> -       if (__munlock_isolate_lru_page(page, true)) {
>> +       if (clearlru) {
>> +               struct lruvec *lruvec;
>> +
> 
> You could just place the get_page() call here.
> 
>> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>> +               del_page_from_lru_list(page, lruvec, page_lru(page));
>>                 spin_unlock_irq(&pgdat->lru_lock);
>>                 __munlock_isolated_page(page);
>> -               goto out;
>> +       } else {
>> +               spin_unlock_irq(&pgdat->lru_lock);
>> +               put_page(page);
>> +               __munlock_isolation_failed(page);
> 
> If you move the get_page() as I suggested above there wouldn't be a
> need for the put_page(). It then becomes possible to simplify the code
> a bit by merging the unlock paths and doing an if/else with the
> __munlock functions like so:
> if (clearlru) {
>     ...
>     del_page_from_lru..
> }
> 
> spin_unlock_irq()
> 
> if (clearlru)
>     __munlock_isolated_page();
> else
>     __munlock_isolated_failed();
> 
>>         }
>> -       __munlock_isolation_failed(page);
>> -
>> -unlock_out:
>> -       spin_unlock_irq(&pgdat->lru_lock);
>>
>> -out:
>>         return nr_pages - 1;
>>  }
>>
>> @@ -297,34 +289,51 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>>         pagevec_init(&pvec_putback);
>>
>>         /* Phase 1: page isolation */
>> -       spin_lock_irq(&zone->zone_pgdat->lru_lock);
>>         for (i = 0; i < nr; i++) {
>>                 struct page *page = pvec->pages[i];
>> +               struct lruvec *lruvec;
>> +               bool clearlru;
>>
>> -               if (TestClearPageMlocked(page)) {
>> -                       /*
>> -                        * We already have pin from follow_page_mask()
>> -                        * so we can spare the get_page() here.
>> -                        */
>> -                       if (__munlock_isolate_lru_page(page, false))
>> -                               continue;
>> -                       else
>> -                               __munlock_isolation_failed(page);
>> -               } else {
>> +               clearlru = TestClearPageLRU(page);
>> +               spin_lock_irq(&zone->zone_pgdat->lru_lock);
> 
> I still don't see what you are gaining by moving the bit test up to
> this point. Seems like it would be better left below with the lock
> just being used to prevent a possible race while you are pulling the
> page out of the LRU list.
> 

the same reason as above comments mentained __split_huge_page_tail() 
issue.

>> +
>> +               if (!TestClearPageMlocked(page)) {
>>                         delta_munlocked++;
>> +                       if (clearlru)
>> +                               SetPageLRU(page);
>> +                       goto putback;
>> +               }
>> +
>> +               if (!clearlru) {
>> +                       __munlock_isolation_failed(page);
>> +                       goto putback;
>>                 }
> 
> With the other function you were processing this outside of the lock,
> here you are doing it inside. It would probably make more sense here
> to follow similar logic and take care of the del_page_from_lru_list
> ifr clealru is set, unlock, and then if clearlru is set continue else
> track the isolation failure. That way you can avoid having to use as
> many jump labels.
> 
>>                 /*
>> +                * Isolate this page.
>> +                * We already have pin from follow_page_mask()
>> +                * so we can spare the get_page() here.
>> +                */
>> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>> +               del_page_from_lru_list(page, lruvec, page_lru(page));
>> +               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>> +               continue;
>> +
>> +               /*
>>                  * We won't be munlocking this page in the next phase
>>                  * but we still need to release the follow_page_mask()
>>                  * pin. We cannot do it under lru_lock however. If it's
>>                  * the last pin, __page_cache_release() would deadlock.
>>                  */
>> +putback:
>> +               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>>                 pvec->pages[i] = NULL;
>>         }
>> +       /* tempary disable irq, will remove later */
>> +       local_irq_disable();
>>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>> -       spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>> +       local_irq_enable();
>>
>>         /* Now we can release pins of pages that we are not munlocking */
>>         pagevec_release(&pvec_putback);
>> --
>> 1.8.3.1
>>
>>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 15/22] mm/compaction: do page isolation first in compaction
  2020-07-17 16:09         ` Alexander Duyck
  (?)
@ 2020-07-19  3:59         ` Alex Shi
  -1 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-19  3:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov



在 2020/7/18 上午12:09, Alexander Duyck 写道:
>>> I wonder if it wouldn't make sense to combine these two atomic ops
>>> with tests and the put_page into a single inline function? Then it
>>> could be possible to just do one check and if succeeds you do the
>>> block of code below, otherwise you just fall-through into the -EBUSY
>>> case.
>>>
>> Uh, since get_page changes page->_refcount, TestClearPageLRU changes page->flags,
>> So I don't know how to combine them, could you make it more clear with code?
> Actually it is pretty straight forward. Something like this:
> static inline bool get_page_unless_zero_or_nonlru(struct page *page)
> {
>     if (get_page_unless_zero(page)) {
>         if (TestClearPageLRU(page))
>             return true;
>         put_page(page);
>     }
>     return false;
> }
> 
> You can then add comments as necessary. The general idea is you are
> having to do this in two different spots anyway so why not combine the
> logic? Although it does assume you can change the ordering of the
> other test above.


It doesn't look different with original code, does it?

Thanks
Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
  2020-07-17 18:26         ` Alexander Duyck
  (?)
@ 2020-07-19  4:45         ` Alex Shi
  2020-07-19 11:24           ` Alex Shi
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-19  4:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov



在 2020/7/18 上午2:26, Alexander Duyck 写道:
> On Fri, Jul 17, 2020 at 12:46 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/7/17 上午5:12, Alexander Duyck 写道:
>>> On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>
>>>> Combine PageLRU check and ClearPageLRU into a function by new
>>>> introduced func TestClearPageLRU. This function will be used as page
>>>> isolation precondition to prevent other isolations some where else.
>>>> Then there are may non PageLRU page on lru list, need to remove BUG
>>>> checking accordingly.
>>>>
>>>> Hugh Dickins pointed that __page_cache_release and release_pages
>>>> has no need to do atomic clear bit since no user on the page at that
>>>> moment. and no need get_page() before lru bit clear in isolate_lru_page,
>>>> since it '(1) Must be called with an elevated refcount on the page'.
>>>>
>>>> As Andrew Morton mentioned this change would dirty cacheline for page
>>>> isn't on LRU. But the lost would be acceptable with Rong Chen
>>>> <rong.a.chen@intel.com> report:
>>>> https://lkml.org/lkml/2020/3/4/173
>>>>
>>
>> ...
>>
>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>> index f645965fde0e..5092fe9c8c47 100644
>>>> --- a/mm/swap.c
>>>> +++ b/mm/swap.c
>>>> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>>>>                 struct lruvec *lruvec;
>>>>                 unsigned long flags;
>>>>
>>>> +               __ClearPageLRU(page);
>>>>                 spin_lock_irqsave(&pgdat->lru_lock, flags);
>>>>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>>>> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
>>>> -               __ClearPageLRU(page);
>>>>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
>>>>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>>>>         }
>>>
>>> So this piece doesn't make much sense to me. Why not use
>>> TestClearPageLRU(page) here? Just a few lines above you are testing
>>> for PageLRU(page) and it seems like if you are going to go for an
>>> atomic test/clear and then remove the page from the LRU list you
>>> should be using it here as well otherwise it seems like you could run
>>> into a potential collision since you are testing here without clearing
>>> the bit.
>>>
>>
>> Hi Alex,
>>
>> Thanks a lot for comments!
>>
>> In this func's call path __page_cache_release, the page is unlikely be
>> ClearPageLRU, since this page isn't used by anyone, and going to be freed.
>> just __ClearPageLRU would be safe, and could save a non lru page flags disturb.
> 
> So if I understand what you are saying correctly you are indicating
> that this page should likely not have the LRU flag set and that we
> just transitioned it from 1 -> 0 so there should be nobody else
> accessing it correct?
> 
> It might be useful to document this somewhere. Essentially what we are
> doing then is breaking this up into the following cases.
> 
> 1. Setting the LRU bit requires holding the LRU lock
> 2. Clearing the LRU bit requires either:
>         a. Use of atomic operations if page count is 1 or more
>         b. Non-atomic operations can be used if we cleared last reference count
> 
> Is my understanding on this correct? So we have essentially two
> scenarios, one for the get_page_unless_zero case, and another with the
> put_page_testzero.

the summary isn't incorrect. 
The the points for me are:
1, Generally, the lru bit indicated if the page on lru list, just in some temporary
moment(isolating), the page may have no bit set when it's on lru list.  that imply
the page must be on lru list when the lru bit is set.
2, have to remove lru bit before delete it from lru list.

> 
>>>> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
>>>>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>>>>                         }
>>>>
>>>> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>>>> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
>>>>                         __ClearPageLRU(page);
>>>> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>>>>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>>>>                 }
>>>>
>>>
>>> Same here. You are just moving the flag clearing, but you didn't
>>> combine it with the test. It seems like if you are expecting this to
>>> be treated as an atomic operation. It should be a relatively low cost
>>> to do since you already should own the cacheline as a result of
>>> calling put_page_testzero so I am not sure why you are not combining
>>> the two.
>>
>> before the ClearPageLRU, there is a put_page_testzero(), that means no one using
>> this page, and isolate_lru_page can not run on this page the in func checking.
>>         VM_BUG_ON_PAGE(!page_count(page), page);
>> So it would be safe here.
> 
> Okay, so this is another 2b case as defined above then.
> 
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index c1c4259b4de5..18986fefd49b 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1548,16 +1548,16 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>>>>  {
>>>>         int ret = -EINVAL;
>>>>
>>>> -       /* Only take pages on the LRU. */
>>>> -       if (!PageLRU(page))
>>>> -               return ret;
>>>> -
>>>>         /* Compaction should not handle unevictable pages but CMA can do so */
>>>>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>>>>                 return ret;
>>>>
>>>>         ret = -EBUSY;
>>>>
>>>> +       /* Only take pages on the LRU. */
>>>> +       if (!PageLRU(page))
>>>> +               return ret;
>>>> +
>>>>         /*
>>>>          * To minimise LRU disruption, the caller can indicate that it only
>>>>          * wants to isolate pages it will be able to operate on without
>>>> @@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>>>                 page = lru_to_page(src);
>>>>                 prefetchw_prev_lru_page(page, src, flags);
>>>>
>>>> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
>>>> -
>>>>                 nr_pages = compound_nr(page);
>>>>                 total_scan += nr_pages;
>>>>
>>>
>>> So effectively the changes here are making it so that a !PageLRU page
>>> will cycle to the start of the LRU list. Now if I understand correctly
>>> we are guaranteed that if the flag is not set it cannot be set while
>>> we are holding the lru_lock, however it can be cleared while we are
>>> holding the lock, correct? Thus that is why isolate_lru_pages has to
>>> call TestClearPageLRU after the earlier check in __isolate_lru_page.
>>
>> Right.
>>
>>>
>>> It might make it more readable to pull in the later patch that
>>> modifies isolate_lru_pages that has it using TestClearPageLRU.
>> As to this change, It has to do in this patch, since any TestClearPageLRU may
>> cause lru bit miss in the lru list, so the precondication check has to
>> removed here.
> 
> So I think some of my cognitive dissonance is from the fact that you
> really are doing two different things here. You aren't really
> implementing the full TestClearPageLRU until patch 15. So this patch
> is doing part of 2a and 2b, and then patch 15 is following up and
> completing the 2a cases. I still think it might make more sense to
> pull out the pieces related to 2b and move them into a patch before
> this with documentation explaining that there should be no competition
> for the LRU flag because the page has transitioned to a reference
> count of zero. Then take the remaining bits and combine them with
> patch 15 since the description for the two is pretty similar.
> 

Good suggestion, I consider this.

Thanks

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-18 14:15     ` Alex Shi
@ 2020-07-19  9:12       ` Alex Shi
  2020-07-19 15:14           ` Alexander Duyck
  0 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-19  9:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov, Rong Chen



在 2020/7/18 下午10:15, Alex Shi 写道:
>>>
>>>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 14c668b7e793..36c1680efd90 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -261,6 +261,8 @@ struct lruvec {
>>>         atomic_long_t                   nonresident_age;
>>>         /* Refaults at the time of last reclaim cycle */
>>>         unsigned long                   refaults;
>>> +       /* per lruvec lru_lock for memcg */
>>> +       spinlock_t                      lru_lock;
>>>         /* Various lruvec state flags (enum lruvec_flags) */
>>>         unsigned long                   flags;
>> Any reason for placing this here instead of at the end of the
>> structure? From what I can tell it looks like lruvec is already 128B
>> long so placing the lock on the end would put it into the next
>> cacheline which may provide some performance benefit since it is
>> likely to be bounced quite a bit.
> Rong Chen(Cced) once reported a performance regression when the lock at
> the end of struct, and move here could remove it.
> Although I can't not reproduce. But I trust his report.
> 
Oops, Rong's report is on another member which is different with current
struct. 

Compare to move to tail, how about to move it to head of struct, which is
close to lru list? Did you have some data of the place change?

Thanks
Alex

 
> ...
> 
>>>  putback:
>>> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>>>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>>>                 pvec->pages[i] = NULL;
>>>         }
>>> -       /* tempary disable irq, will remove later */
>>> -       local_irq_disable();
>>>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>>> -       local_irq_enable();
>>> +       if (lruvec)
>>> +               unlock_page_lruvec_irq(lruvec);
>> So I am not a fan of this change. You went to all the trouble of
>> reducing the lock scope just to bring it back out here again. In
>> addition it implies there is a path where you might try to update the
>> page state without disabling interrupts.
> Right. but any idea to avoid this except a extra local_irq_disable?
> 

The following changes would resolve the problem. Is this ok?
@@ -324,7 +322,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
                pagevec_add(&pvec_putback, pvec->pages[i]);
                pvec->pages[i] = NULL;
        }
-       __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+       if (delta_munlocked)
+               __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
        if (lruvec)
                unlock_page_lruvec_irq(lruvec);

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU
  2020-07-19  4:45         ` Alex Shi
@ 2020-07-19 11:24           ` Alex Shi
  0 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-19 11:24 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov



在 2020/7/19 下午12:45, Alex Shi 写道:
>>>
>>>> It might make it more readable to pull in the later patch that
>>>> modifies isolate_lru_pages that has it using TestClearPageLRU.
>>> As to this change, It has to do in this patch, since any TestClearPageLRU may
>>> cause lru bit miss in the lru list, so the precondication check has to
>>> removed here.
>> So I think some of my cognitive dissonance is from the fact that you
>> really are doing two different things here. You aren't really
>> implementing the full TestClearPageLRU until patch 15. So this patch
>> is doing part of 2a and 2b, and then patch 15 is following up and
>> completing the 2a cases. I still think it might make more sense to
>> pull out the pieces related to 2b and move them into a patch before
>> this with documentation explaining that there should be no competition
>> for the LRU flag because the page has transitioned to a reference
>> count of zero. Then take the remaining bits and combine them with
>> patch 15 since the description for the two is pretty similar.
>>


As to the patch split suggest, actually, Hugh and I talked about a few weeks 
ago when he give me these changes. We both thought keep these changes in this
patch looks better at that time.
If it make you confuse, don't know a changed commit log make it better?

Thanks
Alex

    mm/lru: introduce TestClearPageLRU

    Currently lru_lock still guards both lru list and page's lru bit, that's
    ok. but if we want to use specific lruvec lock on the page, we need to
    pin down the page's lruvec/memcg during locking. Just taking lruvec
    lock first may be undermined by the page's memcg charge/migration. To
    fix this problem, we could take out the page's lru bit clear and use
    it as pin down action to block the memcg changes. That's the reason
    for new atomic func TestClearPageLRU. So now isolating a page need both
    actions: TestClearPageLRU and hold the lru_lock.

    This patch combines PageLRU check and ClearPageLRU into a macro func
    TestClearPageLRU. This function will be used as page isolation
    precondition to prevent other isolations some where else.
    Then there are may non PageLRU page on lru list, need to remove BUG
    checking accordingly.

    There 2 rules for lru bit:
    1, the lru bit still indicate if a page on lru list, just
    in some temporary moment(isolating), the page may have no lru bit when
    it's on lru list.  but the page still must be on lru list when the
    lru bit is set.
    2, have to remove lru bit before delete it from lru list.

    Hugh Dickins pointed that when a page is in freeing path and no one is
    possible to take it, non atomic lru bit clearing is better, like in
    __page_cache_release and release_pages.
    ANd no need get_page() before lru bit clear in isolate_lru_page,
    since it '(1) Must be called with an elevated refcount on the page'.

    As Andrew Morton mentioned this change would dirty cacheline for page
    isn't on LRU. But the lost would be acceptable with Rong Chen
    <rong.a.chen@intel.com> report:
    https://lkml.org/lkml/2020/3/4/173

    Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: linux-kernel@vger.kernel.org
    Cc: cgroups@vger.kernel.org
    Cc: linux-mm@kvack.org



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-19  9:12       ` Alex Shi
@ 2020-07-19 15:14           ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-19 15:14 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov, Rong Chen

On Sun, Jul 19, 2020 at 2:12 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/18 下午10:15, Alex Shi 写道:
> >>>
> >>>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> >>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >>> index 14c668b7e793..36c1680efd90 100644
> >>> --- a/include/linux/mmzone.h
> >>> +++ b/include/linux/mmzone.h
> >>> @@ -261,6 +261,8 @@ struct lruvec {
> >>>         atomic_long_t                   nonresident_age;
> >>>         /* Refaults at the time of last reclaim cycle */
> >>>         unsigned long                   refaults;
> >>> +       /* per lruvec lru_lock for memcg */
> >>> +       spinlock_t                      lru_lock;
> >>>         /* Various lruvec state flags (enum lruvec_flags) */
> >>>         unsigned long                   flags;
> >> Any reason for placing this here instead of at the end of the
> >> structure? From what I can tell it looks like lruvec is already 128B
> >> long so placing the lock on the end would put it into the next
> >> cacheline which may provide some performance benefit since it is
> >> likely to be bounced quite a bit.
> > Rong Chen(Cced) once reported a performance regression when the lock at
> > the end of struct, and move here could remove it.
> > Although I can't not reproduce. But I trust his report.
> >
> Oops, Rong's report is on another member which is different with current
> struct.
>
> Compare to move to tail, how about to move it to head of struct, which is
> close to lru list? Did you have some data of the place change?

I don't have specific data, just anecdotal evidence from the past that
usually you want to keep locks away from read-mostly items since they
cause obvious cache thrash. My concern was more with the other fields
in the structure such as pgdat since it should be a static value and
having it evicted would likely be more expensive than just leaving the
cacheline as it is.

> > ...
> >
> >>>  putback:
> >>> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> >>>                 pagevec_add(&pvec_putback, pvec->pages[i]);
> >>>                 pvec->pages[i] = NULL;
> >>>         }
> >>> -       /* tempary disable irq, will remove later */
> >>> -       local_irq_disable();
> >>>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> >>> -       local_irq_enable();
> >>> +       if (lruvec)
> >>> +               unlock_page_lruvec_irq(lruvec);
> >> So I am not a fan of this change. You went to all the trouble of
> >> reducing the lock scope just to bring it back out here again. In
> >> addition it implies there is a path where you might try to update the
> >> page state without disabling interrupts.
> > Right. but any idea to avoid this except a extra local_irq_disable?
> >
>
> The following changes would resolve the problem. Is this ok?
> @@ -324,7 +322,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
>         }
> -       __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> +       if (delta_munlocked)
> +               __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>         if (lruvec)
>                 unlock_page_lruvec_irq(lruvec);

Why not just wrap the entire thing in a check for "lruvec"? Yes you
could theoretically be modding with a value of 0, but it avoids a
secondary unnecessary check and branching.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
@ 2020-07-19 15:14           ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-19 15:14 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov, Rong Chen

On Sun, Jul 19, 2020 at 2:12 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/18 下午10:15, Alex Shi 写道:
> >>>
> >>>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> >>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >>> index 14c668b7e793..36c1680efd90 100644
> >>> --- a/include/linux/mmzone.h
> >>> +++ b/include/linux/mmzone.h
> >>> @@ -261,6 +261,8 @@ struct lruvec {
> >>>         atomic_long_t                   nonresident_age;
> >>>         /* Refaults at the time of last reclaim cycle */
> >>>         unsigned long                   refaults;
> >>> +       /* per lruvec lru_lock for memcg */
> >>> +       spinlock_t                      lru_lock;
> >>>         /* Various lruvec state flags (enum lruvec_flags) */
> >>>         unsigned long                   flags;
> >> Any reason for placing this here instead of at the end of the
> >> structure? From what I can tell it looks like lruvec is already 128B
> >> long so placing the lock on the end would put it into the next
> >> cacheline which may provide some performance benefit since it is
> >> likely to be bounced quite a bit.
> > Rong Chen(Cced) once reported a performance regression when the lock at
> > the end of struct, and move here could remove it.
> > Although I can't not reproduce. But I trust his report.
> >
> Oops, Rong's report is on another member which is different with current
> struct.
>
> Compare to move to tail, how about to move it to head of struct, which is
> close to lru list? Did you have some data of the place change?

I don't have specific data, just anecdotal evidence from the past that
usually you want to keep locks away from read-mostly items since they
cause obvious cache thrash. My concern was more with the other fields
in the structure such as pgdat since it should be a static value and
having it evicted would likely be more expensive than just leaving the
cacheline as it is.

> > ...
> >
> >>>  putback:
> >>> -               spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> >>>                 pagevec_add(&pvec_putback, pvec->pages[i]);
> >>>                 pvec->pages[i] = NULL;
> >>>         }
> >>> -       /* tempary disable irq, will remove later */
> >>> -       local_irq_disable();
> >>>         __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> >>> -       local_irq_enable();
> >>> +       if (lruvec)
> >>> +               unlock_page_lruvec_irq(lruvec);
> >> So I am not a fan of this change. You went to all the trouble of
> >> reducing the lock scope just to bring it back out here again. In
> >> addition it implies there is a path where you might try to update the
> >> page state without disabling interrupts.
> > Right. but any idea to avoid this except a extra local_irq_disable?
> >
>
> The following changes would resolve the problem. Is this ok?
> @@ -324,7 +322,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
>         }
> -       __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> +       if (delta_munlocked)
> +               __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>         if (lruvec)
>                 unlock_page_lruvec_irq(lruvec);

Why not just wrap the entire thing in a check for "lruvec"? Yes you
could theoretically be modding with a value of 0, but it avoids a
secondary unnecessary check and branching.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-17  5:24   ` Alex Shi
@ 2020-07-19 15:23       ` Hugh Dickins
  0 siblings, 0 replies; 80+ messages in thread
From: Hugh Dickins @ 2020-07-19 15:23 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alexander Duyck, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm,
	LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang,
	Kirill A. Shutemov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4226 bytes --]

On Fri, 17 Jul 2020, Alex Shi wrote:
> 在 2020/7/16 下午10:11, Alexander Duyck 写道:
> >> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> >> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> > Hi Alex,
> > 
> > I think I am seeing a regression with this patch set when I run the
> > will-it-scale/page_fault3 test. Specifically the processes result is
> > dropping from 56371083 to 43127382 when I apply these patches.
> > 
> > I haven't had a chance to bisect and figure out what is causing it,
> > and wanted to let you know in case you are aware of anything specific
> > that may be causing this.
> 
> 
> Thanks a lot for the info!
> 
> Actually, the patch 17th, and patch 13th may changed performance a little,
> like the 17th, intel LKP found vm-scalability.throughput 68.0% improvement,
> and stress-ng.remap.ops_per_sec -76.3% regression, or stress-ng.memfd.ops_per_sec
>  +23.2%. etc.
> 
> This kind performance interference is known and acceptable.

That may be too blithe a response.

I can see that I've lots of other mails to reply to, from you and from
others - I got held up for a week in advancing from gcc 4.8 on my test
machines. But I'd better rush this to you before reading further, because
what I was hunting the last few days rather invalidates earlier testing.
And I'm glad that I held back from volunteering a Tested-by - though,
yes, v13 and later are stable where the older versions were unstable.

I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
longer to run loads than without it applied, when there should have been
only slight differences in system time. Comparing /proc/vmstat, something
that stood out was "pgrotated 0" for the patched kernels, which led here:

If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
decided whether that's good or not, but assume here that it is good),
then functions called though it must be changed not to expect PageLRU!

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/swap.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- 5.8-rc5-lru16/mm/swap.c	2020-07-15 21:03:42.781236769 -0700
+++ linux/mm/swap.c	2020-07-18 13:28:14.000000000 -0700
@@ -227,7 +227,7 @@ static void pagevec_lru_move_fn(struct p
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -300,7 +300,7 @@ void lru_note_cost_page(struct page *pag
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (!PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -357,7 +357,8 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	lruvec = lock_page_lruvec_irq(page);
-	__activate_page(page, lruvec);
+	if (PageLRU(page))
+		__activate_page(page, lruvec);
 	unlock_page_lruvec_irq(lruvec);
 }
 #endif
@@ -515,9 +516,6 @@ static void lru_deactivate_file_fn(struc
 	bool active;
 	int nr_pages = hpage_nr_pages(page);
 
-	if (!PageLRU(page))
-		return;
-
 	if (PageUnevictable(page))
 		return;
 
@@ -558,7 +556,7 @@ static void lru_deactivate_file_fn(struc
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -575,7 +573,7 @@ static void lru_deactivate_fn(struct pag
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	if (PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 		int nr_pages = hpage_nr_pages(page);

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
@ 2020-07-19 15:23       ` Hugh Dickins
  0 siblings, 0 replies; 80+ messages in thread
From: Hugh Dickins @ 2020-07-19 15:23 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alexander Duyck, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm,
	LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang,
	Kirill A. Shutemov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4226 bytes --]

On Fri, 17 Jul 2020, Alex Shi wrote:
> 在 2020/7/16 下午10:11, Alexander Duyck 写道:
> >> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> >> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> > Hi Alex,
> > 
> > I think I am seeing a regression with this patch set when I run the
> > will-it-scale/page_fault3 test. Specifically the processes result is
> > dropping from 56371083 to 43127382 when I apply these patches.
> > 
> > I haven't had a chance to bisect and figure out what is causing it,
> > and wanted to let you know in case you are aware of anything specific
> > that may be causing this.
> 
> 
> Thanks a lot for the info!
> 
> Actually, the patch 17th, and patch 13th may changed performance a little,
> like the 17th, intel LKP found vm-scalability.throughput 68.0% improvement,
> and stress-ng.remap.ops_per_sec -76.3% regression, or stress-ng.memfd.ops_per_sec
>  +23.2%. etc.
> 
> This kind performance interference is known and acceptable.

That may be too blithe a response.

I can see that I've lots of other mails to reply to, from you and from
others - I got held up for a week in advancing from gcc 4.8 on my test
machines. But I'd better rush this to you before reading further, because
what I was hunting the last few days rather invalidates earlier testing.
And I'm glad that I held back from volunteering a Tested-by - though,
yes, v13 and later are stable where the older versions were unstable.

I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
longer to run loads than without it applied, when there should have been
only slight differences in system time. Comparing /proc/vmstat, something
that stood out was "pgrotated 0" for the patched kernels, which led here:

If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
decided whether that's good or not, but assume here that it is good),
then functions called though it must be changed not to expect PageLRU!

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/swap.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- 5.8-rc5-lru16/mm/swap.c	2020-07-15 21:03:42.781236769 -0700
+++ linux/mm/swap.c	2020-07-18 13:28:14.000000000 -0700
@@ -227,7 +227,7 @@ static void pagevec_lru_move_fn(struct p
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -300,7 +300,7 @@ void lru_note_cost_page(struct page *pag
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (!PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -357,7 +357,8 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	lruvec = lock_page_lruvec_irq(page);
-	__activate_page(page, lruvec);
+	if (PageLRU(page))
+		__activate_page(page, lruvec);
 	unlock_page_lruvec_irq(lruvec);
 }
 #endif
@@ -515,9 +516,6 @@ static void lru_deactivate_file_fn(struc
 	bool active;
 	int nr_pages = hpage_nr_pages(page);
 
-	if (!PageLRU(page))
-		return;
-
 	if (PageUnevictable(page))
 		return;
 
@@ -558,7 +556,7 @@ static void lru_deactivate_file_fn(struc
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -575,7 +573,7 @@ static void lru_deactivate_fn(struct pag
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	if (PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 		int nr_pages = hpage_nr_pages(page);

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-19 15:23       ` Hugh Dickins
  (?)
@ 2020-07-20  3:01       ` Alex Shi
  2020-07-20  4:47           ` Hugh Dickins
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-20  3:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Alexander Duyck, Andrew Morton, Mel Gorman, Tejun Heo,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov



在 2020/7/19 下午11:23, Hugh Dickins 写道:
> I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
> longer to run loads than without it applied, when there should have been
> only slight differences in system time. Comparing /proc/vmstat, something
> that stood out was "pgrotated 0" for the patched kernels, which led here:
> 
> If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
> decided whether that's good or not, but assume here that it is good),
> then functions called though it must be changed not to expect PageLRU!
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Good catch!

Thanks a lot, Hugh! 
except 6 changes should apply, looks we add one more in swap.c file to stop
!PageRLU further actions!

Many Thanks!
Alex

@@ -649,7 +647,7 @@ void deactivate_file_page(struct page *page)
         * In a workload with many unevictable page such as mprotect,
         * unevictable page deactivation for accelerating reclaim is pointless.
         */
-       if (PageUnevictable(page))
+       if (PageUnevictable(page) || !PageLRU(page))
                return;

        if (likely(get_page_unless_zero(page))) {

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-20  3:01       ` Alex Shi
@ 2020-07-20  4:47           ` Hugh Dickins
  0 siblings, 0 replies; 80+ messages in thread
From: Hugh Dickins @ 2020-07-20  4:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Alexander Duyck, Andrew Morton, Mel Gorman,
	Tejun Heo, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm,
	LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang,
	Kirill A. Shutemov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1420 bytes --]

On Mon, 20 Jul 2020, Alex Shi wrote:
> 在 2020/7/19 下午11:23, Hugh Dickins 写道:
> > I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
> > longer to run loads than without it applied, when there should have been
> > only slight differences in system time. Comparing /proc/vmstat, something
> > that stood out was "pgrotated 0" for the patched kernels, which led here:
> > 
> > If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
> > decided whether that's good or not, but assume here that it is good),
> > then functions called though it must be changed not to expect PageLRU!
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Good catch!
> 
> Thanks a lot, Hugh! 
> except 6 changes should apply, looks we add one more in swap.c file to stop
> !PageRLU further actions!

Agreed, that's a minor optimization that wasn't done before,
that can be added (but it's not a fix like the rest of them).

> 
> Many Thanks!
> Alex
> 
> @@ -649,7 +647,7 @@ void deactivate_file_page(struct page *page)
>          * In a workload with many unevictable page such as mprotect,
>          * unevictable page deactivation for accelerating reclaim is pointless.
>          */
> -       if (PageUnevictable(page))
> +       if (PageUnevictable(page) || !PageLRU(page))
>                 return;
> 
>         if (likely(get_page_unless_zero(page))) {

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
@ 2020-07-20  4:47           ` Hugh Dickins
  0 siblings, 0 replies; 80+ messages in thread
From: Hugh Dickins @ 2020-07-20  4:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Alexander Duyck, Andrew Morton, Mel Gorman,
	Tejun Heo, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Johannes Weiner, kbuild test robot, linux-mm,
	LKML, cgroups, Shakeel Butt, Joonsoo Kim, Wei Yang,
	Kirill A. Shutemov

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1420 bytes --]

On Mon, 20 Jul 2020, Alex Shi wrote:
> 在 2020/7/19 下午11:23, Hugh Dickins 写道:
> > I noticed that 5.8-rc5, with lrulock v16 applied, took significantly
> > longer to run loads than without it applied, when there should have been
> > only slight differences in system time. Comparing /proc/vmstat, something
> > that stood out was "pgrotated 0" for the patched kernels, which led here:
> > 
> > If pagevec_lru_move_fn() is now to TestClearPageLRU (I have still not
> > decided whether that's good or not, but assume here that it is good),
> > then functions called though it must be changed not to expect PageLRU!
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Good catch!
> 
> Thanks a lot, Hugh! 
> except 6 changes should apply, looks we add one more in swap.c file to stop
> !PageRLU further actions!

Agreed, that's a minor optimization that wasn't done before,
that can be added (but it's not a fix like the rest of them).

> 
> Many Thanks!
> Alex
> 
> @@ -649,7 +647,7 @@ void deactivate_file_page(struct page *page)
>          * In a workload with many unevictable page such as mprotect,
>          * unevictable page deactivation for accelerating reclaim is pointless.
>          */
> -       if (PageUnevictable(page))
> +       if (PageUnevictable(page) || !PageLRU(page))
>                 return;
> 
>         if (likely(get_page_unless_zero(page))) {

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-19 15:14           ` Alexander Duyck
  (?)
@ 2020-07-20  5:47           ` Alex Shi
  -1 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-20  5:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Michal Hocko, Vladimir Davydov, Rong Chen



在 2020/7/19 下午11:14, Alexander Duyck 写道:
>> Compare to move to tail, how about to move it to head of struct, which is
>> close to lru list? Did you have some data of the place change?
> I don't have specific data, just anecdotal evidence from the past that
> usually you want to keep locks away from read-mostly items since they
> cause obvious cache thrash. My concern was more with the other fields
> in the structure such as pgdat since it should be a static value and
> having it evicted would likely be more expensive than just leaving the
> cacheline as it is.
> 

Thanks for comments, Alex.

So, sounds like moving the lru_lock to head of struct lruvec could be better.

>> -       __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>> +       if (delta_munlocked)
>> +               __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>>         if (lruvec)
>>                 unlock_page_lruvec_irq(lruvec);
> Why not just wrap the entire thing in a check for "lruvec"? Yes you
> could theoretically be modding with a value of 0, but it avoids a
> secondary unnecessary check and branching.

Right, and the delta_munlocked value could be checked inside the accounting
func

Thanks!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 00/22] per memcg lru_lock
  2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
                   ` (24 preceding siblings ...)
  2020-07-16 14:11   ` Alexander Duyck
@ 2020-07-20  7:30 ` Alex Shi
  25 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-20  7:30 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill


I am preparing/testing the patch v17 according comments from Hugh Dickins
and Alexander Duyck. 
Many thanks for line by line review and patient suggestion!

Please drop me any more comments or concern of any patches!

Thanks a lot!
Alex

在 2020/7/11 上午8:58, Alex Shi 写道:
> The new version which bases on v5.8-rc4. Add 2 more patchs:
> 'mm/thp: remove code path which never got into'
> 'mm/thp: add tail pages into lru anyway in split_huge_page()'
> and modified 'mm/mlock: reorder isolation sequence during munlock'
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-07-17  5:13       ` Alex Shi
@ 2020-07-20  8:37         ` Kirill A. Shutemov
  0 siblings, 0 replies; 80+ messages in thread
From: Kirill A. Shutemov @ 2020-07-20  8:37 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang

On Fri, Jul 17, 2020 at 01:13:21PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/7/16 下午9:17, Kirill A. Shutemov 写道:
> > On Thu, Jul 16, 2020 at 04:59:48PM +0800, Alex Shi wrote:
> >> Hi Kirill & Matthew,
> >>
> >> Is there any concern from for the THP involved patches?
> > 
> > It is mechanical move. I don't see a problem.
> > 
> 
> Many thanks! Kirill,
> 
> Do you mind to give a reviewed-by?

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 07/22] mm/thp: remove code path which never got into
  2020-07-11  0:58 ` [PATCH v16 07/22] mm/thp: remove code path which never got into Alex Shi
@ 2020-07-20  8:43   ` Kirill A. Shutemov
  0 siblings, 0 replies; 80+ messages in thread
From: Kirill A. Shutemov @ 2020-07-20  8:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang

On Sat, Jul 11, 2020 at 08:58:41AM +0800, Alex Shi wrote:
> split_huge_page() will never call on a page which isn't on lru list, so
> this code never got a chance to run, and should not be run, to add tail
> pages on a lru list which head page isn't there.
> 
> Although the bug was never triggered, it'better be removed for code
> correctness.
> 
> BTW, it looks better to have BUG() or soem warning set in the wrong

s/soem/some/

> path, but the path will be changed in incomming new page isolation
> func. So just save it here.

Yeah, WARN() would be great. Otherwise I'm okay with the patch

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 06/22] mm/thp: clean up lru_add_page_tail
  2020-07-11  0:58 ` [PATCH v16 06/22] mm/thp: clean up lru_add_page_tail Alex Shi
@ 2020-07-20  8:43   ` Kirill A. Shutemov
  0 siblings, 0 replies; 80+ messages in thread
From: Kirill A. Shutemov @ 2020-07-20  8:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang

On Sat, Jul 11, 2020 at 08:58:40AM +0800, Alex Shi wrote:
> Since the first parameter is only used by head page, it's better to make
> it explicit.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page()
  2020-07-17  9:30   ` Alex Shi
@ 2020-07-20  8:49     ` Kirill A. Shutemov
  2020-07-20  9:04       ` Alex Shi
  0 siblings, 1 reply; 80+ messages in thread
From: Kirill A. Shutemov @ 2020-07-20  8:49 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, Mika Penttilä

On Fri, Jul 17, 2020 at 05:30:27PM +0800, Alex Shi wrote:
> 
> Add a VM_WARN_ON for tracking. and updated comments for the code.
> 
> Thanks
> 
> ---
> From f1381a1547625a6521777bf9235823d8fbd00dac Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Fri, 10 Jul 2020 16:54:37 +0800
> Subject: [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in
>  split_huge_page()
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Split_huge_page() must start with PageLRU(head), and we are holding the
> lru_lock here. If the head was cleared lru bit unexpected, tracking it.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page()
  2020-07-20  8:49     ` Kirill A. Shutemov
@ 2020-07-20  9:04       ` Alex Shi
  0 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-20  9:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, Mika Penttilä



在 2020/7/20 下午4:49, Kirill A. Shutemov 写道:
>>
>> Split_huge_page() must start with PageLRU(head), and we are holding the
>> lru_lock here. If the head was cleared lru bit unexpected, tracking it.
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Hi Kirill,

Millions thanks for review!

Alex

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
  2020-07-19  3:55     ` Alex Shi
@ 2020-07-20 18:51         ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-20 18:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Sat, Jul 18, 2020 at 8:56 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/18 上午4:30, Alexander Duyck 写道:
> > On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >> This patch reorder the isolation steps during munlock, move the lru lock
> >> to guard each pages, unfold __munlock_isolate_lru_page func, to do the
> >> preparation for lru lock change.
> >>
> >> __split_huge_page_refcount doesn't exist, but we still have to guard
> >> PageMlocked and PageLRU for tail page in __split_huge_page_tail.
> >>
> >> [lkp@intel.com: found a sleeping function bug ... at mm/rmap.c]
> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> >> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: linux-mm@kvack.org
> >> Cc: linux-kernel@vger.kernel.org
> >> ---
> >>  mm/mlock.c | 93 ++++++++++++++++++++++++++++++++++----------------------------
> >>  1 file changed, 51 insertions(+), 42 deletions(-)
> >>
> >> diff --git a/mm/mlock.c b/mm/mlock.c
> >> index 228ba5a8e0a5..0bdde88b4438 100644
> >> --- a/mm/mlock.c
> >> +++ b/mm/mlock.c
> >> @@ -103,25 +103,6 @@ void mlock_vma_page(struct page *page)
> >>  }
> >>
> >>  /*
> >> - * Isolate a page from LRU with optional get_page() pin.
> >> - * Assumes lru_lock already held and page already pinned.
> >> - */
> >> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> >> -{
> >> -       if (TestClearPageLRU(page)) {
> >> -               struct lruvec *lruvec;
> >> -
> >> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> >> -               if (getpage)
> >> -                       get_page(page);
> >> -               del_page_from_lru_list(page, lruvec, page_lru(page));
> >> -               return true;
> >> -       }
> >> -
> >> -       return false;
> >> -}
> >> -
> >> -/*
> >>   * Finish munlock after successful page isolation
> >>   *
> >>   * Page must be locked. This is a wrapper for try_to_munlock()
> >> @@ -181,6 +162,7 @@ static void __munlock_isolation_failed(struct page *page)
> >>  unsigned int munlock_vma_page(struct page *page)
> >>  {
> >>         int nr_pages;
> >> +       bool clearlru = false;
> >>         pg_data_t *pgdat = page_pgdat(page);
> >>
> >>         /* For try_to_munlock() and to serialize with page migration */
> >> @@ -189,32 +171,42 @@ unsigned int munlock_vma_page(struct page *page)
> >>         VM_BUG_ON_PAGE(PageTail(page), page);
> >>
> >>         /*
> >> -        * Serialize with any parallel __split_huge_page_refcount() which
> >> +        * Serialize split tail pages in __split_huge_page_tail() which
> >>          * might otherwise copy PageMlocked to part of the tail pages before
> >>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
> >>          */
> >> +       get_page(page);
> >
> > I don't think this get_page() call needs to be up here. It could be
> > left down before we delete the page from the LRU list as it is really
> > needed to take a reference on the page before we call
> > __munlock_isolated_page(), or at least that is the way it looks to me.
> > By doing that you can avoid a bunch of cleanup in these exception
> > cases.
>
> Uh, It seems unlikely for !page->_refcount, and then got to release_pages(),
> if so, get_page do could move down.
> Thanks
>
> >
> >> +       clearlru = TestClearPageLRU(page);
> >
> > I'm not sure I fully understand the reason for moving this here. By
> > clearing this flag before you clear Mlocked does this give you some
> > sort of extra protection? I don't see how since Mlocked doesn't
> > necessarily imply the page is on LRU.
> >
>
> Above comments give a reason for the lru_lock usage,

I think things are getting confused here. The problem is that clearing
the page LRU flag is not the same as acquiring the LRU lock.

I was looking through patch 22 and it occured to me that the
documentation in __pagevec_lru_add_fn was never updated. My worry is
that it might have been overlooked, either that or maybe you discussed
it previously and I missed the discussion. There it calls out that you
either have to hold onto the LRU lock, or you have to test PageLRU
after clearing the Mlocked flag otherwise you risk introducing a race.
It seems to me like you could potentially just collapse the lock down
further if you are using it more inline with the 2b case as defined
there rather than trying to still use it to protect the Mlocked flag
even though you have already pulled the LRU bit before taking the
lock. Either that or this is more like the pagevec_lru_move_fn in
which case you are already holding the LRU lock so you just need to
call the test and clear before trying to pull the page off of the LRU
list.

> >> +        * Serialize split tail pages in __split_huge_page_tail() which
> >>          * might otherwise copy PageMlocked to part of the tail pages before
> >>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
>
> Look into the __split_huge_page_tail, there is a tiny gap between tail page
> get PG_mlocked, and it is added into lru list.
> The TestClearPageLRU could blocked memcg changes of the page from stopping
> isolate_lru_page.

I get that there is a gap between the two in __split_huge_page_tail.
My concern is more the fact that you are pulling the bit testing
outside of the locked region when I don't think it needs to be. The
lock is being taken unconditionally, so why pull the testing out when
you could just do it inside the lock anyway? My worry is that you
might be addressing __split_huge_page_tail but in the process you
might be introducing a new race with something like
__pagevec_lru_add_fn.

If I am not mistaken the Mlocked flag can still be cleared regardless
of if the LRU bit is set or not. So you can still clear the LRU bit
before you pull the page out of the list, but it can be done after
clearing the Mlocked flag instead of before you have even taken the
LRU lock. In that way it would function more similar to how you
handled pagevec_lru_move_fn() as all this function is really doing is
moving the pages out of the unevictable list into one of the other LRU
lists anyway since the Mlocked flag was cleared.

> >>         spin_lock_irq(&pgdat->lru_lock);
> >>
> >>         if (!TestClearPageMlocked(page)) {
> >> -               /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> >> -               nr_pages = 1;
> >> -               goto unlock_out;
> >> +               if (clearlru)
> >> +                       SetPageLRU(page);
> >> +               /*
> >> +                * Potentially, PTE-mapped THP: do not skip the rest PTEs
> >> +                * Reuse lock as memory barrier for release_pages racing.
> >> +                */
> >> +               spin_unlock_irq(&pgdat->lru_lock);
> >> +               put_page(page);
> >> +               return 0;
> >>         }
> >>
> >>         nr_pages = hpage_nr_pages(page);
> >>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> >>
> >> -       if (__munlock_isolate_lru_page(page, true)) {
> >> +       if (clearlru) {
> >> +               struct lruvec *lruvec;
> >> +
> >
> > You could just place the get_page() call here.
> >
> >> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> >> +               del_page_from_lru_list(page, lruvec, page_lru(page));
> >>                 spin_unlock_irq(&pgdat->lru_lock);
> >>                 __munlock_isolated_page(page);
> >> -               goto out;
> >> +       } else {
> >> +               spin_unlock_irq(&pgdat->lru_lock);
> >> +               put_page(page);
> >> +               __munlock_isolation_failed(page);
> >
> > If you move the get_page() as I suggested above there wouldn't be a
> > need for the put_page(). It then becomes possible to simplify the code
> > a bit by merging the unlock paths and doing an if/else with the
> > __munlock functions like so:
> > if (clearlru) {
> >     ...
> >     del_page_from_lru..
> > }
> >
> > spin_unlock_irq()
> >
> > if (clearlru)
> >     __munlock_isolated_page();
> > else
> >     __munlock_isolated_failed();
> >
> >>         }
> >> -       __munlock_isolation_failed(page);
> >> -
> >> -unlock_out:
> >> -       spin_unlock_irq(&pgdat->lru_lock);
> >>
> >> -out:
> >>         return nr_pages - 1;
> >>  }
> >>
> >> @@ -297,34 +289,51 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
> >>         pagevec_init(&pvec_putback);
> >>
> >>         /* Phase 1: page isolation */
> >> -       spin_lock_irq(&zone->zone_pgdat->lru_lock);
> >>         for (i = 0; i < nr; i++) {
> >>                 struct page *page = pvec->pages[i];
> >> +               struct lruvec *lruvec;
> >> +               bool clearlru;
> >>
> >> -               if (TestClearPageMlocked(page)) {
> >> -                       /*
> >> -                        * We already have pin from follow_page_mask()
> >> -                        * so we can spare the get_page() here.
> >> -                        */
> >> -                       if (__munlock_isolate_lru_page(page, false))
> >> -                               continue;
> >> -                       else
> >> -                               __munlock_isolation_failed(page);
> >> -               } else {
> >> +               clearlru = TestClearPageLRU(page);
> >> +               spin_lock_irq(&zone->zone_pgdat->lru_lock);
> >
> > I still don't see what you are gaining by moving the bit test up to
> > this point. Seems like it would be better left below with the lock
> > just being used to prevent a possible race while you are pulling the
> > page out of the LRU list.
> >
>
> the same reason as above comments mentained __split_huge_page_tail()
> issue.

I have the same argument here as above. The LRU lock is being used to
protect the Mlocked flag, as such there isn't a need to move the
get_page and clearing of the LRU flag up.  The get_page() call isn't
needed until just before we delete the page from the LRU list, and the
clearing isn't really needed until after we have already cleared the
Mlocked flag to see if we even have any work that we have to do, but
we do need to clear it before we are allowed to delete the page from
the LRU list.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
@ 2020-07-20 18:51         ` Alexander Duyck
  0 siblings, 0 replies; 80+ messages in thread
From: Alexander Duyck @ 2020-07-20 18:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov

On Sat, Jul 18, 2020 at 8:56 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/18 上午4:30, Alexander Duyck 写道:
> > On Fri, Jul 10, 2020 at 5:59 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >> This patch reorder the isolation steps during munlock, move the lru lock
> >> to guard each pages, unfold __munlock_isolate_lru_page func, to do the
> >> preparation for lru lock change.
> >>
> >> __split_huge_page_refcount doesn't exist, but we still have to guard
> >> PageMlocked and PageLRU for tail page in __split_huge_page_tail.
> >>
> >> [lkp@intel.com: found a sleeping function bug ... at mm/rmap.c]
> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> >> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Cc: linux-mm@kvack.org
> >> Cc: linux-kernel@vger.kernel.org
> >> ---
> >>  mm/mlock.c | 93 ++++++++++++++++++++++++++++++++++----------------------------
> >>  1 file changed, 51 insertions(+), 42 deletions(-)
> >>
> >> diff --git a/mm/mlock.c b/mm/mlock.c
> >> index 228ba5a8e0a5..0bdde88b4438 100644
> >> --- a/mm/mlock.c
> >> +++ b/mm/mlock.c
> >> @@ -103,25 +103,6 @@ void mlock_vma_page(struct page *page)
> >>  }
> >>
> >>  /*
> >> - * Isolate a page from LRU with optional get_page() pin.
> >> - * Assumes lru_lock already held and page already pinned.
> >> - */
> >> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> >> -{
> >> -       if (TestClearPageLRU(page)) {
> >> -               struct lruvec *lruvec;
> >> -
> >> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> >> -               if (getpage)
> >> -                       get_page(page);
> >> -               del_page_from_lru_list(page, lruvec, page_lru(page));
> >> -               return true;
> >> -       }
> >> -
> >> -       return false;
> >> -}
> >> -
> >> -/*
> >>   * Finish munlock after successful page isolation
> >>   *
> >>   * Page must be locked. This is a wrapper for try_to_munlock()
> >> @@ -181,6 +162,7 @@ static void __munlock_isolation_failed(struct page *page)
> >>  unsigned int munlock_vma_page(struct page *page)
> >>  {
> >>         int nr_pages;
> >> +       bool clearlru = false;
> >>         pg_data_t *pgdat = page_pgdat(page);
> >>
> >>         /* For try_to_munlock() and to serialize with page migration */
> >> @@ -189,32 +171,42 @@ unsigned int munlock_vma_page(struct page *page)
> >>         VM_BUG_ON_PAGE(PageTail(page), page);
> >>
> >>         /*
> >> -        * Serialize with any parallel __split_huge_page_refcount() which
> >> +        * Serialize split tail pages in __split_huge_page_tail() which
> >>          * might otherwise copy PageMlocked to part of the tail pages before
> >>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
> >>          */
> >> +       get_page(page);
> >
> > I don't think this get_page() call needs to be up here. It could be
> > left down before we delete the page from the LRU list as it is really
> > needed to take a reference on the page before we call
> > __munlock_isolated_page(), or at least that is the way it looks to me.
> > By doing that you can avoid a bunch of cleanup in these exception
> > cases.
>
> Uh, It seems unlikely for !page->_refcount, and then got to release_pages(),
> if so, get_page do could move down.
> Thanks
>
> >
> >> +       clearlru = TestClearPageLRU(page);
> >
> > I'm not sure I fully understand the reason for moving this here. By
> > clearing this flag before you clear Mlocked does this give you some
> > sort of extra protection? I don't see how since Mlocked doesn't
> > necessarily imply the page is on LRU.
> >
>
> Above comments give a reason for the lru_lock usage,

I think things are getting confused here. The problem is that clearing
the page LRU flag is not the same as acquiring the LRU lock.

I was looking through patch 22 and it occured to me that the
documentation in __pagevec_lru_add_fn was never updated. My worry is
that it might have been overlooked, either that or maybe you discussed
it previously and I missed the discussion. There it calls out that you
either have to hold onto the LRU lock, or you have to test PageLRU
after clearing the Mlocked flag otherwise you risk introducing a race.
It seems to me like you could potentially just collapse the lock down
further if you are using it more inline with the 2b case as defined
there rather than trying to still use it to protect the Mlocked flag
even though you have already pulled the LRU bit before taking the
lock. Either that or this is more like the pagevec_lru_move_fn in
which case you are already holding the LRU lock so you just need to
call the test and clear before trying to pull the page off of the LRU
list.

> >> +        * Serialize split tail pages in __split_huge_page_tail() which
> >>          * might otherwise copy PageMlocked to part of the tail pages before
> >>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
>
> Look into the __split_huge_page_tail, there is a tiny gap between tail page
> get PG_mlocked, and it is added into lru list.
> The TestClearPageLRU could blocked memcg changes of the page from stopping
> isolate_lru_page.

I get that there is a gap between the two in __split_huge_page_tail.
My concern is more the fact that you are pulling the bit testing
outside of the locked region when I don't think it needs to be. The
lock is being taken unconditionally, so why pull the testing out when
you could just do it inside the lock anyway? My worry is that you
might be addressing __split_huge_page_tail but in the process you
might be introducing a new race with something like
__pagevec_lru_add_fn.

If I am not mistaken the Mlocked flag can still be cleared regardless
of if the LRU bit is set or not. So you can still clear the LRU bit
before you pull the page out of the list, but it can be done after
clearing the Mlocked flag instead of before you have even taken the
LRU lock. In that way it would function more similar to how you
handled pagevec_lru_move_fn() as all this function is really doing is
moving the pages out of the unevictable list into one of the other LRU
lists anyway since the Mlocked flag was cleared.

> >>         spin_lock_irq(&pgdat->lru_lock);
> >>
> >>         if (!TestClearPageMlocked(page)) {
> >> -               /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> >> -               nr_pages = 1;
> >> -               goto unlock_out;
> >> +               if (clearlru)
> >> +                       SetPageLRU(page);
> >> +               /*
> >> +                * Potentially, PTE-mapped THP: do not skip the rest PTEs
> >> +                * Reuse lock as memory barrier for release_pages racing.
> >> +                */
> >> +               spin_unlock_irq(&pgdat->lru_lock);
> >> +               put_page(page);
> >> +               return 0;
> >>         }
> >>
> >>         nr_pages = hpage_nr_pages(page);
> >>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> >>
> >> -       if (__munlock_isolate_lru_page(page, true)) {
> >> +       if (clearlru) {
> >> +               struct lruvec *lruvec;
> >> +
> >
> > You could just place the get_page() call here.
> >
> >> +               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> >> +               del_page_from_lru_list(page, lruvec, page_lru(page));
> >>                 spin_unlock_irq(&pgdat->lru_lock);
> >>                 __munlock_isolated_page(page);
> >> -               goto out;
> >> +       } else {
> >> +               spin_unlock_irq(&pgdat->lru_lock);
> >> +               put_page(page);
> >> +               __munlock_isolation_failed(page);
> >
> > If you move the get_page() as I suggested above there wouldn't be a
> > need for the put_page(). It then becomes possible to simplify the code
> > a bit by merging the unlock paths and doing an if/else with the
> > __munlock functions like so:
> > if (clearlru) {
> >     ...
> >     del_page_from_lru..
> > }
> >
> > spin_unlock_irq()
> >
> > if (clearlru)
> >     __munlock_isolated_page();
> > else
> >     __munlock_isolated_failed();
> >
> >>         }
> >> -       __munlock_isolation_failed(page);
> >> -
> >> -unlock_out:
> >> -       spin_unlock_irq(&pgdat->lru_lock);
> >>
> >> -out:
> >>         return nr_pages - 1;
> >>  }
> >>
> >> @@ -297,34 +289,51 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
> >>         pagevec_init(&pvec_putback);
> >>
> >>         /* Phase 1: page isolation */
> >> -       spin_lock_irq(&zone->zone_pgdat->lru_lock);
> >>         for (i = 0; i < nr; i++) {
> >>                 struct page *page = pvec->pages[i];
> >> +               struct lruvec *lruvec;
> >> +               bool clearlru;
> >>
> >> -               if (TestClearPageMlocked(page)) {
> >> -                       /*
> >> -                        * We already have pin from follow_page_mask()
> >> -                        * so we can spare the get_page() here.
> >> -                        */
> >> -                       if (__munlock_isolate_lru_page(page, false))
> >> -                               continue;
> >> -                       else
> >> -                               __munlock_isolation_failed(page);
> >> -               } else {
> >> +               clearlru = TestClearPageLRU(page);
> >> +               spin_lock_irq(&zone->zone_pgdat->lru_lock);
> >
> > I still don't see what you are gaining by moving the bit test up to
> > this point. Seems like it would be better left below with the lock
> > just being used to prevent a possible race while you are pulling the
> > page out of the LRU list.
> >
>
> the same reason as above comments mentained __split_huge_page_tail()
> issue.

I have the same argument here as above. The LRU lock is being used to
protect the Mlocked flag, as such there isn't a need to move the
get_page and clearing of the LRU flag up.  The get_page() call isn't
needed until just before we delete the page from the LRU list, and the
clearing isn't really needed until after we have already cleared the
Mlocked flag to see if we even have any work that we have to do, but
we do need to clear it before we are allowed to delete the page from
the LRU list.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
  2020-07-20 18:51         ` Alexander Duyck
  (?)
@ 2020-07-21  9:26         ` Alex Shi
  2020-07-21 13:51           ` Alex Shi
  -1 siblings, 1 reply; 80+ messages in thread
From: Alex Shi @ 2020-07-21  9:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov



在 2020/7/21 上午2:51, Alexander Duyck 写道:
>> Look into the __split_huge_page_tail, there is a tiny gap between tail page
>> get PG_mlocked, and it is added into lru list.
>> The TestClearPageLRU could blocked memcg changes of the page from stopping
>> isolate_lru_page.
> I get that there is a gap between the two in __split_huge_page_tail.
> My concern is more the fact that you are pulling the bit testing
> outside of the locked region when I don't think it needs to be. The
> lock is being taken unconditionally, so why pull the testing out when
> you could just do it inside the lock anyway? My worry is that you
> might be addressing __split_huge_page_tail but in the process you
> might be introducing a new race with something like
> __pagevec_lru_add_fn.

Yes, the page maybe interfered by clear_page_mlock and add pages to wrong lru
list.

> 
> If I am not mistaken the Mlocked flag can still be cleared regardless
> of if the LRU bit is set or not. So you can still clear the LRU bit
> before you pull the page out of the list, but it can be done after
> clearing the Mlocked flag instead of before you have even taken the
> LRU lock. In that way it would function more similar to how you
> handled pagevec_lru_move_fn() as all this function is really doing is
> moving the pages out of the unevictable list into one of the other LRU
> lists anyway since the Mlocked flag was cleared.
> 

Without the lru bit guard, the page may be moved between memcgs, luckly,
lock_page would stop the mem_cgroup_move_account with BUSY state cost.
whole new change would like the following, I will testing/resend again.

Thanks!
Alex

@@ -182,7 +179,7 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
        int nr_pages;
-       pg_data_t *pgdat = page_pgdat(page);
+       struct lruvec *lruvec;

        /* For try_to_munlock() and to serialize with page migration */
        BUG_ON(!PageLocked(page));
@@ -190,11 +187,11 @@ unsigned int munlock_vma_page(struct page *page)
        VM_BUG_ON_PAGE(PageTail(page), page);

        /*
-        * Serialize with any parallel __split_huge_page_refcount() which
+        * Serialize split tail pages in __split_huge_page_tail() which
         * might otherwise copy PageMlocked to part of the tail pages before
         * we clear it in the head page. It also stabilizes hpage_nr_pages().
         */
-       spin_lock_irq(&pgdat->lru_lock);
+       lruvec = lock_page_lruvec_irq(page);

        if (!TestClearPageMlocked(page)) {
                /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -205,15 +202,15 @@ unsigned int munlock_vma_page(struct page *page)
        nr_pages = hpage_nr_pages(page);
        __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);

-       if (__munlock_isolate_lru_page(page, true)) {
-               spin_unlock_irq(&pgdat->lru_lock);
+       if (__munlock_isolate_lru_page(page, lruvec, true)) {
+               unlock_page_lruvec_irq(lruvec);
                __munlock_isolated_page(page);
                goto out;
        }
        __munlock_isolation_failed(page);

 unlock_out:
-       spin_unlock_irq(&pgdat->lru_lock);
+       unlock_page_lruvec_irq(lruvec);

 out:
        return nr_pages - 1;
@@ -293,23 +290,27 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
        int nr = pagevec_count(pvec);
        int delta_munlocked = -nr;
        struct pagevec pvec_putback;
+       struct lruvec *lruvec = NULL;
        int pgrescued = 0;

        pagevec_init(&pvec_putback);

        /* Phase 1: page isolation */
-       spin_lock_irq(&zone->zone_pgdat->lru_lock);
        for (i = 0; i < nr; i++) {
                struct page *page = pvec->pages[i];

+               /* block memcg change in mem_cgroup_move_account */
+               lock_page(page);
+               lruvec = relock_page_lruvec_irq(page, lruvec);
                if (TestClearPageMlocked(page)) {
                        /*
                         * We already have pin from follow_page_mask()
                         * so we can spare the get_page() here.
                         */
-                       if (__munlock_isolate_lru_page(page, false))
+                       if (__munlock_isolate_lru_page(page, lruvec, false)) {
+                               unlock_page(page);
                                continue;
-                       else
+                       } else
                                __munlock_isolation_failed(page);
                } else {
                        delta_munlocked++;
@@ -321,11 +322,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
                 * pin. We cannot do it under lru_lock however. If it's
                 * the last pin, __page_cache_release() would deadlock.
                 */
+               unlock_page(page);
                pagevec_add(&pvec_putback, pvec->pages[i]);
                pvec->pages[i] = NULL;
        }
-       __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-       spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+       if (lruvec) {
+               __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+               unlock_page_lruvec_irq(lruvec);
+       }

        /* Now we can release pins of pages that we are not munlocking */
        pagevec_release(&pvec_putback);

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock
  2020-07-21  9:26         ` Alex Shi
@ 2020-07-21 13:51           ` Alex Shi
  0 siblings, 0 replies; 80+ messages in thread
From: Alex Shi @ 2020-07-21 13:51 UTC (permalink / raw)
  To: Alexander Duyck, Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	kbuild test robot, linux-mm, LKML, cgroups, Shakeel Butt,
	Joonsoo Kim, Wei Yang, Kirill A. Shutemov



在 2020/7/21 下午5:26, Alex Shi 写道:
> 
> 
> 在 2020/7/21 上午2:51, Alexander Duyck 写道:
>>> Look into the __split_huge_page_tail, there is a tiny gap between tail page
>>> get PG_mlocked, and it is added into lru list.
>>> The TestClearPageLRU could blocked memcg changes of the page from stopping
>>> isolate_lru_page.
>> I get that there is a gap between the two in __split_huge_page_tail.
>> My concern is more the fact that you are pulling the bit testing
>> outside of the locked region when I don't think it needs to be. The
>> lock is being taken unconditionally, so why pull the testing out when
>> you could just do it inside the lock anyway? My worry is that you
>> might be addressing __split_huge_page_tail but in the process you
>> might be introducing a new race with something like
>> __pagevec_lru_add_fn.
> 
> Yes, the page maybe interfered by clear_page_mlock and add pages to wrong lru
> list.
> 
>>
>> If I am not mistaken the Mlocked flag can still be cleared regardless
>> of if the LRU bit is set or not. So you can still clear the LRU bit
>> before you pull the page out of the list, but it can be done after
>> clearing the Mlocked flag instead of before you have even taken the
>> LRU lock. In that way it would function more similar to how you
>> handled pagevec_lru_move_fn() as all this function is really doing is
>> moving the pages out of the unevictable list into one of the other LRU
>> lists anyway since the Mlocked flag was cleared.
>>
> 
> Without the lru bit guard, the page may be moved between memcgs, luckly,
> lock_page would stop the mem_cgroup_move_account with BUSY state cost.
> whole new change would like the following, I will testing/resend again.
> 

Hi Johannes,

It looks like lock_page_memcg() could be used to replace lock_page(), which
could change retry into spinlock wait. Would you like to give some comments?

Thank
Alex
> Thanks!
> Alex
> 
> @@ -182,7 +179,7 @@ static void __munlock_isolation_failed(struct page *page)
>  unsigned int munlock_vma_page(struct page *page)
>  {
>         int nr_pages;
> -       pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
> 
>         /* For try_to_munlock() and to serialize with page migration */
>         BUG_ON(!PageLocked(page));
> @@ -190,11 +187,11 @@ unsigned int munlock_vma_page(struct page *page)
>         VM_BUG_ON_PAGE(PageTail(page), page);
> 
>         /*
> -        * Serialize with any parallel __split_huge_page_refcount() which
> +        * Serialize split tail pages in __split_huge_page_tail() which
>          * might otherwise copy PageMlocked to part of the tail pages before
>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
>          */
> -       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page);
> 
>         if (!TestClearPageMlocked(page)) {
>                 /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> @@ -205,15 +202,15 @@ unsigned int munlock_vma_page(struct page *page)
>         nr_pages = hpage_nr_pages(page);
>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
> 
> -       if (__munlock_isolate_lru_page(page, true)) {
> -               spin_unlock_irq(&pgdat->lru_lock);
> +       if (__munlock_isolate_lru_page(page, lruvec, true)) {
> +               unlock_page_lruvec_irq(lruvec);
>                 __munlock_isolated_page(page);
>                 goto out;
>         }
>         __munlock_isolation_failed(page);
> 
>  unlock_out:
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       unlock_page_lruvec_irq(lruvec);
> 
>  out:
>         return nr_pages - 1;
> @@ -293,23 +290,27 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         int nr = pagevec_count(pvec);
>         int delta_munlocked = -nr;
>         struct pagevec pvec_putback;
> +       struct lruvec *lruvec = NULL;
>         int pgrescued = 0;
> 
>         pagevec_init(&pvec_putback);
> 
>         /* Phase 1: page isolation */
> -       spin_lock_irq(&zone->zone_pgdat->lru_lock);
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> 
> +               /* block memcg change in mem_cgroup_move_account */
> +               lock_page(page);
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>                 if (TestClearPageMlocked(page)) {
>                         /*
>                          * We already have pin from follow_page_mask()
>                          * so we can spare the get_page() here.
>                          */
> -                       if (__munlock_isolate_lru_page(page, false))
> +                       if (__munlock_isolate_lru_page(page, lruvec, false)) {
> +                               unlock_page(page);
>                                 continue;
> -                       else
> +                       } else
>                                 __munlock_isolation_failed(page);
>                 } else {
>                         delta_munlocked++;
> @@ -321,11 +322,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                  * pin. We cannot do it under lru_lock however. If it's
>                  * the last pin, __page_cache_release() would deadlock.
>                  */
> +               unlock_page(page);
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
>         }
> -       __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> -       spin_unlock_irq(&zone->zone_pgdat->lru_lock);
> +       if (lruvec) {
> +               __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> +               unlock_page_lruvec_irq(lruvec);
> +       }
> 
>         /* Now we can release pins of pages that we are not munlocking */
>         pagevec_release(&pvec_putback);
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2020-07-21 13:51 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-11  0:58 [PATCH v16 00/22] per memcg lru_lock Alex Shi
2020-07-11  0:58 ` [PATCH v16 01/22] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-07-11  0:58 ` [PATCH v16 02/22] mm/page_idle: no unlikely double check for idle page counting Alex Shi
2020-07-11  0:58 ` [PATCH v16 03/22] mm/compaction: correct the comments of compact_defer_shift Alex Shi
2020-07-11  0:58 ` [PATCH v16 04/22] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
2020-07-11  0:58 ` [PATCH v16 05/22] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
2020-07-16  8:59   ` Alex Shi
2020-07-16 13:17     ` Kirill A. Shutemov
2020-07-17  5:13       ` Alex Shi
2020-07-20  8:37         ` Kirill A. Shutemov
2020-07-11  0:58 ` [PATCH v16 06/22] mm/thp: clean up lru_add_page_tail Alex Shi
2020-07-20  8:43   ` Kirill A. Shutemov
2020-07-11  0:58 ` [PATCH v16 07/22] mm/thp: remove code path which never got into Alex Shi
2020-07-20  8:43   ` Kirill A. Shutemov
2020-07-11  0:58 ` [PATCH v16 08/22] mm/thp: narrow lru locking Alex Shi
2020-07-11  0:58 ` [PATCH v16 09/22] mm/memcg: add debug checking in lock_page_memcg Alex Shi
2020-07-11  0:58 ` [PATCH v16 10/22] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
2020-07-11  0:58 ` [PATCH v16 11/22] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
2020-07-11  0:58 ` [PATCH v16 12/22] mm/lru: move lock into lru_note_cost Alex Shi
2020-07-11  0:58 ` [PATCH v16 13/22] mm/lru: introduce TestClearPageLRU Alex Shi
2020-07-16  9:06   ` Alex Shi
2020-07-16 21:12   ` Alexander Duyck
2020-07-16 21:12     ` Alexander Duyck
2020-07-17  7:45     ` Alex Shi
2020-07-17 18:26       ` Alexander Duyck
2020-07-17 18:26         ` Alexander Duyck
2020-07-19  4:45         ` Alex Shi
2020-07-19 11:24           ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 14/22] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
2020-07-17  9:30   ` Alex Shi
2020-07-20  8:49     ` Kirill A. Shutemov
2020-07-20  9:04       ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 15/22] mm/compaction: do page isolation first in compaction Alex Shi
2020-07-16 21:32   ` Alexander Duyck
2020-07-16 21:32     ` Alexander Duyck
2020-07-17  5:09     ` Alex Shi
2020-07-17 16:09       ` Alexander Duyck
2020-07-17 16:09         ` Alexander Duyck
2020-07-19  3:59         ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock Alex Shi
2020-07-17 20:30   ` Alexander Duyck
2020-07-17 20:30     ` Alexander Duyck
2020-07-19  3:55     ` Alex Shi
2020-07-20 18:51       ` Alexander Duyck
2020-07-20 18:51         ` Alexander Duyck
2020-07-21  9:26         ` Alex Shi
2020-07-21 13:51           ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 17/22] mm/swap: serialize memcg changes during pagevec_lru_move_fn Alex Shi
2020-07-11  0:58 ` [PATCH v16 18/22] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-07-17 21:38   ` Alexander Duyck
2020-07-17 21:38     ` Alexander Duyck
2020-07-18 14:15     ` Alex Shi
2020-07-19  9:12       ` Alex Shi
2020-07-19 15:14         ` Alexander Duyck
2020-07-19 15:14           ` Alexander Duyck
2020-07-20  5:47           ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 19/22] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-07-17 22:03   ` Alexander Duyck
2020-07-17 22:03     ` Alexander Duyck
2020-07-18 14:01     ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 20/22] mm/vmscan: use relock for move_pages_to_lru Alex Shi
2020-07-17 21:44   ` Alexander Duyck
2020-07-17 21:44     ` Alexander Duyck
2020-07-18 14:15     ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 21/22] mm/pgdat: remove pgdat lru_lock Alex Shi
2020-07-17 21:09   ` Alexander Duyck
2020-07-17 21:09     ` Alexander Duyck
2020-07-18 14:17     ` Alex Shi
2020-07-11  0:58 ` [PATCH v16 22/22] mm/lru: revise the comments of lru_lock Alex Shi
2020-07-11  1:02 ` [PATCH v16 00/22] per memcg lru_lock Alex Shi
2020-07-16  8:49 ` Alex Shi
2020-07-16 14:11 ` Alexander Duyck
2020-07-16 14:11   ` Alexander Duyck
2020-07-17  5:24   ` Alex Shi
2020-07-19 15:23     ` Hugh Dickins
2020-07-19 15:23       ` Hugh Dickins
2020-07-20  3:01       ` Alex Shi
2020-07-20  4:47         ` Hugh Dickins
2020-07-20  4:47           ` Hugh Dickins
2020-07-20  7:30 ` Alex Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.