linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] per lruvec lru_lock for memcg
@ 2019-11-16  3:14 Alex Shi
  2019-11-16  3:15 ` [PATCH v3 1/7] mm/lru: add per lruvec lock " Alex Shi
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:14 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy

Hi all,

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node.

According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
containers on my 2s* 8 core * HT box with the modefied case:
  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this change above lru_lock censitive testing improved 17% with multiple
containers scenario. And no performance lose w/o mem_cgroup.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both bring the same idea
7 years ago. Now I believe considering my testing result, and google internal
using fact. This feather is clearly benefit multi-container user.

So I like to introduce it here.

v3: rebase on linux-next, and fold the relock fix patch into introduceing patch

v2: bypass a performance regression bug and fix some function issues

v1: initial version, aim testing show 5% performance incrase

Alex Shi (7):
  mm/lru: add per lruvec lock for memcg
  mm/lruvec: add irqsave flags into lruvec struct
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: only change the lru_lock iff page's lruvec is different
  mm/pgdat: remove pgdat lru_lock
  mm/lru: likely enhancement
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +----
 Documentation/admin-guide/cgroup-v1/memory.rst     |  6 +-
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 +++----
 include/linux/memcontrol.h                         | 67 +++++++++++++++++++
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  7 +-
 mm/compaction.c                                    | 62 +++++++++++------
 mm/filemap.c                                       |  4 +-
 mm/huge_memory.c                                   | 16 ++---
 mm/memcontrol.c                                    | 64 ++++++++++++++----
 mm/mlock.c                                         | 27 ++++----
 mm/mmzone.c                                        |  1 +
 mm/page_alloc.c                                    |  1 -
 mm/page_idle.c                                     |  5 +-
 mm/rmap.c                                          |  2 +-
 mm/swap.c                                          | 77 +++++++++-------------
 mm/vmscan.c                                        | 74 +++++++++++----------
 18 files changed, 277 insertions(+), 177 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 1/7] mm/lru: add per lruvec lock for memcg
  2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
@ 2019-11-16  3:15 ` Alex Shi
  2019-11-16  6:28   ` Shakeel Butt
  2019-11-16  3:15 ` [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct Alex Shi
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:15 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy
  Cc: Vlastimil Babka, Dan Williams, Michal Hocko, Wei Yang,
	Johannes Weiner, Arun KS

Currently memcg still use per node pgdat->lru_lock to guard its lruvec.
That causes some lru_lock contention in a high container density system.

If we can use per lruvec lock, that could relief much of the lru_lock
contention.

The later patches will replace the pgdat->lru_lock with lruvec->lru_lock
and show the performance benefit by benchmarks.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h | 2 ++
 mm/mmzone.c            | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 07b784ffbb14..a13b8a602ee5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -267,6 +267,8 @@ struct lruvec {
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct
  2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
  2019-11-16  3:15 ` [PATCH v3 1/7] mm/lru: add per lruvec lock " Alex Shi
@ 2019-11-16  3:15 ` Alex Shi
  2019-11-16  6:31   ` Shakeel Butt
  2019-11-22  6:46   ` Christoph Hellwig
  2019-11-16  3:15 ` [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:15 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy
  Cc: Vlastimil Babka, Dan Williams, Michal Hocko, Wei Yang,
	Johannes Weiner, Arun KS, Rong Chen

We need a irqflags vaiable to save state when do irqsave action, declare
it here would make code more clear/clean.

Rong Chen <rong.a.chen@intel.com> reported the 'irqflags' variable need
move to the tail of lruvec struct otherwise it causes 18% regressions of
vm-scalability testing on his machine. So add the flags and lru_lock to
both near struct tail, even I have no clue of this perf losing.

Originally-from: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
CC: Rong Chen <rong.a.chen@intel.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a13b8a602ee5..9b8b8daf4e03 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,6 +269,8 @@ struct lruvec {
 	unsigned long			flags;
 	/* per lruvec lru_lock for memcg */
 	spinlock_t			lru_lock;
+	/* flags for irqsave */
+	unsigned long			irqflags;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
  2019-11-16  3:15 ` [PATCH v3 1/7] mm/lru: add per lruvec lock " Alex Shi
  2019-11-16  3:15 ` [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct Alex Shi
@ 2019-11-16  3:15 ` Alex Shi
  2019-11-16  4:38   ` Matthew Wilcox
                     ` (3 more replies)
  2019-11-16  3:15 ` [PATCH v3 4/7] mm/lru: only change the lru_lock iff page's lruvec is different Alex Shi
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:15 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Shakeel Butt, Chris Down, Thomas Gleixner, Vlastimil Babka,
	Qian Cai, Andrey Ryabinin, Kirill A. Shutemov,
	Jérôme Glisse, Andrea Arcangeli, David Rientjes,
	Aneesh Kumar K.V, swkhack, Potyra, Stefan, Mike Rapoport,
	Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node.

This is the main patch to replace per node lru_lock with per memcg
lruvec lock. It also fold the irqsave flags into lruvec.

We introduce function lock_page_lruvec, it's same as vanilla pgdat lock
when memory cgroup unset, w/o memcg, the function will keep repin the
lruvec's lock to guard from page->mem_cgroup changes in page
migrations between memcgs. (Thanks Hugh Dickins and Konstantin
Khlebnikov reminder on this. Than the core logical is same as their
previous patchs)

According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
containers on my 2s* 8 core * HT box with the modefied case:
  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this and later patches, the dd performance is 144MB/s, the vanilla
kernel performance is 123MB/s. 17% performance increased.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: swkhack <swkhack@gmail.com>
Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h | 23 ++++++++++++++
 mm/compaction.c            | 62 ++++++++++++++++++++++++------------
 mm/huge_memory.c           | 16 ++++------
 mm/memcontrol.c            | 64 +++++++++++++++++++++++++++++--------
 mm/mlock.c                 | 31 +++++++++---------
 mm/page_idle.c             |  5 +--
 mm/swap.c                  | 79 +++++++++++++++++++---------------------------
 mm/vmscan.c                | 58 ++++++++++++++++------------------
 8 files changed, 201 insertions(+), 137 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5b86287fa069..0b32eadd0eda 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -418,6 +418,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+struct lruvec *lock_page_lruvec_irq(struct page *, struct pglist_data *);
+struct lruvec *lock_page_lruvec_irqsave(struct page *, struct pglist_data *);
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -901,6 +904,26 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
+						struct pglist_data *pgdat)
+{
+	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+	spin_lock_irq(&lruvec->lru_lock);
+
+	return lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						struct pglist_data *pgdat)
+{
+	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+	spin_lock_irqsave(&lruvec->lru_lock, lruvec->irqflags);
+
+	return lruvec;
+}
+
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 		struct mem_cgroup *memcg)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index d20816b21b55..eb1ad30c1bef 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -786,8 +786,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	pg_data_t *pgdat = cc->zone->zone_pgdat;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
-	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked_lruvec = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -847,11 +846,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked_lruvec) {
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+						locked_lruvec->irqflags);
+				locked_lruvec = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -920,10 +929,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+				if (locked_lruvec) {
+					spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+							locked_lruvec->irqflags);
+					locked_lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -949,10 +958,22 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+reget_lruvec:
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked_lruvec) {
+			if (locked_lruvec) {
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+						locked_lruvec->irqflags);
+				locked_lruvec = NULL;
+			}
+			if (compact_lock_irqsave(&lruvec->lru_lock,
+							&lruvec->irqflags, cc))
+				locked_lruvec = lruvec;
+
+			if (lruvec != mem_cgroup_page_lruvec(page, pgdat))
+				goto reget_lruvec;
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -976,7 +997,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			}
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, isolate_mode) != 0)
@@ -1017,9 +1037,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+			if (locked_lruvec) {
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+						locked_lruvec->irqflags);
+				locked_lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1044,8 +1065,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		low_pfn = end_pfn;
 
 isolate_abort:
-	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (locked_lruvec)
+		spin_unlock_irqrestore(&locked_lruvec->lru_lock,
+						locked_lruvec->irqflags);
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13cc93785006..7e8bd6c700d2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2495,17 +2495,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+			struct lruvec *lruvec, pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
-	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2554,7 +2550,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_unlock(&head->mapping->i_pages);
 	}
 
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 
 	remap_page(head);
 
@@ -2697,9 +2693,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
+	struct lruvec *lruvec;
 	int count, mapcount, extra_pins, ret;
 	bool mlocked;
-	unsigned long flags;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
@@ -2766,7 +2762,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		lru_add_drain();
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
+	lruvec = lock_page_lruvec_irqsave(head, pgdata);
 
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2797,7 +2793,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		}
 
 		spin_unlock(&ds_queue->split_queue_lock);
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, lruvec, end);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2816,7 +2812,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 		remap_page(head);
 		ret = -EBUSY;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62470325f9bc..cf274739e619 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1246,6 +1246,42 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec_irq(struct page *page,
+					struct pglist_data *pgdat)
+{
+	struct lruvec *lruvec;
+
+again:
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+
+	/* lruvec may changed in commit_charge() */
+	if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+		spin_unlock_irq(&lruvec->lru_lock);
+		goto again;
+	}
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+					struct pglist_data *pgdat)
+{
+	struct lruvec *lruvec;
+
+again:
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, lruvec->irqflags);
+
+	/* lruvec may changed in commit_charge() */
+	if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
+		goto again;
+	}
+
+	return lruvec;
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -2571,41 +2607,43 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-static void lock_page_lru(struct page *page, int *isolated)
+static struct lruvec *lock_page_lru(struct page *page, int *isolated)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec = lock_page_lruvec_irq(page, pgdat);
 
-	spin_lock_irq(&pgdat->lru_lock);
 	if (PageLRU(page)) {
-		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		*isolated = 1;
 	} else
 		*isolated = 0;
+
+	return lruvec;
 }
 
-static void unlock_page_lru(struct page *page, int isolated)
+static void unlock_page_lru(struct page *page, int isolated,
+				struct lruvec *locked_lruvec)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
-	if (isolated) {
-		struct lruvec *lruvec;
+	spin_unlock_irq(&locked_lruvec->lru_lock);
+	lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	if (isolated) {
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 			  bool lrucare)
 {
 	int isolated;
+	struct lruvec *lruvec;
 
 	VM_BUG_ON_PAGE(page->mem_cgroup, page);
 
@@ -2614,7 +2652,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	 * may already be on some other mem_cgroup's LRU.  Take care of it.
 	 */
 	if (lrucare)
-		lock_page_lru(page, &isolated);
+		lruvec = lock_page_lru(page, &isolated);
 
 	/*
 	 * Nobody should be changing or seriously looking at
@@ -2633,7 +2671,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	page->mem_cgroup = memcg;
 
 	if (lrucare)
-		unlock_page_lru(page, isolated);
+		unlock_page_lru(page, isolated, lruvec);
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -2929,7 +2967,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * pgdat->lruvec.lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index a72c1eeded77..b509b80b8513 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
  * Isolate a page from LRU with optional get_page() pin.
  * Assumes lru_lock already held and page already pinned.
  */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+			struct lruvec *lruvec, bool getpage)
 {
 	if (PageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		ClearPageLRU(page);
@@ -183,6 +181,7 @@ unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -194,7 +193,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page, pgdat);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -205,15 +204,15 @@ unsigned int munlock_vma_page(struct page *page)
 	nr_pages = hpage_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+	if (__munlock_isolate_lru_page(page, lruvec, true)) {
+		spin_unlock_irq(&lruvec->lru_lock);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 out:
 	return nr_pages - 1;
@@ -291,28 +290,29 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
 	int i;
 	int nr = pagevec_count(pvec);
-	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
+		lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
+
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+				__mod_zone_page_state(zone, NR_MLOCK,  -1);
+				spin_unlock_irq(&lruvec->lru_lock);
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
-		} else {
-			delta_munlocked++;
 		}
 
 		/*
@@ -323,9 +323,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 */
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
+		spin_unlock_irq(&lruvec->lru_lock);
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..25f4b1cf3e0f 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,6 +32,7 @@ static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page;
 	pg_data_t *pgdat;
+	struct lruvec *lruvec;
 
 	if (!pfn_valid(pfn))
 		return NULL;
@@ -42,12 +43,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
 		return NULL;
 
 	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page, pgdat);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 	return page;
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..60f04cb2b49e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,14 +62,12 @@ static void __page_cache_release(struct page *page)
 	if (PageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
-		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, pgdat);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -192,26 +190,17 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void *arg)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
-	unsigned long flags = 0;
+	struct lruvec *lruvec = NULL;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page));
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec, arg);
+		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -325,11 +314,12 @@ static inline void activate_page_drain(int cpu)
 void activate_page(struct page *page)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
-	spin_unlock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page, pgdat);
+	__activate_page(page, lruvec, NULL);
+	spin_unlock_irq(&lruvec->lru_lock);
 }
 #endif
 
@@ -780,9 +770,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
-	unsigned long uninitialized_var(flags);
+	struct lruvec *lruvec = NULL;
 	unsigned int uninitialized_var(lock_batch);
 
 	for (i = 0; i < nr; i++) {
@@ -791,21 +779,22 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			spin_unlock_irqrestore(&lruvec->lru_lock,
+							lruvec->irqflags);
+			lruvec = NULL;
 		}
 
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				spin_unlock_irqrestore(&lruvec->lru_lock,
+						       lruvec->irqflags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -822,27 +811,25 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				spin_unlock_irqrestore(&lruvec->lru_lock,
+								lruvec->irqflags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-									flags);
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
-			}
+				lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page));
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
+			}
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -854,8 +841,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -893,7 +880,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d97985262dda..3cdf343e7a27 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1755,8 +1755,7 @@ int isolate_lru_page(struct page *page)
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lru_lock);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irq(page, pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1764,7 +1763,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	}
 	return ret;
 }
@@ -1829,7 +1828,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1840,12 +1838,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		if (unlikely(!page_evictable(page))) {
 			list_del(&page->lru);
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		SetPageLRU(page);
 		lru = page_lru(page);
@@ -1860,9 +1857,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		} else {
@@ -1925,7 +1922,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1937,7 +1934,7 @@ static int current_may_throttle(void)
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1945,7 +1942,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
@@ -1958,7 +1955,7 @@ static int current_may_throttle(void)
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
@@ -2011,7 +2008,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2022,7 +2019,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2068,7 +2065,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -2086,7 +2083,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2371,7 +2368,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2392,7 +2389,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -4285,24 +4282,25 @@ int page_evictable(struct page *page)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct pglist_data *pgdat = page_pgdat(page);
+		struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 
 		pgscanned++;
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				spin_unlock_irq(&lruvec->lru_lock);
+			lruvec = new_lruvec;
+			spin_lock_irq(&lruvec->lru_lock);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -4318,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 4/7] mm/lru: only change the lru_lock iff page's lruvec is different
  2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
                   ` (2 preceding siblings ...)
  2019-11-16  3:15 ` [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2019-11-16  3:15 ` Alex Shi
  2019-11-16  3:15 ` [PATCH v3 5/7] mm/pgdat: remove pgdat lru_lock Alex Shi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:15 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Shakeel Butt, Chris Down, Thomas Gleixner, Vlastimil Babka,
	Andrey Ryabinin, swkhack, Potyra, Stefan, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

During the pagevec locking, a new page's lruvec is may same as
previous one. Thus we could save a re-locking, and only
change lock iff lruvec is new.

Function named relock_page_lruvec following Hugh Dickins patch.

The first version of this patch used rcu_read_lock to guard
lruvec assign and comparsion with locked_lruvev in relock_page_lruvec.
But Rong Chen <rong.a.chen@intel.com> report a regression with
PROVE_LOCKING config. The rcu_read locking causes qspinlock waiting to
be locked for too long.

Since we had hold a spinlock, rcu_read locking isn't necessary.

[lkp@intel.com: Fix RCU-related regression reported by LKP robot]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: swkhack <swkhack@gmail.com>
Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 44 ++++++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 | 16 +++++++++-------
 mm/swap.c                  | 14 ++++++--------
 mm/vmscan.c                | 24 ++++++++++++------------
 4 files changed, 71 insertions(+), 27 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0b32eadd0eda..eaec01fb627f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1290,6 +1290,50 @@ static inline void dec_lruvec_page_state(struct page *page,
 	mod_lruvec_page_state(page, idx, -1);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+					struct lruvec *locked_lruvec)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
+
+	if (!locked_lruvec)
+		goto lock;
+
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+	if (locked_lruvec == lruvec)
+		return lruvec;
+
+	spin_unlock_irq(&locked_lruvec->lru_lock);
+
+lock:
+	lruvec = lock_page_lruvec_irq(page, pgdat);
+	return lruvec;
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+					struct lruvec *locked_lruvec)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
+
+	if (!locked_lruvec)
+		goto lock;
+
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+	if (locked_lruvec == lruvec)
+		return lruvec;
+
+	spin_unlock_irqrestore(&locked_lruvec->lru_lock, locked_lruvec->irqflags);
+
+lock:
+	lruvec = lock_page_lruvec_irqsave(page, pgdat);
+	return lruvec;
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index b509b80b8513..8b3a97b62c0a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -290,6 +290,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
 	int i;
 	int nr = pagevec_count(pvec);
+	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
 	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
@@ -300,20 +301,19 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
-		lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, lruvec, false)) {
-				__mod_zone_page_state(zone, NR_MLOCK,  -1);
-				spin_unlock_irq(&lruvec->lru_lock);
+			if (__munlock_isolate_lru_page(page, lruvec, false))
 				continue;
-			} else
+			else
 				__munlock_isolation_failed(page);
-		}
+		} else
+			delta_munlocked++;
 
 		/*
 		 * We won't be munlocking this page in the next phase
@@ -323,8 +323,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 */
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
-		spin_unlock_irq(&lruvec->lru_lock);
 	}
+	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	if (lruvec)
+		spin_unlock_irq(&lruvec->lru_lock);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/swap.c b/mm/swap.c
index 60f04cb2b49e..a26e19738e96 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -195,11 +195,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
-		lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page));
+		lruvec = relock_page_lruvec_irqsave(page, lruvec);
 
 		(*move_fn)(page, lruvec, arg);
-		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 	}
+	if (lruvec)
+		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
@@ -821,15 +822,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+			struct lruvec *pre_lruvec = lruvec;
 
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
+			lruvec = relock_page_lruvec_irqsave(page, lruvec);
+			if (pre_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page));
 
-			}
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3cdf343e7a27..ba57c55a6a41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1825,22 +1825,25 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  * Returns the number of pages moved to the given lruvec.
  */
 
-static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
+static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *llvec,
 						     struct list_head *list)
 {
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
 	enum lru_list lru;
+	struct lruvec *lruvec = llvec;
 
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
+		lruvec = relock_page_lruvec_irq(page, lruvec);
+
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		if (unlikely(!page_evictable(page))) {
 			list_del(&page->lru);
 			spin_unlock_irq(&lruvec->lru_lock);
+			lruvec = NULL;
 			putback_lru_page(page);
-			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1858,8 +1861,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		} else {
@@ -1867,6 +1870,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		}
 	}
 
+	if (lruvec != llvec) {
+		if (lruvec)
+			spin_unlock_irq(&lruvec->lru_lock);
+		spin_lock_irq(&llvec->lru_lock);
+	}
 	/*
 	 * To save our caller's stack, now use input list for pages to free.
 	 */
@@ -4289,18 +4297,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pgdat = page_pgdat(page);
-		struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, pgdat);
-
 
 		pgscanned++;
 
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				spin_unlock_irq(&lruvec->lru_lock);
-			lruvec = new_lruvec;
-			spin_lock_irq(&lruvec->lru_lock);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 5/7] mm/pgdat: remove pgdat lru_lock
  2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
                   ` (3 preceding siblings ...)
  2019-11-16  3:15 ` [PATCH v3 4/7] mm/lru: only change the lru_lock iff page's lruvec is different Alex Shi
@ 2019-11-16  3:15 ` Alex Shi
  2019-11-16  3:15 ` [PATCH v3 6/7] mm/lru: likely enhancement Alex Shi
  2019-11-16  3:15 ` [PATCH v3 7/7] mm/lru: revise the comments of lru_lock Alex Shi
  6 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:15 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy
  Cc: Vlastimil Babka, Dan Williams, Michal Hocko, Wei Yang, Arun KS,
	Oscar Salvador, Mike Rapoport, Alexander Duyck, Pavel Tatashin,
	Alexander Potapenko, Johannes Weiner

Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 include/linux/mmzone.h | 1 -
 mm/page_alloc.c        | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b8b8daf4e03..2e9e67fd350c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -736,7 +736,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5ecbacb0e04..e8f2c8f755f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6803,7 +6803,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 6/7] mm/lru: likely enhancement
  2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
                   ` (4 preceding siblings ...)
  2019-11-16  3:15 ` [PATCH v3 5/7] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2019-11-16  3:15 ` Alex Shi
  2019-11-16  3:15 ` [PATCH v3 7/7] mm/lru: revise the comments of lru_lock Alex Shi
  6 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:15 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Shakeel Butt, Chris Down, Thomas Gleixner

Use likely() to remove speculations according to pagevec usage.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 8 ++++----
 mm/memcontrol.c            | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eaec01fb627f..5c49fe1ee9fe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1297,12 +1297,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
 	struct pglist_data *pgdat = page_pgdat(page);
 	struct lruvec *lruvec;
 
-	if (!locked_lruvec)
+	if (unlikely(!locked_lruvec))
 		goto lock;
 
 	lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-	if (locked_lruvec == lruvec)
+	if (likely(locked_lruvec == lruvec))
 		return lruvec;
 
 	spin_unlock_irq(&locked_lruvec->lru_lock);
@@ -1319,12 +1319,12 @@ static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
 	struct pglist_data *pgdat = page_pgdat(page);
 	struct lruvec *lruvec;
 
-	if (!locked_lruvec)
+	if (unlikely(!locked_lruvec))
 		goto lock;
 
 	lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-	if (locked_lruvec == lruvec)
+	if (likely(locked_lruvec == lruvec))
 		return lruvec;
 
 	spin_unlock_irqrestore(&locked_lruvec->lru_lock, locked_lruvec->irqflags);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf274739e619..75b8480bed69 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1256,7 +1256,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *page,
 	spin_lock_irq(&lruvec->lru_lock);
 
 	/* lruvec may changed in commit_charge() */
-	if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+	if (unlikely(lruvec != mem_cgroup_page_lruvec(page, pgdat))) {
 		spin_unlock_irq(&lruvec->lru_lock);
 		goto again;
 	}
@@ -1274,7 +1274,7 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page,
 	spin_lock_irqsave(&lruvec->lru_lock, lruvec->irqflags);
 
 	/* lruvec may changed in commit_charge() */
-	if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+	if (unlikely(lruvec != mem_cgroup_page_lruvec(page, pgdat))) {
 		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
 		goto again;
 	}
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 7/7] mm/lru: revise the comments of lru_lock
  2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
                   ` (5 preceding siblings ...)
  2019-11-16  3:15 ` [PATCH v3 6/7] mm/lru: likely enhancement Alex Shi
@ 2019-11-16  3:15 ` Alex Shi
  6 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-16  3:15 UTC (permalink / raw)
  To: alex.shi, cgroups, linux-kernel, linux-mm, akpm, mgorman, tj,
	hughd, khlebnikov, daniel.m.jordan, yang.shi, willy
  Cc: Jason Gunthorpe, Dan Williams, Vlastimil Babka, Ira Weiny,
	Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Johannes Weiner, Arun KS, Darrick J. Wong, Amir Goldstein,
	Dave Chinner, Josef Bacik, Kirill A. Shutemov,
	Jérôme Glisse, Mike Kravetz, Kirill Tkhai, Yafang Shao

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

Originally-from: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
 Documentation/admin-guide/cgroup-v1/memory.rst     |  6 +++---
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 ++++++++--------------
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  2 +-
 mm/filemap.c                                       |  4 ++--
 mm/rmap.c                                          |  2 +-
 mm/vmscan.c                                        | 12 ++++++++----
 9 files changed, 28 insertions(+), 39 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ======
-        Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global pgdat->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under pgdat->lru_lock.
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-
-	(By __isolate_lru_page(), the page is removed from both of global and
-	private LRU.)
-
+	Each memcg has its own vector of LRUs (inactive anon, active anon,
+	inactive file, active file, unevictable) of pages from each node,
+	each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 0ae4f564c2d6..60d97e8b7f3c 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -297,13 +297,13 @@ When oom event notifier is registered, event will be delivered.
 
    PG_locked.
      mm->page_table_lock
-         pgdat->lru_lock
+         lruvec->lru_lock
 	   lock_page_cgroup.
 
   In many cases, just lock_page_cgroup() is called.
 
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+  per-node-per-cgroup LRU (cgroup's private LRU) is just guarded by
+  lruvec->lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 -----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million 4k pages in a single node.  When a large
 fraction of these pages are not evictable for any reason [see below], vmscan
 will spend a lot of time scanning the LRU lists looking for the small fraction
 of pages that are evictable.  This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
 The Unevictable Page List
 -------------------------
 
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
 called the "unevictable" list and an associated page flag, PG_unevictable, to
 indicate that the page is being managed on the unevictable list.
 
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
 swap-backed pages.  This differentiation is only important while the pages are,
 in fact, evictable.
 
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
 lists and statistics originally proposed and posted by Christoph Lameter.
 
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock.  This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
 
 Memory Control Group Interaction
 --------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
 memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
 lru_list enum.
 
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
 lru_list enum element).  The memory controller tracks the movement of pages to
 and from the unevictable list.
 
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
 active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
 pages in all of the shrink_{active|inactive|page}_list() functions and will
 "cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
 
 There may be situations where a page is mapped into a VM_LOCKED VMA, but the
 page is not marked as PG_mlocked.  Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
 page from the LRU, as it is likely on the appropriate active or inactive list
 at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
 back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list.  If
+is now mlocked and divert the page to the node's unevictable list.  If
 mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
 it later if and when it attempts to reclaim the page.
 
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
      unevictable list in mlock_vma_page().
 
 shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
 
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 270aa8fd2800..ff08a6a8145c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2e9e67fd350c..622ff22fb69c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f5731768222..65236af31db3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,8 +101,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 95f381bbaf7f..b32a15d34a9e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
  *         mapping->i_mmap_rwsem
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
+ *               lruvec->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba57c55a6a41..14e09c14977d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1612,14 +1612,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
  * For pagecache intensive workloads, this function is the hottest
  * spot in the kernel (apart from copy_*_user functions).
  *
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
  *
  * @nr_to_scan:	The number of eligible pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
@@ -1807,14 +1809,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 
 /*
  * This moves pages from @list to corresponding LRU list.
+ * The pages from @list is out of any lruvec, and in the end list reuses as
+ * pages_to_free list.
  *
  * We move them the other way if the page is referenced by one or more
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
+ * appropriate to hold lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
+ * should drop lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-16  3:15 ` [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2019-11-16  4:38   ` Matthew Wilcox
  2019-11-18 11:55     ` Alex Shi
  2019-11-16  7:03   ` Shakeel Butt
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2019-11-16  4:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Shakeel Butt,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao

On Sat, Nov 16, 2019 at 11:15:02AM +0800, Alex Shi wrote:
> This is the main patch to replace per node lru_lock with per memcg
> lruvec lock. It also fold the irqsave flags into lruvec.

I have to say, I don't love the part where we fold the irqsave flags
into the lruvec.  I know it saves us an argument, but it opens up the
possibility of mismatched expectations.  eg we currently have:

static void __split_huge_page(struct page *page, struct list_head *list,
			struct lruvec *lruvec, pgoff_t end)
{
...
	spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);

so if we introduce a new caller, we have to be certain that this caller
is also using lock_page_lruvec_irqsave() and not lock_page_lruvec_irq().
I can't think of a way to make the compiler enforce that, and if we don't,
then we can get some odd crashes with interrupts being unexpectedly
enabled or disabled, depending on how ->irqflags was used last.

So it makes the code more subtle.  And that's not a good thing.

> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
> +						struct pglist_data *pgdat)
> +{
> +	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	return lruvec;
> +}

...

> +static struct lruvec *lock_page_lru(struct page *page, int *isolated)
>  {
>  	pg_data_t *pgdat = page_pgdat(page);
> +	struct lruvec *lruvec = lock_page_lruvec_irq(page, pgdat);
>  
> -	spin_lock_irq(&pgdat->lru_lock);
>  	if (PageLRU(page)) {
> -		struct lruvec *lruvec;
>  
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  		ClearPageLRU(page);
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>  		*isolated = 1;
>  	} else
>  		*isolated = 0;
> +
> +	return lruvec;
>  }

But what if the page is !PageLRU?  What lruvec did we just lock?
According to the comments on mem_cgroup_page_lruvec(),

 * This function is only safe when following the LRU page isolation
 * and putback protocol: the LRU lock must be held, and the page must
 * either be PageLRU() or the caller must have isolated/allocated it.

and now it's being called in order to find out which LRU lock to take.
So this comment needs to be updated, if it's wrong, or this patch has
a race.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/7] mm/lru: add per lruvec lock for memcg
  2019-11-16  3:15 ` [PATCH v3 1/7] mm/lru: add per lruvec lock " Alex Shi
@ 2019-11-16  6:28   ` Shakeel Butt
  2019-11-18  2:44     ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2019-11-16  6:28 UTC (permalink / raw)
  To: Alex Shi
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Vlastimil Babka, Dan Williams, Michal Hocko,
	Wei Yang, Johannes Weiner, Arun KS

On Fri, Nov 15, 2019 at 7:15 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Currently memcg still use per node pgdat->lru_lock to guard its lruvec.
> That causes some lru_lock contention in a high container density system.
>
> If we can use per lruvec lock, that could relief much of the lru_lock
> contention.
>
> The later patches will replace the pgdat->lru_lock with lruvec->lru_lock
> and show the performance benefit by benchmarks.

Merge this patch with actual usage. No need to have a separate patch.

>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Arun KS <arunks@codeaurora.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/linux/mmzone.h | 2 ++
>  mm/mmzone.c            | 1 +
>  2 files changed, 3 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 07b784ffbb14..a13b8a602ee5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -267,6 +267,8 @@ struct lruvec {
>         unsigned long                   refaults;
>         /* Various lruvec state flags (enum lruvec_flags) */
>         unsigned long                   flags;
> +       /* per lruvec lru_lock for memcg */
> +       spinlock_t                      lru_lock;
>  #ifdef CONFIG_MEMCG
>         struct pglist_data *pgdat;
>  #endif
> diff --git a/mm/mmzone.c b/mm/mmzone.c
> index 4686fdc23bb9..3750a90ed4a0 100644
> --- a/mm/mmzone.c
> +++ b/mm/mmzone.c
> @@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
>         enum lru_list lru;
>
>         memset(lruvec, 0, sizeof(struct lruvec));
> +       spin_lock_init(&lruvec->lru_lock);
>
>         for_each_lru(lru)
>                 INIT_LIST_HEAD(&lruvec->lists[lru]);
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct
  2019-11-16  3:15 ` [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct Alex Shi
@ 2019-11-16  6:31   ` Shakeel Butt
  2019-11-18  2:52     ` Alex Shi
  2019-11-22  6:46   ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2019-11-16  6:31 UTC (permalink / raw)
  To: Alex Shi
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Vlastimil Babka, Dan Williams, Michal Hocko,
	Wei Yang, Johannes Weiner, Arun KS, Rong Chen

On Fri, Nov 15, 2019 at 7:15 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> We need a irqflags vaiable to save state when do irqsave action, declare
> it here would make code more clear/clean.
>
> Rong Chen <rong.a.chen@intel.com> reported the 'irqflags' variable need
> move to the tail of lruvec struct otherwise it causes 18% regressions of
> vm-scalability testing on his machine. So add the flags and lru_lock to
> both near struct tail, even I have no clue of this perf losing.

Regressions compared to what? Also no need to have a separate patch.

>
> Originally-from: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Wei Yang <richard.weiyang@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Arun KS <arunks@codeaurora.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> CC: Rong Chen <rong.a.chen@intel.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/linux/mmzone.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a13b8a602ee5..9b8b8daf4e03 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -269,6 +269,8 @@ struct lruvec {
>         unsigned long                   flags;
>         /* per lruvec lru_lock for memcg */
>         spinlock_t                      lru_lock;
> +       /* flags for irqsave */
> +       unsigned long                   irqflags;
>  #ifdef CONFIG_MEMCG
>         struct pglist_data *pgdat;
>  #endif
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-16  3:15 ` [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
  2019-11-16  4:38   ` Matthew Wilcox
@ 2019-11-16  7:03   ` Shakeel Butt
  2019-11-18 12:23     ` Alex Shi
  2019-11-18 16:11   ` Johannes Weiner
  2019-11-19  2:10   ` Daniel Jordan
  3 siblings, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2019-11-16  7:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Chris Down, Thomas Gleixner, Vlastimil Babka,
	Qian Cai, Andrey Ryabinin, Kirill A. Shutemov,
	Jérôme Glisse, Andrea Arcangeli, David Rientjes,
	Aneesh Kumar K.V, swkhack, Potyra, Stefan, Mike Rapoport,
	Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Fri, Nov 15, 2019 at 7:15 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg per node.
>
> This is the main patch to replace per node lru_lock with per memcg
> lruvec lock. It also fold the irqsave flags into lruvec.
>
> We introduce function lock_page_lruvec, it's same as vanilla pgdat lock
> when memory cgroup unset, w/o memcg, the function will keep repin the
> lruvec's lock to guard from page->mem_cgroup changes in page
> migrations between memcgs. (Thanks Hugh Dickins and Konstantin
> Khlebnikov reminder on this. Than the core logical is same as their
> previous patchs)
>
> According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
> containers on my 2s* 8 core * HT box with the modefied case:
>   https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>
> With this and later patches, the dd performance is 144MB/s, the vanilla
> kernel performance is 123MB/s. 17% performance increased.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Chris Down <chris@chrisdown.name>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Cc: swkhack <swkhack@gmail.com>
> Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Colin Ian King <colin.king@canonical.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peng Fan <peng.fan@nxp.com>
> Cc: Nikolay Borisov <nborisov@suse.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/memcontrol.h | 23 ++++++++++++++
>  mm/compaction.c            | 62 ++++++++++++++++++++++++------------
>  mm/huge_memory.c           | 16 ++++------
>  mm/memcontrol.c            | 64 +++++++++++++++++++++++++++++--------
>  mm/mlock.c                 | 31 +++++++++---------
>  mm/page_idle.c             |  5 +--
>  mm/swap.c                  | 79 +++++++++++++++++++---------------------------
>  mm/vmscan.c                | 58 ++++++++++++++++------------------
>  8 files changed, 201 insertions(+), 137 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5b86287fa069..0b32eadd0eda 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -418,6 +418,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
> +struct lruvec *lock_page_lruvec_irq(struct page *, struct pglist_data *);
> +struct lruvec *lock_page_lruvec_irqsave(struct page *, struct pglist_data *);
> +
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
>  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> @@ -901,6 +904,26 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>         return &pgdat->__lruvec;
>  }
>
> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
> +                                               struct pglist_data *pgdat)
> +{
> +       struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
> +       spin_lock_irq(&lruvec->lru_lock);
> +
> +       return lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +                                               struct pglist_data *pgdat)
> +{
> +       struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
> +       spin_lock_irqsave(&lruvec->lru_lock, lruvec->irqflags);
> +
> +       return lruvec;
> +}
> +
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
>                 struct mem_cgroup *memcg)
>  {
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d20816b21b55..eb1ad30c1bef 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -786,8 +786,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         pg_data_t *pgdat = cc->zone->zone_pgdat;
>         unsigned long nr_scanned = 0, nr_isolated = 0;
>         struct lruvec *lruvec;
> -       unsigned long flags = 0;
> -       bool locked = false;
> +       struct lruvec *locked_lruvec = NULL;
>         struct page *page = NULL, *valid_page = NULL;
>         unsigned long start_pfn = low_pfn;
>         bool skip_on_failure = false;
> @@ -847,11 +846,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * contention, to give chance to IRQs. Abort completely if
>                  * a fatal signal is pending.
>                  */
> -               if (!(low_pfn % SWAP_CLUSTER_MAX)
> -                   && compact_unlock_should_abort(&pgdat->lru_lock,
> -                                           flags, &locked, cc)) {
> -                       low_pfn = 0;
> -                       goto fatal_pending;
> +               if (!(low_pfn % SWAP_CLUSTER_MAX)) {
> +                       if (locked_lruvec) {
> +                               spin_unlock_irqrestore(&locked_lruvec->lru_lock,
> +                                               locked_lruvec->irqflags);
> +                               locked_lruvec = NULL;
> +                       }
> +
> +                       if (fatal_signal_pending(current)) {
> +                               cc->contended = true;
> +
> +                               low_pfn = 0;
> +                               goto fatal_pending;
> +                       }
> +
> +                       cond_resched();
>                 }
>
>                 if (!pfn_valid_within(low_pfn))
> @@ -920,10 +929,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(__PageMovable(page)) &&
>                                         !PageIsolated(page)) {
> -                               if (locked) {
> -                                       spin_unlock_irqrestore(&pgdat->lru_lock,
> -                                                                       flags);
> -                                       locked = false;
> +                               if (locked_lruvec) {
> +                                       spin_unlock_irqrestore(&locked_lruvec->lru_lock,
> +                                                       locked_lruvec->irqflags);
> +                                       locked_lruvec = NULL;
>                                 }
>
>                                 if (!isolate_movable_page(page, isolate_mode))
> @@ -949,10 +958,22 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>                         goto isolate_fail;
>
> +reget_lruvec:
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>                 /* If we already hold the lock, we can skip some rechecking */
> -               if (!locked) {
> -                       locked = compact_lock_irqsave(&pgdat->lru_lock,
> -                                                               &flags, cc);
> +               if (lruvec != locked_lruvec) {
> +                       if (locked_lruvec) {
> +                               spin_unlock_irqrestore(&locked_lruvec->lru_lock,
> +                                               locked_lruvec->irqflags);
> +                               locked_lruvec = NULL;
> +                       }

What guarantees the lifetime of lruvec? You should read the comment on
mem_cgroup_page_lruvec(). Have you seen the patches Hugh had shared?
Please look at the  trylock_page_lruvec().

BTW have you tested Hugh's patches?

> +                       if (compact_lock_irqsave(&lruvec->lru_lock,
> +                                                       &lruvec->irqflags, cc))
> +                               locked_lruvec = lruvec;
> +
> +                       if (lruvec != mem_cgroup_page_lruvec(page, pgdat))
> +                               goto reget_lruvec;
>
>                         /* Try get exclusive access under lock */
>                         if (!skip_updated) {
> @@ -976,7 +997,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                         }
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 /* Try isolate the page */
>                 if (__isolate_lru_page(page, isolate_mode) != 0)
> @@ -1017,9 +1037,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * page anyway.
>                  */
>                 if (nr_isolated) {
> -                       if (locked) {
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                               locked = false;
> +                       if (locked_lruvec) {
> +                               spin_unlock_irqrestore(&locked_lruvec->lru_lock,
> +                                               locked_lruvec->irqflags);
> +                               locked_lruvec = NULL;
>                         }
>                         putback_movable_pages(&cc->migratepages);
>                         cc->nr_migratepages = 0;
> @@ -1044,8 +1065,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 low_pfn = end_pfn;
>
>  isolate_abort:
> -       if (locked)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (locked_lruvec)
> +               spin_unlock_irqrestore(&locked_lruvec->lru_lock,
> +                                               locked_lruvec->irqflags);
>
>         /*
>          * Updated the cached scanner pfn once the pageblock has been scanned
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 13cc93785006..7e8bd6c700d2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2495,17 +2495,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  }
>
>  static void __split_huge_page(struct page *page, struct list_head *list,
> -               pgoff_t end, unsigned long flags)
> +                       struct lruvec *lruvec, pgoff_t end)
>  {
>         struct page *head = compound_head(page);
> -       pg_data_t *pgdat = page_pgdat(head);
> -       struct lruvec *lruvec;
>         struct address_space *swap_cache = NULL;
>         unsigned long offset = 0;
>         int i;
>
> -       lruvec = mem_cgroup_page_lruvec(head, pgdat);
> -
>         /* complete memcg works before add pages to LRU */
>         mem_cgroup_split_huge_fixup(head);
>
> @@ -2554,7 +2550,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                 xa_unlock(&head->mapping->i_pages);
>         }
>
> -       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
>
>         remap_page(head);
>
> @@ -2697,9 +2693,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>         struct deferred_split *ds_queue = get_deferred_split_queue(page);
>         struct anon_vma *anon_vma = NULL;
>         struct address_space *mapping = NULL;
> +       struct lruvec *lruvec;
>         int count, mapcount, extra_pins, ret;
>         bool mlocked;
> -       unsigned long flags;
>         pgoff_t end;
>
>         VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
> @@ -2766,7 +2762,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>                 lru_add_drain();
>
>         /* prevent PageLRU to go away from under us, and freeze lru stats */
> -       spin_lock_irqsave(&pgdata->lru_lock, flags);
> +       lruvec = lock_page_lruvec_irqsave(head, pgdata);
>
>         if (mapping) {
>                 XA_STATE(xas, &mapping->i_pages, page_index(head));
> @@ -2797,7 +2793,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>                 }
>
>                 spin_unlock(&ds_queue->split_queue_lock);
> -               __split_huge_page(page, list, end, flags);
> +               __split_huge_page(page, list, lruvec, end);
>                 if (PageSwapCache(head)) {
>                         swp_entry_t entry = { .val = page_private(head) };
>
> @@ -2816,7 +2812,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>                 spin_unlock(&ds_queue->split_queue_lock);
>  fail:          if (mapping)
>                         xa_unlock(&mapping->i_pages);
> -               spin_unlock_irqrestore(&pgdata->lru_lock, flags);
> +               spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
>                 remap_page(head);
>                 ret = -EBUSY;
>         }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 62470325f9bc..cf274739e619 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1246,6 +1246,42 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>         return lruvec;
>  }
>
> +struct lruvec *lock_page_lruvec_irq(struct page *page,
> +                                       struct pglist_data *pgdat)
> +{
> +       struct lruvec *lruvec;
> +
> +again:
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock_irq(&lruvec->lru_lock);
> +
> +       /* lruvec may changed in commit_charge() */
> +       if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
> +               spin_unlock_irq(&lruvec->lru_lock);
> +               goto again;
> +       }
> +
> +       return lruvec;
> +}
> +
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +                                       struct pglist_data *pgdat)
> +{
> +       struct lruvec *lruvec;
> +
> +again:
> +       lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       spin_lock_irqsave(&lruvec->lru_lock, lruvec->irqflags);
> +
> +       /* lruvec may changed in commit_charge() */
> +       if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
> +               spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
> +               goto again;
> +       }
> +
> +       return lruvec;
> +}
> +
>  /**
>   * mem_cgroup_update_lru_size - account for adding or removing an lru page
>   * @lruvec: mem_cgroup per zone lru vector
> @@ -2571,41 +2607,43 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>         css_put_many(&memcg->css, nr_pages);
>  }
>
> -static void lock_page_lru(struct page *page, int *isolated)
> +static struct lruvec *lock_page_lru(struct page *page, int *isolated)
>  {
>         pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec = lock_page_lruvec_irq(page, pgdat);
>
> -       spin_lock_irq(&pgdat->lru_lock);
>         if (PageLRU(page)) {
> -               struct lruvec *lruvec;
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_lru(page));
>                 *isolated = 1;
>         } else
>                 *isolated = 0;
> +
> +       return lruvec;
>  }
>
> -static void unlock_page_lru(struct page *page, int isolated)
> +static void unlock_page_lru(struct page *page, int isolated,
> +                               struct lruvec *locked_lruvec)
>  {
> -       pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
>
> -       if (isolated) {
> -               struct lruvec *lruvec;
> +       spin_unlock_irq(&locked_lruvec->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +       if (isolated) {
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 SetPageLRU(page);
>                 add_page_to_lru_list(page, lruvec, page_lru(page));
>         }
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>  }
>
>  static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>                           bool lrucare)
>  {
>         int isolated;
> +       struct lruvec *lruvec;
>
>         VM_BUG_ON_PAGE(page->mem_cgroup, page);
>
> @@ -2614,7 +2652,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>          * may already be on some other mem_cgroup's LRU.  Take care of it.
>          */
>         if (lrucare)
> -               lock_page_lru(page, &isolated);
> +               lruvec = lock_page_lru(page, &isolated);
>
>         /*
>          * Nobody should be changing or seriously looking at
> @@ -2633,7 +2671,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
>         page->mem_cgroup = memcg;
>
>         if (lrucare)
> -               unlock_page_lru(page, isolated);
> +               unlock_page_lru(page, isolated, lruvec);
>  }
>
>  #ifdef CONFIG_MEMCG_KMEM
> @@ -2929,7 +2967,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
>
>  /*
>   * Because tail pages are not marked as "used", set it. We're under
> - * pgdat->lru_lock and migration entries setup in all page mappings.
> + * pgdat->lruvec.lru_lock and migration entries setup in all page mappings.
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
>  {
> diff --git a/mm/mlock.c b/mm/mlock.c
> index a72c1eeded77..b509b80b8513 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
>   * Isolate a page from LRU with optional get_page() pin.
>   * Assumes lru_lock already held and page already pinned.
>   */
> -static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
> +static bool __munlock_isolate_lru_page(struct page *page,
> +                       struct lruvec *lruvec, bool getpage)
>  {
>         if (PageLRU(page)) {
> -               struct lruvec *lruvec;
> -
> -               lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>                 if (getpage)
>                         get_page(page);
>                 ClearPageLRU(page);
> @@ -183,6 +181,7 @@ unsigned int munlock_vma_page(struct page *page)
>  {
>         int nr_pages;
>         pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
>
>         /* For try_to_munlock() and to serialize with page migration */
>         BUG_ON(!PageLocked(page));
> @@ -194,7 +193,7 @@ unsigned int munlock_vma_page(struct page *page)
>          * might otherwise copy PageMlocked to part of the tail pages before
>          * we clear it in the head page. It also stabilizes hpage_nr_pages().
>          */
> -       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page, pgdat);
>
>         if (!TestClearPageMlocked(page)) {
>                 /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
> @@ -205,15 +204,15 @@ unsigned int munlock_vma_page(struct page *page)
>         nr_pages = hpage_nr_pages(page);
>         __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
>
> -       if (__munlock_isolate_lru_page(page, true)) {
> -               spin_unlock_irq(&pgdat->lru_lock);
> +       if (__munlock_isolate_lru_page(page, lruvec, true)) {
> +               spin_unlock_irq(&lruvec->lru_lock);
>                 __munlock_isolated_page(page);
>                 goto out;
>         }
>         __munlock_isolation_failed(page);
>
>  unlock_out:
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>  out:
>         return nr_pages - 1;
> @@ -291,28 +290,29 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  {
>         int i;
>         int nr = pagevec_count(pvec);
> -       int delta_munlocked = -nr;
>         struct pagevec pvec_putback;
> +       struct lruvec *lruvec = NULL;
>         int pgrescued = 0;
>
>         pagevec_init(&pvec_putback);
>
>         /* Phase 1: page isolation */
> -       spin_lock_irq(&zone->zone_pgdat->lru_lock);
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
>
> +               lruvec = lock_page_lruvec_irq(page, page_pgdat(page));
> +
>                 if (TestClearPageMlocked(page)) {
>                         /*
>                          * We already have pin from follow_page_mask()
>                          * so we can spare the get_page() here.
>                          */
> -                       if (__munlock_isolate_lru_page(page, false))
> +                       if (__munlock_isolate_lru_page(page, lruvec, false)) {
> +                               __mod_zone_page_state(zone, NR_MLOCK,  -1);
> +                               spin_unlock_irq(&lruvec->lru_lock);
>                                 continue;
> -                       else
> +                       } else
>                                 __munlock_isolation_failed(page);
> -               } else {
> -                       delta_munlocked++;
>                 }
>
>                 /*
> @@ -323,9 +323,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>                  */
>                 pagevec_add(&pvec_putback, pvec->pages[i]);
>                 pvec->pages[i] = NULL;
> +               spin_unlock_irq(&lruvec->lru_lock);
>         }
> -       __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
> -       spin_unlock_irq(&zone->zone_pgdat->lru_lock);
>
>         /* Now we can release pins of pages that we are not munlocking */
>         pagevec_release(&pvec_putback);
> diff --git a/mm/page_idle.c b/mm/page_idle.c
> index 295512465065..25f4b1cf3e0f 100644
> --- a/mm/page_idle.c
> +++ b/mm/page_idle.c
> @@ -32,6 +32,7 @@ static struct page *page_idle_get_page(unsigned long pfn)
>  {
>         struct page *page;
>         pg_data_t *pgdat;
> +       struct lruvec *lruvec;
>
>         if (!pfn_valid(pfn))
>                 return NULL;
> @@ -42,12 +43,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
>                 return NULL;
>
>         pgdat = page_pgdat(page);
> -       spin_lock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page, pgdat);
>         if (unlikely(!PageLRU(page))) {
>                 put_page(page);
>                 page = NULL;
>         }
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>         return page;
>  }
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 5341ae93861f..60f04cb2b49e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -62,14 +62,12 @@ static void __page_cache_release(struct page *page)
>         if (PageLRU(page)) {
>                 pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
> -               unsigned long flags;
>
> -               spin_lock_irqsave(&pgdat->lru_lock, flags);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +               lruvec = lock_page_lruvec_irqsave(page, pgdat);
>                 VM_BUG_ON_PAGE(!PageLRU(page), page);
>                 __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +               spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
>         }
>         __ClearPageWaiters(page);
>  }
> @@ -192,26 +190,17 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>         void *arg)
>  {
>         int i;
> -       struct pglist_data *pgdat = NULL;
> -       struct lruvec *lruvec;
> -       unsigned long flags = 0;
> +       struct lruvec *lruvec = NULL;
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
>
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irqsave(&pgdat->lru_lock, flags);
> -               }
> +               lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page));
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 (*move_fn)(page, lruvec, arg);
> +               spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
>         }
> -       if (pgdat)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +
>         release_pages(pvec->pages, pvec->nr);
>         pagevec_reinit(pvec);
>  }
> @@ -325,11 +314,12 @@ static inline void activate_page_drain(int cpu)
>  void activate_page(struct page *page)
>  {
>         pg_data_t *pgdat = page_pgdat(page);
> +       struct lruvec *lruvec;
>
>         page = compound_head(page);
> -       spin_lock_irq(&pgdat->lru_lock);
> -       __activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       lruvec = lock_page_lruvec_irq(page, pgdat);
> +       __activate_page(page, lruvec, NULL);
> +       spin_unlock_irq(&lruvec->lru_lock);
>  }
>  #endif
>
> @@ -780,9 +770,7 @@ void release_pages(struct page **pages, int nr)
>  {
>         int i;
>         LIST_HEAD(pages_to_free);
> -       struct pglist_data *locked_pgdat = NULL;
> -       struct lruvec *lruvec;
> -       unsigned long uninitialized_var(flags);
> +       struct lruvec *lruvec = NULL;
>         unsigned int uninitialized_var(lock_batch);
>
>         for (i = 0; i < nr; i++) {
> @@ -791,21 +779,22 @@ void release_pages(struct page **pages, int nr)
>                 /*
>                  * Make sure the IRQ-safe lock-holding time does not get
>                  * excessive with a continuous string of pages from the
> -                * same pgdat. The lock is held only if pgdat != NULL.
> +                * same lruvec. The lock is held only if lruvec != NULL.
>                  */
> -               if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
> -                       spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -                       locked_pgdat = NULL;
> +               if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
> +                       spin_unlock_irqrestore(&lruvec->lru_lock,
> +                                                       lruvec->irqflags);
> +                       lruvec = NULL;
>                 }
>
>                 if (is_huge_zero_page(page))
>                         continue;
>
>                 if (is_zone_device_page(page)) {
> -                       if (locked_pgdat) {
> -                               spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> -                                                      flags);
> -                               locked_pgdat = NULL;
> +                       if (lruvec) {
> +                               spin_unlock_irqrestore(&lruvec->lru_lock,
> +                                                      lruvec->irqflags);
> +                               lruvec = NULL;
>                         }
>                         /*
>                          * ZONE_DEVICE pages that return 'false' from
> @@ -822,27 +811,25 @@ void release_pages(struct page **pages, int nr)
>                         continue;
>
>                 if (PageCompound(page)) {
> -                       if (locked_pgdat) {
> -                               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> -                               locked_pgdat = NULL;
> +                       if (lruvec) {
> +                               spin_unlock_irqrestore(&lruvec->lru_lock,
> +                                                               lruvec->irqflags);
> +                               lruvec = NULL;
>                         }
>                         __put_compound_page(page);
>                         continue;
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct pglist_data *pgdat = page_pgdat(page);
> +                       struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>
> -                       if (pgdat != locked_pgdat) {
> -                               if (locked_pgdat)
> -                                       spin_unlock_irqrestore(&locked_pgdat->lru_lock,
> -                                                                       flags);
> +                       if (new_lruvec != lruvec) {
> +                               if (lruvec)
> +                                       spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
>                                 lock_batch = 0;
> -                               locked_pgdat = pgdat;
> -                               spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> -                       }
> +                               lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page));
>
> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> +                       }
>                         VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -854,8 +841,8 @@ void release_pages(struct page **pages, int nr)
>
>                 list_add(&page->lru, &pages_to_free);
>         }
> -       if (locked_pgdat)
> -               spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
> +       if (lruvec)
> +               spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
>
>         mem_cgroup_uncharge_list(&pages_to_free);
>         free_unref_page_list(&pages_to_free);
> @@ -893,7 +880,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>         VM_BUG_ON_PAGE(!PageHead(page), page);
>         VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>         VM_BUG_ON_PAGE(PageLRU(page_tail), page);
> -       lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
> +       lockdep_assert_held(&lruvec->lru_lock);
>
>         if (!list)
>                 SetPageLRU(page_tail);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d97985262dda..3cdf343e7a27 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1755,8 +1755,7 @@ int isolate_lru_page(struct page *page)
>                 pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
>
> -               spin_lock_irq(&pgdat->lru_lock);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +               lruvec = lock_page_lruvec_irq(page, pgdat);
>                 if (PageLRU(page)) {
>                         int lru = page_lru(page);
>                         get_page(page);
> @@ -1764,7 +1763,7 @@ int isolate_lru_page(struct page *page)
>                         del_page_from_lru_list(page, lruvec, lru);
>                         ret = 0;
>                 }
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               spin_unlock_irq(&lruvec->lru_lock);
>         }
>         return ret;
>  }
> @@ -1829,7 +1828,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                                                      struct list_head *list)
>  {
> -       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         int nr_pages, nr_moved = 0;
>         LIST_HEAD(pages_to_free);
>         struct page *page;
> @@ -1840,12 +1838,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 if (unlikely(!page_evictable(page))) {
>                         list_del(&page->lru);
> -                       spin_unlock_irq(&pgdat->lru_lock);
> +                       spin_unlock_irq(&lruvec->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(&pgdat->lru_lock);
> +                       spin_lock_irq(&lruvec->lru_lock);
>                         continue;
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 SetPageLRU(page);
>                 lru = page_lru(page);
> @@ -1860,9 +1857,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                         del_page_from_lru_list(page, lruvec, lru);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(&pgdat->lru_lock);
> +                               spin_unlock_irq(&lruvec->lru_lock);
>                                 (*get_compound_page_dtor(page))(page);
> -                               spin_lock_irq(&pgdat->lru_lock);
> +                               spin_lock_irq(&lruvec->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>                 } else {
> @@ -1925,7 +1922,7 @@ static int current_may_throttle(void)
>
>         lru_add_drain();
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
>                                      &nr_scanned, sc, lru);
> @@ -1937,7 +1934,7 @@ static int current_may_throttle(void)
>         if (!cgroup_reclaim(sc))
>                 __count_vm_events(item, nr_scanned);
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         if (nr_taken == 0)
>                 return 0;
> @@ -1945,7 +1942,7 @@ static int current_may_throttle(void)
>         nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
>                                 &stat, false);
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
>         if (!cgroup_reclaim(sc))
> @@ -1958,7 +1955,7 @@ static int current_may_throttle(void)
>
>         __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         mem_cgroup_uncharge_list(&page_list);
>         free_unref_page_list(&page_list);
> @@ -2011,7 +2008,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>
>         lru_add_drain();
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>
>         nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
>                                      &nr_scanned, sc, lru);
> @@ -2022,7 +2019,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         __count_vm_events(PGREFILL, nr_scanned);
>         __count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
>
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         while (!list_empty(&l_hold)) {
>                 cond_resched();
> @@ -2068,7 +2065,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         /*
>          * Move pages back to the lru list.
>          */
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>         /*
>          * Count referenced pages from currently used mappings as rotated,
>          * even though only some of them are actually re-activated.  This
> @@ -2086,7 +2083,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>         __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
>
>         __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         mem_cgroup_uncharge_list(&l_active);
>         free_unref_page_list(&l_active);
> @@ -2371,7 +2368,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>         file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
>                 lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
>
> -       spin_lock_irq(&pgdat->lru_lock);
> +       spin_lock_irq(&lruvec->lru_lock);
>         if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
>                 reclaim_stat->recent_scanned[0] /= 2;
>                 reclaim_stat->recent_rotated[0] /= 2;
> @@ -2392,7 +2389,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>
>         fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
>         fp /= reclaim_stat->recent_rotated[1] + 1;
> -       spin_unlock_irq(&pgdat->lru_lock);
> +       spin_unlock_irq(&lruvec->lru_lock);
>
>         fraction[0] = ap;
>         fraction[1] = fp;
> @@ -4285,24 +4282,25 @@ int page_evictable(struct page *page)
>   */
>  void check_move_unevictable_pages(struct pagevec *pvec)
>  {
> -       struct lruvec *lruvec;
> -       struct pglist_data *pgdat = NULL;
> +       struct lruvec *lruvec = NULL;
>         int pgscanned = 0;
>         int pgrescued = 0;
>         int i;
>
>         for (i = 0; i < pvec->nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct pglist_data *pagepgdat = page_pgdat(page);
> +               struct pglist_data *pgdat = page_pgdat(page);
> +               struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>
>                 pgscanned++;
> -               if (pagepgdat != pgdat) {
> -                       if (pgdat)
> -                               spin_unlock_irq(&pgdat->lru_lock);
> -                       pgdat = pagepgdat;
> -                       spin_lock_irq(&pgdat->lru_lock);
> +
> +               if (lruvec != new_lruvec) {
> +                       if (lruvec)
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                       lruvec = new_lruvec;
> +                       spin_lock_irq(&lruvec->lru_lock);
>                 }
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> @@ -4318,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>                 }
>         }
>
> -       if (pgdat) {
> +       if (lruvec) {
>                 __count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>                 __count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               spin_unlock_irq(&lruvec->lru_lock);
>         }
>  }
>  EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/7] mm/lru: add per lruvec lock for memcg
  2019-11-16  6:28   ` Shakeel Butt
@ 2019-11-18  2:44     ` Alex Shi
  2019-11-18 12:08       ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2019-11-18  2:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Vlastimil Babka, Dan Williams, Michal Hocko,
	Wei Yang, Johannes Weiner, Arun KS



在 2019/11/16 下午2:28, Shakeel Butt 写道:
> On Fri, Nov 15, 2019 at 7:15 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>> Currently memcg still use per node pgdat->lru_lock to guard its lruvec.
>> That causes some lru_lock contention in a high container density system.
>>
>> If we can use per lruvec lock, that could relief much of the lru_lock
>> contention.
>>
>> The later patches will replace the pgdat->lru_lock with lruvec->lru_lock
>> and show the performance benefit by benchmarks.
> 
> Merge this patch with actual usage. No need to have a separate patch.

Thanks for comment, Shakeel!

Yes, but considering the 3rd, huge and un-splitable patch of actully replacing, I'd rather to pull sth out from 
it. Ty to make patches a bit more readable, Do you think so?

Thanks
Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct
  2019-11-16  6:31   ` Shakeel Butt
@ 2019-11-18  2:52     ` Alex Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-18  2:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Vlastimil Babka, Dan Williams, Michal Hocko,
	Wei Yang, Johannes Weiner, Arun KS, Rong Chen



在 2019/11/16 下午2:31, Shakeel Butt 写道:
> On Fri, Nov 15, 2019 at 7:15 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>> We need a irqflags vaiable to save state when do irqsave action, declare
>> it here would make code more clear/clean.
>>
>> Rong Chen <rong.a.chen@intel.com> reported the 'irqflags' variable need
>> move to the tail of lruvec struct otherwise it causes 18% regressions of
>> vm-scalability testing on his machine. So add the flags and lru_lock to
>> both near struct tail, even I have no clue of this perf losing.
> Regressions compared to what? Also no need to have a separate patch.
> 

Thanks for question, Shakeel, 
I will change the commit log and mention the original place: the head of lruvec struct.
As to the folding, the 3rd is already toooo long. May I leave it here to relief a littel bit burden on reading?

Thanks
Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-16  4:38   ` Matthew Wilcox
@ 2019-11-18 11:55     ` Alex Shi
  2019-11-18 12:14       ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2019-11-18 11:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Shakeel Butt,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao



在 2019/11/16 下午12:38, Matthew Wilcox 写道:
> On Sat, Nov 16, 2019 at 11:15:02AM +0800, Alex Shi wrote:
>> This is the main patch to replace per node lru_lock with per memcg
>> lruvec lock. It also fold the irqsave flags into lruvec.
> 
> I have to say, I don't love the part where we fold the irqsave flags
> into the lruvec.  I know it saves us an argument, but it opens up the
> possibility of mismatched expectations.  eg we currently have:
> 
> static void __split_huge_page(struct page *page, struct list_head *list,
> 			struct lruvec *lruvec, pgoff_t end)
> {
> ...
> 	spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
> 
> so if we introduce a new caller, we have to be certain that this caller
> is also using lock_page_lruvec_irqsave() and not lock_page_lruvec_irq().
> I can't think of a way to make the compiler enforce that, and if we don't,
> then we can get some odd crashes with interrupts being unexpectedly
> enabled or disabled, depending on how ->irqflags was used last.
> 
> So it makes the code more subtle.  And that's not a good thing.

Hi Matthew,

Thanks for comments!

Here, the irqflags is bound, and belong to lruvec, merging them into together helps us to take them as whole, and thus reduce a unnecessary code clues.
The only thing maybe bad that it may take move place in pg_data_t.lruvec, but there are PADDINGs to remove this concern.

As your concern for a 'new' caller, since __split_huge_page is a static helper here, no distub for anyothers.

Do you agree on that?

> 
>> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
>> +						struct pglist_data *pgdat)
>> +{
>> +	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +
>> +	spin_lock_irq(&lruvec->lru_lock);
>> +
>> +	return lruvec;
>> +}
> 
> ...
> 
>> +static struct lruvec *lock_page_lru(struct page *page, int *isolated)
>>  {
>>  	pg_data_t *pgdat = page_pgdat(page);
>> +	struct lruvec *lruvec = lock_page_lruvec_irq(page, pgdat);
>>  
>> -	spin_lock_irq(&pgdat->lru_lock);
>>  	if (PageLRU(page)) {
>> -		struct lruvec *lruvec;
>>  
>> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>>  		ClearPageLRU(page);
>>  		del_page_from_lru_list(page, lruvec, page_lru(page));
>>  		*isolated = 1;
>>  	} else
>>  		*isolated = 0;
>> +
>> +	return lruvec;
>>  }
> 
> But what if the page is !PageLRU?  What lruvec did we just lock?

like original pgdat->lru_lock, we need the lock from PageLRU racing. And it the lruvec which the page should be.


> According to the comments on mem_cgroup_page_lruvec(),
> 
>  * This function is only safe when following the LRU page isolation
>  * and putback protocol: the LRU lock must be held, and the page must
>  * either be PageLRU() or the caller must have isolated/allocated it.
> 
> and now it's being called in order to find out which LRU lock to take.
> So this comment needs to be updated, if it's wrong, or this patch has
> a race.


Yes, the function reminder is a bit misunderstanding with new patch, How about the following changes:

- * This function is only safe when following the LRU page isolation
- * and putback protocol: the LRU lock must be held, and the page must
- * either be PageLRU() or the caller must have isolated/allocated it.
+ * The caller needs to grantee the page's mem_cgroup is undisturbed during
+ * using. That could be done by lock_page_memcg or lock_page_lruvec.

Thanks
Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/7] mm/lru: add per lruvec lock for memcg
  2019-11-18  2:44     ` Alex Shi
@ 2019-11-18 12:08       ` Matthew Wilcox
  2019-11-18 12:37         ` Alex Shi
  2019-11-19 10:05         ` Alex Shi
  0 siblings, 2 replies; 30+ messages in thread
From: Matthew Wilcox @ 2019-11-18 12:08 UTC (permalink / raw)
  To: Alex Shi
  Cc: Shakeel Butt, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Yang Shi, Vlastimil Babka, Dan Williams, Michal Hocko, Wei Yang,
	Johannes Weiner, Arun KS

On Mon, Nov 18, 2019 at 10:44:57AM +0800, Alex Shi wrote:
> 
> 
> 在 2019/11/16 下午2:28, Shakeel Butt 写道:
> > On Fri, Nov 15, 2019 at 7:15 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >> Currently memcg still use per node pgdat->lru_lock to guard its lruvec.
> >> That causes some lru_lock contention in a high container density system.
> >>
> >> If we can use per lruvec lock, that could relief much of the lru_lock
> >> contention.
> >>
> >> The later patches will replace the pgdat->lru_lock with lruvec->lru_lock
> >> and show the performance benefit by benchmarks.
> > 
> > Merge this patch with actual usage. No need to have a separate patch.
> 
> Thanks for comment, Shakeel!
> 
> Yes, but considering the 3rd, huge and un-splitable patch of actully replacing, I'd rather to pull sth out from 
> it. Ty to make patches a bit more readable, Do you think so?

This method of splitting the patches doesn't help with the reviewability of
the patch series.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-18 11:55     ` Alex Shi
@ 2019-11-18 12:14       ` Matthew Wilcox
  2019-11-18 12:31         ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2019-11-18 12:14 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Shakeel Butt,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao

On Mon, Nov 18, 2019 at 07:55:43PM +0800, Alex Shi wrote:
> 在 2019/11/16 下午12:38, Matthew Wilcox 写道:
> > On Sat, Nov 16, 2019 at 11:15:02AM +0800, Alex Shi wrote:
> >> This is the main patch to replace per node lru_lock with per memcg
> >> lruvec lock. It also fold the irqsave flags into lruvec.
> > 
> > I have to say, I don't love the part where we fold the irqsave flags
> > into the lruvec.  I know it saves us an argument, but it opens up the
> > possibility of mismatched expectations.  eg we currently have:
> > 
> > static void __split_huge_page(struct page *page, struct list_head *list,
> > 			struct lruvec *lruvec, pgoff_t end)
> > {
> > ...
> > 	spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
> > 
> > so if we introduce a new caller, we have to be certain that this caller
> > is also using lock_page_lruvec_irqsave() and not lock_page_lruvec_irq().
> > I can't think of a way to make the compiler enforce that, and if we don't,
> > then we can get some odd crashes with interrupts being unexpectedly
> > enabled or disabled, depending on how ->irqflags was used last.
> > 
> > So it makes the code more subtle.  And that's not a good thing.
> 
> Hi Matthew,
> 
> Thanks for comments!
> 
> Here, the irqflags is bound, and belong to lruvec, merging them into together helps us to take them as whole, and thus reduce a unnecessary code clues.

It's not bound to the lruvec, though.  Call chain A uses it and call chain
B doesn't.  If it was always used by every call chain, I'd see your point,
but we have call chains which don't use it, and so it adds complexity.

> As your concern for a 'new' caller, since __split_huge_page is a static helper here, no distub for anyothers.

Even though it's static, there may be other callers within the same file.
Or somebody may decide to make it non-static in the future.  I think it's
actually clearer to keep the irqflags as a separate parameter.

> >> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
> >> +						struct pglist_data *pgdat)
> >> +{
> >> +	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >> +
> >> +	spin_lock_irq(&lruvec->lru_lock);
> >> +
> >> +	return lruvec;
> >> +}
> > 
> > ...
> > 
> >> +static struct lruvec *lock_page_lru(struct page *page, int *isolated)
> >>  {
> >>  	pg_data_t *pgdat = page_pgdat(page);
> >> +	struct lruvec *lruvec = lock_page_lruvec_irq(page, pgdat);
> >>  
> >> -	spin_lock_irq(&pgdat->lru_lock);
> >>  	if (PageLRU(page)) {
> >> -		struct lruvec *lruvec;
> >>  
> >> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >>  		ClearPageLRU(page);
> >>  		del_page_from_lru_list(page, lruvec, page_lru(page));
> >>  		*isolated = 1;
> >>  	} else
> >>  		*isolated = 0;
> >> +
> >> +	return lruvec;
> >>  }
> > 
> > But what if the page is !PageLRU?  What lruvec did we just lock?
> 
> like original pgdat->lru_lock, we need the lock from PageLRU racing. And it the lruvec which the page should be.
> 
> 
> > According to the comments on mem_cgroup_page_lruvec(),
> > 
> >  * This function is only safe when following the LRU page isolation
> >  * and putback protocol: the LRU lock must be held, and the page must
> >  * either be PageLRU() or the caller must have isolated/allocated it.
> > 
> > and now it's being called in order to find out which LRU lock to take.
> > So this comment needs to be updated, if it's wrong, or this patch has
> > a race.
> 
> 
> Yes, the function reminder is a bit misunderstanding with new patch, How about the following changes:
> 
> - * This function is only safe when following the LRU page isolation
> - * and putback protocol: the LRU lock must be held, and the page must
> - * either be PageLRU() or the caller must have isolated/allocated it.
> + * The caller needs to grantee the page's mem_cgroup is undisturbed during
> + * using. That could be done by lock_page_memcg or lock_page_lruvec.

I don't understand how lock_page_lruvec makes this guarantee.  I'll look
at the code again and see if I can understand that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-16  7:03   ` Shakeel Butt
@ 2019-11-18 12:23     ` Alex Shi
  2019-11-18 12:31       ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2019-11-18 12:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Chris Down, Thomas Gleixner, Vlastimil Babka,
	Qian Cai, Andrey Ryabinin, Kirill A. Shutemov,
	Jérôme Glisse, Andrea Arcangeli, David Rientjes,
	Aneesh Kumar K.V, swkhack, Potyra, Stefan, Mike Rapoport,
	Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2019/11/16 下午3:03, Shakeel Butt 写道:
>> +reget_lruvec:
>> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +
>>                 /* If we already hold the lock, we can skip some rechecking */
>> -               if (!locked) {
>> -                       locked = compact_lock_irqsave(&pgdat->lru_lock,
>> -                                                               &flags, cc);
>> +               if (lruvec != locked_lruvec) {
>> +                       if (locked_lruvec) {
>> +                               spin_unlock_irqrestore(&locked_lruvec->lru_lock,
>> +                                               locked_lruvec->irqflags);
>> +                               locked_lruvec = NULL;
>> +                       }
> What guarantees the lifetime of lruvec? You should read the comment on
> mem_cgroup_page_lruvec(). Have you seen the patches Hugh had shared?
> Please look at the  trylock_page_lruvec().
> 

Thanks for comments, Shakeel.

lruvec lifetime is same as memcg, which allocted in mem_cgroup_alloc()->alloc_mem_cgroup_per_node_info()
I have read Hugh's patchset, even not every lines. But what's point of you here?

> BTW have you tested Hugh's patches?
> 

yes, I have tried the case-lru-file-readtwice on my machine w/o containers, it show a bit more regression.

Thanks
Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-18 12:23     ` Alex Shi
@ 2019-11-18 12:31       ` Matthew Wilcox
  2019-11-19 10:08         ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2019-11-18 12:31 UTC (permalink / raw)
  To: Alex Shi
  Cc: Shakeel Butt, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Yang Shi, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Chris Down, Thomas Gleixner, Vlastimil Babka,
	Qian Cai, Andrey Ryabinin, Kirill A. Shutemov,
	Jérôme Glisse, Andrea Arcangeli, David Rientjes,
	Aneesh Kumar K.V, swkhack, Potyra, Stefan, Mike Rapoport,
	Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Mon, Nov 18, 2019 at 08:23:12PM +0800, Alex Shi wrote:
> 在 2019/11/16 下午3:03, Shakeel Butt 写道:
> >> +reget_lruvec:
> >> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >> +
> >>                 /* If we already hold the lock, we can skip some rechecking */
> >> -               if (!locked) {
> >> -                       locked = compact_lock_irqsave(&pgdat->lru_lock,
> >> -                                                               &flags, cc);
> >> +               if (lruvec != locked_lruvec) {
> >> +                       if (locked_lruvec) {
> >> +                               spin_unlock_irqrestore(&locked_lruvec->lru_lock,
> >> +                                               locked_lruvec->irqflags);
> >> +                               locked_lruvec = NULL;
> >> +                       }
> > What guarantees the lifetime of lruvec? You should read the comment on
> > mem_cgroup_page_lruvec(). Have you seen the patches Hugh had shared?
> > Please look at the  trylock_page_lruvec().
> > 
> 
> Thanks for comments, Shakeel.
> 
> lruvec lifetime is same as memcg, which allocted in mem_cgroup_alloc()->alloc_mem_cgroup_per_node_info()
> I have read Hugh's patchset, even not every lines. But what's point of you here?

I believe Shakeel's point is that here:

struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
{
...
        memcg = page->mem_cgroup;

there is nothing pinning the memcg, and it could be freed before
dereferencing memcg->nodeinfo in mem_cgroup_page_nodeinfo().

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-18 12:14       ` Matthew Wilcox
@ 2019-11-18 12:31         ` Alex Shi
  2019-11-18 12:34           ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Shi @ 2019-11-18 12:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Shakeel Butt,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao



在 2019/11/18 下午8:14, Matthew Wilcox 写道:
>> Hi Matthew,
>>
>> Thanks for comments!
>>
>> Here, the irqflags is bound, and belong to lruvec, merging them into together helps us to take them as whole, and thus reduce a unnecessary code clues.
> It's not bound to the lruvec, though.  Call chain A uses it and call chain
> B doesn't.  If it was always used by every call chain, I'd see your point,
> but we have call chains which don't use it, and so it adds complexity.

Where is the call chain B, please?

> 
>> As your concern for a 'new' caller, since __split_huge_page is a static helper here, no distub for anyothers.
> Even though it's static, there may be other callers within the same file.
> Or somebody may decide to make it non-static in the future.  I think it's
> actually clearer to keep the irqflags as a separate parameter.
> 

But it's no one else using this function now. and no one get disturb, right? It's non sense to consider a 'possibility' issue.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-18 12:31         ` Alex Shi
@ 2019-11-18 12:34           ` Matthew Wilcox
  2019-11-19 10:14             ` Alex Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2019-11-18 12:34 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Shakeel Butt,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao

On Mon, Nov 18, 2019 at 08:31:50PM +0800, Alex Shi wrote:
> 
> 
> 在 2019/11/18 下午8:14, Matthew Wilcox 写道:
> >> Hi Matthew,
> >>
> >> Thanks for comments!
> >>
> >> Here, the irqflags is bound, and belong to lruvec, merging them into together helps us to take them as whole, and thus reduce a unnecessary code clues.
> > It's not bound to the lruvec, though.  Call chain A uses it and call chain
> > B doesn't.  If it was always used by every call chain, I'd see your point,
> > but we have call chains which don't use it, and so it adds complexity.
> 
> Where is the call chain B, please?

Every call chain that uses lock_page_lruvec_irq().

> >> As your concern for a 'new' caller, since __split_huge_page is a static helper here, no distub for anyothers.
> > Even though it's static, there may be other callers within the same file.
> > Or somebody may decide to make it non-static in the future.  I think it's
> > actually clearer to keep the irqflags as a separate parameter.
> > 
> 
> But it's no one else using this function now. and no one get disturb, right? It's non sense to consider a 'possibility' issue.

It is not nonsense to consider the complexity of the mm!  Your patch makes
it harder to understand unnecessarily.  Please be considerate of the other
programmers who must build on what you have created.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/7] mm/lru: add per lruvec lock for memcg
  2019-11-18 12:08       ` Matthew Wilcox
@ 2019-11-18 12:37         ` Alex Shi
  2019-11-19 10:05         ` Alex Shi
  1 sibling, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-18 12:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shakeel Butt, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Yang Shi, Vlastimil Babka, Dan Williams, Michal Hocko, Wei Yang,
	Johannes Weiner, Arun KS



在 2019/11/18 下午8:08, Matthew Wilcox 写道:
>>> Merge this patch with actual usage. No need to have a separate patch.
>> Thanks for comment, Shakeel!
>>
>> Yes, but considering the 3rd, huge and un-splitable patch of actully replacing, I'd rather to pull sth out from 
>> it. Ty to make patches a bit more readable, Do you think so?
> This method of splitting the patches doesn't help with the reviewability of
> the patch series.

Hi Matthew,

Thanks for comments!
I will fold them into the 3rd patch as you are all insist on this. :)

Thanks
Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-16  3:15 ` [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
  2019-11-16  4:38   ` Matthew Wilcox
  2019-11-16  7:03   ` Shakeel Butt
@ 2019-11-18 16:11   ` Johannes Weiner
  2019-11-19 10:04     ` Alex Shi
  2019-11-19  2:10   ` Daniel Jordan
  3 siblings, 1 reply; 30+ messages in thread
From: Johannes Weiner @ 2019-11-18 16:11 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Sat, Nov 16, 2019 at 11:15:02AM +0800, Alex Shi wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 62470325f9bc..cf274739e619 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1246,6 +1246,42 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  	return lruvec;
>  }
>  
> +struct lruvec *lock_page_lruvec_irq(struct page *page,
> +					struct pglist_data *pgdat)
> +{
> +	struct lruvec *lruvec;
> +
> +again:
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock_irq(&lruvec->lru_lock);

This isn't safe. Nothing prevents the page from being moved to a
different memcg in between these two operations, and the lruvec having
been freed by the time you try to acquire the spinlock.

You need to use the rcu lock to dereference page->mem_cgroup and its
lruvec when coming from the page like this.

You also need to use page_memcg_rcu() to insert the appropriate
lockless access barriers, which mem_cgroup_page_lruvec() does not do
since it's designed for use with pgdat->lru_lock.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-16  3:15 ` [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
                     ` (2 preceding siblings ...)
  2019-11-18 16:11   ` Johannes Weiner
@ 2019-11-19  2:10   ` Daniel Jordan
  2019-11-19 10:10     ` Alex Shi
  3 siblings, 1 reply; 30+ messages in thread
From: Daniel Jordan @ 2019-11-19  2:10 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Shakeel Butt,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao

On Sat, Nov 16, 2019 at 11:15:02AM +0800, Alex Shi wrote:
> @@ -192,26 +190,17 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>  	void *arg)
>  {
>  	int i;
> -	struct pglist_data *pgdat = NULL;
> -	struct lruvec *lruvec;
> -	unsigned long flags = 0;
> +	struct lruvec *lruvec = NULL;
>  
>  	for (i = 0; i < pagevec_count(pvec); i++) {
>  		struct page *page = pvec->pages[i];
> -		struct pglist_data *pagepgdat = page_pgdat(page);
>  
> -		if (pagepgdat != pgdat) {
> -			if (pgdat)
> -				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -			pgdat = pagepgdat;
> -			spin_lock_irqsave(&pgdat->lru_lock, flags);
> -		}
> +		lruvec = lock_page_lruvec_irqsave(page, page_pgdat(page));
>  
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  		(*move_fn)(page, lruvec, arg);
> +		spin_unlock_irqrestore(&lruvec->lru_lock, lruvec->irqflags);
>  	}
> -	if (pgdat)
> -		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +
>  	release_pages(pvec->pages, pvec->nr);
>  	pagevec_reinit(pvec);
>  }

Why can't you keep the locking pattern where we only drop and reacquire if the
lruvec changes?  It'd save a lot of locks and unlocks if most pages were from
the same memcg and node, or the memory controller were unused.


Thanks for running the -readtwice benchmark, by the way.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-18 16:11   ` Johannes Weiner
@ 2019-11-19 10:04     ` Alex Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-19 10:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2019/11/19 上午12:11, Johannes Weiner 写道:
> On Sat, Nov 16, 2019 at 11:15:02AM +0800, Alex Shi wrote:
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 62470325f9bc..cf274739e619 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1246,6 +1246,42 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>>  	return lruvec;
>>  }
>>  
>> +struct lruvec *lock_page_lruvec_irq(struct page *page,
>> +					struct pglist_data *pgdat)
>> +{
>> +	struct lruvec *lruvec;
>> +
>> +again:
>> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +	spin_lock_irq(&lruvec->lru_lock);
> 
> This isn't safe. Nothing prevents the page from being moved to a
> different memcg in between these two operations, and the lruvec having
> been freed by the time you try to acquire the spinlock.
> 
> You need to use the rcu lock to dereference page->mem_cgroup and its
> lruvec when coming from the page like this.

Hi Johannes,

Yes, you are right. Thank a lot to point out this!
Could we use rcu lock to guard the lruvec till lruvec->lru_lock getten?

+struct lruvec *lock_page_lruvec_irq(struct page *page,
+                                       struct pglist_data *pgdat)
+{
+       struct lruvec *lruvec;
+
+again:
+       rcu_read_lock();
+       lruvec = mem_cgroup_page_lruvec(page, pgdat);
+       spin_lock_irq(&lruvec->lru_lock);
+       rcu_read_unlock();
+
+       /* lruvec may changed in commit_charge() */
+       if (lruvec != mem_cgroup_page_lruvec(page, pgdat)) {
+               spin_unlock_irq(&lruvec->lru_lock);
+               goto again;
+       }
+
+       return lruvec;
+}
> 
> You also need to use page_memcg_rcu() to insert the appropriate
> lockless access barriers, which mem_cgroup_page_lruvec() does not do
> since it's designed for use with pgdat->lru_lock.
> 

Since the page_memcg_rcu can not know it under a spin_lock, guess the following enhance is fine:
@@ -1225,7 +1224,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
                goto out;
        }

-       memcg = page->mem_cgroup;
+       memcg = READ_ONCE(page->mem_cgroup);
        /*


Millions thanks!
Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/7] mm/lru: add per lruvec lock for memcg
  2019-11-18 12:08       ` Matthew Wilcox
  2019-11-18 12:37         ` Alex Shi
@ 2019-11-19 10:05         ` Alex Shi
  1 sibling, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-19 10:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shakeel Butt, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Yang Shi, Vlastimil Babka, Dan Williams, Michal Hocko, Wei Yang,
	Johannes Weiner, Arun KS



在 2019/11/18 下午8:08, Matthew Wilcox 写道:
>> Thanks for comment, Shakeel!
>>
>> Yes, but considering the 3rd, huge and un-splitable patch of actully replacing, I'd rather to pull sth out from 
>> it. Ty to make patches a bit more readable, Do you think so?
> This method of splitting the patches doesn't help with the reviewability of
> the patch series.

thanks for comments! I will reorg my patchset for better understanding.

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-18 12:31       ` Matthew Wilcox
@ 2019-11-19 10:08         ` Alex Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-19 10:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shakeel Butt, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Yang Shi, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Chris Down, Thomas Gleixner, Vlastimil Babka,
	Qian Cai, Andrey Ryabinin, Kirill A. Shutemov,
	Jérôme Glisse, Andrea Arcangeli, David Rientjes,
	Aneesh Kumar K.V, swkhack, Potyra, Stefan, Mike Rapoport,
	Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2019/11/18 下午8:31, Matthew Wilcox 写道:
>> Thanks for comments, Shakeel.
>>
>> lruvec lifetime is same as memcg, which allocted in mem_cgroup_alloc()->alloc_mem_cgroup_per_node_info()
>> I have read Hugh's patchset, even not every lines. But what's point of you here?
> I believe Shakeel's point is that here:
> 
> struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
> {
> ...
>         memcg = page->mem_cgroup;
> 
> there is nothing pinning the memcg, and it could be freed before
> dereferencing memcg->nodeinfo in mem_cgroup_page_nodeinfo().

That's right! I will send the fix patches for review. 
Thanks a lot!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-19  2:10   ` Daniel Jordan
@ 2019-11-19 10:10     ` Alex Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-19 10:10 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, yang.shi, willy, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2019/11/19 上午10:10, Daniel Jordan 写道:
>> -	if (pgdat)
>> -		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> +
>>  	release_pages(pvec->pages, pvec->nr);
>>  	pagevec_reinit(pvec);
>>  }
> Why can't you keep the locking pattern where we only drop and reacquire if the
> lruvec changes?  It'd save a lot of locks and unlocks if most pages were from
> the same memcg and node, or the memory controller were unused.
> 

Good catching! This issue will be fixed in the next patch which introduce relock_ function.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock
  2019-11-18 12:34           ` Matthew Wilcox
@ 2019-11-19 10:14             ` Alex Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Shi @ 2019-11-19 10:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Shakeel Butt,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao



> 
>>>> As your concern for a 'new' caller, since __split_huge_page is a static helper here, no distub for anyothers.
>>> Even though it's static, there may be other callers within the same file.
>>> Or somebody may decide to make it non-static in the future.  I think it's
>>> actually clearer to keep the irqflags as a separate parameter.
>>>
>>
>> But it's no one else using this function now. and no one get disturb, right? It's non sense to consider a 'possibility' issue.
> 
> It is not nonsense to consider the complexity of the mm!  Your patch makes
> it harder to understand unnecessarily.  Please be considerate of the other
> programmers who must build on what you have created.
> 

I believe the 'flags' parameter are using __split_huge_page is just for irqsave/restore of lru_lock, so move it into lruvec should be reduce the parameter jumping in caller/callee.
But in another side, the flags opitimze isn't close related with lru_lock change, and better to be split out.
So thanks for your comments, I will remove this part for this patchset.

Thanks!
Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct
  2019-11-16  3:15 ` [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct Alex Shi
  2019-11-16  6:31   ` Shakeel Butt
@ 2019-11-22  6:46   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2019-11-22  6:46 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, Vlastimil Babka,
	Dan Williams, Michal Hocko, Wei Yang, Johannes Weiner, Arun KS,
	Rong Chen

On Sat, Nov 16, 2019 at 11:15:01AM +0800, Alex Shi wrote:
> We need a irqflags vaiable to save state when do irqsave action, declare
> it here would make code more clear/clean.

This patch is wrong on multiple levels.  Adding a field without the
users is completely pointless.  And as a general pattern we never
add the irqsave flags to sturctures.  They are an intimate part of the
calling conventions, and storing them in a structure will lead to
subtile bugs.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-11-22  6:46 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-16  3:14 [PATCH v3 0/8] per lruvec lru_lock for memcg Alex Shi
2019-11-16  3:15 ` [PATCH v3 1/7] mm/lru: add per lruvec lock " Alex Shi
2019-11-16  6:28   ` Shakeel Butt
2019-11-18  2:44     ` Alex Shi
2019-11-18 12:08       ` Matthew Wilcox
2019-11-18 12:37         ` Alex Shi
2019-11-19 10:05         ` Alex Shi
2019-11-16  3:15 ` [PATCH v3 2/7] mm/lruvec: add irqsave flags into lruvec struct Alex Shi
2019-11-16  6:31   ` Shakeel Butt
2019-11-18  2:52     ` Alex Shi
2019-11-22  6:46   ` Christoph Hellwig
2019-11-16  3:15 ` [PATCH v3 3/7] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2019-11-16  4:38   ` Matthew Wilcox
2019-11-18 11:55     ` Alex Shi
2019-11-18 12:14       ` Matthew Wilcox
2019-11-18 12:31         ` Alex Shi
2019-11-18 12:34           ` Matthew Wilcox
2019-11-19 10:14             ` Alex Shi
2019-11-16  7:03   ` Shakeel Butt
2019-11-18 12:23     ` Alex Shi
2019-11-18 12:31       ` Matthew Wilcox
2019-11-19 10:08         ` Alex Shi
2019-11-18 16:11   ` Johannes Weiner
2019-11-19 10:04     ` Alex Shi
2019-11-19  2:10   ` Daniel Jordan
2019-11-19 10:10     ` Alex Shi
2019-11-16  3:15 ` [PATCH v3 4/7] mm/lru: only change the lru_lock iff page's lruvec is different Alex Shi
2019-11-16  3:15 ` [PATCH v3 5/7] mm/pgdat: remove pgdat lru_lock Alex Shi
2019-11-16  3:15 ` [PATCH v3 6/7] mm/lru: likely enhancement Alex Shi
2019-11-16  3:15 ` [PATCH v3 7/7] mm/lru: revise the comments of lru_lock Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).