linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/10] per lruvec lru_lock for memcg
@ 2020-01-16  3:04 Alex Shi
  2020-01-16  3:05 ` [PATCH v8 01/10] mm/vmscan: remove unnecessary lruvec adding Alex Shi
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:04 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes

Hi all,

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node. So on a large
node machine, each of memcg don't need suffer from per node pgdat->lru_lock
waiting. They could go fast with their self lru_lock.

We introduce function lock_page_lruvec, which will lock the page's
memcg and then memcg's lruvec->lru_lock(Thanks Johannes Weiner,
Hugh Dickins and Konstantin Khlebnikov suggestion/reminder) to replace
old pgdat->lru_lock.

Following Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this patchset, the readtwice performance increased about 80%
with containers. And no performance drops w/o container.

Another way to guard move_account is by lru_lock instead of move_lock 
Considering the memcg move task path:
   mem_cgroup_move_task:
     mem_cgroup_move_charge:
	lru_add_drain_all();
	atomic_inc(&mc.from->moving_account); //ask lruvec's move_lock
	synchronize_rcu();
	walk_parge_range: do charge_walk_ops(mem_cgroup_move_charge_pte_range):
	   isolate_lru_page();
	   mem_cgroup_move_account(page,)
		spin_lock(&from->move_lock) 
		page->mem_cgroup = to;
		spin_unlock(&from->move_lock) 
	   putback_lru_page(page)

to guard 'page->mem_cgroup = to' by to_vec->lru_lock has the similar effect with
move_lock. So for performance reason, both solutions are same.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought the same idea
8 years ago.

Thanks all the comments from Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, 
Johannes Weiner, Mel Gorman, Shakeel Butt, Rong Chen, Fengguang Wu, Yun Wang etc.
and some testing support from Intel 0days!

v8,
  a, redo lock_page_lru cleanup as Konstantin Khlebnikov suggested.
  b, fix a bug in lruvec_memcg_debug, reported by Hugh Dickins

v7,
  a, rebase on v5.5-rc3, 
  b, move the lock_page_lru() clean up before lock replace.

v6, 
  a, rebase on v5.5-rc2, and redo performance testing.
  b, pick up Johanness' comments change and a lock_page_lru cleanup.

v5,
  a, locking page's memcg according JohannesW suggestion
  b, using macro for non memcg, according to Metthew's suggestion.

v4: 
  a, fix the page->mem_cgroup dereferencing issue, thanks Johannes Weiner
  b, remove the irqsave flags changes, thanks Metthew Wilcox
  c, merge/split patches for better understanding and bisection purpose

v3: rebase on linux-next, and fold the relock fix patch into introducing patch

v2: bypass a performance regression bug and fix some function issues

v1: initial version, aim testing show 5% performance increase on a 16 threads box.


Alex Shi (9):
  mm/vmscan: remove unnecessary lruvec adding
  mm/memcg: fold lock_page_lru into commit_charge
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/mlock: optimize munlock_pagevec by relocking
  mm/swap: only change the lru_lock iff page's lruvec is different
  mm/pgdat: remove pgdat lru_lock
  mm/lru: add debug checking for page memcg moving
  mm/memcg: add debug checking in lock_page_memcg

Hugh Dickins (1):
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +--
 Documentation/admin-guide/cgroup-v1/memory.rst     |   6 +-
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 ++--
 include/linux/memcontrol.h                         |  68 ++++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   5 +-
 mm/compaction.c                                    |  57 ++++++----
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  18 ++--
 mm/memcontrol.c                                    | 115 ++++++++++++++-------
 mm/mlock.c                                         |  28 ++---
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   7 +-
 mm/rmap.c                                          |   2 +-
 mm/swap.c                                          |  75 ++++++--------
 mm/vmscan.c                                        | 115 ++++++++++++---------
 18 files changed, 326 insertions(+), 217 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 01/10] mm/vmscan: remove unnecessary lruvec adding
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 02/10] mm/memcg: fold lock_page_lru into commit_charge Alex Shi
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: yun.wang

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept here for list intergrity.
Otherwise:
 #0 mave_pages_to_lru              #1 release_pages
                                   if (put_page_testzero())
 if !put_page_testzero
                                     !PageLRU //skip lru_lock
                                       list_add(&page->lru,);
   list_add(&page->lru,) //corrupt

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: yun.wang@linux.alibaba.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17c6273..a270d32bdb94 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1852,26 +1852,29 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			list_del(&page->lru);
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		/*
+		 * The SetPageLRU needs to be kept here for list intergrity.
+		 * Otherwise:
+		 *   #0 mave_pages_to_lru              #1 release_pages
+		 *   				       if (put_page_testzero())
+		 *   if !put_page_testzero
+		 *				         !PageLRU //skip lru_lock
+		 *                                         list_add(&page->lru,);
+		 *     list_add(&page->lru,) //corrupt
+		 */
 		SetPageLRU(page);
-		lru = page_lru(page);
-
-		nr_pages = hpage_nr_pages(page);
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
 
-		if (put_page_testzero(page)) {
+		if (unlikely(put_page_testzero(page))) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -1879,9 +1882,16 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
-		} else {
-			nr_moved += nr_pages;
+			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lru = page_lru(page);
+		nr_pages = hpage_nr_pages(page);
+
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &lruvec->lists[lru]);
+		nr_moved += nr_pages;
 	}
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 02/10] mm/memcg: fold lock_page_lru into commit_charge
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
  2020-01-16  3:05 ` [PATCH v8 01/10] mm/vmscan: remove unnecessary lruvec adding Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Michal Hocko, Vladimir Davydov

As Konstantin Khlebnikov mentioned:

	Also I don't like these functions:
	- called lock/unlock but actually also isolates
	- used just once
	- pgdat evaluated twice

Cleanup and fold these functions into commit_charge. It also reduces
lock time while lrucare && !PageLRU.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 57 ++++++++++++++++++++-------------------------------------
 1 file changed, 20 insertions(+), 37 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..d92538a9185c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2570,41 +2570,11 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-static void lock_page_lru(struct page *page, int *isolated)
-{
-	pg_data_t *pgdat = page_pgdat(page);
-
-	spin_lock_irq(&pgdat->lru_lock);
-	if (PageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		ClearPageLRU(page);
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		*isolated = 1;
-	} else
-		*isolated = 0;
-}
-
-static void unlock_page_lru(struct page *page, int isolated)
-{
-	pg_data_t *pgdat = page_pgdat(page);
-
-	if (isolated) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		VM_BUG_ON_PAGE(PageLRU(page), page);
-		SetPageLRU(page);
-		add_page_to_lru_list(page, lruvec, page_lru(page));
-	}
-	spin_unlock_irq(&pgdat->lru_lock);
-}
-
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 			  bool lrucare)
 {
-	int isolated;
+	struct lruvec *lruvec = NULL;
+	pg_data_t *pgdat;
 
 	VM_BUG_ON_PAGE(page->mem_cgroup, page);
 
@@ -2612,9 +2582,17 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
 	 * may already be on some other mem_cgroup's LRU.  Take care of it.
 	 */
-	if (lrucare)
-		lock_page_lru(page, &isolated);
-
+	if (lrucare) {
+		pgdat = page_pgdat(page);
+		spin_lock_irq(&pgdat->lru_lock);
+
+		if (PageLRU(page)) {
+			lruvec = mem_cgroup_page_lruvec(page, pgdat);
+			ClearPageLRU(page);
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+		} else
+			spin_unlock_irq(&pgdat->lru_lock);
+	}
 	/*
 	 * Nobody should be changing or seriously looking at
 	 * page->mem_cgroup at this point:
@@ -2631,8 +2609,13 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	 */
 	page->mem_cgroup = memcg;
 
-	if (lrucare)
-		unlock_page_lru(page, isolated);
+	if (lrucare && lruvec) {
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+		SetPageLRU(page);
+		add_page_to_lru_list(page, lruvec, page_lru(page));
+		spin_unlock_irq(&pgdat->lru_lock);
+	}
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
  2020-01-16  3:05 ` [PATCH v8 01/10] mm/vmscan: remove unnecessary lruvec adding Alex Shi
  2020-01-16  3:05 ` [PATCH v8 02/10] mm/memcg: fold lock_page_lru into commit_charge Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16 21:52   ` Johannes Weiner
  2020-01-16  3:05 ` [PATCH v8 04/10] mm/lru: introduce the relock_page_lruvec function Alex Shi
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node. So on a large
node machine, each of memcg don't need suffer from per node pgdat->lru_lock
waiting. They could go fast with their self lru_lock.

This is the main patch to replace per node lru_lock with per memcg
lruvec lock. and also fold lock_page_lru into commit_charge.

We introduces function lock_page_lruvec, which will lock the page's
memcg and then memcg's lruvec->lru_lock. (Thanks Johannes Weiner,
Hugh Dickins and Konstantin Khlebnikov suggestion/reminder on them)

According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this and later patches, the readtwice performance increases about
80% with containers.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: swkhack <swkhack@gmail.com>
Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h | 27 ++++++++++++++++
 include/linux/mmzone.h     |  2 ++
 mm/compaction.c            | 55 ++++++++++++++++++++-----------
 mm/huge_memory.c           | 18 ++++-------
 mm/memcontrol.c            | 61 +++++++++++++++++++++++++++-------
 mm/mlock.c                 | 32 +++++++++---------
 mm/mmzone.c                |  1 +
 mm/page_idle.c             |  7 ++--
 mm/swap.c                  | 75 +++++++++++++++++-------------------------
 mm/vmscan.c                | 81 +++++++++++++++++++++++++---------------------
 10 files changed, 215 insertions(+), 144 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5c8d5..8389b9b927ef 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -417,6 +417,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 }
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
+struct lruvec *lock_page_lruvec_irq(struct page *);
+struct lruvec *lock_page_lruvec_irqsave(struct page *, unsigned long*);
+void unlock_page_lruvec_irq(struct lruvec *);
+void unlock_page_lruvec_irqrestore(struct lruvec *, unsigned long);
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
@@ -900,6 +904,29 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 {
 	return &pgdat->__lruvec;
 }
+#define lock_page_lruvec_irq(page)			\
+({							\
+	struct pglist_data *pgdat = page_pgdat(page);	\
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);	\
+	&pgdat->__lruvec;				\
+})
+
+#define lock_page_lruvec_irqsave(page, flagsp)			\
+({								\
+	struct pglist_data *pgdat = page_pgdat(page);		\
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);	\
+	&pgdat->__lruvec;					\
+})
+
+#define unlock_page_lruvec_irq(lruvec)			\
+({							\
+	spin_unlock_irq(&lruvec->lru_lock);		\
+})
+
+#define unlock_page_lruvec_irqrestore(lruvec, flags)		\
+({								\
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);	\
+})
 
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 89d8ff06c9ce..c5455675acf2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -311,6 +311,8 @@ struct lruvec {
 	unsigned long			refaults;
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 672d3c78c6ab..8c0a2da217d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -786,7 +786,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked_lruvec = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -846,11 +846,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+				locked_lruvec = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -919,10 +928,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+				if (locked_lruvec) {
+					unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+					locked_lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -948,10 +956,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked_lruvec) {
+			struct mem_cgroup *memcg = lock_page_memcg(page);
+
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+				locked_lruvec = NULL;
+			}
+			/* reget lruvec with a locked memcg */
+			lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked_lruvec = lruvec;
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -975,7 +993,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			}
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, isolate_mode) != 0)
@@ -1016,9 +1033,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+				locked_lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1043,8 +1060,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		low_pfn = end_pfn;
 
 isolate_abort:
-	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (locked_lruvec)
+		unlock_page_lruvec_irqrestore(locked_lruvec, flags);
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41a0fbddc96b..160c845290cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2495,17 +2495,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+			struct lruvec *lruvec, pgoff_t end, unsigned long flags)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
-	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2554,7 +2550,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_unlock(&head->mapping->i_pages);
 	}
 
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	remap_page(head);
 
@@ -2693,13 +2689,13 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
+	struct lruvec *lruvec;
 	int count, mapcount, extra_pins, ret;
 	bool mlocked;
-	unsigned long flags;
+	unsigned long uninitialized_var(flags);
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
@@ -2766,7 +2762,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		lru_add_drain();
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
+	lruvec = lock_page_lruvec_irqsave(head, &flags);
 
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2797,7 +2793,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		}
 
 		spin_unlock(&ds_queue->split_queue_lock);
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, lruvec, end, flags);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2816,7 +2812,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 		remap_page(head);
 		ret = -EBUSY;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d92538a9185c..00fef8ddbd08 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1217,7 +1217,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 		goto out;
 	}
 
-	memcg = page->mem_cgroup;
+	memcg = READ_ONCE(page->mem_cgroup);
 	/*
 	 * Swapcache readahead pages are added to the LRU - and
 	 * possibly migrated - before they are charged.
@@ -1238,6 +1238,42 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct mem_cgroup *memcg;
+
+	memcg = lock_page_memcg(page);
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+	spin_lock_irq(&lruvec->lru_lock);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct mem_cgroup *memcg;
+
+	memcg = lock_page_memcg(page);
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+
+	return lruvec;
+}
+
+void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+	__unlock_page_memcg(lruvec_memcg(lruvec));
+}
+
+void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+	__unlock_page_memcg(lruvec_memcg(lruvec));
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -2574,7 +2610,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 			  bool lrucare)
 {
 	struct lruvec *lruvec = NULL;
-	pg_data_t *pgdat;
 
 	VM_BUG_ON_PAGE(page->mem_cgroup, page);
 
@@ -2583,16 +2618,16 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	 * may already be on some other mem_cgroup's LRU.  Take care of it.
 	 */
 	if (lrucare) {
-		pgdat = page_pgdat(page);
-		spin_lock_irq(&pgdat->lru_lock);
-
-		if (PageLRU(page)) {
-			lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irq(page);
+		if (likely(PageLRU(page))) {
 			ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_lru(page));
-		} else
-			spin_unlock_irq(&pgdat->lru_lock);
+		} else {
+			unlock_page_lruvec_irq(lruvec);
+			lruvec = NULL;
+		}
 	}
+
 	/*
 	 * Nobody should be changing or seriously looking at
 	 * page->mem_cgroup at this point:
@@ -2610,11 +2645,13 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	page->mem_cgroup = memcg;
 
 	if (lrucare && lruvec) {
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		unlock_page_lruvec_irq(lruvec);
+		lruvec = lock_page_lruvec_irq(page);
+
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	}
 }
 
@@ -2911,7 +2948,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index a72c1eeded77..10d15f58b061 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
  * Isolate a page from LRU with optional get_page() pin.
  * Assumes lru_lock already held and page already pinned.
  */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+			struct lruvec *lruvec, bool getpage)
 {
 	if (PageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		ClearPageLRU(page);
@@ -182,7 +180,7 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -194,7 +192,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -205,15 +203,15 @@ unsigned int munlock_vma_page(struct page *page)
 	nr_pages = hpage_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+	if (__munlock_isolate_lru_page(page, lruvec, true)) {
+		unlock_page_lruvec_irq(lruvec);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	unlock_page_lruvec_irq(lruvec);
 
 out:
 	return nr_pages - 1;
@@ -291,28 +289,29 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
 	int i;
 	int nr = pagevec_count(pvec);
-	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
+		lruvec = lock_page_lruvec_irq(page);
+
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+				__mod_zone_page_state(zone, NR_MLOCK,  -1);
+				unlock_page_lruvec_irq(lruvec);
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
-		} else {
-			delta_munlocked++;
 		}
 
 		/*
@@ -323,9 +322,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 */
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
+		unlock_page_lruvec_irq(lruvec);
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..d2d868ca2bf7 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -31,7 +31,7 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page;
-	pg_data_t *pgdat;
+	struct lruvec *lruvec;
 
 	if (!pfn_valid(pfn))
 		return NULL;
@@ -41,13 +41,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
+	unlock_page_lruvec_irq(lruvec);
 	return page;
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..97e108be4f92 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -60,16 +60,14 @@
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
-		unsigned long flags;
+		unsigned long flags = 0;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -192,26 +190,18 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void *arg)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec, arg);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -324,12 +314,12 @@ static inline void activate_page_drain(int cpu)
 
 void activate_page(struct page *page)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
-	spin_unlock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
+	__activate_page(page, lruvec, NULL);
+	unlock_page_lruvec_irq(lruvec);
 }
 #endif
 
@@ -780,8 +770,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
 
@@ -791,21 +780,20 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -822,27 +810,24 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-									flags);
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec, flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -854,8 +839,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -893,7 +878,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a270d32bdb94..7e1cb41da1fb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1766,11 +1766,9 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lru_lock);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irq(page);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1778,7 +1776,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	}
 	return ret;
 }
@@ -1843,20 +1841,23 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
+		struct mem_cgroup *memcg;
+		struct lruvec *plv;
+		bool relocked = false;
+
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1877,21 +1878,34 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		memcg = lock_page_memcg(page);
+		plv = mem_cgroup_lruvec(memcg, page_pgdat(page));
+		/* page's lruvec changed in memcg moving */
+		if (plv != lruvec) {
+			spin_unlock_irq(&lruvec->lru_lock);
+			spin_lock_irq(&plv->lru_lock);
+			relocked = true;
+		}
+
 		lru = page_lru(page);
 		nr_pages = hpage_nr_pages(page);
-
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_add(&page->lru, &lruvec->lists[lru]);
+		update_lru_size(plv, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &plv->lists[lru]);
 		nr_moved += nr_pages;
+
+		if (relocked) {
+			spin_unlock_irq(&plv->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
+		}
+		__unlock_page_memcg(memcg);
 	}
 
 	/*
@@ -1949,7 +1963,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1961,15 +1975,14 @@ static int current_may_throttle(void)
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
-	spin_unlock_irq(&pgdat->lru_lock);
-
+	spin_unlock_irq(&lruvec->lru_lock);
 	if (nr_taken == 0)
 		return 0;
 
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
@@ -1982,7 +1995,7 @@ static int current_may_throttle(void)
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
@@ -2035,7 +2048,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2046,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2092,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -2110,7 +2123,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2259,7 +2272,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
 	u64 denominator = 0;	/* gcc */
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
 	unsigned long anon, file;
@@ -2337,7 +2349,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2358,7 +2370,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -4336,24 +4348,21 @@ int page_evictable(struct page *page)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
 		pgscanned++;
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -4369,10 +4378,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 04/10] mm/lru: introduce the relock_page_lruvec function
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
                   ` (2 preceding siblings ...)
  2020-01-16  3:05 ` [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 05/10] mm/mlock: optimize munlock_pagevec by relocking Alex Shi
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Andrey Ryabinin, swkhack,
	Potyra, Stefan, Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao

During the lruvec locking, a new page's lruvec is may same as
previous one. Thus we could save a re-locking, and only
change lock iff lruvec is new.

Function named relock_page_lruvec following Hugh Dickins' patch.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: swkhack <swkhack@gmail.com>
Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 36 ++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |  8 ++------
 2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8389b9b927ef..09e861df48e8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1299,6 +1299,42 @@ static inline void dec_lruvec_page_state(struct page *page,
 	mod_lruvec_page_state(page, idx, -1);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+	if (likely(locked_lruvec == lruvec))
+		return lruvec;
+
+	if (unlikely(locked_lruvec))
+		unlock_page_lruvec_irq(locked_lruvec);
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+	if (likely(locked_lruvec == lruvec))
+		return lruvec;
+
+	if (unlikely(locked_lruvec))
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7e1cb41da1fb..ee20a64a7ccc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4355,14 +4355,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 
 		pgscanned++;
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 05/10] mm/mlock: optimize munlock_pagevec by relocking
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
                   ` (3 preceding siblings ...)
  2020-01-16  3:05 ` [PATCH v8 04/10] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 06/10] mm/swap: only change the lru_lock iff page's lruvec is different Alex Shi
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes

During the pagevec locking, a new page's lruvec is may same as
previous one. Thus we could save a re-locking, and only
change lock iff lruvec is newer.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/mlock.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 10d15f58b061..050f999eadb1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -289,6 +289,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
 	int i;
 	int nr = pagevec_count(pvec);
+	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
 	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
@@ -299,20 +300,19 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
-		lruvec = lock_page_lruvec_irq(page);
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, lruvec, false)) {
-				__mod_zone_page_state(zone, NR_MLOCK,  -1);
-				unlock_page_lruvec_irq(lruvec);
+			if (__munlock_isolate_lru_page(page, lruvec, false))
 				continue;
-			} else
+			else
 				__munlock_isolation_failed(page);
-		}
+		} else
+			delta_munlocked++;
 
 		/*
 		 * We won't be munlocking this page in the next phase
@@ -322,8 +322,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 */
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
-		unlock_page_lruvec_irq(lruvec);
 	}
+	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+	if (lruvec)
+		unlock_page_lruvec_irq(lruvec);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 06/10] mm/swap: only change the lru_lock iff page's lruvec is different
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
                   ` (4 preceding siblings ...)
  2020-01-16  3:05 ` [PATCH v8 05/10] mm/mlock: optimize munlock_pagevec by relocking Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 07/10] mm/pgdat: remove pgdat lru_lock Alex Shi
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Thomas Gleixner, Mauro Carvalho Chehab, Yafang Shao

Since we introduced relock_page_lruvec, we could use it in more place
to reduce spin_locks.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/swap.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 97e108be4f92..84a845968e1d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -196,11 +196,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
-		lruvec = lock_page_lruvec_irqsave(page, &flags);
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 
 		(*move_fn)(page, lruvec, arg);
-		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
@@ -819,14 +820,11 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+			struct lruvec *pre_lruvec = lruvec;
 
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
+			if (pre_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 07/10] mm/pgdat: remove pgdat lru_lock
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
                   ` (5 preceding siblings ...)
  2020-01-16  3:05 ` [PATCH v8 06/10] mm/swap: only change the lru_lock iff page's lruvec is different Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 08/10] mm/lru: revise the comments of lru_lock Alex Shi
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Vlastimil Babka, Dan Williams, Michal Hocko, Wei Yang, Arun KS,
	Oscar Salvador, Mike Rapoport, Alexander Duyck, Pavel Tatashin,
	Alexander Potapenko

Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 include/linux/mmzone.h | 1 -
 mm/page_alloc.c        | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c5455675acf2..7db0cec19aa0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -769,7 +769,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4785a8a2040e..352f2a3d67b3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6712,7 +6712,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 08/10] mm/lru: revise the comments of lru_lock
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
                   ` (6 preceding siblings ...)
  2020-01-16  3:05 ` [PATCH v8 07/10] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 09/10] mm/lru: add debug checking for page memcg moving Alex Shi
  2020-01-16  3:05 ` [PATCH v8 10/10] mm/memcg: add debug checking in lock_page_memcg Alex Shi
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Jason Gunthorpe, Dan Williams, Vlastimil Babka, Ira Weiny,
	Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Arun KS, Darrick J. Wong, Amir Goldstein, Dave Chinner,
	Josef Bacik, Kirill A. Shutemov, Jérôme Glisse,
	Mike Kravetz, Kirill Tkhai, Yafang Shao

From: Hugh Dickins <hughd@google.com>

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
 Documentation/admin-guide/cgroup-v1/memory.rst     |  6 +++---
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 ++++++++--------------
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  2 +-
 mm/filemap.c                                       |  4 ++--
 mm/rmap.c                                          |  2 +-
 mm/vmscan.c                                        | 12 ++++++++----
 9 files changed, 28 insertions(+), 39 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ======
-        Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global pgdat->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under pgdat->lru_lock.
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-
-	(By __isolate_lru_page(), the page is removed from both of global and
-	private LRU.)
-
+	Each memcg has its own vector of LRUs (inactive anon, active anon,
+	inactive file, active file, unevictable) of pages from each node,
+	each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 0ae4f564c2d6..60d97e8b7f3c 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -297,13 +297,13 @@ When oom event notifier is registered, event will be delivered.
 
    PG_locked.
      mm->page_table_lock
-         pgdat->lru_lock
+         lruvec->lru_lock
 	   lock_page_cgroup.
 
   In many cases, just lock_page_cgroup() is called.
 
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+  per-node-per-cgroup LRU (cgroup's private LRU) is just guarded by
+  lruvec->lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 -----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million 4k pages in a single node.  When a large
 fraction of these pages are not evictable for any reason [see below], vmscan
 will spend a lot of time scanning the LRU lists looking for the small fraction
 of pages that are evictable.  This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
 The Unevictable Page List
 -------------------------
 
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
 called the "unevictable" list and an associated page flag, PG_unevictable, to
 indicate that the page is being managed on the unevictable list.
 
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
 swap-backed pages.  This differentiation is only important while the pages are,
 in fact, evictable.
 
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
 lists and statistics originally proposed and posted by Christoph Lameter.
 
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock.  This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
 
 Memory Control Group Interaction
 --------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
 memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
 lru_list enum.
 
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
 lru_list enum element).  The memory controller tracks the movement of pages to
 and from the unevictable list.
 
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
 active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
 pages in all of the shrink_{active|inactive|page}_list() functions and will
 "cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
 
 There may be situations where a page is mapped into a VM_LOCKED VMA, but the
 page is not marked as PG_mlocked.  Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
 page from the LRU, as it is likely on the appropriate active or inactive list
 at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
 back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list.  If
+is now mlocked and divert the page to the node's unevictable list.  If
 mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
 it later if and when it attempts to reclaim the page.
 
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
      unevictable list in mlock_vma_page().
 
 shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
 
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 270aa8fd2800..ff08a6a8145c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7db0cec19aa0..d73be191e9f8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..6dcdf06660fb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,8 +101,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index b3e381919835..39052794cb46 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
  *         mapping->i_mmap_rwsem
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
+ *               lruvec->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee20a64a7ccc..2a3fca20d456 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1626,14 +1626,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
  * For pagecache intensive workloads, this function is the hottest
  * spot in the kernel (apart from copy_*_user functions).
  *
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
  *
  * @nr_to_scan:	The number of eligible pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
@@ -1820,14 +1822,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 
 /*
  * This moves pages from @list to corresponding LRU list.
+ * The pages from @list is out of any lruvec, and in the end list reuses as
+ * pages_to_free list.
  *
  * We move them the other way if the page is referenced by one or more
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
+ * appropriate to hold lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
+ * should drop lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 09/10] mm/lru: add debug checking for page memcg moving
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
                   ` (7 preceding siblings ...)
  2020-01-16  3:05 ` [PATCH v8 08/10] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  2020-01-16  3:05 ` [PATCH v8 10/10] mm/memcg: add debug checking in lock_page_memcg Alex Shi
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Michal Hocko, Vladimir Davydov

This debug patch could give some clues if there are sth out of
consideration.

Hugh Dickins report a bug of this patch, Thanks!

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/memcontrol.h |  5 +++++
 mm/compaction.c            |  2 ++
 mm/memcontrol.c            | 15 +++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 09e861df48e8..ece88bb11d0f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -421,6 +421,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 struct lruvec *lock_page_lruvec_irqsave(struct page *, unsigned long*);
 void unlock_page_lruvec_irq(struct lruvec *);
 void unlock_page_lruvec_irqrestore(struct lruvec *, unsigned long);
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
@@ -1183,6 +1184,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/compaction.c b/mm/compaction.c
index 8c0a2da217d8..151242817bf4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -971,6 +971,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
 			locked_lruvec = lruvec;
 
+			lruvec_memcg_debug(lruvec, page);
+
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
 				skip_updated = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00fef8ddbd08..a567fd868739 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1238,6 +1238,19 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+#ifdef  CONFIG_DEBUG_VM
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+#endif
+}
+
 struct lruvec *lock_page_lruvec_irq(struct page *page)
 {
 	struct lruvec *lruvec;
@@ -1247,6 +1260,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	spin_lock_irq(&lruvec->lru_lock);
 
+	lruvec_memcg_debug(lruvec, page);
 	return lruvec;
 }
 
@@ -1259,6 +1273,7 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	spin_lock_irqsave(&lruvec->lru_lock, *flags);
 
+	lruvec_memcg_debug(lruvec, page);
 	return lruvec;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 10/10] mm/memcg: add debug checking in lock_page_memcg
  2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
                   ` (8 preceding siblings ...)
  2020-01-16  3:05 ` [PATCH v8 09/10] mm/lru: add debug checking for page memcg moving Alex Shi
@ 2020-01-16  3:05 ` Alex Shi
  9 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-16  3:05 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb, hannes
  Cc: Michal Hocko, Vladimir Davydov

This extra irq disable/enable and BUG_ON checking costs 5% readtwice
performance on a 2 socket * 26 cores * HT box. So put it into
CONFIG_PROVE_LOCKING.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a567fd868739..4ad1b4d2eb1e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2029,6 +2029,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (unlikely(!memcg))
 		return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
 	if (atomic_read(&memcg->moving_account) <= 0)
 		return memcg;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-16  3:05 ` [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-01-16 21:52   ` Johannes Weiner
  2020-01-19 11:32     ` Alex Shi
                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Johannes Weiner @ 2020-01-16 21:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Thu, Jan 16, 2020 at 11:05:02AM +0800, Alex Shi wrote:
> @@ -948,10 +956,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>  			goto isolate_fail;
>  
> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>  		/* If we already hold the lock, we can skip some rechecking */
> -		if (!locked) {
> -			locked = compact_lock_irqsave(&pgdat->lru_lock,
> -								&flags, cc);
> +		if (lruvec != locked_lruvec) {
> +			struct mem_cgroup *memcg = lock_page_memcg(page);
> +
> +			if (locked_lruvec) {
> +				unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +				locked_lruvec = NULL;
> +			}
> +			/* reget lruvec with a locked memcg */
> +			lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> +			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> +			locked_lruvec = lruvec;
>  
>  			/* Try get exclusive access under lock */
>  			if (!skip_updated) {

In a previous review, I pointed out the following race condition
between page charging and compaction:

compaction:				generic_file_buffered_read:

					page_cache_alloc()

!PageBuddy()

lock_page_lruvec(page)
  lruvec = mem_cgroup_page_lruvec()
  spin_lock(&lruvec->lru_lock)
  if lruvec != mem_cgroup_page_lruvec()
    goto again

					add_to_page_cache_lru()
					  mem_cgroup_commit_charge()
					    page->mem_cgroup = foo
					  lru_cache_add()
					    __pagevec_lru_add()
					      SetPageLRU()

if PageLRU(page):
  __isolate_lru_page()

As far as I can see, you have not addressed this. You have added
lock_page_memcg(), but that prevents charged pages from moving between
cgroups, it does not prevent newly allocated pages from being charged.

It doesn't matter how many times you check the lruvec before and after
locking - if you're looking at a free page, it might get allocated,
charged and put on a new lruvec after you're done checking, and then
you isolate a page from an unlocked lruvec.

You simply cannot serialize on page->mem_cgroup->lruvec when
page->mem_cgroup isn't stable. You need to serialize on the page
itself, one way or another, to make this work.


So here is a crazy idea that may be worth exploring:

Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
linked list.

Can we make PageLRU atomic and use it to stabilize the lru_lock
instead, and then use the lru_lock only serialize list operations?

I.e. in compaction, you'd do

	if (!TestClearPageLRU(page))
		goto isolate_fail;
	/*
	 * We isolated the page's LRU state and thereby locked out all
	 * other isolators, including cgroup page moving, page reclaim,
	 * page freeing etc. That means page->mem_cgroup is now stable
	 * and we can safely look up the correct lruvec and take the
	 * page off its physical LRU list.
	 */
	lruvec = mem_cgroup_page_lruvec(page);
	spin_lock_irq(&lruvec->lru_lock);
	del_page_from_lru_list(page, lruvec, page_lru(page));

Putback would mostly remain the same (although you could take the
PageLRU setting out of the list update locked section, as long as it's
set after the page is physically linked):

	/* LRU isolation pins page->mem_cgroup */
	lruvec = mem_cgroup_page_lruvec(page)
	spin_lock_irq(&lruvec->lru_lock);
	add_page_to_lru_list(...);
	spin_unlock_irq(&lruvec->lru_lock);

	SetPageLRU(page);

And you'd have to carefully review and rework other sites that rely on
PageLRU: reclaim, __page_cache_release(), __activate_page() etc.

Especially things like activate_page(), which used to only check
PageLRU to shuffle the page on the LRU list would now have to briefly
clear PageLRU and then set it again afterwards.

However, aside from a bit more churn in those cases, and the
unfortunate additional atomic operations, I currently can't think of a
fundamental reason why this wouldn't work.

Hugh, what do you think?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-16 21:52   ` Johannes Weiner
@ 2020-01-19 11:32     ` Alex Shi
  2020-01-20 12:58     ` Alex Shi
  2020-04-13 10:48     ` Alex Shi
  2 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-01-19 11:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao


> In a previous review, I pointed out the following race condition
> between page charging and compaction:
> 
> compaction:				generic_file_buffered_read:
> 
> 					page_cache_alloc()
> 
> !PageBuddy()
> 
> lock_page_lruvec(page)
>   lruvec = mem_cgroup_page_lruvec()
>   spin_lock(&lruvec->lru_lock)
>   if lruvec != mem_cgroup_page_lruvec()
>     goto again
> 
> 					add_to_page_cache_lru()
> 					  mem_cgroup_commit_charge()
> 					    page->mem_cgroup = foo
> 					  lru_cache_add()
> 					    __pagevec_lru_add()
> 					      SetPageLRU()
> 
> if PageLRU(page):
>   __isolate_lru_page()
> 
> As far as I can see, you have not addressed this. You have added
> lock_page_memcg(), but that prevents charged pages from moving between
> cgroups, it does not prevent newly allocated pages from being charged.
> 

yes, it's my fault to oversee this problem.

...

> 
> So here is a crazy idea that may be worth exploring:
> 
> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> linked list.
> 
> Can we make PageLRU atomic and use it to stabilize the lru_lock
> instead, and then use the lru_lock only serialize list operations?
> 

Sounds a good idea. I will try this.

Thanks
Alex

> I.e. in compaction, you'd do
> 
> 	if (!TestClearPageLRU(page))
> 		goto isolate_fail;
> 	/*
> 	 * We isolated the page's LRU state and thereby locked out all
> 	 * other isolators, including cgroup page moving, page reclaim,
> 	 * page freeing etc. That means page->mem_cgroup is now stable
> 	 * and we can safely look up the correct lruvec and take the
> 	 * page off its physical LRU list.
> 	 */
> 	lruvec = mem_cgroup_page_lruvec(page);
> 	spin_lock_irq(&lruvec->lru_lock);
> 	del_page_from_lru_list(page, lruvec, page_lru(page));
> 
> Putback would mostly remain the same (although you could take the
> PageLRU setting out of the list update locked section, as long as it's
> set after the page is physically linked):
> 
> 	/* LRU isolation pins page->mem_cgroup */
> 	lruvec = mem_cgroup_page_lruvec(page)
> 	spin_lock_irq(&lruvec->lru_lock);
> 	add_page_to_lru_list(...);
> 	spin_unlock_irq(&lruvec->lru_lock);
> 
> 	SetPageLRU(page);
> 
> And you'd have to carefully review and rework other sites that rely on
> PageLRU: reclaim, __page_cache_release(), __activate_page() etc.
> 
> Especially things like activate_page(), which used to only check
> PageLRU to shuffle the page on the LRU list would now have to briefly
> clear PageLRU and then set it again afterwards.
> 
> However, aside from a bit more churn in those cases, and the
> unfortunate additional atomic operations, I currently can't think of a
> fundamental reason why this wouldn't work.
> 
> Hugh, what do you think?
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-16 21:52   ` Johannes Weiner
  2020-01-19 11:32     ` Alex Shi
@ 2020-01-20 12:58     ` Alex Shi
  2020-01-21 16:00       ` Johannes Weiner
  2020-04-13 10:48     ` Alex Shi
  2 siblings, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-01-20 12:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2020/1/17 上午5:52, Johannes Weiner 写道:

> You simply cannot serialize on page->mem_cgroup->lruvec when
> page->mem_cgroup isn't stable. You need to serialize on the page
> itself, one way or another, to make this work.
> 
> 
> So here is a crazy idea that may be worth exploring:
> 
> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> linked list.
> 
> Can we make PageLRU atomic and use it to stabilize the lru_lock
> instead, and then use the lru_lock only serialize list operations?
> 

Hi Johannes,

I am trying to figure out the solution of atomic PageLRU, but is 
blocked by the following sitations, when PageLRU and lru list was protected
together under lru_lock, the PageLRU could be a indicator if page on lru list
But now seems it can't be the indicator anymore.
Could you give more clues of stabilization usage of PageLRU?
  

__page_cache_release/release_pages/compaction            __pagevec_lru_add
if (TestClearPageLRU(page))                              if (!PageLRU())
                                                                lruvec_lock();
                                                                list_add();
        			                                lruvec_unlock();
        			                                SetPageLRU() //position 1
        lock_page_lruvec_irqsave(page, &flags);
        del_page_from_lru_list(page, lruvec, ..);
        unlock_page_lruvec_irqrestore(lruvec, flags);
                                                                SetPageLRU() //position 2
Thanks a lot!
Alex

> I.e. in compaction, you'd do
> 
> 	if (!TestClearPageLRU(page))
> 		goto isolate_fail;
> 	/*
> 	 * We isolated the page's LRU state and thereby locked out all
> 	 * other isolators, including cgroup page moving, page reclaim,
> 	 * page freeing etc. That means page->mem_cgroup is now stable
> 	 * and we can safely look up the correct lruvec and take the
> 	 * page off its physical LRU list.
> 	 */
> 	lruvec = mem_cgroup_page_lruvec(page);
> 	spin_lock_irq(&lruvec->lru_lock);
> 	del_page_from_lru_list(page, lruvec, page_lru(page));
> 
> Putback would mostly remain the same (although you could take the
> PageLRU setting out of the list update locked section, as long as it's
> set after the page is physically linked):
> 
> 	/* LRU isolation pins page->mem_cgroup */
> 	lruvec = mem_cgroup_page_lruvec(page)
> 	spin_lock_irq(&lruvec->lru_lock);
> 	add_page_to_lru_list(...);
> 	spin_unlock_irq(&lruvec->lru_lock);
> 
> 	SetPageLRU(page);
> 
> And you'd have to carefully review and rework other sites that rely on
> PageLRU: reclaim, __page_cache_release(), __activate_page() etc.
> 
> Especially things like activate_page(), which used to only check
> PageLRU to shuffle the page on the LRU list would now have to briefly
> clear PageLRU and then set it again afterwards.
> 
> However, aside from a bit more churn in those cases, and the
> unfortunate additional atomic operations, I currently can't think of a
> fundamental reason why this wouldn't work.
> 
> Hugh, what do you think?
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-20 12:58     ` Alex Shi
@ 2020-01-21 16:00       ` Johannes Weiner
  2020-01-22 12:01         ` Alex Shi
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2020-01-21 16:00 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Mon, Jan 20, 2020 at 08:58:09PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/1/17 上午5:52, Johannes Weiner 写道:
> 
> > You simply cannot serialize on page->mem_cgroup->lruvec when
> > page->mem_cgroup isn't stable. You need to serialize on the page
> > itself, one way or another, to make this work.
> > 
> > 
> > So here is a crazy idea that may be worth exploring:
> > 
> > Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> > linked list.
> > 
> > Can we make PageLRU atomic and use it to stabilize the lru_lock
> > instead, and then use the lru_lock only serialize list operations?
> > 
> 
> Hi Johannes,
> 
> I am trying to figure out the solution of atomic PageLRU, but is 
> blocked by the following sitations, when PageLRU and lru list was protected
> together under lru_lock, the PageLRU could be a indicator if page on lru list
> But now seems it can't be the indicator anymore.
> Could you give more clues of stabilization usage of PageLRU?

There are two types of PageLRU checks: optimistic and deterministic.

The check in activate_page() for example is optimistic and the result
unstable, but that's okay, because if we miss a page here and there
it's not the end of the world.

But the check in __activate_page() is deterministic, because we need
to be sure before del_page_from_lru_list(). Currently it's made
deterministic by testing under the lock: whoever acquires the lock
first gets to touch the LRU state. The same can be done with an atomic
TestClearPagLRU: whoever clears the flag first gets to touch the LRU
state (the lock is then only acquired to not corrupt the linked list,
in case somebody adds or removes a different page at the same time).

I.e. in my proposal, if you want to get a stable read of PageLRU, you
have to clear it atomically. But AFAICS, everybody who currently does
need a stable read either already clears it or can easily be converted
to clear it and then set it again (like __activate_page and friends).

> __page_cache_release/release_pages/compaction            __pagevec_lru_add
> if (TestClearPageLRU(page))                              if (!PageLRU())
>                                                                 lruvec_lock();
>                                                                 list_add();
>         			                                lruvec_unlock();
>         			                                SetPageLRU() //position 1
>         lock_page_lruvec_irqsave(page, &flags);
>         del_page_from_lru_list(page, lruvec, ..);
>         unlock_page_lruvec_irqrestore(lruvec, flags);
>                                                                 SetPageLRU() //position 2

Hm, that's not how __pagevec_lru_add() looks. In fact,
__pagevec_lru_add_fn() has a BUG_ON(PageLRU).

That's because only one thread can own the isolation state at a time.

If PageLRU is set, only one thread can claim it. Right now, whoever
takes the lock first and clears it wins. When we replace it with
TestClearPageLRU, it's the same thing: only one thread can win.

And you cannot set PageLRU, unless you own it. Either you isolated the
page using TestClearPageLRU, or you allocated a new page.

So you can have multiple threads trying to isolate a page from the LRU
list, hence the atomic testclear. But no two threads should ever be
racing to add a page to the LRU list, because only one thread can own
the isolation state.

With the atomic PageLRU flag, the sequence would be this:

__pagevec_lru_add:

  BUG_ON(PageLRU()) // Caller *must* own the isolation state

  lruvec_lock()     // The lruvec is stable, because changing
                    // page->mem_cgroup requires owning the
                    // isolation state (PageLRU) and we own it

  list_add()        // Linked list protected by lru_lock

  lruvec_unlock()

  SetPageLRU()      // The page has been added to the linked
                    // list, give up our isolation state. Once
                    // this flag becomes visible, other threads
                    // can isolate the page from the LRU list

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-21 16:00       ` Johannes Weiner
@ 2020-01-22 12:01         ` Alex Shi
  2020-01-22 18:31           ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-01-22 12:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2020/1/22 上午12:00, Johannes Weiner 写道:
> On Mon, Jan 20, 2020 at 08:58:09PM +0800, Alex Shi wrote:
>>
>>
>> 在 2020/1/17 上午5:52, Johannes Weiner 写道:
>>
>>> You simply cannot serialize on page->mem_cgroup->lruvec when
>>> page->mem_cgroup isn't stable. You need to serialize on the page
>>> itself, one way or another, to make this work.
>>>
>>>
>>> So here is a crazy idea that may be worth exploring:
>>>
>>> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
>>> linked list.
>>>
>>> Can we make PageLRU atomic and use it to stabilize the lru_lock
>>> instead, and then use the lru_lock only serialize list operations?
>>>
>>
>> Hi Johannes,
>>
>> I am trying to figure out the solution of atomic PageLRU, but is 
>> blocked by the following sitations, when PageLRU and lru list was protected
>> together under lru_lock, the PageLRU could be a indicator if page on lru list
>> But now seems it can't be the indicator anymore.
>> Could you give more clues of stabilization usage of PageLRU?
> 
> There are two types of PageLRU checks: optimistic and deterministic.
> 
> The check in activate_page() for example is optimistic and the result
> unstable, but that's okay, because if we miss a page here and there
> it's not the end of the world.
> 
> But the check in __activate_page() is deterministic, because we need
> to be sure before del_page_from_lru_list(). Currently it's made
> deterministic by testing under the lock: whoever acquires the lock
> first gets to touch the LRU state. The same can be done with an atomic
> TestClearPagLRU: whoever clears the flag first gets to touch the LRU
> state (the lock is then only acquired to not corrupt the linked list,
> in case somebody adds or removes a different page at the same time).

Hi Johannes,

Thanks a lot for detailed explanations! I just gonna to take 2 weeks holiday
from tomorrow as Chinese new year season with families. I am very sorry for 
can not hang on this for a while.

> 
> I.e. in my proposal, if you want to get a stable read of PageLRU, you
> have to clear it atomically. But AFAICS, everybody who currently does
> need a stable read either already clears it or can easily be converted
> to clear it and then set it again (like __activate_page and friends).
> 
>> __page_cache_release/release_pages/compaction            __pagevec_lru_add
>> if (TestClearPageLRU(page))                              if (!PageLRU())
>>                                                                 lruvec_lock();
>>                                                                 list_add();
>>         			                                lruvec_unlock();
>>         			                                SetPageLRU() //position 1
>>         lock_page_lruvec_irqsave(page, &flags);
>>         del_page_from_lru_list(page, lruvec, ..);
>>         unlock_page_lruvec_irqrestore(lruvec, flags);
>>                                                                 SetPageLRU() //position 2
> 
> Hm, that's not how __pagevec_lru_add() looks. In fact,
> __pagevec_lru_add_fn() has a BUG_ON(PageLRU).
> 
> That's because only one thread can own the isolation state at a time.
> 
> If PageLRU is set, only one thread can claim it. Right now, whoever
> takes the lock first and clears it wins. When we replace it with
> TestClearPageLRU, it's the same thing: only one thread can win.
> 
> And you cannot set PageLRU, unless you own it. Either you isolated the
> page using TestClearPageLRU, or you allocated a new page.

Yes I understand isolatation would exclusive by PageLRU, but forgive my
stupid, I didn't figure out how a new page lruvec adding could be blocked.

Anyway, I will try my best to catch up after holiday.

Many thanks for nice cocaching!
Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-22 12:01         ` Alex Shi
@ 2020-01-22 18:31           ` Johannes Weiner
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2020-01-22 18:31 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Wed, Jan 22, 2020 at 08:01:29PM +0800, Alex Shi wrote:
> Yes I understand isolatation would exclusive by PageLRU, but forgive my
> stupid, I didn't figure out how a new page lruvec adding could be blocked.

I don't see why we would need this. Can you elaborate where you think
this is a problem?

If compaction races with charging for example, compaction doesn't need
to prevent a new page from being added to an lruvec. PageLRU is only
set after page->mem_cgroup is updated, so there are two race outcomes:

1) TestClearPageLRU() fails. That means the page isn't (fully) created
yet and cannot be migrated. We goto isolate_fail before even trying to
lock the lruvec.

2) TestClearPageLRU() succeeds. That means the page was fully created
and page->mem_cgroup has been set up. Anybody who now wants to change
page->mem_cgroup needs PageLRU, but we have it, so lruvec is stable.

I.e. cgroup charging does this:

	page->mem_cgroup = new_group

	lock(pgdat->lru_lock)
	SetPageLRU()
	add_page_to_lru_list()
	unlock(pgdat->lru_lock)

and compaction currently does this:

	lock(pgdat->lru_lock)
	if (!PageLRU())
		goto isolate_fail
	// __isolate_lru_page:
	if (!get_page_unless_zero())
		goto isolate_fail
	ClearPageLRU()
	del_page_from_lru_list()
	unlock(pgdat->lru_lock)

We can replace charging with this:

	page->mem_cgroup = new_group

	lock(lruvec->lru_lock)
	add_page_to_lru_list()
	unlock(lruvec->lru_lock)

	SetPageLRU()

and the compaction sequence with something like this:

	if (!get_page_unless_zero())
		goto isolate_fail

	if (!TestClearPageLRU())
		goto isolate_fail_put

	// We got PageLRU, so charging is complete and nobody
	// can modify page->mem_cgroup until we set it again.

	lruvec = mem_cgroup_page_lruvec(page, pgdat)
	lock(lruvec->lru_lock)
	del_page_from_lru_list()
	unlock(lruvec->lru_lock)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-01-16 21:52   ` Johannes Weiner
  2020-01-19 11:32     ` Alex Shi
  2020-01-20 12:58     ` Alex Shi
@ 2020-04-13 10:48     ` Alex Shi
  2020-04-13 18:07       ` Johannes Weiner
  2 siblings, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-04-13 10:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

> In a previous review, I pointed out the following race condition
> between page charging and compaction:
> 
> compaction:				generic_file_buffered_read:
> 
> 					page_cache_alloc()
> 
> !PageBuddy()
> 
> lock_page_lruvec(page)
>   lruvec = mem_cgroup_page_lruvec()
>   spin_lock(&lruvec->lru_lock)
>   if lruvec != mem_cgroup_page_lruvec()
>     goto again
> 
> 					add_to_page_cache_lru()
> 					  mem_cgroup_commit_charge()
> 					    page->mem_cgroup = foo
> 					  lru_cache_add()
> 					    __pagevec_lru_add()
> 					      SetPageLRU()
> 
> if PageLRU(page):
>   __isolate_lru_page()
> 
> As far as I can see, you have not addressed this. You have added
> lock_page_memcg(), but that prevents charged pages from moving between
> cgroups, it does not prevent newly allocated pages from being charged.
> 
> It doesn't matter how many times you check the lruvec before and after
> locking - if you're looking at a free page, it might get allocated,
> charged and put on a new lruvec after you're done checking, and then
> you isolate a page from an unlocked lruvec.
> 
> You simply cannot serialize on page->mem_cgroup->lruvec when
> page->mem_cgroup isn't stable. You need to serialize on the page
> itself, one way or another, to make this work.
> 
> 
> So here is a crazy idea that may be worth exploring:
> 
> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> linked list.
> 
> Can we make PageLRU atomic and use it to stabilize the lru_lock
> instead, and then use the lru_lock only serialize list operations?
> 
> I.e. in compaction, you'd do
> 
> 	if (!TestClearPageLRU(page))
> 		goto isolate_fail;
> 	/*
> 	 * We isolated the page's LRU state and thereby locked out all
> 	 * other isolators, including cgroup page moving, page reclaim,
> 	 * page freeing etc. That means page->mem_cgroup is now stable
> 	 * and we can safely look up the correct lruvec and take the
> 	 * page off its physical LRU list.
> 	 */
> 	lruvec = mem_cgroup_page_lruvec(page);
> 	spin_lock_irq(&lruvec->lru_lock);
> 	del_page_from_lru_list(page, lruvec, page_lru(page));
> 
> Putback would mostly remain the same (although you could take the
> PageLRU setting out of the list update locked section, as long as it's
> set after the page is physically linked):
> 
> 	/* LRU isolation pins page->mem_cgroup */
> 	lruvec = mem_cgroup_page_lruvec(page)
> 	spin_lock_irq(&lruvec->lru_lock);
> 	add_page_to_lru_list(...);
> 	spin_unlock_irq(&lruvec->lru_lock);
> 
> 	SetPageLRU(page);
> 
> And you'd have to carefully review and rework other sites that rely on
> PageLRU: reclaim, __page_cache_release(), __activate_page() etc.
> 
> Especially things like activate_page(), which used to only check
> PageLRU to shuffle the page on the LRU list would now have to briefly
> clear PageLRU and then set it again afterwards.
> 
> However, aside from a bit more churn in those cases, and the
> unfortunate additional atomic operations, I currently can't think of a
> fundamental reason why this wouldn't work.
> 
> Hugh, what do you think?
> 

Hi Johannes

As to the idea of TestClearPageLRU, we except the following scenario
    compaction                       commit_charge
                                     if (TestClearPageLRU)
        !TestClearPageLRU                 lock_page_lruvec
            goto isolate_fail;            del_from_lru_list
                                          unlock_page_lruvec

But there is a difficult situation to handle:

   compaction                        commit_charge
        TestClearPageLRU
                                    !TestClearPageLRU

                                    page possible state:
                                    a, reclaiming, b, moving between lru list, c, migrating, like in compaction
                                    d, mlocking,   e, split_huge_page,

If the page lru bit was cleared in commit_charge with lrucare,
we still have no idea if the page was isolated by the reason from a~e
or the page is never on LRU, to deal with different reasons is high cost.

So as to the above issue you mentioned, Maybe the better idea is to
set lrucare when do mem_cgroup_commit_charge(), since the charge action
is not often. What's your idea of this solution?

Thanks
Alex

> compaction:				generic_file_buffered_read:
> 
> 					page_cache_alloc()
> 
> !PageBuddy()
> 
> lock_page_lruvec(page)
>   lruvec = mem_cgroup_page_lruvec()
>   spin_lock(&lruvec->lru_lock)
>   if lruvec != mem_cgroup_page_lruvec()
>     goto again
> 
> 					add_to_page_cache_lru()
> 					  mem_cgroup_commit_charge()

					 //do charge with lrucare
					 spin_lock(&lruvec->lru_lock)
> 					    page->mem_cgroup = foo
> 					  lru_cache_add()
> 					    __pagevec_lru_add()
> 					      SetPageLRU()
> 
> if PageLRU(page):
>   __isolate_lru_page()

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-13 10:48     ` Alex Shi
@ 2020-04-13 18:07       ` Johannes Weiner
  2020-04-14  4:52         ` Alex Shi
  2020-04-14  8:19         ` Alex Shi
  0 siblings, 2 replies; 29+ messages in thread
From: Johannes Weiner @ 2020-04-13 18:07 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Mon, Apr 13, 2020 at 06:48:22PM +0800, Alex Shi wrote:
> > In a previous review, I pointed out the following race condition
> > between page charging and compaction:
> > 
> > compaction:				generic_file_buffered_read:
> > 
> > 					page_cache_alloc()
> > 
> > !PageBuddy()
> > 
> > lock_page_lruvec(page)
> >   lruvec = mem_cgroup_page_lruvec()
> >   spin_lock(&lruvec->lru_lock)
> >   if lruvec != mem_cgroup_page_lruvec()
> >     goto again
> > 
> > 					add_to_page_cache_lru()
> > 					  mem_cgroup_commit_charge()
> > 					    page->mem_cgroup = foo
> > 					  lru_cache_add()
> > 					    __pagevec_lru_add()
> > 					      SetPageLRU()
> > 
> > if PageLRU(page):
> >   __isolate_lru_page()
> > 
> > As far as I can see, you have not addressed this. You have added
> > lock_page_memcg(), but that prevents charged pages from moving between
> > cgroups, it does not prevent newly allocated pages from being charged.
> > 
> > It doesn't matter how many times you check the lruvec before and after
> > locking - if you're looking at a free page, it might get allocated,
> > charged and put on a new lruvec after you're done checking, and then
> > you isolate a page from an unlocked lruvec.
> > 
> > You simply cannot serialize on page->mem_cgroup->lruvec when
> > page->mem_cgroup isn't stable. You need to serialize on the page
> > itself, one way or another, to make this work.
> > 
> > 
> > So here is a crazy idea that may be worth exploring:
> > 
> > Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> > linked list.
> > 
> > Can we make PageLRU atomic and use it to stabilize the lru_lock
> > instead, and then use the lru_lock only serialize list operations?
> > 
> > I.e. in compaction, you'd do
> > 
> > 	if (!TestClearPageLRU(page))
> > 		goto isolate_fail;
> > 	/*
> > 	 * We isolated the page's LRU state and thereby locked out all
> > 	 * other isolators, including cgroup page moving, page reclaim,
> > 	 * page freeing etc. That means page->mem_cgroup is now stable
> > 	 * and we can safely look up the correct lruvec and take the
> > 	 * page off its physical LRU list.
> > 	 */
> > 	lruvec = mem_cgroup_page_lruvec(page);
> > 	spin_lock_irq(&lruvec->lru_lock);
> > 	del_page_from_lru_list(page, lruvec, page_lru(page));
> > 
> > Putback would mostly remain the same (although you could take the
> > PageLRU setting out of the list update locked section, as long as it's
> > set after the page is physically linked):
> > 
> > 	/* LRU isolation pins page->mem_cgroup */
> > 	lruvec = mem_cgroup_page_lruvec(page)
> > 	spin_lock_irq(&lruvec->lru_lock);
> > 	add_page_to_lru_list(...);
> > 	spin_unlock_irq(&lruvec->lru_lock);
> > 
> > 	SetPageLRU(page);
> > 
> > And you'd have to carefully review and rework other sites that rely on
> > PageLRU: reclaim, __page_cache_release(), __activate_page() etc.
> > 
> > Especially things like activate_page(), which used to only check
> > PageLRU to shuffle the page on the LRU list would now have to briefly
> > clear PageLRU and then set it again afterwards.
> > 
> > However, aside from a bit more churn in those cases, and the
> > unfortunate additional atomic operations, I currently can't think of a
> > fundamental reason why this wouldn't work.
> > 
> > Hugh, what do you think?
> > 
> 
> Hi Johannes
> 
> As to the idea of TestClearPageLRU, we except the following scenario
>     compaction                       commit_charge
>                                      if (TestClearPageLRU)
>         !TestClearPageLRU                 lock_page_lruvec
>             goto isolate_fail;            del_from_lru_list
>                                           unlock_page_lruvec
> 
> But there is a difficult situation to handle:
> 
>    compaction                        commit_charge
>         TestClearPageLRU
>                                     !TestClearPageLRU
> 
>                                     page possible state:
>                                     a, reclaiming, b, moving between lru list, c, migrating, like in compaction
>                                     d, mlocking,   e, split_huge_page,
> 
> If the page lru bit was cleared in commit_charge with lrucare,
> we still have no idea if the page was isolated by the reason from a~e
> or the page is never on LRU, to deal with different reasons is high cost.
> 
> So as to the above issue you mentioned, Maybe the better idea is to
> set lrucare when do mem_cgroup_commit_charge(), since the charge action
> is not often. What's your idea of this solution?

Hm, yes, the lrucare scenario is a real problem. If it can isolate the
page, fine, but if not, it changes page->mem_cgroup on a page that
somebody else has isolated, having indeed no idea who they are and how
they are going to access page->mem_cgroup.

Right now it's safe because of secondary protection on top of
isolation: split_huge_page keeps the lru_lock held throughout;
reclaim, cgroup migration, page migration, compaction etc. hold the
page lock which locks out swapcache charging.

But it complicates the serialization model immensely and makes it
subtle and error prone.

I'm not sure how unconditionally taking the lru_lock when charging
would help. Can you lay out what you have in mind in prototype code,
like I'm using below, for isolation, putback, charging, compaction?

That said, charging actually is a hotpath. I'm reluctant to
unconditionally take the LRU lock there. But if you can make things a
lot simpler this way, it could be worth exploring.

In the PageLRU locking scheme, I can see a way to make putback safe
wrt lrucare charging, but I'm not sure about isolation:

putback:
lruvec = page->mem_cgroup->lruvecs[pgdat]
spin_lock(lruvec->lru_lock)
if lruvec != page->mem_cgroup->lruvecs[pgdat]:
  /*
   * commit_charge(lrucare=true) can charge an uncharged swapcache
   * page while we had it isolated. This changes page->mem_cgroup,
   * but it can only happen once. Look up the new cgroup.
   */
  spin_unlock(lruvec->lru_lock)
  lruvec = page->mem_cgroup->lruvecs[pgdat]
  spin_lock(lruvec->lru_lock)
add_page_to_lru_list(page, lruvec, ...)
SetPageLRU(page);
spin_unlock(lruvec->lru_lock)

commit_charge:
if (lrucare)
  spin_lock(root_memcg->lru_lock)
  /*
   * If we can isolate the page, we'll move it to the new
   * cgroup's LRU list. If somebody else has the page
   * isolated, we need their putback to move it to the
   * new cgroup. If they see the old cgroup - the root -
   * they will spin until we're done and recheck.
   */
  if ((lru = TestClearPageLRU(page)))
    del_page_from_lru_list()
page->mem_cgroup = new;
if (lrucare)
  spin_unlock(root_memcg->lru_lock)
  if (lru)
    spin_lock(new->lru_lock)
    add_page_to_lru_list()
    spin_unlock(new->lru_lock);
    SetPageLRU(page)

putback would need to 1) recheck once after acquiring the lock and 2)
SetPageLRU while holding the lru_lock after all. But it works because
we know the old cgroup: if the putback sees the old cgroup, we know
it's the root cgroup, and we have that locked until we're done with
the update. And if putback manages to lock the old cgroup before us,
we will spin until the isolator is done, and then either be able to
isolate it ourselves or, if racing with yet another isolator, hold the
lock and delay putback until we're done.

But isolation actually needs to lock out charging, or it would operate
on the wrong list:

isolation:                                     commit_charge:
if (TestClearPageLRU(page))
                                               page->mem_cgroup = new
  // page is still physically on
  // the root_mem_cgroup's LRU. We're
  // updating the wrong list:
  memcg = page->mem_cgroup
  spin_lock(memcg->lru_lock)
  del_page_from_lru_list(page, memcg)
  spin_unlock(memcg->lru_lock)

lrucare really is a mess. Even before this patch series, it makes
things tricky and subtle and error prone.

The only reason we're doing it is for when there is swapping without
swap tracking, in which case swap reahadead needs to put pages on the
LRU but cannot charge them until we have a faulting vma later.

But it's not clear how practical such a configuration is. Both memory
and swap are shared resources, and isolation isn't really effective
when you restrict access to memory but then let workloads swap freely.

Plus, the overhead of tracking is tiny - 512k per G of swap (0.04%).

Maybe we should just delete MEMCG_SWAP and unconditionally track swap
entry ownership when the memory controller is enabled. I don't see a
good reason not to, and it would simplify the entire swapin path, the
LRU locking, and the page->mem_cgroup stabilization rules.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-13 18:07       ` Johannes Weiner
@ 2020-04-14  4:52         ` Alex Shi
  2020-04-14 16:31           ` Johannes Weiner
  2020-04-14  8:19         ` Alex Shi
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-04-14  4:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2020/4/14 上午2:07, Johannes Weiner 写道:
> On Mon, Apr 13, 2020 at 06:48:22PM +0800, Alex Shi wrote:
>>> In a previous review, I pointed out the following race condition
>>> between page charging and compaction:
>>>
>>> compaction:				generic_file_buffered_read:
>>>
>>> 					page_cache_alloc()
>>>
>>> !PageBuddy()
>>>
>>> lock_page_lruvec(page)
>>>   lruvec = mem_cgroup_page_lruvec()
>>>   spin_lock(&lruvec->lru_lock)
>>>   if lruvec != mem_cgroup_page_lruvec()
>>>     goto again
>>>
>>> 					add_to_page_cache_lru()
>>> 					  mem_cgroup_commit_charge()
>>> 					    page->mem_cgroup = foo
>>> 					  lru_cache_add()
>>> 					    __pagevec_lru_add()
>>> 					      SetPageLRU()
>>>
>>> if PageLRU(page):
>>>   __isolate_lru_page()
>>>
>>> As far as I can see, you have not addressed this. You have added
>>> lock_page_memcg(), but that prevents charged pages from moving between
>>> cgroups, it does not prevent newly allocated pages from being charged.
>>>
>>> It doesn't matter how many times you check the lruvec before and after
>>> locking - if you're looking at a free page, it might get allocated,
>>> charged and put on a new lruvec after you're done checking, and then
>>> you isolate a page from an unlocked lruvec.
>>>
>>> You simply cannot serialize on page->mem_cgroup->lruvec when
>>> page->mem_cgroup isn't stable. You need to serialize on the page
>>> itself, one way or another, to make this work.
>>>
>>>
>>> So here is a crazy idea that may be worth exploring:
>>>
>>> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
>>> linked list.
>>>
>>> Can we make PageLRU atomic and use it to stabilize the lru_lock
>>> instead, and then use the lru_lock only serialize list operations?
>>>
>>> I.e. in compaction, you'd do
>>>
>>> 	if (!TestClearPageLRU(page))
>>> 		goto isolate_fail;
>>> 	/*
>>> 	 * We isolated the page's LRU state and thereby locked out all
>>> 	 * other isolators, including cgroup page moving, page reclaim,
>>> 	 * page freeing etc. That means page->mem_cgroup is now stable
>>> 	 * and we can safely look up the correct lruvec and take the
>>> 	 * page off its physical LRU list.
>>> 	 */
>>> 	lruvec = mem_cgroup_page_lruvec(page);
>>> 	spin_lock_irq(&lruvec->lru_lock);
>>> 	del_page_from_lru_list(page, lruvec, page_lru(page));
>>>
>>> Putback would mostly remain the same (although you could take the
>>> PageLRU setting out of the list update locked section, as long as it's
>>> set after the page is physically linked):
>>>
>>> 	/* LRU isolation pins page->mem_cgroup */
>>> 	lruvec = mem_cgroup_page_lruvec(page)
>>> 	spin_lock_irq(&lruvec->lru_lock);
>>> 	add_page_to_lru_list(...);
>>> 	spin_unlock_irq(&lruvec->lru_lock);
>>>
>>> 	SetPageLRU(page);
>>>
>>> And you'd have to carefully review and rework other sites that rely on
>>> PageLRU: reclaim, __page_cache_release(), __activate_page() etc.
>>>
>>> Especially things like activate_page(), which used to only check
>>> PageLRU to shuffle the page on the LRU list would now have to briefly
>>> clear PageLRU and then set it again afterwards.
>>>
>>> However, aside from a bit more churn in those cases, and the
>>> unfortunate additional atomic operations, I currently can't think of a
>>> fundamental reason why this wouldn't work.
>>>
>>> Hugh, what do you think?
>>>
>>
>> Hi Johannes
>>
>> As to the idea of TestClearPageLRU, we except the following scenario
>>     compaction                       commit_charge
>>                                      if (TestClearPageLRU)
>>         !TestClearPageLRU                 lock_page_lruvec
>>             goto isolate_fail;            del_from_lru_list
>>                                           unlock_page_lruvec
>>
>> But there is a difficult situation to handle:
>>
>>    compaction                        commit_charge
>>         TestClearPageLRU
>>                                     !TestClearPageLRU
>>
>>                                     page possible state:
>>                                     a, reclaiming, b, moving between lru list, c, migrating, like in compaction
>>                                     d, mlocking,   e, split_huge_page,
>>
>> If the page lru bit was cleared in commit_charge with lrucare,
>> we still have no idea if the page was isolated by the reason from a~e
>> or the page is never on LRU, to deal with different reasons is high cost.
>>
>> So as to the above issue you mentioned, Maybe the better idea is to
>> set lrucare when do mem_cgroup_commit_charge(), since the charge action
>> is not often. What's your idea of this solution?
> 
> Hm, yes, the lrucare scenario is a real problem. If it can isolate the
> page, fine, but if not, it changes page->mem_cgroup on a page that
> somebody else has isolated, having indeed no idea who they are and how
> they are going to access page->mem_cgroup.
> 
> Right now it's safe because of secondary protection on top of
> isolation: split_huge_page keeps the lru_lock held throughout;
> reclaim, cgroup migration, page migration, compaction etc. hold the
> page lock which locks out swapcache charging.
> 
> But it complicates the serialization model immensely and makes it
> subtle and error prone.
> 
> I'm not sure how unconditionally taking the lru_lock when charging
> would help. Can you lay out what you have in mind in prototype code,
> like I'm using below, for isolation, putback, charging, compaction?

The situation would back to relock scheme, the lru_lock will compete with 
the some root_memcg->lru_lock in practical. So no needs to distinguish 
putback, compaction etc. 

But I don't know how much impact on this alloc path...

compaction:				generic_file_buffered_read:
 					page_cache_alloc()

 !PageBuddy()

 lock_page_lruvec(page)
   lruvec = mem_cgroup_page_lruvec()
   spin_lock(&lruvec->lru_lock)
   if lruvec != mem_cgroup_page_lruvec()
     goto again

 					add_to_page_cache_lru()
 					  mem_cgroup_commit_charge()
					    spin_lock_irq(page->memcg->lruvec->lru_lock)
 					    page->mem_cgroup = foo
					    spin_unlock_irq(page->memcg->lruvec->lru_lock)
 					  lru_cache_add()
 					    __pagevec_lru_add()
 					      SetPageLRU()

 if PageLRU(page):
   __isolate_lru_page()

> 
> That said, charging actually is a hotpath. I'm reluctant to
> unconditionally take the LRU lock there. But if you can make things a
> lot simpler this way, it could be worth exploring.
> 
> In the PageLRU locking scheme, I can see a way to make putback safe
> wrt lrucare charging, but I'm not sure about isolation:
> 
> putback:
> lruvec = page->mem_cgroup->lruvecs[pgdat]
> spin_lock(lruvec->lru_lock)
> if lruvec != page->mem_cgroup->lruvecs[pgdat]:
>   /*
>    * commit_charge(lrucare=true) can charge an uncharged swapcache
>    * page while we had it isolated. This changes page->mem_cgroup,
>    * but it can only happen once. Look up the new cgroup.
>    */
>   spin_unlock(lruvec->lru_lock)
>   lruvec = page->mem_cgroup->lruvecs[pgdat]
>   spin_lock(lruvec->lru_lock)
> add_page_to_lru_list(page, lruvec, ...)
> SetPageLRU(page);
> spin_unlock(lruvec->lru_lock)
> 
> commit_charge:
> if (lrucare)
>   spin_lock(root_memcg->lru_lock)
>   /*
>    * If we can isolate the page, we'll move it to the new
>    * cgroup's LRU list. If somebody else has the page
>    * isolated, we need their putback to move it to the
>    * new cgroup. If they see the old cgroup - the root -
>    * they will spin until we're done and recheck.
>    */
>   if ((lru = TestClearPageLRU(page)))
>     del_page_from_lru_list()
> page->mem_cgroup = new;
> if (lrucare)
>   spin_unlock(root_memcg->lru_lock)
>   if (lru)
>     spin_lock(new->lru_lock)
>     add_page_to_lru_list()
>     spin_unlock(new->lru_lock);
>     SetPageLRU(page)
> 
> putback would need to 1) recheck once after acquiring the lock and 2)
> SetPageLRU while holding the lru_lock after all. But it works because
> we know the old cgroup: if the putback sees the old cgroup, we know
> it's the root cgroup, and we have that locked until we're done with
> the update. And if putback manages to lock the old cgroup before us,
> we will spin until the isolator is done, and then either be able to
> isolate it ourselves or, if racing with yet another isolator, hold the
> lock and delay putback until we're done.
> 
> But isolation actually needs to lock out charging, or it would operate
> on the wrong list:
> 
> isolation:                                     commit_charge:
> if (TestClearPageLRU(page))
>                                                page->mem_cgroup = new
>   // page is still physically on
>   // the root_mem_cgroup's LRU. We're
>   // updating the wrong list:
>   memcg = page->mem_cgroup
>   spin_lock(memcg->lru_lock)
>   del_page_from_lru_list(page, memcg)
>   spin_unlock(memcg->lru_lock)
> 

Yes, this is the problem I encountered now for mem_cgroup_lru_size incorrect.

 
> lrucare really is a mess. Even before this patch series, it makes
> things tricky and subtle and error prone.
> 
> The only reason we're doing it is for when there is swapping without
> swap tracking, in which case swap reahadead needs to put pages on the
> LRU but cannot charge them until we have a faulting vma later.
> 
> But it's not clear how practical such a configuration is. Both memory
> and swap are shared resources, and isolation isn't really effective
> when you restrict access to memory but then let workloads swap freely.

Yes, we didn't figure a good usage on MEMCG_SWAP too. And if swaping happens
often, the different memcg's memory were swaped to same disk and mixed together
which cause readahead useless.

> 
> Plus, the overhead of tracking is tiny - 512k per G of swap (0.04%).
> 
> Maybe we should just delete MEMCG_SWAP and unconditionally track swap
> entry ownership when the memory controller is enabled. I don't see a
> good reason not to, and it would simplify the entire swapin path, the
> LRU locking, and the page->mem_cgroup stabilization rules.
> 

Sorry for not follow you up, did you mean just remove the MEMCG_SWAP configuration
and keep the feature in default memcg? 
That does can remove lrucare, but PageLRU lock scheme still fails since
we can not isolate the page during commit_charge, is that right?

Thanks a lot!
Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-13 18:07       ` Johannes Weiner
  2020-04-14  4:52         ` Alex Shi
@ 2020-04-14  8:19         ` Alex Shi
  2020-04-14 16:36           ` Johannes Weiner
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-04-14  8:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao, Wei Yang



在 2020/4/14 上午2:07, Johannes Weiner 写道:
> But isolation actually needs to lock out charging, or it would operate
> on the wrong list:
> 
> isolation:                                     commit_charge:
> if (TestClearPageLRU(page))
>                                                page->mem_cgroup = new
>   // page is still physically on
>   // the root_mem_cgroup's LRU. We're
>   // updating the wrong list:
>   memcg = page->mem_cgroup
>   spin_lock(memcg->lru_lock)
>   del_page_from_lru_list(page, memcg)
>   spin_unlock(memcg->lru_lock)
> 
> lrucare really is a mess. Even before this patch series, it makes
> things tricky and subtle and error prone.
> 
> The only reason we're doing it is for when there is swapping without
> swap tracking, in which case swap reahadead needs to put pages on the
> LRU but cannot charge them until we have a faulting vma later.
> 
> But it's not clear how practical such a configuration is. Both memory
> and swap are shared resources, and isolation isn't really effective
> when you restrict access to memory but then let workloads swap freely.
> 
> Plus, the overhead of tracking is tiny - 512k per G of swap (0.04%).
> 
> Maybe we should just delete MEMCG_SWAP and unconditionally track swap
> entry ownership when the memory controller is enabled. I don't see a
> good reason not to, and it would simplify the entire swapin path, the
> LRU locking, and the page->mem_cgroup stabilization rules.

Hi Johannes,

I think what you mean here is to keep swap_cgroup id even it was swaped,
then we read back the page from swap disk, we don't need to charge it.
So all other memcg charge are just happens on non lru list, thus we have
no isolation required in above awkward scenario.

That sounds a good idea. so, split_huge_page and mem_cgroup_migrate should
be safe, tasks cgroup migration may needs extra from_vec->lru_lock. Is that
right?

That's a good idea. I'm glad to have a try...

BTW,
As to the memcg swapped page mixed in swap disk timely. Maybe we could try
Tim Chen's swap_slot for memcg. What's your idea?

Thanks
Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-14  4:52         ` Alex Shi
@ 2020-04-14 16:31           ` Johannes Weiner
  2020-04-15 13:42             ` Alex Shi
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2020-04-14 16:31 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao

On Tue, Apr 14, 2020 at 12:52:30PM +0800, Alex Shi wrote:
> 在 2020/4/14 上午2:07, Johannes Weiner 写道:
> > Plus, the overhead of tracking is tiny - 512k per G of swap (0.04%).
> > 
> > Maybe we should just delete MEMCG_SWAP and unconditionally track swap
> > entry ownership when the memory controller is enabled. I don't see a
> > good reason not to, and it would simplify the entire swapin path, the
> > LRU locking, and the page->mem_cgroup stabilization rules.
> > 
> 
> Sorry for not follow you up, did you mean just remove the MEMCG_SWAP configuration
> and keep the feature in default memcg? 

Yes.

> That does can remove lrucare, but PageLRU lock scheme still fails since
> we can not isolate the page during commit_charge, is that right?

No, without lrucare the scheme works. Charges usually do:

page->mem_cgroup = new;
SetPageLRU(page);

And so if you can TestClearPageLRU(), page->mem_cgroup is stable.

lrucare charging is the exception: it changes page->mem_cgroup AFTER
PageLRU has already been set, and even when it CANNOT acquire the
PageLRU lock itself. It violates the rules.

If we make MEMCG_SWAP mandatory, we always have cgroup records for
swapped out pages. That means we can charge all swapin pages
(incl. readahead pages) directly in __read_swap_cache_async(), before
setting PageLRU on the new pages.

Then we can delete lrucare.

And then TestClearPageLRU() guarantees page->mem_cgroup is stable.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-14  8:19         ` Alex Shi
@ 2020-04-14 16:36           ` Johannes Weiner
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2020-04-14 16:36 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao, Wei Yang

On Tue, Apr 14, 2020 at 04:19:01PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/4/14 上午2:07, Johannes Weiner 写道:
> > But isolation actually needs to lock out charging, or it would operate
> > on the wrong list:
> > 
> > isolation:                                     commit_charge:
> > if (TestClearPageLRU(page))
> >                                                page->mem_cgroup = new
> >   // page is still physically on
> >   // the root_mem_cgroup's LRU. We're
> >   // updating the wrong list:
> >   memcg = page->mem_cgroup
> >   spin_lock(memcg->lru_lock)
> >   del_page_from_lru_list(page, memcg)
> >   spin_unlock(memcg->lru_lock)
> > 
> > lrucare really is a mess. Even before this patch series, it makes
> > things tricky and subtle and error prone.
> > 
> > The only reason we're doing it is for when there is swapping without
> > swap tracking, in which case swap reahadead needs to put pages on the
> > LRU but cannot charge them until we have a faulting vma later.
> > 
> > But it's not clear how practical such a configuration is. Both memory
> > and swap are shared resources, and isolation isn't really effective
> > when you restrict access to memory but then let workloads swap freely.
> > 
> > Plus, the overhead of tracking is tiny - 512k per G of swap (0.04%).
> > 
> > Maybe we should just delete MEMCG_SWAP and unconditionally track swap
> > entry ownership when the memory controller is enabled. I don't see a
> > good reason not to, and it would simplify the entire swapin path, the
> > LRU locking, and the page->mem_cgroup stabilization rules.
> 
> Hi Johannes,
> 
> I think what you mean here is to keep swap_cgroup id even it was swaped,
> then we read back the page from swap disk, we don't need to charge it.
> So all other memcg charge are just happens on non lru list, thus we have
> no isolation required in above awkward scenario.

We don't need to change how swap recording works, we just need to
always do it when CONFIG_MEMCG && CONFIG_SWAP.

We can uncharge the page once it's swapped out. The only difference is
that with a swap record, we know who owned the page and can charge
readahead pages right awya, before setting PageLRU; whereas without a
record, we read pages onto the LRU, and then wait until we hit a page
fault with an mm to charge. That's why we have this lrucare mess.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-14 16:31           ` Johannes Weiner
@ 2020-04-15 13:42             ` Alex Shi
  2020-04-16  8:01               ` Alex Shi
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-04-15 13:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups, linux-kernel, linux-mm, akpm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao



在 2020/4/15 上午12:31, Johannes Weiner 写道:
> On Tue, Apr 14, 2020 at 12:52:30PM +0800, Alex Shi wrote:
>> 在 2020/4/14 上午2:07, Johannes Weiner 写道:
>>> Plus, the overhead of tracking is tiny - 512k per G of swap (0.04%).
>>>
>>> Maybe we should just delete MEMCG_SWAP and unconditionally track swap
>>> entry ownership when the memory controller is enabled. I don't see a
>>> good reason not to, and it would simplify the entire swapin path, the
>>> LRU locking, and the page->mem_cgroup stabilization rules.
>>>
>>
>> Sorry for not follow you up, did you mean just remove the MEMCG_SWAP configuration
>> and keep the feature in default memcg? 
> 
> Yes.
> 
>> That does can remove lrucare, but PageLRU lock scheme still fails since
>> we can not isolate the page during commit_charge, is that right?
> 
> No, without lrucare the scheme works. Charges usually do:
> 
> page->mem_cgroup = new;
> SetPageLRU(page);
> 
> And so if you can TestClearPageLRU(), page->mem_cgroup is stable.
> 
> lrucare charging is the exception: it changes page->mem_cgroup AFTER
> PageLRU has already been set, and even when it CANNOT acquire the
> PageLRU lock itself. It violates the rules.
> 
> If we make MEMCG_SWAP mandatory, we always have cgroup records for
> swapped out pages. That means we can charge all swapin pages
> (incl. readahead pages) directly in __read_swap_cache_async(), before
> setting PageLRU on the new pages.
> 
> Then we can delete lrucare.
> 
> And then TestClearPageLRU() guarantees page->mem_cgroup is stable.
> 

Hi Johannes,

Thanks a lot for point out!

Charging in __read_swap_cache_async would ask for 3 layers function arguments
pass, that would be a bit ugly. Compare to this, could we move out the
lru_cache add after commit_charge, like ksm copied pages?

That give a bit extra non lru list time, but the page just only be used only
after add_anon_rmap setting. Could it cause troubles?

I tried to track down the reason of lru_cache_add here, but no explanation
till first git kernel commit.

Thanks
Alex 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-15 13:42             ` Alex Shi
@ 2020-04-16  8:01               ` Alex Shi
  2020-04-16 15:28                 ` Johannes Weiner
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Shi @ 2020-04-16  8:01 UTC (permalink / raw)
  To: Johannes Weiner, akpm
  Cc: cgroups, linux-kernel, linux-mm, mgorman, tj, hughd, khlebnikov,
	daniel.m.jordan, yang.shi, willy, shakeelb, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Chris Down, Thomas Gleixner,
	Vlastimil Babka, Qian Cai, Andrey Ryabinin, Kirill A. Shutemov,
	Jérôme Glisse, Andrea Arcangeli, David Rientjes,
	Aneesh Kumar K.V, swkhack, Potyra, Stefan, Mike Rapoport,
	Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao, Wei Yang



在 2020/4/15 下午9:42, Alex Shi 写道:
> Hi Johannes,
> 
> Thanks a lot for point out!
> 
> Charging in __read_swap_cache_async would ask for 3 layers function arguments
> pass, that would be a bit ugly. Compare to this, could we move out the
> lru_cache add after commit_charge, like ksm copied pages?
> 
> That give a bit extra non lru list time, but the page just only be used only
> after add_anon_rmap setting. Could it cause troubles?

Hi Johannes & Andrew,

Doing lru_cache_add_anon during swapin_readahead can give a very short timing 
for possible page reclaiming for these few pages.

If we delay these few pages lru adding till after the vm_fault target page 
get memcg charging(mem_cgroup_commit_charge) and activate, we could skip the 
mem_cgroup_try_charge/commit_charge/cancel_charge process in __read_swap_cache_async().
But the cost is maximum SWAP_RA_ORDER_CEILING number pages on each cpu miss
page reclaiming in a short time. On the other hand, save the target vm_fault
page from reclaiming before activate it during that time.

Judging the lose and gain, and the example of shmem/ksm copied pages, I believe 
it's fine to delay lru list adding till activate the target swapin page.

Any comments are appreciated! 

Thanks
Alex


> 
> I tried to track down the reason of lru_cache_add here, but no explanation
> till first git kernel commit.
> 
> Thanks
> Alex 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-16  8:01               ` Alex Shi
@ 2020-04-16 15:28                 ` Johannes Weiner
  2020-04-16 17:47                   ` Shakeel Butt
  2020-04-17 14:39                   ` Alex Shi
  0 siblings, 2 replies; 29+ messages in thread
From: Johannes Weiner @ 2020-04-16 15:28 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, cgroups, linux-kernel, linux-mm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao, Wei Yang

Hi Alex,

On Thu, Apr 16, 2020 at 04:01:20PM +0800, Alex Shi wrote:
> 
> 
> 在 2020/4/15 下午9:42, Alex Shi 写道:
> > Hi Johannes,
> > 
> > Thanks a lot for point out!
> > 
> > Charging in __read_swap_cache_async would ask for 3 layers function arguments
> > pass, that would be a bit ugly. Compare to this, could we move out the
> > lru_cache add after commit_charge, like ksm copied pages?
> > 
> > That give a bit extra non lru list time, but the page just only be used only
> > after add_anon_rmap setting. Could it cause troubles?
> 
> Hi Johannes & Andrew,
> 
> Doing lru_cache_add_anon during swapin_readahead can give a very short timing 
> for possible page reclaiming for these few pages.
> 
> If we delay these few pages lru adding till after the vm_fault target page 
> get memcg charging(mem_cgroup_commit_charge) and activate, we could skip the 
> mem_cgroup_try_charge/commit_charge/cancel_charge process in __read_swap_cache_async().
> But the cost is maximum SWAP_RA_ORDER_CEILING number pages on each cpu miss
> page reclaiming in a short time. On the other hand, save the target vm_fault
> page from reclaiming before activate it during that time.

The readahead pages surrounding the faulting page might never get
accessed and pile up to large amounts. Users can also trigger
non-faulting readahead with MADV_WILLNEED.

So unfortunately, I don't see a way to keep these pages off the
LRU. They do need to be reclaimable, or they become a DoS vector.

I'm currently preparing a small patch series to make swap ownership
tracking an integral part of memcg and change the swapin charging
sequence, then you don't have to worry about it. This will also
unblock Joonsoo's "workingset protection/detection on the anonymous
LRU list" patch series, since he is blocked on the same problem - he
needs the correct LRU available at swapin time to process refaults
correctly. Both of your patch series are already pretty large, they
shouldn't need to also deal with that.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-16 15:28                 ` Johannes Weiner
@ 2020-04-16 17:47                   ` Shakeel Butt
  2020-04-17 13:18                     ` Alex Shi
  2020-04-17 14:39                   ` Alex Shi
  1 sibling, 1 reply; 29+ messages in thread
From: Shakeel Butt @ 2020-04-16 17:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Alex Shi, Andrew Morton, Cgroups, LKML, Linux MM, Mel Gorman,
	Tejun Heo, Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan,
	Yang Shi, Matthew Wilcox, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Chris Down, Thomas Gleixner, Vlastimil Babka,
	Qian Cai, Andrey Ryabinin, Kirill A. Shutemov,
	Jérôme Glisse, Andrea Arcangeli, David Rientjes,
	Aneesh Kumar K.V, swkhack, Potyra, Stefan, Mike Rapoport,
	Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao, Wei Yang

Hi Johannes & Alex,

On Thu, Apr 16, 2020 at 8:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hi Alex,
>
> On Thu, Apr 16, 2020 at 04:01:20PM +0800, Alex Shi wrote:
> >
> >
> > 在 2020/4/15 下午9:42, Alex Shi 写道:
> > > Hi Johannes,
> > >
> > > Thanks a lot for point out!
> > >
> > > Charging in __read_swap_cache_async would ask for 3 layers function arguments
> > > pass, that would be a bit ugly. Compare to this, could we move out the
> > > lru_cache add after commit_charge, like ksm copied pages?
> > >
> > > That give a bit extra non lru list time, but the page just only be used only
> > > after add_anon_rmap setting. Could it cause troubles?
> >
> > Hi Johannes & Andrew,
> >
> > Doing lru_cache_add_anon during swapin_readahead can give a very short timing
> > for possible page reclaiming for these few pages.
> >
> > If we delay these few pages lru adding till after the vm_fault target page
> > get memcg charging(mem_cgroup_commit_charge) and activate, we could skip the
> > mem_cgroup_try_charge/commit_charge/cancel_charge process in __read_swap_cache_async().
> > But the cost is maximum SWAP_RA_ORDER_CEILING number pages on each cpu miss
> > page reclaiming in a short time. On the other hand, save the target vm_fault
> > page from reclaiming before activate it during that time.
>
> The readahead pages surrounding the faulting page might never get
> accessed and pile up to large amounts. Users can also trigger
> non-faulting readahead with MADV_WILLNEED.
>
> So unfortunately, I don't see a way to keep these pages off the
> LRU. They do need to be reclaimable, or they become a DoS vector.
>
> I'm currently preparing a small patch series to make swap ownership
> tracking an integral part of memcg and change the swapin charging
> sequence, then you don't have to worry about it. This will also
> unblock Joonsoo's "workingset protection/detection on the anonymous
> LRU list" patch series, since he is blocked on the same problem - he
> needs the correct LRU available at swapin time to process refaults
> correctly. Both of your patch series are already pretty large, they
> shouldn't need to also deal with that.

I think this would be a very good cleanup and will make the code much
more readable. I totally agree to keep this separate from the other
work. Please do CC me the series once it's ready.

Now regarding the per-memcg LRU locks, Alex, did you get the chance to
try the workload Hugh has provided? I was planning of posting Hugh's
patch series but Hugh advised me to wait for your & Johannes's
response since you both have already invested a lot of time in your
series and I do want to see how Johannes's TestClearPageLRU() idea
will look like, so, I will hold off for now.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-16 17:47                   ` Shakeel Butt
@ 2020-04-17 13:18                     ` Alex Shi
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-04-17 13:18 UTC (permalink / raw)
  To: Shakeel Butt, Johannes Weiner
  Cc: Andrew Morton, Cgroups, LKML, Linux MM, Mel Gorman, Tejun Heo,
	Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, Yang Shi,
	Matthew Wilcox, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Chris Down, Thomas Gleixner, Vlastimil Babka, Qian Cai,
	Andrey Ryabinin, Kirill A. Shutemov, Jérôme Glisse,
	Andrea Arcangeli, David Rientjes, Aneesh Kumar K.V, swkhack,
	Potyra, Stefan, Mike Rapoport, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan,
	Nikolay Borisov, Ira Weiny, Kirill Tkhai, Yafang Shao, Wei Yang



在 2020/4/17 上午1:47, Shakeel Butt 写道:
> I think this would be a very good cleanup and will make the code much
> more readable. I totally agree to keep this separate from the other
> work. Please do CC me the series once it's ready.
> 
> Now regarding the per-memcg LRU locks, Alex, did you get the chance to
> try the workload Hugh has provided? I was planning of posting Hugh's
> patch series but Hugh advised me to wait for your & Johannes's
> response since you both have already invested a lot of time in your
> series and I do want to see how Johannes's TestClearPageLRU() idea
> will look like, so, I will hold off for now.
> 

Hugh Dickin's testcase is great. It reveal the race on memcg charge in hours.

Thanks
Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-04-16 15:28                 ` Johannes Weiner
  2020-04-16 17:47                   ` Shakeel Butt
@ 2020-04-17 14:39                   ` Alex Shi
  1 sibling, 0 replies; 29+ messages in thread
From: Alex Shi @ 2020-04-17 14:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, cgroups, linux-kernel, linux-mm, mgorman, tj, hughd,
	khlebnikov, daniel.m.jordan, yang.shi, willy, shakeelb,
	Michal Hocko, Vladimir Davydov, Roman Gushchin, Chris Down,
	Thomas Gleixner, Vlastimil Babka, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Jérôme Glisse, Andrea Arcangeli,
	David Rientjes, Aneesh Kumar K.V, swkhack, Potyra, Stefan,
	Mike Rapoport, Stephen Rothwell, Colin Ian King, Jason Gunthorpe,
	Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov, Ira Weiny,
	Kirill Tkhai, Yafang Shao, Wei Yang



在 2020/4/16 下午11:28, Johannes Weiner 写道:
> Hi Alex,
> 
> On Thu, Apr 16, 2020 at 04:01:20PM +0800, Alex Shi wrote:
>>
>>
>> 在 2020/4/15 下午9:42, Alex Shi 写道:
>>> Hi Johannes,
>>>
>>> Thanks a lot for point out!
>>>
>>> Charging in __read_swap_cache_async would ask for 3 layers function arguments
>>> pass, that would be a bit ugly. Compare to this, could we move out the
>>> lru_cache add after commit_charge, like ksm copied pages?
>>>
>>> That give a bit extra non lru list time, but the page just only be used only
>>> after add_anon_rmap setting. Could it cause troubles?
>>
>> Hi Johannes & Andrew,
>>
>> Doing lru_cache_add_anon during swapin_readahead can give a very short timing 
>> for possible page reclaiming for these few pages.
>>
>> If we delay these few pages lru adding till after the vm_fault target page 
>> get memcg charging(mem_cgroup_commit_charge) and activate, we could skip the 
>> mem_cgroup_try_charge/commit_charge/cancel_charge process in __read_swap_cache_async().
>> But the cost is maximum SWAP_RA_ORDER_CEILING number pages on each cpu miss
>> page reclaiming in a short time. On the other hand, save the target vm_fault
>> page from reclaiming before activate it during that time.
> 
> The readahead pages surrounding the faulting page might never get
> accessed and pile up to large amounts. Users can also trigger
> non-faulting readahead with MADV_WILLNEED.
> 
> So unfortunately, I don't see a way to keep these pages off the
> LRU. They do need to be reclaimable, or they become a DoS vector.
> 
> I'm currently preparing a small patch series to make swap ownership
> tracking an integral part of memcg and change the swapin charging
> sequence, then you don't have to worry about it. This will also
> unblock Joonsoo's "workingset protection/detection on the anonymous
> LRU list" patch series, since he is blocked on the same problem - he
> needs the correct LRU available at swapin time to process refaults
> correctly. Both of your patch series are already pretty large, they
> shouldn't need to also deal with that.
> 

That sounds great!
BTW, the swapin target page will add into inactive_anon list and then activate
after chaged. that left a minum time slot for this page to be reclaimed.
May better activate it early?

Also I have 2 clean up patches, which may or may not useful for you. will send
it to you. :)

Thanks
Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-04-17 14:41 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16  3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
2020-01-16  3:05 ` [PATCH v8 01/10] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-01-16  3:05 ` [PATCH v8 02/10] mm/memcg: fold lock_page_lru into commit_charge Alex Shi
2020-01-16  3:05 ` [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-01-16 21:52   ` Johannes Weiner
2020-01-19 11:32     ` Alex Shi
2020-01-20 12:58     ` Alex Shi
2020-01-21 16:00       ` Johannes Weiner
2020-01-22 12:01         ` Alex Shi
2020-01-22 18:31           ` Johannes Weiner
2020-04-13 10:48     ` Alex Shi
2020-04-13 18:07       ` Johannes Weiner
2020-04-14  4:52         ` Alex Shi
2020-04-14 16:31           ` Johannes Weiner
2020-04-15 13:42             ` Alex Shi
2020-04-16  8:01               ` Alex Shi
2020-04-16 15:28                 ` Johannes Weiner
2020-04-16 17:47                   ` Shakeel Butt
2020-04-17 13:18                     ` Alex Shi
2020-04-17 14:39                   ` Alex Shi
2020-04-14  8:19         ` Alex Shi
2020-04-14 16:36           ` Johannes Weiner
2020-01-16  3:05 ` [PATCH v8 04/10] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-01-16  3:05 ` [PATCH v8 05/10] mm/mlock: optimize munlock_pagevec by relocking Alex Shi
2020-01-16  3:05 ` [PATCH v8 06/10] mm/swap: only change the lru_lock iff page's lruvec is different Alex Shi
2020-01-16  3:05 ` [PATCH v8 07/10] mm/pgdat: remove pgdat lru_lock Alex Shi
2020-01-16  3:05 ` [PATCH v8 08/10] mm/lru: revise the comments of lru_lock Alex Shi
2020-01-16  3:05 ` [PATCH v8 09/10] mm/lru: add debug checking for page memcg moving Alex Shi
2020-01-16  3:05 ` [PATCH v8 10/10] mm/memcg: add debug checking in lock_page_memcg Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).