All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] per memcg lru_lock
@ 2019-08-20  9:48 Alex Shi
  2019-08-20  9:48   ` Alex Shi
                   ` (15 more replies)
  0 siblings, 16 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg.

Per memcg lru_lock would ease the lru_lock contention a lot in
this patch series.

In some data center, containers are used widely to deploy different kind
of services, then multiple memcgs share per node pgdat->lru_lock which
cause heavy lock contentions when doing lru operation.

On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
simultaneously with mmtests' config:
        # AIM9
        export AIM9_TESTTIME=180
        export AIM9_TESTLIST=page_test,brk_test

perf lock report show much contentions on lru_lock in 20 second snapshot:
                        Name   acquired  contended   avg wait (ns) total wait (ns)   max wait (ns)   min wait (ns)
        &(ptlock_ptr(pag...         22          0               0       0               0               0
        ...
        &(&pgdat->lru_lo...          9          7           12728       89096           26656            1597

With this patch series, lruvec->lru_lock show no contentions
        &(&lruvec->lru_l...          8          0               0       0               0               0

and aim9 page_test/brk_test performance increased 5%~50%.
BTW, Detailed results in aim9-pft.compare.log if needed,
All containers data are increased and pretty steady.

$for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i page_test ============"; cat aim9-pft.compare.log | grep "^$i.*page_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
========= Max page_test ============
5.3-rc4          34729.6
5.3-rc4+lru_lock 36128.3
========= Min page_test ============
5.3-rc4          33644.2
5.3-rc4+lru_lock 35349.7
========= Hmean page_test ============
5.3-rc4          34355.4
5.3-rc4+lru_lock 35810.9
========= Stddev page_test ============
5.3-rc4          319.757
5.3-rc4+lru_lock 223.324
========= CoeffVar page_test ============
5.3-rc4          0.93125
5.3-rc4+lru_lock 0.623333
========= BHmean-50 page_test ============
5.3-rc4          34579.2
5.3-rc4+lru_lock 35977.1
========= BHmean-95 page_test ============
5.3-rc4          34421.7
5.3-rc4+lru_lock 35853.6
========= BHmean-99 page_test ============
5.3-rc4          34421.7
5.3-rc4+lru_lock 35853.6

$for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i brk_test ============"; cat aim9-pft.compare.log | grep "^$i.*brk_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
========= Max brk_test ============
5.3-rc4          96647.7
5.3-rc4+lru_lock 98960.3
========= Min brk_test ============
5.3-rc4          91800.8
5.3-rc4+lru_lock 96817.6
========= Hmean brk_test ============
5.3-rc4          95470
5.3-rc4+lru_lock 97769.6
========= Stddev brk_test ============
5.3-rc4          1253.52
5.3-rc4+lru_lock 596.593
========= CoeffVar brk_test ============
5.3-rc4          1.31375
5.3-rc4+lru_lock 0.609583
========= BHmean-50 brk_test ============
5.3-rc4          96141.4
5.3-rc4+lru_lock 98194
========= BHmean-95 brk_test ============
5.3-rc4          95818.5
5.3-rc4+lru_lock 97857.2
========= BHmean-99 brk_test ============
5.3-rc4          95818.5
5.3-rc4+lru_lock 97857.2

Alex Shi (14):
  mm/lru: move pgdat lru_lock into lruvec
  lru/memcg: move the lruvec->pgdat sync out lru_lock
  lru/memcg: using per lruvec lock in un/lock_page_lru
  lru/compaction: use per lruvec lock in isolate_migratepages_block
  lru/huge_page: use per lruvec lock in __split_huge_page
  lru/mlock: using per lruvec lock in munlock
  lru/swap: using per lruvec lock in page_cache_release
  lru/swap: uer lruvec lock in activate_page
  lru/swap: uer per lruvec lock in pagevec_lru_move_fn
  lru/swap: use per lruvec lock in release_pages
  lru/vmscan: using per lruvec lock in lists shrinking.
  lru/vmscan: use pre lruvec lock in check_move_unevictable_pages
  lru/vmscan: using per lruvec lru_lock in get_scan_count
  mm/lru: fix the comments of lru_lock

 include/linux/memcontrol.h | 24 ++++++++++----
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h     |  6 ++--
 mm/compaction.c            | 48 +++++++++++++++++-----------
 mm/filemap.c               |  4 +--
 mm/huge_memory.c           |  9 ++++--
 mm/memcontrol.c            | 24 ++++++--------
 mm/mlock.c                 | 35 ++++++++++----------
 mm/mmzone.c                |  1 +
 mm/page_alloc.c            |  1 -
 mm/page_idle.c             |  4 +--
 mm/rmap.c                  |  2 +-
 mm/swap.c                  | 79 +++++++++++++++++++++++++---------------------
 mm/vmscan.c                | 63 ++++++++++++++++++------------------
 14 files changed, 166 insertions(+), 136 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/14] mm/lru: move pgdat lru_lock into lruvec
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
@ 2019-08-20  9:48   ` Alex Shi
  2019-08-20  9:48 ` [PATCH 02/14] lru/memcg: move the lruvec->pgdat sync out lru_lock Alex Shi
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Vlastimil Babka, Dan Williams, Oscar Salvador, Wei Yang,
	Pavel Tatashin, Arun KS, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Hugh Dickins, Jérôme Glisse,
	Andrea Arcangeli, Aneesh Kumar K.V, David Rientjes,
	Souptick Joarder, swkhack, Potyra, Stefan, Mike Rapoport,
	Alexander Duyck, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Matthew Wilcox, Peng Fan,
	Ira Weiny, Kirill Tkhai, Daniel Jordan, Yafang Shao, Yang Shi

This patch move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg.

Per memcg lru_lock would ease the lru_lock contention a lot in
this patch series.

In some data center, containers are used widely to deploy different kind
of services, then multiple memcgs share per node pgdat->lru_lock which
cause heavy lock contentions when doing lru operations.
On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
simultaneously with mmtests' config:
	# AIM9
	export AIM9_TESTTIME=180
	export AIM9_TESTLIST=page_test,brk_test

perf lock report show much contentions on lru_lock in 20 second snapshot:
        	        Name   acquired  contended   avg wait (ns) total wait (ns)   max wait (ns)   min wait (ns)
	&(ptlock_ptr(pag...         22          0               0	0               0               0
	...
	&(&pgdat->lru_lo...          9          7           12728	89096           26656            1597

With this patch series, lruvec->lru_lock show no contentions
	&(&lruvec->lru_l...          8          0               0	0               0               0

and aim9 page_test/brk_test performance increased 5%~50%.

Now this patch still using per pgdat lru_lock, no function changes yet.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: swkhack <swkhack@gmail.com>
Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h |  4 +++-
 mm/compaction.c        | 10 +++++-----
 mm/huge_memory.c       |  6 +++---
 mm/memcontrol.c        |  6 +++---
 mm/mlock.c             | 10 +++++-----
 mm/mmzone.c            |  1 +
 mm/page_alloc.c        |  1 -
 mm/page_idle.c         |  4 ++--
 mm/swap.c              | 28 ++++++++++++++--------------
 mm/vmscan.c            | 38 +++++++++++++++++++-------------------
 10 files changed, 55 insertions(+), 53 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..8d0076d084be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -295,6 +295,9 @@ struct zone_reclaim_stat {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* move lru_lock to per lruvec for memcg */
+	spinlock_t			lru_lock;
+
 	struct zone_reclaim_stat	reclaim_stat;
 	/* Evictions & activations on the inactive file list */
 	atomic_long_t			inactive_age;
@@ -744,7 +747,6 @@ struct zonelist {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..9a737f343183 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -846,7 +846,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * a fatal signal is pending.
 		 */
 		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
+		    && compact_unlock_should_abort(&pgdat->lruvec.lru_lock,
 					    flags, &locked, cc)) {
 			low_pfn = 0;
 			goto fatal_pending;
@@ -919,7 +919,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
+					spin_unlock_irqrestore(&pgdat->lruvec.lru_lock,
 									flags);
 					locked = false;
 				}
@@ -949,7 +949,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
+			locked = compact_lock_irqsave(&pgdat->lruvec.lru_lock,
 								&flags, cc);
 
 			/* Try get exclusive access under lock */
@@ -1016,7 +1016,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+				spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 				locked = false;
 			}
 			putback_movable_pages(&cc->migratepages);
@@ -1043,7 +1043,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 738065f765ab..3a483deee807 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2529,7 +2529,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_unlock(&head->mapping->i_pages);
 	}
 
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 
 	remap_page(head);
 
@@ -2740,7 +2740,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		lru_add_drain();
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
+	spin_lock_irqsave(&pgdata->lruvec.lru_lock, flags);
 
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2785,7 +2785,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&pgdata->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdata->lruvec.lru_lock, flags);
 		remap_page(head);
 		ret = -EBUSY;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f5c0c517c49..2792b8ed405f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2557,7 +2557,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 {
 	pg_data_t *pgdat = page_pgdat(page);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
@@ -2581,7 +2581,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -2901,7 +2901,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * pgdat->lruvec.lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index a90099da4fb4..1279684bada0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -194,7 +194,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -206,14 +206,14 @@ unsigned int munlock_vma_page(struct page *page)
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
 	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&pgdat->lruvec.lru_lock);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 out:
 	return nr_pages - 1;
@@ -298,7 +298,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
+	spin_lock_irq(&zone->zone_pgdat->lruvec.lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -325,7 +325,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pvec->pages[i] = NULL;
 	}
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	spin_unlock_irq(&zone->zone_pgdat->lruvec.lru_lock);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..1b07dcaabbd7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6677,7 +6677,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(node_lruvec(pgdat));
 }
 
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..420bc0ac8c1e 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -42,12 +42,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
 		return NULL;
 
 	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 	return page;
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index ae300397dfda..63f4782af57a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,12 +63,12 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
+		spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 	}
 	__ClearPageWaiters(page);
 	mem_cgroup_uncharge(page);
@@ -201,16 +201,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 		if (pagepgdat != pgdat) {
 			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+				spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+			spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -326,9 +326,9 @@ void activate_page(struct page *page)
 	pg_data_t *pgdat = page_pgdat(page);
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 }
 #endif
 
@@ -733,7 +733,7 @@ void release_pages(struct page **pages, int nr)
 		 * same pgdat. The lock is held only if pgdat != NULL.
 		 */
 		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+			spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
 			locked_pgdat = NULL;
 		}
 
@@ -742,7 +742,7 @@ void release_pages(struct page **pages, int nr)
 
 		if (is_zone_device_page(page)) {
 			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+				spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
 						       flags);
 				locked_pgdat = NULL;
 			}
@@ -762,7 +762,7 @@ void release_pages(struct page **pages, int nr)
 
 		if (PageCompound(page)) {
 			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+				spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
 				locked_pgdat = NULL;
 			}
 			__put_compound_page(page);
@@ -774,11 +774,11 @@ void release_pages(struct page **pages, int nr)
 
 			if (pgdat != locked_pgdat) {
 				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+					spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
 									flags);
 				lock_batch = 0;
 				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				spin_lock_irqsave(&locked_pgdat->lruvec.lru_lock, flags);
 			}
 
 			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
@@ -794,7 +794,7 @@ void release_pages(struct page **pages, int nr)
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -832,7 +832,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c77d1e3761a7..c7a228525df0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1805,7 +1805,7 @@ int isolate_lru_page(struct page *page)
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&pgdat->lruvec.lru_lock);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
@@ -1814,7 +1814,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&pgdat->lruvec.lru_lock);
 	}
 	return ret;
 }
@@ -1890,9 +1890,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		if (unlikely(!page_evictable(page))) {
 			list_del(&page->lru);
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&pgdat->lruvec.lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&pgdat->lruvec.lru_lock);
 			continue;
 		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
@@ -1910,10 +1910,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&pgdat->lruvec.lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&pgdat->lruvec.lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		} else {
@@ -1976,7 +1976,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1988,7 +1988,7 @@ static int current_may_throttle(void)
 	if (global_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1996,7 +1996,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (global_reclaim(sc))
@@ -2009,7 +2009,7 @@ static int current_may_throttle(void)
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
@@ -2062,7 +2062,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2073,7 +2073,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2119,7 +2119,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -2137,7 +2137,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2373,7 +2373,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2394,7 +2394,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -4263,9 +4263,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		pgscanned++;
 		if (pagepgdat != pgdat) {
 			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&pgdat->lruvec.lru_lock);
 			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&pgdat->lruvec.lru_lock);
 		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
@@ -4286,7 +4286,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	if (pgdat) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&pgdat->lruvec.lru_lock);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 01/14] mm/lru: move pgdat lru_lock into lruvec
@ 2019-08-20  9:48   ` Alex Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Vlastimil Babka, Dan Williams, Oscar Salvador, Wei Yang,
	Pavel Tatashin, Arun KS, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Hugh Dickins, Jérôme Glisse,
	Andrea Arcangeli, Aneesh Kumar K.V, David Rientjes,
	Souptick Joarder

This patch move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg.

Per memcg lru_lock would ease the lru_lock contention a lot in
this patch series.

In some data center, containers are used widely to deploy different kind
of services, then multiple memcgs share per node pgdat->lru_lock which
cause heavy lock contentions when doing lru operations.
On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
simultaneously with mmtests' config:
	# AIM9
	export AIM9_TESTTIME=180
	export AIM9_TESTLIST=page_test,brk_test

perf lock report show much contentions on lru_lock in 20 second snapshot:
        	        Name   acquired  contended   avg wait (ns) total wait (ns)   max wait (ns)   min wait (ns)
	&(ptlock_ptr(pag...         22          0               0	0               0               0
	...
	&(&pgdat->lru_lo...          9          7           12728	89096           26656            1597

With this patch series, lruvec->lru_lock show no contentions
	&(&lruvec->lru_l...          8          0               0	0               0               0

and aim9 page_test/brk_test performance increased 5%~50%.

Now this patch still using per pgdat lru_lock, no function changes yet.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: swkhack <swkhack@gmail.com>
Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h |  4 +++-
 mm/compaction.c        | 10 +++++-----
 mm/huge_memory.c       |  6 +++---
 mm/memcontrol.c        |  6 +++---
 mm/mlock.c             | 10 +++++-----
 mm/mmzone.c            |  1 +
 mm/page_alloc.c        |  1 -
 mm/page_idle.c         |  4 ++--
 mm/swap.c              | 28 ++++++++++++++--------------
 mm/vmscan.c            | 38 +++++++++++++++++++-------------------
 10 files changed, 55 insertions(+), 53 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..8d0076d084be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -295,6 +295,9 @@ struct zone_reclaim_stat {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
+	/* move lru_lock to per lruvec for memcg */
+	spinlock_t			lru_lock;
+
 	struct zone_reclaim_stat	reclaim_stat;
 	/* Evictions & activations on the inactive file list */
 	atomic_long_t			inactive_age;
@@ -744,7 +747,6 @@ struct zonelist {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..9a737f343183 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -846,7 +846,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * a fatal signal is pending.
 		 */
 		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
+		    && compact_unlock_should_abort(&pgdat->lruvec.lru_lock,
 					    flags, &locked, cc)) {
 			low_pfn = 0;
 			goto fatal_pending;
@@ -919,7 +919,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
+					spin_unlock_irqrestore(&pgdat->lruvec.lru_lock,
 									flags);
 					locked = false;
 				}
@@ -949,7 +949,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
+			locked = compact_lock_irqsave(&pgdat->lruvec.lru_lock,
 								&flags, cc);
 
 			/* Try get exclusive access under lock */
@@ -1016,7 +1016,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+				spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 				locked = false;
 			}
 			putback_movable_pages(&cc->migratepages);
@@ -1043,7 +1043,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_abort:
 	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 738065f765ab..3a483deee807 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2529,7 +2529,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_unlock(&head->mapping->i_pages);
 	}
 
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 
 	remap_page(head);
 
@@ -2740,7 +2740,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		lru_add_drain();
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
+	spin_lock_irqsave(&pgdata->lruvec.lru_lock, flags);
 
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2785,7 +2785,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&pgdata->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdata->lruvec.lru_lock, flags);
 		remap_page(head);
 		ret = -EBUSY;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f5c0c517c49..2792b8ed405f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2557,7 +2557,7 @@ static void lock_page_lru(struct page *page, int *isolated)
 {
 	pg_data_t *pgdat = page_pgdat(page);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	if (PageLRU(page)) {
 		struct lruvec *lruvec;
 
@@ -2581,7 +2581,7 @@ static void unlock_page_lru(struct page *page, int isolated)
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -2901,7 +2901,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * pgdat->lruvec.lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index a90099da4fb4..1279684bada0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -194,7 +194,7 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -206,14 +206,14 @@ unsigned int munlock_vma_page(struct page *page)
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
 	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&pgdat->lruvec.lru_lock);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 out:
 	return nr_pages - 1;
@@ -298,7 +298,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
+	spin_lock_irq(&zone->zone_pgdat->lruvec.lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
@@ -325,7 +325,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pvec->pages[i] = NULL;
 	}
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	spin_unlock_irq(&zone->zone_pgdat->lruvec.lru_lock);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..1b07dcaabbd7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6677,7 +6677,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(node_lruvec(pgdat));
 }
 
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 295512465065..420bc0ac8c1e 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -42,12 +42,12 @@ static struct page *page_idle_get_page(unsigned long pfn)
 		return NULL;
 
 	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	if (unlikely(!PageLRU(page))) {
 		put_page(page);
 		page = NULL;
 	}
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 	return page;
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index ae300397dfda..63f4782af57a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,12 +63,12 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
+		spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 	}
 	__ClearPageWaiters(page);
 	mem_cgroup_uncharge(page);
@@ -201,16 +201,16 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 		if (pagepgdat != pgdat) {
 			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+				spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+			spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec, arg);
 	}
 	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -326,9 +326,9 @@ void activate_page(struct page *page)
 	pg_data_t *pgdat = page_pgdat(page);
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 }
 #endif
 
@@ -733,7 +733,7 @@ void release_pages(struct page **pages, int nr)
 		 * same pgdat. The lock is held only if pgdat != NULL.
 		 */
 		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+			spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
 			locked_pgdat = NULL;
 		}
 
@@ -742,7 +742,7 @@ void release_pages(struct page **pages, int nr)
 
 		if (is_zone_device_page(page)) {
 			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+				spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
 						       flags);
 				locked_pgdat = NULL;
 			}
@@ -762,7 +762,7 @@ void release_pages(struct page **pages, int nr)
 
 		if (PageCompound(page)) {
 			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+				spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
 				locked_pgdat = NULL;
 			}
 			__put_compound_page(page);
@@ -774,11 +774,11 @@ void release_pages(struct page **pages, int nr)
 
 			if (pgdat != locked_pgdat) {
 				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+					spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
 									flags);
 				lock_batch = 0;
 				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				spin_lock_irqsave(&locked_pgdat->lruvec.lru_lock, flags);
 			}
 
 			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
@@ -794,7 +794,7 @@ void release_pages(struct page **pages, int nr)
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+		spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -832,7 +832,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c77d1e3761a7..c7a228525df0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1805,7 +1805,7 @@ int isolate_lru_page(struct page *page)
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&pgdat->lruvec.lru_lock);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
@@ -1814,7 +1814,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&pgdat->lruvec.lru_lock);
 	}
 	return ret;
 }
@@ -1890,9 +1890,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		if (unlikely(!page_evictable(page))) {
 			list_del(&page->lru);
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&pgdat->lruvec.lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&pgdat->lruvec.lru_lock);
 			continue;
 		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
@@ -1910,10 +1910,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&pgdat->lruvec.lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&pgdat->lruvec.lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		} else {
@@ -1976,7 +1976,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1988,7 +1988,7 @@ static int current_may_throttle(void)
 	if (global_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1996,7 +1996,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (global_reclaim(sc))
@@ -2009,7 +2009,7 @@ static int current_may_throttle(void)
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
@@ -2062,7 +2062,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2073,7 +2073,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2119,7 +2119,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -2137,7 +2137,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2373,7 +2373,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&pgdat->lruvec.lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2394,7 +2394,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&pgdat->lruvec.lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
@@ -4263,9 +4263,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		pgscanned++;
 		if (pagepgdat != pgdat) {
 			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&pgdat->lruvec.lru_lock);
 			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&pgdat->lruvec.lru_lock);
 		}
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
@@ -4286,7 +4286,7 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	if (pgdat) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&pgdat->lruvec.lru_lock);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 02/14] lru/memcg: move the lruvec->pgdat sync out lru_lock
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
  2019-08-20  9:48   ` Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 03/14] lru/memcg: using per lruvec lock in un/lock_page_lru Alex Shi
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Shakeel Butt, Chris Down, Kirill Tkhai,
	Thomas Gleixner

We are going to move lruvec getting out of lru_lock, the only unsafe
part is lruvec->pgdat syncing when memory node hot pluging.

Splitting out the lruvec->pgdat assignment now and will put it in
lruvec lru_lock protection.

No function changes in this patch now.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 24 +++++++++++++++++-------
 mm/memcontrol.c            |  8 +-------
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2cd4359cb38c..95b3d9885ab6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -359,6 +359,17 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 	return memcg->nodeinfo[nid];
 }
 
+static void sync_lruvec_pgdat(struct lruvec *lruvec, struct pglist_data *pgdat)
+{
+	/*
+	 * Since a node can be onlined after the mem_cgroup was created,
+	 * we have to be prepared to initialize lruvec->pgdat here;
+	 * and if offlined then reonlined, we need to reinitialize it.
+	 */
+	if (!mem_cgroup_disabled() && unlikely(lruvec->pgdat != pgdat))
+		lruvec->pgdat = pgdat;
+}
+
 /**
  * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
  * @node: node of the wanted lruvec
@@ -382,13 +393,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 	mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
 	lruvec = &mz->lruvec;
 out:
-	/*
-	 * Since a node can be onlined after the mem_cgroup was created,
-	 * we have to be prepared to initialize lruvec->pgdat here;
-	 * and if offlined then reonlined, we need to reinitialize it.
-	 */
-	if (unlikely(lruvec->pgdat != pgdat))
-		lruvec->pgdat = pgdat;
+	sync_lruvec_pgdat(lruvec, pgdat);
 	return lruvec;
 }
 
@@ -857,6 +862,11 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 {
 }
 
+static inline void sync_lruvec_pgdat(struct lruvec *lruvec,
+						struct pglist_data *pgdat)
+{
+}
+
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 				struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2792b8ed405f..e8a1b0d95ba8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1257,13 +1257,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	mz = mem_cgroup_page_nodeinfo(memcg, page);
 	lruvec = &mz->lruvec;
 out:
-	/*
-	 * Since a node can be onlined after the mem_cgroup was created,
-	 * we have to be prepared to initialize lruvec->zone here;
-	 * and if offlined then reonlined, we need to reinitialize it.
-	 */
-	if (unlikely(lruvec->pgdat != pgdat))
-		lruvec->pgdat = pgdat;
+	sync_lruvec_pgdat(lruvec, pgdat);
 	return lruvec;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 03/14] lru/memcg: using per lruvec lock in un/lock_page_lru
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
  2019-08-20  9:48   ` Alex Shi
  2019-08-20  9:48 ` [PATCH 02/14] lru/memcg: move the lruvec->pgdat sync out lru_lock Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-26  8:30   ` Konstantin Khlebnikov
  2019-08-20  9:48 ` [PATCH 04/14] lru/compaction: use per lruvec lock in isolate_migratepages_block Alex Shi
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Johannes Weiner, Michal Hocko, Vladimir Davydov

Now we repeatly assign the lruvec->pgdat in memcg. Will remove the
assignment in lruvec getting function after very points are protected.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8a1b0d95ba8..19fd911e8098 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2550,12 +2550,12 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 static void lock_page_lru(struct page *page, int *isolated)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
+	sync_lruvec_pgdat(lruvec, pgdat);
 	if (PageLRU(page)) {
-		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		*isolated = 1;
@@ -2566,16 +2566,14 @@ static void lock_page_lru(struct page *page, int *isolated)
 static void unlock_page_lru(struct page *page, int isolated)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 	if (isolated) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		SetPageLRU(page);
 		add_page_to_lru_list(page, lruvec, page_lru(page));
 	}
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 04/14] lru/compaction: use per lruvec lock in isolate_migratepages_block
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (2 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 03/14] lru/memcg: using per lruvec lock in un/lock_page_lru Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 05/14] lru/huge_page: use per lruvec lock in __split_huge_page Alex Shi
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Vlastimil Babka, Qian Cai, Andrey Ryabinin

Using lruvec locking to replace pgdat lru_lock. and then unfold
compact_unlock_should_abort() to fit the replacement.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/compaction.c | 48 ++++++++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 18 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9a737f343183..8877f38410d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -785,7 +785,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked_lruvec = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -845,11 +845,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lruvec.lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked_lruvec) {
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+				locked_lruvec = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -918,10 +927,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lruvec.lru_lock,
+				if (locked_lruvec) {
+					spin_unlock_irqrestore(&locked_lruvec->lru_lock,
 									flags);
-					locked = false;
+					locked_lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -947,10 +956,14 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lruvec.lru_lock,
-								&flags, cc);
+		if (lruvec != locked_lruvec) {
+			if (compact_lock_irqsave(&lruvec->lru_lock, &flags, cc))
+				locked_lruvec = lruvec;
+
+			sync_lruvec_pgdat(lruvec, pgdat);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -974,7 +987,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			}
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, isolate_mode) != 0)
@@ -1015,9 +1027,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
-				locked = false;
+			if (locked_lruvec) {
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+				locked_lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1042,8 +1054,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		low_pfn = end_pfn;
 
 isolate_abort:
-	if (locked)
-		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+	if (locked_lruvec)
+		spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 05/14] lru/huge_page: use per lruvec lock in __split_huge_page
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (3 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 04/14] lru/compaction: use per lruvec lock in isolate_migratepages_block Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 06/14] lru/mlock: using per lruvec lock in munlock Alex Shi
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Kirill A. Shutemov, Hugh Dickins,
	Vlastimil Babka, Jérôme Glisse, Andrea Arcangeli,
	Aneesh Kumar K.V, David Rientjes, Souptick Joarder

Using lruvec lock to replace pgdat lru_lock.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a483deee807..9a96c0944b4d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2529,7 +2529,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_unlock(&head->mapping->i_pages);
 	}
 
-	spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 
 	remap_page(head);
 
@@ -2671,6 +2671,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
+	struct lruvec *lruvec;
 	int count, mapcount, extra_pins, ret;
 	bool mlocked;
 	unsigned long flags;
@@ -2739,8 +2740,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	if (mlocked)
 		lru_add_drain();
 
+	lruvec = mem_cgroup_page_lruvec(head, pgdata);
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lruvec.lru_lock, flags);
+	spin_lock_irqsave(&lruvec->lru_lock, flags);
+	sync_lruvec_pgdat(lruvec, pgdata);
 
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2785,7 +2788,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&pgdata->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lruvec.lru_lock, flags);
+		spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 		remap_page(head);
 		ret = -EBUSY;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 06/14] lru/mlock: using per lruvec lock in munlock
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (4 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 05/14] lru/huge_page: use per lruvec lock in __split_huge_page Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 07/14] lru/swap: using per lruvec lock in page_cache_release Alex Shi
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Vlastimil Babka, Michal Hocko, Andrey Ryabinin,
	swkhack, Potyra, Stefan

This patch using sperate lruvec lock for each of pages in
__munlock_pagevec().

Also this patch pass a lruvec in __munlock_isolate_lru_page() as
parameter to reduce a repeat lruvec locating.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: swkhack <swkhack@gmail.com>
Cc: "Potyra, Stefan" <Stefan.Potyra@elektrobit.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 1279684bada0..9915968d490a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
  * Isolate a page from LRU with optional get_page() pin.
  * Assumes lru_lock already held and page already pinned.
  */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+			struct lruvec *lruvec, bool getpage)
 {
 	if (PageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		ClearPageLRU(page);
@@ -183,6 +181,9 @@ unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -194,7 +195,8 @@ unsigned int munlock_vma_page(struct page *page)
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
+	sync_lruvec_pgdat(lruvec, pgdat);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -205,15 +207,15 @@ unsigned int munlock_vma_page(struct page *page)
 	nr_pages = hpage_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	if (__munlock_isolate_lru_page(page, lruvec, true)) {
+		spin_unlock_irq(&lruvec->lru_lock);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 out:
 	return nr_pages - 1;
@@ -291,28 +293,30 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
 	int i;
 	int nr = pagevec_count(pvec);
-	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lruvec.lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
+		pg_data_t *pgdat = page_pgdat(page);
+		struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		spin_lock_irq(&lruvec->lru_lock);
+		sync_lruvec_pgdat(lruvec, pgdat);
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+				__mod_zone_page_state(zone, NR_MLOCK,  -1);
+				spin_unlock_irq(&lruvec->lru_lock);
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
-		} else {
-			delta_munlocked++;
 		}
 
 		/*
@@ -323,9 +327,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 */
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
+		spin_unlock_irq(&lruvec->lru_lock);
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lruvec.lru_lock);
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 07/14] lru/swap: using per lruvec lock in page_cache_release
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (5 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 06/14] lru/mlock: using per lruvec lock in munlock Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 08/14] lru/swap: uer lruvec lock in activate_page Alex Shi
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Vlastimil Babka, Jason Gunthorpe,
	Matthew Wilcox, Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov,
	Ira Weiny, Andrey Ryabinin

Also cares the lruvec->pgdat syncing.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 63f4782af57a..2a8fe6df08fc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,12 +63,13 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
-		spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		spin_lock_irqsave(&lruvec->lru_lock, flags);
+		sync_lruvec_pgdat(lruvec, pgdat);
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
 		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+		spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 	}
 	__ClearPageWaiters(page);
 	mem_cgroup_uncharge(page);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 08/14] lru/swap: uer lruvec lock in activate_page
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (6 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 07/14] lru/swap: using per lruvec lock in page_cache_release Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 09/14] lru/swap: uer per lruvec lock in pagevec_lru_move_fn Alex Shi
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Vlastimil Babka, Jason Gunthorpe,
	Matthew Wilcox, Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov,
	Ira Weiny, Andrey Ryabinin

to replace pgdat lru_lock.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 2a8fe6df08fc..d2dad08fcfd0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -325,11 +325,14 @@ static inline void activate_page_drain(int cpu)
 void activate_page(struct page *page)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	sync_lruvec_pgdat(lruvec, pgdat);
 	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 }
 #endif
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 09/14] lru/swap: uer per lruvec lock in pagevec_lru_move_fn
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (7 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 08/14] lru/swap: uer lruvec lock in activate_page Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 10/14] lru/swap: use per lruvec lock in release_pages Alex Shi
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Vlastimil Babka, Jason Gunthorpe,
	Matthew Wilcox, Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov,
	Ira Weiny, Andrey Ryabinin

to replace pgdat lru_lock.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index d2dad08fcfd0..24a2b3456e10 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -192,26 +192,27 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void *arg)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *locked_lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lruvec.lru_lock, flags);
+		struct pglist_data *pgdat = page_pgdat(page);
+		struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+		if (locked_lruvec != lruvec) {
+			if (locked_lruvec)
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+			locked_lruvec = lruvec;
+			spin_lock_irqsave(&lruvec->lru_lock, flags);
+			sync_lruvec_pgdat(lruvec, pgdat);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec, arg);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lruvec.lru_lock, flags);
+	if (locked_lruvec)
+		spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 10/14] lru/swap: use per lruvec lock in release_pages
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (8 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 09/14] lru/swap: uer per lruvec lock in pagevec_lru_move_fn Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 11/14] lru/vmscan: using per lruvec lock in lists shrinking Alex Shi
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Vlastimil Babka, Jason Gunthorpe,
	Matthew Wilcox, Mauro Carvalho Chehab, Peng Fan, Nikolay Borisov,
	Ira Weiny, Andrey Ryabinin

Replace pgdat lru_lock with lruvec lru_lock.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 24a2b3456e10..798bffe7875d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -724,8 +724,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *locked_lruvec = NULL;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
 
@@ -737,19 +736,19 @@ void release_pages(struct page **pages, int nr)
 		 * excessive with a continuous string of pages from the
 		 * same pgdat. The lock is held only if pgdat != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
-			locked_pgdat = NULL;
+		if (locked_lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+			locked_lruvec = NULL;
 		}
 
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
+			if (locked_lruvec) {
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock,
 						       flags);
-				locked_pgdat = NULL;
+				locked_lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -766,27 +765,30 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
-				locked_pgdat = NULL;
+			if (locked_lruvec) {
+				spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
+				locked_lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
+			struct lruvec *lruvec;
 			struct pglist_data *pgdat = page_pgdat(page);
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock,
+			lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
+			if (lruvec != locked_lruvec) {
+				if (locked_lruvec)
+					spin_unlock_irqrestore(&locked_lruvec->lru_lock,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lruvec.lru_lock, flags);
+				locked_lruvec = lruvec;
+				spin_lock_irqsave(&locked_lruvec->lru_lock, flags);
+				sync_lruvec_pgdat(lruvec, pgdat);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -798,8 +800,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lruvec.lru_lock, flags);
+	if (locked_lruvec)
+		spin_unlock_irqrestore(&locked_lruvec->lru_lock, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 11/14] lru/vmscan: using per lruvec lock in lists shrinking.
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (9 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 10/14] lru/swap: use per lruvec lock in release_pages Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 12/14] lru/vmscan: use pre lruvec lock in check_move_unevictable_pages Alex Shi
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Kirill Tkhai, Johannes Weiner,
	Daniel Jordan, Yafang Shao, Yang Shi

The involoving functions includes isolate_lru_page, move_pages_to_lru
and shrink_in/active_list. also remove unnecessary pgdat.

And remove unnecessary pgdat accordingly.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c7a228525df0..defc2c4778eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1805,8 +1805,9 @@ int isolate_lru_page(struct page *page)
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 
-		spin_lock_irq(&pgdat->lruvec.lru_lock);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		spin_lock_irq(&lruvec->lru_lock);
+		sync_lruvec_pgdat(lruvec, pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
 			get_page(page);
@@ -1814,7 +1815,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
-		spin_unlock_irq(&pgdat->lruvec.lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	}
 	return ret;
 }
@@ -1879,7 +1880,6 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
@@ -1890,12 +1890,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		if (unlikely(!page_evictable(page))) {
 			list_del(&page->lru);
-			spin_unlock_irq(&pgdat->lruvec.lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lruvec.lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		SetPageLRU(page);
 		lru = page_lru(page);
@@ -1910,10 +1909,10 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lruvec.lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				mem_cgroup_uncharge(page);
 				(*get_compound_page_dtor(page))(page);
-				spin_lock_irq(&pgdat->lruvec.lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 		} else {
@@ -1976,7 +1975,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1988,7 +1987,7 @@ static int current_may_throttle(void)
 	if (global_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1996,7 +1995,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (global_reclaim(sc))
@@ -2009,7 +2008,7 @@ static int current_may_throttle(void)
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
@@ -2062,7 +2061,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2073,7 +2072,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2119,7 +2118,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	/*
 	 * Count referenced pages from currently used mappings as rotated,
 	 * even though only some of them are actually re-activated.  This
@@ -2137,7 +2136,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 12/14] lru/vmscan: use pre lruvec lock in check_move_unevictable_pages
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (10 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 11/14] lru/vmscan: using per lruvec lock in lists shrinking Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48 ` [PATCH 13/14] lru/vmscan: using per lruvec lru_lock in get_scan_count Alex Shi
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Kirill Tkhai, Johannes Weiner,
	Daniel Jordan, Yafang Shao, Yang Shi

to replace per pgdat lru_lock.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index defc2c4778eb..123447b9beda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4249,24 +4249,24 @@ int page_evictable(struct page *page)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *locked_lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct pglist_data *pgdat = page_pgdat(page);
+		struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		pgscanned++;
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lruvec.lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lruvec.lru_lock);
+		if (lruvec != locked_lruvec) {
+			if (locked_lruvec)
+				spin_unlock_irq(&locked_lruvec->lru_lock);
+			locked_lruvec = lruvec;
+			spin_lock_irq(&lruvec->lru_lock);
+			sync_lruvec_pgdat(lruvec, pgdat);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -4282,10 +4282,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 	}
 
-	if (pgdat) {
+	if (locked_lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lruvec.lru_lock);
+		spin_unlock_irq(&locked_lruvec->lru_lock);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 13/14] lru/vmscan: using per lruvec lru_lock in get_scan_count
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (11 preceding siblings ...)
  2019-08-20  9:48 ` [PATCH 12/14] lru/vmscan: use pre lruvec lock in check_move_unevictable_pages Alex Shi
@ 2019-08-20  9:48 ` Alex Shi
  2019-08-20  9:48   ` Alex Shi
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Michal Hocko, Kirill Tkhai, Johannes Weiner,
	Daniel Jordan, Yafang Shao, Yang Shi

The lruvec is passed as parameter, so no lruvec->pgdat syncing needed.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 123447b9beda..ea5c2f3f2567 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2372,7 +2372,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
 		lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);
 
-	spin_lock_irq(&pgdat->lruvec.lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
 		reclaim_stat->recent_scanned[0] /= 2;
 		reclaim_stat->recent_rotated[0] /= 2;
@@ -2393,7 +2393,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 
 	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
 	fp /= reclaim_stat->recent_rotated[1] + 1;
-	spin_unlock_irq(&pgdat->lruvec.lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	fraction[0] = ap;
 	fraction[1] = fp;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 14/14] mm/lru: fix the comments of lru_lock
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
@ 2019-08-20  9:48   ` Alex Shi
  2019-08-20  9:48 ` [PATCH 02/14] lru/memcg: move the lruvec->pgdat sync out lru_lock Alex Shi
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Ira Weiny, Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Johannes Weiner, Pavel Tatashin, Arun KS, Matthew Wilcox,
	Darrick J. Wong, Amir Goldstein, Dave Chinner, Josef Bacik,
	Kirill A. Shutemov, Jérôme Glisse, Mike Kravetz,
	Hugh Dickins, Kirill Tkhai, Daniel Jordan, Yafang Shao, Yang Shi

Since we changed the pgdat->lru_lock to lruvec->lru_lock, have to fix the
incorrect comments in code. Also fixed some zone->lru_lock comment error
in ancient time.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mm_types.h | 2 +-
 include/linux/mmzone.h   | 4 ++--
 mm/filemap.c             | 4 ++--
 mm/rmap.c                | 2 +-
 mm/vmscan.c              | 6 +++---
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..f9f990d8f08f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -79,7 +79,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8d0076d084be..d2f782263e42 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -295,7 +295,7 @@ struct zone_reclaim_stat {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
-	/* move lru_lock to per lruvec for memcg */
+	/* perf lruvec lru_lock for memcg */
 	spinlock_t			lru_lock;
 
 	struct zone_reclaim_stat	reclaim_stat;
diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..0a604c8284f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -100,8 +100,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..6bee4aebced6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
  *         mapping->i_mmap_rwsem
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
+ *               lruvec->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ea5c2f3f2567..1328eb182a3e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1662,7 +1662,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
@@ -1864,9 +1864,9 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
+ * appropriate to hold lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
+ * should drop lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 14/14] mm/lru: fix the comments of lru_lock
@ 2019-08-20  9:48   ` Alex Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20  9:48 UTC (permalink / raw)
  To: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Alex Shi, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Ira Weiny, Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Johannes Weiner, Pavel Tatashin, Arun KS, Matthew Wilcox,
	Darrick J. Wong

Since we changed the pgdat->lru_lock to lruvec->lru_lock, have to fix the
incorrect comments in code. Also fixed some zone->lru_lock comment error
in ancient time.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mm_types.h | 2 +-
 include/linux/mmzone.h   | 4 ++--
 mm/filemap.c             | 4 ++--
 mm/rmap.c                | 2 +-
 mm/vmscan.c              | 6 +++---
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..f9f990d8f08f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -79,7 +79,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8d0076d084be..d2f782263e42 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -295,7 +295,7 @@ struct zone_reclaim_stat {
 
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
-	/* move lru_lock to per lruvec for memcg */
+	/* perf lruvec lru_lock for memcg */
 	spinlock_t			lru_lock;
 
 	struct zone_reclaim_stat	reclaim_stat;
diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..0a604c8284f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -100,8 +100,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 003377e24232..6bee4aebced6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -27,7 +27,7 @@
  *         mapping->i_mmap_rwsem
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
+ *               lruvec->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ea5c2f3f2567..1328eb182a3e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1662,7 +1662,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
@@ -1864,9 +1864,9 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
+ * appropriate to hold lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
+ * should drop lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (13 preceding siblings ...)
  2019-08-20  9:48   ` Alex Shi
@ 2019-08-20 10:45 ` Michal Hocko
  2019-08-20 16:48     ` Shakeel Butt
  2019-08-21 18:00 ` Daniel Jordan
  15 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2019-08-20 10:45 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins

On Tue 20-08-19 17:48:23, Alex Shi wrote:
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg.
> 
> Per memcg lru_lock would ease the lru_lock contention a lot in
> this patch series.
> 
> In some data center, containers are used widely to deploy different kind
> of services, then multiple memcgs share per node pgdat->lru_lock which
> cause heavy lock contentions when doing lru operation.

Having some real world workloads numbers would be more than useful
for a non trivial change like this. I believe googlers have tried
something like this in the past but then didn't have really a good
example of workloads that benefit. I might misremember though. Cc Hugh.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] mm/lru: move pgdat lru_lock into lruvec
  2019-08-20  9:48   ` Alex Shi
@ 2019-08-20 13:40     ` Matthew Wilcox
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2019-08-20 13:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Vlastimil Babka, Dan Williams, Oscar Salvador, Wei Yang,
	Pavel Tatashin, Arun KS, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Hugh Dickins, Jérôme Glisse,
	Andrea Arcangeli, Aneesh Kumar K.V, David Rientjes,
	Souptick Joarder, swkhack, Potyra, Stefan, Mike Rapoport,
	Alexander Duyck, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan, Ira Weiny,
	Kirill Tkhai, Daniel Jordan, Yafang Shao, Yang Shi

On Tue, Aug 20, 2019 at 05:48:24PM +0800, Alex Shi wrote:
> +++ b/include/linux/mmzone.h
> @@ -295,6 +295,9 @@ struct zone_reclaim_stat {
>  
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
> +	/* move lru_lock to per lruvec for memcg */
> +	spinlock_t			lru_lock;

This comment makes no sense outside the context of this patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] mm/lru: move pgdat lru_lock into lruvec
@ 2019-08-20 13:40     ` Matthew Wilcox
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2019-08-20 13:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Vlastimil Babka, Dan Williams, Oscar Salvador, Wei Yang,
	Pavel Tatashin, Arun KS, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Hugh Dickins, Jérôme Glisse

On Tue, Aug 20, 2019 at 05:48:24PM +0800, Alex Shi wrote:
> +++ b/include/linux/mmzone.h
> @@ -295,6 +295,9 @@ struct zone_reclaim_stat {
>  
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
> +	/* move lru_lock to per lruvec for memcg */
> +	spinlock_t			lru_lock;

This comment makes no sense outside the context of this patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 14/14] mm/lru: fix the comments of lru_lock
  2019-08-20  9:48   ` Alex Shi
@ 2019-08-20 14:00     ` Matthew Wilcox
  -1 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2019-08-20 14:00 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Ira Weiny, Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Johannes Weiner, Pavel Tatashin, Arun KS, Darrick J. Wong,
	Amir Goldstein, Dave Chinner, Josef Bacik, Kirill A. Shutemov,
	Jérôme Glisse, Mike Kravetz, Hugh Dickins,
	Kirill Tkhai, Daniel Jordan, Yafang Shao, Yang Shi

On Tue, Aug 20, 2019 at 05:48:37PM +0800, Alex Shi wrote:
> @@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>  struct pglist_data;
>  
>  /*
> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>   * So add a wild amount of padding here to ensure that they fall into separate
>   * cachelines.  There are very few zone structures in the machine, so space
>   * consumption is not a concern here.

But after this patch series, the lru lock is no longer stored in the zone.
So this comment makes no sense.

> @@ -295,7 +295,7 @@ struct zone_reclaim_stat {
>  
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
> -	/* move lru_lock to per lruvec for memcg */
> +	/* perf lruvec lru_lock for memcg */

What does the word 'perf' mean here?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 14/14] mm/lru: fix the comments of lru_lock
@ 2019-08-20 14:00     ` Matthew Wilcox
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2019-08-20 14:00 UTC (permalink / raw)
  To: Alex Shi
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Ira Weiny, Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Johannes Weiner, Pavel

On Tue, Aug 20, 2019 at 05:48:37PM +0800, Alex Shi wrote:
> @@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>  struct pglist_data;
>  
>  /*
> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>   * So add a wild amount of padding here to ensure that they fall into separate
>   * cachelines.  There are very few zone structures in the machine, so space
>   * consumption is not a concern here.

But after this patch series, the lru lock is no longer stored in the zone.
So this comment makes no sense.

> @@ -295,7 +295,7 @@ struct zone_reclaim_stat {
>  
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
> -	/* move lru_lock to per lruvec for memcg */
> +	/* perf lruvec lru_lock for memcg */

What does the word 'perf' mean here?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] mm/lru: move pgdat lru_lock into lruvec
  2019-08-20 13:40     ` Matthew Wilcox
@ 2019-08-20 14:11       ` Alex Shi
  -1 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20 14:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Vlastimil Babka, Dan Williams, Oscar Salvador, Wei Yang,
	Pavel Tatashin, Arun KS, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Hugh Dickins, Jérôme Glisse,
	Andrea Arcangeli, Aneesh Kumar K.V, David Rientjes,
	Souptick Joarder, swkhack, Potyra, Stefan, Mike Rapoport,
	Alexander Duyck, Stephen Rothwell, Colin Ian King,
	Jason Gunthorpe, Mauro Carvalho Chehab, Peng Fan, Ira Weiny,
	Kirill Tkhai, Daniel Jordan, Yafang Shao, Yang Shi



在 2019/8/20 下午9:40, Matthew Wilcox 写道:
> On Tue, Aug 20, 2019 at 05:48:24PM +0800, Alex Shi wrote:
>> +++ b/include/linux/mmzone.h
>> @@ -295,6 +295,9 @@ struct zone_reclaim_stat {
>>  
>>  struct lruvec {
>>  	struct list_head		lists[NR_LRU_LISTS];
>> +	/* move lru_lock to per lruvec for memcg */
>> +	spinlock_t			lru_lock;
> 
> This comment makes no sense outside the context of this patch.
> 

Right, Thanks for point out this. will remove it in v2.

Thanks
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 01/14] mm/lru: move pgdat lru_lock into lruvec
@ 2019-08-20 14:11       ` Alex Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20 14:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Vlastimil Babka, Dan Williams, Oscar Salvador, Wei Yang,
	Pavel Tatashin, Arun KS, Qian Cai, Andrey Ryabinin,
	Kirill A. Shutemov, Hugh Dickins, Jérôme Glisse



ÔÚ 2019/8/20 ÏÂÎç9:40, Matthew Wilcox дµÀ:
> On Tue, Aug 20, 2019 at 05:48:24PM +0800, Alex Shi wrote:
>> +++ b/include/linux/mmzone.h
>> @@ -295,6 +295,9 @@ struct zone_reclaim_stat {
>>  
>>  struct lruvec {
>>  	struct list_head		lists[NR_LRU_LISTS];
>> +	/* move lru_lock to per lruvec for memcg */
>> +	spinlock_t			lru_lock;
> 
> This comment makes no sense outside the context of this patch.
> 

Right, Thanks for point out this. will remove it in v2.

Thanks
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 14/14] mm/lru: fix the comments of lru_lock
  2019-08-20 14:00     ` Matthew Wilcox
@ 2019-08-20 14:21       ` Alex Shi
  -1 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20 14:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Ira Weiny, Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Johannes Weiner, Pavel Tatashin, Arun KS, Darrick J. Wong,
	Amir Goldstein, Dave Chinner, Josef Bacik, Kirill A. Shutemov,
	Jérôme Glisse, Mike Kravetz, Hugh Dickins,
	Kirill Tkhai, Daniel Jordan, Yafang Shao, Yang Shi



在 2019/8/20 下午10:00, Matthew Wilcox 写道:
> On Tue, Aug 20, 2019 at 05:48:37PM +0800, Alex Shi wrote:
>> @@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>>  struct pglist_data;
>>  
>>  /*
>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>>   * So add a wild amount of padding here to ensure that they fall into separate
>>   * cachelines.  There are very few zone structures in the machine, so space
>>   * consumption is not a concern here.
> 
> But after this patch series, the lru lock is no longer stored in the zone.
> So this comment makes no sense.

Yes, It's need reconsider here. thanks for opoint out.

> 
>> @@ -295,7 +295,7 @@ struct zone_reclaim_stat {
>>  
>>  struct lruvec {
>>  	struct list_head		lists[NR_LRU_LISTS];
>> -	/* move lru_lock to per lruvec for memcg */
>> +	/* perf lruvec lru_lock for memcg */
> 
> What does the word 'perf' mean here?

sorry for typo, could be s/perf/per/ here.

Thanks
Alex

 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 14/14] mm/lru: fix the comments of lru_lock
@ 2019-08-20 14:21       ` Alex Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-20 14:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cgroups, linux-kernel, linux-mm, Andrew Morton, Mel Gorman,
	Tejun Heo, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Ira Weiny, Jesper Dangaard Brouer, Andrey Ryabinin, Jann Horn,
	Logan Gunthorpe, Souptick Joarder, Ralph Campbell,
	Tobin C. Harding, Michal Hocko, Oscar Salvador, Wei Yang,
	Johannes Weiner, Pavel



ÔÚ 2019/8/20 ÏÂÎç10:00, Matthew Wilcox дµÀ:
> On Tue, Aug 20, 2019 at 05:48:37PM +0800, Alex Shi wrote:
>> @@ -159,7 +159,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>>  struct pglist_data;
>>  
>>  /*
>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>>   * So add a wild amount of padding here to ensure that they fall into separate
>>   * cachelines.  There are very few zone structures in the machine, so space
>>   * consumption is not a concern here.
> 
> But after this patch series, the lru lock is no longer stored in the zone.
> So this comment makes no sense.

Yes, It's need reconsider here. thanks for opoint out.

> 
>> @@ -295,7 +295,7 @@ struct zone_reclaim_stat {
>>  
>>  struct lruvec {
>>  	struct list_head		lists[NR_LRU_LISTS];
>> -	/* move lru_lock to per lruvec for memcg */
>> +	/* perf lruvec lru_lock for memcg */
> 
> What does the word 'perf' mean here?

sorry for typo, could be s/perf/per/ here.

Thanks
Alex

 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-20 10:45 ` [PATCH 00/14] per memcg lru_lock Michal Hocko
@ 2019-08-20 16:48     ` Shakeel Butt
  0 siblings, 0 replies; 43+ messages in thread
From: Shakeel Butt @ 2019-08-20 16:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins

On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > lruvec, thus bring a lru_lock for each of memcg.
> >
> > Per memcg lru_lock would ease the lru_lock contention a lot in
> > this patch series.
> >
> > In some data center, containers are used widely to deploy different kind
> > of services, then multiple memcgs share per node pgdat->lru_lock which
> > cause heavy lock contentions when doing lru operation.
>
> Having some real world workloads numbers would be more than useful
> for a non trivial change like this. I believe googlers have tried
> something like this in the past but then didn't have really a good
> example of workloads that benefit. I might misremember though. Cc Hugh.
>

We, at Google, have been using per-memcg lru locks for more than 7
years. Per-memcg lru locks are really beneficial for providing
performance isolation if there are multiple distinct jobs/memcgs
running on large machines. We are planning to upstream our internal
implementation. I will let Hugh comment on that.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
@ 2019-08-20 16:48     ` Shakeel Butt
  0 siblings, 0 replies; 43+ messages in thread
From: Shakeel Butt @ 2019-08-20 16:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins

On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > lruvec, thus bring a lru_lock for each of memcg.
> >
> > Per memcg lru_lock would ease the lru_lock contention a lot in
> > this patch series.
> >
> > In some data center, containers are used widely to deploy different kind
> > of services, then multiple memcgs share per node pgdat->lru_lock which
> > cause heavy lock contentions when doing lru operation.
>
> Having some real world workloads numbers would be more than useful
> for a non trivial change like this. I believe googlers have tried
> something like this in the past but then didn't have really a good
> example of workloads that benefit. I might misremember though. Cc Hugh.
>

We, at Google, have been using per-memcg lru locks for more than 7
years. Per-memcg lru locks are really beneficial for providing
performance isolation if there are multiple distinct jobs/memcgs
running on large machines. We are planning to upstream our internal
implementation. I will let Hugh comment on that.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-20 16:48     ` Shakeel Butt
@ 2019-08-20 18:24       ` Hugh Dickins
  -1 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2019-08-20 18:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins, Shakeel Butt, Yu Zhao, Daniel Jordan

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2042 bytes --]

On Tue, 20 Aug 2019, Shakeel Butt wrote:
> On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > > lruvec, thus bring a lru_lock for each of memcg.
> > >
> > > Per memcg lru_lock would ease the lru_lock contention a lot in
> > > this patch series.
> > >
> > > In some data center, containers are used widely to deploy different kind
> > > of services, then multiple memcgs share per node pgdat->lru_lock which
> > > cause heavy lock contentions when doing lru operation.
> >
> > Having some real world workloads numbers would be more than useful
> > for a non trivial change like this. I believe googlers have tried
> > something like this in the past but then didn't have really a good
> > example of workloads that benefit. I might misremember though. Cc Hugh.
> >
> 
> We, at Google, have been using per-memcg lru locks for more than 7
> years. Per-memcg lru locks are really beneficial for providing
> performance isolation if there are multiple distinct jobs/memcgs
> running on large machines. We are planning to upstream our internal
> implementation. I will let Hugh comment on that.

Thanks for the Cc Michal.  As Shakeel says, Google prodkernel has been
using our per-memcg lru locks for 7 years or so.  Yes, we did not come
up with supporting performance data at the time of posting, nor since:
I see Alex has done much better on that (though I haven't even glanced
to see if +s are persuasive).

https://lkml.org/lkml/2012/2/20/434
was how ours was back then; some parts of that went in, then attached
lrulock417.tar is how it was the last time I rebased, to v4.17.

I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
and/or mmotm.  Then compare with what Alex has, to see if there's any
good reason to prefer one to the other: if no good reason to prefer ours,
I doubt we shall bother to repost, but just use it as basis for helping
to review or improve Alex's.

Hugh

[-- Attachment #2: per-memcg lru_lock on v4.17 --]
[-- Type: APPLICATION/x-tar, Size: 112640 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
@ 2019-08-20 18:24       ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2019-08-20 18:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alex Shi, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Hugh Dickins, Shakeel Butt, Yu Zhao, Daniel Jordan

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2042 bytes --]

On Tue, 20 Aug 2019, Shakeel Butt wrote:
> On Tue, Aug 20, 2019 at 3:45 AM Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 20-08-19 17:48:23, Alex Shi wrote:
> > > This patchset move lru_lock into lruvec, give a lru_lock for each of
> > > lruvec, thus bring a lru_lock for each of memcg.
> > >
> > > Per memcg lru_lock would ease the lru_lock contention a lot in
> > > this patch series.
> > >
> > > In some data center, containers are used widely to deploy different kind
> > > of services, then multiple memcgs share per node pgdat->lru_lock which
> > > cause heavy lock contentions when doing lru operation.
> >
> > Having some real world workloads numbers would be more than useful
> > for a non trivial change like this. I believe googlers have tried
> > something like this in the past but then didn't have really a good
> > example of workloads that benefit. I might misremember though. Cc Hugh.
> >
> 
> We, at Google, have been using per-memcg lru locks for more than 7
> years. Per-memcg lru locks are really beneficial for providing
> performance isolation if there are multiple distinct jobs/memcgs
> running on large machines. We are planning to upstream our internal
> implementation. I will let Hugh comment on that.

Thanks for the Cc Michal.  As Shakeel says, Google prodkernel has been
using our per-memcg lru locks for 7 years or so.  Yes, we did not come
up with supporting performance data at the time of posting, nor since:
I see Alex has done much better on that (though I haven't even glanced
to see if +s are persuasive).

https://lkml.org/lkml/2012/2/20/434
was how ours was back then; some parts of that went in, then attached
lrulock417.tar is how it was the last time I rebased, to v4.17.

I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
and/or mmotm.  Then compare with what Alex has, to see if there's any
good reason to prefer one to the other: if no good reason to prefer ours,
I doubt we shall bother to repost, but just use it as basis for helping
to review or improve Alex's.

Hugh

[-- Attachment #2: per-memcg lru_lock on v4.17 --]
[-- Type: APPLICATION/x-tar, Size: 112640 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-20 18:24       ` Hugh Dickins
  (?)
@ 2019-08-21  1:21       ` Alex Shi
  -1 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-21  1:21 UTC (permalink / raw)
  To: Hugh Dickins, Michal Hocko
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Shakeel Butt, Yu Zhao, Daniel Jordan



> 
> Thanks for the Cc Michal.  As Shakeel says, Google prodkernel has been
> using our per-memcg lru locks for 7 years or so.  Yes, we did not come
> up with supporting performance data at the time of posting, nor since:
> I see Alex has done much better on that (though I haven't even glanced
> to see if +s are persuasive).
> 
> https://lkml.org/lkml/2012/2/20/434
> was how ours was back then; some parts of that went in, then attached
> lrulock417.tar is how it was the last time I rebased, to v4.17.
> 
> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> and/or mmotm.  Then compare with what Alex has, to see if there's any
> good reason to prefer one to the other: if no good reason to prefer ours,
> I doubt we shall bother to repost, but just use it as basis for helping
> to review or improve Alex's.
> 

Thanks for you all! Very glad to see we are trying on same point. :)
Not only on per memcg lru_lock, there are much room on lru and page replacement
tunings. Anyway Hope to see your update and more review comments soon.

Thanks
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-20 18:24       ` Hugh Dickins
  (?)
  (?)
@ 2019-08-21  2:00       ` Alex Shi
  2019-08-24  1:59           ` Hugh Dickins
  -1 siblings, 1 reply; 43+ messages in thread
From: Alex Shi @ 2019-08-21  2:00 UTC (permalink / raw)
  To: Hugh Dickins, Michal Hocko
  Cc: Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman, Tejun Heo,
	Shakeel Butt, Yu Zhao, Daniel Jordan



在 2019/8/21 上午2:24, Hugh Dickins 写道:
> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> and/or mmotm.  Then compare with what Alex has, to see if there's any
> good reason to prefer one to the other: if no good reason to prefer ours,
> I doubt we shall bother to repost, but just use it as basis for helping
> to review or improve Alex's.

For your review, my patchset are pretty straight and simple. It just use per lruvec lru_lock to replace necessary pgdat lru_lock. just this. 
We could talk more after I back to work. :)

Thanks alot!
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
                   ` (14 preceding siblings ...)
  2019-08-20 10:45 ` [PATCH 00/14] per memcg lru_lock Michal Hocko
@ 2019-08-21 18:00 ` Daniel Jordan
  2019-08-22 11:56   ` Alex Shi
  15 siblings, 1 reply; 43+ messages in thread
From: Daniel Jordan @ 2019-08-21 18:00 UTC (permalink / raw)
  To: Alex Shi, cgroups, linux-kernel, linux-mm, Andrew Morton,
	Mel Gorman, Tejun Heo, Michal Hocko

Hi Alex,

On 8/20/19 5:48 AM, Alex Shi wrote:
> In some data center, containers are used widely to deploy different kind
> of services, then multiple memcgs share per node pgdat->lru_lock which
> cause heavy lock contentions when doing lru operation.
> 
> On my 2 socket * 6 cores E5-2630 platform, 24 containers run aim9
> simultaneously with mmtests' config:
>          # AIM9
>          export AIM9_TESTTIME=180
>          export AIM9_TESTLIST=page_test,brk_test
> 
> perf lock report show much contentions on lru_lock in 20 second snapshot:
>                          Name   acquired  contended   avg wait (ns) total wait (ns)   max wait (ns)   min wait (ns)
>          &(ptlock_ptr(pag...         22          0               0       0               0               0
>          ...
>          &(&pgdat->lru_lo...          9          7           12728       89096           26656            1597

This is system-wide right, not per container?  Even per container, 89 usec isn't much contention over 20 seconds.  You may want to give this a try:

   https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].

It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?

> With this patch series, lruvec->lru_lock show no contentions
>          &(&lruvec->lru_l...          8          0               0       0               0               0
> 
> and aim9 page_test/brk_test performance increased 5%~50%.

Where does the 50% number come in?  The numbers below seem to only show ~4% boost.

> BTW, Detailed results in aim9-pft.compare.log if needed,
> All containers data are increased and pretty steady.
> 
> $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i page_test ============"; cat aim9-pft.compare.log | grep "^$i.*page_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
> ========= Max page_test ============
> 5.3-rc4          34729.6
> 5.3-rc4+lru_lock 36128.3
> ========= Min page_test ============
> 5.3-rc4          33644.2
> 5.3-rc4+lru_lock 35349.7
> ========= Hmean page_test ============
> 5.3-rc4          34355.4
> 5.3-rc4+lru_lock 35810.9
> ========= Stddev page_test ============
> 5.3-rc4          319.757
> 5.3-rc4+lru_lock 223.324
> ========= CoeffVar page_test ============
> 5.3-rc4          0.93125
> 5.3-rc4+lru_lock 0.623333
> ========= BHmean-50 page_test ============
> 5.3-rc4          34579.2
> 5.3-rc4+lru_lock 35977.1
> ========= BHmean-95 page_test ============
> 5.3-rc4          34421.7
> 5.3-rc4+lru_lock 35853.6
> ========= BHmean-99 page_test ============
> 5.3-rc4          34421.7
> 5.3-rc4+lru_lock 35853.6
> 
> $for i in Max Min Hmean Stddev CoeffVar BHmean-50 BHmean-95 BHmean-99; do echo "========= $i brk_test ============"; cat aim9-pft.compare.log | grep "^$i.*brk_test" | awk 'BEGIN {a=b=0;}  { a+=$3; b+=$6 } END { print "5.3-rc4          " a/24; print "5.3-rc4+lru_lock " b/24}' ; done
> ========= Max brk_test ============
> 5.3-rc4          96647.7
> 5.3-rc4+lru_lock 98960.3
> ========= Min brk_test ============
> 5.3-rc4          91800.8
> 5.3-rc4+lru_lock 96817.6
> ========= Hmean brk_test ============
> 5.3-rc4          95470
> 5.3-rc4+lru_lock 97769.6
> ========= Stddev brk_test ============
> 5.3-rc4          1253.52
> 5.3-rc4+lru_lock 596.593
> ========= CoeffVar brk_test ============
> 5.3-rc4          1.31375
> 5.3-rc4+lru_lock 0.609583
> ========= BHmean-50 brk_test ============
> 5.3-rc4          96141.4
> 5.3-rc4+lru_lock 98194
> ========= BHmean-95 brk_test ============
> 5.3-rc4          95818.5
> 5.3-rc4+lru_lock 97857.2
> ========= BHmean-99 brk_test ============
> 5.3-rc4          95818.5
> 5.3-rc4+lru_lock 97857.2

[1] https://lore.kernel.org/linux-mm/CABdVr8R2y9B+2zzSAT_Ve=BQCa+F+E9_kVH+C28DGpkeQitiog@mail.gmail.com/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-21 18:00 ` Daniel Jordan
@ 2019-08-22 11:56   ` Alex Shi
  2019-08-22 15:20     ` Daniel Jordan
  0 siblings, 1 reply; 43+ messages in thread
From: Alex Shi @ 2019-08-22 11:56 UTC (permalink / raw)
  To: Daniel Jordan, cgroups, linux-kernel, linux-mm, Andrew Morton,
	Mel Gorman, Tejun Heo, Michal Hocko



在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>
> 
> This is system-wide right, not per container?  Even per container, 89 usec isn't much contention over 20 seconds.  You may want to give this a try:

yes, perf lock show the host info.
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice> 
> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].

Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now. Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :)
> 
> It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?

We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove. aim9 could reflect part of scenarios. I don't know the DB scenario yet.

> 
>> With this patch series, lruvec->lru_lock show no contentions
>>          &(&lruvec->lru_l...          8          0               0       0               0               0
>>
>> and aim9 page_test/brk_test performance increased 5%~50%.
> 
> Where does the 50% number come in?  The numbers below seem to only show ~4% boost.

the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:

Stddev    page_test      245.15 (   0.00%)      189.29 (  22.79%)
Stddev    brk_test      1258.60 (   0.00%)      629.16 (  50.01%)
CoeffVar  page_test        0.71 (   0.00%)        0.53 (  26.05%)
CoeffVar  brk_test         1.32 (   0.00%)        0.64 (  51.14%)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-22 11:56   ` Alex Shi
@ 2019-08-22 15:20     ` Daniel Jordan
  2019-08-26  8:39       ` Konstantin Khlebnikov
  2019-08-26 14:25       ` Alex Shi
  0 siblings, 2 replies; 43+ messages in thread
From: Daniel Jordan @ 2019-08-22 15:20 UTC (permalink / raw)
  To: Alex Shi, cgroups, linux-kernel, linux-mm, Andrew Morton,
	Mel Gorman, Tejun Heo, Michal Hocko

On 8/22/19 7:56 AM, Alex Shi wrote:
> 在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>    https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
> 
> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.

I should've been clearer.  [1] is meant as an example of someone suffering from lru_lock during reclaim.  Wouldn't your series help per-memcg reclaim?

> Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. But anyway, aim9 is also pretty good to show the problem and solutions. :)
>>
>> It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?
> 
> We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages alloc/remove. 

I think numbers from that scenario would help your case.

> aim9 could reflect part of scenarios. I don't know the DB scenario yet.

We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed).  But that's a different thing, clearly Not This Series.

>>> With this patch series, lruvec->lru_lock show no contentions
>>>           &(&lruvec->lru_l...          8          0               0       0               0               0
>>>
>>> and aim9 page_test/brk_test performance increased 5%~50%.
>>
>> Where does the 50% number come in?  The numbers below seem to only show ~4% boost.
> 
> the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:
> 
> Stddev    page_test      245.15 (   0.00%)      189.29 (  22.79%)
> Stddev    brk_test      1258.60 (   0.00%)      629.16 (  50.01%)
> CoeffVar  page_test        0.71 (   0.00%)        0.53 (  26.05%)
> CoeffVar  brk_test         1.32 (   0.00%)        0.64 (  51.14%)

Aha.  50% decrease in stdev.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-21  2:00       ` Alex Shi
@ 2019-08-24  1:59           ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2019-08-24  1:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Michal Hocko, Cgroups, LKML, Linux MM,
	Andrew Morton, Mel Gorman, Tejun Heo, Shakeel Butt, Yu Zhao,
	Daniel Jordan, Johannes Weiner, Vlastimil Babka,
	Vladimir Davydov, Minchan Kim, Kirill Tkhai,
	Konstantin Khlebnikov, Matthew Wilcox

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3268 bytes --]

On Wed, 21 Aug 2019, Alex Shi wrote:
> 在 2019/8/21 上午2:24, Hugh Dickins 写道:
> > I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> > and/or mmotm.  Then compare with what Alex has, to see if there's any
> > good reason to prefer one to the other: if no good reason to prefer ours,
> > I doubt we shall bother to repost, but just use it as basis for helping
> > to review or improve Alex's.
> 
> For your review, my patchset are pretty straight and simple.
> It just use per lruvec lru_lock to replace necessary pgdat lru_lock.
> just this.  We could talk more after I back to work. :)

Sorry to be bearer of bad news, Alex, but when you said "straight and
simple", I feared that your patchset would turn out to be fundamentally
too simple.

And that is so. I have only to see the
		lruvec = mem_cgroup_page_lruvec(page, pgdat);
line in isolate_migratepages_block() in mm/compaction.c, and check
that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c.

The central problem with per-memcg lru_lock is that you do not know
for sure what lock to take (which memcg a page belongs to) until you
have taken the expected lock, and then checked whether page->memcg
is still the same - backing out and trying again if not.

Fix that central problem, and you end up with a more complicated
patchset, much like ours.  It's true that when ours was first developed,
the memcg situation was more complicated in several ways, and perhaps
some aspects of our patchset could be simplified now (though I've not
identified any).  Johannes in particular has done a great deal of
simplifying work in memcg over the last few years, but there are still
situations in which a page's memcg can change (move_charge_at_immigrate
and swapin readahead spring to mind - or perhaps the latter is only an
issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if
reparenting will be brought back one day).

I did not review your patchset in detail, and wasn't able to get very
far in testing it.  At first I was put off by set_task_reclaim_state
warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5
itself, not from your patchset or mine (but I've not yet investigated
what's responsible for them).  Within a minute of starting swapping
load, kcompactd compact_lock_irqsave() in isolate_migratepages_block()
would deadlock, and I did not get further.  (Though I did also notice
that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory"
froze in booting - tiresomely, one has to keep both the memcg and
no-memcg locking to cope with that case, and I guess you had not.)

Rather than duplicating effort, I would advise you to give our patchset
a try, and if it works for you, help towards getting that one merged:
but of course, it's up to you.

I've attached a tarfile of it rebased to v5.3-rc5: I do not want to
spam the list with patches yet, because I do not have any stats or
argument in support of the series, as Andrew asked for years ago and
Michal asks again now.  But aside from that I consider it ready, and
will let Shakeel take it over from here, while I get back to what I
diverted from (but of course I'll try to answer questions on it).

Thanks,
Hugh

[-- Attachment #2: per-memcg lru_lock on v5.3-rc5 --]
[-- Type: APPLICATION/x-tar, Size: 102400 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
@ 2019-08-24  1:59           ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2019-08-24  1:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, Michal Hocko, Cgroups, LKML, Linux MM,
	Andrew Morton, Mel Gorman, Tejun Heo, Shakeel Butt, Yu Zhao,
	Daniel Jordan, Johannes Weiner, Vlastimil Babka,
	Vladimir Davydov, Minchan Kim, Kirill Tkhai,
	Konstantin Khlebnikov, Matthew Wilcox

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3268 bytes --]

On Wed, 21 Aug 2019, Alex Shi wrote:
> 在 2019/8/21 上午2:24, Hugh Dickins 写道:
> > I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
> > and/or mmotm.  Then compare with what Alex has, to see if there's any
> > good reason to prefer one to the other: if no good reason to prefer ours,
> > I doubt we shall bother to repost, but just use it as basis for helping
> > to review or improve Alex's.
> 
> For your review, my patchset are pretty straight and simple.
> It just use per lruvec lru_lock to replace necessary pgdat lru_lock.
> just this.  We could talk more after I back to work. :)

Sorry to be bearer of bad news, Alex, but when you said "straight and
simple", I feared that your patchset would turn out to be fundamentally
too simple.

And that is so. I have only to see the
		lruvec = mem_cgroup_page_lruvec(page, pgdat);
line in isolate_migratepages_block() in mm/compaction.c, and check
that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c.

The central problem with per-memcg lru_lock is that you do not know
for sure what lock to take (which memcg a page belongs to) until you
have taken the expected lock, and then checked whether page->memcg
is still the same - backing out and trying again if not.

Fix that central problem, and you end up with a more complicated
patchset, much like ours.  It's true that when ours was first developed,
the memcg situation was more complicated in several ways, and perhaps
some aspects of our patchset could be simplified now (though I've not
identified any).  Johannes in particular has done a great deal of
simplifying work in memcg over the last few years, but there are still
situations in which a page's memcg can change (move_charge_at_immigrate
and swapin readahead spring to mind - or perhaps the latter is only an
issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if
reparenting will be brought back one day).

I did not review your patchset in detail, and wasn't able to get very
far in testing it.  At first I was put off by set_task_reclaim_state
warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5
itself, not from your patchset or mine (but I've not yet investigated
what's responsible for them).  Within a minute of starting swapping
load, kcompactd compact_lock_irqsave() in isolate_migratepages_block()
would deadlock, and I did not get further.  (Though I did also notice
that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory"
froze in booting - tiresomely, one has to keep both the memcg and
no-memcg locking to cope with that case, and I guess you had not.)

Rather than duplicating effort, I would advise you to give our patchset
a try, and if it works for you, help towards getting that one merged:
but of course, it's up to you.

I've attached a tarfile of it rebased to v5.3-rc5: I do not want to
spam the list with patches yet, because I do not have any stats or
argument in support of the series, as Andrew asked for years ago and
Michal asks again now.  But aside from that I consider it ready, and
will let Shakeel take it over from here, while I get back to what I
diverted from (but of course I'll try to answer questions on it).

Thanks,
Hugh

[-- Attachment #2: per-memcg lru_lock on v5.3-rc5 --]
[-- Type: APPLICATION/x-tar, Size: 102400 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/14] lru/memcg: using per lruvec lock in un/lock_page_lru
  2019-08-20  9:48 ` [PATCH 03/14] lru/memcg: using per lruvec lock in un/lock_page_lru Alex Shi
@ 2019-08-26  8:30   ` Konstantin Khlebnikov
  2019-08-26 14:16     ` Alex Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Konstantin Khlebnikov @ 2019-08-26  8:30 UTC (permalink / raw)
  To: Alex Shi, cgroups, linux-kernel, linux-mm, Andrew Morton,
	Mel Gorman, Tejun Heo
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins

On 20/08/2019 12.48, Alex Shi wrote:
> Now we repeatly assign the lruvec->pgdat in memcg. Will remove the
> assignment in lruvec getting function after very points are protected.
> 
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   mm/memcontrol.c | 12 +++++-------
>   1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e8a1b0d95ba8..19fd911e8098 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2550,12 +2550,12 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
>   static void lock_page_lru(struct page *page, int *isolated)
>   {
>   	pg_data_t *pgdat = page_pgdat(page);
> +	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   

What protects lruvec from freeing at this point?
After reading resolving lruvec page could be moved and cgroup deleted.

In this old patches I've used RCU for that: https://lkml.org/lkml/2012/2/20/276
Pointer to lruvec should be resolved under disabled irq.
Not sure this works these days.

> -	spin_lock_irq(&pgdat->lruvec.lru_lock);
> +	spin_lock_irq(&lruvec->lru_lock);
> +	sync_lruvec_pgdat(lruvec, pgdat);
>   	if (PageLRU(page)) {
> -		struct lruvec *lruvec;
>   
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   		ClearPageLRU(page);
>   		del_page_from_lru_list(page, lruvec, page_lru(page));
>   		*isolated = 1;
> @@ -2566,16 +2566,14 @@ static void lock_page_lru(struct page *page, int *isolated)
>   static void unlock_page_lru(struct page *page, int isolated)
>   {
>   	pg_data_t *pgdat = page_pgdat(page);
> +	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   
>   	if (isolated) {
> -		struct lruvec *lruvec;
> -
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   		VM_BUG_ON_PAGE(PageLRU(page), page);
>   		SetPageLRU(page);
>   		add_page_to_lru_list(page, lruvec, page_lru(page));
>   	}
> -	spin_unlock_irq(&pgdat->lruvec.lru_lock);
> +	spin_unlock_irq(&lruvec->lru_lock);
>   }
>   
>   static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-22 15:20     ` Daniel Jordan
@ 2019-08-26  8:39       ` Konstantin Khlebnikov
  2019-08-26 14:22         ` Alex Shi
  2019-08-26 14:25       ` Alex Shi
  1 sibling, 1 reply; 43+ messages in thread
From: Konstantin Khlebnikov @ 2019-08-26  8:39 UTC (permalink / raw)
  To: Daniel Jordan, Alex Shi, cgroups, linux-kernel, linux-mm,
	Andrew Morton, Mel Gorman, Tejun Heo, Michal Hocko, Hugh Dickins

On 22/08/2019 18.20, Daniel Jordan wrote:
> On 8/22/19 7:56 AM, Alex Shi wrote:
>> 在 2019/8/22 上午2:00, Daniel Jordan 写道:
>>>    https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>>> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this 
>>> lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
>>
>> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.
> 
> I should've been clearer.  [1] is meant as an example of someone suffering from lru_lock during reclaim.  Wouldn't your series help 
> per-memcg reclaim?
> 
>> Yes, readtwice case could be more sensitive for this lru_lock changes in containers. I may try to use it in container with some tuning. 
>> But anyway, aim9 is also pretty good to show the problem and solutions. :)
>>>
>>> It'd be better though, as Michal suggests, to use the real workload that's causing problems.  Where are you seeing contention?
>>
>> We repeatly create or delete a lot of different containers according to servers load/usage, so normal workload could cause lots of pages 
>> alloc/remove. 
> 
> I think numbers from that scenario would help your case.
> 
>> aim9 could reflect part of scenarios. I don't know the DB scenario yet.
> 
> We see it during DB shutdown when each DB process frees its memory (zap_pte_range -> mark_page_accessed).  But that's a different thing, 
> clearly Not This Series.
> 
>>>> With this patch series, lruvec->lru_lock show no contentions
>>>>           &(&lruvec->lru_l...          8          0               0       0               0               0
>>>>
>>>> and aim9 page_test/brk_test performance increased 5%~50%.
>>>
>>> Where does the 50% number come in?  The numbers below seem to only show ~4% boost.
>>After splitting lru-locks present per-cpu page-vectors works no so well
because they mixes pages from different cgroups.

pagevec_lru_move_fn and friends need better implementation:
either sorting pages or splitting vectores in per-lruvec basis.
>> the Setddev/CoeffVar case has about 50% performance increase. one of container's mmtests result as following:
>>
>> Stddev    page_test      245.15 (   0.00%)      189.29 (  22.79%)
>> Stddev    brk_test      1258.60 (   0.00%)      629.16 (  50.01%)
>> CoeffVar  page_test        0.71 (   0.00%)        0.53 (  26.05%)
>> CoeffVar  brk_test         1.32 (   0.00%)        0.64 (  51.14%)
> 
> Aha.  50% decrease in stdev.
> 

After splitting lru-locks present per-cpu page-vectors works
no so well because they mix pages from different cgroups.

pagevec_lru_move_fn and friends need better implementation:
either sorting pages or splitting vectores in per-lruvec basis.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/14] lru/memcg: using per lruvec lock in un/lock_page_lru
  2019-08-26  8:30   ` Konstantin Khlebnikov
@ 2019-08-26 14:16     ` Alex Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-26 14:16 UTC (permalink / raw)
  To: Konstantin Khlebnikov, cgroups, linux-kernel, linux-mm,
	Andrew Morton, Mel Gorman, Tejun Heo
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins



在 2019/8/26 下午4:30, Konstantin Khlebnikov 写道:
>>
>>   
> 
> What protects lruvec from freeing at this point?
> After reading resolving lruvec page could be moved and cgroup deleted.
> 
> In this old patches I've used RCU for that: https://lkml.org/lkml/2012/2/20/276
> Pointer to lruvec should be resolved under disabled irq.
> Not sure this works these days.

Thanks for reminder! I will reconsider this point and come up with changes.

Thanks
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-26  8:39       ` Konstantin Khlebnikov
@ 2019-08-26 14:22         ` Alex Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-26 14:22 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Daniel Jordan, cgroups, linux-kernel,
	linux-mm, Andrew Morton, Mel Gorman, Tejun Heo, Michal Hocko,
	Hugh Dickins



在 2019/8/26 下午4:39, Konstantin Khlebnikov 写道:
>>>
> because they mixes pages from different cgroups.
> 
> pagevec_lru_move_fn and friends need better implementation:
> either sorting pages or splitting vectores in per-lruvec basis.

Right, this should be the next step to improve. Maybe we could try the per-lruvec pagevec?

Thanks
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-22 15:20     ` Daniel Jordan
  2019-08-26  8:39       ` Konstantin Khlebnikov
@ 2019-08-26 14:25       ` Alex Shi
  1 sibling, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-26 14:25 UTC (permalink / raw)
  To: Daniel Jordan, cgroups, linux-kernel, linux-mm, Andrew Morton,
	Mel Gorman, Tejun Heo, Michal Hocko



在 2019/8/22 下午11:20, Daniel Jordan 写道:
>>
>>>    https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice>
>>> It's also synthetic but it stresses lru_lock more than just anon alloc/free.  It hits the page activate path, which is where we see this lock in our database, and if enough memory is configured lru_lock also gets stressed during reclaim, similar to [1].
>>
>> Thanks for the sharing, this patchset can not help the [1] case, since it's just relief the per container lock contention now.
> 
> I should've been clearer.  [1] is meant as an example of someone suffering from lru_lock during reclaim.  Wouldn't your series help per-memcg reclaim?

yes, I got your point, since the aim9 don't show much improvement, I am trying this case in containers.

Thanks
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/14] per memcg lru_lock
  2019-08-24  1:59           ` Hugh Dickins
  (?)
@ 2019-08-26 14:35           ` Alex Shi
  -1 siblings, 0 replies; 43+ messages in thread
From: Alex Shi @ 2019-08-26 14:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Cgroups, LKML, Linux MM, Andrew Morton, Mel Gorman,
	Tejun Heo, Shakeel Butt, Yu Zhao, Daniel Jordan, Johannes Weiner,
	Vlastimil Babka, Vladimir Davydov, Minchan Kim, Kirill Tkhai,
	Konstantin Khlebnikov, Matthew Wilcox



在 2019/8/24 上午9:59, Hugh Dickins 写道:
> On Wed, 21 Aug 2019, Alex Shi wrote:
>> 在 2019/8/21 上午2:24, Hugh Dickins 写道:
>>> I'll set aside what I'm doing, and switch to rebasing ours to v5.3-rc
>>> and/or mmotm.  Then compare with what Alex has, to see if there's any
>>> good reason to prefer one to the other: if no good reason to prefer ours,
>>> I doubt we shall bother to repost, but just use it as basis for helping
>>> to review or improve Alex's.
>>
>> For your review, my patchset are pretty straight and simple.
>> It just use per lruvec lru_lock to replace necessary pgdat lru_lock.
>> just this.  We could talk more after I back to work. :)
> 
> Sorry to be bearer of bad news, Alex, but when you said "straight and
> simple", I feared that your patchset would turn out to be fundamentally
> too simple.
> 
> And that is so. I have only to see the
> 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> line in isolate_migratepages_block() in mm/compaction.c, and check
> that mem_cgroup_page_lruvec() is little changed in mm/mempolicy.c.
> 
> The central problem with per-memcg lru_lock is that you do not know
> for sure what lock to take (which memcg a page belongs to) until you
> have taken the expected lock, and then checked whether page->memcg
> is still the same - backing out and trying again if not.
> 
> Fix that central problem, and you end up with a more complicated
> patchset, much like ours.  It's true that when ours was first developed,
> the memcg situation was more complicated in several ways, and perhaps
> some aspects of our patchset could be simplified now (though I've not
> identified any).  Johannes in particular has done a great deal of
> simplifying work in memcg over the last few years, but there are still
> situations in which a page's memcg can change (move_charge_at_immigrate
> and swapin readahead spring to mind - or perhaps the latter is only an
> issue when MEMCG_SWAP is not enabled, I forget; and I often wonder if
> reparenting will be brought back one day).
> 
> I did not review your patchset in detail, and wasn't able to get very
> far in testing it.  At first I was put off by set_task_reclaim_state
> warnings from mm/vmscan.c, but those turned out to be in v5.3-rc5
> itself, not from your patchset or mine (but I've not yet investigated
> what's responsible for them).  Within a minute of starting swapping
> load, kcompactd compact_lock_irqsave() in isolate_migratepages_block()
> would deadlock, and I did not get further.  (Though I did also notice
> that booting the CONFIG_MEMCG=y kernel with "cgroup_disable=memory"
> froze in booting - tiresomely, one has to keep both the memcg and
> no-memcg locking to cope with that case, and I guess you had not.)
> 
> Rather than duplicating effort, I would advise you to give our patchset
> a try, and if it works for you, help towards getting that one merged:
> but of course, it's up to you.

Thanks a lot for all infos and reminders! Yes, the page->memcg change would be a problem. I will studying your patchset and try to merge them.

> 
> I've attached a tarfile of it rebased to v5.3-rc5: I do not want to
> spam the list with patches yet, because I do not have any stats or
> argument in support of the series, as Andrew asked for years ago and
> Michal asks again now.  But aside from that I consider it ready, and
> will let Shakeel take it over from here, while I get back to what I
> diverted from (but of course I'll try to answer questions on it).
> 
I will trying to look into them. Thanks for your kindly offer. :)

Thanks!
Alex

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2019-08-26 14:36 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-20  9:48 [PATCH 00/14] per memcg lru_lock Alex Shi
2019-08-20  9:48 ` [PATCH 01/14] mm/lru: move pgdat lru_lock into lruvec Alex Shi
2019-08-20  9:48   ` Alex Shi
2019-08-20 13:40   ` Matthew Wilcox
2019-08-20 13:40     ` Matthew Wilcox
2019-08-20 14:11     ` Alex Shi
2019-08-20 14:11       ` Alex Shi
2019-08-20  9:48 ` [PATCH 02/14] lru/memcg: move the lruvec->pgdat sync out lru_lock Alex Shi
2019-08-20  9:48 ` [PATCH 03/14] lru/memcg: using per lruvec lock in un/lock_page_lru Alex Shi
2019-08-26  8:30   ` Konstantin Khlebnikov
2019-08-26 14:16     ` Alex Shi
2019-08-20  9:48 ` [PATCH 04/14] lru/compaction: use per lruvec lock in isolate_migratepages_block Alex Shi
2019-08-20  9:48 ` [PATCH 05/14] lru/huge_page: use per lruvec lock in __split_huge_page Alex Shi
2019-08-20  9:48 ` [PATCH 06/14] lru/mlock: using per lruvec lock in munlock Alex Shi
2019-08-20  9:48 ` [PATCH 07/14] lru/swap: using per lruvec lock in page_cache_release Alex Shi
2019-08-20  9:48 ` [PATCH 08/14] lru/swap: uer lruvec lock in activate_page Alex Shi
2019-08-20  9:48 ` [PATCH 09/14] lru/swap: uer per lruvec lock in pagevec_lru_move_fn Alex Shi
2019-08-20  9:48 ` [PATCH 10/14] lru/swap: use per lruvec lock in release_pages Alex Shi
2019-08-20  9:48 ` [PATCH 11/14] lru/vmscan: using per lruvec lock in lists shrinking Alex Shi
2019-08-20  9:48 ` [PATCH 12/14] lru/vmscan: use pre lruvec lock in check_move_unevictable_pages Alex Shi
2019-08-20  9:48 ` [PATCH 13/14] lru/vmscan: using per lruvec lru_lock in get_scan_count Alex Shi
2019-08-20  9:48 ` [PATCH 14/14] mm/lru: fix the comments of lru_lock Alex Shi
2019-08-20  9:48   ` Alex Shi
2019-08-20 14:00   ` Matthew Wilcox
2019-08-20 14:00     ` Matthew Wilcox
2019-08-20 14:21     ` Alex Shi
2019-08-20 14:21       ` Alex Shi
2019-08-20 10:45 ` [PATCH 00/14] per memcg lru_lock Michal Hocko
2019-08-20 16:48   ` Shakeel Butt
2019-08-20 16:48     ` Shakeel Butt
2019-08-20 18:24     ` Hugh Dickins
2019-08-20 18:24       ` Hugh Dickins
2019-08-21  1:21       ` Alex Shi
2019-08-21  2:00       ` Alex Shi
2019-08-24  1:59         ` Hugh Dickins
2019-08-24  1:59           ` Hugh Dickins
2019-08-26 14:35           ` Alex Shi
2019-08-21 18:00 ` Daniel Jordan
2019-08-22 11:56   ` Alex Shi
2019-08-22 15:20     ` Daniel Jordan
2019-08-26  8:39       ` Konstantin Khlebnikov
2019-08-26 14:22         ` Alex Shi
2019-08-26 14:25       ` Alex Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.