Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v17 00/21] per memcg lru lock
@ 2020-07-25 12:59 Alex Shi
  2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi
                   ` (26 more replies)
  0 siblings, 27 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in 
mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
removes 'mm/mlock: reorder isolation sequence during munlock' 

Hi Johanness & Hugh & Alexander & Willy,

Could you like to give a reviewed by since you address much of issue and
give lots of suggestions! Many thanks!

Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
lru lists, but now we had moved the lru lists into memcg for long time. Still
using per node lru_lock is clearly unscalable, pages on each of memcgs have
to compete each others for a whole lru_lock. This patchset try to use per
lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
it scalable for memcgs and get performance gain.

Currently lru_lock still guards both lru list and page's lru bit, that's ok.
but if we want to use specific lruvec lock on the page, we need to pin down
the page's lruvec/memcg during locking. Just taking lruvec lock first may be
undermined by the page's memcg charge/migration. To fix this problem, we could
take out the page's lru bit clear and use it as pin down action to block the
memcg changes. That's the reason for new atomic func TestClearPageLRU.
So now isolating a page need both actions: TestClearPageLRU and hold the
lru_lock.

The typical usage of this is isolate_migratepages_block() in compaction.c
we have to take lru bit before lru lock, that serialized the page isolation
in memcg page charge/migration which will change page's lruvec and new 
lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new memcg 
charge path, then have this patchset. (Hugh Dickins tested and contributed much
code from compaction fix to general code polish, thanks a lot!).

The patchset includes 3 parts:
1, some code cleanup and minimum optimization as a preparation.
2, use TestCleanPageLRU as page isolation's precondition
3, replace per node lru_lock with per memcg per node lru_lock

Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
With this patchset, the readtwice performance increased about 80%
in concurrent containers.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who give comments as well: Daniel Jordan, 
Mel Gorman, Shakeel Butt, Matthew Wilcox etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!


Alex Shi (19):
  mm/vmscan: remove unnecessary lruvec adding
  mm/page_idle: no unlikely double check for idle page counting
  mm/compaction: correct the comments of compact_defer_shift
  mm/compaction: rename compact_deferred as compact_should_defer
  mm/thp: move lru_add_page_tail func to huge_memory.c
  mm/thp: clean up lru_add_page_tail
  mm/thp: remove code path which never got into
  mm/thp: narrow lru locking
  mm/memcg: add debug checking in lock_page_memcg
  mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/lru: move lru_lock holding in func lru_note_cost_page
  mm/lru: move lock into lru_note_cost
  mm/lru: introduce TestClearPageLRU
  mm/compaction: do page isolation first in compaction
  mm/thp: add tail pages into lru anyway in split_huge_page()
  mm/swap: serialize memcg changes in pagevec_lru_move_fn
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/pgdat: remove pgdat lru_lock

Hugh Dickins (2):
  mm/vmscan: use relock for move_pages_to_lru
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
 Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 +--
 include/linux/compaction.h                         |   4 +-
 include/linux/memcontrol.h                         |  98 ++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   6 +-
 include/linux/page-flags.h                         |   1 +
 include/linux/swap.h                               |   4 +-
 include/trace/events/compaction.h                  |   2 +-
 mm/compaction.c                                    | 113 ++++++++----
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  48 +++--
 mm/memcontrol.c                                    |  71 ++++++-
 mm/memory.c                                        |   3 -
 mm/mlock.c                                         |  43 +++--
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   8 -
 mm/rmap.c                                          |   4 +-
 mm/swap.c                                          | 203 ++++++++-------------
 mm/swap_state.c                                    |   2 -
 mm/vmscan.c                                        | 174 ++++++++++--------
 mm/workingset.c                                    |   2 -
 25 files changed, 510 insertions(+), 344 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-08-06  3:47   ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting Alex Shi
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

We don't have to add a freeable page into lru and then remove from it.
This change saves a couple of actions and makes the moving more clear.

The SetPageLRU needs to be kept here for list intergrity.
Otherwise:
 #0 mave_pages_to_lru              #1 release_pages
                                   if (put_page_testzero())
 if !put_page_testzero
                                     !PageLRU //skip lru_lock
                                       list_add(&page->lru,)
   list_add(&page->lru,) //corrupt

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/vmscan.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749d239c62b2..ddb29d813d77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1856,26 +1856,29 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	while (!list_empty(list)) {
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
+		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			list_del(&page->lru);
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
 			spin_lock_irq(&pgdat->lru_lock);
 			continue;
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
+		/*
+		 * The SetPageLRU needs to be kept here for list intergrity.
+		 * Otherwise:
+		 *   #0 mave_pages_to_lru             #1 release_pages
+		 *				      if (put_page_testzero())
+		 *   if !put_page_testzero
+		 *				        !PageLRU //skip lru_lock
+		 *                                        list_add(&page->lru,)
+		 *     list_add(&page->lru,) //corrupt
+		 */
 		SetPageLRU(page);
-		lru = page_lru(page);
 
-		nr_pages = hpage_nr_pages(page);
-		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
-		list_move(&page->lru, &lruvec->lists[lru]);
-
-		if (put_page_testzero(page)) {
+		if (unlikely(put_page_testzero(page))) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
-			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&pgdat->lru_lock);
@@ -1883,11 +1886,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 				spin_lock_irq(&pgdat->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
-		} else {
-			nr_moved += nr_pages;
-			if (PageActive(page))
-				workingset_age_nonresident(lruvec, nr_pages);
+
+			continue;
 		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lru = page_lru(page);
+		nr_pages = hpage_nr_pages(page);
+
+		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
+		list_add(&page->lru, &lruvec->lists[lru]);
+		nr_moved += nr_pages;
+		if (PageActive(page))
+			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
 	/*
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
  2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift Alex Shi
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

As func comments mentioned, few isolated page missing be tolerated.
So why not do further to drop the unlikely double check. That won't
cause more idle pages, but reduce a lock contention.

This is also a preparation for later new page isolation feature.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/page_idle.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..5fdd753e151a 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -32,19 +32,11 @@
 static struct page *page_idle_get_page(unsigned long pfn)
 {
 	struct page *page = pfn_to_online_page(pfn);
-	pg_data_t *pgdat;
 
 	if (!page || !PageLRU(page) ||
 	    !get_page_unless_zero(page))
 		return NULL;
 
-	pgdat = page_pgdat(page);
-	spin_lock_irq(&pgdat->lru_lock);
-	if (unlikely(!PageLRU(page))) {
-		put_page(page);
-		page = NULL;
-	}
-	spin_unlock_irq(&pgdat->lru_lock);
 	return page;
 }
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
  2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi
  2020-07-25 12:59 ` [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-27 17:29   ` Alexander Duyck
  2020-07-25 12:59 ` [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h | 1 +
 mm/compaction.c        | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f6f884970511..14c668b7e793 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -512,6 +512,7 @@ struct zone {
 	 * On compaction failure, 1<<compact_defer_shift compactions
 	 * are skipped before trying again. The number attempted since
 	 * last failure is tracked with compact_considered.
+	 * compact_order_failed is the minimum compaction failed order.
 	 */
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
diff --git a/mm/compaction.c b/mm/compaction.c
index 86375605faa9..cd1ef9e5e638 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page)
 
 /*
  * Compaction is deferred when compaction fails to result in a page
- * allocation success. 1 << compact_defer_limit compactions are skipped up
+ * allocation success. compact_defer_shift++, compactions are skipped up
  * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
  */
 void defer_compaction(struct zone *zone, int order)
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (2 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Steven Rostedt, Ingo Molnar, Vlastimil Babka, Mike Kravetz

The compact_deferred is a defer suggestion check, deferring action does in
defer_compaction not here. so, better rename it to avoid confusing.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/compaction.h        | 4 ++--
 include/trace/events/compaction.h | 2 +-
 mm/compaction.c                   | 8 ++++----
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6fa0eea3f530..be9ed7437a38 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -100,7 +100,7 @@ extern enum compact_result compaction_suitable(struct zone *zone, int order,
 		unsigned int alloc_flags, int highest_zoneidx);
 
 extern void defer_compaction(struct zone *zone, int order);
-extern bool compaction_deferred(struct zone *zone, int order);
+extern bool compaction_should_defer(struct zone *zone, int order);
 extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
 extern bool compaction_restarting(struct zone *zone, int order);
@@ -199,7 +199,7 @@ static inline void defer_compaction(struct zone *zone, int order)
 {
 }
 
-static inline bool compaction_deferred(struct zone *zone, int order)
+static inline bool compaction_should_defer(struct zone *zone, int order)
 {
 	return true;
 }
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 54e5bf081171..33633c71df04 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -274,7 +274,7 @@
 		1UL << __entry->defer_shift)
 );
 
-DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred,
+DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_should_defer,
 
 	TP_PROTO(struct zone *zone, int order),
 
diff --git a/mm/compaction.c b/mm/compaction.c
index cd1ef9e5e638..f14780fc296a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -154,7 +154,7 @@ void defer_compaction(struct zone *zone, int order)
 }
 
 /* Returns true if compaction should be skipped this time */
-bool compaction_deferred(struct zone *zone, int order)
+bool compaction_should_defer(struct zone *zone, int order)
 {
 	unsigned long defer_limit = 1UL << zone->compact_defer_shift;
 
@@ -168,7 +168,7 @@ bool compaction_deferred(struct zone *zone, int order)
 	if (zone->compact_considered >= defer_limit)
 		return false;
 
-	trace_mm_compaction_deferred(zone, order);
+	trace_mm_compaction_should_defer(zone, order);
 
 	return true;
 }
@@ -2377,7 +2377,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		enum compact_result status;
 
 		if (prio > MIN_COMPACT_PRIORITY
-					&& compaction_deferred(zone, order)) {
+				&& compaction_should_defer(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
 		}
@@ -2561,7 +2561,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		if (!populated_zone(zone))
 			continue;
 
-		if (compaction_deferred(zone, cc.order))
+		if (compaction_should_defer(zone, cc.order))
 			continue;
 
 		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (3 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 06/21] mm/thp: clean up lru_add_page_tail Alex Shi
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

The func is only used in huge_memory.c, defining it in other file with a
CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickin suggested.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 --
 mm/huge_memory.c     | 30 ++++++++++++++++++++++++++++++
 mm/swap.c            | 33 ---------------------------------
 3 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5b3216ba39a9..2c29399b29a0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,8 +339,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
 			  unsigned int nr_pages);
 extern void lru_note_cost_page(struct page *);
 extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
-			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84bee7e29..9e050b13f597 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2340,6 +2340,36 @@ static void remap_page(struct page *page)
 	}
 }
 
+static void lru_add_page_tail(struct page *page, struct page *page_tail,
+				struct lruvec *lruvec, struct list_head *list)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Put page_tail on the list at the correct position
+		 * so they all end up in order.
+		 */
+		add_page_to_lru_list_tail(page_tail, lruvec,
+					  page_lru(page_tail));
+	}
+}
+
 static void __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/swap.c b/mm/swap.c
index a82efc33411f..7701d855873d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -933,39 +933,6 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* used by __split_huge_page_refcount() */
-void lru_add_page_tail(struct page *page, struct page *page_tail,
-		       struct lruvec *lruvec, struct list_head *list)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
-
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
-	else if (list) {
-		/* page reclaim is reclaiming a huge page */
-		get_page(page_tail);
-		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
 				 void *arg)
 {
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 06/21] mm/thp: clean up lru_add_page_tail
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (4 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 07/21] mm/thp: remove code path which never got into Alex Shi
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

Since the first parameter is only used by head page, it's better to make
it explicit.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9e050b13f597..b18f21da4dac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2340,19 +2340,19 @@ static void remap_page(struct page *page)
 	}
 }
 
-static void lru_add_page_tail(struct page *page, struct page *page_tail,
+static void lru_add_page_tail(struct page *head, struct page *page_tail,
 				struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
-	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON_PAGE(!PageHead(head), head);
+	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
 	if (!list)
 		SetPageLRU(page_tail);
 
-	if (likely(PageLRU(page)))
-		list_add_tail(&page_tail->lru, &page->lru);
+	if (likely(PageLRU(head)))
+		list_add_tail(&page_tail->lru, &head->lru);
 	else if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 07/21] mm/thp: remove code path which never got into
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (5 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 06/21] mm/thp: clean up lru_add_page_tail Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 08/21] mm/thp: narrow lru locking Alex Shi
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

split_huge_page() will never call on a page which isn't on lru list, so
this code never got a chance to run, and should not be run, to add tail
pages on a lru list which head page isn't there.

Although the bug was never triggered, it'better be removed for code
correctness.

BTW, it looks better to have a WARN() set in the wrong path, but the
path will be changed in incoming new page isolation func. So just save
it now.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b18f21da4dac..1fb4147ff854 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,16 +2357,6 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
-	} else {
-		/*
-		 * Head page has not yet been counted, as an hpage,
-		 * so we must account for each subpage individually.
-		 *
-		 * Put page_tail on the list at the correct position
-		 * so they all end up in order.
-		 */
-		add_page_to_lru_list_tail(page_tail, lruvec,
-					  page_lru(page_tail));
 	}
 }
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 08/21] mm/thp: narrow lru locking
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (6 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 07/21] mm/thp: remove code path which never got into Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg Alex Shi
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Andrea Arcangeli

lru_lock and page cache xa_lock have no reason with current sequence,
put them together isn't necessary. let's narrow the lru locking, but
left the local_irq_disable to block interrupt re-entry and statistic update.

Hugh Dickins point: split_huge_page_to_list() was already silly,to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.
so we can save passing flags argument down to __split_huge_page().

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1fb4147ff854..d866b6e43434 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2423,7 +2423,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2432,8 +2432,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned long offset = 0;
 	int i;
 
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -2445,6 +2443,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock(&pgdat->lru_lock);
+
+	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond i_size: drop them from page cache */
@@ -2464,6 +2467,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
+	spin_unlock(&pgdat->lru_lock);
+	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
 
@@ -2481,8 +2486,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		page_ref_add(head, 2);
 		xa_unlock(&head->mapping->i_pages);
 	}
-
-	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	local_irq_enable();
 
 	remap_page(head);
 
@@ -2621,12 +2625,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct page *head = compound_head(page);
-	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
-	unsigned long flags;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2687,9 +2689,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+	/* block interrupt reentry in xa_lock and spinlock */
+	local_irq_disable();
 	if (mapping) {
 		XA_STATE(xas, &mapping->i_pages, page_index(head));
 
@@ -2719,7 +2720,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
 
@@ -2738,7 +2739,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
-		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+		local_irq_enable();
 		remap_page(head);
 		ret = -EBUSY;
 	}
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (7 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 08/21] mm/thp: narrow lru locking Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Michal Hocko, Vladimir Davydov

Add a debug checking in lock_page_memcg, then we could get alarm
if anything wrong here.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d45a9159af9..20c8ed69a930 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1983,6 +1983,12 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
 	if (unlikely(!memcg))
 		return NULL;
 
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
 	if (atomic_read(&memcg->moving_account) <= 0)
 		return memcg;
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (8 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

Fold the PGROTATED event collection into pagevec_move_tail_fn call back
func like other funcs does in pagevec_lru_move_fn. Now all usage of
pagevec_lru_move_fn are same and no needs of the 3rd parameter.

It's simply the calling.

[lkp@intel.com: found a build issue in the original patch, thanks]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 66 +++++++++++++++++++++++----------------------------------------
 1 file changed, 24 insertions(+), 42 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 7701d855873d..dc8b02cdddcb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -204,8 +204,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
 EXPORT_SYMBOL_GPL(get_kernel_page);
 
 static void pagevec_lru_move_fn(struct pagevec *pvec,
-	void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
-	void *arg)
+	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
 	struct pglist_data *pgdat = NULL;
@@ -224,7 +223,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		(*move_fn)(page, lruvec, arg);
+		(*move_fn)(page, lruvec);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -232,35 +231,23 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	pagevec_reinit(pvec);
 }
 
-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	int *pgmoved = arg;
-
 	if (PageLRU(page) && !PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-		(*pgmoved) += hpage_nr_pages(page);
+		__count_vm_events(PGROTATED, hpage_nr_pages(page));
 	}
 }
 
 /*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
-	int pgmoved = 0;
-
-	pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
-	__count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
+ *
+ * pagevec_move_tail_fn() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
  */
 void rotate_reclaimable_page(struct page *page)
 {
@@ -273,7 +260,7 @@ void rotate_reclaimable_page(struct page *page)
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_move_tail(pvec);
+			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 }
@@ -315,8 +302,7 @@ void lru_note_cost_page(struct page *page)
 		      page_is_file_lru(page), hpage_nr_pages(page));
 }
 
-static void __activate_page(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -340,7 +326,7 @@ static void activate_page_drain(int cpu)
 	struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);
 
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, __activate_page, NULL);
+		pagevec_lru_move_fn(pvec, __activate_page);
 }
 
 static bool need_activate_page_drain(int cpu)
@@ -358,7 +344,7 @@ void activate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, __activate_page, NULL);
+			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -374,7 +360,7 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat), NULL);
+	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -526,8 +512,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
-			      void *arg)
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 {
 	int lru;
 	bool active;
@@ -574,8 +559,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
@@ -592,8 +576,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 	}
 }
 
-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
-			    void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
 	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
@@ -636,21 +619,21 @@ void lru_add_drain_cpu(int cpu)
 
 		/* No harm done if a racing interrupt already did this */
 		local_lock_irqsave(&lru_rotate.lock, flags);
-		pagevec_move_tail(pvec);
+		pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 
 	pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
 }
@@ -679,7 +662,7 @@ void deactivate_file_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -701,7 +684,7 @@ void deactivate_page(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -723,7 +706,7 @@ void mark_page_lazyfree(struct page *page)
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
 }
@@ -933,8 +916,7 @@ void __pagevec_release(struct pagevec *pvec)
 }
 EXPORT_SYMBOL(__pagevec_release);
 
-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
-				 void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 {
 	enum lru_list lru;
 	int was_unevictable = TestClearPageUnevictable(page);
@@ -993,7 +975,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
 }
 
 /**
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (9 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-08-05 21:18   ` Alexander Duyck
  2020-07-25 12:59 ` [PATCH v17 12/21] mm/lru: move lock into lru_note_cost Alex Shi
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

It's a clean up patch w/o function changes.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/memory.c     | 3 ---
 mm/swap.c       | 2 ++
 mm/swap_state.c | 2 --
 mm/workingset.c | 2 --
 4 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 87ec87cdc1ff..dafc5585517e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3150,10 +3150,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				 * XXX: Move to lru_cache_add() when it
 				 * supports new vs putback
 				 */
-				spin_lock_irq(&page_pgdat(page)->lru_lock);
 				lru_note_cost_page(page);
-				spin_unlock_irq(&page_pgdat(page)->lru_lock);
-
 				lru_cache_add(page);
 				swap_readpage(page, true);
 			}
diff --git a/mm/swap.c b/mm/swap.c
index dc8b02cdddcb..b88ca630db70 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -298,8 +298,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 
 void lru_note_cost_page(struct page *page)
 {
+	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
 		      page_is_file_lru(page), hpage_nr_pages(page));
+	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 05889e8e3c97..080be52db6a8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -440,9 +440,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	}
 
 	/* XXX: Move to lru_cache_add() when it supports new vs putback */
-	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost_page(page);
-	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 
 	/* Caller will initiate read into locked page */
 	SetPageWorkingset(page);
diff --git a/mm/workingset.c b/mm/workingset.c
index 50b7937bab32..337d5b9ad132 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -372,9 +372,7 @@ void workingset_refault(struct page *page, void *shadow)
 	if (workingset) {
 		SetPageWorkingset(page);
 		/* XXX: Move to lru_cache_add() when it supports new vs putback */
-		spin_lock_irq(&page_pgdat(page)->lru_lock);
 		lru_note_cost_page(page);
-		spin_unlock_irq(&page_pgdat(page)->lru_lock);
 		inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
 	}
 out:
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 12/21] mm/lru: move lock into lru_note_cost
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (10 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU Alex Shi
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

This patch move lru_lock into lru_note_cost. It's a bit ugly and may
cost more locking, but it's necessary for later per pgdat lru_lock to
per memcg lru_lock change.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c   | 5 +++--
 mm/vmscan.c | 4 +---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index b88ca630db70..f645965fde0e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -269,7 +269,9 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+		spin_lock_irq(&pgdat->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -293,15 +295,14 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
+		spin_unlock_irq(&pgdat->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
 void lru_note_cost_page(struct page *page)
 {
-	spin_lock_irq(&page_pgdat(page)->lru_lock);
 	lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
 		      page_is_file_lru(page), hpage_nr_pages(page));
-	spin_unlock_irq(&page_pgdat(page)->lru_lock);
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddb29d813d77..c1c4259b4de5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1976,19 +1976,17 @@ static int current_may_throttle(void)
 				&stat, false);
 
 	spin_lock_irq(&pgdat->lru_lock);
-
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	lru_note_cost(lruvec, file, stat.nr_pageout);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
+	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
 	free_unref_page_list(&page_list);
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (11 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 12/21] mm/lru: move lock into lru_note_cost Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-29  3:53   ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Michal Hocko, Vladimir Davydov

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we could clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.
So now we do page isolating by both actions:
	TestClearPageLRU and hold the lru_lock.

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

Hugh Dickins pointed that when a page is in free path and no one is
possible to take it, non atomic lru bit clearing is better, like in
__page_cache_release and release_pages.
And no need get_page() before lru bit clear in isolate_lru_page,
since it '(1) Must be called with an elevated refcount on the page'.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable with Rong Chen
<rong.a.chen@intel.com> report:
https://lkml.org/lkml/2020/3/4/173

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c                 |  3 +--
 mm/swap.c                  |  6 ++----
 mm/vmscan.c                | 18 +++++++-----------
 4 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6be1aa559b1e..9554ed1387dc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+	TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index f8736136fad7..228ba5a8e0a5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page)
  */
 static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
 {
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		struct lruvec *lruvec;
 
 		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
-		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		return true;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index f645965fde0e..5092fe9c8c47 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
+		__ClearPageLRU(page);
 		spin_lock_irqsave(&pgdat->lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 	}
@@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
 				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
-			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
+			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c1c4259b4de5..4183ae6b54b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-
 		nr_pages = compound_nr(page);
 		total_scan += nr_pages;
 
@@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
+		int lru = page_lru(page);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		if (PageLRU(page)) {
-			int lru = page_lru(page);
-			get_page(page);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
-			ret = 0;
-		}
+		spin_lock_irq(&pgdat->lru_lock);
+		del_page_from_lru_list(page, lruvec, lru);
 		spin_unlock_irq(&pgdat->lru_lock);
+		ret = 0;
 	}
+
 	return ret;
 }
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (12 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-08-04 21:35   ` Alexander Duyck
                     ` (2 more replies)
  2020-07-25 12:59 ` [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
                   ` (12 subsequent siblings)
  26 siblings, 3 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
 3 files changed, 60 insertions(+), 30 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2c29399b29a0..6d23d3beeff7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index f14780fc296a..2da2933fe56b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
+				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
+		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+			goto isolate_fail_put;
+
+		/* Try isolate the page */
+		if (!TestClearPageLRU(page))
+			goto isolate_fail_put;
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
 			locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}
 
-			/* Recheck PageLRU and PageCompound under lock */
-			if (!PageLRU(page))
-				goto isolate_fail;
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
 				low_pfn += compound_nr(page) - 1;
-				goto isolate_fail;
+				SetPageLRU(page);
+				goto isolate_fail_put;
 			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, isolate_mode) != 0)
-			goto isolate_fail;
-
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
 			low_pfn += compound_nr(page) - 1;
@@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		}
 
 		continue;
+
+isolate_fail_put:
+		/* Avoid potential deadlock in freeing page under lru_lock */
+		if (locked) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			locked = false;
+		}
+		put_page(page);
+
 isolate_fail:
 		if (!skip_on_failure)
 			continue;
@@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	page = NULL;
+
 isolate_abort:
 	if (locked)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (page) {
+		SetPageLRU(page);
+		put_page(page);
+	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4183ae6b54b5..f77748adc340 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1544,20 +1544,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EINVAL;
 
-	/* Only take pages on the LRU. */
-	if (!PageLRU(page))
-		return ret;
-
 	/* Compaction should not handle unevictable pages but CMA can do so */
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
 	ret = -EBUSY;
 
+	/* Only take pages on the LRU. */
+	if (!PageLRU(page))
+		return ret;
+
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
 
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		ClearPageLRU(page);
-		ret = 0;
-	}
-
-	return ret;
+	return 0;
 }
 
-
 /*
  * Update LRU sizes after isolating pages. The LRU size updates must
  * be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		 * only when the page is being freed somewhere else.
 		 */
 		scan += nr_pages;
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page_prepare(page, mode)) {
 		case 0:
+			/*
+			 * Be careful not to clear PageLRU until after we're
+			 * sure the page is not being freed elsewhere -- the
+			 * page release code relies on it.
+			 */
+			if (unlikely(!get_page_unless_zero(page)))
+				goto busy;
+
+			if (!TestClearPageLRU(page)) {
+				/*
+				 * This page may in other isolation path,
+				 * but we still hold lru_lock.
+				 */
+				put_page(page);
+				goto busy;
+			}
+
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
-
+busy:
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			continue;
+			break;
 
 		default:
 			BUG();
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page()
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (13 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn Alex Shi
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Mika Penttilä

Split_huge_page() must start with PageLRU(head), and we are holding the
lru_lock here. If the head was cleared lru bit unexpected, tracking it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/huge_memory.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d866b6e43434..28538444197b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2348,15 +2348,19 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-	if (!list)
-		SetPageLRU(page_tail);
-
-	if (likely(PageLRU(head)))
-		list_add_tail(&page_tail->lru, &head->lru);
-	else if (list) {
+	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		get_page(page_tail);
 		list_add_tail(&page_tail->lru, list);
+	} else {
+		/*
+		 * Split start from PageLRU(head), and we are holding the
+		 * lru_lock.
+		 * Do a warning if the head's lru bit was cleared unexpected.
+		 */
+		VM_WARN_ON(!PageLRU(head));
+		SetPageLRU(page_tail);
+		list_add_tail(&page_tail->lru, &head->lru);
 	}
 }
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (14 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

Hugh Dickins' found a memcg change bug on original version:
If we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. The
possible bad scenario would like:

	cpu 0					cpu 1
lruvec = mem_cgroup_page_lruvec()
					if (!isolate_lru_page())
						mem_cgroup_move_account

spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.

So we need the ClearPageLRU to block isolate_lru_page(), that serializes
the memcg change. and then removing the PageLRU check in move_fn callee
as the consequence.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 46 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 36 insertions(+), 10 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5092fe9c8c47..3029b3f74811 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -221,8 +221,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 			spin_lock_irqsave(&pgdat->lru_lock, flags);
 		}
 
+		/* block memcg migration during page moving between lru */
+		if (!TestClearPageLRU(page))
+			continue;
+
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		(*move_fn)(page, lruvec);
+
+		SetPageLRU(page);
 	}
 	if (pgdat)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -232,7 +238,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageUnevictable(page)) {
+	if (!PageUnevictable(page)) {
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		ClearPageActive(page);
 		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
@@ -306,7 +312,7 @@ void lru_note_cost_page(struct page *page)
 
 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
+	if (!PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -362,7 +368,8 @@ void activate_page(struct page *page)
 
 	page = compound_head(page);
 	spin_lock_irq(&pgdat->lru_lock);
-	__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
+	if (PageLRU(page))
+		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
 	spin_unlock_irq(&pgdat->lru_lock);
 }
 #endif
@@ -520,9 +527,6 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 	bool active;
 	int nr_pages = hpage_nr_pages(page);
 
-	if (!PageLRU(page))
-		return;
-
 	if (PageUnevictable(page))
 		return;
 
@@ -563,7 +567,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+	if (PageActive(page) && !PageUnevictable(page)) {
 		int lru = page_lru_base_type(page);
 		int nr_pages = hpage_nr_pages(page);
 
@@ -580,7 +584,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
 {
-	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	if (PageAnon(page) && PageSwapBacked(page) &&
 	    !PageSwapCache(page) && !PageUnevictable(page)) {
 		bool active = PageActive(page);
 		int nr_pages = hpage_nr_pages(page);
@@ -654,7 +658,7 @@ void deactivate_file_page(struct page *page)
 	 * In a workload with many unevictable page such as mprotect,
 	 * unevictable page deactivation for accelerating reclaim is pointless.
 	 */
-	if (PageUnevictable(page))
+	if (PageUnevictable(page) || !PageLRU(page))
 		return;
 
 	if (likely(get_page_unless_zero(page))) {
@@ -976,7 +980,29 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
+	int i;
+	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec;
+	unsigned long flags = 0;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct pglist_data *pagepgdat = page_pgdat(page);
+
+		if (pagepgdat != pgdat) {
+			if (pgdat)
+				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			pgdat = pagepgdat;
+			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		}
+
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		__pagevec_lru_add_fn(page, lruvec);
+	}
+	if (pgdat)
+		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
 }
 
 /**
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (15 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-27 23:34   ` Alexander Duyck
                     ` (2 more replies)
  2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi
                   ` (9 subsequent siblings)
  26 siblings, 3 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Michal Hocko, Vladimir Davydov

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this and later patches, the readtwice performance increases about
80% within concurrent containers.

Also add a debug func in locking which may give some clues if there are
sth out of hands.

Hugh Dickins helped on patch polish, thanks!

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
 include/linux/mmzone.h     |   2 +
 mm/compaction.c            |  67 ++++++++++++++++++-----------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  63 ++++++++++++++++++++++++++-
 mm/mlock.c                 |  47 +++++++++++++-------
 mm/mmzone.c                |   1 +
 mm/swap.c                  | 104 +++++++++++++++++++++------------------------
 mm/vmscan.c                |  70 ++++++++++++++++--------------
 9 files changed, 288 insertions(+), 135 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e77197a62809..258901021c6c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1255,6 +1297,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 14c668b7e793..30b961a9a749 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -249,6 +249,8 @@ enum lruvec_flags {
 };
 
 struct lruvec {
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	struct list_head		lists[NR_LRU_LISTS];
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
diff --git a/mm/compaction.c b/mm/compaction.c
index 2da2933fe56b..88bbd2e93895 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked_lruvec = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+				locked_lruvec = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+				if (locked_lruvec) {
+					unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+					locked_lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked_lruvec) {
+			if (locked_lruvec)
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked_lruvec = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
-		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+		if (locked_lruvec) {
+			unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+			locked_lruvec = NULL;
 		}
 		put_page(page);
 
@@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+				locked_lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	page = NULL;
 
 isolate_abort:
-	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (locked_lruvec)
+		unlock_page_lruvec_irqrestore(locked_lruvec, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 28538444197b..a0cb95891ae5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,7 +2346,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2430,7 +2430,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2447,10 +2446,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2471,7 +2468,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20c8ed69a930..d6746656cc39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1196,6 +1196,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 		goto out;
 	}
 
-	memcg = page->mem_cgroup;
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	memcg = READ_ONCE(page->mem_cgroup);
 	/*
 	 * Swapcache readahead pages are added to the LRU - and
 	 * possibly migrated - before they are charged.
@@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -2999,7 +3058,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 228ba5a8e0a5..5d40d259a931 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
  * Isolate a page from LRU with optional get_page() pin.
  * Assumes lru_lock already held and page already pinned.
  */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+				struct lruvec *lruvec, bool getpage)
 {
 	if (TestClearPageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
@@ -181,7 +179,7 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -189,11 +187,16 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
+	 * Serialize split tail pages in __split_huge_page_tail() which
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
+	 * TestClearPageLRU can't be used here to block page isolation, since
+	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
+	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
+	 * wrong lru list here. So relay on PageLocked to stop lruvec change
+	 * in mem_cgroup_move_account().
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -204,15 +207,15 @@ unsigned int munlock_vma_page(struct page *page)
 	nr_pages = hpage_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+	if (__munlock_isolate_lru_page(page, lruvec, true)) {
+		unlock_page_lruvec_irq(lruvec);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	unlock_page_lruvec_irq(lruvec);
 
 out:
 	return nr_pages - 1;
@@ -292,23 +295,34 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
+		struct lruvec *new_lruvec;
+
+		/* block memcg change in mem_cgroup_move_account */
+		lock_page_memcg(page);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+				unlock_page_memcg(page);
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
 		} else {
 			delta_munlocked++;
@@ -320,11 +334,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 * pin. We cannot do it under lru_lock however. If it's
 		 * the last pin, __page_cache_release() would deadlock.
 		 */
+		unlock_page_memcg(page);
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/swap.c b/mm/swap.c
index 3029b3f74811..09edac441eb6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
 		__ClearPageLRU(page);
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -365,12 +360,13 @@ static inline void activate_page_drain(int cpu)
 void activate_page(struct page *page)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+		__activate_page(page, lruvec);
+	unlock_page_lruvec_irq(lruvec);
 }
 #endif
 
@@ -817,8 +813,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
 
@@ -828,21 +823,20 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -861,28 +855,28 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
 			__ClearPageLRU(page);
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
@@ -892,8 +886,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -981,26 +975,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f77748adc340..168c1659e430 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		int lru = page_lru(page);
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, lru);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
+	struct lruvec *orig_lruvec = lruvec;
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
+		struct lruvec *new_lruvec = NULL;
+
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 *                                        list_add(&page->lru,)
 		 *     list_add(&page->lru,) //corrupt
 		 */
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				spin_unlock_irq(&lruvec->lru_lock);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 		SetPageLRU(page);
 
 		if (unlikely(put_page_testzero(page))) {
@@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		lru = page_lru(page);
 		nr_pages = hpage_nr_pages(page);
 
@@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		if (PageActive(page))
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
+	if (orig_lruvec != lruvec) {
+		if (lruvec)
+			spin_unlock_irq(&lruvec->lru_lock);
+		spin_lock_irq(&orig_lruvec->lru_lock);
+	}
 
 	/*
 	 * To save our caller's stack, now use input list for pages to free.
@@ -1957,7 +1967,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1969,7 +1979,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1977,7 +1987,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1986,7 +1996,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2039,7 +2049,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2049,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2095,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2106,7 +2116,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2696,10 +2706,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4275,24 +4285,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
 		pgscanned++;
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -4308,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (16 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-07-29 17:52   ` Alexander Duyck
  2020-07-31 21:14   ` [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid alexander.h.duyck
  2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi
                   ` (8 subsequent siblings)
  26 siblings, 2 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Thomas Gleixner, Andrey Ryabinin

Use this new function to replace repeated same code, no func change.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/memcontrol.h | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/mlock.c                 |  9 +--------
 mm/swap.c                  | 33 +++++++--------------------------
 mm/vmscan.c                |  8 +-------
 4 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 258901021c6c..6e670f991b42 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1313,6 +1313,46 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
 }
 
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+		struct lruvec *locked_lruvec)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	bool locked;
+
+	rcu_read_lock();
+	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
+	rcu_read_unlock();
+
+	if (locked)
+		return locked_lruvec;
+
+	if (locked_lruvec)
+		unlock_page_lruvec_irq(locked_lruvec);
+
+	return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+		struct lruvec *locked_lruvec, unsigned long *flags)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	bool locked;
+
+	rcu_read_lock();
+	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
+	rcu_read_unlock();
+
+	if (locked)
+		return locked_lruvec;
+
+	if (locked_lruvec)
+		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+
+	return lock_page_lruvec_irqsave(page, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/mm/mlock.c b/mm/mlock.c
index 5d40d259a931..bc2fb3bfbe7a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -303,17 +303,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	/* Phase 1: page isolation */
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg change in mem_cgroup_move_account */
 		lock_page_memcg(page);
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (new_lruvec != lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
-
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
diff --git a/mm/swap.c b/mm/swap.c
index 09edac441eb6..6d9c7288f7de 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
-
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
@@ -864,17 +857,12 @@ void release_pages(struct page **pages, int nr)
 		}
 
 		if (PageLRU(page)) {
-			struct lruvec *new_lruvec;
-
-			new_lruvec = mem_cgroup_page_lruvec(page,
-							page_pgdat(page));
-			if (new_lruvec != lruvec) {
-				if (lruvec)
-					unlock_page_lruvec_irqrestore(lruvec,
-									flags);
+			struct lruvec *prev_lruvec = lruvec;
+
+			lruvec = relock_page_lruvec_irqsave(page, lruvec,
+									&flags);
+			if (prev_lruvec != lruvec)
 				lock_batch = 0;
-				lruvec = lock_page_lruvec_irqsave(page, &flags);
-			}
 
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
@@ -980,15 +968,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
-
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irqrestore(lruvec, flags);
-			lruvec = lock_page_lruvec_irqsave(page, &flags);
-		}
 
+		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
 	if (lruvec)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 168c1659e430..bdb53a678e7e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4292,15 +4292,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct lruvec *new_lruvec;
 
 		pgscanned++;
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (lruvec != new_lruvec) {
-			if (lruvec)
-				unlock_page_lruvec_irq(lruvec);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (17 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-08-03 22:49   ` Alexander Duyck
  2020-07-25 12:59 ` [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock Alex Shi
                   ` (7 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Use the relock function to replace relocking action. And try to save few
lock times.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/vmscan.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bdb53a678e7e..078a1640ec60 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1854,15 +1854,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
-		struct lruvec *new_lruvec = NULL;
-
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&lruvec->lru_lock);
+			if (lruvec) {
+				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
+			}
 			putback_lru_page(page);
-			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1876,12 +1876,7 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 *                                        list_add(&page->lru,)
 		 *     list_add(&page->lru,) //corrupt
 		 */
-		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
-		if (new_lruvec != lruvec) {
-			if (lruvec)
-				spin_unlock_irq(&lruvec->lru_lock);
-			lruvec = lock_page_lruvec_irq(page);
-		}
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		SetPageLRU(page);
 
 		if (unlikely(put_page_testzero(page))) {
@@ -1890,8 +1885,8 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
 				destroy_compound_page(page);
-				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (18 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-08-03 22:42   ` Alexander Duyck
  2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
---
 include/linux/mmzone.h | 1 -
 mm/page_alloc.c        | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30b961a9a749..8af956aa13cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -735,7 +735,6 @@ struct deferred_split {
 
 	/* Write-intensive fields used by page reclaim */
 	ZONE_PADDING(_pad1_)
-	spinlock_t		lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e028b87ce294..4d7df42b32d6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
 	pgdat_page_ext_init(pgdat);
-	spin_lock_init(&pgdat->lru_lock);
 	lruvec_init(&pgdat->__lruvec);
 }
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v17 21/21] mm/lru: revise the comments of lru_lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (19 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2020-07-25 12:59 ` Alex Shi
  2020-08-03 22:37   ` Alexander Duyck
  2020-07-27  5:40 ` [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-25 12:59 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Andrey Ryabinin, Jann Horn

From: Hugh Dickins <hughd@google.com>

Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
fix the incorrect comments in code. Also fixed some zone->lru_lock comment
error from ancient time. etc.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
 Documentation/admin-guide/cgroup-v1/memory.rst     | 21 +++++++++------------
 Documentation/trace/events-kmem.rst                |  2 +-
 Documentation/vm/unevictable-lru.rst               | 22 ++++++++--------------
 include/linux/mm_types.h                           |  2 +-
 include/linux/mmzone.h                             |  2 +-
 mm/filemap.c                                       |  4 ++--
 mm/memcontrol.c                                    |  2 +-
 mm/rmap.c                                          |  4 ++--
 mm/vmscan.c                                        | 12 ++++++++----
 10 files changed, 36 insertions(+), 50 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 3f7115e07b5d..0b9f91589d3d 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 ======
-        Each memcg has its own private LRU. Now, its handling is under global
-	VM's control (means that it's handled under global pgdat->lru_lock).
-	Almost all routines around memcg's LRU is called by global LRU's
-	list management functions under pgdat->lru_lock.
-
-	A special function is mem_cgroup_isolate_pages(). This scans
-	memcg's private LRU and call __isolate_lru_page() to extract a page
-	from LRU.
-
-	(By __isolate_lru_page(), the page is removed from both of global and
-	private LRU.)
-
+	Each memcg has its own vector of LRUs (inactive anon, active anon,
+	inactive file, active file, unevictable) of pages from each node,
+	each LRU handled under a single lru_lock for that memcg and node.
 
 9. Typical Tests.
 =================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 12757e63b26c..24450696579f 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered.
 2.6 Locking
 -----------
 
-   lock_page_cgroup()/unlock_page_cgroup() should not be called under
-   the i_pages lock.
+Lock order is as follows:
 
-   Other lock order is following:
+  Page lock (PG_locked bit of page->flags)
+    mm->page_table_lock or split pte_lock
+      lock_page_memcg (memcg->move_lock)
+        mapping->i_pages lock
+          lruvec->lru_lock.
 
-   PG_locked.
-     mm->page_table_lock
-         pgdat->lru_lock
-	   lock_page_cgroup.
-
-  In many cases, just lock_page_cgroup() is called.
-
-  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  pgdat->lru_lock, it has no lock of its own.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 -----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
 Broadly speaking, pages are taken off the LRU lock in bulk and
 freed in batch with a page list. Significant amounts of activity here could
 indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
 
 4. Per-CPU Allocator Activity
 =============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
 memory x86_64 systems.
 
 To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone.  When a large
+main memory will have over 32 million 4k pages in a single node.  When a large
 fraction of these pages are not evictable for any reason [see below], vmscan
 will spend a lot of time scanning the LRU lists looking for the small fraction
 of pages that are evictable.  This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
 The Unevictable Page List
 -------------------------
 
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
 called the "unevictable" list and an associated page flag, PG_unevictable, to
 indicate that the page is being managed on the unevictable list.
 
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
 swap-backed pages.  This differentiation is only important while the pages are,
 in fact, evictable.
 
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
 lists and statistics originally proposed and posted by Christoph Lameter.
 
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock.  This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
 
 Memory Control Group Interaction
 --------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
 memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
 lru_list enum.
 
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
 lru_list enum element).  The memory controller tracks the movement of pages to
 and from the unevictable list.
 
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
 active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
 pages in all of the shrink_{active|inactive|page}_list() functions and will
 "cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
 
 There may be situations where a page is mapped into a VM_LOCKED VMA, but the
 page is not marked as PG_mlocked.  Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
 page from the LRU, as it is likely on the appropriate active or inactive list
 at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
 back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list.  If
+is now mlocked and divert the page to the node's unevictable list.  If
 mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
 it later if and when it attempts to reclaim the page.
 
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
      unevictable list in mlock_vma_page().
 
 shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
 
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 64ede5f150dc..44738cdb5a55 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,7 +78,7 @@ struct page {
 		struct {	/* Page cache and anonymous pages */
 			/**
 			 * @lru: Pageout list, eg. active_list protected by
-			 * pgdat->lru_lock.  Sometimes used as a generic list
+			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
 			struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8af956aa13cf..c92289a4e14d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;
 
 /*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
diff --git a/mm/filemap.c b/mm/filemap.c
index 385759c4ce4b..3083557a1ce6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -101,8 +101,8 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->pgdat->lru_lock		(follow_page->mark_page_accessed)
- *    ->pgdat->lru_lock		(check_pte_range->isolate_lru_page)
+ *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->i_pages lock		(page_remove_rmap->set_page_dirty)
  *    bdi.wb->list_lock		(page_remove_rmap->set_page_dirty)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d6746656cc39..a018d7c8a3f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3057,7 +3057,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
 /*
- * Because tail pages are not marked as "used", set it. We're under
+ * Because tail pages are not marked as "used", set it. Don't need
  * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/rmap.c b/mm/rmap.c
index 5fe2dedce1fc..7f6e95680c47 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -28,12 +28,12 @@
  *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
  *           anon_vma->rwsem
  *             mm->page_table_lock or pte_lock
- *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
  *               swap_lock (in swap_duplicate, swap_info_get)
  *                 mmlist_lock (in mmput, drain_mmlist and others)
  *                 mapping->private_lock (in __set_page_dirty_buffers)
- *                   mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
+ *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
  *                     i_pages lock (widely used)
+ *                       lruvec->lru_lock (in lock_page_lruvec_irq)
  *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
  *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
  *                   sb_lock (within inode_lock in fs/fs-writeback.c)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 078a1640ec60..bb3ac52de058 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1620,14 +1620,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 }
 
 /**
- * pgdat->lru_lock is heavily contended.  Some of the functions that
+ * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
+ *
+ * lruvec->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
  * and working on them outside the LRU lock.
  *
  * For pagecache intensive workloads, this function is the hottest
  * spot in the kernel (apart from copy_*_user functions).
  *
- * Appropriate locks must be held before calling this function.
+ * Lru_lock must be held before calling this function.
  *
  * @nr_to_scan:	The number of eligible pages to look through on the list.
  * @lruvec:	The LRU vector to pull pages from.
@@ -1826,14 +1828,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 
 /*
  * This moves pages from @list to corresponding LRU list.
+ * The pages from @list is out of any lruvec, and in the end list reuses as
+ * pages_to_free list.
  *
  * We move them the other way if the page is referenced by one or more
  * processes, from rmap.
  *
  * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone_lru_lock across the whole operation.  But if
+ * appropriate to hold lru_lock across the whole operation.  But if
  * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone_lru_lock around each page.  It's impossible to balance
+ * should drop lru_lock around each page.  It's impossible to balance
  * this, so instead we remove the pages from the LRU while processing them.
  * It is safe to rely on PG_active against the non-LRU pages in here because
  * nobody will play with that bit on a non-LRU page.
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (20 preceding siblings ...)
  2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-07-27  5:40 ` Alex Shi
  2020-07-29 14:49   ` Alex Shi
  2020-07-31 21:31 ` Alexander Duyck
                   ` (4 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-27  5:40 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen







A standard for new page isolation steps like the following:
1, get_page(); #pin the page avoid be free
2, TestClearPageLRU(); #serialize other isolation, also memcg change
3, spin_lock on lru_lock; #serialize lru list access
The step 2 could be optimzed/replaced in scenarios which page is unlikely
be accessed by others.



在 2020/7/25 下午8:59, Alex Shi 写道:
> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in 
> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
> removes 'mm/mlock: reorder isolation sequence during munlock' 
> 
> Hi Johanness & Hugh & Alexander & Willy,
> 
> Could you like to give a reviewed by since you address much of issue and
> give lots of suggestions! Many thanks!
> 
> Current lru_lock is one for each of node, pgdat->lru_lock, that guard for
> lru lists, but now we had moved the lru lists into memcg for long time. Still
> using per node lru_lock is clearly unscalable, pages on each of memcgs have
> to compete each others for a whole lru_lock. This patchset try to use per
> lruvec/memcg lru_lock to repleace per node lru lock to guard lru lists, make
> it scalable for memcgs and get performance gain.
> 
> Currently lru_lock still guards both lru list and page's lru bit, that's ok.
> but if we want to use specific lruvec lock on the page, we need to pin down
> the page's lruvec/memcg during locking. Just taking lruvec lock first may be
> undermined by the page's memcg charge/migration. To fix this problem, we could
> take out the page's lru bit clear and use it as pin down action to block the
> memcg changes. That's the reason for new atomic func TestClearPageLRU.
> So now isolating a page need both actions: TestClearPageLRU and hold the
> lru_lock.
> 
> The typical usage of this is isolate_migratepages_block() in compaction.c
> we have to take lru bit before lru lock, that serialized the page isolation
> in memcg page charge/migration which will change page's lruvec and new 
> lru_lock in it.
> 
> The above solution suggested by Johannes Weiner, and based on his new memcg 
> charge path, then have this patchset. (Hugh Dickins tested and contributed much
> code from compaction fix to general code polish, thanks a lot!).
> 
> The patchset includes 3 parts:
> 1, some code cleanup and minimum optimization as a preparation.
> 2, use TestCleanPageLRU as page isolation's precondition
> 3, replace per node lru_lock with per memcg per node lru_lock
> 
> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> With this patchset, the readtwice performance increased about 80%
> in concurrent containers.
> 
> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this
> idea 8 years ago, and others who give comments as well: Daniel Jordan, 
> Mel Gorman, Shakeel Butt, Matthew Wilcox etc.
> 
> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks!
> 
> 
> Alex Shi (19):
>   mm/vmscan: remove unnecessary lruvec adding
>   mm/page_idle: no unlikely double check for idle page counting
>   mm/compaction: correct the comments of compact_defer_shift
>   mm/compaction: rename compact_deferred as compact_should_defer
>   mm/thp: move lru_add_page_tail func to huge_memory.c
>   mm/thp: clean up lru_add_page_tail
>   mm/thp: remove code path which never got into
>   mm/thp: narrow lru locking
>   mm/memcg: add debug checking in lock_page_memcg
>   mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn
>   mm/lru: move lru_lock holding in func lru_note_cost_page
>   mm/lru: move lock into lru_note_cost
>   mm/lru: introduce TestClearPageLRU
>   mm/compaction: do page isolation first in compaction
>   mm/thp: add tail pages into lru anyway in split_huge_page()
>   mm/swap: serialize memcg changes in pagevec_lru_move_fn
>   mm/lru: replace pgdat lru_lock with lruvec lock
>   mm/lru: introduce the relock_page_lruvec function
>   mm/pgdat: remove pgdat lru_lock
> 
> Hugh Dickins (2):
>   mm/vmscan: use relock for move_pages_to_lru
>   mm/lru: revise the comments of lru_lock
> 
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +-
>  Documentation/admin-guide/cgroup-v1/memory.rst     |  21 +--
>  Documentation/trace/events-kmem.rst                |   2 +-
>  Documentation/vm/unevictable-lru.rst               |  22 +--
>  include/linux/compaction.h                         |   4 +-
>  include/linux/memcontrol.h                         |  98 ++++++++++
>  include/linux/mm_types.h                           |   2 +-
>  include/linux/mmzone.h                             |   6 +-
>  include/linux/page-flags.h                         |   1 +
>  include/linux/swap.h                               |   4 +-
>  include/trace/events/compaction.h                  |   2 +-
>  mm/compaction.c                                    | 113 ++++++++----
>  mm/filemap.c                                       |   4 +-
>  mm/huge_memory.c                                   |  48 +++--
>  mm/memcontrol.c                                    |  71 ++++++-
>  mm/memory.c                                        |   3 -
>  mm/mlock.c                                         |  43 +++--
>  mm/mmzone.c                                        |   1 +
>  mm/page_alloc.c                                    |   1 -
>  mm/page_idle.c                                     |   8 -
>  mm/rmap.c                                          |   4 +-
>  mm/swap.c                                          | 203 ++++++++-------------
>  mm/swap_state.c                                    |   2 -
>  mm/vmscan.c                                        | 174 ++++++++++--------
>  mm/workingset.c                                    |   2 -
>  25 files changed, 510 insertions(+), 344 deletions(-)
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift
  2020-07-25 12:59 ` [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift Alex Shi
@ 2020-07-27 17:29   ` Alexander Duyck
  2020-07-28 11:59     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-07-27 17:29 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> There is no compact_defer_limit. It should be compact_defer_shift in
> use. and add compact_order_failed explanation.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/linux/mmzone.h | 1 +
>  mm/compaction.c        | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f6f884970511..14c668b7e793 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -512,6 +512,7 @@ struct zone {
>          * On compaction failure, 1<<compact_defer_shift compactions
>          * are skipped before trying again. The number attempted since
>          * last failure is tracked with compact_considered.
> +        * compact_order_failed is the minimum compaction failed order.
>          */
>         unsigned int            compact_considered;
>         unsigned int            compact_defer_shift;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 86375605faa9..cd1ef9e5e638 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page)
>
>  /*
>   * Compaction is deferred when compaction fails to result in a page
> - * allocation success. 1 << compact_defer_limit compactions are skipped up
> + * allocation success. compact_defer_shift++, compactions are skipped up
>   * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
>   */
>  void defer_compaction(struct zone *zone, int order)

So this doesn't read right. I wouldn't keep the "++," in the
explanation, and if we are going to refer to a limit of "1 <<
COMPACT_MAX_DEFER_SHIFT" then maybe this should be left as "1 <<
compact_defer_shift".


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
@ 2020-07-27 23:34   ` Alexander Duyck
  2020-07-28  7:15     ` Alex Shi
                       ` (2 more replies)
  2020-07-29  3:54   ` Alex Shi
  2020-08-06  7:41   ` Alex Shi
  2 siblings, 3 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-07-27 23:34 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov

On Sat, Jul 25, 2020 at 6:01 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
> each of memcg per node. So on a large machine, each of memcg don't
> have to suffer from per node pgdat->lru_lock competition. They could go
> fast with their self lru_lock.
>
> After move memcg charge before lru inserting, page isolation could
> serialize page's memcg, then per memcg lruvec lock is stable and could
> replace per node lru lock.
>
> According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
> containers on a 2s * 26cores * HT box with a modefied case:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
>
> With this and later patches, the readtwice performance increases about
> 80% within concurrent containers.
>
> Also add a debug func in locking which may give some clues if there are
> sth out of hands.
>
> Hugh Dickins helped on patch polish, thanks!
>
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: cgroups@vger.kernel.org
> ---
>  include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
>  include/linux/mmzone.h     |   2 +
>  mm/compaction.c            |  67 ++++++++++++++++++-----------
>  mm/huge_memory.c           |  11 ++---
>  mm/memcontrol.c            |  63 ++++++++++++++++++++++++++-
>  mm/mlock.c                 |  47 +++++++++++++-------
>  mm/mmzone.c                |   1 +
>  mm/swap.c                  | 104 +++++++++++++++++++++------------------------
>  mm/vmscan.c                |  70 ++++++++++++++++--------------
>  9 files changed, 288 insertions(+), 135 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e77197a62809..258901021c6c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>
>  struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
>
> +struct lruvec *lock_page_lruvec(struct page *page);
> +struct lruvec *lock_page_lruvec_irq(struct page *page);
> +struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +                                               unsigned long *flags);
> +
> +#ifdef CONFIG_DEBUG_VM
> +void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
> +#else
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
> +#endif
> +
>  static inline
>  struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>         return css ? container_of(css, struct mem_cgroup, css) : NULL;
> @@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
>
> +static inline struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock(&pgdat->__lruvec.lru_lock);
> +       return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock_irq(&pgdat->__lruvec.lru_lock);
> +       return &pgdat->__lruvec;
> +}
> +
> +static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
> +               unsigned long *flagsp)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +
> +       spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
> +       return &pgdat->__lruvec;
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>                 struct mem_cgroup *prev,
> @@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page,
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
> +{
> +}
>  #endif /* CONFIG_MEMCG */
>
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> @@ -1255,6 +1297,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
>         return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
>  }
>
> +static inline void unlock_page_lruvec(struct lruvec *lruvec)
> +{
> +       spin_unlock(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
> +{
> +       spin_unlock_irq(&lruvec->lru_lock);
> +}
> +
> +static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
> +               unsigned long flags)
> +{
> +       spin_unlock_irqrestore(&lruvec->lru_lock, flags);
> +}
> +
>  #ifdef CONFIG_CGROUP_WRITEBACK
>
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 14c668b7e793..30b961a9a749 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -249,6 +249,8 @@ enum lruvec_flags {
>  };
>
>  struct lruvec {
> +       /* per lruvec lru_lock for memcg */
> +       spinlock_t                      lru_lock;
>         struct list_head                lists[NR_LRU_LISTS];
>         /*
>          * These track the cost of reclaiming one LRU - file or anon -
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 2da2933fe56b..88bbd2e93895 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         unsigned long nr_scanned = 0, nr_isolated = 0;
>         struct lruvec *lruvec;
>         unsigned long flags = 0;
> -       bool locked = false;
> +       struct lruvec *locked_lruvec = NULL;
>         struct page *page = NULL, *valid_page = NULL;
>         unsigned long start_pfn = low_pfn;
>         bool skip_on_failure = false;
> @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * contention, to give chance to IRQs. Abort completely if
>                  * a fatal signal is pending.
>                  */
> -               if (!(low_pfn % SWAP_CLUSTER_MAX)
> -                   && compact_unlock_should_abort(&pgdat->lru_lock,
> -                                           flags, &locked, cc)) {
> -                       low_pfn = 0;
> -                       goto fatal_pending;
> +               if (!(low_pfn % SWAP_CLUSTER_MAX)) {
> +                       if (locked_lruvec) {
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +                               locked_lruvec = NULL;
> +                       }
> +
> +                       if (fatal_signal_pending(current)) {
> +                               cc->contended = true;
> +
> +                               low_pfn = 0;
> +                               goto fatal_pending;
> +                       }
> +
> +                       cond_resched();
>                 }
>
>                 if (!pfn_valid_within(low_pfn))

I'm noticing this patch introduces a bunch of noise. What is the
reason for getting rid of compact_unlock_should_abort? It seems like
you just open coded it here. If there is some sort of issue with it
then it might be better to replace it as part of a preparatory patch
before you introduce this one as changes like this make it harder to
review.

It might make more sense to look at modifying
compact_unlock_should_abort and compact_lock_irqsave (which always
returns true so should probably be a void) to address the deficiencies
they have that make them unusable for you.

> @@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(__PageMovable(page)) &&
>                                         !PageIsolated(page)) {
> -                               if (locked) {
> -                                       spin_unlock_irqrestore(&pgdat->lru_lock,
> -                                                                       flags);
> -                                       locked = false;
> +                               if (locked_lruvec) {
> +                                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +                                       locked_lruvec = NULL;
>                                 }
>
>                                 if (!isolate_movable_page(page, isolate_mode))
> @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!TestClearPageLRU(page))
>                         goto isolate_fail_put;
>
> +               rcu_read_lock();
> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>                 /* If we already hold the lock, we can skip some rechecking */
> -               if (!locked) {
> -                       locked = compact_lock_irqsave(&pgdat->lru_lock,
> -                                                               &flags, cc);
> +               if (lruvec != locked_lruvec) {
> +                       if (locked_lruvec)
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +
> +                       compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> +                       locked_lruvec = lruvec;
> +                       rcu_read_unlock();
> +
> +                       lruvec_memcg_debug(lruvec, page);
>
>                         /* Try get exclusive access under lock */
>                         if (!skip_updated) {

So this bit makes things a bit complicated. From what I can can tell
the comment about exclusive access under the lock is supposed to apply
to the pageblock via the lru_lock. However you are having to retest
the lock for each page because it is possible the page was moved to
another memory cgroup while the lru_lock was released correct? So in
this case is the lru vector lock really providing any protection for
the skip_updated portion of this code block if the lock isn't
exclusive to the pageblock? In theory this would probably make more
sense to have protected the skip bits under the zone lock, but I
imagine that was avoided due to the additional overhead.

> @@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                 SetPageLRU(page);
>                                 goto isolate_fail_put;
>                         }
> -               }
> -
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +               } else
> +                       rcu_read_unlock();
>
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
> @@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>
>  isolate_fail_put:
>                 /* Avoid potential deadlock in freeing page under lru_lock */
> -               if (locked) {
> -                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                       locked = false;
> +               if (locked_lruvec) {
> +                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +                       locked_lruvec = NULL;
>                 }
>                 put_page(page);
>
> @@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * page anyway.
>                  */
>                 if (nr_isolated) {
> -                       if (locked) {
> -                               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> -                               locked = false;
> +                       if (locked_lruvec) {
> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +                                                                       flags);
> +                               locked_lruvec = NULL;
>                         }
>                         putback_movable_pages(&cc->migratepages);
>                         cc->nr_migratepages = 0;
> @@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         page = NULL;
>
>  isolate_abort:
> -       if (locked)
> -               spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irqrestore(locked_lruvec, flags);
>         if (page) {
>                 SetPageLRU(page);
>                 put_page(page);

<snip>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f77748adc340..168c1659e430 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page)
>         WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
>
>         if (TestClearPageLRU(page)) {
> -               pg_data_t *pgdat = page_pgdat(page);
>                 struct lruvec *lruvec;
>                 int lru = page_lru(page);
>
>                 get_page(page);
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               spin_lock_irq(&pgdat->lru_lock);
> +               lruvec = lock_page_lruvec_irq(page);
>                 del_page_from_lru_list(page, lruvec, lru);
> -               spin_unlock_irq(&pgdat->lru_lock);
> +               unlock_page_lruvec_irq(lruvec);
>                 ret = 0;
>         }
>
> @@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                                                      struct list_head *list)
>  {
> -       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         int nr_pages, nr_moved = 0;
>         LIST_HEAD(pages_to_free);
>         struct page *page;
> +       struct lruvec *orig_lruvec = lruvec;
>         enum lru_list lru;
>
>         while (!list_empty(list)) {
> +               struct lruvec *new_lruvec = NULL;
> +
>                 page = lru_to_page(list);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(&pgdat->lru_lock);
> +                       spin_unlock_irq(&lruvec->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(&pgdat->lru_lock);
> +                       spin_lock_irq(&lruvec->lru_lock);
>                         continue;
>                 }
>
> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                  *                                        list_add(&page->lru,)
>                  *     list_add(&page->lru,) //corrupt
>                  */
> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +               if (new_lruvec != lruvec) {
> +                       if (lruvec)
> +                               spin_unlock_irq(&lruvec->lru_lock);
> +                       lruvec = lock_page_lruvec_irq(page);
> +               }
>                 SetPageLRU(page);
>
>                 if (unlikely(put_page_testzero(page))) {

I was going through the code of the entire patch set and I noticed
these changes in move_pages_to_lru. What is the reason for adding the
new_lruvec logic? My understanding is that we are moving the pages to
the lruvec provided are we not?If so why do we need to add code to get
a new lruvec? The code itself seems to stand out from the rest of the
patch as it is introducing new code instead of replacing existing
locking code, and it doesn't match up with the description of what
this function is supposed to do since it changes the lruvec.

> @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                         __ClearPageActive(page);
>
>                         if (unlikely(PageCompound(page))) {
> -                               spin_unlock_irq(&pgdat->lru_lock);
> +                               spin_unlock_irq(&lruvec->lru_lock);
>                                 destroy_compound_page(page);
> -                               spin_lock_irq(&pgdat->lru_lock);
> +                               spin_lock_irq(&lruvec->lru_lock);
>                         } else
>                                 list_add(&page->lru, &pages_to_free);
>
>                         continue;
>                 }
>
> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>                 lru = page_lru(page);
>                 nr_pages = hpage_nr_pages(page);
>
> @@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>                 if (PageActive(page))
>                         workingset_age_nonresident(lruvec, nr_pages);
>         }
> +       if (orig_lruvec != lruvec) {
> +               if (lruvec)
> +                       spin_unlock_irq(&lruvec->lru_lock);
> +               spin_lock_irq(&orig_lruvec->lru_lock);
> +       }
>
>         /*
>          * To save our caller's stack, now use input list for pages to free.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-27 23:34   ` Alexander Duyck
@ 2020-07-28  7:15     ` Alex Shi
  2020-07-28 11:19     ` Alex Shi
  2020-07-28 15:39     ` Alex Shi
  2 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-28  7:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov



在 2020/7/28 上午7:34, Alexander Duyck 写道:

>> @@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                  * contention, to give chance to IRQs. Abort completely if
>>                  * a fatal signal is pending.
>>                  */
>> -               if (!(low_pfn % SWAP_CLUSTER_MAX)
>> -                   && compact_unlock_should_abort(&pgdat->lru_lock,
>> -                                           flags, &locked, cc)) {
>> -                       low_pfn = 0;
>> -                       goto fatal_pending;
>> +               if (!(low_pfn % SWAP_CLUSTER_MAX)) {
>> +                       if (locked_lruvec) {
>> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
>> +                                                                       flags);
>> +                               locked_lruvec = NULL;
>> +                       }
>> +
>> +                       if (fatal_signal_pending(current)) {
>> +                               cc->contended = true;
>> +
>> +                               low_pfn = 0;
>> +                               goto fatal_pending;
>> +                       }
>> +
>> +                       cond_resched();
>>                 }
>>
>>                 if (!pfn_valid_within(low_pfn))
> 
> I'm noticing this patch introduces a bunch of noise. What is the
> reason for getting rid of compact_unlock_should_abort? It seems like
> you just open coded it here. If there is some sort of issue with it
> then it might be better to replace it as part of a preparatory patch
> before you introduce this one as changes like this make it harder to
> review.

Thanks for comments, Alex.

the func compact_unlock_should_abort should be removed since one of parameters
changed from 'bool *locked' to 'struct lruvec *lruvec'. So it's not applicable
now. I have to open it here instead of adding a only one user func.

> 
> It might make more sense to look at modifying
> compact_unlock_should_abort and compact_lock_irqsave (which always
> returns true so should probably be a void) to address the deficiencies
> they have that make them unusable for you.

I am wondering if people like a patch which just open compact_unlock_should_abort
func and move bool to void as a preparation patch, do you like this?


>> @@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                 if (!TestClearPageLRU(page))
>>                         goto isolate_fail_put;
>>
>> +               rcu_read_lock();
>> +               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> +
>>                 /* If we already hold the lock, we can skip some rechecking */
>> -               if (!locked) {
>> -                       locked = compact_lock_irqsave(&pgdat->lru_lock,
>> -                                                               &flags, cc);
>> +               if (lruvec != locked_lruvec) {
>> +                       if (locked_lruvec)
>> +                               unlock_page_lruvec_irqrestore(locked_lruvec,
>> +                                                                       flags);
>> +
>> +                       compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
>> +                       locked_lruvec = lruvec;
>> +                       rcu_read_unlock();
>> +
>> +                       lruvec_memcg_debug(lruvec, page);
>>
>>                         /* Try get exclusive access under lock */
>>                         if (!skip_updated) {
> 
> So this bit makes things a bit complicated. From what I can can tell
> the comment about exclusive access under the lock is supposed to apply
> to the pageblock via the lru_lock. However you are having to retest
> the lock for each page because it is possible the page was moved to
> another memory cgroup while the lru_lock was released correct? So in

The pageblock is aligned by pfn, so pages in them maynot on same memcg
originally. and yes, page may be changed memcg also.

> this case is the lru vector lock really providing any protection for
> the skip_updated portion of this code block if the lock isn't
> exclusive to the pageblock? In theory this would probably make more
> sense to have protected the skip bits under the zone lock, but I
> imagine that was avoided due to the additional overhead.

when we change to lruvec->lru_lock, it does the same thing as pgdat->lru_lock.
just may get a bit more chance to here, and find out this is a skipable
pageblock and quit. 
Yes, logically, pgdat lru_lock seems better, but since we are holding lru_lock.
It's fine to not bother more locks.

> 
>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>>                  *                                        list_add(&page->lru,)
>>                  *     list_add(&page->lru,) //corrupt
>>                  */
>> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>> +               if (new_lruvec != lruvec) {
>> +                       if (lruvec)
>> +                               spin_unlock_irq(&lruvec->lru_lock);
>> +                       lruvec = lock_page_lruvec_irq(page);
>> +               }
>>                 SetPageLRU(page);
>>
>>                 if (unlikely(put_page_testzero(page))) {
> 
> I was going through the code of the entire patch set and I noticed
> these changes in move_pages_to_lru. What is the reason for adding the
> new_lruvec logic? My understanding is that we are moving the pages to
> the lruvec provided are we not?If so why do we need to add code to get
> a new lruvec? The code itself seems to stand out from the rest of the
> patch as it is introducing new code instead of replacing existing
> locking code, and it doesn't match up with the description of what
> this function is supposed to do since it changes the lruvec.

A code here since some bugs happened. I will check it again anyway.

Thanks!


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-27 23:34   ` Alexander Duyck
  2020-07-28  7:15     ` Alex Shi
@ 2020-07-28 11:19     ` Alex Shi
  2020-07-28 14:54       ` Alexander Duyck
  2020-07-28 15:39     ` Alex Shi
  2 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-28 11:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov



在 2020/7/28 上午7:34, Alexander Duyck 写道:
>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>>                  *                                        list_add(&page->lru,)
>>                  *     list_add(&page->lru,) //corrupt
>>                  */
>> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>> +               if (new_lruvec != lruvec) {
>> +                       if (lruvec)
>> +                               spin_unlock_irq(&lruvec->lru_lock);
>> +                       lruvec = lock_page_lruvec_irq(page);
>> +               }
>>                 SetPageLRU(page);
>>
>>                 if (unlikely(put_page_testzero(page))) {
> I was going through the code of the entire patch set and I noticed
> these changes in move_pages_to_lru. What is the reason for adding the
> new_lruvec logic? My understanding is that we are moving the pages to
> the lruvec provided are we not?If so why do we need to add code to get
> a new lruvec? The code itself seems to stand out from the rest of the
> patch as it is introducing new code instead of replacing existing
> locking code, and it doesn't match up with the description of what
> this function is supposed to do since it changes the lruvec.

this new_lruvec is the replacement of removed line, as following code:
>> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
This recheck is for the page move the root memcg, otherwise it cause the bug:

[ 2081.240795] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 2081.248125] #PF: supervisor read access in kernel mode
[ 2081.253627] #PF: error_code(0x0000) - not-present page
[ 2081.259124] PGD 8000000044cb0067 P4D 8000000044cb0067 PUD 95c9067 PMD 0
[ 2081.266193] Oops: 0000 [#1] PREEMPT SMP PTI
[ 2081.270740] CPU: 5 PID: 131 Comm: kswapd0 Kdump: loaded Tainted: G        W         5.8.0-rc6-00025-gc708f8a0db47 #45
[ 2081.281960] Hardware name: Alibaba X-Dragon CN 01/20G4B, BIOS 1ALSP016 05/21/2018
[ 2081.290054] RIP: 0010:do_raw_spin_trylock+0x5/0x40
[ 2081.295209] Code: 76 82 48 89 df e8 bb fe ff ff eb 8c 89 c6 48 89 df e8 4f dd ff ff 66 90 eb 8b 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 85 c0 75 28 ba 01 00 00 00 f0 0f b1 17 75 1d 65 8b 05 03 6a
[ 2081.314832] RSP: 0018:ffffc900002ebac8 EFLAGS: 00010082
[ 2081.320410] RAX: 0000000000000000 RBX: 0000000000000018 RCX: 0000000000000000
[ 2081.327907] RDX: ffff888035833480 RSI: 0000000000000000 RDI: 0000000000000000
[ 2081.335407] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[ 2081.342907] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 2081.350405] R13: dead000000000100 R14: 0000000000000000 R15: ffffc900002ebbb0
[ 2081.357908] FS:  0000000000000000(0000) GS:ffff88807a200000(0000) knlGS:0000000000000000
[ 2081.366619] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2081.372717] CR2: 0000000000000000 CR3: 0000000031228005 CR4: 00000000003606e0
[ 2081.380215] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2081.387713] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2081.395198] Call Trace:
[ 2081.398008]  _raw_spin_lock_irq+0x47/0x80
[ 2081.402387]  ? move_pages_to_lru+0x566/0xb80
[ 2081.407028]  move_pages_to_lru+0x566/0xb80
[ 2081.411495]  shrink_active_list+0x355/0xa70
[ 2081.416054]  shrink_lruvec+0x4f7/0x810
[ 2081.420176]  ? mem_cgroup_iter+0xb6/0x410
[ 2081.424558]  shrink_node+0x1cc/0x8d0
[ 2081.428510]  balance_pgdat+0x3cf/0x760
[ 2081.432634]  kswapd+0x232/0x660
[ 2081.436147]  ? finish_wait+0x80/0x80
[ 2081.440093]  ? balance_pgdat+0x760/0x760
[ 2081.444382]  kthread+0x17e/0x1b0
[ 2081.447975]  ? kthread_park+0xc0/0xc0
[ 2081.452005]  ret_from_fork+0x22/0x30

Thanks!
Alex
> 
>> @@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>>                         __ClearPageActive(page);
>>
>>                         if (unlikely(PageCompound(page))) {
>> -                               spin_unlock_irq(&pgdat->lru_lock);
>> +                               spin_unlock_irq(&lruvec->lru_lock);
>>                                 destroy_compound_page(page);
>> -                               spin_lock_irq(&pgdat->lru_lock);
>> +                               spin_lock_irq(&lruvec->lru_lock);
>>                         } else
>>                                 list_add(&page->lru, &pages_to_free);
>>
>>                         continue;
>>                 }
>>
>> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>>                 lru = page_lru(page);
>>                 nr_pages = hpage_nr_pages(page);


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift
  2020-07-27 17:29   ` Alexander Duyck
@ 2020-07-28 11:59     ` Alex Shi
  2020-07-28 14:17       ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-28 11:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen


>>   * Compaction is deferred when compaction fails to result in a page
>> - * allocation success. 1 << compact_defer_limit compactions are skipped up
>> + * allocation success. compact_defer_shift++, compactions are skipped up
>>   * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
>>   */
>>  void defer_compaction(struct zone *zone, int order)
> 
> So this doesn't read right. I wouldn't keep the "++," in the
> explanation, and if we are going to refer to a limit of "1 <<
> COMPACT_MAX_DEFER_SHIFT" then maybe this should be left as "1 <<
> compact_defer_shift".
> 

Thanks for comments! So is the changed patch fine?
--

From 80ffde4c8e13ba2ad1ad5175dbaef245c2fe49bc Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 26 May 2020 09:47:01 +0800
Subject: [PATCH] mm/compaction: correct the comments of compact_defer_shift

There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/mmzone.h | 1 +
 mm/compaction.c        | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f6f884970511..14c668b7e793 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -512,6 +512,7 @@ struct zone {
 	 * On compaction failure, 1<<compact_defer_shift compactions
 	 * are skipped before trying again. The number attempted since
 	 * last failure is tracked with compact_considered.
+	 * compact_order_failed is the minimum compaction failed order.
 	 */
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
diff --git a/mm/compaction.c b/mm/compaction.c
index 86375605faa9..4950240cd455 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page)
 
 /*
  * Compaction is deferred when compaction fails to result in a page
- * allocation success. 1 << compact_defer_limit compactions are skipped up
+ * allocation success. 1 << compact_defer_shift, compactions are skipped up
  * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
  */
 void defer_compaction(struct zone *zone, int order)
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift
  2020-07-28 11:59     ` Alex Shi
@ 2020-07-28 14:17       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-07-28 14:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Tue, Jul 28, 2020 at 4:59 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
> >>   * Compaction is deferred when compaction fails to result in a page
> >> - * allocation success. 1 << compact_defer_limit compactions are skipped up
> >> + * allocation success. compact_defer_shift++, compactions are skipped up
> >>   * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
> >>   */
> >>  void defer_compaction(struct zone *zone, int order)
> >
> > So this doesn't read right. I wouldn't keep the "++," in the
> > explanation, and if we are going to refer to a limit of "1 <<
> > COMPACT_MAX_DEFER_SHIFT" then maybe this should be left as "1 <<
> > compact_defer_shift".
> >
>
> Thanks for comments! So is the changed patch fine?
> --
>
> From 80ffde4c8e13ba2ad1ad5175dbaef245c2fe49bc Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Tue, 26 May 2020 09:47:01 +0800
> Subject: [PATCH] mm/compaction: correct the comments of compact_defer_shift
>
> There is no compact_defer_limit. It should be compact_defer_shift in
> use. and add compact_order_failed explanation.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  include/linux/mmzone.h | 1 +
>  mm/compaction.c        | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f6f884970511..14c668b7e793 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -512,6 +512,7 @@ struct zone {
>          * On compaction failure, 1<<compact_defer_shift compactions
>          * are skipped before trying again. The number attempted since
>          * last failure is tracked with compact_considered.
> +        * compact_order_failed is the minimum compaction failed order.
>          */
>         unsigned int            compact_considered;
>         unsigned int            compact_defer_shift;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 86375605faa9..4950240cd455 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -136,7 +136,7 @@ void __ClearPageMovable(struct page *page)
>
>  /*
>   * Compaction is deferred when compaction fails to result in a page
> - * allocation success. 1 << compact_defer_limit compactions are skipped up
> + * allocation success. 1 << compact_defer_shift, compactions are skipped up
>   * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
>   */
>  void defer_compaction(struct zone *zone, int order)

Yes, that looks better to me.

Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-28 11:19     ` Alex Shi
@ 2020-07-28 14:54       ` Alexander Duyck
  2020-07-29  1:00         ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-07-28 14:54 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov

On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/28 上午7:34, Alexander Duyck 写道:
> >> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> >>                  *                                        list_add(&page->lru,)
> >>                  *     list_add(&page->lru,) //corrupt
> >>                  */
> >> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> >> +               if (new_lruvec != lruvec) {
> >> +                       if (lruvec)
> >> +                               spin_unlock_irq(&lruvec->lru_lock);
> >> +                       lruvec = lock_page_lruvec_irq(page);
> >> +               }
> >>                 SetPageLRU(page);
> >>
> >>                 if (unlikely(put_page_testzero(page))) {
> > I was going through the code of the entire patch set and I noticed
> > these changes in move_pages_to_lru. What is the reason for adding the
> > new_lruvec logic? My understanding is that we are moving the pages to
> > the lruvec provided are we not?If so why do we need to add code to get
> > a new lruvec? The code itself seems to stand out from the rest of the
> > patch as it is introducing new code instead of replacing existing
> > locking code, and it doesn't match up with the description of what
> > this function is supposed to do since it changes the lruvec.
>
> this new_lruvec is the replacement of removed line, as following code:
> >> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> This recheck is for the page move the root memcg, otherwise it cause the bug:

Okay, now I see where the issue is. You moved this code so now it has
a different effect than it did before. You are relocking things before
you needed to. Don't forget that when you came into this function you
already had the lock. In addition the patch is broken as it currently
stands as you aren't using similar logic in the code just above this
addition if you encounter an evictable page. As a result this is
really difficult to review as there are subtle bugs here.

I suppose the correct fix is to get rid of this line, but  it should
be placed everywhere the original function was calling
spin_lock_irq().

In addition I would consider changing the arguments/documentation for
move_pages_to_lru. You aren't moving the pages to lruvec, so there is
probably no need to pass that as an argument. Instead I would pass
pgdat since that isn't going to be moving and is the only thing you
actually derive based on the original lruvec.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-27 23:34   ` Alexander Duyck
  2020-07-28  7:15     ` Alex Shi
  2020-07-28 11:19     ` Alex Shi
@ 2020-07-28 15:39     ` Alex Shi
  2020-07-28 15:55       ` Alexander Duyck
  2 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-28 15:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov



在 2020/7/28 上午7:34, Alexander Duyck 写道:
> It might make more sense to look at modifying
> compact_unlock_should_abort and compact_lock_irqsave (which always
> returns true so should probably be a void) to address the deficiencies
> they have that make them unusable for you.

One of possible reuse for the func compact_unlock_should_abort, could be
like the following, the locked parameter reused different in 2 places.
but, it's seems no this style usage in kernel, isn't it?

Thanks
Alex

From 41d5ce6562f20f74bc6ac2db83e226ac28d56e90 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 28 Jul 2020 21:19:32 +0800
Subject: [PATCH] compaction polishing

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
---
 mm/compaction.c | 71 ++++++++++++++++++++++++---------------------------------
 1 file changed, 30 insertions(+), 41 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c28a43481f01..36fce988de3e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -479,20 +479,20 @@ static bool test_and_set_skip(struct compact_control *cc, struct page *page,
  *
  * Always returns true which makes it easier to track lock state in callers.
  */
-static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
+static void compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
 						struct compact_control *cc)
 	__acquires(lock)
 {
 	/* Track if the lock is contended in async mode */
 	if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
 		if (spin_trylock_irqsave(lock, *flags))
-			return true;
+			return;
 
 		cc->contended = true;
 	}
 
 	spin_lock_irqsave(lock, *flags);
-	return true;
+	return;
 }
 
 /*
@@ -511,11 +511,11 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
  *		scheduled)
  */
 static bool compact_unlock_should_abort(spinlock_t *lock,
-		unsigned long flags, bool *locked, struct compact_control *cc)
+		unsigned long flags, void **locked, struct compact_control *cc)
 {
 	if (*locked) {
 		spin_unlock_irqrestore(lock, flags);
-		*locked = false;
+		*locked = NULL;
 	}
 
 	if (fatal_signal_pending(current)) {
@@ -543,7 +543,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	int nr_scanned = 0, total_isolated = 0;
 	struct page *cursor;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct compact_control *locked = NULL;
 	unsigned long blockpfn = *start_pfn;
 	unsigned int order;
 
@@ -565,7 +565,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		 */
 		if (!(blockpfn % SWAP_CLUSTER_MAX)
 		    && compact_unlock_should_abort(&cc->zone->lock, flags,
-								&locked, cc))
+							(void**)&locked, cc))
 			break;
 
 		nr_scanned++;
@@ -599,8 +599,8 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		 * recheck as well.
 		 */
 		if (!locked) {
-			locked = compact_lock_irqsave(&cc->zone->lock,
-								&flags, cc);
+			compact_lock_irqsave(&cc->zone->lock, &flags, cc);
+			locked = cc;
 
 			/* Recheck this is a buddy page under lock */
 			if (!PageBuddy(page))
@@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	struct lruvec *locked_lruvec = NULL;
+	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -847,21 +847,11 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
-			if (locked_lruvec) {
-				unlock_page_lruvec_irqrestore(locked_lruvec,
-									flags);
-				locked_lruvec = NULL;
-			}
-
-			if (fatal_signal_pending(current)) {
-				cc->contended = true;
-
-				low_pfn = 0;
-				goto fatal_pending;
-			}
-
-			cond_resched();
+		if (!(low_pfn % SWAP_CLUSTER_MAX)
+		    && compact_unlock_should_abort(&locked->lru_lock, flags,
+						(void**)&locked, cc)) {
+			low_pfn = 0;
+			goto fatal_pending;
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -932,9 +922,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked_lruvec) {
-					unlock_page_lruvec_irqrestore(locked_lruvec, flags);
-					locked_lruvec = NULL;
+				if (locked) {
+					unlock_page_lruvec_irqrestore(locked, flags);
+					locked = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -979,13 +969,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		/* If we already hold the lock, we can skip some rechecking */
-		if (lruvec != locked_lruvec) {
-			if (locked_lruvec)
-				unlock_page_lruvec_irqrestore(locked_lruvec,
+		if (lruvec != locked) {
+			if (locked)
+				unlock_page_lruvec_irqrestore(locked,
 									flags);
 
 			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
-			locked_lruvec = lruvec;
+			locked = lruvec;
 			rcu_read_unlock();
 
 			lruvec_memcg_debug(lruvec, page);
@@ -1041,9 +1031,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
-		if (locked_lruvec) {
-			unlock_page_lruvec_irqrestore(locked_lruvec, flags);
-			locked_lruvec = NULL;
+		if (locked) {
+			unlock_page_lruvec_irqrestore(locked, flags);
+			locked = NULL;
 		}
 		put_page(page);
 
@@ -1057,10 +1047,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked_lruvec) {
-				unlock_page_lruvec_irqrestore(locked_lruvec,
-									flags);
-				locked_lruvec = NULL;
+			if (locked) {
+				unlock_page_lruvec_irqrestore(locked, flags);
+				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1087,8 +1076,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	page = NULL;
 
 isolate_abort:
-	if (locked_lruvec)
-		unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+	if (locked)
+		unlock_page_lruvec_irqrestore(locked, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-28 15:39     ` Alex Shi
@ 2020-07-28 15:55       ` Alexander Duyck
  2020-07-29  0:48         ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-07-28 15:55 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov

On Tue, Jul 28, 2020 at 8:40 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/28 上午7:34, Alexander Duyck 写道:
> > It might make more sense to look at modifying
> > compact_unlock_should_abort and compact_lock_irqsave (which always
> > returns true so should probably be a void) to address the deficiencies
> > they have that make them unusable for you.
>
> One of possible reuse for the func compact_unlock_should_abort, could be
> like the following, the locked parameter reused different in 2 places.
> but, it's seems no this style usage in kernel, isn't it?
>
> Thanks
> Alex
>
> From 41d5ce6562f20f74bc6ac2db83e226ac28d56e90 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Tue, 28 Jul 2020 21:19:32 +0800
> Subject: [PATCH] compaction polishing
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> ---
>  mm/compaction.c | 71 ++++++++++++++++++++++++---------------------------------
>  1 file changed, 30 insertions(+), 41 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c28a43481f01..36fce988de3e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -479,20 +479,20 @@ static bool test_and_set_skip(struct compact_control *cc, struct page *page,
>   *
>   * Always returns true which makes it easier to track lock state in callers.
>   */
> -static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
> +static void compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
>                                                 struct compact_control *cc)
>         __acquires(lock)
>  {
>         /* Track if the lock is contended in async mode */
>         if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
>                 if (spin_trylock_irqsave(lock, *flags))
> -                       return true;
> +                       return;
>
>                 cc->contended = true;
>         }
>
>         spin_lock_irqsave(lock, *flags);
> -       return true;
> +       return;
>  }
>
>  /*
> @@ -511,11 +511,11 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
>   *             scheduled)
>   */
>  static bool compact_unlock_should_abort(spinlock_t *lock,
> -               unsigned long flags, bool *locked, struct compact_control *cc)
> +               unsigned long flags, void **locked, struct compact_control *cc)

Instead of passing both a void pointer and the lock why not just pass
the pointer to the lock pointer? You could combine lock and locked
into a single argument and save yourself some extra effort.

>  {
>         if (*locked) {
>                 spin_unlock_irqrestore(lock, flags);
> -               *locked = false;
> +               *locked = NULL;
>         }
>
>         if (fatal_signal_pending(current)) {
> @@ -543,7 +543,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>         int nr_scanned = 0, total_isolated = 0;
>         struct page *cursor;
>         unsigned long flags = 0;
> -       bool locked = false;
> +       struct compact_control *locked = NULL;
>         unsigned long blockpfn = *start_pfn;
>         unsigned int order;
>
> @@ -565,7 +565,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>                  */
>                 if (!(blockpfn % SWAP_CLUSTER_MAX)
>                     && compact_unlock_should_abort(&cc->zone->lock, flags,
> -                                                               &locked, cc))
> +                                                       (void**)&locked, cc))
>                         break;
>
>                 nr_scanned++;
> @@ -599,8 +599,8 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>                  * recheck as well.
>                  */
>                 if (!locked) {
> -                       locked = compact_lock_irqsave(&cc->zone->lock,
> -                                                               &flags, cc);
> +                       compact_lock_irqsave(&cc->zone->lock, &flags, cc);
> +                       locked = cc;
>
>                         /* Recheck this is a buddy page under lock */
>                         if (!PageBuddy(page))

If you have to provide a pointer you might as well just provide a
pointer to the zone lock since that is the thing that is actually
holding the lock at this point and would be consistent with your other
uses of the locked value. One possibility would be to change the
return type so that you return a pointer to the lock you are using.
Then the code would look closer to the lruvec code you are already
using.

> @@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         unsigned long nr_scanned = 0, nr_isolated = 0;
>         struct lruvec *lruvec;
>         unsigned long flags = 0;
> -       struct lruvec *locked_lruvec = NULL;
> +       struct lruvec *locked = NULL;
>         struct page *page = NULL, *valid_page = NULL;
>         unsigned long start_pfn = low_pfn;
>         bool skip_on_failure = false;
> @@ -847,21 +847,11 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * contention, to give chance to IRQs. Abort completely if
>                  * a fatal signal is pending.
>                  */
> -               if (!(low_pfn % SWAP_CLUSTER_MAX)) {
> -                       if (locked_lruvec) {
> -                               unlock_page_lruvec_irqrestore(locked_lruvec,
> -                                                                       flags);
> -                               locked_lruvec = NULL;
> -                       }
> -
> -                       if (fatal_signal_pending(current)) {
> -                               cc->contended = true;
> -
> -                               low_pfn = 0;
> -                               goto fatal_pending;
> -                       }
> -
> -                       cond_resched();
> +               if (!(low_pfn % SWAP_CLUSTER_MAX)
> +                   && compact_unlock_should_abort(&locked->lru_lock, flags,
> +                                               (void**)&locked, cc)) {

An added advantage to making locked a pointer to a spinlock is that
you could reduce the number of pointers you have to pass. Instead of
messing with &locked->lru_lock you would just pass the pointer to
locked resulting in fewer arguments being passed and if it is NULL you
skip the whole unlock pass.

> +                       low_pfn = 0;
> +                       goto fatal_pending;
>                 }
>
>                 if (!pfn_valid_within(low_pfn))
> @@ -932,9 +922,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(__PageMovable(page)) &&
>                                         !PageIsolated(page)) {
> -                               if (locked_lruvec) {
> -                                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> -                                       locked_lruvec = NULL;
> +                               if (locked) {
> +                                       unlock_page_lruvec_irqrestore(locked, flags);
> +                                       locked = NULL;
>                                 }
>
>                                 if (!isolate_movable_page(page, isolate_mode))
> @@ -979,13 +969,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
>                 /* If we already hold the lock, we can skip some rechecking */
> -               if (lruvec != locked_lruvec) {
> -                       if (locked_lruvec)
> -                               unlock_page_lruvec_irqrestore(locked_lruvec,
> +               if (lruvec != locked) {
> +                       if (locked)
> +                               unlock_page_lruvec_irqrestore(locked,
>                                                                         flags);
>
>                         compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> -                       locked_lruvec = lruvec;
> +                       locked = lruvec;
>                         rcu_read_unlock();
>
>                         lruvec_memcg_debug(lruvec, page);
> @@ -1041,9 +1031,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>
>  isolate_fail_put:
>                 /* Avoid potential deadlock in freeing page under lru_lock */
> -               if (locked_lruvec) {
> -                       unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> -                       locked_lruvec = NULL;
> +               if (locked) {
> +                       unlock_page_lruvec_irqrestore(locked, flags);
> +                       locked = NULL;
>                 }
>                 put_page(page);
>
> @@ -1057,10 +1047,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                  * page anyway.
>                  */
>                 if (nr_isolated) {
> -                       if (locked_lruvec) {
> -                               unlock_page_lruvec_irqrestore(locked_lruvec,
> -                                                                       flags);
> -                               locked_lruvec = NULL;
> +                       if (locked) {
> +                               unlock_page_lruvec_irqrestore(locked, flags);
> +                               locked = NULL;
>                         }
>                         putback_movable_pages(&cc->migratepages);
>                         cc->nr_migratepages = 0;
> @@ -1087,8 +1076,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         page = NULL;
>
>  isolate_abort:
> -       if (locked_lruvec)
> -               unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +       if (locked)
> +               unlock_page_lruvec_irqrestore(locked, flags);
>         if (page) {
>                 SetPageLRU(page);
>                 put_page(page);
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-28 15:55       ` Alexander Duyck
@ 2020-07-29  0:48         ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-29  0:48 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov



在 2020/7/28 下午11:55, Alexander Duyck 写道:
>>  /*
>> @@ -511,11 +511,11 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
>>   *             scheduled)
>>   */
>>  static bool compact_unlock_should_abort(spinlock_t *lock,
>> -               unsigned long flags, bool *locked, struct compact_control *cc)
>> +               unsigned long flags, void **locked, struct compact_control *cc)
> Instead of passing both a void pointer and the lock why not just pass
> the pointer to the lock pointer? You could combine lock and locked
> into a single argument and save yourself some extra effort.
> 

the passed locked pointer could be rewrite in the func, that is unacceptable if it is a lock which could
be used other place.

And it is alreay dangerous to NULL a local pointer. In fact, I perfer the orignal verion, not so smart
but rebust enough for future changes, right?

Thanks
Alex


>>  {
>>         if (*locked) {
>>                 spin_unlock_irqrestore(lock, flags);
>> -               *locked = false;
>> +               *locked = NULL;
>>         }
>>
>>         if (fatal_signal_pending(current)) {


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-28 14:54       ` Alexander Duyck
@ 2020-07-29  1:00         ` Alex Shi
  2020-07-29  1:27           ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-29  1:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov



在 2020/7/28 下午10:54, Alexander Duyck 写道:
> On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/7/28 上午7:34, Alexander Duyck 写道:
>>>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>>>>                  *                                        list_add(&page->lru,)
>>>>                  *     list_add(&page->lru,) //corrupt
>>>>                  */
>>>> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>>>> +               if (new_lruvec != lruvec) {
>>>> +                       if (lruvec)
>>>> +                               spin_unlock_irq(&lruvec->lru_lock);
>>>> +                       lruvec = lock_page_lruvec_irq(page);
>>>> +               }
>>>>                 SetPageLRU(page);
>>>>
>>>>                 if (unlikely(put_page_testzero(page))) {
>>> I was going through the code of the entire patch set and I noticed
>>> these changes in move_pages_to_lru. What is the reason for adding the
>>> new_lruvec logic? My understanding is that we are moving the pages to
>>> the lruvec provided are we not?If so why do we need to add code to get
>>> a new lruvec? The code itself seems to stand out from the rest of the
>>> patch as it is introducing new code instead of replacing existing
>>> locking code, and it doesn't match up with the description of what
>>> this function is supposed to do since it changes the lruvec.
>>
>> this new_lruvec is the replacement of removed line, as following code:
>>>> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>> This recheck is for the page move the root memcg, otherwise it cause the bug:
> 
> Okay, now I see where the issue is. You moved this code so now it has
> a different effect than it did before. You are relocking things before
> you needed to. Don't forget that when you came into this function you
> already had the lock. In addition the patch is broken as it currently
> stands as you aren't using similar logic in the code just above this
> addition if you encounter an evictable page. As a result this is
> really difficult to review as there are subtle bugs here.

Why you think its a bug? the relock only happens if locked lruvec is different.
and unlock the old one.

> 
> I suppose the correct fix is to get rid of this line, but  it should
> be placed everywhere the original function was calling
> spin_lock_irq().
> 
> In addition I would consider changing the arguments/documentation for
> move_pages_to_lru. You aren't moving the pages to lruvec, so there is
> probably no need to pass that as an argument. Instead I would pass
> pgdat since that isn't going to be moving and is the only thing you
> actually derive based on the original lruvec.

yes, The comments should be changed with the line was introduced from long ago. :)
Anyway, I am wondering if it worth a v18 version resend?

> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-29  1:00         ` Alex Shi
@ 2020-07-29  1:27           ` Alexander Duyck
  2020-07-29  2:27             ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-07-29  1:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov

On Tue, Jul 28, 2020 at 6:00 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/28 下午10:54, Alexander Duyck 写道:
> > On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/7/28 上午7:34, Alexander Duyck 写道:
> >>>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> >>>>                  *                                        list_add(&page->lru,)
> >>>>                  *     list_add(&page->lru,) //corrupt
> >>>>                  */
> >>>> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> >>>> +               if (new_lruvec != lruvec) {
> >>>> +                       if (lruvec)
> >>>> +                               spin_unlock_irq(&lruvec->lru_lock);
> >>>> +                       lruvec = lock_page_lruvec_irq(page);
> >>>> +               }
> >>>>                 SetPageLRU(page);
> >>>>
> >>>>                 if (unlikely(put_page_testzero(page))) {
> >>> I was going through the code of the entire patch set and I noticed
> >>> these changes in move_pages_to_lru. What is the reason for adding the
> >>> new_lruvec logic? My understanding is that we are moving the pages to
> >>> the lruvec provided are we not?If so why do we need to add code to get
> >>> a new lruvec? The code itself seems to stand out from the rest of the
> >>> patch as it is introducing new code instead of replacing existing
> >>> locking code, and it doesn't match up with the description of what
> >>> this function is supposed to do since it changes the lruvec.
> >>
> >> this new_lruvec is the replacement of removed line, as following code:
> >>>> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >> This recheck is for the page move the root memcg, otherwise it cause the bug:
> >
> > Okay, now I see where the issue is. You moved this code so now it has
> > a different effect than it did before. You are relocking things before
> > you needed to. Don't forget that when you came into this function you
> > already had the lock. In addition the patch is broken as it currently
> > stands as you aren't using similar logic in the code just above this
> > addition if you encounter an evictable page. As a result this is
> > really difficult to review as there are subtle bugs here.
>
> Why you think its a bug? the relock only happens if locked lruvec is different.
> and unlock the old one.

The section I am talking about with the bug is this section here:
       while (!list_empty(list)) {
+               struct lruvec *new_lruvec = NULL;
+
                page = lru_to_page(list);
                VM_BUG_ON_PAGE(PageLRU(page), page);
                list_del(&page->lru);
                if (unlikely(!page_evictable(page))) {
-                       spin_unlock_irq(&pgdat->lru_lock);
+                       spin_unlock_irq(&lruvec->lru_lock);
                        putback_lru_page(page);
-                       spin_lock_irq(&pgdat->lru_lock);
+                       spin_lock_irq(&lruvec->lru_lock);
                        continue;
                }

Basically it probably is not advisable to be retaking the
lruvec->lru_lock directly as the lruvec may have changed so it
wouldn't be correct for the next page. It would make more sense to be
using your API and calling unlock_page_lruvec_irq and
lock_page_lruvec_irq instead of using the lock directly.

> >
> > I suppose the correct fix is to get rid of this line, but  it should
> > be placed everywhere the original function was calling
> > spin_lock_irq().
> >
> > In addition I would consider changing the arguments/documentation for
> > move_pages_to_lru. You aren't moving the pages to lruvec, so there is
> > probably no need to pass that as an argument. Instead I would pass
> > pgdat since that isn't going to be moving and is the only thing you
> > actually derive based on the original lruvec.
>
> yes, The comments should be changed with the line was introduced from long ago. :)
> Anyway, I am wondering if it worth a v18 version resend?

So I have been looking over the function itself and I wonder if it
isn't worth looking at rewriting this to optimize the locking behavior
to minimize the number of times we have to take the LRU lock. I have
some code I am working on that I plan to submit as an RFC in the next
day or so after I can get it smoke tested. The basic idea would be to
defer returning the evictiable pages or freeing the compound pages
until after we have processed the pages that can be moved while still
holding the lock. I would think it should reduce the lock contention
significantly while improving the throughput.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-29  1:27           ` Alexander Duyck
@ 2020-07-29  2:27             ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-29  2:27 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov



在 2020/7/29 上午9:27, Alexander Duyck 写道:
> On Tue, Jul 28, 2020 at 6:00 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/7/28 下午10:54, Alexander Duyck 写道:
>>> On Tue, Jul 28, 2020 at 4:20 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2020/7/28 上午7:34, Alexander Duyck 写道:
>>>>>> @@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>>>>>>                  *                                        list_add(&page->lru,)
>>>>>>                  *     list_add(&page->lru,) //corrupt
>>>>>>                  */
>>>>>> +               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
>>>>>> +               if (new_lruvec != lruvec) {
>>>>>> +                       if (lruvec)
>>>>>> +                               spin_unlock_irq(&lruvec->lru_lock);
>>>>>> +                       lruvec = lock_page_lruvec_irq(page);
>>>>>> +               }
>>>>>>                 SetPageLRU(page);
>>>>>>
>>>>>>                 if (unlikely(put_page_testzero(page))) {
>>>>> I was going through the code of the entire patch set and I noticed
>>>>> these changes in move_pages_to_lru. What is the reason for adding the
>>>>> new_lruvec logic? My understanding is that we are moving the pages to
>>>>> the lruvec provided are we not?If so why do we need to add code to get
>>>>> a new lruvec? The code itself seems to stand out from the rest of the
>>>>> patch as it is introducing new code instead of replacing existing
>>>>> locking code, and it doesn't match up with the description of what
>>>>> this function is supposed to do since it changes the lruvec.
>>>>
>>>> this new_lruvec is the replacement of removed line, as following code:
>>>>>> -               lruvec = mem_cgroup_page_lruvec(page, pgdat);
>>>> This recheck is for the page move the root memcg, otherwise it cause the bug:
>>>
>>> Okay, now I see where the issue is. You moved this code so now it has
>>> a different effect than it did before. You are relocking things before
>>> you needed to. Don't forget that when you came into this function you
>>> already had the lock. In addition the patch is broken as it currently
>>> stands as you aren't using similar logic in the code just above this
>>> addition if you encounter an evictable page. As a result this is
>>> really difficult to review as there are subtle bugs here.
>>
>> Why you think its a bug? the relock only happens if locked lruvec is different.
>> and unlock the old one.
> 
> The section I am talking about with the bug is this section here:
>        while (!list_empty(list)) {
> +               struct lruvec *new_lruvec = NULL;
> +
>                 page = lru_to_page(list);
>                 VM_BUG_ON_PAGE(PageLRU(page), page);
>                 list_del(&page->lru);
>                 if (unlikely(!page_evictable(page))) {
> -                       spin_unlock_irq(&pgdat->lru_lock);
> +                       spin_unlock_irq(&lruvec->lru_lock);
>                         putback_lru_page(page);
> -                       spin_lock_irq(&pgdat->lru_lock);
> +                       spin_lock_irq(&lruvec->lru_lock);

It would be still fine. The lruvec->lru_lock will be checked again before
we take and use it. 
And this lock will optimized in patch 19th which did by Hugh Dickins.

>                         continue;
>                 }
> 
> Basically it probably is not advisable to be retaking the
> lruvec->lru_lock directly as the lruvec may have changed so it
> wouldn't be correct for the next page. It would make more sense to be
> using your API and calling unlock_page_lruvec_irq and
> lock_page_lruvec_irq instead of using the lock directly.
> 
>>>
>>> I suppose the correct fix is to get rid of this line, but  it should
>>> be placed everywhere the original function was calling
>>> spin_lock_irq().
>>>
>>> In addition I would consider changing the arguments/documentation for
>>> move_pages_to_lru. You aren't moving the pages to lruvec, so there is
>>> probably no need to pass that as an argument. Instead I would pass
>>> pgdat since that isn't going to be moving and is the only thing you
>>> actually derive based on the original lruvec.
>>
>> yes, The comments should be changed with the line was introduced from long ago. :)
>> Anyway, I am wondering if it worth a v18 version resend?
> 
> So I have been looking over the function itself and I wonder if it
> isn't worth looking at rewriting this to optimize the locking behavior
> to minimize the number of times we have to take the LRU lock. I have
> some code I am working on that I plan to submit as an RFC in the next
> day or so after I can get it smoke tested. The basic idea would be to
> defer returning the evictiable pages or freeing the compound pages
> until after we have processed the pages that can be moved while still
> holding the lock. I would think it should reduce the lock contention
> significantly while improving the throughput.
> 

I had tried once, but the freeing page cross onto release_pages which hard to deal with.
I am very glad to wait your patch, and hope it could be resovled. :)

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU
  2020-07-25 12:59 ` [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU Alex Shi
@ 2020-07-29  3:53   ` Alex Shi
  2020-08-05 22:43     ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-29  3:53 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Michal Hocko, Vladimir Davydov

rewrite the commit log.

From 9310c359b0049e3cc9827b771dc583d504bbf022 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 25 Apr 2020 12:03:30 +0800
Subject: [PATCH v17 13/23] mm/lru: introduce TestClearPageLRU

Currently lru_lock still guards both lru list and page's lru bit, that's
ok. but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking. Just taking lruvec
lock first may be undermined by the page's memcg charge/migration. To
fix this problem, we could clear the lru bit out of locking and use
it as pin down action to block the page isolation in memcg changing.

So now a standard steps of page isolation is following:
	1, get_page(); 	       #pin the page avoid to be free
	2, TestClearPageLRU(); #block other isolation like memcg change
	3, spin_lock on lru_lock; #serialize lru list access
	4, delete page from lru list;
The step 2 could be optimzed/replaced in scenarios which page is
unlikely be accessed or be moved between memcgs.

This patch start with the first part: TestClearPageLRU, which combines
PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
function will be used as page isolation precondition to prevent other
isolations some where else. Then there are may !PageLRU page on lru
list, need to remove BUG() checking accordingly.

There 2 rules for lru bit now:
1, the lru bit still indicate if a page on lru list, just in some
   temporary moment(isolating), the page may have no lru bit when
   it's on lru list.  but the page still must be on lru list when the
   lru bit set.
2, have to remove lru bit before delete it from lru list.

Hugh Dickins pointed that when a page is in free path and no one is
possible to take it, non atomic lru bit clearing is better, like in
__page_cache_release and release_pages.
And no need get_page() before lru bit clear in isolate_lru_page,
since it '(1) Must be called with an elevated refcount on the page'.

As Andrew Morton mentioned this change would dirty cacheline for page
isn't on LRU. But the lost would be acceptable with Rong Chen
<rong.a.chen@intel.com> report:
https://lkml.org/lkml/2020/3/4/173

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/page-flags.h |  1 +
 mm/mlock.c                 |  3 +--
 mm/swap.c                  |  6 ++----
 mm/vmscan.c                | 18 +++++++-----------
 4 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6be1aa559b1e..9554ed1387dc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -326,6 +326,7 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
 	__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
 PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+	TESTCLEARFLAG(LRU, lru, PF_HEAD)
 PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/mm/mlock.c b/mm/mlock.c
index f8736136fad7..228ba5a8e0a5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -108,13 +108,12 @@ void mlock_vma_page(struct page *page)
  */
 static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
 {
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		struct lruvec *lruvec;
 
 		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
-		ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		return true;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index f645965fde0e..5092fe9c8c47 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
 		struct lruvec *lruvec;
 		unsigned long flags;
 
+		__ClearPageLRU(page);
 		spin_lock_irqsave(&pgdat->lru_lock, flags);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-		__ClearPageLRU(page);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 	}
@@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
 				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
 			}
 
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
-			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
+			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c1c4259b4de5..4183ae6b54b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1671,8 +1671,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
-		VM_BUG_ON_PAGE(!PageLRU(page), page);
-
 		nr_pages = compound_nr(page);
 		total_scan += nr_pages;
 
@@ -1769,21 +1767,19 @@ int isolate_lru_page(struct page *page)
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
-	if (PageLRU(page)) {
+	if (TestClearPageLRU(page)) {
 		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
+		int lru = page_lru(page);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		get_page(page);
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		if (PageLRU(page)) {
-			int lru = page_lru(page);
-			get_page(page);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, lru);
-			ret = 0;
-		}
+		spin_lock_irq(&pgdat->lru_lock);
+		del_page_from_lru_list(page, lruvec, lru);
 		spin_unlock_irq(&pgdat->lru_lock);
+		ret = 0;
 	}
+
 	return ret;
 }
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
  2020-07-27 23:34   ` Alexander Duyck
@ 2020-07-29  3:54   ` Alex Shi
  2020-08-06  7:41   ` Alex Shi
  2 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-07-29  3:54 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Michal Hocko, Vladimir Davydov

rewrite the commit log.

From 5e9340444632d69cf10c8db521577d0637819c5f Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 26 May 2020 17:27:52 +0800
Subject: [PATCH v17 17/23] mm/lru: replace pgdat lru_lock with lruvec lock

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node. So on a large machine, each of memcg don't
have to suffer from per node pgdat->lru_lock competition. They could go
fast with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In func isolate_migratepages_block, compact_unlock_should_abort is
opend, and lock_page_lruvec logical is embedded for tight process.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

According to Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this and later patches, the readtwice performance increases about
80% within concurrent containers.

On a large but non memcg machine, the extra recheck if page's lruvec
should be changed in few place, that increase a little lock holding
time, and a little regression.

Hugh Dickins helped on patch polish, thanks!

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
---
 include/linux/memcontrol.h |  58 +++++++++++++++++++++++++
 include/linux/mmzone.h     |   2 +
 mm/compaction.c            |  67 ++++++++++++++++++-----------
 mm/huge_memory.c           |  11 ++---
 mm/memcontrol.c            |  63 ++++++++++++++++++++++++++-
 mm/mlock.c                 |  47 +++++++++++++-------
 mm/mmzone.c                |   1 +
 mm/swap.c                  | 104 +++++++++++++++++++++------------------------
 mm/vmscan.c                |  70 ++++++++++++++++--------------
 9 files changed, 288 insertions(+), 135 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e77197a62809..258901021c6c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -411,6 +411,19 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+						unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -892,6 +905,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
 
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irq(&pgdat->__lruvec.lru_lock);
+	return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+		unsigned long *flagsp)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+	return &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -1126,6 +1164,10 @@ static inline void count_memcg_page_event(struct page *page,
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1255,6 +1297,22 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+	spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+	spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+		unsigned long flags)
+{
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 14c668b7e793..30b961a9a749 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -249,6 +249,8 @@ enum lruvec_flags {
 };
 
 struct lruvec {
+	/* per lruvec lru_lock for memcg */
+	spinlock_t			lru_lock;
 	struct list_head		lists[NR_LRU_LISTS];
 	/*
 	 * These track the cost of reclaiming one LRU - file or anon -
diff --git a/mm/compaction.c b/mm/compaction.c
index 72f135330f81..c28a43481f01 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,7 +787,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct lruvec *lruvec;
 	unsigned long flags = 0;
-	bool locked = false;
+	struct lruvec *locked_lruvec = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -847,11 +847,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * contention, to give chance to IRQs. Abort completely if
 		 * a fatal signal is pending.
 		 */
-		if (!(low_pfn % SWAP_CLUSTER_MAX)
-		    && compact_unlock_should_abort(&pgdat->lru_lock,
-					    flags, &locked, cc)) {
-			low_pfn = 0;
-			goto fatal_pending;
+		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+				locked_lruvec = NULL;
+			}
+
+			if (fatal_signal_pending(current)) {
+				cc->contended = true;
+
+				low_pfn = 0;
+				goto fatal_pending;
+			}
+
+			cond_resched();
 		}
 
 		if (!pfn_valid_within(low_pfn))
@@ -922,10 +932,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					spin_unlock_irqrestore(&pgdat->lru_lock,
-									flags);
-					locked = false;
+				if (locked_lruvec) {
+					unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+					locked_lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -966,10 +975,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		rcu_read_lock();
+		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
 		/* If we already hold the lock, we can skip some rechecking */
-		if (!locked) {
-			locked = compact_lock_irqsave(&pgdat->lru_lock,
-								&flags, cc);
+		if (lruvec != locked_lruvec) {
+			if (locked_lruvec)
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+
+			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			locked_lruvec = lruvec;
+			rcu_read_unlock();
+
+			lruvec_memcg_debug(lruvec, page);
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
@@ -988,9 +1007,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		}
-
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		} else
+			rcu_read_unlock();
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1023,9 +1041,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
-		if (locked) {
-			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			locked = false;
+		if (locked_lruvec) {
+			unlock_page_lruvec_irqrestore(locked_lruvec, flags);
+			locked_lruvec = NULL;
 		}
 		put_page(page);
 
@@ -1039,9 +1057,10 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-				locked = false;
+			if (locked_lruvec) {
+				unlock_page_lruvec_irqrestore(locked_lruvec,
+									flags);
+				locked_lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1068,8 +1087,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	page = NULL;
 
 isolate_abort:
-	if (locked)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (locked_lruvec)
+		unlock_page_lruvec_irqrestore(locked_lruvec, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 28538444197b..a0cb95891ae5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2346,7 +2346,7 @@ static void lru_add_page_tail(struct page *head, struct page *page_tail,
 	VM_BUG_ON_PAGE(!PageHead(head), head);
 	VM_BUG_ON_PAGE(PageCompound(page_tail), head);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), head);
-	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
+	lockdep_assert_held(&lruvec->lru_lock);
 
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
@@ -2430,7 +2430,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			      pgoff_t end)
 {
 	struct page *head = compound_head(page);
-	pg_data_t *pgdat = page_pgdat(head);
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
@@ -2447,10 +2446,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock(&pgdat->lru_lock);
-
-	lruvec = mem_cgroup_page_lruvec(head, pgdat);
+	/* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+	lruvec = lock_page_lruvec(head);
 
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
@@ -2471,7 +2468,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	ClearPageCompound(head);
-	spin_unlock(&pgdat->lru_lock);
+	unlock_page_lruvec(lruvec);
 	/* Caller disabled irqs, so they are still disabled here */
 
 	split_page_owner(head, HPAGE_PMD_ORDER);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20c8ed69a930..d6746656cc39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1196,6 +1196,19 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!page->mem_cgroup)
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+	else
+		VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup, page);
+}
+#endif
+
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
@@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 		goto out;
 	}
 
-	memcg = page->mem_cgroup;
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	memcg = READ_ONCE(page->mem_cgroup);
 	/*
 	 * Swapcache readahead pages are added to the LRU - and
 	 * possibly migrated - before they are charged.
@@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	return lruvec;
 }
 
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+	struct lruvec *lruvec;
+	struct pglist_data *pgdat = page_pgdat(page);
+
+	rcu_read_lock();
+	lruvec = mem_cgroup_page_lruvec(page, pgdat);
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+	rcu_read_unlock();
+
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
 /**
  * mem_cgroup_update_lru_size - account for adding or removing an lru page
  * @lruvec: mem_cgroup per zone lru vector
@@ -2999,7 +3058,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * lruvec->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index 228ba5a8e0a5..5d40d259a931 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,12 +106,10 @@ void mlock_vma_page(struct page *page)
  * Isolate a page from LRU with optional get_page() pin.
  * Assumes lru_lock already held and page already pinned.
  */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+static bool __munlock_isolate_lru_page(struct page *page,
+				struct lruvec *lruvec, bool getpage)
 {
 	if (TestClearPageLRU(page)) {
-		struct lruvec *lruvec;
-
-		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
 		if (getpage)
 			get_page(page);
 		del_page_from_lru_list(page, lruvec, page_lru(page));
@@ -181,7 +179,7 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -189,11 +187,16 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
+	 * Serialize split tail pages in __split_huge_page_tail() which
 	 * might otherwise copy PageMlocked to part of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
+	 * TestClearPageLRU can't be used here to block page isolation, since
+	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
+	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
+	 * wrong lru list here. So relay on PageLocked to stop lruvec change
+	 * in mem_cgroup_move_account().
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 
 	if (!TestClearPageMlocked(page)) {
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
@@ -204,15 +207,15 @@ unsigned int munlock_vma_page(struct page *page)
 	nr_pages = hpage_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, true)) {
-		spin_unlock_irq(&pgdat->lru_lock);
+	if (__munlock_isolate_lru_page(page, lruvec, true)) {
+		unlock_page_lruvec_irq(lruvec);
 		__munlock_isolated_page(page);
 		goto out;
 	}
 	__munlock_isolation_failed(page);
 
 unlock_out:
-	spin_unlock_irq(&pgdat->lru_lock);
+	unlock_page_lruvec_irq(lruvec);
 
 out:
 	return nr_pages - 1;
@@ -292,23 +295,34 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
+	struct lruvec *lruvec = NULL;
 	int pgrescued = 0;
 
 	pagevec_init(&pvec_putback);
 
 	/* Phase 1: page isolation */
-	spin_lock_irq(&zone->zone_pgdat->lru_lock);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
+		struct lruvec *new_lruvec;
+
+		/* block memcg change in mem_cgroup_move_account */
+		lock_page_memcg(page);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 
 		if (TestClearPageMlocked(page)) {
 			/*
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, false))
+			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+				unlock_page_memcg(page);
 				continue;
-			else
+			} else
 				__munlock_isolation_failed(page);
 		} else {
 			delta_munlocked++;
@@ -320,11 +334,14 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		 * pin. We cannot do it under lru_lock however. If it's
 		 * the last pin, __page_cache_release() would deadlock.
 		 */
+		unlock_page_memcg(page);
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	spin_unlock_irq(&zone->zone_pgdat->lru_lock);
+	if (lruvec) {
+		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
+		unlock_page_lruvec_irq(lruvec);
+	}
 
 	/* Now we can release pins of pages that we are not munlocking */
 	pagevec_release(&pvec_putback);
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..3750a90ed4a0 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -91,6 +91,7 @@ void lruvec_init(struct lruvec *lruvec)
 	enum lru_list lru;
 
 	memset(lruvec, 0, sizeof(struct lruvec));
+	spin_lock_init(&lruvec->lru_lock);
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
diff --git a/mm/swap.c b/mm/swap.c
index 3029b3f74811..09edac441eb6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,15 +79,13 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 static void __page_cache_release(struct page *page)
 {
 	if (PageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		unsigned long flags;
 
 		__ClearPageLRU(page);
-		spin_lock_irqsave(&pgdat->lru_lock, flags);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		lruvec = lock_page_lruvec_irqsave(page, &flags);
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
 	__ClearPageWaiters(page);
 }
@@ -206,32 +204,30 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
 	void (*move_fn)(struct page *page, struct lruvec *lruvec))
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
-
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
-		}
+		struct lruvec *new_lruvec;
 
 		/* block memcg migration during page moving between lru */
 		if (!TestClearPageLRU(page))
 			continue;
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
+		}
+
 		(*move_fn)(page, lruvec);
 
 		SetPageLRU(page);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
@@ -274,9 +270,8 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 {
 	do {
 		unsigned long lrusize;
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
-		spin_lock_irq(&pgdat->lru_lock);
+		spin_lock_irq(&lruvec->lru_lock);
 		/* Record cost event */
 		if (file)
 			lruvec->file_cost += nr_pages;
@@ -300,7 +295,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
 			lruvec->file_cost /= 2;
 			lruvec->anon_cost /= 2;
 		}
-		spin_unlock_irq(&pgdat->lru_lock);
+		spin_unlock_irq(&lruvec->lru_lock);
 	} while ((lruvec = parent_lruvec(lruvec)));
 }
 
@@ -365,12 +360,13 @@ static inline void activate_page_drain(int cpu)
 void activate_page(struct page *page)
 {
 	pg_data_t *pgdat = page_pgdat(page);
+	struct lruvec *lruvec;
 
 	page = compound_head(page);
-	spin_lock_irq(&pgdat->lru_lock);
+	lruvec = lock_page_lruvec_irq(page);
 	if (PageLRU(page))
-		__activate_page(page, mem_cgroup_page_lruvec(page, pgdat));
-	spin_unlock_irq(&pgdat->lru_lock);
+		__activate_page(page, lruvec);
+	unlock_page_lruvec_irq(lruvec);
 }
 #endif
 
@@ -817,8 +813,7 @@ void release_pages(struct page **pages, int nr)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
-	struct pglist_data *locked_pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long uninitialized_var(flags);
 	unsigned int uninitialized_var(lock_batch);
 
@@ -828,21 +823,20 @@ void release_pages(struct page **pages, int nr)
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
-		 * same pgdat. The lock is held only if pgdat != NULL.
+		 * same lruvec. The lock is held only if lruvec != NULL.
 		 */
-		if (locked_pgdat && ++lock_batch == SWAP_CLUSTER_MAX) {
-			spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-			locked_pgdat = NULL;
+		if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 
 		if (is_huge_zero_page(page))
 			continue;
 
 		if (is_zone_device_page(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
-						       flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
@@ -861,28 +855,28 @@ void release_pages(struct page **pages, int nr)
 			continue;
 
 		if (PageCompound(page)) {
-			if (locked_pgdat) {
-				spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
-				locked_pgdat = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			__put_compound_page(page);
 			continue;
 		}
 
 		if (PageLRU(page)) {
-			struct pglist_data *pgdat = page_pgdat(page);
+			struct lruvec *new_lruvec;
 
-			if (pgdat != locked_pgdat) {
-				if (locked_pgdat)
-					spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+			new_lruvec = mem_cgroup_page_lruvec(page,
+							page_pgdat(page));
+			if (new_lruvec != lruvec) {
+				if (lruvec)
+					unlock_page_lruvec_irqrestore(lruvec,
 									flags);
 				lock_batch = 0;
-				locked_pgdat = pgdat;
-				spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
+				lruvec = lock_page_lruvec_irqsave(page, &flags);
 			}
 
 			__ClearPageLRU(page);
-			lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
@@ -892,8 +886,8 @@ void release_pages(struct page **pages, int nr)
 
 		list_add(&page->lru, &pages_to_free);
 	}
-	if (locked_pgdat)
-		spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_unref_page_list(&pages_to_free);
@@ -981,26 +975,24 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
 void __pagevec_lru_add(struct pagevec *pvec)
 {
 	int i;
-	struct pglist_data *pgdat = NULL;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
-			pgdat = pagepgdat;
-			spin_lock_irqsave(&pgdat->lru_lock, flags);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = lock_page_lruvec_irqsave(page, &flags);
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		__pagevec_lru_add_fn(page, lruvec);
 	}
-	if (pgdat)
-		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f77748adc340..168c1659e430 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1774,15 +1774,13 @@ int isolate_lru_page(struct page *page)
 	WARN_RATELIMIT(PageTail(page), "trying to isolate tail page");
 
 	if (TestClearPageLRU(page)) {
-		pg_data_t *pgdat = page_pgdat(page);
 		struct lruvec *lruvec;
 		int lru = page_lru(page);
 
 		get_page(page);
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-		spin_lock_irq(&pgdat->lru_lock);
+		lruvec = lock_page_lruvec_irq(page);
 		del_page_from_lru_list(page, lruvec, lru);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 		ret = 0;
 	}
 
@@ -1849,20 +1847,22 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 						     struct list_head *list)
 {
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
 	struct page *page;
+	struct lruvec *orig_lruvec = lruvec;
 	enum lru_list lru;
 
 	while (!list_empty(list)) {
+		struct lruvec *new_lruvec = NULL;
+
 		page = lru_to_page(list);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&pgdat->lru_lock);
+			spin_unlock_irq(&lruvec->lru_lock);
 			putback_lru_page(page);
-			spin_lock_irq(&pgdat->lru_lock);
+			spin_lock_irq(&lruvec->lru_lock);
 			continue;
 		}
 
@@ -1876,6 +1876,12 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		 *                                        list_add(&page->lru,)
 		 *     list_add(&page->lru,) //corrupt
 		 */
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (new_lruvec != lruvec) {
+			if (lruvec)
+				spin_unlock_irq(&lruvec->lru_lock);
+			lruvec = lock_page_lruvec_irq(page);
+		}
 		SetPageLRU(page);
 
 		if (unlikely(put_page_testzero(page))) {
@@ -1883,16 +1889,15 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&pgdat->lru_lock);
+				spin_unlock_irq(&lruvec->lru_lock);
 				destroy_compound_page(page);
-				spin_lock_irq(&pgdat->lru_lock);
+				spin_lock_irq(&lruvec->lru_lock);
 			} else
 				list_add(&page->lru, &pages_to_free);
 
 			continue;
 		}
 
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		lru = page_lru(page);
 		nr_pages = hpage_nr_pages(page);
 
@@ -1902,6 +1907,11 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
 		if (PageActive(page))
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
+	if (orig_lruvec != lruvec) {
+		if (lruvec)
+			spin_unlock_irq(&lruvec->lru_lock);
+		spin_lock_irq(&orig_lruvec->lru_lock);
+	}
 
 	/*
 	 * To save our caller's stack, now use input list for pages to free.
@@ -1957,7 +1967,7 @@ static int current_may_throttle(void)
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, lru);
@@ -1969,7 +1979,7 @@ static int current_may_throttle(void)
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	if (nr_taken == 0)
 		return 0;
@@ -1977,7 +1987,7 @@ static int current_may_throttle(void)
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 	move_pages_to_lru(lruvec, &page_list);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -1986,7 +1996,7 @@ static int current_may_throttle(void)
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2039,7 +2049,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	lru_add_drain();
 
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, lru);
@@ -2049,7 +2059,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_vm_events(PGREFILL, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	while (!list_empty(&l_hold)) {
 		cond_resched();
@@ -2095,7 +2105,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&lruvec->lru_lock);
 
 	nr_activate = move_pages_to_lru(lruvec, &l_active);
 	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
@@ -2106,7 +2116,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
@@ -2696,10 +2706,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	/*
 	 * Determine the scan balance between anon and file LRUs.
 	 */
-	spin_lock_irq(&pgdat->lru_lock);
+	spin_lock_irq(&target_lruvec->lru_lock);
 	sc->anon_cost = target_lruvec->anon_cost;
 	sc->file_cost = target_lruvec->file_cost;
-	spin_unlock_irq(&pgdat->lru_lock);
+	spin_unlock_irq(&target_lruvec->lru_lock);
 
 	/*
 	 * Target desirable inactive:active list ratios for the anon
@@ -4275,24 +4285,22 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
  */
 void check_move_unevictable_pages(struct pagevec *pvec)
 {
-	struct lruvec *lruvec;
-	struct pglist_data *pgdat = NULL;
+	struct lruvec *lruvec = NULL;
 	int pgscanned = 0;
 	int pgrescued = 0;
 	int i;
 
 	for (i = 0; i < pvec->nr; i++) {
 		struct page *page = pvec->pages[i];
-		struct pglist_data *pagepgdat = page_pgdat(page);
+		struct lruvec *new_lruvec;
 
 		pgscanned++;
-		if (pagepgdat != pgdat) {
-			if (pgdat)
-				spin_unlock_irq(&pgdat->lru_lock);
-			pgdat = pagepgdat;
-			spin_lock_irq(&pgdat->lru_lock);
+		new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+		if (lruvec != new_lruvec) {
+			if (lruvec)
+				unlock_page_lruvec_irq(lruvec);
+			lruvec = lock_page_lruvec_irq(page);
 		}
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
 		if (!PageLRU(page) || !PageUnevictable(page))
 			continue;
@@ -4308,10 +4316,10 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 		}
 	}
 
-	if (pgdat) {
+	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		spin_unlock_irq(&pgdat->lru_lock);
+		unlock_page_lruvec_irq(lruvec);
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
-- 
1.8.3.1




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-27  5:40 ` [PATCH v17 00/21] per memcg lru lock Alex Shi
@ 2020-07-29 14:49   ` Alex Shi
  2020-07-29 18:06     ` Hugh Dickins
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-29 14:49 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen


Is there any comments or suggestion for this patchset?
Any hints will be very appreciated.

Thanks
Alex

在 2020/7/27 下午1:40, Alex Shi 写道:

> A standard for new page isolation steps like the following:
> 1, get_page(); #pin the page avoid be free
> 2, TestClearPageLRU(); #serialize other isolation, also memcg change
> 3, spin_lock on lru_lock; #serialize lru list access
> The step 2 could be optimzed/replaced in scenarios which page is unlikely
> be accessed by others.
> 
> 
> 
> 在 2020/7/25 下午8:59, Alex Shi 写道:
>> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in 
>> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
>> removes 'mm/mlock: reorder isolation sequence during munlock' 
>>
>> Hi Johanness & Hugh & Alexander & Willy,
>>
>> Could you like to give a reviewed by since you address much of issue and
>> give lots of suggestions! Many thanks!


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function
  2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi
@ 2020-07-29 17:52   ` Alexander Duyck
  2020-07-30  6:08     ` Alex Shi
  2020-07-31 21:14   ` [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid alexander.h.duyck
  1 sibling, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-07-29 17:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Thomas Gleixner, Andrey Ryabinin

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Use this new function to replace repeated same code, no func change.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/memcontrol.h | 40 ++++++++++++++++++++++++++++++++++++++++
>  mm/mlock.c                 |  9 +--------
>  mm/swap.c                  | 33 +++++++--------------------------
>  mm/vmscan.c                |  8 +-------
>  4 files changed, 49 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 258901021c6c..6e670f991b42 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1313,6 +1313,46 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
>         spin_unlock_irqrestore(&lruvec->lru_lock, flags);
>  }
>
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
> +               struct lruvec *locked_lruvec)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +       bool locked;
> +
> +       rcu_read_lock();
> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> +       rcu_read_unlock();
> +
> +       if (locked)
> +               return locked_lruvec;
> +
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irq(locked_lruvec);
> +
> +       return lock_page_lruvec_irq(page);
> +}
> +
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
> +               struct lruvec *locked_lruvec, unsigned long *flags)
> +{
> +       struct pglist_data *pgdat = page_pgdat(page);
> +       bool locked;
> +
> +       rcu_read_lock();
> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> +       rcu_read_unlock();
> +
> +       if (locked)
> +               return locked_lruvec;
> +
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> +
> +       return lock_page_lruvec_irqsave(page, flags);
> +}
> +

So looking these over they seem to be pretty inefficient for what they
do. Basically in worst case (locked_lruvec == NULL) you end up calling
mem_cgoup_page_lruvec and all the rcu_read_lock/unlock a couple times
for a single page. It might make more sense to structure this like:
if (locked_lruvec) {
    if (lruvec_holds_page_lru_lock(page, locked_lruvec))
        return locked_lruvec;

    unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
}
return lock_page_lruvec_irqsave(page, flags);

The other piece that has me scratching my head is that I wonder if we
couldn't do this without needing the rcu_read_lock. For example, what
if we were to compare the page mem_cgroup pointer to the memcg back
pointer stored in the mem_cgroup_per_node? It seems like ordering
things this way would significantly reduce the overhead due to the
pointer chasing to see if the page is in the locked lruvec or not.

>  #ifdef CONFIG_CGROUP_WRITEBACK
>
>  struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 5d40d259a931..bc2fb3bfbe7a 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -303,17 +303,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>         /* Phase 1: page isolation */
>         for (i = 0; i < nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
>
>                 /* block memcg change in mem_cgroup_move_account */
>                 lock_page_memcg(page);
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (new_lruvec != lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irq(lruvec);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> -
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>                 if (TestClearPageMlocked(page)) {
>                         /*
>                          * We already have pin from follow_page_mask()
> diff --git a/mm/swap.c b/mm/swap.c
> index 09edac441eb6..6d9c7288f7de 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,19 +209,12 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
>
>                 /* block memcg migration during page moving between lru */
>                 if (!TestClearPageLRU(page))
>                         continue;
>
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irqrestore(lruvec, flags);
> -                       lruvec = lock_page_lruvec_irqsave(page, &flags);
> -               }
> -
> +               lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>                 (*move_fn)(page, lruvec);
>
>                 SetPageLRU(page);
> @@ -864,17 +857,12 @@ void release_pages(struct page **pages, int nr)
>                 }
>
>                 if (PageLRU(page)) {
> -                       struct lruvec *new_lruvec;
> -
> -                       new_lruvec = mem_cgroup_page_lruvec(page,
> -                                                       page_pgdat(page));
> -                       if (new_lruvec != lruvec) {
> -                               if (lruvec)
> -                                       unlock_page_lruvec_irqrestore(lruvec,
> -                                                                       flags);
> +                       struct lruvec *prev_lruvec = lruvec;
> +
> +                       lruvec = relock_page_lruvec_irqsave(page, lruvec,
> +                                                                       &flags);
> +                       if (prev_lruvec != lruvec)
>                                 lock_batch = 0;
> -                               lruvec = lock_page_lruvec_irqsave(page, &flags);
> -                       }
>
>                         __ClearPageLRU(page);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> @@ -980,15 +968,8 @@ void __pagevec_lru_add(struct pagevec *pvec)
>
>         for (i = 0; i < pagevec_count(pvec); i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
> -
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irqrestore(lruvec, flags);
> -                       lruvec = lock_page_lruvec_irqsave(page, &flags);
> -               }
>
> +               lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>                 __pagevec_lru_add_fn(page, lruvec);
>         }
>         if (lruvec)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 168c1659e430..bdb53a678e7e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4292,15 +4292,9 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>
>         for (i = 0; i < pvec->nr; i++) {
>                 struct page *page = pvec->pages[i];
> -               struct lruvec *new_lruvec;
>
>                 pgscanned++;
> -               new_lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> -               if (lruvec != new_lruvec) {
> -                       if (lruvec)
> -                               unlock_page_lruvec_irq(lruvec);
> -                       lruvec = lock_page_lruvec_irq(page);
> -               }
> +               lruvec = relock_page_lruvec_irq(page, lruvec);
>
>                 if (!PageLRU(page) || !PageUnevictable(page))
>                         continue;
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-29 14:49   ` Alex Shi
@ 2020-07-29 18:06     ` Hugh Dickins
  2020-07-30  2:16       ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Hugh Dickins @ 2020-07-29 18:06 UTC (permalink / raw)
  To: Alex Shi
  Cc: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen

On Wed, 29 Jul 2020, Alex Shi wrote:
> 
> Is there any comments or suggestion for this patchset?
> Any hints will be very appreciated.

Alex: it is now v5.8-rc7, obviously too late for this patchset to make
v5.9, so I'm currently concentrated on checking some patches headed for
v5.9 (and some bugfix patches of my own that I don't get time to send):
I'll get back to responding on lru_lock in a week or two's time.

Hugh


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-29 18:06     ` Hugh Dickins
@ 2020-07-30  2:16       ` Alex Shi
  2020-08-03 15:07         ` Michal Hocko
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-30  2:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: akpm, mgorman, tj, khlebnikov, daniel.m.jordan, yang.shi, willy,
	hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen



在 2020/7/30 上午2:06, Hugh Dickins 写道:
> On Wed, 29 Jul 2020, Alex Shi wrote:
>>
>> Is there any comments or suggestion for this patchset?
>> Any hints will be very appreciated.
> 
> Alex: it is now v5.8-rc7, obviously too late for this patchset to make
> v5.9, so I'm currently concentrated on checking some patches headed for
> v5.9 (and some bugfix patches of my own that I don't get time to send):
> I'll get back to responding on lru_lock in a week or two's time.

Hi Hugh,

Thanks a lot for response! It's fine to wait longer.
But thing would be more efficient if review get concentrated...
I am still too new in mm area.

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function
  2020-07-29 17:52   ` Alexander Duyck
@ 2020-07-30  6:08     ` Alex Shi
  2020-07-31 14:20       ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-30  6:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Thomas Gleixner, Andrey Ryabinin



在 2020/7/30 上午1:52, Alexander Duyck 写道:
>> +       rcu_read_lock();
>> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
>> +       rcu_read_unlock();
>> +
>> +       if (locked)
>> +               return locked_lruvec;
>> +
>> +       if (locked_lruvec)
>> +               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
>> +
>> +       return lock_page_lruvec_irqsave(page, flags);
>> +}
>> +
> So looking these over they seem to be pretty inefficient for what they
> do. Basically in worst case (locked_lruvec == NULL) you end up calling
> mem_cgoup_page_lruvec and all the rcu_read_lock/unlock a couple times
> for a single page. It might make more sense to structure this like:
> if (locked_lruvec) {

Uh, we still need to check this page's lruvec, that needs a rcu_read_lock.
to save a mem_cgroup_page_lruvec call, we have to open lock_page_lruvec
as your mentained before.

>     if (lruvec_holds_page_lru_lock(page, locked_lruvec))
>         return locked_lruvec;
> 
>     unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> }
> return lock_page_lruvec_irqsave(page, flags);
> 
> The other piece that has me scratching my head is that I wonder if we
> couldn't do this without needing the rcu_read_lock. For example, what
> if we were to compare the page mem_cgroup pointer to the memcg back
> pointer stored in the mem_cgroup_per_node? It seems like ordering
> things this way would significantly reduce the overhead due to the
> pointer chasing to see if the page is in the locked lruvec or not.
> 

If page->mem_cgroup always be charged. the following could be better.

+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+               struct lruvec *locked_lruvec, unsigned long *flags)
+{
+       struct lruvec *lruvec;
+
+       if (mem_cgroup_disabled())
+               return locked_lruvec;
+
+       /* user page always be charged */
+       VM_BUG_ON_PAGE(!page->mem_cgroup, page);
+
+       rcu_read_lock();
+       if (likely(lruvec_memcg(locked_lruvec) == page->mem_cgroup)) {
+               rcu_read_unlock();
+               return locked_lruvec;
+       }
+
+       if (locked_lruvec)
+               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+
+       lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+       spin_lock_irqsave(&lruvec->lru_lock, *flags);
+       rcu_read_unlock();
+       lruvec_memcg_debug(lruvec, page);
+
+       return lruvec;
+}
+

The user page is always be charged since readahead page is charged now.
and looks we also can apply this patch. I will test it to see if there is
other exception.


commit 826128346e50f6c60c513e166998466b593becad
Author: Alex Shi <alex.shi@linux.alibaba.com>
Date:   Thu Jul 30 13:58:38 2020 +0800

    mm/memcg: remove useless check on page->mem_cgroup

    Since readahead page will be charged on memcg too. We don't need to
    check this exception now.

    Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af96217f2ec5..0c7f6bed199b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1336,12 +1336,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd

        VM_BUG_ON_PAGE(PageTail(page), page);
        memcg = READ_ONCE(page->mem_cgroup);
-       /*
-        * Swapcache readahead pages are added to the LRU - and
-        * possibly migrated - before they are charged.
-        */
-       if (!memcg)
-               memcg = root_mem_cgroup;

        mz = mem_cgroup_page_nodeinfo(memcg, page);
        lruvec = &mz->lruvec;
@@ -6962,10 +6956,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
        if (newpage->mem_cgroup)
                return;

-       /* Swapcache readahead pages can get replaced before being charged */
        memcg = oldpage->mem_cgroup;
-       if (!memcg)
-               return;

        /* Force-charge the new page. The old one will be freed soon */
        nr_pages = thp_nr_pages(newpage);
@@ -7160,10 +7151,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)

        memcg = page->mem_cgroup;

-       /* Readahead page, never charged */
-       if (!memcg)
-               return;
-
        /*
         * In case the memcg owning these pages has been offlined and doesn't
         * have an ID allocated to it anymore, charge the closest online


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function
  2020-07-30  6:08     ` Alex Shi
@ 2020-07-31 14:20       ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-07-31 14:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Thomas Gleixner, Andrey Ryabinin

On Wed, Jul 29, 2020 at 11:08 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/7/30 上午1:52, Alexander Duyck 写道:
> >> +       rcu_read_lock();
> >> +       locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> >> +       rcu_read_unlock();
> >> +
> >> +       if (locked)
> >> +               return locked_lruvec;
> >> +
> >> +       if (locked_lruvec)
> >> +               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> >> +
> >> +       return lock_page_lruvec_irqsave(page, flags);
> >> +}
> >> +
> > So looking these over they seem to be pretty inefficient for what they
> > do. Basically in worst case (locked_lruvec == NULL) you end up calling
> > mem_cgoup_page_lruvec and all the rcu_read_lock/unlock a couple times
> > for a single page. It might make more sense to structure this like:
> > if (locked_lruvec) {
>
> Uh, we still need to check this page's lruvec, that needs a rcu_read_lock.
> to save a mem_cgroup_page_lruvec call, we have to open lock_page_lruvec
> as your mentained before.
>
> >     if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> >         return locked_lruvec;
> >
> >     unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> > }
> > return lock_page_lruvec_irqsave(page, flags);
> >
> > The other piece that has me scratching my head is that I wonder if we
> > couldn't do this without needing the rcu_read_lock. For example, what
> > if we were to compare the page mem_cgroup pointer to the memcg back
> > pointer stored in the mem_cgroup_per_node? It seems like ordering
> > things this way would significantly reduce the overhead due to the
> > pointer chasing to see if the page is in the locked lruvec or not.
> >
>
> If page->mem_cgroup always be charged. the following could be better.
>
> +/* Don't lock again iff page's lruvec locked */
> +static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
> +               struct lruvec *locked_lruvec, unsigned long *flags)
> +{
> +       struct lruvec *lruvec;
> +
> +       if (mem_cgroup_disabled())
> +               return locked_lruvec;
> +
> +       /* user page always be charged */
> +       VM_BUG_ON_PAGE(!page->mem_cgroup, page);
> +
> +       rcu_read_lock();
> +       if (likely(lruvec_memcg(locked_lruvec) == page->mem_cgroup)) {
> +               rcu_read_unlock();
> +               return locked_lruvec;
> +       }
> +
> +       if (locked_lruvec)
> +               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> +
> +       lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +       spin_lock_irqsave(&lruvec->lru_lock, *flags);
> +       rcu_read_unlock();
> +       lruvec_memcg_debug(lruvec, page);
> +
> +       return lruvec;
> +}
> +

I understand that you have to use the rcu_lock when you want to
acquire the lruvec via mem_cgroup_page_lruvec(). That is why I didn't
do away with the call to lock_page_lruvec_irqsave() at the end of the
function. However it doesn't make sense to do it when you are already
holding the locked_lruvec and simply getting the container of it in
order to compare pointer values.

One thing I was getting at with the lruvec_holds_page_lru_lock()
function I had introduced in my example is that the code baths for the
two relock functions are very similar. If we could move all the logic
for identifying if we can reuse the lock into a single function it
would dut down on the redundancy quite a bit as well. In addition by
testing for locked_lruvec != NULL before we before we do the
comparison we can save ourselves some unnecessary testing in the case
where

The thought I had was try to avoid the rcu_lock entirely in the lock
reuse case. Basically you just need to compare the pgdat value and the
memcg between the page and the lruvec. As long as they both point the
same values then you should have the correct lruvec and no need to
relock. There is no need to take the rcu_lock as long as you aren't
dereferencing something and if you are just comparing the pointers it
should be good with that. The fallback if mem_cgroup_disabled() is to
make certain the page pgdat->__lruvec is the address belonging to the
lruvec.

> The user page is always be charged since readahead page is charged now.
> and looks we also can apply this patch. I will test it to see if there is
> other exception.

Yes that would simplify things a bit as the code I had was having to
use a ternary to test for root_mem_cgroup if page->mem_cgroup was
NULL. I should be able to finish up testing today and will submit a
few clean-up patches as RFC to get your thoughts/feedback.

> commit 826128346e50f6c60c513e166998466b593becad
> Author: Alex Shi <alex.shi@linux.alibaba.com>
> Date:   Thu Jul 30 13:58:38 2020 +0800
>
>     mm/memcg: remove useless check on page->mem_cgroup
>
>     Since readahead page will be charged on memcg too. We don't need to
>     check this exception now.
>
>     Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index af96217f2ec5..0c7f6bed199b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1336,12 +1336,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>
>         VM_BUG_ON_PAGE(PageTail(page), page);
>         memcg = READ_ONCE(page->mem_cgroup);
> -       /*
> -        * Swapcache readahead pages are added to the LRU - and
> -        * possibly migrated - before they are charged.
> -        */
> -       if (!memcg)
> -               memcg = root_mem_cgroup;
>
>         mz = mem_cgroup_page_nodeinfo(memcg, page);
>         lruvec = &mz->lruvec;
> @@ -6962,10 +6956,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
>         if (newpage->mem_cgroup)
>                 return;
>
> -       /* Swapcache readahead pages can get replaced before being charged */
>         memcg = oldpage->mem_cgroup;
> -       if (!memcg)
> -               return;
>
>         /* Force-charge the new page. The old one will be freed soon */
>         nr_pages = thp_nr_pages(newpage);
> @@ -7160,10 +7151,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>
>         memcg = page->mem_cgroup;
>
> -       /* Readahead page, never charged */
> -       if (!memcg)
> -               return;
> -
>         /*
>          * In case the memcg owning these pages has been offlined and doesn't
>          * have an ID allocated to it anymore, charge the closest online
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid
  2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi
  2020-07-29 17:52   ` Alexander Duyck
@ 2020-07-31 21:14   ` alexander.h.duyck
  2020-07-31 23:54     ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: alexander.h.duyck @ 2020-07-31 21:14 UTC (permalink / raw)
  To: alex.shi
  Cc: akpm, alexander.duyck, aryabinin, cgroups, daniel.m.jordan,
	hannes, hughd, iamjoonsoo.kim, khlebnikov, kirill, linux-kernel,
	linux-mm, lkp, mgorman, richard.weiyang, rong.a.chen, shakeelb,
	tglx, tj, willy, yang.shi

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

When testing for relock we can avoid the need for RCU locking if we simply
compare the page pgdat and memcg pointers versus those that the lruvec is
holding. By doing this we can avoid the extra pointer walks and accesses of
the memory cgroup.

In addition we can avoid the checks entirely if lruvec is currently NULL.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/memcontrol.h |   52 +++++++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6e670f991b42..7a02f00bf3de 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	const struct mem_cgroup *memcg;
+	struct mem_cgroup_per_node *mz;
+
+	if (mem_cgroup_disabled())
+		return lruvec == &pgdat->__lruvec;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+	memcg = page->mem_cgroup ? : root_mem_cgroup;
+
+	return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &pgdat->__lruvec;
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+					      struct lruvec *lruvec)
+{
+		pg_data_t *pgdat = page_pgdat(page);
+
+		return lruvec == &pgdat->__lruvec;
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
 static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
 		struct lruvec *locked_lruvec)
 {
-	struct pglist_data *pgdat = page_pgdat(page);
-	bool locked;
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
 
-	rcu_read_lock();
-	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
-	rcu_read_unlock();
-
-	if (locked)
-		return locked_lruvec;
-
-	if (locked_lruvec)
 		unlock_page_lruvec_irq(locked_lruvec);
+	}
 
 	return lock_page_lruvec_irq(page);
 }
@@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
 static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
 		struct lruvec *locked_lruvec, unsigned long *flags)
 {
-	struct pglist_data *pgdat = page_pgdat(page);
-	bool locked;
-
-	rcu_read_lock();
-	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
-	rcu_read_unlock();
-
-	if (locked)
-		return locked_lruvec;
+	if (locked_lruvec) {
+		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+			return locked_lruvec;
 
-	if (locked_lruvec)
 		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+	}
 
 	return lock_page_lruvec_irqsave(page, flags);
 }




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (21 preceding siblings ...)
  2020-07-27  5:40 ` [PATCH v17 00/21] per memcg lru lock Alex Shi
@ 2020-07-31 21:31 ` Alexander Duyck
  2020-08-04  8:36 ` Alex Shi
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-07-31 21:31 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> The new version which bases on v5.8-rc6. It includes Hugh Dickins fix in
> mm/swap.c and mm/mlock.c fix which Alexander Duyck pointed out, then
> removes 'mm/mlock: reorder isolation sequence during munlock'
>
> Hi Johanness & Hugh & Alexander & Willy,
>
> Could you like to give a reviewed by since you address much of issue and
> give lots of suggestions! Many thanks!
>

I just finished getting a test pass done on the patches. I'm still
seeing a regression on the will-it-scale/page_fault3 test but it is
now only 3% instead of the 20% that it was so it may just be noise at
this point.

I'll try to make sure to get my review feedback wrapped up early next week.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid
  2020-07-31 21:14   ` [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid alexander.h.duyck
@ 2020-07-31 23:54     ` Alex Shi
  2020-08-02 18:20       ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-07-31 23:54 UTC (permalink / raw)
  To: alexander.h.duyck
  Cc: akpm, alexander.duyck, aryabinin, cgroups, daniel.m.jordan,
	hannes, hughd, iamjoonsoo.kim, khlebnikov, kirill, linux-kernel,
	linux-mm, lkp, mgorman, richard.weiyang, rong.a.chen, shakeelb,
	tglx, tj, willy, yang.shi

It looks much better than mine. and could replace 'mm/lru: introduce the relock_page_lruvec function'
with your author signed. :)

BTW,
it's the rcu_read_lock cause the will-it-scale/page_fault3 regression which you mentained in another
letter?

Thanks
Alex

在 2020/8/1 上午5:14, alexander.h.duyck@intel.com 写道:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> When testing for relock we can avoid the need for RCU locking if we simply
> compare the page pgdat and memcg pointers versus those that the lruvec is
> holding. By doing this we can avoid the extra pointer walks and accesses of
> the memory cgroup.
> 
> In addition we can avoid the checks entirely if lruvec is currently NULL.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  include/linux/memcontrol.h |   52 +++++++++++++++++++++++++++-----------------
>  1 file changed, 32 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6e670f991b42..7a02f00bf3de 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>  
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>  
> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> +					      struct lruvec *lruvec)
> +{
> +	pg_data_t *pgdat = page_pgdat(page);
> +	const struct mem_cgroup *memcg;
> +	struct mem_cgroup_per_node *mz;
> +
> +	if (mem_cgroup_disabled())
> +		return lruvec == &pgdat->__lruvec;
> +
> +	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> +	memcg = page->mem_cgroup ? : root_mem_cgroup;
> +
> +	return lruvec->pgdat == pgdat && mz->memcg == memcg;
> +}
> +
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  
>  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> @@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>  	return &pgdat->__lruvec;
>  }
>  
> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> +					      struct lruvec *lruvec)
> +{
> +		pg_data_t *pgdat = page_pgdat(page);
> +
> +		return lruvec == &pgdat->__lruvec;
> +}
> +
>  static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
>  {
>  	return NULL;
> @@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
>  static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
>  		struct lruvec *locked_lruvec)
>  {
> -	struct pglist_data *pgdat = page_pgdat(page);
> -	bool locked;
> +	if (locked_lruvec) {
> +		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> +			return locked_lruvec;
>  
> -	rcu_read_lock();
> -	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> -	rcu_read_unlock();
> -
> -	if (locked)
> -		return locked_lruvec;
> -
> -	if (locked_lruvec)
>  		unlock_page_lruvec_irq(locked_lruvec);
> +	}
>  
>  	return lock_page_lruvec_irq(page);
>  }
> @@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
>  static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
>  		struct lruvec *locked_lruvec, unsigned long *flags)
>  {
> -	struct pglist_data *pgdat = page_pgdat(page);
> -	bool locked;
> -
> -	rcu_read_lock();
> -	locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> -	rcu_read_unlock();
> -
> -	if (locked)
> -		return locked_lruvec;
> +	if (locked_lruvec) {
> +		if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> +			return locked_lruvec;
>  
> -	if (locked_lruvec)
>  		unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> +	}
>  
>  	return lock_page_lruvec_irqsave(page, flags);
>  }
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid
  2020-07-31 23:54     ` Alex Shi
@ 2020-08-02 18:20       ` Alexander Duyck
  2020-08-04  6:13         ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-02 18:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: Duyck, Alexander H, Andrew Morton, Andrey Ryabinin, cgroups,
	Daniel Jordan, Johannes Weiner, Hugh Dickins, Joonsoo Kim,
	Konstantin Khlebnikov, Kirill A. Shutemov, LKML, linux-mm,
	kbuild test robot, Mel Gorman, Wei Yang, Rong Chen, Shakeel Butt,
	Thomas Gleixner, Tejun Heo, Matthew Wilcox, Yang Shi

Feel free to fold it into your patches if you want.

I think Hugh was the one that had submitted a patch that addressed it,
and it looks like you folded that into your v17 set. It was probably
what he had identified which was the additional LRU checks needing to
be removed from the code.

Thanks.

- Alex

On Fri, Jul 31, 2020 at 4:55 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> It looks much better than mine. and could replace 'mm/lru: introduce the relock_page_lruvec function'
> with your author signed. :)
>
> BTW,
> it's the rcu_read_lock cause the will-it-scale/page_fault3 regression which you mentained in another
> letter?
>
> Thanks
> Alex
>
> 在 2020/8/1 上午5:14, alexander.h.duyck@intel.com 写道:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >
> > When testing for relock we can avoid the need for RCU locking if we simply
> > compare the page pgdat and memcg pointers versus those that the lruvec is
> > holding. By doing this we can avoid the extra pointer walks and accesses of
> > the memory cgroup.
> >
> > In addition we can avoid the checks entirely if lruvec is currently NULL.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> >  include/linux/memcontrol.h |   52 +++++++++++++++++++++++++++-----------------
> >  1 file changed, 32 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 6e670f991b42..7a02f00bf3de 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> >
> >  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
> >
> > +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> > +                                           struct lruvec *lruvec)
> > +{
> > +     pg_data_t *pgdat = page_pgdat(page);
> > +     const struct mem_cgroup *memcg;
> > +     struct mem_cgroup_per_node *mz;
> > +
> > +     if (mem_cgroup_disabled())
> > +             return lruvec == &pgdat->__lruvec;
> > +
> > +     mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> > +     memcg = page->mem_cgroup ? : root_mem_cgroup;
> > +
> > +     return lruvec->pgdat == pgdat && mz->memcg == memcg;
> > +}
> > +
> >  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> >
> >  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> > @@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> >       return &pgdat->__lruvec;
> >  }
> >
> > +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> > +                                           struct lruvec *lruvec)
> > +{
> > +             pg_data_t *pgdat = page_pgdat(page);
> > +
> > +             return lruvec == &pgdat->__lruvec;
> > +}
> > +
> >  static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
> >  {
> >       return NULL;
> > @@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
> >  static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
> >               struct lruvec *locked_lruvec)
> >  {
> > -     struct pglist_data *pgdat = page_pgdat(page);
> > -     bool locked;
> > +     if (locked_lruvec) {
> > +             if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> > +                     return locked_lruvec;
> >
> > -     rcu_read_lock();
> > -     locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> > -     rcu_read_unlock();
> > -
> > -     if (locked)
> > -             return locked_lruvec;
> > -
> > -     if (locked_lruvec)
> >               unlock_page_lruvec_irq(locked_lruvec);
> > +     }
> >
> >       return lock_page_lruvec_irq(page);
> >  }
> > @@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
> >  static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
> >               struct lruvec *locked_lruvec, unsigned long *flags)
> >  {
> > -     struct pglist_data *pgdat = page_pgdat(page);
> > -     bool locked;
> > -
> > -     rcu_read_lock();
> > -     locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
> > -     rcu_read_unlock();
> > -
> > -     if (locked)
> > -             return locked_lruvec;
> > +     if (locked_lruvec) {
> > +             if (lruvec_holds_page_lru_lock(page, locked_lruvec))
> > +                     return locked_lruvec;
> >
> > -     if (locked_lruvec)
> >               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
> > +     }
> >
> >       return lock_page_lruvec_irqsave(page, flags);
> >  }
> >
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-30  2:16       ` Alex Shi
@ 2020-08-03 15:07         ` Michal Hocko
  2020-08-04  6:14           ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Michal Hocko @ 2020-08-03 15:07 UTC (permalink / raw)
  To: Alex Shi
  Cc: Hugh Dickins, akpm, mgorman, tj, khlebnikov, daniel.m.jordan,
	yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups,
	shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen

On Thu 30-07-20 10:16:13, Alex Shi wrote:
> 
> 
> 在 2020/7/30 上午2:06, Hugh Dickins 写道:
> > On Wed, 29 Jul 2020, Alex Shi wrote:
> >>
> >> Is there any comments or suggestion for this patchset?
> >> Any hints will be very appreciated.
> > 
> > Alex: it is now v5.8-rc7, obviously too late for this patchset to make
> > v5.9, so I'm currently concentrated on checking some patches headed for
> > v5.9 (and some bugfix patches of my own that I don't get time to send):
> > I'll get back to responding on lru_lock in a week or two's time.
> 
> Hi Hugh,
> 
> Thanks a lot for response! It's fine to wait longer.
> But thing would be more efficient if review get concentrated...
> I am still too new in mm area.

I am sorry and owe you a review but it is hard to find time for that.
This is a large change and the review will be really far from trivial.
If this version is mostly stable then I would recommend not posting new
versions and simply remind people you expect the review from by a
targeted ping.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock
  2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi
@ 2020-08-03 22:37   ` Alexander Duyck
  2020-08-04 10:04     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-03 22:37 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Andrey Ryabinin, Jann Horn

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> From: Hugh Dickins <hughd@google.com>
>
> Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to
> fix the incorrect comments in code. Also fixed some zone->lru_lock comment
> error from ancient time. etc.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  Documentation/admin-guide/cgroup-v1/memcg_test.rst | 15 +++------------
>  Documentation/admin-guide/cgroup-v1/memory.rst     | 21 +++++++++------------
>  Documentation/trace/events-kmem.rst                |  2 +-
>  Documentation/vm/unevictable-lru.rst               | 22 ++++++++--------------
>  include/linux/mm_types.h                           |  2 +-
>  include/linux/mmzone.h                             |  2 +-
>  mm/filemap.c                                       |  4 ++--
>  mm/memcontrol.c                                    |  2 +-
>  mm/rmap.c                                          |  4 ++--
>  mm/vmscan.c                                        | 12 ++++++++----
>  10 files changed, 36 insertions(+), 50 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
> index 3f7115e07b5d..0b9f91589d3d 100644
> --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
> @@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
>
>  8. LRU
>  ======
> -        Each memcg has its own private LRU. Now, its handling is under global
> -       VM's control (means that it's handled under global pgdat->lru_lock).
> -       Almost all routines around memcg's LRU is called by global LRU's
> -       list management functions under pgdat->lru_lock.
> -
> -       A special function is mem_cgroup_isolate_pages(). This scans
> -       memcg's private LRU and call __isolate_lru_page() to extract a page
> -       from LRU.
> -
> -       (By __isolate_lru_page(), the page is removed from both of global and
> -       private LRU.)
> -
> +       Each memcg has its own vector of LRUs (inactive anon, active anon,
> +       inactive file, active file, unevictable) of pages from each node,
> +       each LRU handled under a single lru_lock for that memcg and node.
>
>  9. Typical Tests.
>  =================
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 12757e63b26c..24450696579f 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -285,20 +285,17 @@ When oom event notifier is registered, event will be delivered.
>  2.6 Locking
>  -----------
>
> -   lock_page_cgroup()/unlock_page_cgroup() should not be called under
> -   the i_pages lock.
> +Lock order is as follows:
>
> -   Other lock order is following:
> +  Page lock (PG_locked bit of page->flags)
> +    mm->page_table_lock or split pte_lock
> +      lock_page_memcg (memcg->move_lock)
> +        mapping->i_pages lock
> +          lruvec->lru_lock.
>
> -   PG_locked.
> -     mm->page_table_lock
> -         pgdat->lru_lock
> -          lock_page_cgroup.
> -
> -  In many cases, just lock_page_cgroup() is called.
> -
> -  per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
> -  pgdat->lru_lock, it has no lock of its own.
> +Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
> +lruvec->lru_lock; PG_lru bit of page->flags is cleared before
> +isolating a page from its LRU under lruvec->lru_lock.
>
>  2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
>  -----------------------------------------------
> diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
> index 555484110e36..68fa75247488 100644
> --- a/Documentation/trace/events-kmem.rst
> +++ b/Documentation/trace/events-kmem.rst
> @@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
>  Broadly speaking, pages are taken off the LRU lock in bulk and
>  freed in batch with a page list. Significant amounts of activity here could
>  indicate that the system is under memory pressure and can also indicate
> -contention on the zone->lru_lock.
> +contention on the lruvec->lru_lock.
>
>  4. Per-CPU Allocator Activity
>  =============================
> diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
> index 17d0861b0f1d..0e1490524f53 100644
> --- a/Documentation/vm/unevictable-lru.rst
> +++ b/Documentation/vm/unevictable-lru.rst
> @@ -33,7 +33,7 @@ reclaim in Linux.  The problems have been observed at customer sites on large
>  memory x86_64 systems.
>
>  To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
> -main memory will have over 32 million 4k pages in a single zone.  When a large
> +main memory will have over 32 million 4k pages in a single node.  When a large
>  fraction of these pages are not evictable for any reason [see below], vmscan
>  will spend a lot of time scanning the LRU lists looking for the small fraction
>  of pages that are evictable.  This can result in a situation where all CPUs are

I'm not entirely sure this makes sense. If the system in non-NUMA you
don't have nodes then do you?

> @@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
>  The Unevictable Page List
>  -------------------------
>
> -The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
> +The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
>  called the "unevictable" list and an associated page flag, PG_unevictable, to
>  indicate that the page is being managed on the unevictable list.

This isn't quite true either is it? It is per-memcg and per-node isn't it?

> @@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
>  swap-backed pages.  This differentiation is only important while the pages are,
>  in fact, evictable.
>
> -The unevictable list benefits from the "arrayification" of the per-zone LRU
> +The unevictable list benefits from the "arrayification" of the per-node LRU
>  lists and statistics originally proposed and posted by Christoph Lameter.

Again, per-node x per-memcg. The list is not stored in just a per-node
structure, it is also per-memcg.

> -The unevictable list does not use the LRU pagevec mechanism. Rather,
> -unevictable pages are placed directly on the page's zone's unevictable list
> -under the zone lru_lock.  This allows us to prevent the stranding of pages on
> -the unevictable list when one task has the page isolated from the LRU and other
> -tasks are changing the "evictability" state of the page.
> -
>
>  Memory Control Group Interaction
>  --------------------------------
> @@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
>  memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
>  lru_list enum.
>
> -The memory controller data structure automatically gets a per-zone unevictable
> -list as a result of the "arrayification" of the per-zone LRU lists (one per
> +The memory controller data structure automatically gets a per-node unevictable
> +list as a result of the "arrayification" of the per-node LRU lists (one per
>  lru_list enum element).  The memory controller tracks the movement of pages to
>  and from the unevictable list.

Again, per-memcg and per-node.

> @@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
>  active/inactive LRU lists for vmscan to deal with.  vmscan checks for such
>  pages in all of the shrink_{active|inactive|page}_list() functions and will
>  "cull" such pages that it encounters: that is, it diverts those pages to the
> -unevictable list for the zone being scanned.
> +unevictable list for the node being scanned.

Another spot where memcg and node apply, not just node.

>  There may be situations where a page is mapped into a VM_LOCKED VMA, but the
>  page is not marked as PG_mlocked.  Such pages will make it all the way to
> @@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
>  page from the LRU, as it is likely on the appropriate active or inactive list
>  at that time.  If the isolate_lru_page() succeeds, mlock_vma_page() will put
>  back the page - by calling putback_lru_page() - which will notice that the page
> -is now mlocked and divert the page to the zone's unevictable list.  If
> +is now mlocked and divert the page to the node's unevictable list.  If
>  mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
>  it later if and when it attempts to reclaim the page.
>

Maybe instead of just replacing zone with node it might work better to
use wording such as "node specific memcg unevictable LRU list".

> @@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
>       unevictable list in mlock_vma_page().
>
>  shrink_inactive_list() also diverts any unevictable pages that it finds on the
> -inactive lists to the appropriate zone's unevictable list.
> +inactive lists to the appropriate node's unevictable list.
>
>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
>  after shrink_active_list() had moved them to the inactive list, or pages mapped

Same here.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 64ede5f150dc..44738cdb5a55 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,7 +78,7 @@ struct page {
>                 struct {        /* Page cache and anonymous pages */
>                         /**
>                          * @lru: Pageout list, eg. active_list protected by
> -                        * pgdat->lru_lock.  Sometimes used as a generic list
> +                        * lruvec->lru_lock.  Sometimes used as a generic list
>                          * by the page owner.
>                          */
>                         struct list_head lru;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8af956aa13cf..c92289a4e14d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>  struct pglist_data;
>
>  /*
> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>   * So add a wild amount of padding here to ensure that they fall into separate
>   * cachelines.  There are very few zone structures in the machine, so space
>   * consumption is not a concern here.

So I don't believe you are using ZONE_PADDING in any way to try and
protect the LRU lock currently. At least you aren't using it in the
lruvec. As such it might make sense to just drop the reference to the
lru_lock here. That reminds me that we still need to review the
placement of the lru_lock and determine if there might be a better
placement and/or padding that might improve performance when under
heavy stress.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 385759c4ce4b..3083557a1ce6 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -101,8 +101,8 @@
>   *    ->swap_lock              (try_to_unmap_one)
>   *    ->private_lock           (try_to_unmap_one)
>   *    ->i_pages lock           (try_to_unmap_one)
> - *    ->pgdat->lru_lock                (follow_page->mark_page_accessed)
> - *    ->pgdat->lru_lock                (check_pte_range->isolate_lru_page)
> + *    ->lruvec->lru_lock       (follow_page->mark_page_accessed)
> + *    ->lruvec->lru_lock       (check_pte_range->isolate_lru_page)
>   *    ->private_lock           (page_remove_rmap->set_page_dirty)
>   *    ->i_pages lock           (page_remove_rmap->set_page_dirty)
>   *    bdi.wb->list_lock                (page_remove_rmap->set_page_dirty)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d6746656cc39..a018d7c8a3f2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3057,7 +3057,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
>  /*
> - * Because tail pages are not marked as "used", set it. We're under
> + * Because tail pages are not marked as "used", set it. Don't need
>   * lruvec->lru_lock and migration entries setup in all page mappings.
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5fe2dedce1fc..7f6e95680c47 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -28,12 +28,12 @@
>   *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
>   *           anon_vma->rwsem
>   *             mm->page_table_lock or pte_lock
> - *               pgdat->lru_lock (in mark_page_accessed, isolate_lru_page)
>   *               swap_lock (in swap_duplicate, swap_info_get)
>   *                 mmlist_lock (in mmput, drain_mmlist and others)
>   *                 mapping->private_lock (in __set_page_dirty_buffers)
> - *                   mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
> + *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
>   *                     i_pages lock (widely used)
> + *                       lruvec->lru_lock (in lock_page_lruvec_irq)
>   *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
>   *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
>   *                   sb_lock (within inode_lock in fs/fs-writeback.c)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 078a1640ec60..bb3ac52de058 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1620,14 +1620,16 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
>  }
>
>  /**
> - * pgdat->lru_lock is heavily contended.  Some of the functions that
> + * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
> + *
> + * lruvec->lru_lock is heavily contended.  Some of the functions that
>   * shrink the lists perform better by taking out a batch of pages
>   * and working on them outside the LRU lock.
>   *
>   * For pagecache intensive workloads, this function is the hottest
>   * spot in the kernel (apart from copy_*_user functions).
>   *
> - * Appropriate locks must be held before calling this function.
> + * Lru_lock must be held before calling this function.
>   *
>   * @nr_to_scan:        The number of eligible pages to look through on the list.
>   * @lruvec:    The LRU vector to pull pages from.
> @@ -1826,14 +1828,16 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>
>  /*
>   * This moves pages from @list to corresponding LRU list.
> + * The pages from @list is out of any lruvec, and in the end list reuses as
> + * pages_to_free list.
>   *
>   * We move them the other way if the page is referenced by one or more
>   * processes, from rmap.
>   *
>   * If the pages are mostly unmapped, the processing is fast and it is
> - * appropriate to hold zone_lru_lock across the whole operation.  But if
> + * appropriate to hold lru_lock across the whole operation.  But if
>   * the pages are mapped, the processing is slow (page_referenced()) so we
> - * should drop zone_lru_lock around each page.  It's impossible to balance
> + * should drop lru_lock around each page.  It's impossible to balance
>   * this, so instead we remove the pages from the LRU while processing them.
>   * It is safe to rely on PG_active against the non-LRU pages in here because
>   * nobody will play with that bit on a non-LRU page.
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock
  2020-07-25 12:59 ` [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock Alex Shi
@ 2020-08-03 22:42   ` Alexander Duyck
  2020-08-03 22:45     ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-03 22:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org

I really think this would be better just squashed into patch 18
instead of as a standalone patch since you were moving all of the
locking anyway so it would be more likely to trigger build errors if
somebody didn't move a lock somewhere that was referencing this.

That said this change is harmless at this point.

Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

> ---
>  include/linux/mmzone.h | 1 -
>  mm/page_alloc.c        | 1 -
>  2 files changed, 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30b961a9a749..8af956aa13cf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -735,7 +735,6 @@ struct deferred_split {
>
>         /* Write-intensive fields used by page reclaim */
>         ZONE_PADDING(_pad1_)
> -       spinlock_t              lru_lock;
>
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>         /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e028b87ce294..4d7df42b32d6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6721,7 +6721,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
>
>         pgdat_page_ext_init(pgdat);
> -       spin_lock_init(&pgdat->lru_lock);
>         lruvec_init(&pgdat->__lruvec);
>  }
>
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock
  2020-08-03 22:42   ` Alexander Duyck
@ 2020-08-03 22:45     ` Alexander Duyck
  2020-08-04  6:22       ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-03 22:45 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

Just to correct a typo, I meant patch 17, not 18. in the comment below.


On Mon, Aug 3, 2020 at 3:42 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >
> > Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
> >
> > Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: cgroups@vger.kernel.org
>
> I really think this would be better just squashed into patch 18
> instead of as a standalone patch since you were moving all of the
> locking anyway so it would be more likely to trigger build errors if
> somebody didn't move a lock somewhere that was referencing this.
>
> That said this change is harmless at this point.
>
> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru
  2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi
@ 2020-08-03 22:49   ` Alexander Duyck
  2020-08-04  6:23     ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-03 22:49 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Andrey Ryabinin, Jann Horn

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> From: Hugh Dickins <hughd@google.com>
>
> Use the relock function to replace relocking action. And try to save few
> lock times.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

I am assuming this is only separate from patch 18 because of the fact
that it is from Hugh and not yourself. Otherwise I would recommend
folding this into patch 18.

Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid
  2020-08-02 18:20       ` Alexander Duyck
@ 2020-08-04  6:13         ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  6:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, Andrew Morton, Andrey Ryabinin, cgroups,
	Daniel Jordan, Johannes Weiner, Hugh Dickins, Joonsoo Kim,
	Konstantin Khlebnikov, Kirill A. Shutemov, LKML, linux-mm,
	kbuild test robot, Mel Gorman, Wei Yang, Rong Chen, Shakeel Butt,
	Thomas Gleixner, Tejun Heo, Matthew Wilcox, Yang Shi



在 2020/8/3 上午2:20, Alexander Duyck 写道:
> Feel free to fold it into your patches if you want.
> 
> I think Hugh was the one that had submitted a patch that addressed it,
> and it looks like you folded that into your v17 set. It was probably
> what he had identified which was the additional LRU checks needing to
> be removed from the code.

Yes, Hugh's patch was folded into patch [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn
and your change is on patch 18. seems there are no interfere with each other.
Both of patches are fine.

Thanks

> 
> Thanks.
> 
> - Alex
> 
> On Fri, Jul 31, 2020 at 4:55 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>> It looks much better than mine. and could replace 'mm/lru: introduce the relock_page_lruvec function'
>> with your author signed. :)
>>
>> BTW,
>> it's the rcu_read_lock cause the will-it-scale/page_fault3 regression which you mentained in another
>> letter?
>>
>> Thanks
>> Alex
>>
>> 在 2020/8/1 上午5:14, alexander.h.duyck@intel.com 写道:
>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>
>>> When testing for relock we can avoid the need for RCU locking if we simply
>>> compare the page pgdat and memcg pointers versus those that the lruvec is
>>> holding. By doing this we can avoid the extra pointer walks and accesses of
>>> the memory cgroup.
>>>
>>> In addition we can avoid the checks entirely if lruvec is currently NULL.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>> ---
>>>  include/linux/memcontrol.h |   52 +++++++++++++++++++++++++++-----------------
>>>  1 file changed, 32 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 6e670f991b42..7a02f00bf3de 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -405,6 +405,22 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>>>
>>>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>>>
>>> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
>>> +                                           struct lruvec *lruvec)
>>> +{
>>> +     pg_data_t *pgdat = page_pgdat(page);
>>> +     const struct mem_cgroup *memcg;
>>> +     struct mem_cgroup_per_node *mz;
>>> +
>>> +     if (mem_cgroup_disabled())
>>> +             return lruvec == &pgdat->__lruvec;
>>> +
>>> +     mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>>> +     memcg = page->mem_cgroup ? : root_mem_cgroup;
>>> +
>>> +     return lruvec->pgdat == pgdat && mz->memcg == memcg;
>>> +}
>>> +
>>>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>>>
>>>  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
>>> @@ -880,6 +896,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>>>       return &pgdat->__lruvec;
>>>  }
>>>
>>> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
>>> +                                           struct lruvec *lruvec)
>>> +{
>>> +             pg_data_t *pgdat = page_pgdat(page);
>>> +
>>> +             return lruvec == &pgdat->__lruvec;
>>> +}
>>> +
>>>  static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
>>>  {
>>>       return NULL;
>>> @@ -1317,18 +1341,12 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
>>>  static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
>>>               struct lruvec *locked_lruvec)
>>>  {
>>> -     struct pglist_data *pgdat = page_pgdat(page);
>>> -     bool locked;
>>> +     if (locked_lruvec) {
>>> +             if (lruvec_holds_page_lru_lock(page, locked_lruvec))
>>> +                     return locked_lruvec;
>>>
>>> -     rcu_read_lock();
>>> -     locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
>>> -     rcu_read_unlock();
>>> -
>>> -     if (locked)
>>> -             return locked_lruvec;
>>> -
>>> -     if (locked_lruvec)
>>>               unlock_page_lruvec_irq(locked_lruvec);
>>> +     }
>>>
>>>       return lock_page_lruvec_irq(page);
>>>  }
>>> @@ -1337,18 +1355,12 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
>>>  static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
>>>               struct lruvec *locked_lruvec, unsigned long *flags)
>>>  {
>>> -     struct pglist_data *pgdat = page_pgdat(page);
>>> -     bool locked;
>>> -
>>> -     rcu_read_lock();
>>> -     locked = mem_cgroup_page_lruvec(page, pgdat) == locked_lruvec;
>>> -     rcu_read_unlock();
>>> -
>>> -     if (locked)
>>> -             return locked_lruvec;
>>> +     if (locked_lruvec) {
>>> +             if (lruvec_holds_page_lru_lock(page, locked_lruvec))
>>> +                     return locked_lruvec;
>>>
>>> -     if (locked_lruvec)
>>>               unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
>>> +     }
>>>
>>>       return lock_page_lruvec_irqsave(page, flags);
>>>  }
>>>
>>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-08-03 15:07         ` Michal Hocko
@ 2020-08-04  6:14           ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  6:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, akpm, mgorman, tj, khlebnikov, daniel.m.jordan,
	yang.shi, willy, hannes, lkp, linux-mm, linux-kernel, cgroups,
	shakeelb, iamjoonsoo.kim, richard.weiyang, kirill,
	alexander.duyck, rong.a.chen



在 2020/8/3 下午11:07, Michal Hocko 写道:
> On Thu 30-07-20 10:16:13, Alex Shi wrote:
>>
>>
>> 在 2020/7/30 上午2:06, Hugh Dickins 写道:
>>> On Wed, 29 Jul 2020, Alex Shi wrote:
>>>>
>>>> Is there any comments or suggestion for this patchset?
>>>> Any hints will be very appreciated.
>>>
>>> Alex: it is now v5.8-rc7, obviously too late for this patchset to make
>>> v5.9, so I'm currently concentrated on checking some patches headed for
>>> v5.9 (and some bugfix patches of my own that I don't get time to send):
>>> I'll get back to responding on lru_lock in a week or two's time.
>>
>> Hi Hugh,
>>
>> Thanks a lot for response! It's fine to wait longer.
>> But thing would be more efficient if review get concentrated...
>> I am still too new in mm area.
> 
> I am sorry and owe you a review but it is hard to find time for that.
> This is a large change and the review will be really far from trivial.
> If this version is mostly stable then I would recommend not posting new
> versions and simply remind people you expect the review from by a
> targeted ping.
> 

hi Michal,

Thanks a lot for reminder!

Except a update on patch [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function
from Alexander, the patchset is stable on 5.8.

Just on linux-next, there are changes on hpage_nr_pages -> thp_nr_pages func name change, and
lru_note_cost changes that they need a new update.
And I have another 3 more patches, following this patchset which do clean up and optimzing.

Is it worth for a new patchset? or let me just update here?

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock
  2020-08-03 22:45     ` Alexander Duyck
@ 2020-08-04  6:22       ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  6:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/4 上午6:45, Alexander Duyck 写道:
> Just to correct a typo, I meant patch 17, not 18. in the comment below.
> 
> 
> On Mon, Aug 3, 2020 at 3:42 PM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>>
>> On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>
>>> Now pgdat.lru_lock was replaced by lruvec lock. It's not used anymore.
>>>
>>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: linux-mm@kvack.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: cgroups@vger.kernel.org
>>
>> I really think this would be better just squashed into patch 18
>> instead of as a standalone patch since you were moving all of the
>> locking anyway so it would be more likely to trigger build errors if
>> somebody didn't move a lock somewhere that was referencing this.

Thanks for comments!
If someone changed the lru_lock between patch 17 and this, it would cause
more troubles then build error here. :) so don't warries for that.
But on the other side, I am so insist to have a ceremony to remove this lock...
  
>>
>> That said this change is harmless at this point.
>>
>> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Thanks a lot for review!


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru
  2020-08-03 22:49   ` Alexander Duyck
@ 2020-08-04  6:23     ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  6:23 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Andrey Ryabinin, Jann Horn



在 2020/8/4 上午6:49, Alexander Duyck 写道:
>> Cc: linux-mm@kvack.org
> I am assuming this is only separate from patch 18 because of the fact
> that it is from Hugh and not yourself. Otherwise I would recommend
> folding this into patch 18.

Yes, that's resaon for this patch keeps.

> 
> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Thanks a lot!
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (22 preceding siblings ...)
  2020-07-31 21:31 ` Alexander Duyck
@ 2020-08-04  8:36 ` Alex Shi
  2020-08-04  8:36 ` Alex Shi
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  8:36 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, Michal Hocko, Vladimir Davydov

From 6f3ac2a72448291a88f50df836d829a23e7df736 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 25 Jul 2020 22:52:11 +0800
Subject: [PATCH 2/3] mm/mlock: remove __munlock_isolate_lru_page

The func only has one caller, remove it to clean up code and simplify
code.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 46a05e6ec5ba..40a8bb79c65e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -102,23 +102,6 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page,
-				struct lruvec *lruvec, bool getpage)
-{
-	if (TestClearPageLRU(page)) {
-		if (getpage)
-			get_page(page);
-		del_page_from_lru_list(page, lruvec, page_lru(page));
-		return true;
-	}
-
-	return false;
-}
-
-/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -300,7 +283,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			 * We already have pin from follow_page_mask()
 			 * so we can spare the get_page() here.
 			 */
-			if (__munlock_isolate_lru_page(page, lruvec, false)) {
+			if (TestClearPageLRU(page)) {
+				enum lru_list lru = page_lru(page);
+
+				del_page_from_lru_list(page, lruvec, lru);
 				unlock_page_memcg(page);
 				continue;
 			} else
-- 
1.8.3.1




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (23 preceding siblings ...)
  2020-08-04  8:36 ` Alex Shi
@ 2020-08-04  8:36 ` Alex Shi
  2020-08-04  8:37 ` Alex Shi
  2020-08-04  8:37 ` Alex Shi
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  8:36 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, Vladimir Davydov, Michal Hocko

From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 1 Aug 2020 22:49:31 +0800
Subject: [PATCH 3/3] mm/swap.c: optimizing __pagevec_lru_add lru_lock

The current relock will unlock/lock lru_lock with every time lruvec
changes, so it would cause frequency relock if 2 memcgs are reading file
simultaneously.

This patch will record the involved lru_lock and only hold them once in
above scenario. That could reduce the lock contention.

Using per cpu data intead of local stack data to avoid repeatly
INIT_LIST_HEAD action.

[lkp@intel.com: found a build issue in the original patch, thanks]
Suggested-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 51 insertions(+), 6 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index d88a6c650a7c..e227fec6983c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -72,6 +72,27 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
+struct pvlvs {
+	struct list_head lists[PAGEVEC_SIZE];
+	struct lruvec *vecs[PAGEVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct pvlvs, pvlvs);
+
+static int __init pvlvs_init(void) {
+	int i, cpu;
+	struct pvlvs *pvecs;
+
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_possible(cpu))
+			continue;
+		pvecs = per_cpu_ptr(&pvlvs, cpu);
+		for (i = 0; i < PAGEVEC_SIZE; i++)
+			INIT_LIST_HEAD(&pvecs->lists[i]);
+	}
+	return 0;
+}
+subsys_initcall(pvlvs_init);
+
 /*
  * This path almost never happens for VM activity - pages are normally
  * freed via pagevecs.  But it gets used by networking.
@@ -963,18 +984,42 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
  */
 void __pagevec_lru_add(struct pagevec *pvec)
 {
-	int i;
+	int i, j, total = 0;
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
+	struct page *page;
+	struct pvlvs *lvs = this_cpu_ptr(&pvlvs);
 
+	/* Sort the same lruvec pages on a list. */
 	for (i = 0; i < pagevec_count(pvec); i++) {
-		struct page *page = pvec->pages[i];
+		page = pvec->pages[i];
+		lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+		/* Try to find a same lruvec */
+		for (j = 0; j <= total; j++)
+			if (lruvec == lvs->vecs[j])
+				break;
+		/* A new lruvec */
+		if (j > total) {
+			lvs->vecs[total] = lruvec;
+			j = total;
+			total++;
+		}
 
-		lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
-		__pagevec_lru_add_fn(page, lruvec);
+		list_add(&page->lru, &lvs->lists[j]);
 	}
-	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+
+	for (i = 0; i < total; i++) {
+		spin_lock_irqsave(&lvs->vecs[i]->lru_lock, flags);
+		while (!list_empty(&lvs->lists[i])) {
+			page = lru_to_page(&lvs->lists[i]);
+			list_del(&page->lru);
+			__pagevec_lru_add_fn(page, lvs->vecs[i]);
+		}
+		spin_unlock_irqrestore(&lvs->vecs[i]->lru_lock, flags);
+		lvs->vecs[i] = NULL;
+	}
+
 	release_pages(pvec->pages, pvec->nr);
 	pagevec_reinit(pvec);
 }
-- 
1.8.3.1




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (24 preceding siblings ...)
  2020-08-04  8:36 ` Alex Shi
@ 2020-08-04  8:37 ` Alex Shi
  2020-08-04  8:37 ` Alex Shi
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  8:37 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, Vladimir Davydov, Michal Hocko

From 0696a2a4a8ca5e9bf62f208126ea4af7727d2edc Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Sat, 25 Jul 2020 22:31:03 +0800
Subject: [PATCH 1/3] mm/mlock: remove lru_lock on TestClearPageMlocked in
 munlock_vma_page

In the func munlock_vma_page, the page must be PageLocked as well as
pages in split_huge_page series funcs. Thus the PageLocked is enough
to serialize both funcs.

So we could relief the TestClearPageMlocked/hpage_nr_pages which are not
necessary under lru lock.

As to another munlock func __munlock_pagevec, which no PageLocked
protection and should remain lru protecting.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/mlock.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 0448409184e3..46a05e6ec5ba 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -69,9 +69,9 @@ void clear_page_mlock(struct page *page)
 	 *
 	 * See __pagevec_lru_add_fn for more explanation.
 	 */
-	if (!isolate_lru_page(page)) {
+	if (!isolate_lru_page(page))
 		putback_lru_page(page);
-	} else {
+	else {
 		/*
 		 * We lost the race. the page already moved to evictable list.
 		 */
@@ -178,7 +178,6 @@ static void __munlock_isolation_failed(struct page *page)
 unsigned int munlock_vma_page(struct page *page)
 {
 	int nr_pages;
-	struct lruvec *lruvec;
 
 	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
@@ -186,37 +185,22 @@ unsigned int munlock_vma_page(struct page *page)
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	/*
-	 * Serialize split tail pages in __split_huge_page_tail() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
-	 * we clear it in the head page. It also stabilizes thp_nr_pages().
-	 * TestClearPageLRU can't be used here to block page isolation, since
-	 * out of lock clear_page_mlock may interfer PageLRU/PageMlocked
-	 * sequence, same as __pagevec_lru_add_fn, and lead the page place to
-	 * wrong lru list here. So relay on PageLocked to stop lruvec change
-	 * in mem_cgroup_move_account().
+	 * Serialize split tail pages in __split_huge_page_tail() by
+	 * lock_page(); Do TestClearPageMlocked/PageLRU sequence like
+	 * clear_page_mlock().
 	 */
-	lruvec = lock_page_lruvec_irq(page);
-
-	if (!TestClearPageMlocked(page)) {
+	if (!TestClearPageMlocked(page))
 		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		nr_pages = 1;
-		goto unlock_out;
-	}
+		return 0;
 
 	nr_pages = thp_nr_pages(page);
 	__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 
-	if (__munlock_isolate_lru_page(page, lruvec, true)) {
-		unlock_page_lruvec_irq(lruvec);
+	if (!isolate_lru_page(page))
 		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
-
-unlock_out:
-	unlock_page_lruvec_irq(lruvec);
+	else
+		__munlock_isolation_failed(page);
 
-out:
 	return nr_pages - 1;
 }
 
@@ -305,6 +289,11 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 
 		/* block memcg change in mem_cgroup_move_account */
 		lock_page_memcg(page);
+		/*
+		 * Serialize split tail pages in __split_huge_page_tail() which
+		 * might otherwise copy PageMlocked to part of the tail pages
+		 * before we clear it in the head page.
+		 */
 		lruvec = relock_page_lruvec_irq(page, lruvec);
 		if (TestClearPageMlocked(page)) {
 			/*
-- 
1.8.3.1




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 00/21] per memcg lru lock
  2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
                   ` (25 preceding siblings ...)
  2020-08-04  8:37 ` Alex Shi
@ 2020-08-04  8:37 ` Alex Shi
  26 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-04  8:37 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen, Vladimir Davydov, Michal Hocko

From e2918c8fa741442255a2f12659f95dae94fdfe5d Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linux.alibaba.com>
Date: Tue, 4 Aug 2020 16:20:02 +0800
Subject: [PATCH 0/3] optimzing following per memcg lru_lock

The first 2 patches are code clean up. And the 3rd one is a lru_add optimize.


Alex Shi (3):
  mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page
  mm/mlock: remove __munlock_isolate_lru_page
  mm/swap.c: optimizing __pagevec_lru_add lru_lock

 mm/mlock.c | 63 +++++++++++++++++++-------------------------------------------
 mm/swap.c  | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 70 insertions(+), 50 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock
  2020-08-03 22:37   ` Alexander Duyck
@ 2020-08-04 10:04     ` Alex Shi
  2020-08-04 14:29       ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-04 10:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Andrey Ryabinin, Jann Horn



在 2020/8/4 上午6:37, Alexander Duyck 写道:
>>
>>  shrink_inactive_list() also diverts any unevictable pages that it finds on the
>> -inactive lists to the appropriate zone's unevictable list.
>> +inactive lists to the appropriate node's unevictable list.
>>
>>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
>>  after shrink_active_list() had moved them to the inactive list, or pages mapped
> Same here.

lruvec is used per memcg per node actually, and it fallback to node if memcg disabled.
So the comments are still right.

And most of changes just fix from zone->lru_lock to pgdat->lru_lock change.
> 
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 64ede5f150dc..44738cdb5a55 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -78,7 +78,7 @@ struct page {
>>                 struct {        /* Page cache and anonymous pages */
>>                         /**
>>                          * @lru: Pageout list, eg. active_list protected by
>> -                        * pgdat->lru_lock.  Sometimes used as a generic list
>> +                        * lruvec->lru_lock.  Sometimes used as a generic list
>>                          * by the page owner.
>>                          */
>>                         struct list_head lru;
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 8af956aa13cf..c92289a4e14d 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>>  struct pglist_data;
>>
>>  /*
>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>>   * So add a wild amount of padding here to ensure that they fall into separate
>>   * cachelines.  There are very few zone structures in the machine, so space
>>   * consumption is not a concern here.
> So I don't believe you are using ZONE_PADDING in any way to try and
> protect the LRU lock currently. At least you aren't using it in the
> lruvec. As such it might make sense to just drop the reference to the
> lru_lock here. That reminds me that we still need to review the
> placement of the lru_lock and determine if there might be a better
> placement and/or padding that might improve performance when under
> heavy stress.
> 

Right, is it the following looks better?

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ccc76590f823..0ed520954843 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
 struct pglist_data;

 /*
- * zone->lock and the lru_lock are two of the hottest locks in the kernel.
- * So add a wild amount of padding here to ensure that they fall into separate
+ * Add a wild amount of padding here to ensure datas fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
  */

Thanks!
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock
  2020-08-04 10:04     ` Alex Shi
@ 2020-08-04 14:29       ` Alexander Duyck
  2020-08-06  1:39         ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-04 14:29 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Andrey Ryabinin, Jann Horn

On Tue, Aug 4, 2020 at 3:04 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/4 上午6:37, Alexander Duyck 写道:
> >>
> >>  shrink_inactive_list() also diverts any unevictable pages that it finds on the
> >> -inactive lists to the appropriate zone's unevictable list.
> >> +inactive lists to the appropriate node's unevictable list.
> >>
> >>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
> >>  after shrink_active_list() had moved them to the inactive list, or pages mapped
> > Same here.
>
> lruvec is used per memcg per node actually, and it fallback to node if memcg disabled.
> So the comments are still right.
>
> And most of changes just fix from zone->lru_lock to pgdat->lru_lock change.

Actually in my mind one thing that might work better would be to
explain what the lruvec is and where it resides. Then replace zone
with lruvec since that is really where the unevictable list resides.
Then it would be correct for both the memcg and pgdat case.

> >
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 64ede5f150dc..44738cdb5a55 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -78,7 +78,7 @@ struct page {
> >>                 struct {        /* Page cache and anonymous pages */
> >>                         /**
> >>                          * @lru: Pageout list, eg. active_list protected by
> >> -                        * pgdat->lru_lock.  Sometimes used as a generic list
> >> +                        * lruvec->lru_lock.  Sometimes used as a generic list
> >>                          * by the page owner.
> >>                          */
> >>                         struct list_head lru;
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 8af956aa13cf..c92289a4e14d 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
> >>  struct pglist_data;
> >>
> >>  /*
> >> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
> >> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
> >>   * So add a wild amount of padding here to ensure that they fall into separate
> >>   * cachelines.  There are very few zone structures in the machine, so space
> >>   * consumption is not a concern here.
> > So I don't believe you are using ZONE_PADDING in any way to try and
> > protect the LRU lock currently. At least you aren't using it in the
> > lruvec. As such it might make sense to just drop the reference to the
> > lru_lock here. That reminds me that we still need to review the
> > placement of the lru_lock and determine if there might be a better
> > placement and/or padding that might improve performance when under
> > heavy stress.
> >
>
> Right, is it the following looks better?
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ccc76590f823..0ed520954843 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>  struct pglist_data;
>
>  /*
> - * zone->lock and the lru_lock are two of the hottest locks in the kernel.
> - * So add a wild amount of padding here to ensure that they fall into separate
> + * Add a wild amount of padding here to ensure datas fall into separate
>   * cachelines.  There are very few zone structures in the machine, so space
>   * consumption is not a concern here.
>   */
>
> Thanks!
> Alex

I would maybe tweak it to make sure it is clear that we are using this
to pad out items that are likely to cause cache thrash such as various
hot spinocks and such.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi
@ 2020-08-04 21:35   ` Alexander Duyck
  2020-08-06 18:38   ` Alexander Duyck
  2020-08-17 22:58   ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alexander Duyck
  2 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-04 21:35 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
>
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
>
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
>
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 +-
>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
>  3 files changed, 60 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2c29399b29a0..6d23d3beeff7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f14780fc296a..2da2933fe56b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>                         if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>                                 low_pfn = end_pfn;
> +                               page = NULL;
>                                 goto isolate_abort;
>                         }
>                         valid_page = page;
> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>                         goto isolate_fail;
>
> +               /*
> +                * Be careful not to clear PageLRU until after we're
> +                * sure the page is not being freed elsewhere -- the
> +                * page release code relies on it.
> +                */
> +               if (unlikely(!get_page_unless_zero(page)))
> +                       goto isolate_fail;
> +
> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> +                       goto isolate_fail_put;
> +
> +               /* Try isolate the page */
> +               if (!TestClearPageLRU(page))
> +                       goto isolate_fail_put;
> +
>                 /* If we already hold the lock, we can skip some rechecking */
>                 if (!locked) {
>                         locked = compact_lock_irqsave(&pgdat->lru_lock,

So the fact that this flow doesn't match what we have below in
isolate_lru_pages(). I went digging through the history and realize I
brought this up before and you referenced the following patch from
hugh:
https://lore.kernel.org/lkml/alpine.LSU.2.11.2006111529010.10801@eggly.anvils/

As such I am assuming this flow is needed because we aren't holding an
LRU lock, and the flow in mm/vmscan.c works because that is being
called while holding an LRU lock. I am wondering if we are
overcomplicating things by keeping the LRU check in
__isolate_lru_page_prepare(). If we were to pull it out then you could
just perform the get_page_unless_zero and TestClearPageLRU check
before you call the function and you could consolidate the code so
that it could be combined into a single function as below. So for
example you could combine them into:
static inline bool get_lru_page_unless_zero(struct page *page)
{
    /*
     * Be careful not to clear PageLRU until after we're
     * sure the page is not being freed elsewhere -- the
     * page release code relies on it.
     */
   if (unlikely(!get_page_unless_zero(page))
        return false;
    if(TestClearPageLRU(page))
        return true;
    put_page(page);
    return false;
}

Then the logic becomes that you have to either call
get_lru_page_unless_zero before calling __isolate_lru_page_prepare or
you have to be holding the LRU lock.

> @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /* Recheck PageLRU and PageCompound under lock */
> -                       if (!PageLRU(page))
> -                               goto isolate_fail;
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>                                 low_pfn += compound_nr(page) - 1;
> -                               goto isolate_fail;
> +                               SetPageLRU(page);
> +                               goto isolate_fail_put;
>                         }
>                 }
>
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
> -               /* Try isolate the page */
> -               if (__isolate_lru_page(page, isolate_mode) != 0)
> -                       goto isolate_fail;
> -
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
>                         low_pfn += compound_nr(page) - 1;
> @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 }
>
>                 continue;
> +
> +isolate_fail_put:
> +               /* Avoid potential deadlock in freeing page under lru_lock */
> +               if (locked) {
> +                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +                       locked = false;
> +               }
> +               put_page(page);
> +
>  isolate_fail:
>                 if (!skip_on_failure)
>                         continue;
> @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         if (unlikely(low_pfn > end_pfn))
>                 low_pfn = end_pfn;
>
> +       page = NULL;
> +
>  isolate_abort:
>         if (locked)
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (page) {
> +               SetPageLRU(page);
> +               put_page(page);
> +       }
>
>         /*
>          * Updated the cached scanner pfn once the pageblock has been scanned
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4183ae6b54b5..f77748adc340 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1544,20 +1544,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   *
>   * returns 0 on success, -ve errno on failure.
>   */
> -int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
>  {
>         int ret = -EINVAL;
>
> -       /* Only take pages on the LRU. */
> -       if (!PageLRU(page))
> -               return ret;
> -
>         /* Compaction should not handle unevictable pages but CMA can do so */
>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>                 return ret;
>
>         ret = -EBUSY;
>
> +       /* Only take pages on the LRU. */
> +       if (!PageLRU(page))
> +               return ret;
> +
>         /*
>          * To minimise LRU disruption, the caller can indicate that it only
>          * wants to isolate pages it will be able to operate on without

So the question I would have is if we really need to be checking
PageLRU here? I wonder if this isn't another spot where we would be
better served by just assuming that PageLRU has been checked while
holding the lock, or tested and cleared while holding a page
reference. The original patch from Hugh referenced above mentions a
desire to do away with __isolate_lru_page_prepare entirely, so I
wonder if it wouldn't be good to be proactive and pull out the bits we
think we might need versus the ones we don't.

> @@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>         if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
>                 return ret;
>
> -       if (likely(get_page_unless_zero(page))) {
> -               /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> -                */
> -               ClearPageLRU(page);
> -               ret = 0;
> -       }
> -
> -       return ret;
> +       return 0;
>  }
>
> -
>  /*
>   * Update LRU sizes after isolating pages. The LRU size updates must
>   * be complete before mem_cgroup_update_lru_size due to a sanity check.
> @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                  * only when the page is being freed somewhere else.
>                  */
>                 scan += nr_pages;
> -               switch (__isolate_lru_page(page, mode)) {
> +               switch (__isolate_lru_page_prepare(page, mode)) {

So after looking through the code I realized that "mode" here will
always be either 0 or ISOLATE_UNMAPPED. I assume this is why we aren't
worried about the trylock_page call messing things up. With that said
it looks like the function just breaks down to three tests, first for
PageUnevictable(), then PageLRU(), and then possibly page_mapped(). As
such I believe dropping the PageLRU check from the function as I
suggested above should be safe since we are at risk for the test of it
racing anyway since it could be cleared out from under us, and the bit
isn't really protecting anything anyway since we are holding the LRU
lock and anyhow.

>                 case 0:
> +                       /*
> +                        * Be careful not to clear PageLRU until after we're
> +                        * sure the page is not being freed elsewhere -- the
> +                        * page release code relies on it.
> +                        */
> +                       if (unlikely(!get_page_unless_zero(page)))
> +                               goto busy;
> +
> +                       if (!TestClearPageLRU(page)) {
> +                               /*
> +                                * This page may in other isolation path,
> +                                * but we still hold lru_lock.
> +                                */
> +                               put_page(page);
> +                               goto busy;
> +                       }
> +

This piece could be consolidated via the single function I called out above.

>                         nr_taken += nr_pages;
>                         nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
> -
> +busy:
>                 case -EBUSY:
>                         /* else it is being freed elsewhere */
>                         list_move(&page->lru, src);
> -                       continue;
> +                       break;
>                 default:
>                         BUG();
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page
  2020-07-25 12:59 ` [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
@ 2020-08-05 21:18   ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-05 21:18 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> It's a clean up patch w/o function changes.
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org

Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

> ---
>  mm/memory.c     | 3 ---
>  mm/swap.c       | 2 ++
>  mm/swap_state.c | 2 --
>  mm/workingset.c | 2 --
>  4 files changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 87ec87cdc1ff..dafc5585517e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3150,10 +3150,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                  * XXX: Move to lru_cache_add() when it
>                                  * supports new vs putback
>                                  */
> -                               spin_lock_irq(&page_pgdat(page)->lru_lock);
>                                 lru_note_cost_page(page);
> -                               spin_unlock_irq(&page_pgdat(page)->lru_lock);
> -
>                                 lru_cache_add(page);
>                                 swap_readpage(page, true);
>                         }
> diff --git a/mm/swap.c b/mm/swap.c
> index dc8b02cdddcb..b88ca630db70 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -298,8 +298,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
>
>  void lru_note_cost_page(struct page *page)
>  {
> +       spin_lock_irq(&page_pgdat(page)->lru_lock);
>         lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
>                       page_is_file_lru(page), hpage_nr_pages(page));
> +       spin_unlock_irq(&page_pgdat(page)->lru_lock);
>  }
>
>  static void __activate_page(struct page *page, struct lruvec *lruvec)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 05889e8e3c97..080be52db6a8 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -440,9 +440,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         }
>
>         /* XXX: Move to lru_cache_add() when it supports new vs putback */
> -       spin_lock_irq(&page_pgdat(page)->lru_lock);
>         lru_note_cost_page(page);
> -       spin_unlock_irq(&page_pgdat(page)->lru_lock);
>
>         /* Caller will initiate read into locked page */
>         SetPageWorkingset(page);
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 50b7937bab32..337d5b9ad132 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -372,9 +372,7 @@ void workingset_refault(struct page *page, void *shadow)
>         if (workingset) {
>                 SetPageWorkingset(page);
>                 /* XXX: Move to lru_cache_add() when it supports new vs putback */
> -               spin_lock_irq(&page_pgdat(page)->lru_lock);
>                 lru_note_cost_page(page);
> -               spin_unlock_irq(&page_pgdat(page)->lru_lock);
>                 inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
>         }
>  out:
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU
  2020-07-29  3:53   ` Alex Shi
@ 2020-08-05 22:43     ` Alexander Duyck
  2020-08-06  1:54       ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-05 22:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov

On Tue, Jul 28, 2020 at 8:53 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> rewrite the commit log.
>
> From 9310c359b0049e3cc9827b771dc583d504bbf022 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linux.alibaba.com>
> Date: Sat, 25 Apr 2020 12:03:30 +0800
> Subject: [PATCH v17 13/23] mm/lru: introduce TestClearPageLRU
>
> Currently lru_lock still guards both lru list and page's lru bit, that's
> ok. but if we want to use specific lruvec lock on the page, we need to
> pin down the page's lruvec/memcg during locking. Just taking lruvec
> lock first may be undermined by the page's memcg charge/migration. To
> fix this problem, we could clear the lru bit out of locking and use
> it as pin down action to block the page isolation in memcg changing.
>
> So now a standard steps of page isolation is following:
>         1, get_page();         #pin the page avoid to be free
>         2, TestClearPageLRU(); #block other isolation like memcg change
>         3, spin_lock on lru_lock; #serialize lru list access
>         4, delete page from lru list;
> The step 2 could be optimzed/replaced in scenarios which page is
> unlikely be accessed or be moved between memcgs.
>
> This patch start with the first part: TestClearPageLRU, which combines
> PageLRU check and ClearPageLRU into a macro func TestClearPageLRU. This
> function will be used as page isolation precondition to prevent other
> isolations some where else. Then there are may !PageLRU page on lru
> list, need to remove BUG() checking accordingly.
>
> There 2 rules for lru bit now:
> 1, the lru bit still indicate if a page on lru list, just in some
>    temporary moment(isolating), the page may have no lru bit when
>    it's on lru list.  but the page still must be on lru list when the
>    lru bit set.
> 2, have to remove lru bit before delete it from lru list.
>
> Hugh Dickins pointed that when a page is in free path and no one is
> possible to take it, non atomic lru bit clearing is better, like in
> __page_cache_release and release_pages.
> And no need get_page() before lru bit clear in isolate_lru_page,
> since it '(1) Must be called with an elevated refcount on the page'.
>
> As Andrew Morton mentioned this change would dirty cacheline for page
> isn't on LRU. But the lost would be acceptable with Rong Chen
> <rong.a.chen@intel.com> report:
> https://lkml.org/lkml/2020/3/4/173
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/page-flags.h |  1 +
>  mm/mlock.c                 |  3 +--
>  mm/swap.c                  |  6 ++----
>  mm/vmscan.c                | 18 +++++++-----------
>  4 files changed, 11 insertions(+), 17 deletions(-)
>

<snip>

> diff --git a/mm/swap.c b/mm/swap.c
> index f645965fde0e..5092fe9c8c47 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -83,10 +83,9 @@ static void __page_cache_release(struct page *page)
>                 struct lruvec *lruvec;
>                 unsigned long flags;
>
> +               __ClearPageLRU(page);
>                 spin_lock_irqsave(&pgdat->lru_lock, flags);
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -               VM_BUG_ON_PAGE(!PageLRU(page), page);
> -               __ClearPageLRU(page);
>                 del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>         }
> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>                         }
>
> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
>                         __ClearPageLRU(page);
> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>                 }
>

The more I look at this piece it seems like this change wasn't really
necessary. If anything it seems like it could catch potential bugs as
it was testing for the PageLRU flag before and then clearing it
manually anyway. In addition it doesn't reduce the critical path by
any significant amount so I am not sure these changes are providing
any benefit.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock
  2020-08-04 14:29       ` Alexander Duyck
@ 2020-08-06  1:39         ` Alex Shi
  2020-08-06 16:27           ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-06  1:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Andrey Ryabinin, Jann Horn



在 2020/8/4 下午10:29, Alexander Duyck 写道:
> On Tue, Aug 4, 2020 at 3:04 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/4 上午6:37, Alexander Duyck 写道:
>>>>
>>>>  shrink_inactive_list() also diverts any unevictable pages that it finds on the
>>>> -inactive lists to the appropriate zone's unevictable list.
>>>> +inactive lists to the appropriate node's unevictable list.
>>>>
>>>>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
>>>>  after shrink_active_list() had moved them to the inactive list, or pages mapped
>>> Same here.
>>
>> lruvec is used per memcg per node actually, and it fallback to node if memcg disabled.
>> So the comments are still right.
>>
>> And most of changes just fix from zone->lru_lock to pgdat->lru_lock change.
> 
> Actually in my mind one thing that might work better would be to
> explain what the lruvec is and where it resides. Then replace zone
> with lruvec since that is really where the unevictable list resides.
> Then it would be correct for both the memcg and pgdat case.

Could you like to revise the doc as your thought?
> 
>>>
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 64ede5f150dc..44738cdb5a55 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -78,7 +78,7 @@ struct page {
>>>>                 struct {        /* Page cache and anonymous pages */
>>>>                         /**
>>>>                          * @lru: Pageout list, eg. active_list protected by
>>>> -                        * pgdat->lru_lock.  Sometimes used as a generic list
>>>> +                        * lruvec->lru_lock.  Sometimes used as a generic list
>>>>                          * by the page owner.
>>>>                          */
>>>>                         struct list_head lru;
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index 8af956aa13cf..c92289a4e14d 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>>>>  struct pglist_data;
>>>>
>>>>  /*
>>>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
>>>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>>>>   * So add a wild amount of padding here to ensure that they fall into separate
>>>>   * cachelines.  There are very few zone structures in the machine, so space
>>>>   * consumption is not a concern here.
>>> So I don't believe you are using ZONE_PADDING in any way to try and
>>> protect the LRU lock currently. At least you aren't using it in the
>>> lruvec. As such it might make sense to just drop the reference to the
>>> lru_lock here. That reminds me that we still need to review the
>>> placement of the lru_lock and determine if there might be a better
>>> placement and/or padding that might improve performance when under
>>> heavy stress.
>>>
>>
>> Right, is it the following looks better?
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index ccc76590f823..0ed520954843 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
>>  struct pglist_data;
>>
>>  /*
>> - * zone->lock and the lru_lock are two of the hottest locks in the kernel.
>> - * So add a wild amount of padding here to ensure that they fall into separate
>> + * Add a wild amount of padding here to ensure datas fall into separate
>>   * cachelines.  There are very few zone structures in the machine, so space
>>   * consumption is not a concern here.
>>   */
>>
>> Thanks!
>> Alex
> 
> I would maybe tweak it to make sure it is clear that we are using this
> to pad out items that are likely to cause cache thrash such as various
> hot spinocks and such.
> 

I appreciate if you like to change the doc better. :)

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU
  2020-08-05 22:43     ` Alexander Duyck
@ 2020-08-06  1:54       ` Alex Shi
  2020-08-06 14:41         ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-06  1:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov



在 2020/8/6 上午6:43, Alexander Duyck 写道:
>> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
>>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
>>                         }
>>
>> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
>>                         __ClearPageLRU(page);
>> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
>>                 }
>>
> The more I look at this piece it seems like this change wasn't really
> necessary. If anything it seems like it could catch potential bugs as
> it was testing for the PageLRU flag before and then clearing it
> manually anyway. In addition it doesn't reduce the critical path by
> any significant amount so I am not sure these changes are providing
> any benefit.

Don't know hat kind of bug do you mean here, since the page is no one using, means
no one could ClearPageLRU in other place,  so if you like to keep the VM_BUG_ON_PAGE,
that should be ok.

Thanks!
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding
  2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi
@ 2020-08-06  3:47   ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-06  3:47 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen



在 2020/7/25 下午8:59, Alex Shi 写道:
> We don't have to add a freeable page into lru and then remove from it.
> This change saves a couple of actions and makes the moving more clear.
> 
> The SetPageLRU needs to be kept here for list intergrity.
> Otherwise:
>  #0 mave_pages_to_lru              #1 release_pages
>                                    if (put_page_testzero())
>  if !put_page_testzero
>                                      !PageLRU //skip lru_lock
>                                        list_add(&page->lru,)
>    list_add(&page->lru,) //corrupt

The race comments should be corrected to this:
                /*
                 * The SetPageLRU needs to be kept here for list intergrity.
                 * Otherwise:
                 *   #0 mave_pages_to_lru             #1 release_pages
                 *   if !put_page_testzero
                 *                                    if (put_page_testzero())
                 *                                      !PageLRU //skip lru_lock
                 *     SetPageLRU()
                 *     list_add(&page->lru,)
                 *                                        list_add(&page->lru,)
                 */

> 
> [akpm@linux-foundation.org: coding style fixes]
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  mm/vmscan.c | 37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749d239c62b2..ddb29d813d77 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1856,26 +1856,29 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  	while (!list_empty(list)) {
>  		page = lru_to_page(list);
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
> +		list_del(&page->lru);
>  		if (unlikely(!page_evictable(page))) {
> -			list_del(&page->lru);
>  			spin_unlock_irq(&pgdat->lru_lock);
>  			putback_lru_page(page);
>  			spin_lock_irq(&pgdat->lru_lock);
>  			continue;
>  		}
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  
> +		/*
> +		 * The SetPageLRU needs to be kept here for list intergrity.
> +		 * Otherwise:
> +		 *   #0 mave_pages_to_lru             #1 release_pages
> +		 *				      if (put_page_testzero())
> +		 *   if !put_page_testzero
> +		 *				        !PageLRU //skip lru_lock
> +		 *                                        list_add(&page->lru,)
> +		 *     list_add(&page->lru,) //corrupt
> +		 */

                /*
                 * The SetPageLRU needs to be kept here for list intergrity.
                 * Otherwise:
                 *   #0 mave_pages_to_lru             #1 release_pages
                 *   if !put_page_testzero
                 *                                    if (put_page_testzero())
                 *                                      !PageLRU //skip lru_lock
                 *     SetPageLRU()
                 *     list_add(&page->lru,)
                 *                                        list_add(&page->lru,)
                 */

>  		SetPageLRU(page);
> -		lru = page_lru(page);
>  
> -		nr_pages = hpage_nr_pages(page);
> -		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
> -		list_move(&page->lru, &lruvec->lists[lru]);
> -
> -		if (put_page_testzero(page)) {
> +		if (unlikely(put_page_testzero(page))) {
>  			__ClearPageLRU(page);
>  			__ClearPageActive(page);
> -			del_page_from_lru_list(page, lruvec, lru);
>  
>  			if (unlikely(PageCompound(page))) {
>  				spin_unlock_irq(&pgdat->lru_lock);
> @@ -1883,11 +1886,19 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>  				spin_lock_irq(&pgdat->lru_lock);
>  			} else
>  				list_add(&page->lru, &pages_to_free);
> -		} else {
> -			nr_moved += nr_pages;
> -			if (PageActive(page))
> -				workingset_age_nonresident(lruvec, nr_pages);
> +
> +			continue;
>  		}
> +
> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +		lru = page_lru(page);
> +		nr_pages = hpage_nr_pages(page);
> +
> +		update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
> +		list_add(&page->lru, &lruvec->lists[lru]);
> +		nr_moved += nr_pages;
> +		if (PageActive(page))
> +			workingset_age_nonresident(lruvec, nr_pages);
>  	}
>  
>  	/*
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock
  2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
  2020-07-27 23:34   ` Alexander Duyck
  2020-07-29  3:54   ` Alex Shi
@ 2020-08-06  7:41   ` Alex Shi
  2 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-06  7:41 UTC (permalink / raw)
  To: akpm, mgorman, tj, hughd, khlebnikov, daniel.m.jordan, yang.shi,
	willy, hannes, lkp, linux-mm, linux-kernel, cgroups, shakeelb,
	iamjoonsoo.kim, richard.weiyang, kirill, alexander.duyck,
	rong.a.chen
  Cc: Michal Hocko, Vladimir Davydov

Hi Johannes, Michal,

From page to its lruvec, a few memory access under lock cause extra cost.
Would you like to save the per memcg lruvec pointer to page->private?

Thanks
Alex



在 2020/7/25 下午8:59, Alex Shi 写道:
>  /**
>   * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>   * @page: the page
> @@ -1215,7 +1228,8 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  		goto out;
>  	}
>  
> -	memcg = page->mem_cgroup;
> +	VM_BUG_ON_PAGE(PageTail(page), page);
> +	memcg = READ_ONCE(page->mem_cgroup);
>  	/*
>  	 * Swapcache readahead pages are added to the LRU - and
>  	 * possibly migrated - before they are charged.
> @@ -1236,6 +1250,51 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  	return lruvec;
>  }
>  
> +struct lruvec *lock_page_lruvec(struct page *page)
> +{
> +	struct lruvec *lruvec;
> +	struct pglist_data *pgdat = page_pgdat(page);
> +
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock(&lruvec->lru_lock);
> +	rcu_read_unlock();
> +
> +	lruvec_memcg_debug(lruvec, page);
> +
> +	return lruvec;
> +}
> +


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU
  2020-08-06  1:54       ` Alex Shi
@ 2020-08-06 14:41         ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-06 14:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Michal Hocko, Vladimir Davydov

On Wed, Aug 5, 2020 at 6:54 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/6 上午6:43, Alexander Duyck 写道:
> >> @@ -878,9 +877,8 @@ void release_pages(struct page **pages, int nr)
> >>                                 spin_lock_irqsave(&locked_pgdat->lru_lock, flags);
> >>                         }
> >>
> >> -                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> >> -                       VM_BUG_ON_PAGE(!PageLRU(page), page);
> >>                         __ClearPageLRU(page);
> >> +                       lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
> >>                         del_page_from_lru_list(page, lruvec, page_off_lru(page));
> >>                 }
> >>
> > The more I look at this piece it seems like this change wasn't really
> > necessary. If anything it seems like it could catch potential bugs as
> > it was testing for the PageLRU flag before and then clearing it
> > manually anyway. In addition it doesn't reduce the critical path by
> > any significant amount so I am not sure these changes are providing
> > any benefit.
>
> Don't know hat kind of bug do you mean here, since the page is no one using, means
> no one could ClearPageLRU in other place,  so if you like to keep the VM_BUG_ON_PAGE,
> that should be ok.

You kind of answered your own question. Basically the bug it would
catch is if another thread were to clear the flag without getting a
reference to the page first. My preference would be to leave this code
as is for now. There isn't much value in either moving the lruvec or
removing the VM_BUG_ON_PAGE call since the critical path size would
barely be effected as it is only one or two operations anyway. What it
comes down to is that the less unnecessary changes we make the better.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 21/21] mm/lru: revise the comments of lru_lock
  2020-08-06  1:39         ` Alex Shi
@ 2020-08-06 16:27           ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-06 16:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen, Andrey Ryabinin, Jann Horn

On Wed, Aug 5, 2020 at 6:39 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/4 下午10:29, Alexander Duyck 写道:
> > On Tue, Aug 4, 2020 at 3:04 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/4 上午6:37, Alexander Duyck 写道:
> >>>>
> >>>>  shrink_inactive_list() also diverts any unevictable pages that it finds on the
> >>>> -inactive lists to the appropriate zone's unevictable list.
> >>>> +inactive lists to the appropriate node's unevictable list.
> >>>>
> >>>>  shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
> >>>>  after shrink_active_list() had moved them to the inactive list, or pages mapped
> >>> Same here.
> >>
> >> lruvec is used per memcg per node actually, and it fallback to node if memcg disabled.
> >> So the comments are still right.
> >>
> >> And most of changes just fix from zone->lru_lock to pgdat->lru_lock change.
> >
> > Actually in my mind one thing that might work better would be to
> > explain what the lruvec is and where it resides. Then replace zone
> > with lruvec since that is really where the unevictable list resides.
> > Then it would be correct for both the memcg and pgdat case.
>
> Could you like to revise the doc as your thought?
> >
> >>>
> >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>> index 64ede5f150dc..44738cdb5a55 100644
> >>>> --- a/include/linux/mm_types.h
> >>>> +++ b/include/linux/mm_types.h
> >>>> @@ -78,7 +78,7 @@ struct page {
> >>>>                 struct {        /* Page cache and anonymous pages */
> >>>>                         /**
> >>>>                          * @lru: Pageout list, eg. active_list protected by
> >>>> -                        * pgdat->lru_lock.  Sometimes used as a generic list
> >>>> +                        * lruvec->lru_lock.  Sometimes used as a generic list
> >>>>                          * by the page owner.
> >>>>                          */
> >>>>                         struct list_head lru;
> >>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >>>> index 8af956aa13cf..c92289a4e14d 100644
> >>>> --- a/include/linux/mmzone.h
> >>>> +++ b/include/linux/mmzone.h
> >>>> @@ -115,7 +115,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
> >>>>  struct pglist_data;
> >>>>
> >>>>  /*
> >>>> - * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
> >>>> + * zone->lock and the lru_lock are two of the hottest locks in the kernel.
> >>>>   * So add a wild amount of padding here to ensure that they fall into separate
> >>>>   * cachelines.  There are very few zone structures in the machine, so space
> >>>>   * consumption is not a concern here.
> >>> So I don't believe you are using ZONE_PADDING in any way to try and
> >>> protect the LRU lock currently. At least you aren't using it in the
> >>> lruvec. As such it might make sense to just drop the reference to the
> >>> lru_lock here. That reminds me that we still need to review the
> >>> placement of the lru_lock and determine if there might be a better
> >>> placement and/or padding that might improve performance when under
> >>> heavy stress.
> >>>
> >>
> >> Right, is it the following looks better?
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index ccc76590f823..0ed520954843 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
> >>  struct pglist_data;
> >>
> >>  /*
> >> - * zone->lock and the lru_lock are two of the hottest locks in the kernel.
> >> - * So add a wild amount of padding here to ensure that they fall into separate
> >> + * Add a wild amount of padding here to ensure datas fall into separate
> >>   * cachelines.  There are very few zone structures in the machine, so space
> >>   * consumption is not a concern here.
> >>   */
> >>
> >> Thanks!
> >> Alex
> >
> > I would maybe tweak it to make sure it is clear that we are using this
> > to pad out items that are likely to cause cache thrash such as various
> > hot spinocks and such.
> >
>
> I appreciate if you like to change the doc better. :)

Give me a day or so. I will submit a follow-on patch with some cleanup
for the comments.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi
  2020-08-04 21:35   ` Alexander Duyck
@ 2020-08-06 18:38   ` Alexander Duyck
  2020-08-07  3:24     ` Alex Shi
  2020-08-17 22:58   ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alexander Duyck
  2 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-06 18:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
>
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
>
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
>
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 +-
>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
>  3 files changed, 60 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2c29399b29a0..6d23d3beeff7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f14780fc296a..2da2933fe56b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>                         if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>                                 low_pfn = end_pfn;
> +                               page = NULL;
>                                 goto isolate_abort;
>                         }
>                         valid_page = page;
> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>                         goto isolate_fail;
>
> +               /*
> +                * Be careful not to clear PageLRU until after we're
> +                * sure the page is not being freed elsewhere -- the
> +                * page release code relies on it.
> +                */
> +               if (unlikely(!get_page_unless_zero(page)))
> +                       goto isolate_fail;
> +
> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> +                       goto isolate_fail_put;
> +
> +               /* Try isolate the page */
> +               if (!TestClearPageLRU(page))
> +                       goto isolate_fail_put;
> +
>                 /* If we already hold the lock, we can skip some rechecking */
>                 if (!locked) {
>                         locked = compact_lock_irqsave(&pgdat->lru_lock,
> @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /* Recheck PageLRU and PageCompound under lock */
> -                       if (!PageLRU(page))
> -                               goto isolate_fail;
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>                                 low_pfn += compound_nr(page) - 1;
> -                               goto isolate_fail;
> +                               SetPageLRU(page);
> +                               goto isolate_fail_put;
>                         }
>                 }
>
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
> -               /* Try isolate the page */
> -               if (__isolate_lru_page(page, isolate_mode) != 0)
> -                       goto isolate_fail;
> -
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
>                         low_pfn += compound_nr(page) - 1;
> @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 }
>
>                 continue;
> +
> +isolate_fail_put:
> +               /* Avoid potential deadlock in freeing page under lru_lock */
> +               if (locked) {
> +                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +                       locked = false;
> +               }
> +               put_page(page);
> +
>  isolate_fail:
>                 if (!skip_on_failure)
>                         continue;
> @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         if (unlikely(low_pfn > end_pfn))
>                 low_pfn = end_pfn;
>
> +       page = NULL;
> +
>  isolate_abort:
>         if (locked)
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (page) {
> +               SetPageLRU(page);
> +               put_page(page);
> +       }
>
>         /*
>          * Updated the cached scanner pfn once the pageblock has been scanned

We should probably be calling SetPageLRU before we release the lru
lock instead of before. It might make sense to just call it before we
get here, similar to how you did in the isolate_fail_put case a few
lines later. Otherwise this seems to violate the rules you had set up
earlier where we were only going to be setting the LRU bit while
holding the LRU lock.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-06 18:38   ` Alexander Duyck
@ 2020-08-07  3:24     ` Alex Shi
  2020-08-07 14:51       ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-07  3:24 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/7 上午2:38, Alexander Duyck 写道:
>> +
>>  isolate_abort:
>>         if (locked)
>>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> +       if (page) {
>> +               SetPageLRU(page);
>> +               put_page(page);
>> +       }
>>
>>         /*
>>          * Updated the cached scanner pfn once the pageblock has been scanned
> We should probably be calling SetPageLRU before we release the lru
> lock instead of before. It might make sense to just call it before we
> get here, similar to how you did in the isolate_fail_put case a few
> lines later. Otherwise this seems to violate the rules you had set up
> earlier where we were only going to be setting the LRU bit while
> holding the LRU lock.

Hi Alex,

Set out of lock here should be fine. I never said we must set the bit in locking.
And this page is get by get_page_unless_zero(), no warry on release.

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-07  3:24     ` Alex Shi
@ 2020-08-07 14:51       ` Alexander Duyck
  2020-08-10 13:10         ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-07 14:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Thu, Aug 6, 2020 at 8:25 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/7 上午2:38, Alexander Duyck 写道:
> >> +
> >>  isolate_abort:
> >>         if (locked)
> >>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> >> +       if (page) {
> >> +               SetPageLRU(page);
> >> +               put_page(page);
> >> +       }
> >>
> >>         /*
> >>          * Updated the cached scanner pfn once the pageblock has been scanned
> > We should probably be calling SetPageLRU before we release the lru
> > lock instead of before. It might make sense to just call it before we
> > get here, similar to how you did in the isolate_fail_put case a few
> > lines later. Otherwise this seems to violate the rules you had set up
> > earlier where we were only going to be setting the LRU bit while
> > holding the LRU lock.
>
> Hi Alex,
>
> Set out of lock here should be fine. I never said we must set the bit in locking.
> And this page is get by get_page_unless_zero(), no warry on release.
>
> Thanks
> Alex

I wonder if this entire section shouldn't be restructured. This is the
only spot I can see where we are resetting the LRU flag instead of
pulling the page from the LRU list with the lock held. Looking over
the code it seems like something like that should be possible. I am
not sure the LRU lock is really protecting us in either the
PageCompound check nor the skip bits. It seems like holding a
reference on the page should prevent it from switching between
compound or not, and the skip bits are per pageblock with the LRU bits
being per node/memcg which I would think implies that we could have
multiple LRU locks that could apply to a single skip bit.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-07 14:51       ` Alexander Duyck
@ 2020-08-10 13:10         ` Alex Shi
  2020-08-10 14:41           ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-10 13:10 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/7 下午10:51, Alexander Duyck 写道:
> I wonder if this entire section shouldn't be restructured. This is the
> only spot I can see where we are resetting the LRU flag instead of
> pulling the page from the LRU list with the lock held. Looking over
> the code it seems like something like that should be possible. I am
> not sure the LRU lock is really protecting us in either the
> PageCompound check nor the skip bits. It seems like holding a
> reference on the page should prevent it from switching between
> compound or not, and the skip bits are per pageblock with the LRU bits
> being per node/memcg which I would think implies that we could have
> multiple LRU locks that could apply to a single skip bit.

Hi Alexander,

I don't find problem yet on compound or skip bit usage. Would you clarify the
issue do you concerned? 

Thanks!


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-10 13:10         ` Alex Shi
@ 2020-08-10 14:41           ` Alexander Duyck
  2020-08-11  8:22             ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-10 14:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> > I wonder if this entire section shouldn't be restructured. This is the
> > only spot I can see where we are resetting the LRU flag instead of
> > pulling the page from the LRU list with the lock held. Looking over
> > the code it seems like something like that should be possible. I am
> > not sure the LRU lock is really protecting us in either the
> > PageCompound check nor the skip bits. It seems like holding a
> > reference on the page should prevent it from switching between
> > compound or not, and the skip bits are per pageblock with the LRU bits
> > being per node/memcg which I would think implies that we could have
> > multiple LRU locks that could apply to a single skip bit.
>
> Hi Alexander,
>
> I don't find problem yet on compound or skip bit usage. Would you clarify the
> issue do you concerned?
>
> Thanks!

The point I was getting at is that the LRU lock is being used to
protect these and with your changes I don't think that makes sense
anymore.

The skip bits are per-pageblock bits. With your change the LRU lock is
now per memcg first and then per node. As such I do not believe it
really provides any sort of exclusive access to the skip bits. I still
have to look into this more, but it seems like you need a lock per
either section or zone that can be used to protect those bits and deal
with this sooner rather than waiting until you have found an LRU page.
The one part that is confusing though is that the definition of the
skip bits seems to call out that they are a hint since they are not
protected by a lock, but that is exactly what has been happening here.

The point I was getting at with the PageCompound check is that instead
of needing the LRU lock you should be able to look at PageCompound as
soon as you call get_page_unless_zero() and preempt the need to set
the LRU bit again. Instead of trying to rely on the LRU lock to
guarantee that the page hasn't been merged you could just rely on the
fact that you are holding a reference to it so it isn't going to
switch between being compound or order 0 since it cannot be freed. It
spoils the idea I originally had of combining the logic for
get_page_unless_zero and TestClearPageLRU into a single function, but
the advantage is you aren't clearing the LRU flag unless you are
actually going to pull the page from the LRU list.

My main worry is that this is the one spot where we appear to be
clearing the LRU bit without ever actually pulling the page off of the
LRU list, and I am thinking we would be better served by addressing
the skip and PageCompound checks earlier rather than adding code to
set the bit again if either of those cases are encountered. This way
we don't pseudo-pin pages in the LRU if they are compound or supposed
to be skipped.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-10 14:41           ` Alexander Duyck
@ 2020-08-11  8:22             ` Alex Shi
  2020-08-11 14:47               ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-11  8:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/10 下午10:41, Alexander Duyck 写道:
> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
>>> I wonder if this entire section shouldn't be restructured. This is the
>>> only spot I can see where we are resetting the LRU flag instead of
>>> pulling the page from the LRU list with the lock held. Looking over
>>> the code it seems like something like that should be possible. I am
>>> not sure the LRU lock is really protecting us in either the
>>> PageCompound check nor the skip bits. It seems like holding a
>>> reference on the page should prevent it from switching between
>>> compound or not, and the skip bits are per pageblock with the LRU bits
>>> being per node/memcg which I would think implies that we could have
>>> multiple LRU locks that could apply to a single skip bit.
>>
>> Hi Alexander,
>>
>> I don't find problem yet on compound or skip bit usage. Would you clarify the
>> issue do you concerned?
>>
>> Thanks!
> 
> The point I was getting at is that the LRU lock is being used to
> protect these and with your changes I don't think that makes sense
> anymore.
> 
> The skip bits are per-pageblock bits. With your change the LRU lock is
> now per memcg first and then per node. As such I do not believe it
> really provides any sort of exclusive access to the skip bits. I still
> have to look into this more, but it seems like you need a lock per
> either section or zone that can be used to protect those bits and deal
> with this sooner rather than waiting until you have found an LRU page.
> The one part that is confusing though is that the definition of the
> skip bits seems to call out that they are a hint since they are not
> protected by a lock, but that is exactly what has been happening here.
> 

The skip bits are safe here, since even it race with other skip action,
It will still skip out. The skip action is try not to compaction too much,
not a exclusive action needs avoid race.


> The point I was getting at with the PageCompound check is that instead
> of needing the LRU lock you should be able to look at PageCompound as
> soon as you call get_page_unless_zero() and preempt the need to set
> the LRU bit again. Instead of trying to rely on the LRU lock to
> guarantee that the page hasn't been merged you could just rely on the
> fact that you are holding a reference to it so it isn't going to
> switch between being compound or order 0 since it cannot be freed. It
> spoils the idea I originally had of combining the logic for
> get_page_unless_zero and TestClearPageLRU into a single function, but
> the advantage is you aren't clearing the LRU flag unless you are
> actually going to pull the page from the LRU list.

Sorry, I still can not follow you here. Compound code part is unchanged
and follow the original logical. So would you like to pose a new code to
see if its works?

Thanks
Alex

> 
> My main worry is that this is the one spot where we appear to be
> clearing the LRU bit without ever actually pulling the page off of the
> LRU list, and I am thinking we would be better served by addressing
> the skip and PageCompound checks earlier rather than adding code to
> set the bit again if either of those cases are encountered. This way
> we don't pseudo-pin pages in the LRU if they are compound or supposed
> to be skipped.
> 
> Thanks.
> 
> - Alex
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-11  8:22             ` Alex Shi
@ 2020-08-11 14:47               ` Alexander Duyck
  2020-08-12 11:43                 ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-11 14:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
> > On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> >>> I wonder if this entire section shouldn't be restructured. This is the
> >>> only spot I can see where we are resetting the LRU flag instead of
> >>> pulling the page from the LRU list with the lock held. Looking over
> >>> the code it seems like something like that should be possible. I am
> >>> not sure the LRU lock is really protecting us in either the
> >>> PageCompound check nor the skip bits. It seems like holding a
> >>> reference on the page should prevent it from switching between
> >>> compound or not, and the skip bits are per pageblock with the LRU bits
> >>> being per node/memcg which I would think implies that we could have
> >>> multiple LRU locks that could apply to a single skip bit.
> >>
> >> Hi Alexander,
> >>
> >> I don't find problem yet on compound or skip bit usage. Would you clarify the
> >> issue do you concerned?
> >>
> >> Thanks!
> >
> > The point I was getting at is that the LRU lock is being used to
> > protect these and with your changes I don't think that makes sense
> > anymore.
> >
> > The skip bits are per-pageblock bits. With your change the LRU lock is
> > now per memcg first and then per node. As such I do not believe it
> > really provides any sort of exclusive access to the skip bits. I still
> > have to look into this more, but it seems like you need a lock per
> > either section or zone that can be used to protect those bits and deal
> > with this sooner rather than waiting until you have found an LRU page.
> > The one part that is confusing though is that the definition of the
> > skip bits seems to call out that they are a hint since they are not
> > protected by a lock, but that is exactly what has been happening here.
> >
>
> The skip bits are safe here, since even it race with other skip action,
> It will still skip out. The skip action is try not to compaction too much,
> not a exclusive action needs avoid race.

That would be the case if it didn't have the impact that they
currently do on the compaction process. What I am getting at is that a
race was introduced when you placed this test between the clearing of
the LRU flag and the actual pulling of the page from the LRU list. So
if you tested the skip bits before clearing the LRU flag then I would
be okay with the code, however because it is triggering an abort after
the LRU flag is cleared then you are creating a situation where
multiple processes will be stomping all over each other as you can
have each thread essentially take a page via the LRU flag, but only
one thread will process a page and it could skip over all other pages
that preemptively had their LRU flag cleared.

If you take a look at the test_and_set_skip the function only acts on
the pageblock aligned PFN for a given range. WIth the changes you have
in place now that would mean that only one thread would ever actually
call this function anyway since the first PFN would take the LRU flag
so no other thread could follow through and test or set the bit as
well. The expectation before was that all threads would encounter this
test and either proceed after setting the bit for the first PFN or
abort after testing the first PFN. With you changes only the first
thread actually runs this test and then it and the others will likely
encounter multiple failures as they are all clearing LRU bits
simultaneously and tripping each other up. That is why the skip bit
must have a test and set done before you even get to the point of
clearing the LRU flag.

> > The point I was getting at with the PageCompound check is that instead
> > of needing the LRU lock you should be able to look at PageCompound as
> > soon as you call get_page_unless_zero() and preempt the need to set
> > the LRU bit again. Instead of trying to rely on the LRU lock to
> > guarantee that the page hasn't been merged you could just rely on the
> > fact that you are holding a reference to it so it isn't going to
> > switch between being compound or order 0 since it cannot be freed. It
> > spoils the idea I originally had of combining the logic for
> > get_page_unless_zero and TestClearPageLRU into a single function, but
> > the advantage is you aren't clearing the LRU flag unless you are
> > actually going to pull the page from the LRU list.
>
> Sorry, I still can not follow you here. Compound code part is unchanged
> and follow the original logical. So would you like to pose a new code to
> see if its works?

No there are significant changes as you reordered all of the
operations. Prior to your change the LRU bit was checked, but not
cleared before testing for PageCompound. Now you are clearing it
before you are testing if it is a compound page. So if compaction is
running we will be seeing the pages in the LRU stay put, but the
compound bit flickering off and on if the compound page is encountered
with the wrong or NULL lruvec. What I was suggesting is that the
PageCompound test probably doesn't need to be concerned with the lock
after your changes. You could test it after you call
get_page_unless_zero() and before you call
__isolate_lru_page_prepare(). Instead of relying on the LRU lock to
protect us from the page switching between compound and not we would
be relying on the fact that we are holding a reference to the page so
it should not be freed and transition between compound or not.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-11 14:47               ` Alexander Duyck
@ 2020-08-12 11:43                 ` Alex Shi
  2020-08-12 12:16                   ` Alex Shi
  2020-08-12 16:51                   ` Alexander Duyck
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-12 11:43 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/11 下午10:47, Alexander Duyck 写道:
> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
>>>>> I wonder if this entire section shouldn't be restructured. This is the
>>>>> only spot I can see where we are resetting the LRU flag instead of
>>>>> pulling the page from the LRU list with the lock held. Looking over
>>>>> the code it seems like something like that should be possible. I am
>>>>> not sure the LRU lock is really protecting us in either the
>>>>> PageCompound check nor the skip bits. It seems like holding a
>>>>> reference on the page should prevent it from switching between
>>>>> compound or not, and the skip bits are per pageblock with the LRU bits
>>>>> being per node/memcg which I would think implies that we could have
>>>>> multiple LRU locks that could apply to a single skip bit.
>>>>
>>>> Hi Alexander,
>>>>
>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
>>>> issue do you concerned?
>>>>
>>>> Thanks!
>>>
>>> The point I was getting at is that the LRU lock is being used to
>>> protect these and with your changes I don't think that makes sense
>>> anymore.
>>>
>>> The skip bits are per-pageblock bits. With your change the LRU lock is
>>> now per memcg first and then per node. As such I do not believe it
>>> really provides any sort of exclusive access to the skip bits. I still
>>> have to look into this more, but it seems like you need a lock per
>>> either section or zone that can be used to protect those bits and deal
>>> with this sooner rather than waiting until you have found an LRU page.
>>> The one part that is confusing though is that the definition of the
>>> skip bits seems to call out that they are a hint since they are not
>>> protected by a lock, but that is exactly what has been happening here.
>>>
>>
>> The skip bits are safe here, since even it race with other skip action,
>> It will still skip out. The skip action is try not to compaction too much,
>> not a exclusive action needs avoid race.
> 
> That would be the case if it didn't have the impact that they
> currently do on the compaction process. What I am getting at is that a
> race was introduced when you placed this test between the clearing of
> the LRU flag and the actual pulling of the page from the LRU list. So
> if you tested the skip bits before clearing the LRU flag then I would
> be okay with the code, however because it is triggering an abort after

Hi Alexander,

Thanks a lot for comments and suggestions!

I have tried your suggestion:

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
---
 mm/compaction.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b99c96c4862d..6c881dee8c9a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
 			goto isolate_fail_put;

+		/* Try get exclusive access under lock */
+		if (!skip_updated) {
+			skip_updated = true;
+			if (test_and_set_skip(cc, page, low_pfn))
+				goto isolate_fail_put;
+		}
+
 		/* Try isolate the page */
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
@@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)

 			lruvec_memcg_debug(lruvec, page);

-			/* Try get exclusive access under lock */
-			if (!skip_updated) {
-				skip_updated = true;
-				if (test_and_set_skip(cc, page, low_pfn))
-					goto isolate_abort;
-			}
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
--

Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
helpful

> the LRU flag is cleared then you are creating a situation where
> multiple processes will be stomping all over each other as you can
> have each thread essentially take a page via the LRU flag, but only
> one thread will process a page and it could skip over all other pages
> that preemptively had their LRU flag cleared.

It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
could stop each other in a array check(bitmap). So compare to whole node 
lru_lock, the net profit is clear in patch 17.

> 
> If you take a look at the test_and_set_skip the function only acts on
> the pageblock aligned PFN for a given range. WIth the changes you have
> in place now that would mean that only one thread would ever actually
> call this function anyway since the first PFN would take the LRU flag
> so no other thread could follow through and test or set the bit as

Is this good for only one process could do test_and_set_skip? is that 
the 'skip' meaning to be?

> well. The expectation before was that all threads would encounter this
> test and either proceed after setting the bit for the first PFN or
> abort after testing the first PFN. With you changes only the first
> thread actually runs this test and then it and the others will likely
> encounter multiple failures as they are all clearing LRU bits
> simultaneously and tripping each other up. That is why the skip bit
> must have a test and set done before you even get to the point of
> clearing the LRU flag.

It make the things warse in my machine, would you like to have a try by yourself?

> 
>>> The point I was getting at with the PageCompound check is that instead
>>> of needing the LRU lock you should be able to look at PageCompound as
>>> soon as you call get_page_unless_zero() and preempt the need to set
>>> the LRU bit again. Instead of trying to rely on the LRU lock to
>>> guarantee that the page hasn't been merged you could just rely on the
>>> fact that you are holding a reference to it so it isn't going to
>>> switch between being compound or order 0 since it cannot be freed. It
>>> spoils the idea I originally had of combining the logic for
>>> get_page_unless_zero and TestClearPageLRU into a single function, but
>>> the advantage is you aren't clearing the LRU flag unless you are
>>> actually going to pull the page from the LRU list.
>>
>> Sorry, I still can not follow you here. Compound code part is unchanged
>> and follow the original logical. So would you like to pose a new code to
>> see if its works?
> 
> No there are significant changes as you reordered all of the
> operations. Prior to your change the LRU bit was checked, but not
> cleared before testing for PageCompound. Now you are clearing it
> before you are testing if it is a compound page. So if compaction is
> running we will be seeing the pages in the LRU stay put, but the
> compound bit flickering off and on if the compound page is encountered
> with the wrong or NULL lruvec. What I was suggesting is that the

The lruvec could be wrong or NULL here, that is the base stone of whole
patchset.

> PageCompound test probably doesn't need to be concerned with the lock
> after your changes. You could test it after you call
> get_page_unless_zero() and before you call
> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
> protect us from the page switching between compound and not we would
> be relying on the fact that we are holding a reference to the page so
> it should not be freed and transition between compound or not.
> 

I have tried the patch as your suggested, it has no clear help on performance
on above vm-scaliblity case. Maybe it's due to we checked the same thing
before lock already.

diff --git a/mm/compaction.c b/mm/compaction.c
index b99c96c4862d..cf2ac5148001 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (unlikely(!get_page_unless_zero(page)))
 			goto isolate_fail;

+			/*
+			 * Page become compound since the non-locked check,
+			 * and it's on LRU. It can only be a THP so the order
+			 * is safe to read and it's 0 for tail pages.
+			 */
+			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_fail_put;
+			}
+
 		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
 			goto isolate_fail_put;

@@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}

-			/*
-			 * Page become compound since the non-locked check,
-			 * and it's on LRU. It can only be a THP so the order
-			 * is safe to read and it's 0 for tail pages.
-			 */
-			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
-				low_pfn += compound_nr(page) - 1;
-				SetPageLRU(page);
-				goto isolate_fail_put;
-			}
 		} else
 			rcu_read_unlock();

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-12 11:43                 ` Alex Shi
@ 2020-08-12 12:16                   ` Alex Shi
  2020-08-12 16:51                   ` Alexander Duyck
  1 sibling, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-12 12:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/12 下午7:43, Alex Shi 写道:
>>> Sorry, I still can not follow you here. Compound code part is unchanged
>>> and follow the original logical. So would you like to pose a new code to
>>> see if its works?
>> No there are significant changes as you reordered all of the
>> operations. Prior to your change the LRU bit was checked, but not
>> cleared before testing for PageCompound. Now you are clearing it
>> before you are testing if it is a compound page. So if compaction is
>> running we will be seeing the pages in the LRU stay put, but the
>> compound bit flickering off and on if the compound page is encountered
>> with the wrong or NULL lruvec. What I was suggesting is that the

> The lruvec could be wrong or NULL here, that is the base stone of whole
> patchset.
> 
Sorry for typo. s/could/could not/


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-12 11:43                 ` Alex Shi
  2020-08-12 12:16                   ` Alex Shi
@ 2020-08-12 16:51                   ` Alexander Duyck
  2020-08-13  1:46                     ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-12 16:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/11 下午10:47, Alexander Duyck 写道:
> > On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
> >>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> >>>>> I wonder if this entire section shouldn't be restructured. This is the
> >>>>> only spot I can see where we are resetting the LRU flag instead of
> >>>>> pulling the page from the LRU list with the lock held. Looking over
> >>>>> the code it seems like something like that should be possible. I am
> >>>>> not sure the LRU lock is really protecting us in either the
> >>>>> PageCompound check nor the skip bits. It seems like holding a
> >>>>> reference on the page should prevent it from switching between
> >>>>> compound or not, and the skip bits are per pageblock with the LRU bits
> >>>>> being per node/memcg which I would think implies that we could have
> >>>>> multiple LRU locks that could apply to a single skip bit.
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
> >>>> issue do you concerned?
> >>>>
> >>>> Thanks!
> >>>
> >>> The point I was getting at is that the LRU lock is being used to
> >>> protect these and with your changes I don't think that makes sense
> >>> anymore.
> >>>
> >>> The skip bits are per-pageblock bits. With your change the LRU lock is
> >>> now per memcg first and then per node. As such I do not believe it
> >>> really provides any sort of exclusive access to the skip bits. I still
> >>> have to look into this more, but it seems like you need a lock per
> >>> either section or zone that can be used to protect those bits and deal
> >>> with this sooner rather than waiting until you have found an LRU page.
> >>> The one part that is confusing though is that the definition of the
> >>> skip bits seems to call out that they are a hint since they are not
> >>> protected by a lock, but that is exactly what has been happening here.
> >>>
> >>
> >> The skip bits are safe here, since even it race with other skip action,
> >> It will still skip out. The skip action is try not to compaction too much,
> >> not a exclusive action needs avoid race.
> >
> > That would be the case if it didn't have the impact that they
> > currently do on the compaction process. What I am getting at is that a
> > race was introduced when you placed this test between the clearing of
> > the LRU flag and the actual pulling of the page from the LRU list. So
> > if you tested the skip bits before clearing the LRU flag then I would
> > be okay with the code, however because it is triggering an abort after
>
> Hi Alexander,
>
> Thanks a lot for comments and suggestions!
>
> I have tried your suggestion:
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> ---
>  mm/compaction.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b99c96c4862d..6c881dee8c9a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>                         goto isolate_fail_put;
>
> +               /* Try get exclusive access under lock */
> +               if (!skip_updated) {
> +                       skip_updated = true;
> +                       if (test_and_set_skip(cc, page, low_pfn))
> +                               goto isolate_fail_put;
> +               }
> +
>                 /* Try isolate the page */
>                 if (!TestClearPageLRU(page))
>                         goto isolate_fail_put;

I would have made this much sooner. Probably before you call
get_page_unless_zero so as to avoid the unnecessary atomic operations.

> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>
>                         lruvec_memcg_debug(lruvec, page);
>
> -                       /* Try get exclusive access under lock */
> -                       if (!skip_updated) {
> -                               skip_updated = true;
> -                               if (test_and_set_skip(cc, page, low_pfn))
> -                                       goto isolate_abort;
> -                       }
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> --
>
> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
> helpful

So one issue with this change is that it is still too late to be of
much benefit. Really you should probably be doing this much sooner,
for example somewhere before the get_page_unless_zero(). Also the
thing that still has me scratching my head is the "Try get exclusive
access under lock" comment. The function declaration says this is
supposed to be a hint, but we were using the LRU lock to synchronize
it. I'm wondering if we should really be protecting this with the zone
lock since we are modifying the pageblock flags which also contain the
migration type value for the pageblock and are only modified while
holding the zone lock.

> > the LRU flag is cleared then you are creating a situation where
> > multiple processes will be stomping all over each other as you can
> > have each thread essentially take a page via the LRU flag, but only
> > one thread will process a page and it could skip over all other pages
> > that preemptively had their LRU flag cleared.
>
> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
> could stop each other in a array check(bitmap). So compare to whole node
> lru_lock, the net profit is clear in patch 17.

My concern is that what you can end up with is multiple threads all
working over the same pageblock for isolation. With the old code the
LRU lock was used to make certain that test_and_set_skip was being
synchronized on the first page in the pageblock so you would only have
one thread going through and working a single pageblock. However after
your changes it doesn't seem like the test_and_set_skip has that
protection since only one thread will ever be able to successfully
call it for the first page in the pageblock assuming that the LRU flag
is set on the first page in the pageblock block.

> >
> > If you take a look at the test_and_set_skip the function only acts on
> > the pageblock aligned PFN for a given range. WIth the changes you have
> > in place now that would mean that only one thread would ever actually
> > call this function anyway since the first PFN would take the LRU flag
> > so no other thread could follow through and test or set the bit as
>
> Is this good for only one process could do test_and_set_skip? is that
> the 'skip' meaning to be?

So only one thread really getting to fully use test_and_set_skip is
good, however the issue is that there is nothing to synchronize the
testing from the other threads. As a result the other threads could
have isolated other pages within the pageblock before the thread that
is calling test_and_set_skip will get to complete the setting of the
skip bit. This will result in isolation failures for the thread that
set the skip bit which may be undesirable behavior.

With the old code the threads were all synchronized on testing the
first PFN in the pageblock while holding the LRU lock and that is what
we lost. My concern is the cases where skip_on_failure == true are
going to fail much more often now as the threads can easily interfere
with each other.

> > well. The expectation before was that all threads would encounter this
> > test and either proceed after setting the bit for the first PFN or
> > abort after testing the first PFN. With you changes only the first
> > thread actually runs this test and then it and the others will likely
> > encounter multiple failures as they are all clearing LRU bits
> > simultaneously and tripping each other up. That is why the skip bit
> > must have a test and set done before you even get to the point of
> > clearing the LRU flag.
>
> It make the things warse in my machine, would you like to have a try by yourself?

I plan to do that. I have already been working on a few things to
clean up and optimize your patch set further. I will try to submit an
RFC this evening so we can discuss.

> >
> >>> The point I was getting at with the PageCompound check is that instead
> >>> of needing the LRU lock you should be able to look at PageCompound as
> >>> soon as you call get_page_unless_zero() and preempt the need to set
> >>> the LRU bit again. Instead of trying to rely on the LRU lock to
> >>> guarantee that the page hasn't been merged you could just rely on the
> >>> fact that you are holding a reference to it so it isn't going to
> >>> switch between being compound or order 0 since it cannot be freed. It
> >>> spoils the idea I originally had of combining the logic for
> >>> get_page_unless_zero and TestClearPageLRU into a single function, but
> >>> the advantage is you aren't clearing the LRU flag unless you are
> >>> actually going to pull the page from the LRU list.
> >>
> >> Sorry, I still can not follow you here. Compound code part is unchanged
> >> and follow the original logical. So would you like to pose a new code to
> >> see if its works?
> >
> > No there are significant changes as you reordered all of the
> > operations. Prior to your change the LRU bit was checked, but not
> > cleared before testing for PageCompound. Now you are clearing it
> > before you are testing if it is a compound page. So if compaction is
> > running we will be seeing the pages in the LRU stay put, but the
> > compound bit flickering off and on if the compound page is encountered
> > with the wrong or NULL lruvec. What I was suggesting is that the
>
> The lruvec could be wrong or NULL here, that is the base stone of whole
> patchset.

Sorry I had a typo in my comment as well as it is the LRU bit that
will be flickering, not the compound. The goal here is to avoid
clearing the LRU bit unless we are sure we are going to take the
lruvec lock and pull the page from the list.

> > PageCompound test probably doesn't need to be concerned with the lock
> > after your changes. You could test it after you call
> > get_page_unless_zero() and before you call
> > __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
> > protect us from the page switching between compound and not we would
> > be relying on the fact that we are holding a reference to the page so
> > it should not be freed and transition between compound or not.
> >
>
> I have tried the patch as your suggested, it has no clear help on performance
> on above vm-scaliblity case. Maybe it's due to we checked the same thing
> before lock already.
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b99c96c4862d..cf2ac5148001 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (unlikely(!get_page_unless_zero(page)))
>                         goto isolate_fail;
>
> +                       /*
> +                        * Page become compound since the non-locked check,
> +                        * and it's on LRU. It can only be a THP so the order
> +                        * is safe to read and it's 0 for tail pages.
> +                        */
> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> +                               low_pfn += compound_nr(page) - 1;
> +                               goto isolate_fail_put;
> +                       }
> +
>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>                         goto isolate_fail_put;
>
> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /*
> -                        * Page become compound since the non-locked check,
> -                        * and it's on LRU. It can only be a THP so the order
> -                        * is safe to read and it's 0 for tail pages.
> -                        */
> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> -                               low_pfn += compound_nr(page) - 1;
> -                               SetPageLRU(page);
> -                               goto isolate_fail_put;
> -                       }
>                 } else
>                         rcu_read_unlock();
>

So actually there is more we could do than just this. Specifically a
few lines below the rcu_read_lock there is yet another PageCompound
check that sets low_pfn yet again. So in theory we could combine both
of those and modify the code so you end up with something more like:
@@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
*cc, unsigned long low_pfn,
                if (unlikely(!get_page_unless_zero(page)))
                        goto isolate_fail;

+               if (PageCompound(page)) {
+                       const unsigned int order = compound_order(page);
+
+                       if (likely(order < MAX_ORDER))
+                               low_pfn += (1UL << order) - 1;
+
+                       if (unlikely(!cc->alloc_contig))
+                               goto isolate_fail_put;
+               }
+
                if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
                        goto isolate_fail_put;

Doing this you would be more likely to skip over the entire compound
page in a single jump should you not be able to either take the LRU
bit or encounter a busy page in __isolate_Lru_page_prepare. I had
copied this bit from an earlier check and modified it as I was not
sure I can guarantee that this is a THP since we haven't taken the LRU
lock yet. However I believe the page cannot be split up while we are
holding the extra reference so the PageCompound flag and order should
not change until we call put_page.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-12 16:51                   ` Alexander Duyck
@ 2020-08-13  1:46                     ` Alex Shi
  2020-08-13  2:17                       ` Alexander Duyck
  2020-08-13  4:02                       ` [RFC PATCH 0/3] " Alexander Duyck
  0 siblings, 2 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-13  1:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/13 上午12:51, Alexander Duyck 写道:
> On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/11 下午10:47, Alexander Duyck 写道:
>>> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
>>>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
>>>>>>> I wonder if this entire section shouldn't be restructured. This is the
>>>>>>> only spot I can see where we are resetting the LRU flag instead of
>>>>>>> pulling the page from the LRU list with the lock held. Looking over
>>>>>>> the code it seems like something like that should be possible. I am
>>>>>>> not sure the LRU lock is really protecting us in either the
>>>>>>> PageCompound check nor the skip bits. It seems like holding a
>>>>>>> reference on the page should prevent it from switching between
>>>>>>> compound or not, and the skip bits are per pageblock with the LRU bits
>>>>>>> being per node/memcg which I would think implies that we could have
>>>>>>> multiple LRU locks that could apply to a single skip bit.
>>>>>>
>>>>>> Hi Alexander,
>>>>>>
>>>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
>>>>>> issue do you concerned?
>>>>>>
>>>>>> Thanks!
>>>>>
>>>>> The point I was getting at is that the LRU lock is being used to
>>>>> protect these and with your changes I don't think that makes sense
>>>>> anymore.
>>>>>
>>>>> The skip bits are per-pageblock bits. With your change the LRU lock is
>>>>> now per memcg first and then per node. As such I do not believe it
>>>>> really provides any sort of exclusive access to the skip bits. I still
>>>>> have to look into this more, but it seems like you need a lock per
>>>>> either section or zone that can be used to protect those bits and deal
>>>>> with this sooner rather than waiting until you have found an LRU page.
>>>>> The one part that is confusing though is that the definition of the
>>>>> skip bits seems to call out that they are a hint since they are not
>>>>> protected by a lock, but that is exactly what has been happening here.
>>>>>
>>>>
>>>> The skip bits are safe here, since even it race with other skip action,
>>>> It will still skip out. The skip action is try not to compaction too much,
>>>> not a exclusive action needs avoid race.
>>>
>>> That would be the case if it didn't have the impact that they
>>> currently do on the compaction process. What I am getting at is that a
>>> race was introduced when you placed this test between the clearing of
>>> the LRU flag and the actual pulling of the page from the LRU list. So
>>> if you tested the skip bits before clearing the LRU flag then I would
>>> be okay with the code, however because it is triggering an abort after
>>
>> Hi Alexander,
>>
>> Thanks a lot for comments and suggestions!
>>
>> I have tried your suggestion:
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> ---
>>  mm/compaction.c | 14 +++++++-------
>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b99c96c4862d..6c881dee8c9a 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>                         goto isolate_fail_put;
>>
>> +               /* Try get exclusive access under lock */
>> +               if (!skip_updated) {
>> +                       skip_updated = true;
>> +                       if (test_and_set_skip(cc, page, low_pfn))
>> +                               goto isolate_fail_put;
>> +               }
>> +
>>                 /* Try isolate the page */
>>                 if (!TestClearPageLRU(page))
>>                         goto isolate_fail_put;
> 
> I would have made this much sooner. Probably before you call
> get_page_unless_zero so as to avoid the unnecessary atomic operations.
> 
>> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>
>>                         lruvec_memcg_debug(lruvec, page);
>>
>> -                       /* Try get exclusive access under lock */
>> -                       if (!skip_updated) {
>> -                               skip_updated = true;
>> -                               if (test_and_set_skip(cc, page, low_pfn))
>> -                                       goto isolate_abort;
>> -                       }
>> -
>>                         /*
>>                          * Page become compound since the non-locked check,
>>                          * and it's on LRU. It can only be a THP so the order
>> --
>>
>> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
>> helpful
> 
> So one issue with this change is that it is still too late to be of
> much benefit. Really you should probably be doing this much sooner,
> for example somewhere before the get_page_unless_zero(). Also the
> thing that still has me scratching my head is the "Try get exclusive
> access under lock" comment. The function declaration says this is
> supposed to be a hint, but we were using the LRU lock to synchronize
> it. I'm wondering if we should really be protecting this with the zone
> lock since we are modifying the pageblock flags which also contain the
> migration type value for the pageblock and are only modified while
> holding the zone lock.

zone lock is probability better. you can try and test.
> 
>>> the LRU flag is cleared then you are creating a situation where
>>> multiple processes will be stomping all over each other as you can
>>> have each thread essentially take a page via the LRU flag, but only
>>> one thread will process a page and it could skip over all other pages
>>> that preemptively had their LRU flag cleared.
>>
>> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
>> could stop each other in a array check(bitmap). So compare to whole node
>> lru_lock, the net profit is clear in patch 17.
> 
> My concern is that what you can end up with is multiple threads all
> working over the same pageblock for isolation. With the old code the
> LRU lock was used to make certain that test_and_set_skip was being
> synchronized on the first page in the pageblock so you would only have
> one thread going through and working a single pageblock. However after
> your changes it doesn't seem like the test_and_set_skip has that
> protection since only one thread will ever be able to successfully
> call it for the first page in the pageblock assuming that the LRU flag
> is set on the first page in the pageblock block.
> 
>>>
>>> If you take a look at the test_and_set_skip the function only acts on
>>> the pageblock aligned PFN for a given range. WIth the changes you have
>>> in place now that would mean that only one thread would ever actually
>>> call this function anyway since the first PFN would take the LRU flag
>>> so no other thread could follow through and test or set the bit as
>>
>> Is this good for only one process could do test_and_set_skip? is that
>> the 'skip' meaning to be?
> 
> So only one thread really getting to fully use test_and_set_skip is
> good, however the issue is that there is nothing to synchronize the
> testing from the other threads. As a result the other threads could
> have isolated other pages within the pageblock before the thread that
> is calling test_and_set_skip will get to complete the setting of the
> skip bit. This will result in isolation failures for the thread that
> set the skip bit which may be undesirable behavior.
> 
> With the old code the threads were all synchronized on testing the
> first PFN in the pageblock while holding the LRU lock and that is what
> we lost. My concern is the cases where skip_on_failure == true are
> going to fail much more often now as the threads can easily interfere
> with each other.

I have a patch to fix this, which is on 
	https://github.com/alexshi/linux.git lrunext
> 
>>> well. The expectation before was that all threads would encounter this
>>> test and either proceed after setting the bit for the first PFN or
>>> abort after testing the first PFN. With you changes only the first
>>> thread actually runs this test and then it and the others will likely
>>> encounter multiple failures as they are all clearing LRU bits
>>> simultaneously and tripping each other up. That is why the skip bit
>>> must have a test and set done before you even get to the point of
>>> clearing the LRU flag.
>>
>> It make the things warse in my machine, would you like to have a try by yourself?
> 
> I plan to do that. I have already been working on a few things to
> clean up and optimize your patch set further. I will try to submit an
> RFC this evening so we can discuss.
> 

Glad to see your new code soon. Would you like do it base on 
		https://github.com/alexshi/linux.git lrunext

>>>
>>>>> The point I was getting at with the PageCompound check is that instead
>>>>> of needing the LRU lock you should be able to look at PageCompound as
>>>>> soon as you call get_page_unless_zero() and preempt the need to set
>>>>> the LRU bit again. Instead of trying to rely on the LRU lock to
>>>>> guarantee that the page hasn't been merged you could just rely on the
>>>>> fact that you are holding a reference to it so it isn't going to
>>>>> switch between being compound or order 0 since it cannot be freed. It
>>>>> spoils the idea I originally had of combining the logic for
>>>>> get_page_unless_zero and TestClearPageLRU into a single function, but
>>>>> the advantage is you aren't clearing the LRU flag unless you are
>>>>> actually going to pull the page from the LRU list.
>>>>
>>>> Sorry, I still can not follow you here. Compound code part is unchanged
>>>> and follow the original logical. So would you like to pose a new code to
>>>> see if its works?
>>>
>>> No there are significant changes as you reordered all of the
>>> operations. Prior to your change the LRU bit was checked, but not
>>> cleared before testing for PageCompound. Now you are clearing it
>>> before you are testing if it is a compound page. So if compaction is
>>> running we will be seeing the pages in the LRU stay put, but the
>>> compound bit flickering off and on if the compound page is encountered
>>> with the wrong or NULL lruvec. What I was suggesting is that the
>>
>> The lruvec could be wrong or NULL here, that is the base stone of whole
>> patchset.
> 
> Sorry I had a typo in my comment as well as it is the LRU bit that
> will be flickering, not the compound. The goal here is to avoid
> clearing the LRU bit unless we are sure we are going to take the
> lruvec lock and pull the page from the list.
> 
>>> PageCompound test probably doesn't need to be concerned with the lock
>>> after your changes. You could test it after you call
>>> get_page_unless_zero() and before you call
>>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
>>> protect us from the page switching between compound and not we would
>>> be relying on the fact that we are holding a reference to the page so
>>> it should not be freed and transition between compound or not.
>>>
>>
>> I have tried the patch as your suggested, it has no clear help on performance
>> on above vm-scaliblity case. Maybe it's due to we checked the same thing
>> before lock already.
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b99c96c4862d..cf2ac5148001 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                 if (unlikely(!get_page_unless_zero(page)))
>>                         goto isolate_fail;
>>
>> +                       /*
>> +                        * Page become compound since the non-locked check,
>> +                        * and it's on LRU. It can only be a THP so the order
>> +                        * is safe to read and it's 0 for tail pages.
>> +                        */
>> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>> +                               low_pfn += compound_nr(page) - 1;
>> +                               goto isolate_fail_put;
>> +                       }
>> +
>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>                         goto isolate_fail_put;
>>
>> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                                         goto isolate_abort;
>>                         }
>>
>> -                       /*
>> -                        * Page become compound since the non-locked check,
>> -                        * and it's on LRU. It can only be a THP so the order
>> -                        * is safe to read and it's 0 for tail pages.
>> -                        */
>> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>> -                               low_pfn += compound_nr(page) - 1;
>> -                               SetPageLRU(page);
>> -                               goto isolate_fail_put;
>> -                       }
>>                 } else
>>                         rcu_read_unlock();
>>
> 
> So actually there is more we could do than just this. Specifically a
> few lines below the rcu_read_lock there is yet another PageCompound
> check that sets low_pfn yet again. So in theory we could combine both
> of those and modify the code so you end up with something more like:
> @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
> *cc, unsigned long low_pfn,
>                 if (unlikely(!get_page_unless_zero(page)))
>                         goto isolate_fail;
> 
> +               if (PageCompound(page)) {
> +                       const unsigned int order = compound_order(page);
> +
> +                       if (likely(order < MAX_ORDER))
> +                               low_pfn += (1UL << order) - 1;
> +
> +                       if (unlikely(!cc->alloc_contig))
> +                               goto isolate_fail_put;
> 

The current don't check this unless locked changed. But anyway check it
every page may have no performance impact.

+               }
> +
>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>                         goto isolate_fail_put;
> 
> Doing this you would be more likely to skip over the entire compound
> page in a single jump should you not be able to either take the LRU
> bit or encounter a busy page in __isolate_Lru_page_prepare. I had
> copied this bit from an earlier check and modified it as I was not
> sure I can guarantee that this is a THP since we haven't taken the LRU
> lock yet. However I believe the page cannot be split up while we are
> holding the extra reference so the PageCompound flag and order should
> not change until we call put_page.
> 

It looks like the lock_page protect this instead of get_page that just works
after split func called.

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-13  1:46                     ` Alex Shi
@ 2020-08-13  2:17                       ` Alexander Duyck
  2020-08-13  3:52                         ` Alex Shi
  2020-08-13  4:02                       ` [RFC PATCH 0/3] " Alexander Duyck
  1 sibling, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-13  2:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

On Wed, Aug 12, 2020 at 6:47 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/13 上午12:51, Alexander Duyck 写道:
> > On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/11 下午10:47, Alexander Duyck 写道:
> >>> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
> >>>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> >>>>>>> I wonder if this entire section shouldn't be restructured. This is the
> >>>>>>> only spot I can see where we are resetting the LRU flag instead of
> >>>>>>> pulling the page from the LRU list with the lock held. Looking over
> >>>>>>> the code it seems like something like that should be possible. I am
> >>>>>>> not sure the LRU lock is really protecting us in either the
> >>>>>>> PageCompound check nor the skip bits. It seems like holding a
> >>>>>>> reference on the page should prevent it from switching between
> >>>>>>> compound or not, and the skip bits are per pageblock with the LRU bits
> >>>>>>> being per node/memcg which I would think implies that we could have
> >>>>>>> multiple LRU locks that could apply to a single skip bit.
> >>>>>>
> >>>>>> Hi Alexander,
> >>>>>>
> >>>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
> >>>>>> issue do you concerned?
> >>>>>>
> >>>>>> Thanks!
> >>>>>
> >>>>> The point I was getting at is that the LRU lock is being used to
> >>>>> protect these and with your changes I don't think that makes sense
> >>>>> anymore.
> >>>>>
> >>>>> The skip bits are per-pageblock bits. With your change the LRU lock is
> >>>>> now per memcg first and then per node. As such I do not believe it
> >>>>> really provides any sort of exclusive access to the skip bits. I still
> >>>>> have to look into this more, but it seems like you need a lock per
> >>>>> either section or zone that can be used to protect those bits and deal
> >>>>> with this sooner rather than waiting until you have found an LRU page.
> >>>>> The one part that is confusing though is that the definition of the
> >>>>> skip bits seems to call out that they are a hint since they are not
> >>>>> protected by a lock, but that is exactly what has been happening here.
> >>>>>
> >>>>
> >>>> The skip bits are safe here, since even it race with other skip action,
> >>>> It will still skip out. The skip action is try not to compaction too much,
> >>>> not a exclusive action needs avoid race.
> >>>
> >>> That would be the case if it didn't have the impact that they
> >>> currently do on the compaction process. What I am getting at is that a
> >>> race was introduced when you placed this test between the clearing of
> >>> the LRU flag and the actual pulling of the page from the LRU list. So
> >>> if you tested the skip bits before clearing the LRU flag then I would
> >>> be okay with the code, however because it is triggering an abort after
> >>
> >> Hi Alexander,
> >>
> >> Thanks a lot for comments and suggestions!
> >>
> >> I have tried your suggestion:
> >>
> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> >> ---
> >>  mm/compaction.c | 14 +++++++-------
> >>  1 file changed, 7 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index b99c96c4862d..6c881dee8c9a 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >>                         goto isolate_fail_put;
> >>
> >> +               /* Try get exclusive access under lock */
> >> +               if (!skip_updated) {
> >> +                       skip_updated = true;
> >> +                       if (test_and_set_skip(cc, page, low_pfn))
> >> +                               goto isolate_fail_put;
> >> +               }
> >> +
> >>                 /* Try isolate the page */
> >>                 if (!TestClearPageLRU(page))
> >>                         goto isolate_fail_put;
> >
> > I would have made this much sooner. Probably before you call
> > get_page_unless_zero so as to avoid the unnecessary atomic operations.
> >
> >> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>
> >>                         lruvec_memcg_debug(lruvec, page);
> >>
> >> -                       /* Try get exclusive access under lock */
> >> -                       if (!skip_updated) {
> >> -                               skip_updated = true;
> >> -                               if (test_and_set_skip(cc, page, low_pfn))
> >> -                                       goto isolate_abort;
> >> -                       }
> >> -
> >>                         /*
> >>                          * Page become compound since the non-locked check,
> >>                          * and it's on LRU. It can only be a THP so the order
> >> --
> >>
> >> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
> >> helpful
> >
> > So one issue with this change is that it is still too late to be of
> > much benefit. Really you should probably be doing this much sooner,
> > for example somewhere before the get_page_unless_zero(). Also the
> > thing that still has me scratching my head is the "Try get exclusive
> > access under lock" comment. The function declaration says this is
> > supposed to be a hint, but we were using the LRU lock to synchronize
> > it. I'm wondering if we should really be protecting this with the zone
> > lock since we are modifying the pageblock flags which also contain the
> > migration type value for the pageblock and are only modified while
> > holding the zone lock.
>
> zone lock is probability better. you can try and test.

So I spent a good chunk of today looking the code over and what I
realized is that we probably don't even really need to have this code
protected by the zone lock since the LRU bit in the pageblock should
do most of the work for us. In addition we can get rid of the test
portion of this and just make it a set only operation if I am not
mistaken.

> >>> the LRU flag is cleared then you are creating a situation where
> >>> multiple processes will be stomping all over each other as you can
> >>> have each thread essentially take a page via the LRU flag, but only
> >>> one thread will process a page and it could skip over all other pages
> >>> that preemptively had their LRU flag cleared.
> >>
> >> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
> >> could stop each other in a array check(bitmap). So compare to whole node
> >> lru_lock, the net profit is clear in patch 17.
> >
> > My concern is that what you can end up with is multiple threads all
> > working over the same pageblock for isolation. With the old code the
> > LRU lock was used to make certain that test_and_set_skip was being
> > synchronized on the first page in the pageblock so you would only have
> > one thread going through and working a single pageblock. However after
> > your changes it doesn't seem like the test_and_set_skip has that
> > protection since only one thread will ever be able to successfully
> > call it for the first page in the pageblock assuming that the LRU flag
> > is set on the first page in the pageblock block.
> >
> >>>
> >>> If you take a look at the test_and_set_skip the function only acts on
> >>> the pageblock aligned PFN for a given range. WIth the changes you have
> >>> in place now that would mean that only one thread would ever actually
> >>> call this function anyway since the first PFN would take the LRU flag
> >>> so no other thread could follow through and test or set the bit as
> >>
> >> Is this good for only one process could do test_and_set_skip? is that
> >> the 'skip' meaning to be?
> >
> > So only one thread really getting to fully use test_and_set_skip is
> > good, however the issue is that there is nothing to synchronize the
> > testing from the other threads. As a result the other threads could
> > have isolated other pages within the pageblock before the thread that
> > is calling test_and_set_skip will get to complete the setting of the
> > skip bit. This will result in isolation failures for the thread that
> > set the skip bit which may be undesirable behavior.
> >
> > With the old code the threads were all synchronized on testing the
> > first PFN in the pageblock while holding the LRU lock and that is what
> > we lost. My concern is the cases where skip_on_failure == true are
> > going to fail much more often now as the threads can easily interfere
> > with each other.
>
> I have a patch to fix this, which is on
>         https://github.com/alexshi/linux.git lrunext

I don't think that patch helps to address anything. You are now
failing to set the bit in the case that something modifies the
pageblock flags while you are attempting to do so. I think it would be
better to just leave the cmpxchg loop as it is.

> >
> >>> well. The expectation before was that all threads would encounter this
> >>> test and either proceed after setting the bit for the first PFN or
> >>> abort after testing the first PFN. With you changes only the first
> >>> thread actually runs this test and then it and the others will likely
> >>> encounter multiple failures as they are all clearing LRU bits
> >>> simultaneously and tripping each other up. That is why the skip bit
> >>> must have a test and set done before you even get to the point of
> >>> clearing the LRU flag.
> >>
> >> It make the things warse in my machine, would you like to have a try by yourself?
> >
> > I plan to do that. I have already been working on a few things to
> > clean up and optimize your patch set further. I will try to submit an
> > RFC this evening so we can discuss.
> >
>
> Glad to see your new code soon. Would you like do it base on
>                 https://github.com/alexshi/linux.git lrunext

I can rebase off of that tree. It may add another half hour or so. I
have barely had any time to test my code. When I enabled some of the
debugging features in the kernel related to using the vm-scalability
tests the boot time became incredibly slow so I may just make certain
I can boot and not mess the system up before submitting my patches as
an RFC. I can probably try testing them more tomorrow.

> >>>
> >>>>> The point I was getting at with the PageCompound check is that instead
> >>>>> of needing the LRU lock you should be able to look at PageCompound as
> >>>>> soon as you call get_page_unless_zero() and preempt the need to set
> >>>>> the LRU bit again. Instead of trying to rely on the LRU lock to
> >>>>> guarantee that the page hasn't been merged you could just rely on the
> >>>>> fact that you are holding a reference to it so it isn't going to
> >>>>> switch between being compound or order 0 since it cannot be freed. It
> >>>>> spoils the idea I originally had of combining the logic for
> >>>>> get_page_unless_zero and TestClearPageLRU into a single function, but
> >>>>> the advantage is you aren't clearing the LRU flag unless you are
> >>>>> actually going to pull the page from the LRU list.
> >>>>
> >>>> Sorry, I still can not follow you here. Compound code part is unchanged
> >>>> and follow the original logical. So would you like to pose a new code to
> >>>> see if its works?
> >>>
> >>> No there are significant changes as you reordered all of the
> >>> operations. Prior to your change the LRU bit was checked, but not
> >>> cleared before testing for PageCompound. Now you are clearing it
> >>> before you are testing if it is a compound page. So if compaction is
> >>> running we will be seeing the pages in the LRU stay put, but the
> >>> compound bit flickering off and on if the compound page is encountered
> >>> with the wrong or NULL lruvec. What I was suggesting is that the
> >>
> >> The lruvec could be wrong or NULL here, that is the base stone of whole
> >> patchset.
> >
> > Sorry I had a typo in my comment as well as it is the LRU bit that
> > will be flickering, not the compound. The goal here is to avoid
> > clearing the LRU bit unless we are sure we are going to take the
> > lruvec lock and pull the page from the list.
> >
> >>> PageCompound test probably doesn't need to be concerned with the lock
> >>> after your changes. You could test it after you call
> >>> get_page_unless_zero() and before you call
> >>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
> >>> protect us from the page switching between compound and not we would
> >>> be relying on the fact that we are holding a reference to the page so
> >>> it should not be freed and transition between compound or not.
> >>>
> >>
> >> I have tried the patch as your suggested, it has no clear help on performance
> >> on above vm-scaliblity case. Maybe it's due to we checked the same thing
> >> before lock already.
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index b99c96c4862d..cf2ac5148001 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (unlikely(!get_page_unless_zero(page)))
> >>                         goto isolate_fail;
> >>
> >> +                       /*
> >> +                        * Page become compound since the non-locked check,
> >> +                        * and it's on LRU. It can only be a THP so the order
> >> +                        * is safe to read and it's 0 for tail pages.
> >> +                        */
> >> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> >> +                               low_pfn += compound_nr(page) - 1;
> >> +                               goto isolate_fail_put;
> >> +                       }
> >> +
> >>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >>                         goto isolate_fail_put;
> >>
> >> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                                         goto isolate_abort;
> >>                         }
> >>
> >> -                       /*
> >> -                        * Page become compound since the non-locked check,
> >> -                        * and it's on LRU. It can only be a THP so the order
> >> -                        * is safe to read and it's 0 for tail pages.
> >> -                        */
> >> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> >> -                               low_pfn += compound_nr(page) - 1;
> >> -                               SetPageLRU(page);
> >> -                               goto isolate_fail_put;
> >> -                       }
> >>                 } else
> >>                         rcu_read_unlock();
> >>
> >
> > So actually there is more we could do than just this. Specifically a
> > few lines below the rcu_read_lock there is yet another PageCompound
> > check that sets low_pfn yet again. So in theory we could combine both
> > of those and modify the code so you end up with something more like:
> > @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
> > *cc, unsigned long low_pfn,
> >                 if (unlikely(!get_page_unless_zero(page)))
> >                         goto isolate_fail;
> >
> > +               if (PageCompound(page)) {
> > +                       const unsigned int order = compound_order(page);
> > +
> > +                       if (likely(order < MAX_ORDER))
> > +                               low_pfn += (1UL << order) - 1;
> > +
> > +                       if (unlikely(!cc->alloc_contig))
> > +                               goto isolate_fail_put;
> >
>
> The current don't check this unless locked changed. But anyway check it
> every page may have no performance impact.

Yes and no. The same code is also ran outside the lock and that is why
I suggested merging the two and creating this block of logic. It will
be clearer once I have done some initial smoke testing and submitted
my patch.

> +               }
> > +
> >                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >                         goto isolate_fail_put;
> >
> > Doing this you would be more likely to skip over the entire compound
> > page in a single jump should you not be able to either take the LRU
> > bit or encounter a busy page in __isolate_Lru_page_prepare. I had
> > copied this bit from an earlier check and modified it as I was not
> > sure I can guarantee that this is a THP since we haven't taken the LRU
> > lock yet. However I believe the page cannot be split up while we are
> > holding the extra reference so the PageCompound flag and order should
> > not change until we call put_page.
> >
>
> It looks like the lock_page protect this instead of get_page that just works
> after split func called.

So I thought that the call to page_ref_freeze that is used in
functions like split_huge_page_to_list is meant to address this case.
What it is essentially doing is setting the reference count to zero if
the count is at the expected value. So with the get_page_unless_zero
it would either fail because the value is already zero, or the
page_ref_freeze would fail because the count would be one higher than
the expected value. Either that or I am still missing another piece in
the understanding of this.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-13  2:17                       ` Alexander Duyck
@ 2020-08-13  3:52                         ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-13  3:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen



在 2020/8/13 上午10:17, Alexander Duyck 写道:
>> zone lock is probability better. you can try and test.
> So I spent a good chunk of today looking the code over and what I
> realized is that we probably don't even really need to have this code
> protected by the zone lock since the LRU bit in the pageblock should
> do most of the work for us. In addition we can get rid of the test
> portion of this and just make it a set only operation if I am not
> mistaken.
> 
>>>>> the LRU flag is cleared then you are creating a situation where
>>>>> multiple processes will be stomping all over each other as you can
>>>>> have each thread essentially take a page via the LRU flag, but only
>>>>> one thread will process a page and it could skip over all other pages
>>>>> that preemptively had their LRU flag cleared.
>>>> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
>>>> could stop each other in a array check(bitmap). So compare to whole node
>>>> lru_lock, the net profit is clear in patch 17.
>>> My concern is that what you can end up with is multiple threads all
>>> working over the same pageblock for isolation. With the old code the
>>> LRU lock was used to make certain that test_and_set_skip was being
>>> synchronized on the first page in the pageblock so you would only have
>>> one thread going through and working a single pageblock. However after
>>> your changes it doesn't seem like the test_and_set_skip has that
>>> protection since only one thread will ever be able to successfully
>>> call it for the first page in the pageblock assuming that the LRU flag
>>> is set on the first page in the pageblock block.
>>>
>>>>> If you take a look at the test_and_set_skip the function only acts on
>>>>> the pageblock aligned PFN for a given range. WIth the changes you have
>>>>> in place now that would mean that only one thread would ever actually
>>>>> call this function anyway since the first PFN would take the LRU flag
>>>>> so no other thread could follow through and test or set the bit as
>>>> Is this good for only one process could do test_and_set_skip? is that
>>>> the 'skip' meaning to be?
>>> So only one thread really getting to fully use test_and_set_skip is
>>> good, however the issue is that there is nothing to synchronize the
>>> testing from the other threads. As a result the other threads could
>>> have isolated other pages within the pageblock before the thread that
>>> is calling test_and_set_skip will get to complete the setting of the
>>> skip bit. This will result in isolation failures for the thread that
>>> set the skip bit which may be undesirable behavior.
>>>
>>> With the old code the threads were all synchronized on testing the
>>> first PFN in the pageblock while holding the LRU lock and that is what
>>> we lost. My concern is the cases where skip_on_failure == true are
>>> going to fail much more often now as the threads can easily interfere
>>> with each other.
>> I have a patch to fix this, which is on
>>         https://github.com/alexshi/linux.git lrunext
> I don't think that patch helps to address anything. You are now
> failing to set the bit in the case that something modifies the
> pageblock flags while you are attempting to do so. I think it would be
> better to just leave the cmpxchg loop as it is.

It do increae the case-lru-file-mmap-read in vm-scalibity about 3% performance.
Yes, I am glad to see it can be make better.


> 
>>>>> well. The expectation before was that all threads would encounter this
>>>>> test and either proceed after setting the bit for the first PFN or
>>>>> abort after testing the first PFN. With you changes only the first
>>>>> thread actually runs this test and then it and the others will likely
>>>>> encounter multiple failures as they are all clearing LRU bits
>>>>> simultaneously and tripping each other up. That is why the skip bit
>>>>> must have a test and set done before you even get to the point of
>>>>> clearing the LRU flag.
>>>> It make the things warse in my machine, would you like to have a try by yourself?
>>> I plan to do that. I have already been working on a few things to
>>> clean up and optimize your patch set further. I will try to submit an
>>> RFC this evening so we can discuss.
>>>
>> Glad to see your new code soon. Would you like do it base on
>>                 https://github.com/alexshi/linux.git lrunext
> I can rebase off of that tree. It may add another half hour or so. I
> have barely had any time to test my code. When I enabled some of the
> debugging features in the kernel related to using the vm-scalability
> tests the boot time became incredibly slow so I may just make certain
> I can boot and not mess the system up before submitting my patches as
> an RFC. I can probably try testing them more tomorrow.
> 
>>>>>>> The point I was getting at with the PageCompound check is that instead
>>>>>>> of needing the LRU lock you should be able to look at PageCompound as
>>>>>>> soon as you call get_page_unless_zero() and preempt the need to set
>>>>>>> the LRU bit again. Instead of trying to rely on the LRU lock to
>>>>>>> guarantee that the page hasn't been merged you could just rely on the
>>>>>>> fact that you are holding a reference to it so it isn't going to
>>>>>>> switch between being compound or order 0 since it cannot be freed. It
>>>>>>> spoils the idea I originally had of combining the logic for
>>>>>>> get_page_unless_zero and TestClearPageLRU into a single function, but
>>>>>>> the advantage is you aren't clearing the LRU flag unless you are
>>>>>>> actually going to pull the page from the LRU list.
>>>>>> Sorry, I still can not follow you here. Compound code part is unchanged
>>>>>> and follow the original logical. So would you like to pose a new code to
>>>>>> see if its works?
>>>>> No there are significant changes as you reordered all of the
>>>>> operations. Prior to your change the LRU bit was checked, but not
>>>>> cleared before testing for PageCompound. Now you are clearing it
>>>>> before you are testing if it is a compound page. So if compaction is
>>>>> running we will be seeing the pages in the LRU stay put, but the
>>>>> compound bit flickering off and on if the compound page is encountered
>>>>> with the wrong or NULL lruvec. What I was suggesting is that the
>>>> The lruvec could be wrong or NULL here, that is the base stone of whole
>>>> patchset.
>>> Sorry I had a typo in my comment as well as it is the LRU bit that
>>> will be flickering, not the compound. The goal here is to avoid
>>> clearing the LRU bit unless we are sure we are going to take the
>>> lruvec lock and pull the page from the list.
>>>
>>>>> PageCompound test probably doesn't need to be concerned with the lock
>>>>> after your changes. You could test it after you call
>>>>> get_page_unless_zero() and before you call
>>>>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
>>>>> protect us from the page switching between compound and not we would
>>>>> be relying on the fact that we are holding a reference to the page so
>>>>> it should not be freed and transition between compound or not.
>>>>>
>>>> I have tried the patch as your suggested, it has no clear help on performance
>>>> on above vm-scaliblity case. Maybe it's due to we checked the same thing
>>>> before lock already.
>>>>
>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>> index b99c96c4862d..cf2ac5148001 100644
>>>> --- a/mm/compaction.c
>>>> +++ b/mm/compaction.c
>>>> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>>>                 if (unlikely(!get_page_unless_zero(page)))
>>>>                         goto isolate_fail;
>>>>
>>>> +                       /*
>>>> +                        * Page become compound since the non-locked check,
>>>> +                        * and it's on LRU. It can only be a THP so the order
>>>> +                        * is safe to read and it's 0 for tail pages.
>>>> +                        */
>>>> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>>>> +                               low_pfn += compound_nr(page) - 1;
>>>> +                               goto isolate_fail_put;
>>>> +                       }
>>>> +
>>>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>>>                         goto isolate_fail_put;
>>>>
>>>> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>>>                                         goto isolate_abort;
>>>>                         }
>>>>
>>>> -                       /*
>>>> -                        * Page become compound since the non-locked check,
>>>> -                        * and it's on LRU. It can only be a THP so the order
>>>> -                        * is safe to read and it's 0 for tail pages.
>>>> -                        */
>>>> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>>>> -                               low_pfn += compound_nr(page) - 1;
>>>> -                               SetPageLRU(page);
>>>> -                               goto isolate_fail_put;
>>>> -                       }
>>>>                 } else
>>>>                         rcu_read_unlock();
>>>>
>>> So actually there is more we could do than just this. Specifically a
>>> few lines below the rcu_read_lock there is yet another PageCompound
>>> check that sets low_pfn yet again. So in theory we could combine both
>>> of those and modify the code so you end up with something more like:
>>> @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
>>> *cc, unsigned long low_pfn,
>>>                 if (unlikely(!get_page_unless_zero(page)))
>>>                         goto isolate_fail;
>>>
>>> +               if (PageCompound(page)) {
>>> +                       const unsigned int order = compound_order(page);
>>> +
>>> +                       if (likely(order < MAX_ORDER))
>>> +                               low_pfn += (1UL << order) - 1;
>>> +
>>> +                       if (unlikely(!cc->alloc_contig))
>>> +                               goto isolate_fail_put;
>>>
>> The current don't check this unless locked changed. But anyway check it
>> every page may have no performance impact.
> Yes and no. The same code is also ran outside the lock and that is why
> I suggested merging the two and creating this block of logic. It will
> be clearer once I have done some initial smoke testing and submitted
> my patch.
> 
>> +               }
>>> +
>>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>>                         goto isolate_fail_put;
>>>
>>> Doing this you would be more likely to skip over the entire compound
>>> page in a single jump should you not be able to either take the LRU
>>> bit or encounter a busy page in __isolate_Lru_page_prepare. I had
>>> copied this bit from an earlier check and modified it as I was not
>>> sure I can guarantee that this is a THP since we haven't taken the LRU
>>> lock yet. However I believe the page cannot be split up while we are
>>> holding the extra reference so the PageCompound flag and order should
>>> not change until we call put_page.
>>>
>> It looks like the lock_page protect this instead of get_page that just works
>> after split func called.
> So I thought that the call to page_ref_freeze that is used in
> functions like split_huge_page_to_list is meant to address this case.
> What it is essentially doing is setting the reference count to zero if
> the count is at the expected value. So with the get_page_unless_zero
> it would either fail because the value is already zero, or the
> page_ref_freeze would fail because the count would be one higher than
> the expected value. Either that or I am still missing another piece in
> the understanding of this.

Uh, the front xa_lock or anon_vma lock guard the -refcount, so long locking path...

Thanks
Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 0/3] Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-08-13  1:46                     ` Alex Shi
  2020-08-13  2:17                       ` Alexander Duyck
@ 2020-08-13  4:02                       ` Alexander Duyck
  2020-08-13  4:02                         ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck
                                           ` (2 more replies)
  1 sibling, 3 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-13  4:02 UTC (permalink / raw)
  To: alex.shi
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, alexander.duyck, daniel.m.jordan, linux-mm,
	shakeelb, willy, hannes, tj, cgroups, akpm, richard.weiyang,
	mgorman, iamjoonsoo.kim

Here are the patches I had discussed earlier to address the issues in
isolate_migratepages_block.

They are based on the tree at:
 https://github.com/alexshi/linux.git lrunext 

The first patch is mostly cleanup to address the RCU locking in the
function. The second addresses the test_and_set_skip issue, and the third
relocates PageCompound.

I did some digging into the history of the skip bits and since they are
only supposed to be a hint I thought we could probably just drop the
testing portion of the call since the LRU flag is preventing more than one
thread from accessing the function anyway so it would make sense to just
switch it to a set operation similar to what happens when low_pfn ==
end_pfn at the end of the call.

I have only had a chance to build test these since rebasing on the tree. In
addition I am not 100% certain the PageCompound changes are correct as they
operate on the assumption that get_page_unless_zero is enough to keep a
compound page from being split up. I plan on doing some testing tomorrow,
but thought I would push these out now so that we could discuss them.

---

Alexander Duyck (3):
      mm: Drop locked from isolate_migratepages_block
      mm: Drop use of test_and_set_skip in favor of just setting skip
      mm: Identify compound pages sooner in isolate_migratepages_block


 mm/compaction.c |  126 +++++++++++++++++++------------------------------------
 1 file changed, 44 insertions(+), 82 deletions(-)

--


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block
  2020-08-13  4:02                       ` [RFC PATCH 0/3] " Alexander Duyck
@ 2020-08-13  4:02                         ` Alexander Duyck
  2020-08-13  6:56                           ` Alex Shi
  2020-08-13  7:44                           ` Alex Shi
  2020-08-13  4:02                         ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck
  2020-08-13  4:02                         ` [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block Alexander Duyck
  2 siblings, 2 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-13  4:02 UTC (permalink / raw)
  To: alex.shi
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, alexander.duyck, daniel.m.jordan, linux-mm,
	shakeelb, willy, hannes, tj, cgroups, akpm, richard.weiyang,
	mgorman, iamjoonsoo.kim

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

We can drop the need for the locked variable by making use of the
lruvec_holds_page_lru_lock function. By doing this we can avoid some rcu
locking ugliness for the case where the lruvec is still holding the LRU
lock associated with the page. Instead we can just use the lruvec and if it
is NULL we assume the lock was released.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/compaction.c |   45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b99c96c4862d..5021a18ef722 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -803,9 +803,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 {
 	pg_data_t *pgdat = cc->zone->zone_pgdat;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
-	struct lruvec *locked = NULL;
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long start_pfn = low_pfn;
 	bool skip_on_failure = false;
@@ -866,9 +865,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * a fatal signal is pending.
 		 */
 		if (!(low_pfn % SWAP_CLUSTER_MAX)) {
-			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
-				locked = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 
 			if (fatal_signal_pending(current)) {
@@ -949,9 +948,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
-				if (locked) {
-					unlock_page_lruvec_irqrestore(locked, flags);
-					locked = NULL;
+				if (lruvec) {
+					unlock_page_lruvec_irqrestore(lruvec, flags);
+					lruvec = NULL;
 				}
 
 				if (!isolate_movable_page(page, isolate_mode))
@@ -992,16 +991,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
-		rcu_read_lock();
-		lruvec = mem_cgroup_page_lruvec(page, pgdat);
-
 		/* If we already hold the lock, we can skip some rechecking */
-		if (lruvec != locked) {
-			if (locked)
-				unlock_page_lruvec_irqrestore(locked, flags);
+		if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) {
+			if (lruvec)
+				unlock_page_lruvec_irqrestore(lruvec, flags);
 
+			lruvec = mem_cgroup_page_lruvec(page, pgdat);
 			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
-			locked = lruvec;
 			rcu_read_unlock();
 
 			lruvec_memcg_debug(lruvec, page);
@@ -1023,8 +1019,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
 				SetPageLRU(page);
 				goto isolate_fail_put;
 			}
-		} else
-			rcu_read_unlock();
+		}
 
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
@@ -1057,9 +1052,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
-		if (locked) {
-			unlock_page_lruvec_irqrestore(locked, flags);
-			locked = NULL;
+		if (lruvec) {
+			unlock_page_lruvec_irqrestore(lruvec, flags);
+			lruvec = NULL;
 		}
 		put_page(page);
 
@@ -1073,9 +1068,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		 * page anyway.
 		 */
 		if (nr_isolated) {
-			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
-				locked = NULL;
+			if (lruvec) {
+				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
@@ -1102,8 +1097,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	page = NULL;
 
 isolate_abort:
-	if (locked)
-		unlock_page_lruvec_irqrestore(locked, flags);
+	if (lruvec)
+		unlock_page_lruvec_irqrestore(lruvec, flags);
 	if (page) {
 		SetPageLRU(page);
 		put_page(page);



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-13  4:02                       ` [RFC PATCH 0/3] " Alexander Duyck
  2020-08-13  4:02                         ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck
@ 2020-08-13  4:02                         ` Alexander Duyck
  2020-08-14  7:19                           ` Alex Shi
  2020-08-18  6:50                           ` Alex Shi
  2020-08-13  4:02                         ` [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block Alexander Duyck
  2 siblings, 2 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-13  4:02 UTC (permalink / raw)
  To: alex.shi
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, alexander.duyck, daniel.m.jordan, linux-mm,
	shakeelb, willy, hannes, tj, cgroups, akpm, richard.weiyang,
	mgorman, iamjoonsoo.kim

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

The only user of test_and_set_skip was isolate_migratepages_block and it
was using it after a call that was testing and clearing the LRU flag. As
such it really didn't need to be behind the LRU lock anymore as it wasn't
really fulfilling its purpose.

With that being the case we can simply drop the bit and instead directly
just call the set_pageblock_skip function if the page we are working on is
the valid_page at the start of the pageblock. It shouldn't be possible for
us to encounter the bit being set since we obtained the LRU flag for the
first page in the pageblock which means we would have exclusive access to
setting the skip bit. As such we don't need to worry about the abort case
since no other thread will be able to call what used to be
test_and_set_skip.

Since we have dropped the late abort case we can drop the code that was
clearing the LRU flag and calling page_put since the abort case will now
not be holding a reference to a page.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/compaction.c |   50 +++++++-------------------------------------------
 1 file changed, 7 insertions(+), 43 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 5021a18ef722..c1e9918f9dd4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -399,29 +399,6 @@ void reset_isolation_suitable(pg_data_t *pgdat)
 	}
 }
 
-/*
- * Sets the pageblock skip bit if it was clear. Note that this is a hint as
- * locks are not required for read/writers. Returns true if it was already set.
- */
-static bool test_and_set_skip(struct compact_control *cc, struct page *page,
-							unsigned long pfn)
-{
-	bool skip;
-
-	/* Do no update if skip hint is being ignored */
-	if (cc->ignore_skip_hint)
-		return false;
-
-	if (!IS_ALIGNED(pfn, pageblock_nr_pages))
-		return false;
-
-	skip = get_pageblock_skip(page);
-	if (!skip && !cc->no_set_skip_hint)
-		skip = !set_pageblock_skip(page);
-
-	return skip;
-}
-
 static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
 {
 	struct zone *zone = cc->zone;
@@ -480,12 +457,6 @@ static inline void update_pageblock_skip(struct compact_control *cc,
 static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
 {
 }
-
-static bool test_and_set_skip(struct compact_control *cc, struct page *page,
-							unsigned long pfn)
-{
-	return false;
-}
 #endif /* CONFIG_COMPACTION */
 
 /*
@@ -895,7 +866,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
-				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -991,6 +961,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
 
+		/* Indicate that we want exclusive access to this pageblock */
+		if (page == valid_page) {
+			skip_updated = true;
+			if (!cc->ignore_skip_hint)
+				set_pageblock_skip(page);
+		}
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) {
 			if (lruvec)
@@ -1002,13 +979,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 
 			lruvec_memcg_debug(lruvec, page);
 
-			/* Try get exclusive access under lock */
-			if (!skip_updated) {
-				skip_updated = true;
-				if (test_and_set_skip(cc, page, low_pfn))
-					goto isolate_abort;
-			}
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -1094,15 +1064,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
-	page = NULL;
-
 isolate_abort:
 	if (lruvec)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
-	if (page) {
-		SetPageLRU(page);
-		put_page(page);
-	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block
  2020-08-13  4:02                       ` [RFC PATCH 0/3] " Alexander Duyck
  2020-08-13  4:02                         ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck
  2020-08-13  4:02                         ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck
@ 2020-08-13  4:02                         ` Alexander Duyck
  2020-08-14  7:20                           ` Alex Shi
  2 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-13  4:02 UTC (permalink / raw)
  To: alex.shi
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, alexander.duyck, daniel.m.jordan, linux-mm,
	shakeelb, willy, hannes, tj, cgroups, akpm, richard.weiyang,
	mgorman, iamjoonsoo.kim

From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Since we are holding a reference to the page much sooner in
isolate_migratepages_block we could move the PageCompound check out of the
LRU locked section and instead just place it after get_page_unless_zero. By
doing this we can allow any of the items that might trigger a failure to
trigger a failure for the compound page rather than the order 0 page and as
a result we should be able to process the pageblock faster.

In addition by testing for PageCompound sooner we can avoid having the LRU
flag cleared and then reset in the exception case. As a result this should
prevent any possible races where another thread might be attempting to pull
the LRU pages from the list.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 mm/compaction.c |   33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c1e9918f9dd4..3803f129fd6a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -954,6 +954,24 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (unlikely(!get_page_unless_zero(page)))
 			goto isolate_fail;
 
+		/*
+		 * Page is compound. We know the order before we know if it is
+		 * on the LRU so we cannot assume it is THP. However since the
+		 * page will have the LRU validated shortly we can use the value
+		 * to skip over this page for now or validate the LRU is set and
+		 * then isolate the entire compound page if we are isolating to
+		 * generate a CMA page.
+		 */
+		if (PageCompound(page)) {
+			const unsigned int order = compound_order(page);
+
+			if (likely(order < MAX_ORDER))
+				low_pfn += (1UL << order) - 1;
+
+			if (!cc->alloc_contig)
+				goto isolate_fail_put;
+		}
+
 		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
 			goto isolate_fail_put;
 
@@ -978,23 +996,8 @@ static bool too_many_isolated(pg_data_t *pgdat)
 			rcu_read_unlock();
 
 			lruvec_memcg_debug(lruvec, page);
-
-			/*
-			 * Page become compound since the non-locked check,
-			 * and it's on LRU. It can only be a THP so the order
-			 * is safe to read and it's 0 for tail pages.
-			 */
-			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
-				low_pfn += compound_nr(page) - 1;
-				SetPageLRU(page);
-				goto isolate_fail_put;
-			}
 		}
 
-		/* The whole page is taken off the LRU; skip the tail pages. */
-		if (PageCompound(page))
-			low_pfn += compound_nr(page) - 1;
-
 		/* Successfully isolated */
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		mod_node_page_state(page_pgdat(page),



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block
  2020-08-13  4:02                         ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck
@ 2020-08-13  6:56                           ` Alex Shi
  2020-08-13 14:32                             ` Alexander Duyck
  2020-08-13  7:44                           ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-13  6:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, daniel.m.jordan, linux-mm, shakeelb, willy, hannes,
	tj, cgroups, akpm, richard.weiyang, mgorman, iamjoonsoo.kim



在 2020/8/13 下午12:02, Alexander Duyck 写道:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> We can drop the need for the locked variable by making use of the
> lruvec_holds_page_lru_lock function. By doing this we can avoid some rcu
> locking ugliness for the case where the lruvec is still holding the LRU
> lock associated with the page. Instead we can just use the lruvec and if it
> is NULL we assume the lock was released.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  mm/compaction.c |   45 ++++++++++++++++++++-------------------------
>  1 file changed, 20 insertions(+), 25 deletions(-)

Thanks a lot!
Don't know if community is ok if we keep the patch following whole patchset alone?



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block
  2020-08-13  4:02                         ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck
  2020-08-13  6:56                           ` Alex Shi
@ 2020-08-13  7:44                           ` Alex Shi
  2020-08-13 14:26                             ` Alexander Duyck
  1 sibling, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-13  7:44 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, daniel.m.jordan, linux-mm, shakeelb, willy, hannes,
	tj, cgroups, akpm, richard.weiyang, mgorman, iamjoonsoo.kim



在 2020/8/13 下午12:02, Alexander Duyck 写道:
> -		rcu_read_lock();
> -		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> -
>  		/* If we already hold the lock, we can skip some rechecking */
> -		if (lruvec != locked) {
> -			if (locked)
> -				unlock_page_lruvec_irqrestore(locked, flags);
> +		if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) {

Ops, lruvec_holds_page_lru_lock need rcu_read_lock. 

> +			if (lruvec)
> +				unlock_page_lruvec_irqrestore(lruvec, flags);
>  
> +			lruvec = mem_cgroup_page_lruvec(page, pgdat);
>  			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> -			locked = lruvec;
>  			rcu_read_unlock();
>  

and some bugs:
[  534.564741] CPU: 23 PID: 545 Comm: kcompactd1 Kdump: loaded Tainted: G S      W         5.8.0-next-20200803-00028-g9a7ff2cd6e5c #85
[  534.577320] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 1.0.PL.IP.P.027.02 05/29/2020
[  534.587693] Call Trace:
[  534.590522]  dump_stack+0x96/0xd0
[  534.594231]  ___might_sleep.cold.90+0xff/0x115
[  534.599102]  kcompactd+0x24b/0x370
[  534.602904]  ? finish_wait+0x80/0x80
[  534.606897]  ? kcompactd_do_work+0x3d0/0x3d0
[  534.611566]  kthread+0x14e/0x170
[  534.615182]  ? kthread_park+0x80/0x80
[  534.619252]  ret_from_fork+0x1f/0x30
[  535.629483] BUG: sleeping function called from invalid context at include/linux/freezer.h:57
[  535.638691] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 545, name: kcompactd1
[  535.647601] INFO: lockdep is turned off.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block
  2020-08-13  7:44                           ` Alex Shi
@ 2020-08-13 14:26                             ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-13 14:26 UTC (permalink / raw)
  To: Alex Shi
  Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov,
	Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm,
	Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo,
	cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim

On Thu, Aug 13, 2020 at 12:45 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/13 下午12:02, Alexander Duyck 写道:
> > -             rcu_read_lock();
> > -             lruvec = mem_cgroup_page_lruvec(page, pgdat);
> > -
> >               /* If we already hold the lock, we can skip some rechecking */
> > -             if (lruvec != locked) {
> > -                     if (locked)
> > -                             unlock_page_lruvec_irqrestore(locked, flags);
> > +             if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) {
>
> Ops, lruvec_holds_page_lru_lock need rcu_read_lock.

How so? The reason I wrote lruvec_holds_page_lru_lock the way I did is
that it is simply comparing the pointers held by the page and the
lruvec. It is never actually accessing any of the values, just the
pointers. As such we should be able to compare the two since the
lruvec is still locked and the the memcg and pgdat held by the lruvec
should not be changed. Likewise with the page pointers assuming the
values match.

> > +                     if (lruvec)
> > +                             unlock_page_lruvec_irqrestore(lruvec, flags);
> >
> > +                     lruvec = mem_cgroup_page_lruvec(page, pgdat);
> >                       compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> > -                     locked = lruvec;
> >                       rcu_read_unlock();
> >
>
> and some bugs:
> [  534.564741] CPU: 23 PID: 545 Comm: kcompactd1 Kdump: loaded Tainted: G S      W         5.8.0-next-20200803-00028-g9a7ff2cd6e5c #85
> [  534.577320] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 1.0.PL.IP.P.027.02 05/29/2020
> [  534.587693] Call Trace:
> [  534.590522]  dump_stack+0x96/0xd0
> [  534.594231]  ___might_sleep.cold.90+0xff/0x115
> [  534.599102]  kcompactd+0x24b/0x370
> [  534.602904]  ? finish_wait+0x80/0x80
> [  534.606897]  ? kcompactd_do_work+0x3d0/0x3d0
> [  534.611566]  kthread+0x14e/0x170
> [  534.615182]  ? kthread_park+0x80/0x80
> [  534.619252]  ret_from_fork+0x1f/0x30
> [  535.629483] BUG: sleeping function called from invalid context at include/linux/freezer.h:57
> [  535.638691] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 545, name: kcompactd1
> [  535.647601] INFO: lockdep is turned off.

Ah, I see the bug now. It isn't the lruvec_holds_page_lru_lock that
needs the LRU lock. This is an issue as a part of a merge conflict.
There should have been an rcu_read_lock added before
mem_cgroup_page_lruvec.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block
  2020-08-13  6:56                           ` Alex Shi
@ 2020-08-13 14:32                             ` Alexander Duyck
  2020-08-14  7:25                               ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-13 14:32 UTC (permalink / raw)
  To: Alex Shi
  Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov,
	Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm,
	Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo,
	cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim

On Wed, Aug 12, 2020 at 11:57 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/13 下午12:02, Alexander Duyck 写道:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >
> > We can drop the need for the locked variable by making use of the
> > lruvec_holds_page_lru_lock function. By doing this we can avoid some rcu
> > locking ugliness for the case where the lruvec is still holding the LRU
> > lock associated with the page. Instead we can just use the lruvec and if it
> > is NULL we assume the lock was released.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> >  mm/compaction.c |   45 ++++++++++++++++++++-------------------------
> >  1 file changed, 20 insertions(+), 25 deletions(-)
>
> Thanks a lot!
> Don't know if community is ok if we keep the patch following whole patchset alone?

I am fine with you squashing it with another patch if you want. In
theory this could probably be squashed in with the earlier patch I
submitted that introduced lruvec_holds_page_lru_lock or some other
patch. It is mostly just a cleanup anyway as it gets us away from
needing to hold the RCU read lock in the case that we already have the
correct lruvec.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-13  4:02                         ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck
@ 2020-08-14  7:19                           ` Alex Shi
  2020-08-14 14:24                             ` Alexander Duyck
  2020-08-18  6:50                           ` Alex Shi
  1 sibling, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-14  7:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, daniel.m.jordan, linux-mm, shakeelb, willy, hannes,
	tj, cgroups, akpm, richard.weiyang, mgorman, iamjoonsoo.kim



在 2020/8/13 下午12:02, Alexander Duyck 写道:
> 
> Since we have dropped the late abort case we can drop the code that was
> clearing the LRU flag and calling page_put since the abort case will now
> not be holding a reference to a page.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing.
on my 80 core machine.

Thanks
Alex



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block
  2020-08-13  4:02                         ` [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block Alexander Duyck
@ 2020-08-14  7:20                           ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-14  7:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, daniel.m.jordan, linux-mm, shakeelb, willy, hannes,
	tj, cgroups, akpm, richard.weiyang, mgorman, iamjoonsoo.kim

It has a slight performance drop too...

Thanks
Alex

在 2020/8/13 下午12:02, Alexander Duyck 写道:
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  mm/compaction.c |   33 ++++++++++++++++++---------------
>  1 file changed, 18 insertions(+), 15 deletions(-)


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block
  2020-08-13 14:32                             ` Alexander Duyck
@ 2020-08-14  7:25                               ` Alex Shi
  0 siblings, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-14  7:25 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov,
	Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm,
	Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo,
	cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim



在 2020/8/13 下午10:32, Alexander Duyck 写道:
>> Thanks a lot!
>> Don't know if community is ok if we keep the patch following whole patchset alone?
> I am fine with you squashing it with another patch if you want. In
> theory this could probably be squashed in with the earlier patch I
> submitted that introduced lruvec_holds_page_lru_lock or some other
> patch. It is mostly just a cleanup anyway as it gets us away from
> needing to hold the RCU read lock in the case that we already have the
> correct lruvec.

Hi Alexander,

Thanks a lot! look like it's better to fold it in patch 17.

Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-14  7:19                           ` Alex Shi
@ 2020-08-14 14:24                             ` Alexander Duyck
  2020-08-14 21:15                               ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-14 14:24 UTC (permalink / raw)
  To: Alex Shi
  Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov,
	Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm,
	Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo,
	cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim

On Fri, Aug 14, 2020 at 12:19 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/13 下午12:02, Alexander Duyck 写道:
> >
> > Since we have dropped the late abort case we can drop the code that was
> > clearing the LRU flag and calling page_put since the abort case will now
> > not be holding a reference to a page.
> >
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>
> seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing.
> on my 80 core machine.

I'm not sure how it could have that much impact on the performance
since the total effect would just be dropping what should be a
redundant test since we tested the skip bit before we took the LRU
bit, so we shouldn't need to test it again after.

I finally got my test setup working last night. I'll have to do some
testing in my environment and I can start trying to see what is going
on.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-14 14:24                             ` Alexander Duyck
@ 2020-08-14 21:15                               ` Alexander Duyck
  2020-08-15  9:49                                 ` Alex Shi
  0 siblings, 1 reply; 102+ messages in thread
From: Alexander Duyck @ 2020-08-14 21:15 UTC (permalink / raw)
  To: Alex Shi
  Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov,
	Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm,
	Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo,
	cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim

On Fri, Aug 14, 2020 at 7:24 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Aug 14, 2020 at 12:19 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >
> >
> >
> > 在 2020/8/13 下午12:02, Alexander Duyck 写道:
> > >
> > > Since we have dropped the late abort case we can drop the code that was
> > > clearing the LRU flag and calling page_put since the abort case will now
> > > not be holding a reference to a page.
> > >
> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >
> > seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing.
> > on my 80 core machine.
>
> I'm not sure how it could have that much impact on the performance
> since the total effect would just be dropping what should be a
> redundant test since we tested the skip bit before we took the LRU
> bit, so we shouldn't need to test it again after.
>
> I finally got my test setup working last night. I'll have to do some
> testing in my environment and I can start trying to see what is going
> on.

So I ran the case-lru-file-mmap-read a few times and I don't see how
it is supposed to be testing the compaction code. It doesn't seem like
compaction is running at least on my system as a result of the test
script. I wonder if testing this code wouldn't be better done using
something like thpscale from the
mmtests(https://github.com/gormanm/mmtests)? It seems past changes to
the compaction code were tested using that, and the config script for
the test explains that it is designed specifically to stress the
compaction code. I have the test up and running now and hope to
collect results over the weekend.

There is one change I will probably make to this patch and that is to
place the new code that is setting skip_updated where the old code was
calling test_and_set_skip_bit. By doing that we can avoid extra checks
and it should help to reduce possible collisions when setting the skip
bit in the pageblock flags.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-14 21:15                               ` Alexander Duyck
@ 2020-08-15  9:49                                 ` Alex Shi
  2020-08-17 15:38                                   ` Alexander Duyck
  0 siblings, 1 reply; 102+ messages in thread
From: Alex Shi @ 2020-08-15  9:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov,
	Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm,
	Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo,
	cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim


[-- Attachment #1: Type: text/plain, Size: 2910 bytes --]



在 2020/8/15 上午5:15, Alexander Duyck 写道:
> On Fri, Aug 14, 2020 at 7:24 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>>
>> On Fri, Aug 14, 2020 at 12:19 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> 在 2020/8/13 下午12:02, Alexander Duyck 写道:
>>>>
>>>> Since we have dropped the late abort case we can drop the code that was
>>>> clearing the LRU flag and calling page_put since the abort case will now
>>>> not be holding a reference to a page.
>>>>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>
>>> seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing.
>>> on my 80 core machine.
>>
>> I'm not sure how it could have that much impact on the performance
>> since the total effect would just be dropping what should be a
>> redundant test since we tested the skip bit before we took the LRU
>> bit, so we shouldn't need to test it again after.
>>
>> I finally got my test setup working last night. I'll have to do some
>> testing in my environment and I can start trying to see what is going
>> on.
> 
> So I ran the case-lru-file-mmap-read a few times and I don't see how
> it is supposed to be testing the compaction code. It doesn't seem like
> compaction is running at least on my system as a result of the test
> script. 

atteched my kernel config, it is used on mine machine,

> I wonder if testing this code wouldn't be better done using
> something like thpscale from the
> mmtests(https://github.com/gormanm/mmtests)? It seems past changes to
> the compaction code were tested using that, and the config script for
> the test explains that it is designed specifically to stress the
> compaction code. I have the test up and running now and hope to
> collect results over the weekend.

I did the testing, but a awkward is that I failed to get result,
maybe leak some packages.

# ../../compare-kernels.sh

thpscale Fault Latencies
Can't locate List/BinarySearch.pm in @INC (@INC contains: /root/mmtests/bin/lib /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vend.
BEGIN failed--compilation aborted at /root/mmtests/bin/lib/MMTests/Stat.pm line 13.
Compilation failed in require at /root/mmtests/work/log/../../bin/compare-mmtests.pl line 13.
BEGIN failed--compilation aborted at /root/mmtests/work/log/../../bin/compare-mmtests.pl line 13.

> 
> There is one change I will probably make to this patch and that is to
> place the new code that is setting skip_updated where the old code was
> calling test_and_set_skip_bit. By doing that we can avoid extra checks
> and it should help to reduce possible collisions when setting the skip
> bit in the pageblock flags.

the problem maybe on cmpchxg pb flags, that may involved other blocks changes.


> 
> Thanks.
> 
> - Alex
> 

[-- Attachment #2: config80 --]
[-- Type: text/plain, Size: 158126 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.8.0 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=80301
CONFIG_LD_VERSION=230000000
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_SIM=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING=y
# CONFIG_CONTEXT_TRACKING_FORCE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_HAVE_SCHED_AVG_IRQ=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=20
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_TIME_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_CHECKPOINT_RESTORE=y
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_RD_ZSTD=y
CONFIG_BOOT_CONFIG=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_HAVE_ARCH_USERFAULTFD_WP=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
# CONFIG_BPF_LSM is not set
CONFIG_BPF_SYSCALL=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_USERFAULTFD=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_RSEQ=y
# CONFIG_DEBUG_RSEQ is not set
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
# end of Kernel Performance Events And Counters

CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_MEMCG_SYSFS_ON is not set
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
# CONFIG_SHUFFLE_PAGE_ALLOCATOR is not set
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SYSTEM_DATA_VERIFICATION=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# end of General setup

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_FILTER_PGPROT=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DYNAMIC_PHYSICAL_MASK=y
CONFIG_PGTABLE_LEVELS=5
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
CONFIG_RETPOLINE=y
CONFIG_X86_CPU_RESCTRL=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
CONFIG_X86_UV=y
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_MID is not set
CONFIG_X86_INTEL_LPSS=y
CONFIG_X86_AMD_PLATFORM_DEVICE=y
CONFIG_IOSF_MBI=y
# CONFIG_IOSF_MBI_DEBUG is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
CONFIG_HYPERVISOR_GUEST=y
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_XXL=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_PARAVIRT_SPINLOCKS=y
CONFIG_X86_HV_CALLBACK_VECTOR=y
CONFIG_XEN=y
CONFIG_XEN_PV=y
CONFIG_XEN_PV_SMP=y
# CONFIG_XEN_DOM0 is not set
CONFIG_XEN_PVHVM=y
CONFIG_XEN_PVHVM_SMP=y
CONFIG_XEN_512GB=y
CONFIG_XEN_SAVE_RESTORE=y
# CONFIG_XEN_DEBUG_FS is not set
# CONFIG_XEN_PVH is not set
CONFIG_KVM_GUEST=y
CONFIG_ARCH_CPUIDLE_HALTPOLL=y
# CONFIG_PVH is not set
CONFIG_PARAVIRT_TIME_ACCOUNTING=y
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_JAILHOUSE_GUEST is not set
# CONFIG_ACRN_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
CONFIG_X86_VMX_FEATURE_NAMES=y
# CONFIG_PROCESSOR_SELECT is not set
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_ZHAOXIN=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_MAXSMP=y
CONFIG_NR_CPUS_RANGE_BEGIN=8192
CONFIG_NR_CPUS_RANGE_END=8192
CONFIG_NR_CPUS_DEFAULT=8192
CONFIG_NR_CPUS=8192
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCELOG_LEGACY=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
# end of Performance monitoring

CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
CONFIG_X86_IOPL_IOPERM=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_5LEVEL=y
CONFIG_X86_DIRECT_GBPAGES=y
# CONFIG_X86_CPA_STATISTICS is not set
CONFIG_AMD_MEM_ENCRYPT=y
# CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=10
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_MEMORY_PROBE=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
# CONFIG_X86_PMEM_LEGACY is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
# CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_X86_UMIP=y
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_X86_INTEL_TSX_MODE_OFF=y
# CONFIG_X86_INTEL_TSX_MODE_ON is not set
# CONFIG_X86_INTEL_TSX_MODE_AUTO is not set
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_MIXED=y
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
# CONFIG_KEXEC_SIG is not set
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y
CONFIG_X86_NEED_RELOCS=y
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_DYNAMIC_MEMORY_LAYOUT=y
CONFIG_RANDOMIZE_MEMORY=y
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING=0xa
CONFIG_HOTPLUG_CPU=y
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
# CONFIG_COMPAT_VDSO is not set
CONFIG_LEGACY_VSYSCALL_EMULATE=y
# CONFIG_LEGACY_VSYSCALL_XONLY is not set
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_MODIFY_LDT_SYSCALL=y
CONFIG_HAVE_LIVEPATCH=y
CONFIG_LIVEPATCH=y
# end of Processor type and features

CONFIG_ARCH_HAS_ADD_PAGES=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_SUSPEND_SKIP_SYNC is not set
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_HIBERNATION_SNAPSHOT_DEV=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
# CONFIG_DPM_WATCHDOG is not set
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
# CONFIG_ENERGY_MODEL is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SPCR_TABLE=y
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_TAD is not set
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_IPMI is not set
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_PROCESSOR_AGGREGATOR=m
CONFIG_ACPI_THERMAL=y
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_MEMORY=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
CONFIG_ACPI_BGRT=y
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
CONFIG_ACPI_NFIT=m
# CONFIG_NFIT_SECURITY_DEBUG is not set
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_HMAT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
# CONFIG_ACPI_APEI_EINJ is not set
CONFIG_ACPI_APEI_ERST_DEBUG=y
# CONFIG_DPTF_POWER is not set
# CONFIG_ACPI_EXTLOG is not set
CONFIG_ACPI_ADXL=y
# CONFIG_PMIC_OPREGION is not set
# CONFIG_ACPI_CONFIGFS is not set
CONFIG_X86_PM_TIMER=y
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
# CONFIG_X86_PCC_CPUFREQ is not set
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# end of CPU Frequency scaling

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_CPU_IDLE_GOV_TEO is not set
# CONFIG_CPU_IDLE_GOV_HALTPOLL is not set
CONFIG_HALTPOLL_CPUIDLE=y
# end of CPU Idle

CONFIG_INTEL_IDLE=y
# end of Power management and ACPI options

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_XEN=y
CONFIG_MMCONF_FAM10H=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
# CONFIG_ISA_BUS is not set
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# CONFIG_X86_SYSFB is not set
# end of Bus options (PCI etc.)

#
# Binary Emulations
#
CONFIG_IA32_EMULATION=y
# CONFIG_X86_X32 is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
# end of Binary Emulations

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
CONFIG_DMI_SYSFS=y
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT is not set
CONFIG_FW_CFG_SYSFS=y
# CONFIG_FW_CFG_SYSFS_CMDLINE is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_VARS=y
CONFIG_EFI_ESRT=y
# CONFIG_EFI_VARS_PSTORE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
CONFIG_EFI_GENERIC_STUB_INITRD_CMDLINE_LOADER=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
# CONFIG_RESET_ATTACK_MITIGATION is not set
# CONFIG_EFI_RCI2_TABLE is not set
# CONFIG_EFI_DISABLE_PCI_DMA is not set
# end of EFI (Extensible Firmware Interface) Support

CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y
CONFIG_EFI_EARLYCON=y
CONFIG_EFI_CUSTOM_SSDT_OVERLAYS=y

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_KVM_COMPAT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_HAVE_KVM_NO_POLL=y
CONFIG_KVM_XFER_TO_GUEST_WORK=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_WERROR=y
CONFIG_KVM_INTEL=m
# CONFIG_KVM_AMD is not set
CONFIG_KVM_MMU_AUDIT=y
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y

#
# General architecture-dependent options
#
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_HOTPLUG_SMT=y
CONFIG_GENERIC_ENTRY=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_TABLE_FREE=y
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
# CONFIG_LOCK_EVENT_COUNTS is not set
CONFIG_ARCH_HAS_MEM_ENCRYPT=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
# end of General architecture-dependent options

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULE_SIG_FORMAT=y
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_MODULE_SIG=y
# CONFIG_MODULE_SIG_FORCE is not set
CONFIG_MODULE_SIG_ALL=y
# CONFIG_MODULE_SIG_SHA1 is not set
# CONFIG_MODULE_SIG_SHA224 is not set
CONFIG_MODULE_SIG_SHA256=y
# CONFIG_MODULE_SIG_SHA384 is not set
# CONFIG_MODULE_SIG_SHA512 is not set
CONFIG_MODULE_SIG_HASH="sha256"
# CONFIG_MODULE_COMPRESS is not set
# CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLK_SCSI_REQUEST=y
CONFIG_BLK_CGROUP_RWSTAT=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_ZONED=y
CONFIG_BLK_DEV_THROTTLING=y
# CONFIG_BLK_DEV_THROTTLING_LOW is not set
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_BLK_WBT is not set
# CONFIG_BLK_CGROUP_IOLATENCY is not set
# CONFIG_BLK_CGROUP_IOCOST is not set
CONFIG_BLK_DEBUG_FS=y
CONFIG_BLK_DEBUG_FS_ZONED=y
# CONFIG_BLK_SED_OPAL is not set
# CONFIG_BLK_INLINE_ENCRYPTION is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
# end of Partition Types

CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_BLK_PM=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
# end of IO Schedulers

CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_PADATA=y
CONFIG_ASN1=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
# CONFIG_BINFMT_MISC is not set
CONFIG_COREDUMP=y
# end of Executable file formats

#
# Memory Management options
#
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_HAVE_BOOTMEM_INFO_NODE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
# CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is not set
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MEMORY_BALLOON=y
CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y
CONFIG_PAGE_REPORTING=y
CONFIG_MIGRATION=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
# CONFIG_HWPOISON_INJECT is not set
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_THP_SWAP=y
CONFIG_CLEANCACHE=y
CONFIG_FRONTSWAP=y
CONFIG_CMA=y
# CONFIG_CMA_DEBUG is not set
# CONFIG_CMA_DEBUGFS is not set
CONFIG_CMA_AREAS=7
CONFIG_MEM_SOFT_DIRTY=y
CONFIG_ZSWAP=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_DEFLATE is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZO=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_842 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4HC is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_ZSTD is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT="lzo"
CONFIG_ZSWAP_ZPOOL_DEFAULT_ZBUD=y
# CONFIG_ZSWAP_ZPOOL_DEFAULT_Z3FOLD is not set
# CONFIG_ZSWAP_ZPOOL_DEFAULT_ZSMALLOC is not set
CONFIG_ZSWAP_ZPOOL_DEFAULT="zbud"
# CONFIG_ZSWAP_DEFAULT_ON is not set
CONFIG_ZPOOL=y
CONFIG_ZBUD=y
# CONFIG_Z3FOLD is not set
CONFIG_ZSMALLOC=y
# CONFIG_ZSMALLOC_PGTABLE_MAPPING is not set
# CONFIG_ZSMALLOC_STAT is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
CONFIG_IDLE_PAGE_TRACKING=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ZONE_DEVICE=y
CONFIG_DEV_PAGEMAP_OPS=y
# CONFIG_DEVICE_PRIVATE is not set
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
# CONFIG_PERCPU_STATS is not set
# CONFIG_GUP_BENCHMARK is not set
CONFIG_READ_ONLY_THP_FOR_FS=y
CONFIG_ARCH_HAS_PTE_SPECIAL=y
# end of Memory Management options

CONFIG_NET=y
CONFIG_NET_INGRESS=y
CONFIG_SKB_EXTENSIONS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
# CONFIG_XFRM_INTERFACE is not set
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
# CONFIG_NET_KEY is not set
# CONFIG_XDP_SOCKETS is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_IP_MROUTE_COMMON=y
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
# CONFIG_NET_IPVTI is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_NV is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_TCP_CONG_DCTCP is not set
# CONFIG_TCP_CONG_CDG is not set
# CONFIG_TCP_CONG_BBR is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_IPV6_ILA is not set
# CONFIG_IPV6_VTI is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
# CONFIG_IPV6_SUBTREES is not set
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y
CONFIG_IPV6_SEG6_LWTUNNEL=y
# CONFIG_IPV6_SEG6_HMAC is not set
CONFIG_IPV6_SEG6_BPF=y
# CONFIG_IPV6_RPL_LWTUNNEL is not set
CONFIG_NETLABEL=y
# CONFIG_MPTCP is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
CONFIG_NETWORK_PHY_TIMESTAMPING=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=m

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_INGRESS=y
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_FAMILY_BRIDGE=y
CONFIG_NETFILTER_NETLINK_ACCT=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NETFILTER_NETLINK_OSF=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_LOG_COMMON=m
CONFIG_NF_LOG_NETDEV=m
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_ZONES=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CONNTRACK_TIMEOUT=y
CONFIG_NF_CONNTRACK_TIMESTAMP=y
CONFIG_NF_CONNTRACK_LABELS=y
CONFIG_NF_CT_PROTO_DCCP=y
CONFIG_NF_CT_PROTO_SCTP=y
CONFIG_NF_CT_PROTO_UDPLITE=y
# CONFIG_NF_CONNTRACK_AMANDA is not set
# CONFIG_NF_CONNTRACK_FTP is not set
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_SNMP is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
# CONFIG_NF_CT_NETLINK_TIMEOUT is not set
CONFIG_NF_NAT=m
CONFIG_NF_NAT_MASQUERADE=y
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set
# CONFIG_NETFILTER_XT_CONNMARK is not set

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_AUDIT is not set
CONFIG_NETFILTER_XT_TARGET_CHECKSUM=m
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set
# CONFIG_NETFILTER_XT_TARGET_CONNSECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_DSCP is not set
# CONFIG_NETFILTER_XT_TARGET_HL is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
CONFIG_NETFILTER_XT_NAT=m
# CONFIG_NETFILTER_XT_TARGET_NETMAP is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_REDIRECT is not set
CONFIG_NETFILTER_XT_TARGET_MASQUERADE=m
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
# CONFIG_NETFILTER_XT_TARGET_TPROXY is not set
# CONFIG_NETFILTER_XT_TARGET_SECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
# CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP is not set

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_CLUSTER is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLABEL is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_L2TP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OSF is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_SOCKET is not set
# CONFIG_NETFILTER_XT_MATCH_STATE is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# end of Core Netfilter Configuration

# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
# CONFIG_NF_SOCKET_IPV4 is not set
# CONFIG_NF_TPROXY_IPV4 is not set
# CONFIG_NF_DUP_IPV4 is not set
# CONFIG_NF_LOG_ARP is not set
# CONFIG_NF_LOG_IPV4 is not set
CONFIG_NF_REJECT_IPV4=m
CONFIG_IP_NF_IPTABLES=m
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_RPFILTER is not set
# CONFIG_IP_NF_MATCH_TTL is not set
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
# CONFIG_IP_NF_TARGET_SYNPROXY is not set
CONFIG_IP_NF_NAT=m
# CONFIG_IP_NF_TARGET_MASQUERADE is not set
# CONFIG_IP_NF_TARGET_NETMAP is not set
# CONFIG_IP_NF_TARGET_REDIRECT is not set
CONFIG_IP_NF_MANGLE=m
# CONFIG_IP_NF_TARGET_CLUSTERIP is not set
# CONFIG_IP_NF_TARGET_ECN is not set
# CONFIG_IP_NF_TARGET_TTL is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_SECURITY is not set
# CONFIG_IP_NF_ARPTABLES is not set
# end of IP: Netfilter Configuration

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_SOCKET_IPV6 is not set
# CONFIG_NF_TPROXY_IPV6 is not set
# CONFIG_NF_DUP_IPV6 is not set
# CONFIG_NF_REJECT_IPV6 is not set
# CONFIG_NF_LOG_IPV6 is not set
CONFIG_IP6_NF_IPTABLES=m
# CONFIG_IP6_NF_MATCH_AH is not set
# CONFIG_IP6_NF_MATCH_EUI64 is not set
# CONFIG_IP6_NF_MATCH_FRAG is not set
# CONFIG_IP6_NF_MATCH_OPTS is not set
# CONFIG_IP6_NF_MATCH_HL is not set
# CONFIG_IP6_NF_MATCH_IPV6HEADER is not set
# CONFIG_IP6_NF_MATCH_MH is not set
# CONFIG_IP6_NF_MATCH_RPFILTER is not set
# CONFIG_IP6_NF_MATCH_RT is not set
# CONFIG_IP6_NF_MATCH_SRH is not set
# CONFIG_IP6_NF_TARGET_HL is not set
CONFIG_IP6_NF_FILTER=m
# CONFIG_IP6_NF_TARGET_REJECT is not set
# CONFIG_IP6_NF_TARGET_SYNPROXY is not set
CONFIG_IP6_NF_MANGLE=m
# CONFIG_IP6_NF_RAW is not set
# CONFIG_IP6_NF_SECURITY is not set
# CONFIG_IP6_NF_NAT is not set
# end of IPv6: Netfilter Configuration

CONFIG_NF_DEFRAG_IPV6=m
# CONFIG_NF_CONNTRACK_BRIDGE is not set
CONFIG_BRIDGE_NF_EBTABLES=m
# CONFIG_BRIDGE_EBT_BROUTE is not set
CONFIG_BRIDGE_EBT_T_FILTER=m
# CONFIG_BRIDGE_EBT_T_NAT is not set
# CONFIG_BRIDGE_EBT_802_3 is not set
# CONFIG_BRIDGE_EBT_AMONG is not set
# CONFIG_BRIDGE_EBT_ARP is not set
# CONFIG_BRIDGE_EBT_IP is not set
# CONFIG_BRIDGE_EBT_IP6 is not set
# CONFIG_BRIDGE_EBT_LIMIT is not set
# CONFIG_BRIDGE_EBT_MARK is not set
# CONFIG_BRIDGE_EBT_PKTTYPE is not set
# CONFIG_BRIDGE_EBT_STP is not set
# CONFIG_BRIDGE_EBT_VLAN is not set
# CONFIG_BRIDGE_EBT_ARPREPLY is not set
# CONFIG_BRIDGE_EBT_DNAT is not set
# CONFIG_BRIDGE_EBT_MARK_T is not set
# CONFIG_BRIDGE_EBT_REDIRECT is not set
# CONFIG_BRIDGE_EBT_SNAT is not set
# CONFIG_BRIDGE_EBT_LOG is not set
# CONFIG_BRIDGE_EBT_NFLOG is not set
# CONFIG_BPFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
# CONFIG_BRIDGE_MRP is not set
CONFIG_HAVE_NET_DSA=y
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_CBS is not set
# CONFIG_NET_SCH_ETF is not set
# CONFIG_NET_SCH_TAPRIO is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_SKBPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_CAKE is not set
# CONFIG_NET_SCH_FQ is not set
# CONFIG_NET_SCH_HHF is not set
# CONFIG_NET_SCH_PIE is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set
# CONFIG_NET_SCH_ETS is not set
# CONFIG_NET_SCH_DEFAULT is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
# CONFIG_NET_CLS_BPF is not set
# CONFIG_NET_CLS_FLOWER is not set
# CONFIG_NET_CLS_MATCHALL is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
# CONFIG_NET_EMATCH_IPT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_SAMPLE is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
# CONFIG_NET_ACT_MPLS is not set
# CONFIG_NET_ACT_VLAN is not set
# CONFIG_NET_ACT_BPF is not set
# CONFIG_NET_ACT_CONNMARK is not set
# CONFIG_NET_ACT_CTINFO is not set
# CONFIG_NET_ACT_SKBMOD is not set
# CONFIG_NET_ACT_IFE is not set
# CONFIG_NET_ACT_TUNNEL_KEY is not set
# CONFIG_NET_ACT_GATE is not set
# CONFIG_NET_TC_SKB_EXT is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
# CONFIG_DNS_RESOLVER is not set
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=y
# CONFIG_MPLS_ROUTING is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
CONFIG_NET_SWITCHDEV=y
CONFIG_NET_L3_MASTER_DEV=y
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
CONFIG_NET_DROP_MONITOR=y
# end of Network testing
# end of Networking options

# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
CONFIG_STREAM_PARSER=y
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
# CONFIG_NET_9P_XEN is not set
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
CONFIG_LWTUNNEL=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_DST_CACHE=y
CONFIG_GRO_CELLS=y
CONFIG_NET_SOCK_MSG=y
CONFIG_FAILOVER=m
CONFIG_ETHTOOL_NETLINK=y
CONFIG_HAVE_EBPF_JIT=y

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIE_ECRC=y
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
# CONFIG_PCIE_BW is not set
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
# CONFIG_PCI_PF_STUB is not set
# CONFIG_XEN_PCIDEV_FRONTEND is not set
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
# CONFIG_PCI_P2PDMA is not set
CONFIG_PCI_LABEL=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
CONFIG_HOTPLUG_PCI_SHPC=y

#
# PCI controller drivers
#
CONFIG_VMD=y

#
# DesignWare PCI Core Support
#
# CONFIG_PCIE_DW_PLAT_HOST is not set
# CONFIG_PCI_MESON is not set
# end of DesignWare PCI Core Support

#
# Mobiveil PCIe Core Support
#
# end of Mobiveil PCIe Core Support

#
# Cadence PCIe controllers support
#
# end of Cadence PCIe controllers support
# end of PCI controller drivers

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set
# end of PCI Endpoint

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# end of PCI switch controller drivers

CONFIG_PCCARD=y
# CONFIG_PCMCIA is not set
CONFIG_CARDBUS=y

#
# PC-card bridges
#
# CONFIG_YENTA is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_PAGED_BUF=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y
# CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set
# CONFIG_FW_LOADER_COMPRESS is not set
CONFIG_FW_CACHE=y
# end of Firmware loader

CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
CONFIG_SYS_HYPERVISOR=y
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set
# end of Generic Driver Options

#
# Bus devices
#
# CONFIG_MHI_BUS is not set
# end of Bus devices

CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_GNSS is not set
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_ZRAM is not set
# CONFIG_BLK_DEV_UMEM is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_LOOP_MIN_COUNT=0
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SKD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_XEN_BLKDEV_FRONTEND is not set
CONFIG_VIRTIO_BLK=y
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set

#
# NVME Support
#
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TARGET is not set
# end of NVME Support

#
# Misc devices
#
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_SGI_XP is not set
# CONFIG_HP_ILO is not set
# CONFIG_SGI_GRU is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_LATTICE_ECP3_CONFIG is not set
# CONFIG_SRAM is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_XILINX_SDFEC is not set
CONFIG_PVPANIC=y
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_AT25 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_93XX46 is not set
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_EEPROM_EE1004 is not set
# end of EEPROM support

# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_TI_ST is not set
# end of Texas Instruments shared transport line discipline

# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_ALTERA_STAPL is not set
CONFIG_INTEL_MEI=m
CONFIG_INTEL_MEI_ME=m
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set

#
# Intel MIC & related support
#
# CONFIG_INTEL_MIC_BUS is not set
# CONFIG_SCIF_BUS is not set
# CONFIG_VOP_BUS is not set
# end of Intel MIC & related support

# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_MISC_RTSX_USB is not set
# CONFIG_HABANA_AI is not set
# CONFIG_UACCE is not set
# end of Misc devices

CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
# CONFIG_BLK_DEV_SD is not set
# CONFIG_CHR_DEV_ST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# end of SCSI Transports

CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_SCSI_ESAS2R is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT3SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_SMARTPQI is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_MYRB is not set
# CONFIG_SCSI_MYRS is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_XEN_SCSI_FRONTEND is not set
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_FDOMAIN_PCI is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_VIRTIO is not set
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=y
CONFIG_SCSI_DH_HP_SW=y
CONFIG_SCSI_DH_EMC=y
CONFIG_SCSI_DH_ALUA=y
# end of SCSI device support

CONFIG_ATA=m
CONFIG_SATA_HOST=y
CONFIG_PATA_TIMINGS=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_FORCE=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=m
CONFIG_SATA_MOBILE_LPM_POLICY=0
CONFIG_SATA_AHCI_PLATFORM=m
# CONFIG_SATA_INIC162X is not set
CONFIG_SATA_ACARD_AHCI=m
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_DWC is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PLATFORM is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
# CONFIG_PATA_ACPI is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=m
CONFIG_DM_DEBUG=y
# CONFIG_DM_UNSTRIPED is not set
# CONFIG_DM_CRYPT is not set
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_THIN_PROVISIONING is not set
# CONFIG_DM_CACHE is not set
# CONFIG_DM_WRITECACHE is not set
# CONFIG_DM_EBS is not set
# CONFIG_DM_ERA is not set
# CONFIG_DM_CLONE is not set
CONFIG_DM_MIRROR=m
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_RAID is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_DUST is not set
CONFIG_DM_UEVENT=y
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_DM_LOG_WRITES is not set
# CONFIG_DM_INTEGRITY is not set
# CONFIG_DM_ZONED is not set
# CONFIG_TARGET_CORE is not set
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=128
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# end of IEEE 1394 (FireWire) support

CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_WIREGUARD is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_IPVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_GENEVE is not set
# CONFIG_BAREUDP is not set
# CONFIG_GTP is not set
CONFIG_MACSEC=y
# CONFIG_NETCONSOLE is not set
CONFIG_TUN=m
# CONFIG_TUN_VNET_CROSS_LE is not set
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=m
# CONFIG_NLMON is not set
CONFIG_NET_VRF=y
# CONFIG_ARCNET is not set

#
# Distributed Switch Architecture drivers
#
# end of Distributed Switch Architecture drivers

CONFIG_ETHERNET=y
CONFIG_MDIO=y
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_NET_VENDOR_ADAPTEC is not set
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
# CONFIG_NET_VENDOR_ALTEON is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_AMD_XGBE is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
CONFIG_NET_VENDOR_AURORA=y
# CONFIG_AURORA_NB8800 is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BCMGENET is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
CONFIG_TIGON3=y
CONFIG_TIGON3_HWMON=y
# CONFIG_BNX2X is not set
# CONFIG_SYSTEMPORT is not set
# CONFIG_BNXT is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
CONFIG_NET_VENDOR_CADENCE=y
# CONFIG_MACB is not set
CONFIG_NET_VENDOR_CAVIUM=y
# CONFIG_THUNDER_NIC_PF is not set
# CONFIG_THUNDER_NIC_VF is not set
# CONFIG_THUNDER_NIC_BGX is not set
# CONFIG_THUNDER_NIC_RGX is not set
CONFIG_CAVIUM_PTP=y
# CONFIG_LIQUIDIO is not set
# CONFIG_LIQUIDIO_VF is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
CONFIG_NET_VENDOR_CORTINA=y
# CONFIG_CX_ECAT is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
CONFIG_TULIP=y
# CONFIG_TULIP_MWI is not set
CONFIG_TULIP_MMIO=y
# CONFIG_TULIP_NAPI is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
# CONFIG_NET_VENDOR_DLINK is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_EZCHIP=y
CONFIG_NET_VENDOR_GOOGLE=y
# CONFIG_GVE is not set
CONFIG_NET_VENDOR_HUAWEI=y
# CONFIG_HINIC is not set
# CONFIG_NET_VENDOR_I825XX is not set
CONFIG_NET_VENDOR_INTEL=y
# CONFIG_E100 is not set
CONFIG_E1000=y
CONFIG_E1000E=y
CONFIG_E1000E_HWTS=y
CONFIG_IGB=y
CONFIG_IGB_HWMON=y
# CONFIG_IGBVF is not set
# CONFIG_IXGB is not set
CONFIG_IXGBE=y
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBE_DCB=y
# CONFIG_IXGBEVF is not set
CONFIG_I40E=y
CONFIG_I40E_DCB=y
# CONFIG_I40EVF is not set
# CONFIG_ICE is not set
# CONFIG_FM10K is not set
# CONFIG_IGC is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
CONFIG_SKGE=y
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKGE_GENESIS=y
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX5_CORE is not set
# CONFIG_MLXSW_CORE is not set
# CONFIG_MLXFW is not set
# CONFIG_NET_VENDOR_MICREL is not set
# CONFIG_NET_VENDOR_MICROCHIP is not set
CONFIG_NET_VENDOR_MICROSEMI=y
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
# CONFIG_NET_VENDOR_NATSEMI is not set
CONFIG_NET_VENDOR_NETERION=y
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
CONFIG_NET_VENDOR_NETRONOME=y
# CONFIG_NFP is not set
CONFIG_NET_VENDOR_NI=y
# CONFIG_NI_XGE_MANAGEMENT_ENET is not set
# CONFIG_NET_VENDOR_NVIDIA is not set
CONFIG_NET_VENDOR_OKI=y
# CONFIG_ETHOC is not set
CONFIG_NET_VENDOR_PACKET_ENGINES=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_PENSANDO=y
# CONFIG_IONIC is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_QED is not set
CONFIG_NET_VENDOR_QUALCOMM=y
# CONFIG_QCOM_EMAC is not set
# CONFIG_RMNET is not set
# CONFIG_NET_VENDOR_RDC is not set
CONFIG_NET_VENDOR_REALTEK=y
CONFIG_8139CP=y
CONFIG_8139TOO=y
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
CONFIG_R8169=y
CONFIG_NET_VENDOR_RENESAS=y
CONFIG_NET_VENDOR_ROCKER=y
# CONFIG_ROCKER is not set
CONFIG_NET_VENDOR_SAMSUNG=y
# CONFIG_SXGBE_ETH is not set
# CONFIG_NET_VENDOR_SEEQ is not set
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
# CONFIG_NET_VENDOR_SILAN is not set
# CONFIG_NET_VENDOR_SIS is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_EPIC100 is not set
# CONFIG_SMSC911X is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_SOCIONEXT=y
# CONFIG_NET_VENDOR_STMICRO is not set
# CONFIG_NET_VENDOR_SUN is not set
CONFIG_NET_VENDOR_SYNOPSYS=y
# CONFIG_DWC_XLGMAC is not set
# CONFIG_NET_VENDOR_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_PHY_SEL is not set
# CONFIG_TLAN is not set
# CONFIG_NET_VENDOR_VIA is not set
# CONFIG_NET_VENDOR_WIZNET is not set
CONFIG_NET_VENDOR_XILINX=y
# CONFIG_XILINX_AXI_EMAC is not set
# CONFIG_XILINX_LL_TEMAC is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_MDIO_DEVICE=y
CONFIG_MDIO_BUS=y
CONFIG_MDIO_DEVRES=y
# CONFIG_MDIO_BCM_UNIMAC is not set
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_MSCC_MIIM is not set
# CONFIG_MDIO_MVUSB is not set
# CONFIG_MDIO_THUNDER is not set
# CONFIG_MDIO_XPCS is not set
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
# CONFIG_LED_TRIGGER_PHY is not set

#
# MII PHY device drivers
#
# CONFIG_ADIN_PHY is not set
# CONFIG_AMD_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
# CONFIG_AX88796B_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_BCM54140_PHY is not set
# CONFIG_BCM84881_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_CORTINA_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_DP83822_PHY is not set
# CONFIG_DP83TC811_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
# CONFIG_DP83869_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_ICPLUS_PHY is not set
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_MARVELL_10G_PHY is not set
# CONFIG_MICREL_PHY is not set
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROCHIP_T1_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_NXP_TJA11XX_PHY is not set
# CONFIG_QSEMI_PHY is not set
CONFIG_REALTEK_PHY=y
# CONFIG_RENESAS_PHY is not set
# CONFIG_ROCKCHIP_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_XILINX_GMII2RGMII is not set
# CONFIG_MICREL_KS8995MA is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
CONFIG_USB_NET_DRIVERS=y
CONFIG_USB_CATC=y
CONFIG_USB_KAWETH=y
CONFIG_USB_PEGASUS=y
CONFIG_USB_RTL8150=y
# CONFIG_USB_RTL8152 is not set
# CONFIG_USB_LAN78XX is not set
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=y
# CONFIG_USB_NET_AX88179_178A is not set
CONFIG_USB_NET_CDCETHER=y
CONFIG_USB_NET_CDC_EEM=y
# CONFIG_USB_NET_CDC_NCM is not set
# CONFIG_USB_NET_HUAWEI_CDC_NCM is not set
# CONFIG_USB_NET_CDC_MBIM is not set
CONFIG_USB_NET_DM9601=y
# CONFIG_USB_NET_SR9700 is not set
# CONFIG_USB_NET_SR9800 is not set
CONFIG_USB_NET_SMSC75XX=y
CONFIG_USB_NET_SMSC95XX=y
CONFIG_USB_NET_GL620A=y
CONFIG_USB_NET_NET1080=y
CONFIG_USB_NET_PLUSB=y
CONFIG_USB_NET_MCS7830=y
CONFIG_USB_NET_RNDIS_HOST=y
CONFIG_USB_NET_CDC_SUBSET_ENABLE=y
CONFIG_USB_NET_CDC_SUBSET=y
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_KC2190=y
CONFIG_USB_NET_ZAURUS=y
# CONFIG_USB_NET_CX82310_ETH is not set
# CONFIG_USB_NET_KALMIA is not set
# CONFIG_USB_NET_QMI_WWAN is not set
CONFIG_USB_NET_INT51X1=y
CONFIG_USB_IPHETH=y
CONFIG_USB_SIERRA_NET=y
# CONFIG_USB_VL600 is not set
# CONFIG_USB_NET_CH9200 is not set
# CONFIG_USB_NET_AQC111 is not set
CONFIG_WLAN=y
# CONFIG_WIRELESS_WDS is not set
CONFIG_WLAN_VENDOR_ADMTEK=y
CONFIG_WLAN_VENDOR_ATH=y
# CONFIG_ATH_DEBUG is not set
# CONFIG_ATH5K_PCI is not set
CONFIG_WLAN_VENDOR_ATMEL=y
CONFIG_WLAN_VENDOR_BROADCOM=y
CONFIG_WLAN_VENDOR_CISCO=y
CONFIG_WLAN_VENDOR_INTEL=y
CONFIG_WLAN_VENDOR_INTERSIL=y
# CONFIG_HOSTAP is not set
# CONFIG_PRISM54 is not set
CONFIG_WLAN_VENDOR_MARVELL=y
CONFIG_WLAN_VENDOR_MEDIATEK=y
CONFIG_WLAN_VENDOR_MICROCHIP=y
CONFIG_WLAN_VENDOR_RALINK=y
CONFIG_WLAN_VENDOR_REALTEK=y
CONFIG_WLAN_VENDOR_RSI=y
CONFIG_WLAN_VENDOR_ST=y
CONFIG_WLAN_VENDOR_TI=y
CONFIG_WLAN_VENDOR_ZYDAS=y
CONFIG_WLAN_VENDOR_QUANTENNA=y

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
# CONFIG_SBNI is not set
# CONFIG_XEN_NETDEV_FRONTEND is not set
# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
# CONFIG_NETDEVSIM is not set
CONFIG_NET_FAILOVER=m
CONFIG_ISDN=y
# CONFIG_MISDN is not set
CONFIG_NVM=y
# CONFIG_NVM_PBLK is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_LEDS=y
CONFIG_INPUT_FF_MEMLESS=y
# CONFIG_INPUT_POLLDEV is not set
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADC is not set
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
# CONFIG_KEYBOARD_APPLESPI is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1050 is not set
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_GPIO_POLLED is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_SAMSUNG is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_TM2_TOUCHKEY is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_BYD=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_ELANTECH_SMBUS=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
CONFIG_MOUSE_PS2_VMMOUSE=y
CONFIG_MOUSE_PS2_SMBUS=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
# CONFIG_INPUT_JOYSTICK is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_PEGASUS is not set
# CONFIG_TABLET_SERIAL_WACOM4 is not set
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_TOUCHSCREEN_PROPERTIES=y
# CONFIG_TOUCHSCREEN_ADS7846 is not set
# CONFIG_TOUCHSCREEN_AD7877 is not set
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ADC is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_AUO_PIXCIR is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_BU21029 is not set
# CONFIG_TOUCHSCREEN_CHIPONE_ICN8505 is not set
# CONFIG_TOUCHSCREEN_CY8CTMA140 is not set
# CONFIG_TOUCHSCREEN_CY8CTMG110 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_CYTTSP4_CORE is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_EGALAX_SERIAL is not set
# CONFIG_TOUCHSCREEN_EXC3000 is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_GOODIX is not set
# CONFIG_TOUCHSCREEN_HIDEEP is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_S6SY761 is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_EKTF2127 is not set
# CONFIG_TOUCHSCREEN_ELAN is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MELFAS_MIP4 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_WDT87XX_I2C is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2004 is not set
# CONFIG_TOUCHSCREEN_TSC2005 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_RM_TS is not set
# CONFIG_TOUCHSCREEN_SILEAD is not set
# CONFIG_TOUCHSCREEN_SIS_I2C is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_STMFTS is not set
# CONFIG_TOUCHSCREEN_SURFACE3_SPI is not set
# CONFIG_TOUCHSCREEN_SX8654 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
# CONFIG_TOUCHSCREEN_ZET6223 is not set
# CONFIG_TOUCHSCREEN_ZFORCE is not set
# CONFIG_TOUCHSCREEN_ROHM_BU21023 is not set
# CONFIG_TOUCHSCREEN_IQS5XX is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
CONFIG_INPUT_PCSPKR=m
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_GPIO_BEEPER is not set
# CONFIG_INPUT_GPIO_DECODER is not set
# CONFIG_INPUT_GPIO_VIBRA is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
# CONFIG_INPUT_UINPUT is not set
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_PWM_BEEPER is not set
# CONFIG_INPUT_PWM_VIBRA is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_IQS269A is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_XEN_KBDDEV_FRONTEND is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV260X_HAPTICS is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_SERIO_GPIO_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_LDISC_AUTOLOAD=y

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_DEPRECATED_OPTIONS is not set
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_16550A_VARIANTS is not set
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_DWLIB=y
CONFIG_SERIAL_8250_DW=y
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MAX3100 is not set
# CONFIG_SERIAL_MAX310X is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_LANTIQ is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_IFX6X60 is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_FSL_LINFLEXUART is not set
# CONFIG_SERIAL_SPRD is not set
# end of Serial drivers

CONFIG_SERIAL_MCTRL_GPIO=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_NOZOMI is not set
# CONFIG_NULL_TTY is not set
# CONFIG_TRACE_SINK is not set
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
CONFIG_HVC_XEN_FRONTEND=y
# CONFIG_SERIAL_DEV_BUS is not set
# CONFIG_TTY_PRINTK is not set
CONFIG_VIRTIO_CONSOLE=y
CONFIG_IPMI_HANDLER=m
CONFIG_IPMI_DMI_DECODE=y
CONFIG_IPMI_PLAT_DATA=y
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_SSIF=m
# CONFIG_IPMI_WATCHDOG is not set
# CONFIG_IPMI_POWEROFF is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_BA431 is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_HW_RANDOM_VIRTIO=y
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
CONFIG_DEVMEM=y
# CONFIG_DEVKMEM is not set
CONFIG_NVRAM=y
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=8192
CONFIG_DEVPORT=y
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
# CONFIG_HPET_MMAP_DEFAULT is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_UV_MMTIMER is not set
CONFIG_TCG_TPM=y
CONFIG_HW_RANDOM_TPM=y
CONFIG_TCG_TIS_CORE=y
CONFIG_TCG_TIS=y
# CONFIG_TCG_TIS_SPI is not set
# CONFIG_TCG_TIS_I2C_ATMEL is not set
# CONFIG_TCG_TIS_I2C_INFINEON is not set
# CONFIG_TCG_TIS_I2C_NUVOTON is not set
# CONFIG_TCG_NSC is not set
# CONFIG_TCG_ATMEL is not set
# CONFIG_TCG_INFINEON is not set
# CONFIG_TCG_XEN is not set
CONFIG_TCG_CRB=y
# CONFIG_TCG_VTPM_PROXY is not set
# CONFIG_TCG_TIS_ST33ZP24_I2C is not set
# CONFIG_TCG_TIS_ST33ZP24_SPI is not set
# CONFIG_TELCLOCK is not set
# CONFIG_XILLYBUS is not set
# end of Character devices

# CONFIG_RANDOM_TRUST_CPU is not set
# CONFIG_RANDOM_TRUST_BOOTLOADER is not set

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=m
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_AMD_MP2 is not set
CONFIG_I2C_I801=m
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_NVIDIA_GPU is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_CBUS_GPIO is not set
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# end of I2C Hardware Bus support

# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# end of I2C support

# CONFIG_I3C is not set
CONFIG_SPI=y
# CONFIG_SPI_DEBUG is not set
CONFIG_SPI_MASTER=y
# CONFIG_SPI_MEM is not set

#
# SPI Master Controller Drivers
#
# CONFIG_SPI_ALTERA is not set
# CONFIG_SPI_AXI_SPI_ENGINE is not set
# CONFIG_SPI_BITBANG is not set
# CONFIG_SPI_CADENCE is not set
# CONFIG_SPI_DESIGNWARE is not set
# CONFIG_SPI_NXP_FLEXSPI is not set
# CONFIG_SPI_GPIO is not set
# CONFIG_SPI_LANTIQ_SSC is not set
# CONFIG_SPI_OC_TINY is not set
# CONFIG_SPI_PXA2XX is not set
# CONFIG_SPI_ROCKCHIP is not set
# CONFIG_SPI_SC18IS602 is not set
# CONFIG_SPI_SIFIVE is not set
# CONFIG_SPI_MXIC is not set
# CONFIG_SPI_XCOMM is not set
# CONFIG_SPI_XILINX is not set
# CONFIG_SPI_ZYNQMP_GQSPI is not set
# CONFIG_SPI_AMD is not set

#
# SPI Multiplexer support
#
# CONFIG_SPI_MUX is not set

#
# SPI Protocol Masters
#
# CONFIG_SPI_SPIDEV is not set
# CONFIG_SPI_LOOPBACK_TEST is not set
# CONFIG_SPI_TLE62X0 is not set
# CONFIG_SPI_SLAVE is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y
# CONFIG_DP83640_PHY is not set
# CONFIG_PTP_1588_CLOCK_INES is not set
# CONFIG_PTP_1588_CLOCK_KVM is not set
# CONFIG_PTP_1588_CLOCK_IDT82P33 is not set
# CONFIG_PTP_1588_CLOCK_IDTCM is not set
# CONFIG_PTP_1588_CLOCK_VMW is not set
# end of PTP clock support

CONFIG_PINCTRL=y
CONFIG_PINMUX=y
CONFIG_PINCONF=y
CONFIG_GENERIC_PINCONF=y
# CONFIG_DEBUG_PINCTRL is not set
# CONFIG_PINCTRL_AMD is not set
# CONFIG_PINCTRL_MCP23S08 is not set
# CONFIG_PINCTRL_SX150X is not set
CONFIG_PINCTRL_BAYTRAIL=y
# CONFIG_PINCTRL_CHERRYVIEW is not set
# CONFIG_PINCTRL_LYNXPOINT is not set
# CONFIG_PINCTRL_BROXTON is not set
# CONFIG_PINCTRL_CANNONLAKE is not set
# CONFIG_PINCTRL_CEDARFORK is not set
# CONFIG_PINCTRL_DENVERTON is not set
# CONFIG_PINCTRL_EMMITSBURG is not set
# CONFIG_PINCTRL_GEMINILAKE is not set
# CONFIG_PINCTRL_ICELAKE is not set
# CONFIG_PINCTRL_JASPERLAKE is not set
# CONFIG_PINCTRL_LEWISBURG is not set
# CONFIG_PINCTRL_SUNRISEPOINT is not set
# CONFIG_PINCTRL_TIGERLAKE is not set
CONFIG_GPIOLIB=y
CONFIG_GPIOLIB_FASTPATH_LIMIT=512
CONFIG_GPIO_ACPI=y
CONFIG_GPIOLIB_IRQCHIP=y
# CONFIG_DEBUG_GPIO is not set
CONFIG_GPIO_SYSFS=y

#
# Memory mapped GPIO drivers
#
# CONFIG_GPIO_AMDPT is not set
# CONFIG_GPIO_DWAPB is not set
# CONFIG_GPIO_EXAR is not set
# CONFIG_GPIO_GENERIC_PLATFORM is not set
# CONFIG_GPIO_ICH is not set
# CONFIG_GPIO_MB86S7X is not set
# CONFIG_GPIO_VX855 is not set
# CONFIG_GPIO_XILINX is not set
# CONFIG_GPIO_AMD_FCH is not set
# end of Memory mapped GPIO drivers

#
# Port-mapped I/O GPIO drivers
#
# CONFIG_GPIO_F7188X is not set
# CONFIG_GPIO_IT87 is not set
# CONFIG_GPIO_SCH is not set
# CONFIG_GPIO_SCH311X is not set
# CONFIG_GPIO_WINBOND is not set
# CONFIG_GPIO_WS16C48 is not set
# end of Port-mapped I/O GPIO drivers

#
# I2C GPIO expanders
#
# CONFIG_GPIO_ADP5588 is not set
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCA9570 is not set
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_TPIC2810 is not set
# end of I2C GPIO expanders

#
# MFD GPIO expanders
#
# end of MFD GPIO expanders

#
# PCI GPIO expanders
#
# CONFIG_GPIO_AMD8111 is not set
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_ML_IOH is not set
# CONFIG_GPIO_PCI_IDIO_16 is not set
# CONFIG_GPIO_PCIE_IDIO_24 is not set
# CONFIG_GPIO_RDC321X is not set
# end of PCI GPIO expanders

#
# SPI GPIO expanders
#
# CONFIG_GPIO_MAX3191X is not set
# CONFIG_GPIO_MAX7301 is not set
# CONFIG_GPIO_MC33880 is not set
# CONFIG_GPIO_PISOSR is not set
# CONFIG_GPIO_XRA1403 is not set
# end of SPI GPIO expanders

#
# USB GPIO expanders
#
# end of USB GPIO expanders

# CONFIG_GPIO_AGGREGATOR is not set
CONFIG_GPIO_MOCKUP=y
# CONFIG_W1 is not set
# CONFIG_POWER_AVS is not set
CONFIG_POWER_RESET=y
# CONFIG_POWER_RESET_RESTART is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_POWER_SUPPLY_HWMON=y
# CONFIG_PDA_POWER is not set
# CONFIG_GENERIC_ADC_BATTERY is not set
# CONFIG_TEST_POWER is not set
# CONFIG_CHARGER_ADP5061 is not set
# CONFIG_BATTERY_CW2015 is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_GPIO is not set
# CONFIG_CHARGER_LT3651 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_BQ24257 is not set
# CONFIG_CHARGER_BQ24735 is not set
# CONFIG_CHARGER_BQ2515X is not set
# CONFIG_CHARGER_BQ25890 is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
# CONFIG_CHARGER_RT9455 is not set
# CONFIG_CHARGER_BD99954 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7314 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM1177 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7310 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_AS370 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_AXI_FAN_CONTROL is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_AMD_ENERGY is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ASPEED is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_CORSAIR_CPRO is not set
# CONFIG_SENSORS_DRIVETEMP is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_FTSTEUTATES is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
# CONFIG_SENSORS_IIO_HWMON is not set
# CONFIG_SENSORS_I5500 is not set
CONFIG_SENSORS_CORETEMP=m
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2947_I2C is not set
# CONFIG_SENSORS_LTC2947_SPI is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX1111 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX31722 is not set
# CONFIG_SENSORS_MAX31730 is not set
# CONFIG_SENSORS_MAX6621 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_ADCXX is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM70 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NCT7904 is not set
# CONFIG_SENSORS_NPCM7XX is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_ADS7871 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_TMP513 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83773G is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
CONFIG_SENSORS_ACPI_POWER=m
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_NETLINK=y
# CONFIG_THERMAL_STATISTICS is not set
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
CONFIG_THERMAL_GOV_FAIR_SHARE=y
CONFIG_THERMAL_GOV_STEP_WISE=y
CONFIG_THERMAL_GOV_BANG_BANG=y
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_DEVFREQ_THERMAL is not set
# CONFIG_THERMAL_EMULATION is not set

#
# Intel thermal drivers
#
CONFIG_INTEL_POWERCLAMP=m
CONFIG_X86_PKG_TEMP_THERMAL=m
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# end of ACPI INT340X thermal drivers

# CONFIG_INTEL_PCH_THERMAL is not set
# end of Intel thermal drivers

# CONFIG_GENERIC_ADC_THERMAL is not set
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
CONFIG_WATCHDOG_OPEN_TIMEOUT=0
CONFIG_WATCHDOG_SYSFS=y

#
# Watchdog Pretimeout Governors
#
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_WDAT_WDT is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_EBC_C384_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
CONFIG_I6300ESB_WDT=y
# CONFIG_IE6XX_WDT is not set
CONFIG_ITCO_WDT=y
CONFIG_ITCO_VENDOR_SUPPORT=y
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_TQMX86_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_INTEL_MEI_WDT is not set
# CONFIG_NI903X_WDT is not set
# CONFIG_NIC7018_WDT is not set
# CONFIG_MEN_A21_WDT is not set
# CONFIG_XEN_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
CONFIG_MFD_CORE=y
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_AAT2870_CORE is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_BD9571MWV is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_MADERA is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_SPI is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_SPI is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_MFD_MP2629 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_HTC_I2CPLD is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
CONFIG_LPC_ICH=m
# CONFIG_LPC_SCH is not set
# CONFIG_INTEL_SOC_PMIC_CHTDC_TI is not set
CONFIG_MFD_INTEL_LPSS=y
CONFIG_MFD_INTEL_LPSS_ACPI=y
CONFIG_MFD_INTEL_LPSS_PCI=y
# CONFIG_MFD_INTEL_PMC_BXT is not set
# CONFIG_MFD_IQS62X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6360 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_EZX_PCAP is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65910 is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS65912_SPI is not set
# CONFIG_MFD_TPS80031 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TQMX86 is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_ARIZONA_SPI is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM831X_SPI is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# end of Multifunction device drivers

# CONFIG_REGULATOR is not set
# CONFIG_RC_CORE is not set
CONFIG_MEDIA_CEC_SUPPORT=y
# CONFIG_CEC_CH7322 is not set
# CONFIG_CEC_GPIO is not set
# CONFIG_CEC_SECO is not set
# CONFIG_USB_PULSE8_CEC is not set
# CONFIG_USB_RAINSHADOW_CEC is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_INTEL_GTT=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=64
CONFIG_VGA_SWITCHEROO=y
CONFIG_DRM=m
CONFIG_DRM_DP_AUX_CHARDEV=y
# CONFIG_DRM_DEBUG_SELFTEST is not set
CONFIG_DRM_KMS_HELPER=m
CONFIG_DRM_KMS_FB_HELPER=y
# CONFIG_DRM_DEBUG_DP_MST_TOPOLOGY_REFS is not set
CONFIG_DRM_FBDEV_EMULATION=y
CONFIG_DRM_FBDEV_OVERALLOC=100
# CONFIG_DRM_FBDEV_LEAK_PHYS_SMEM is not set
CONFIG_DRM_LOAD_EDID_FIRMWARE=y
# CONFIG_DRM_DP_CEC is not set
CONFIG_DRM_TTM=m
CONFIG_DRM_TTM_DMA_PAGE_POOL=y
CONFIG_DRM_VRAM_HELPER=m
CONFIG_DRM_TTM_HELPER=m

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I2C_NXP_TDA998X is not set
# CONFIG_DRM_I2C_NXP_TDA9950 is not set
# end of I2C encoder or helper chips

#
# ARM devices
#
# end of ARM devices

# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_AMDGPU is not set
# CONFIG_DRM_NOUVEAU is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_VGEM is not set
# CONFIG_DRM_VKMS is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
CONFIG_DRM_AST=m
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_QXL is not set
# CONFIG_DRM_BOCHS is not set
# CONFIG_DRM_VIRTIO_GPU is not set
CONFIG_DRM_PANEL=y

#
# Display Panels
#
# end of Display Panels

CONFIG_DRM_BRIDGE=y
CONFIG_DRM_PANEL_BRIDGE=y

#
# Display Interface Bridges
#
# CONFIG_DRM_ANALOGIX_ANX78XX is not set
# end of Display Interface Bridges

# CONFIG_DRM_ETNAVIV is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_DRM_GM12U320 is not set
# CONFIG_TINYDRM_HX8357D is not set
# CONFIG_TINYDRM_ILI9225 is not set
# CONFIG_TINYDRM_ILI9341 is not set
# CONFIG_TINYDRM_ILI9486 is not set
# CONFIG_TINYDRM_MI0283QT is not set
# CONFIG_TINYDRM_REPAPER is not set
# CONFIG_TINYDRM_ST7586 is not set
# CONFIG_TINYDRM_ST7735R is not set
# CONFIG_DRM_XEN is not set
# CONFIG_DRM_VBOXVIDEO is not set
# CONFIG_DRM_LEGACY is not set
CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y

#
# Frame buffer Devices
#
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
CONFIG_FB_SYS_FILLRECT=m
CONFIG_FB_SYS_COPYAREA=m
CONFIG_FB_SYS_IMAGEBLIT=m
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=m
CONFIG_FB_DEFERRED_IO=y
# CONFIG_FB_MODE_HELPERS is not set
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_XEN_FBDEV_FRONTEND is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SM712 is not set
# end of Frame buffer Devices

#
# Backlight & LCD device support
#
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_PWM is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_QCOM_WLED is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3630A is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LP855X is not set
# CONFIG_BACKLIGHT_GPIO is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set
# CONFIG_BACKLIGHT_ARCXCNN is not set
# end of Backlight & LCD device support

CONFIG_HDMI=y

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_VGACON_SOFT_SCROLLBACK_PERSISTENT_ENABLE_BY_DEFAULT is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER is not set
# end of Console display driver support

CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# end of Graphics support

# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
CONFIG_HID_BATTERY_STRENGTH=y
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_ASUS is not set
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
# CONFIG_HID_BETOP_FF is not set
# CONFIG_HID_BIGBEN_FF is not set
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
# CONFIG_HID_CORSAIR is not set
# CONFIG_HID_COUGAR is not set
# CONFIG_HID_MACALLY is not set
# CONFIG_HID_CMEDIA is not set
# CONFIG_HID_CP2112 is not set
# CONFIG_HID_CREATIVE_SB0540 is not set
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELAN is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_GLORIOUS is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_GT683R is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_VIEWSONIC is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
CONFIG_HID_ITE=y
# CONFIG_HID_JABRA is not set
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LED is not set
# CONFIG_HID_LENOVO is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_HID_LOGITECH_HIDPP is not set
# CONFIG_LOGITECH_FF is not set
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
# CONFIG_LOGIWHEELS_FF is not set
CONFIG_HID_MAGICMOUSE=y
# CONFIG_HID_MALTRON is not set
# CONFIG_HID_MAYFLASH is not set
CONFIG_HID_REDRAGON=y
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTI is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
CONFIG_HID_PLANTRONICS=y
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEAM is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_U2FZERO is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set
# CONFIG_HID_MCP2221 is not set
# end of Special HID drivers

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y
# end of USB HID support

#
# I2C HID support
#
# CONFIG_I2C_HID is not set
# end of I2C HID support

#
# Intel ISH HID support
#
CONFIG_INTEL_ISH_HID=y
# CONFIG_INTEL_ISH_FIRMWARE_DOWNLOADER is not set
# end of Intel ISH HID support
# end of HID support

CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
# CONFIG_USB_LED_TRIG is not set
# CONFIG_USB_ULPI_BUS is not set
# CONFIG_USB_CONN_GPIO is not set
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_PRODUCTLIST is not set
# CONFIG_USB_OTG_DISABLE_EXTERNAL_HUB is not set
# CONFIG_USB_LEDS_TRIGGER_USBPORT is not set
CONFIG_USB_AUTOSUSPEND_DELAY=2
CONFIG_USB_MON=y

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
# CONFIG_USB_XHCI_DBGCAP is not set
CONFIG_USB_XHCI_PCI=y
# CONFIG_USB_XHCI_PCI_RENESAS is not set
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_FSL is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
# CONFIG_USB_MAX3421_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
# CONFIG_USB_STORAGE is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_CDNS3 is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_SIMPLE is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_F81232 is not set
# CONFIG_USB_SERIAL_F8153X is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_METRO is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MXUPORT is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_XSENS_MT is not set
# CONFIG_USB_SERIAL_WISHBONE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
# CONFIG_USB_SERIAL_QT2 is not set
# CONFIG_USB_SERIAL_UPD78F0730 is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_APPLE_MFI_FASTCHARGE is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set

#
# USB Physical Layer drivers
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_USB_ISP1301 is not set
# end of USB Physical Layer drivers

# CONFIG_USB_GADGET is not set
CONFIG_TYPEC=y
# CONFIG_TYPEC_TCPM is not set
CONFIG_TYPEC_UCSI=y
# CONFIG_UCSI_CCG is not set
CONFIG_UCSI_ACPI=y

#
# USB Type-C Multiplexer/DeMultiplexer Switch support
#
# CONFIG_TYPEC_MUX_PI3USB30532 is not set
# end of USB Type-C Multiplexer/DeMultiplexer Switch support

#
# USB Type-C Alternate Mode drivers
#
# CONFIG_TYPEC_DP_ALTMODE is not set
# end of USB Type-C Alternate Mode drivers

# CONFIG_USB_ROLE_SWITCH is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y
# CONFIG_LEDS_CLASS_FLASH is not set
# CONFIG_LEDS_CLASS_MULTICOLOR is not set
# CONFIG_LEDS_BRIGHTNESS_HW_CHANGED is not set

#
# LED drivers
#
# CONFIG_LEDS_APU is not set
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3532 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP3952 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_DAC124S085 is not set
# CONFIG_LEDS_PWM is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_TLC591XX is not set
# CONFIG_LEDS_LM355x is not set

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
# CONFIG_LEDS_BLINKM is not set
# CONFIG_LEDS_MLXCPLD is not set
# CONFIG_LEDS_MLXREG is not set
# CONFIG_LEDS_USER is not set
# CONFIG_LEDS_NIC78BX is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_DISK is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_ACTIVITY is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_LEDS_TRIGGER_PANIC is not set
# CONFIG_LEDS_TRIGGER_NETDEV is not set
# CONFIG_LEDS_TRIGGER_PATTERN is not set
# CONFIG_LEDS_TRIGGER_AUDIO is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_DECODE_MCE is not set
CONFIG_EDAC_GHES=y
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
# CONFIG_EDAC_SBRIDGE is not set
CONFIG_EDAC_SKX=m
# CONFIG_EDAC_I10NM is not set
# CONFIG_EDAC_PND2 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_SYSTOHC is not set
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABEOZ9 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF85363 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3028 is not set
# CONFIG_RTC_DRV_RV8803 is not set
# CONFIG_RTC_DRV_SD3078 is not set

#
# SPI RTC drivers
#
# CONFIG_RTC_DRV_M41T93 is not set
# CONFIG_RTC_DRV_M41T94 is not set
# CONFIG_RTC_DRV_DS1302 is not set
# CONFIG_RTC_DRV_DS1305 is not set
# CONFIG_RTC_DRV_DS1343 is not set
# CONFIG_RTC_DRV_DS1347 is not set
# CONFIG_RTC_DRV_DS1390 is not set
# CONFIG_RTC_DRV_MAX6916 is not set
# CONFIG_RTC_DRV_R9701 is not set
# CONFIG_RTC_DRV_RX4581 is not set
# CONFIG_RTC_DRV_RX6110 is not set
# CONFIG_RTC_DRV_RS5C348 is not set
# CONFIG_RTC_DRV_MAX6902 is not set
# CONFIG_RTC_DRV_PCF2123 is not set
# CONFIG_RTC_DRV_MCP795 is not set
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_ACPI=y
# CONFIG_ALTERA_MSGDMA is not set
# CONFIG_INTEL_IDMA64 is not set
# CONFIG_INTEL_IDXD is not set
CONFIG_INTEL_IOATDMA=m
# CONFIG_PLX_DMA is not set
# CONFIG_XILINX_ZYNQMP_DPDMA is not set
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
CONFIG_DW_DMAC_CORE=y
# CONFIG_DW_DMAC is not set
CONFIG_DW_DMAC_PCI=y
# CONFIG_DW_EDMA is not set
# CONFIG_DW_EDMA_PCIE is not set
CONFIG_HSU_DMA=y
# CONFIG_SF_PDMA is not set

#
# DMA Clients
#
CONFIG_ASYNC_TX_DMA=y
# CONFIG_DMATEST is not set
CONFIG_DMA_ENGINE_RAID=y

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
CONFIG_SW_SYNC=y
# CONFIG_UDMABUF is not set
# CONFIG_DMABUF_MOVE_NOTIFY is not set
# CONFIG_DMABUF_SELFTESTS is not set
# CONFIG_DMABUF_HEAPS is not set
# end of DMABUF options

CONFIG_DCA=m
CONFIG_AUXDISPLAY=y
# CONFIG_HD44780 is not set
# CONFIG_IMG_ASCII_LCD is not set
# CONFIG_CHARLCD_BL_OFF is not set
# CONFIG_CHARLCD_BL_ON is not set
CONFIG_CHARLCD_BL_FLASH=y
# CONFIG_UIO is not set
# CONFIG_VFIO is not set
CONFIG_IRQ_BYPASS_MANAGER=m
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO=y
CONFIG_VIRTIO_MENU=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_PCI_LEGACY=y
# CONFIG_VIRTIO_PMEM is not set
CONFIG_VIRTIO_BALLOON=y
CONFIG_VIRTIO_MEM=m
# CONFIG_VIRTIO_INPUT is not set
# CONFIG_VIRTIO_MMIO is not set
# CONFIG_VDPA is not set
CONFIG_VHOST_MENU=y
# CONFIG_VHOST_NET is not set
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set
# end of Microsoft Hyper-V guest support

#
# Xen driver support
#
CONFIG_XEN_BALLOON=y
# CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is not set
CONFIG_XEN_SCRUB_PAGES_DEFAULT=y
# CONFIG_XEN_DEV_EVTCHN is not set
# CONFIG_XEN_BACKEND is not set
# CONFIG_XENFS is not set
CONFIG_XEN_SYS_HYPERVISOR=y
CONFIG_XEN_XENBUS_FRONTEND=y
# CONFIG_XEN_GNTDEV is not set
# CONFIG_XEN_GRANT_DEV_ALLOC is not set
# CONFIG_XEN_GRANT_DMA_ALLOC is not set
CONFIG_SWIOTLB_XEN=y
# CONFIG_XEN_PVCALLS_FRONTEND is not set
CONFIG_XEN_PRIVCMD=m
CONFIG_XEN_HAVE_PVMMU=y
CONFIG_XEN_EFI=y
CONFIG_XEN_AUTO_XLATE=y
CONFIG_XEN_ACPI=y
CONFIG_XEN_HAVE_VPMU=y
# end of Xen driver support

# CONFIG_GREYBUS is not set
CONFIG_STAGING=y
# CONFIG_COMEDI is not set
# CONFIG_RTL8192U is not set
# CONFIG_RTLLIB is not set
# CONFIG_RTS5208 is not set

#
# IIO staging drivers
#

#
# Accelerometers
#
# CONFIG_ADIS16203 is not set
# CONFIG_ADIS16240 is not set
# end of Accelerometers

#
# Analog to digital converters
#
# CONFIG_AD7816 is not set
# CONFIG_AD7280 is not set
# end of Analog to digital converters

#
# Analog digital bi-direction converters
#
# CONFIG_ADT7316 is not set
# end of Analog digital bi-direction converters

#
# Capacitance to digital converters
#
# CONFIG_AD7150 is not set
# CONFIG_AD7746 is not set
# end of Capacitance to digital converters

#
# Direct Digital Synthesis
#
# CONFIG_AD9832 is not set
# CONFIG_AD9834 is not set
# end of Direct Digital Synthesis

#
# Network Analyzer, Impedance Converters
#
# CONFIG_AD5933 is not set
# end of Network Analyzer, Impedance Converters

#
# Active energy metering IC
#
# CONFIG_ADE7854 is not set
# end of Active energy metering IC

#
# Resolver to digital converters
#
# CONFIG_AD2S1210 is not set
# end of Resolver to digital converters
# end of IIO staging drivers

# CONFIG_FB_SM750 is not set
# CONFIG_STAGING_MEDIA is not set

#
# Android
#
# CONFIG_ASHMEM is not set
CONFIG_ION=y
CONFIG_ION_SYSTEM_HEAP=y
# CONFIG_ION_CMA_HEAP is not set
# end of Android

# CONFIG_LTE_GDM724X is not set
# CONFIG_GS_FPGABOOT is not set
# CONFIG_UNISYSSPAR is not set
# CONFIG_FB_TFT is not set
# CONFIG_PI433 is not set

#
# Gasket devices
#
# CONFIG_STAGING_GASKET_FRAMEWORK is not set
# end of Gasket devices

# CONFIG_FIELDBUS_DEV is not set
# CONFIG_QLGE is not set
CONFIG_X86_PLATFORM_DEVICES=y
CONFIG_ACPI_WMI=m
# CONFIG_WMI_BMOF is not set
# CONFIG_ALIENWARE_WMI is not set
# CONFIG_HUAWEI_WMI is not set
# CONFIG_INTEL_WMI_SBL_FW_UPDATE is not set
# CONFIG_INTEL_WMI_THUNDERBOLT is not set
# CONFIG_MXM_WMI is not set
# CONFIG_PEAQ_WMI is not set
# CONFIG_XIAOMI_WMI is not set
# CONFIG_ACERHDF is not set
# CONFIG_ACER_WIRELESS is not set
# CONFIG_ACER_WMI is not set
# CONFIG_APPLE_GMUX is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_ASUS_WIRELESS is not set
# CONFIG_ASUS_WMI is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_DCDBAS is not set
# CONFIG_DELL_SMBIOS is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DELL_SMO8800 is not set
# CONFIG_DELL_WMI_AIO is not set
# CONFIG_DELL_WMI_LED is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_GPD_POCKET_FAN is not set
# CONFIG_HP_ACCEL is not set
# CONFIG_HP_WIRELESS is not set
# CONFIG_HP_WMI is not set
# CONFIG_IBM_RTL is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_ATOMISP2_PM is not set
# CONFIG_INTEL_HID_EVENT is not set
# CONFIG_INTEL_INT0002_VGPIO is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_INTEL_VBTN is not set
# CONFIG_SURFACE3_WMI is not set
# CONFIG_SURFACE_3_POWER_OPREGION is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
# CONFIG_MSI_WMI is not set
# CONFIG_PCENGINES_APU2 is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_TOSHIBA_HAPS is not set
# CONFIG_TOSHIBA_WMI is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_LG_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_SYSTEM76_ACPI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_I2C_MULTI_INSTANTIATE is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_INTEL_RST is not set
# CONFIG_INTEL_SMARTCONNECT is not set

#
# Intel Speed Select Technology interface support
#
# CONFIG_INTEL_SPEED_SELECT_INTERFACE is not set
# end of Intel Speed Select Technology interface support

# CONFIG_INTEL_TURBO_MAX_3 is not set
# CONFIG_INTEL_UNCORE_FREQ_CONTROL is not set
# CONFIG_INTEL_PMC_CORE is not set
# CONFIG_INTEL_PUNIT_IPC is not set
# CONFIG_INTEL_SCU_PCI is not set
# CONFIG_INTEL_SCU_PLATFORM is not set
CONFIG_PMC_ATOM=y
# CONFIG_MFD_CROS_EC is not set
# CONFIG_CHROME_PLATFORMS is not set
# CONFIG_MELLANOX_PLATFORM is not set
CONFIG_HAVE_CLK=y
CONFIG_CLKDEV_LOOKUP=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y
# CONFIG_COMMON_CLK_MAX9485 is not set
# CONFIG_COMMON_CLK_SI5341 is not set
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_SI544 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_COMMON_CLK_PWM is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# end of Clock Source drivers

CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_IOVA=y
CONFIG_IOASID=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
# end of Generic IOMMU Pagetable Support

# CONFIG_IOMMU_DEBUGFS is not set
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_IOMMU_DMA=y
CONFIG_AMD_IOMMU=y
# CONFIG_AMD_IOMMU_V2 is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON is not set
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set
# end of Remoteproc drivers

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set
# CONFIG_RPMSG_VIRTIO is not set
# end of Rpmsg drivers

# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Aspeed SoC drivers
#
# end of Aspeed SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# CONFIG_XILINX_VCU is not set
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

CONFIG_PM_DEVFREQ=y

#
# DEVFREQ Governors
#
# CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND is not set
# CONFIG_DEVFREQ_GOV_PERFORMANCE is not set
# CONFIG_DEVFREQ_GOV_POWERSAVE is not set
# CONFIG_DEVFREQ_GOV_USERSPACE is not set
# CONFIG_DEVFREQ_GOV_PASSIVE is not set

#
# DEVFREQ Drivers
#
# CONFIG_PM_DEVFREQ_EVENT is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
CONFIG_IIO=y
CONFIG_IIO_BUFFER=y
CONFIG_IIO_BUFFER_CB=y
# CONFIG_IIO_BUFFER_HW_CONSUMER is not set
CONFIG_IIO_KFIFO_BUF=y
# CONFIG_IIO_CONFIGFS is not set
CONFIG_IIO_TRIGGER=y
CONFIG_IIO_CONSUMERS_PER_TRIGGER=2
# CONFIG_IIO_SW_DEVICE is not set
# CONFIG_IIO_SW_TRIGGER is not set

#
# Accelerometers
#
# CONFIG_ADIS16201 is not set
# CONFIG_ADIS16209 is not set
# CONFIG_ADXL345_I2C is not set
# CONFIG_ADXL345_SPI is not set
# CONFIG_ADXL372_SPI is not set
# CONFIG_ADXL372_I2C is not set
# CONFIG_BMA180 is not set
# CONFIG_BMA220 is not set
# CONFIG_BMA400 is not set
# CONFIG_BMC150_ACCEL is not set
# CONFIG_DA280 is not set
# CONFIG_DA311 is not set
# CONFIG_DMARD09 is not set
# CONFIG_DMARD10 is not set
# CONFIG_IIO_ST_ACCEL_3AXIS is not set
# CONFIG_KXSD9 is not set
# CONFIG_KXCJK1013 is not set
# CONFIG_MC3230 is not set
# CONFIG_MMA7455_I2C is not set
# CONFIG_MMA7455_SPI is not set
# CONFIG_MMA7660 is not set
# CONFIG_MMA8452 is not set
# CONFIG_MMA9551 is not set
# CONFIG_MMA9553 is not set
# CONFIG_MXC4005 is not set
# CONFIG_MXC6255 is not set
# CONFIG_SCA3000 is not set
# CONFIG_STK8312 is not set
# CONFIG_STK8BA50 is not set
# end of Accelerometers

#
# Analog to digital converters
#
# CONFIG_AD7091R5 is not set
# CONFIG_AD7124 is not set
# CONFIG_AD7192 is not set
# CONFIG_AD7266 is not set
# CONFIG_AD7291 is not set
# CONFIG_AD7292 is not set
# CONFIG_AD7298 is not set
# CONFIG_AD7476 is not set
# CONFIG_AD7606_IFACE_PARALLEL is not set
# CONFIG_AD7606_IFACE_SPI is not set
# CONFIG_AD7766 is not set
# CONFIG_AD7768_1 is not set
# CONFIG_AD7780 is not set
# CONFIG_AD7791 is not set
# CONFIG_AD7793 is not set
# CONFIG_AD7887 is not set
# CONFIG_AD7923 is not set
# CONFIG_AD7949 is not set
# CONFIG_AD799X is not set
# CONFIG_AD9467 is not set
# CONFIG_ADI_AXI_ADC is not set
# CONFIG_HI8435 is not set
# CONFIG_HX711 is not set
# CONFIG_INA2XX_ADC is not set
# CONFIG_LTC2471 is not set
# CONFIG_LTC2485 is not set
# CONFIG_LTC2496 is not set
# CONFIG_LTC2497 is not set
# CONFIG_MAX1027 is not set
# CONFIG_MAX11100 is not set
# CONFIG_MAX1118 is not set
# CONFIG_MAX1241 is not set
# CONFIG_MAX1363 is not set
# CONFIG_MAX9611 is not set
# CONFIG_MCP320X is not set
# CONFIG_MCP3422 is not set
# CONFIG_MCP3911 is not set
# CONFIG_NAU7802 is not set
# CONFIG_TI_ADC081C is not set
# CONFIG_TI_ADC0832 is not set
# CONFIG_TI_ADC084S021 is not set
# CONFIG_TI_ADC12138 is not set
# CONFIG_TI_ADC108S102 is not set
# CONFIG_TI_ADC128S052 is not set
# CONFIG_TI_ADC161S626 is not set
# CONFIG_TI_ADS1015 is not set
# CONFIG_TI_ADS7950 is not set
# CONFIG_TI_TLC4541 is not set
# CONFIG_XILINX_XADC is not set
# end of Analog to digital converters

#
# Analog Front Ends
#
# end of Analog Front Ends

#
# Amplifiers
#
# CONFIG_AD8366 is not set
# CONFIG_HMC425 is not set
# end of Amplifiers

#
# Chemical Sensors
#
# CONFIG_ATLAS_PH_SENSOR is not set
# CONFIG_ATLAS_EZO_SENSOR is not set
# CONFIG_BME680 is not set
# CONFIG_CCS811 is not set
# CONFIG_IAQCORE is not set
# CONFIG_SCD30_CORE is not set
# CONFIG_SENSIRION_SGP30 is not set
# CONFIG_SPS30 is not set
# CONFIG_VZ89X is not set
# end of Chemical Sensors

#
# Hid Sensor IIO Common
#
# end of Hid Sensor IIO Common

#
# SSP Sensor Common
#
# CONFIG_IIO_SSP_SENSORHUB is not set
# end of SSP Sensor Common

#
# Digital to analog converters
#
# CONFIG_AD5064 is not set
# CONFIG_AD5360 is not set
# CONFIG_AD5380 is not set
# CONFIG_AD5421 is not set
# CONFIG_AD5446 is not set
# CONFIG_AD5449 is not set
# CONFIG_AD5592R is not set
# CONFIG_AD5593R is not set
# CONFIG_AD5504 is not set
# CONFIG_AD5624R_SPI is not set
# CONFIG_AD5686_SPI is not set
# CONFIG_AD5696_I2C is not set
# CONFIG_AD5755 is not set
# CONFIG_AD5758 is not set
# CONFIG_AD5761 is not set
# CONFIG_AD5764 is not set
# CONFIG_AD5770R is not set
# CONFIG_AD5791 is not set
# CONFIG_AD7303 is not set
# CONFIG_AD8801 is not set
# CONFIG_DS4424 is not set
# CONFIG_LTC1660 is not set
# CONFIG_LTC2632 is not set
# CONFIG_M62332 is not set
# CONFIG_MAX517 is not set
# CONFIG_MCP4725 is not set
# CONFIG_MCP4922 is not set
# CONFIG_TI_DAC082S085 is not set
# CONFIG_TI_DAC5571 is not set
# CONFIG_TI_DAC7311 is not set
# CONFIG_TI_DAC7612 is not set
# end of Digital to analog converters

#
# IIO dummy driver
#
# end of IIO dummy driver

#
# Frequency Synthesizers DDS/PLL
#

#
# Clock Generator/Distribution
#
# CONFIG_AD9523 is not set
# end of Clock Generator/Distribution

#
# Phase-Locked Loop (PLL) frequency synthesizers
#
# CONFIG_ADF4350 is not set
# CONFIG_ADF4371 is not set
# end of Phase-Locked Loop (PLL) frequency synthesizers
# end of Frequency Synthesizers DDS/PLL

#
# Digital gyroscope sensors
#
# CONFIG_ADIS16080 is not set
# CONFIG_ADIS16130 is not set
# CONFIG_ADIS16136 is not set
# CONFIG_ADIS16260 is not set
# CONFIG_ADXRS450 is not set
# CONFIG_BMG160 is not set
# CONFIG_FXAS21002C is not set
# CONFIG_MPU3050_I2C is not set
# CONFIG_IIO_ST_GYRO_3AXIS is not set
# CONFIG_ITG3200 is not set
# end of Digital gyroscope sensors

#
# Health Sensors
#

#
# Heart Rate Monitors
#
# CONFIG_AFE4403 is not set
# CONFIG_AFE4404 is not set
# CONFIG_MAX30100 is not set
# CONFIG_MAX30102 is not set
# end of Heart Rate Monitors
# end of Health Sensors

#
# Humidity sensors
#
# CONFIG_AM2315 is not set
# CONFIG_DHT11 is not set
# CONFIG_HDC100X is not set
# CONFIG_HTS221 is not set
# CONFIG_HTU21 is not set
# CONFIG_SI7005 is not set
# CONFIG_SI7020 is not set
# end of Humidity sensors

#
# Inertial measurement units
#
# CONFIG_ADIS16400 is not set
# CONFIG_ADIS16460 is not set
# CONFIG_ADIS16475 is not set
# CONFIG_ADIS16480 is not set
# CONFIG_BMI160_I2C is not set
# CONFIG_BMI160_SPI is not set
# CONFIG_FXOS8700_I2C is not set
# CONFIG_FXOS8700_SPI is not set
# CONFIG_KMX61 is not set
# CONFIG_INV_ICM42600_I2C is not set
# CONFIG_INV_ICM42600_SPI is not set
# CONFIG_INV_MPU6050_I2C is not set
# CONFIG_INV_MPU6050_SPI is not set
# CONFIG_IIO_ST_LSM6DSX is not set
# end of Inertial measurement units

#
# Light sensors
#
# CONFIG_ACPI_ALS is not set
# CONFIG_ADJD_S311 is not set
# CONFIG_ADUX1020 is not set
# CONFIG_AL3010 is not set
# CONFIG_AL3320A is not set
# CONFIG_APDS9300 is not set
# CONFIG_APDS9960 is not set
# CONFIG_BH1750 is not set
# CONFIG_BH1780 is not set
# CONFIG_CM32181 is not set
# CONFIG_CM3232 is not set
# CONFIG_CM3323 is not set
# CONFIG_CM36651 is not set
# CONFIG_GP2AP002 is not set
# CONFIG_GP2AP020A00F is not set
# CONFIG_SENSORS_ISL29018 is not set
# CONFIG_SENSORS_ISL29028 is not set
# CONFIG_ISL29125 is not set
# CONFIG_JSA1212 is not set
# CONFIG_RPR0521 is not set
# CONFIG_LTR501 is not set
# CONFIG_LV0104CS is not set
# CONFIG_MAX44000 is not set
# CONFIG_MAX44009 is not set
# CONFIG_NOA1305 is not set
# CONFIG_OPT3001 is not set
# CONFIG_PA12203001 is not set
# CONFIG_SI1133 is not set
# CONFIG_SI1145 is not set
# CONFIG_STK3310 is not set
# CONFIG_ST_UVIS25 is not set
# CONFIG_TCS3414 is not set
# CONFIG_TCS3472 is not set
# CONFIG_SENSORS_TSL2563 is not set
# CONFIG_TSL2583 is not set
# CONFIG_TSL2772 is not set
# CONFIG_TSL4531 is not set
# CONFIG_US5182D is not set
# CONFIG_VCNL4000 is not set
# CONFIG_VCNL4035 is not set
# CONFIG_VEML6030 is not set
# CONFIG_VEML6070 is not set
# CONFIG_VL6180 is not set
# CONFIG_ZOPT2201 is not set
# end of Light sensors

#
# Magnetometer sensors
#
# CONFIG_AK8975 is not set
# CONFIG_AK09911 is not set
# CONFIG_BMC150_MAGN_I2C is not set
# CONFIG_BMC150_MAGN_SPI is not set
# CONFIG_MAG3110 is not set
# CONFIG_MMC35240 is not set
# CONFIG_IIO_ST_MAGN_3AXIS is not set
# CONFIG_SENSORS_HMC5843_I2C is not set
# CONFIG_SENSORS_HMC5843_SPI is not set
# CONFIG_SENSORS_RM3100_I2C is not set
# CONFIG_SENSORS_RM3100_SPI is not set
# end of Magnetometer sensors

#
# Multiplexers
#
# end of Multiplexers

#
# Inclinometer sensors
#
# end of Inclinometer sensors

#
# Triggers - standalone
#
# CONFIG_IIO_INTERRUPT_TRIGGER is not set
# CONFIG_IIO_SYSFS_TRIGGER is not set
# end of Triggers - standalone

#
# Linear and angular position sensors
#
# end of Linear and angular position sensors

#
# Digital potentiometers
#
# CONFIG_AD5272 is not set
# CONFIG_DS1803 is not set
# CONFIG_MAX5432 is not set
# CONFIG_MAX5481 is not set
# CONFIG_MAX5487 is not set
# CONFIG_MCP4018 is not set
# CONFIG_MCP4131 is not set
# CONFIG_MCP4531 is not set
# CONFIG_MCP41010 is not set
# CONFIG_TPL0102 is not set
# end of Digital potentiometers

#
# Digital potentiostats
#
# CONFIG_LMP91000 is not set
# end of Digital potentiostats

#
# Pressure sensors
#
# CONFIG_ABP060MG is not set
# CONFIG_BMP280 is not set
# CONFIG_DLHL60D is not set
# CONFIG_DPS310 is not set
# CONFIG_HP03 is not set
# CONFIG_ICP10100 is not set
# CONFIG_MPL115_I2C is not set
# CONFIG_MPL115_SPI is not set
# CONFIG_MPL3115 is not set
# CONFIG_MS5611 is not set
# CONFIG_MS5637 is not set
# CONFIG_IIO_ST_PRESS is not set
# CONFIG_T5403 is not set
# CONFIG_HP206C is not set
# CONFIG_ZPA2326 is not set
# end of Pressure sensors

#
# Lightning sensors
#
# CONFIG_AS3935 is not set
# end of Lightning sensors

#
# Proximity and distance sensors
#
# CONFIG_ISL29501 is not set
# CONFIG_LIDAR_LITE_V2 is not set
# CONFIG_MB1232 is not set
# CONFIG_PING is not set
# CONFIG_RFD77402 is not set
# CONFIG_SRF04 is not set
# CONFIG_SX9310 is not set
# CONFIG_SX9500 is not set
# CONFIG_SRF08 is not set
# CONFIG_VCNL3020 is not set
# CONFIG_VL53L0X_I2C is not set
# end of Proximity and distance sensors

#
# Resolver to digital converters
#
# CONFIG_AD2S90 is not set
# CONFIG_AD2S1200 is not set
# end of Resolver to digital converters

#
# Temperature sensors
#
# CONFIG_LTC2983 is not set
# CONFIG_MAXIM_THERMOCOUPLE is not set
# CONFIG_MLX90614 is not set
# CONFIG_MLX90632 is not set
# CONFIG_TMP006 is not set
# CONFIG_TMP007 is not set
# CONFIG_TSYS01 is not set
# CONFIG_TSYS02D is not set
# CONFIG_MAX31856 is not set
# end of Temperature sensors

# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
CONFIG_PWM=y
CONFIG_PWM_SYSFS=y
# CONFIG_PWM_DEBUG is not set
# CONFIG_PWM_LPSS_PCI is not set
# CONFIG_PWM_LPSS_PLATFORM is not set
# CONFIG_PWM_PCA9685 is not set

#
# IRQ chip support
#
# end of IRQ chip support

# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set

#
# PHY Subsystem
#
CONFIG_GENERIC_PHY=y
# CONFIG_BCM_KONA_USB2_PHY is not set
# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_PHY_CPCAP_USB is not set
# CONFIG_PHY_INTEL_EMMC is not set
# end of PHY Subsystem

CONFIG_POWERCAP=y
CONFIG_INTEL_RAPL_CORE=m
CONFIG_INTEL_RAPL=m
# CONFIG_IDLE_INJECT is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# end of Performance monitor support

CONFIG_RAS=y
# CONFIG_RAS_CEC is not set
# CONFIG_USB4 is not set

#
# Android
#
CONFIG_ANDROID=y
# CONFIG_ANDROID_BINDER_IPC is not set
# end of Android

CONFIG_LIBNVDIMM=m
# CONFIG_BLK_DEV_PMEM is not set
# CONFIG_ND_BLK is not set
CONFIG_ND_CLAIM=y
CONFIG_BTT=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_NVDIMM_KEYS=y
CONFIG_DAX=y
CONFIG_DEV_DAX=y
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_KMEM=y
CONFIG_DEV_DAX_PMEM_COMPAT=m
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# end of HW tracing support

# CONFIG_FPGA is not set
# CONFIG_TEE is not set
CONFIG_PM_OPP=y
# CONFIG_UNISYS_VISORBUS is not set
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set
# CONFIG_INTERCONNECT is not set
# CONFIG_COUNTER is not set
# CONFIG_MOST is not set
# end of Device Drivers

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_VALIDATE_FS_PARSER is not set
# CONFIG_FSINFO is not set
CONFIG_FS_IOMAP=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=m
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=m
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=m
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_XFS_FS=m
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_XFS_ONLINE_SCRUB=y
CONFIG_XFS_ONLINE_REPAIR=y
CONFIG_XFS_DEBUG=y
CONFIG_XFS_ASSERT_FATAL=y
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_ZONEFS_FS is not set
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_EXPORTFS_BLOCK_OPS=y
CONFIG_FILE_LOCKING=y
CONFIG_MANDATORY_FILE_LOCKING=y
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=m
# CONFIG_FS_VERITY is not set
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
# CONFIG_MOUNT_NOTIFICATIONS is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
CONFIG_AUTOFS_FS=y
CONFIG_FUSE_FS=m
# CONFIG_CUSE is not set
CONFIG_VIRTIO_FS=m
CONFIG_OVERLAY_FS=m
CONFIG_OVERLAY_FS_REDIRECT_DIR=y
CONFIG_OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW=y
# CONFIG_OVERLAY_FS_INDEX is not set
# CONFIG_OVERLAY_FS_XINO_AUTO is not set
# CONFIG_OVERLAY_FS_METACOPY is not set

#
# Caches
#
# CONFIG_FSCACHE is not set
# end of Caches

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set
# end of CD-ROM/DVD Filesystems

#
# DOS/FAT/EXFAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_EXFAT_FS is not set
# CONFIG_NTFS_FS is not set
# end of DOS/FAT/EXFAT/NT Filesystems

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
# CONFIG_PROC_VMCORE_DEVICE_DUMP is not set
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_PROC_CHILDREN=y
CONFIG_PROC_PID_ARCH_STATUS=y
CONFIG_PROC_CPU_RESCTRL=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
# CONFIG_TMPFS_INODE64 is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_MEMFD_CREATE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=m
# end of Pseudo filesystems

CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFLATE_COMPRESS=y
# CONFIG_PSTORE_LZO_COMPRESS is not set
# CONFIG_PSTORE_LZ4_COMPRESS is not set
# CONFIG_PSTORE_LZ4HC_COMPRESS is not set
# CONFIG_PSTORE_842_COMPRESS is not set
# CONFIG_PSTORE_ZSTD_COMPRESS is not set
CONFIG_PSTORE_COMPRESS=y
CONFIG_PSTORE_DEFLATE_COMPRESS_DEFAULT=y
CONFIG_PSTORE_COMPRESS_DEFAULT="deflate"
CONFIG_PSTORE_CONSOLE=y
CONFIG_PSTORE_PMSG=y
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_PSTORE_BLK is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_EROFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
# CONFIG_NFS_SWAP is not set
CONFIG_ROOT_NFS=y
CONFIG_NFS_DEBUG=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
# CONFIG_NFSD is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_DEBUG=y
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=y
CONFIG_9P_FS_POSIX_ACL=y
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
# CONFIG_NLS_UTF8 is not set
# CONFIG_DLM is not set
# CONFIG_UNICODE is not set
CONFIG_IO_WQ=y
# end of File systems

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_REQUEST_CACHE is not set
CONFIG_PERSISTENT_KEYRINGS=y
CONFIG_TRUSTED_KEYS=y
CONFIG_ENCRYPTED_KEYS=y
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITY_WRITABLE_HOOKS=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_PAGE_TABLE_ISOLATION=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_PATH=y
CONFIG_INTEL_TXT=y
CONFIG_LSM_MMAP_MIN_ADDR=65535
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
CONFIG_HARDENED_USERCOPY=y
CONFIG_HARDENED_USERCOPY_FALLBACK=y
# CONFIG_HARDENED_USERCOPY_PAGESPAN is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
CONFIG_SECURITY_SELINUX_SIDTAB_HASH_BITS=9
CONFIG_SECURITY_SELINUX_SID2STR_CACHE_SIZE=256
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
CONFIG_SECURITY_APPARMOR=y
CONFIG_SECURITY_APPARMOR_HASH=y
CONFIG_SECURITY_APPARMOR_HASH_DEFAULT=y
# CONFIG_SECURITY_APPARMOR_DEBUG is not set
# CONFIG_SECURITY_LOADPIN is not set
CONFIG_SECURITY_YAMA=y
# CONFIG_SECURITY_SAFESETID is not set
# CONFIG_SECURITY_LOCKDOWN_LSM is not set
CONFIG_INTEGRITY=y
CONFIG_INTEGRITY_SIGNATURE=y
CONFIG_INTEGRITY_ASYMMETRIC_KEYS=y
CONFIG_INTEGRITY_TRUSTED_KEYRING=y
# CONFIG_INTEGRITY_PLATFORM_KEYRING is not set
CONFIG_INTEGRITY_AUDIT=y
CONFIG_IMA=y
CONFIG_IMA_MEASURE_PCR_IDX=10
CONFIG_IMA_LSM_RULES=y
# CONFIG_IMA_TEMPLATE is not set
CONFIG_IMA_NG_TEMPLATE=y
# CONFIG_IMA_SIG_TEMPLATE is not set
CONFIG_IMA_DEFAULT_TEMPLATE="ima-ng"
CONFIG_IMA_DEFAULT_HASH_SHA1=y
# CONFIG_IMA_DEFAULT_HASH_SHA256 is not set
# CONFIG_IMA_DEFAULT_HASH_SHA512 is not set
CONFIG_IMA_DEFAULT_HASH="sha1"
# CONFIG_IMA_WRITE_POLICY is not set
# CONFIG_IMA_READ_POLICY is not set
CONFIG_IMA_APPRAISE=y
# CONFIG_IMA_ARCH_POLICY is not set
# CONFIG_IMA_APPRAISE_BUILD_POLICY is not set
CONFIG_IMA_APPRAISE_BOOTPARAM=y
# CONFIG_IMA_APPRAISE_MODSIG is not set
CONFIG_IMA_TRUSTED_KEYRING=y
# CONFIG_IMA_BLACKLIST_KEYRING is not set
# CONFIG_IMA_LOAD_X509 is not set
CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS=y
CONFIG_IMA_QUEUE_EARLY_BOOT_KEYS=y
# CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT is not set
CONFIG_EVM=y
CONFIG_EVM_ATTR_FSUUID=y
# CONFIG_EVM_ADD_XATTRS is not set
# CONFIG_EVM_LOAD_X509 is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_APPARMOR is not set
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_LSM="lockdown,yama,loadpin,safesetid,integrity,selinux,smack,tomoyo,apparmor"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_INIT_STACK_NONE=y
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
# end of Memory initialization
# end of Kernel hardening options
# end of Security options

CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_CRYPTD=m
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_SIMD=m
CONFIG_CRYPTO_GLUE_HELPER_X86=m

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=y
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
# CONFIG_CRYPTO_ECRDSA is not set
# CONFIG_CRYPTO_CURVE25519 is not set
# CONFIG_CRYPTO_CURVE25519_X86 is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
CONFIG_CRYPTO_GCM=y
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
CONFIG_CRYPTO_SEQIV=y
# CONFIG_CRYPTO_ECHAINIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CFB is not set
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_OFB is not set
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# CONFIG_CRYPTO_KEYWRAP is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_ADIANTUM is not set
# CONFIG_CRYPTO_ESSIV is not set

#
# Hash modes
#
# CONFIG_CRYPTO_CMAC is not set
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=m
# CONFIG_CRYPTO_CRC32 is not set
CONFIG_CRYPTO_CRC32_PCLMUL=m
# CONFIG_CRYPTO_XXHASH is not set
# CONFIG_CRYPTO_BLAKE2B is not set
# CONFIG_CRYPTO_BLAKE2S is not set
# CONFIG_CRYPTO_BLAKE2S_X86 is not set
CONFIG_CRYPTO_CRCT10DIF=y
CONFIG_CRYPTO_CRCT10DIF_PCLMUL=m
CONFIG_CRYPTO_GHASH=y
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA1_SSSE3=y
CONFIG_CRYPTO_SHA256_SSSE3=y
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_SM3 is not set
# CONFIG_CRYPTO_STREEBOG is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL=m

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
CONFIG_CRYPTO_AES_NI_INTEL=m
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_LZO=y
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
# CONFIG_CRYPTO_ZSTD is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
CONFIG_CRYPTO_DRBG_HASH=y
CONFIG_CRYPTO_DRBG_CTR=y
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_USER_API_SKCIPHER=y
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_HASH_INFO=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_AES=y
# CONFIG_CRYPTO_LIB_BLAKE2S is not set
# CONFIG_CRYPTO_LIB_CHACHA is not set
# CONFIG_CRYPTO_LIB_CURVE25519 is not set
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
# CONFIG_CRYPTO_LIB_POLY1305 is not set
# CONFIG_CRYPTO_LIB_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_LIB_SHA256=y
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_ATMEL_ECC is not set
# CONFIG_CRYPTO_DEV_ATMEL_SHA204A is not set
CONFIG_CRYPTO_DEV_CCP=y
# CONFIG_CRYPTO_DEV_CCP_DD is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
# CONFIG_CRYPTO_DEV_NITROX_CNN55XX is not set
# CONFIG_CRYPTO_DEV_VIRTIO is not set
# CONFIG_CRYPTO_DEV_SAFEXCEL is not set
# CONFIG_CRYPTO_DEV_AMLOGIC_GXL is not set
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
# CONFIG_ASYMMETRIC_TPM_KEY_SUBTYPE is not set
CONFIG_X509_CERTIFICATE_PARSER=y
# CONFIG_PKCS8_PRIVATE_KEY_PARSER is not set
CONFIG_PKCS7_MESSAGE_PARSER=y
# CONFIG_PKCS7_TEST_KEY is not set
CONFIG_SIGNED_PE_FILE_VERIFICATION=y

#
# Certificates for signature checking
#
CONFIG_MODULE_SIG_KEY="certs/signing_key.pem"
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_SYSTEM_TRUSTED_KEYS=""
# CONFIG_SYSTEM_EXTRA_CERTIFICATE is not set
# CONFIG_SECONDARY_TRUSTED_KEYRING is not set
CONFIG_SYSTEM_BLACKLIST_KEYRING=y
CONFIG_SYSTEM_BLACKLIST_HASH_LIST=""
# end of Certificates for signature checking

CONFIG_BINARY_PRINTF=y

#
# Library routines
#
# CONFIG_PACKING is not set
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
# CONFIG_CORDIC is not set
# CONFIG_PRIME_NUMBERS is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC64 is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=m
# CONFIG_CRC8 is not set
CONFIG_XXHASH=y
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_ZSTD_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_DECOMPRESS_LZ4=y
CONFIG_DECOMPRESS_ZSTD=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED=y
CONFIG_SWIOTLB=y
CONFIG_DMA_COHERENT_POOL=y
CONFIG_DMA_CMA=y

#
# Default contiguous memory area size:
#
CONFIG_CMA_SIZE_MBYTES=200
CONFIG_CMA_SIZE_SEL_MBYTES=y
# CONFIG_CMA_SIZE_SEL_PERCENTAGE is not set
# CONFIG_CMA_SIZE_SEL_MIN is not set
# CONFIG_CMA_SIZE_SEL_MAX is not set
CONFIG_CMA_ALIGNMENT=8
# CONFIG_DMA_API_DEBUG is not set
CONFIG_SGL_ALLOC=y
CONFIG_IOMMU_HELPER=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPUMASK_OFFSTACK=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
CONFIG_CLZ_TAB=y
CONFIG_IRQ_POLL=y
CONFIG_MPILIB=y
CONFIG_SIGNATURE=y
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_MEMREGION=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_UACCESS_MCSAFE=y
CONFIG_ARCH_STACKWALK=y
CONFIG_SBITMAP=y
# CONFIG_STRING_SELFTEST is not set
# end of Library routines

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
# CONFIG_PRINTK_CALLER is not set
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
CONFIG_BOOT_PRINTK_DELAY=y
CONFIG_DYNAMIC_DEBUG=y
CONFIG_DYNAMIC_DEBUG_CORE=y
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y
# end of printk and dmesg options

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_REDUCED=y
# CONFIG_DEBUG_INFO_COMPRESSED is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_DEBUG_INFO_DWARF4=y
# CONFIG_GDB_SCRIPTS is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_READABLE_ASM is not set
# CONFIG_HEADERS_INSTALL is not set
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
# CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_32B is not set
CONFIG_STACK_VALIDATION=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
# CONFIG_DEBUG_FS_DISALLOW_MOUNT is not set
# CONFIG_DEBUG_FS_ALLOW_NONE is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set
# end of Generic Kernel Debugging Instruments

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_POISONING is not set
CONFIG_DEBUG_PAGE_REF=y
CONFIG_DEBUG_RODATA_TEST=y
CONFIG_ARCH_HAS_DEBUG_WX=y
# CONFIG_DEBUG_WX is not set
CONFIG_GENERIC_PTDUMP=y
# CONFIG_PTDUMP_DEBUGFS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_DEBUG_VM_PGTABLE=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_MEMORY_INIT is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
# CONFIG_KASAN is not set
# end of Memory Debugging

CONFIG_DEBUG_SHIRQ=y

#
# Debug Oops, Lockups and Hangs
#
CONFIG_PANIC_ON_OOPS=y
CONFIG_PANIC_ON_OOPS_VALUE=1
CONFIG_PANIC_TIMEOUT=0
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_TEST_LOCKUP is not set
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
CONFIG_SCHEDSTATS=y
# end of Scheduler Debugging

# CONFIG_DEBUG_TIMEKEEPING is not set
CONFIG_DEBUG_PREEMPT=y

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RAW_LOCK_NESTING=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

CONFIG_TRACE_IRQFLAGS=y
CONFIG_TRACE_IRQFLAGS_NMI=y
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set

#
# Debug kernel data structures
#
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_PLIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# end of Debug kernel data structures

# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
CONFIG_PROVE_RCU=y
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
CONFIG_LATENCYTOP=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_BOOTTIME_TRACING=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_STACK_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_PREEMPT_TRACER is not set
CONFIG_SCHED_TRACER=y
CONFIG_HWLAT_TRACER=y
# CONFIG_MMIOTRACE is not set
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
# CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENTS=y
# CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
CONFIG_UPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
# CONFIG_BPF_KPROBE_OVERRIDE is not set
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_TRACING_MAP=y
CONFIG_SYNTH_EVENTS=y
CONFIG_HIST_TRIGGERS=y
# CONFIG_TRACE_EVENT_INJECT is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
# CONFIG_SYNTH_EVENT_GEN_TEST is not set
# CONFIG_KPROBE_EVENT_GEN_TEST is not set
# CONFIG_HIST_TRIGGERS_DEBUG is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_IO_STRICT_DEVMEM is not set

#
# x86 Debugging
#
# CONFIG_DEBUG_AID_FOR_SYZBOT is not set
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_EARLY_PRINTK_USB=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_DEBUG is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
CONFIG_X86_DEBUG_FPU=y
# CONFIG_PUNIT_ATOM_DEBUG is not set
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set
# CONFIG_UNWINDER_GUESS is not set
# end of x86 Debugging

#
# Kernel Testing and Coverage
#
# CONFIG_KUNIT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
CONFIG_FUNCTION_ERROR_INJECTION=y
CONFIG_FAULT_INJECTION=y
# CONFIG_FAILSLAB is not set
# CONFIG_FAIL_PAGE_ALLOC is not set
CONFIG_FAIL_MAKE_REQUEST=y
# CONFIG_FAIL_IO_TIMEOUT is not set
# CONFIG_FAIL_FUTEX is not set
CONFIG_FAULT_INJECTION_DEBUG_FS=y
# CONFIG_FAIL_FUNCTION is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
CONFIG_RUNTIME_TESTING_MENU=y
# CONFIG_LKDTM is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_TEST_MIN_HEAP is not set
# CONFIG_TEST_SORT is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_REED_SOLOMON_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
CONFIG_ATOMIC64_SELFTEST=y
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_TEST_STRING_HELPERS is not set
# CONFIG_TEST_STRSCPY is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_BITFIELD is not set
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_XARRAY is not set
# CONFIG_TEST_OVERFLOW is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_HASH is not set
# CONFIG_TEST_IDA is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_BITOPS is not set
# CONFIG_TEST_VMALLOC is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_TEST_BPF is not set
# CONFIG_TEST_BLACKHOLE_DEV is not set
# CONFIG_FIND_BIT_BENCHMARK is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_SYSCTL is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_TEST_KMOD is not set
# CONFIG_TEST_DEBUG_VIRTUAL is not set
# CONFIG_TEST_MEMCAT_P is not set
# CONFIG_TEST_LIVEPATCH is not set
# CONFIG_TEST_STACKINIT is not set
# CONFIG_TEST_MEMINIT is not set
# CONFIG_TEST_FPU is not set
# CONFIG_MEMTEST is not set
# end of Kernel Testing and Coverage
# end of Kernel hacking

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-15  9:49                                 ` Alex Shi
@ 2020-08-17 15:38                                   ` Alexander Duyck
  0 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-17 15:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: Yang Shi, kbuild test robot, Rong Chen, Konstantin Khlebnikov,
	Kirill A. Shutemov, Hugh Dickins, LKML, Daniel Jordan, linux-mm,
	Shakeel Butt, Matthew Wilcox, Johannes Weiner, Tejun Heo,
	cgroups, Andrew Morton, Wei Yang, Mel Gorman, Joonsoo Kim

On Sat, Aug 15, 2020 at 2:51 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/15 上午5:15, Alexander Duyck 写道:
> > On Fri, Aug 14, 2020 at 7:24 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >>
> >> On Fri, Aug 14, 2020 at 12:19 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>>
> >>>
> >>>
> >>> 在 2020/8/13 下午12:02, Alexander Duyck 写道:
> >>>>
> >>>> Since we have dropped the late abort case we can drop the code that was
> >>>> clearing the LRU flag and calling page_put since the abort case will now
> >>>> not be holding a reference to a page.
> >>>>
> >>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >>>
> >>> seems the case-lru-file-mmap-read case drop about 3% on this patch in a rough testing.
> >>> on my 80 core machine.
> >>
> >> I'm not sure how it could have that much impact on the performance
> >> since the total effect would just be dropping what should be a
> >> redundant test since we tested the skip bit before we took the LRU
> >> bit, so we shouldn't need to test it again after.
> >>
> >> I finally got my test setup working last night. I'll have to do some
> >> testing in my environment and I can start trying to see what is going
> >> on.
> >
> > So I ran the case-lru-file-mmap-read a few times and I don't see how
> > it is supposed to be testing the compaction code. It doesn't seem like
> > compaction is running at least on my system as a result of the test
> > script.
>
> atteched my kernel config, it is used on mine machine,

I'm just wondering what the margin of error is on the tests you are
running. What is the variance between runs? I'm just wondering if 3%
falls into the range of noise or possible changes due to just code
shifting around?

In order for the code to have shown any change it needs to be run and
I didn't see the tests triggering compaction on my test system. I'm
wondering how much memory you have available in the system you were
testing on that the test was enough to trigger compaction?

> > I wonder if testing this code wouldn't be better done using
> > something like thpscale from the
> > mmtests(https://github.com/gormanm/mmtests)? It seems past changes to
> > the compaction code were tested using that, and the config script for
> > the test explains that it is designed specifically to stress the
> > compaction code. I have the test up and running now and hope to
> > collect results over the weekend.
>
> I did the testing, but a awkward is that I failed to get result,
> maybe leak some packages.

So one thing I noticed is that if you have over 128GB of memory in the
system it will fail unless you update the sysctl value
vm.max_map_count. It defaulted to somewhere close to 64K, and I
increased it 20X to 1280K in order for the test to run without failing
on the mmap calls. The other edit I had to make was the config file as
the test system I was on had about 1TB of RAM, and my home partition
only had about 800GB to spare so I had to reduce the map size from
8/10 to 5/8.

> # ../../compare-kernels.sh
>
> thpscale Fault Latencies
> Can't locate List/BinarySearch.pm in @INC (@INC contains: /root/mmtests/bin/lib /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vend.
> BEGIN failed--compilation aborted at /root/mmtests/bin/lib/MMTests/Stat.pm line 13.
> Compilation failed in require at /root/mmtests/work/log/../../bin/compare-mmtests.pl line 13.
> BEGIN failed--compilation aborted at /root/mmtests/work/log/../../bin/compare-mmtests.pl line 13.

I had to install List::BinarySearch.pm. It required installing the
cpan perl libraries.

> >
> > There is one change I will probably make to this patch and that is to
> > place the new code that is setting skip_updated where the old code was
> > calling test_and_set_skip_bit. By doing that we can avoid extra checks
> > and it should help to reduce possible collisions when setting the skip
> > bit in the pageblock flags.
>
> the problem maybe on cmpchxg pb flags, that may involved other blocks changes.

That is the only thing I can think of just based on code review.
Although that would imply multiple compact threads are running, and as
I said in my tests I never saw kcompactd wakeup so I don't think the
tests you were mentioning were enough to stress compaction.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction
  2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi
  2020-08-04 21:35   ` Alexander Duyck
  2020-08-06 18:38   ` Alexander Duyck
@ 2020-08-17 22:58   ` Alexander Duyck
  2 siblings, 0 replies; 102+ messages in thread
From: Alexander Duyck @ 2020-08-17 22:58 UTC (permalink / raw)
  To: Alex Shi
  Cc: Andrew Morton, Mel Gorman, Tejun Heo, Hugh Dickins,
	Konstantin Khlebnikov, Daniel Jordan, Yang Shi, Matthew Wilcox,
	Johannes Weiner, kbuild test robot, linux-mm, LKML, cgroups,
	Shakeel Butt, Joonsoo Kim, Wei Yang, Kirill A. Shutemov,
	Rong Chen

> @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                  * only when the page is being freed somewhere else.
>                  */
>                 scan += nr_pages;
> -               switch (__isolate_lru_page(page, mode)) {
> +               switch (__isolate_lru_page_prepare(page, mode)) {
>                 case 0:
> +                       /*
> +                        * Be careful not to clear PageLRU until after we're
> +                        * sure the page is not being freed elsewhere -- the
> +                        * page release code relies on it.
> +                        */
> +                       if (unlikely(!get_page_unless_zero(page)))
> +                               goto busy;
> +
> +                       if (!TestClearPageLRU(page)) {
> +                               /*
> +                                * This page may in other isolation path,
> +                                * but we still hold lru_lock.
> +                                */
> +                               put_page(page);
> +                               goto busy;
> +                       }
> +

So I was reviewing the code and came across this piece. It has me a
bit concerned since we are calling put_page while holding the LRU lock
which was taken before calling the function. We should be fine in
terms of not encountering a deadlock since the LRU bit is cleared the
page shouldn't grab the LRU lock again, however we could end up
grabbing the zone lock while holding the LRU lock which would be an
issue.

One other thought I had is that this might be safe because the
assumption would be that another thread is holding a reference on the
page, has already called TestClearPageLRU on the page and retrieved
the LRU bit, and is waiting on us to release the LRU lock before it
can pull the page off of the list. In that case put_page will never
decrement the reference count to 0. I believe that is the current case
but I cannot be certain.

I'm just wondering if we should just replace the put_page(page) with a
WARN_ON(put_page_testzero(page)) and a bit more documentation. If I am
not mistaken it should never be possible for the reference count to
actually hit zero here.

Thanks.

- Alex


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip
  2020-08-13  4:02                         ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck
  2020-08-14  7:19                           ` Alex Shi
@ 2020-08-18  6:50                           ` Alex Shi
  1 sibling, 0 replies; 102+ messages in thread
From: Alex Shi @ 2020-08-18  6:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: yang.shi, lkp, rong.a.chen, khlebnikov, kirill, hughd,
	linux-kernel, daniel.m.jordan, linux-mm, shakeelb, willy, hannes,
	tj, cgroups, akpm, richard.weiyang, mgorman, iamjoonsoo.kim



在 2020/8/13 下午12:02, Alexander Duyck 写道:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> The only user of test_and_set_skip was isolate_migratepages_block and it
> was using it after a call that was testing and clearing the LRU flag. As
> such it really didn't need to be behind the LRU lock anymore as it wasn't
> really fulfilling its purpose.
> 
> With that being the case we can simply drop the bit and instead directly
> just call the set_pageblock_skip function if the page we are working on is
> the valid_page at the start of the pageblock. It shouldn't be possible for
> us to encounter the bit being set since we obtained the LRU flag for the
> first page in the pageblock which means we would have exclusive access to
> setting the skip bit. As such we don't need to worry about the abort case
> since no other thread will be able to call what used to be
> test_and_set_skip.
> 
> Since we have dropped the late abort case we can drop the code that was
> clearing the LRU flag and calling page_put since the abort case will now
> not be holding a reference to a page.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

After my false sharing remove on pageblock_flags, this patch looks fine with
a minor change

> ---
>  mm/compaction.c |   50 +++++++-------------------------------------------
>  1 file changed, 7 insertions(+), 43 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5021a18ef722..c1e9918f9dd4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -399,29 +399,6 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>  	}
>  }
>  
> -/*
> - * Sets the pageblock skip bit if it was clear. Note that this is a hint as
> - * locks are not required for read/writers. Returns true if it was already set.
> - */
> -static bool test_and_set_skip(struct compact_control *cc, struct page *page,
> -							unsigned long pfn)
> -{
> -	bool skip;
> -
> -	/* Do no update if skip hint is being ignored */
> -	if (cc->ignore_skip_hint)
> -		return false;
> -
> -	if (!IS_ALIGNED(pfn, pageblock_nr_pages))
> -		return false;
> -
> -	skip = get_pageblock_skip(page);
> -	if (!skip && !cc->no_set_skip_hint)
> -		skip = !set_pageblock_skip(page);
> -
> -	return skip;
> -}
> -
>  static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
>  {
>  	struct zone *zone = cc->zone;
> @@ -480,12 +457,6 @@ static inline void update_pageblock_skip(struct compact_control *cc,
>  static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
>  {
>  }
> -
> -static bool test_and_set_skip(struct compact_control *cc, struct page *page,
> -							unsigned long pfn)
> -{
> -	return false;
> -}
>  #endif /* CONFIG_COMPACTION */
>  
>  /*
> @@ -895,7 +866,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>  			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>  				low_pfn = end_pfn;
> -				page = NULL;
>  				goto isolate_abort;
>  			}
>  			valid_page = page;
> @@ -991,6 +961,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		if (!TestClearPageLRU(page))
>  			goto isolate_fail_put;
>  
> +		/* Indicate that we want exclusive access to this pageblock */
> +		if (page == valid_page) {
> +			skip_updated = true;
> +			if (!cc->ignore_skip_hint)

                        if (!cc->ignore_skip_hint && !cc->no_set_skip_hint)
no_set_skip_hint needs to add here.

Thanks
Alex

> +				set_pageblock_skip(page);
> +		}
> +
>  		/* If we already hold the lock, we can skip some rechecking */
>  		if (!lruvec || !lruvec_holds_page_lru_lock(page, lruvec)) {
>  			if (lruvec)
> @@ -1002,13 +979,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  
>  			lruvec_memcg_debug(lruvec, page);
>  
> -			/* Try get exclusive access under lock */
> -			if (!skip_updated) {
> -				skip_updated = true;
> -				if (test_and_set_skip(cc, page, low_pfn))
> -					goto isolate_abort;
> -			}
> -
>  			/*
>  			 * Page become compound since the non-locked check,
>  			 * and it's on LRU. It can only be a THP so the order
> @@ -1094,15 +1064,9 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  	if (unlikely(low_pfn > end_pfn))
>  		low_pfn = end_pfn;
>  
> -	page = NULL;
> -
>  isolate_abort:
>  	if (lruvec)
>  		unlock_page_lruvec_irqrestore(lruvec, flags);
> -	if (page) {
> -		SetPageLRU(page);
> -		put_page(page);
> -	}
>  
>  	/*
>  	 * Updated the cached scanner pfn once the pageblock has been scanned
> 


^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, back to index

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-25 12:59 [PATCH v17 00/21] per memcg lru lock Alex Shi
2020-07-25 12:59 ` [PATCH v17 01/21] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-08-06  3:47   ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 02/21] mm/page_idle: no unlikely double check for idle page counting Alex Shi
2020-07-25 12:59 ` [PATCH v17 03/21] mm/compaction: correct the comments of compact_defer_shift Alex Shi
2020-07-27 17:29   ` Alexander Duyck
2020-07-28 11:59     ` Alex Shi
2020-07-28 14:17       ` Alexander Duyck
2020-07-25 12:59 ` [PATCH v17 04/21] mm/compaction: rename compact_deferred as compact_should_defer Alex Shi
2020-07-25 12:59 ` [PATCH v17 05/21] mm/thp: move lru_add_page_tail func to huge_memory.c Alex Shi
2020-07-25 12:59 ` [PATCH v17 06/21] mm/thp: clean up lru_add_page_tail Alex Shi
2020-07-25 12:59 ` [PATCH v17 07/21] mm/thp: remove code path which never got into Alex Shi
2020-07-25 12:59 ` [PATCH v17 08/21] mm/thp: narrow lru locking Alex Shi
2020-07-25 12:59 ` [PATCH v17 09/21] mm/memcg: add debug checking in lock_page_memcg Alex Shi
2020-07-25 12:59 ` [PATCH v17 10/21] mm/swap: fold vm event PGROTATED into pagevec_move_tail_fn Alex Shi
2020-07-25 12:59 ` [PATCH v17 11/21] mm/lru: move lru_lock holding in func lru_note_cost_page Alex Shi
2020-08-05 21:18   ` Alexander Duyck
2020-07-25 12:59 ` [PATCH v17 12/21] mm/lru: move lock into lru_note_cost Alex Shi
2020-07-25 12:59 ` [PATCH v17 13/21] mm/lru: introduce TestClearPageLRU Alex Shi
2020-07-29  3:53   ` Alex Shi
2020-08-05 22:43     ` Alexander Duyck
2020-08-06  1:54       ` Alex Shi
2020-08-06 14:41         ` Alexander Duyck
2020-07-25 12:59 ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alex Shi
2020-08-04 21:35   ` Alexander Duyck
2020-08-06 18:38   ` Alexander Duyck
2020-08-07  3:24     ` Alex Shi
2020-08-07 14:51       ` Alexander Duyck
2020-08-10 13:10         ` Alex Shi
2020-08-10 14:41           ` Alexander Duyck
2020-08-11  8:22             ` Alex Shi
2020-08-11 14:47               ` Alexander Duyck
2020-08-12 11:43                 ` Alex Shi
2020-08-12 12:16                   ` Alex Shi
2020-08-12 16:51                   ` Alexander Duyck
2020-08-13  1:46                     ` Alex Shi
2020-08-13  2:17                       ` Alexander Duyck
2020-08-13  3:52                         ` Alex Shi
2020-08-13  4:02                       ` [RFC PATCH 0/3] " Alexander Duyck
2020-08-13  4:02                         ` [RFC PATCH 1/3] mm: Drop locked from isolate_migratepages_block Alexander Duyck
2020-08-13  6:56                           ` Alex Shi
2020-08-13 14:32                             ` Alexander Duyck
2020-08-14  7:25                               ` Alex Shi
2020-08-13  7:44                           ` Alex Shi
2020-08-13 14:26                             ` Alexander Duyck
2020-08-13  4:02                         ` [RFC PATCH 2/3] mm: Drop use of test_and_set_skip in favor of just setting skip Alexander Duyck
2020-08-14  7:19                           ` Alex Shi
2020-08-14 14:24                             ` Alexander Duyck
2020-08-14 21:15                               ` Alexander Duyck
2020-08-15  9:49                                 ` Alex Shi
2020-08-17 15:38                                   ` Alexander Duyck
2020-08-18  6:50                           ` Alex Shi
2020-08-13  4:02                         ` [RFC PATCH 3/3] mm: Identify compound pages sooner in isolate_migratepages_block Alexander Duyck
2020-08-14  7:20                           ` Alex Shi
2020-08-17 22:58   ` [PATCH v17 14/21] mm/compaction: do page isolation first in compaction Alexander Duyck
2020-07-25 12:59 ` [PATCH v17 15/21] mm/thp: add tail pages into lru anyway in split_huge_page() Alex Shi
2020-07-25 12:59 ` [PATCH v17 16/21] mm/swap: serialize memcg changes in pagevec_lru_move_fn Alex Shi
2020-07-25 12:59 ` [PATCH v17 17/21] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-07-27 23:34   ` Alexander Duyck
2020-07-28  7:15     ` Alex Shi
2020-07-28 11:19     ` Alex Shi
2020-07-28 14:54       ` Alexander Duyck
2020-07-29  1:00         ` Alex Shi
2020-07-29  1:27           ` Alexander Duyck
2020-07-29  2:27             ` Alex Shi
2020-07-28 15:39     ` Alex Shi
2020-07-28 15:55       ` Alexander Duyck
2020-07-29  0:48         ` Alex Shi
2020-07-29  3:54   ` Alex Shi
2020-08-06  7:41   ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 18/21] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-07-29 17:52   ` Alexander Duyck
2020-07-30  6:08     ` Alex Shi
2020-07-31 14:20       ` Alexander Duyck
2020-07-31 21:14   ` [PATCH RFC] mm: Add function for testing if the current lruvec lock is valid alexander.h.duyck
2020-07-31 23:54     ` Alex Shi
2020-08-02 18:20       ` Alexander Duyck
2020-08-04  6:13         ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 19/21] mm/vmscan: use relock for move_pages_to_lru Alex Shi
2020-08-03 22:49   ` Alexander Duyck
2020-08-04  6:23     ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 20/21] mm/pgdat: remove pgdat lru_lock Alex Shi
2020-08-03 22:42   ` Alexander Duyck
2020-08-03 22:45     ` Alexander Duyck
2020-08-04  6:22       ` Alex Shi
2020-07-25 12:59 ` [PATCH v17 21/21] mm/lru: revise the comments of lru_lock Alex Shi
2020-08-03 22:37   ` Alexander Duyck
2020-08-04 10:04     ` Alex Shi
2020-08-04 14:29       ` Alexander Duyck
2020-08-06  1:39         ` Alex Shi
2020-08-06 16:27           ` Alexander Duyck
2020-07-27  5:40 ` [PATCH v17 00/21] per memcg lru lock Alex Shi
2020-07-29 14:49   ` Alex Shi
2020-07-29 18:06     ` Hugh Dickins
2020-07-30  2:16       ` Alex Shi
2020-08-03 15:07         ` Michal Hocko
2020-08-04  6:14           ` Alex Shi
2020-07-31 21:31 ` Alexander Duyck
2020-08-04  8:36 ` Alex Shi
2020-08-04  8:36 ` Alex Shi
2020-08-04  8:37 ` Alex Shi
2020-08-04  8:37 ` Alex Shi

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git