[RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
@ 2021-10-28 11:56 Ning Zhang
  2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Ning Zhang @ 2021-10-28 11:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Yu Zhao

As we know, thp may lead to memory bloat which may cause OOM.
Through testing with some apps, we found that the reason of
memory bloat is a huge page may contain some zero subpages
(may accessed or not). And we found that most zero subpages
are centralized in a few huge pages.

Following is a text_classification_rnn case for tensorflow:

  zero_subpages   huge_pages  waste
  [     0,     1) 186         0.00%
  [     1,     2) 23          0.01%
  [     2,     4) 36          0.02%
  [     4,     8) 67          0.08%
  [     8,    16) 80          0.23%
  [    16,    32) 109         0.61%
  [    32,    64) 44          0.49%
  [    64,   128) 12          0.30%
  [   128,   256) 28          1.54%
  [   256,   513) 159        18.03%

In the case, there are 187 huge pages (25% of the total huge pages)
which contain more then 128 zero subpages. And these huge pages
lead to 19.57% waste of the total rss. It means we can reclaim
19.57% memory by splitting the 187 huge pages and reclaiming the
zero subpages.

This patchset introduce a new mechanism to split the huge page
which has zero subpages and reclaim these zero subpages.

We add the anonymous huge page to a list to reduce the cost of
finding the huge page. When the memory reclaim is triggering,
the list will be walked and the huge page contains enough zero
subpages may be reclaimed. Meanwhile, replace the zero subpages
by ZERO_PAGE(0). 

Yu Zhao has done some similar work when the huge page is swap out
or migrated to accelerate[1]. While we do this in the normal memory
shrink path for the swapoff scene to avoid OOM.

In the future, we will do the proactive reclaim to reclaim the "cold"
huge page proactively. This is for keeping the performance of thp as
for as possible. In addition to that, some users want the memory usage
using thp is equal to the usage using 4K.

[1] https://lore.kernel.org/linux-mm/20210731063938.1391602-1-yuzhao@google.com/

Ning Zhang (6):
  mm, thp: introduce thp zero subpages reclaim
  mm, thp: add a global interface for zero subapges reclaim
  mm, thp: introduce zero subpages reclaim threshold
  mm, thp: introduce a controller to trigger zero subpages reclaim
  mm, thp: add some statistics for zero subpages reclaim
  mm, thp: add document for zero subpages reclaim

 Documentation/admin-guide/mm/transhuge.rst |  75 ++++++
 include/linux/huge_mm.h                    |  13 +
 include/linux/memcontrol.h                 |  26 ++
 include/linux/mm.h                         |   1 +
 include/linux/mm_types.h                   |   6 +
 include/linux/mmzone.h                     |   9 +
 mm/huge_memory.c                           | 374 ++++++++++++++++++++++++++++-
 mm/memcontrol.c                            | 243 +++++++++++++++++++
 mm/vmscan.c                                |  61 ++++-
 9 files changed, 805 insertions(+), 3 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 1/6] mm, thp: introduce thp zero subpages reclaim
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
@ 2021-10-28 11:56 ` Ning Zhang
  2021-10-28 12:53   ` Matthew Wilcox
  2021-10-28 20:50     ` kernel test robot
  2021-10-28 11:56 ` [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Ning Zhang
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 21+ messages in thread
From: Ning Zhang @ 2021-10-28 11:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Yu Zhao

Transparent huge pages could reduce the number of tlb misses, which
could improve performance for the applications. But one concern is
that thp may lead to memory bloat which may cause OOM. The reason
is a huge page may contain some zero subpages which user didn't
really access them.

This patch introduces a mechanism to reclaim these zero subpages, it
works when memory pressure is high. We'll estimate whether a huge
page contains enough zero subpages at first, then try split it and
reclaim the zero subpages.

Through testing with some apps, we found that the zero subpages are
tended to be concentrated into a few huge pages. Following is a
text_classification_rnn case for tensorflow:

  zero_subpages   huge_pages  waste
  [     0,     1) 186         0.00%
  [     1,     2) 23          0.01%
  [     2,     4) 36          0.02%
  [     4,     8) 67          0.08%
  [     8,    16) 80          0.23%
  [    16,    32) 109         0.61%
  [    32,    64) 44          0.49%
  [    64,   128) 12          0.30%
  [   128,   256) 28          1.54%
  [   256,   513) 159        18.03%

In the case, a lot of zero subpages are concentrated into 187(28+159)
huge pages, which lead to 19.57% waste of the total rss. It means we
can reclaim 19.57% memory by splitting the 187 huge pages and reclaiming
the zero subpages.

We store the huge pages to a new list in order to find them quickly. And
add a interface 'thp_reclaim' to control on or off in memory cgroup:

  echo 1 > memory.thp_reclaim to enable.
  echo 0 > memory.thp_reclaim to disable.

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
---
 include/linux/huge_mm.h    |   9 ++
 include/linux/memcontrol.h |  15 +++
 include/linux/mm.h         |   1 +
 include/linux/mm_types.h   |   6 +
 include/linux/mmzone.h     |   6 +
 mm/huge_memory.c           | 296 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/memcontrol.c            | 107 ++++++++++++++++
 mm/vmscan.c                |  59 ++++++++-
 8 files changed, 496 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f123e15..e1b3bf9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -185,6 +185,15 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
+#ifdef CONFIG_MEMCG
+int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page);
+unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
+static inline struct list_head *hpage_reclaim_list(struct page *page)
+{
+	return &page[3].hpage_reclaim_list;
+}
+#endif
+
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3096c9a..502a6ab 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -150,6 +150,9 @@ struct mem_cgroup_per_node {
 	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
 	bool			on_tree;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct hpage_reclaim hpage_reclaim_queue;
+#endif
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -228,6 +231,13 @@ struct obj_cgroup {
 	};
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+enum thp_reclaim_state {
+	THP_RECLAIM_DISABLE,
+	THP_RECLAIM_ENABLE,
+	THP_RECLAIM_MEMCG, /* For global configure*/
+};
+#endif
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -345,6 +355,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	struct deferred_split deferred_split_queue;
+	int thp_reclaim;
 #endif
 
 	struct mem_cgroup_per_node *nodeinfo[];
@@ -1110,6 +1121,10 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void del_hpage_from_queue(struct page *page);
+#endif
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52ab..39676f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3061,6 +3061,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void *, size_t *,
 
 void drop_slab(void);
 void drop_slab_node(int nid);
+unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list);
 
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f8ee09..9433987 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -159,6 +159,12 @@ struct page {
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
+		struct {	 /* Third tail page of compound page */
+			unsigned long _compound_pad_2;
+			unsigned long _compound_pad_3;
+			/* For zero subpages reclaim */
+			struct list_head hpage_reclaim_list;
+		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d..222cd4f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -787,6 +787,12 @@ struct deferred_split {
 	struct list_head split_queue;
 	unsigned long split_queue_len;
 };
+
+struct hpage_reclaim {
+	spinlock_t reclaim_queue_lock;
+	struct list_head reclaim_queue;
+	unsigned long reclaim_queue_len;
+};
 #endif
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0f..21e3c01 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -526,6 +526,9 @@ void prep_transhuge_page(struct page *page)
 	 */
 
 	INIT_LIST_HEAD(page_deferred_list(page));
+#ifdef CONFIG_MEMCG
+	INIT_LIST_HEAD(hpage_reclaim_list(page));
+#endif
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
@@ -2367,7 +2370,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_dirty)));
 
 	/* ->mapping in first tail page is compound_mapcount */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING,
 			page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
@@ -2620,6 +2623,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(head), head);
 	VM_BUG_ON_PAGE(!PageCompound(head), head);
 
+	del_hpage_from_queue(page);
 	if (PageWriteback(head))
 		return -EBUSY;
 
@@ -2779,6 +2783,7 @@ void deferred_split_huge_page(struct page *page)
 			set_shrinker_bit(memcg, page_to_nid(page),
 					 deferred_split_shrinker.id);
 #endif
+		del_hpage_from_queue(page);
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
@@ -3203,3 +3208,292 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
+
+#ifdef CONFIG_MEMCG
+static inline bool is_zero_page(struct page *page)
+{
+	void *addr = kmap(page);
+	bool ret = true;
+
+	if (memchr_inv(addr, 0, PAGE_SIZE))
+		ret = false;
+	kunmap(page);
+
+	return ret;
+}
+
+/*
+ * We'll split the huge page iff it contains at least 1/32 zeros,
+ * estimate it by checking some discrete unsigned long values.
+ */
+static bool hpage_estimate_zero(struct page *page)
+{
+	unsigned int i, maybe_zero_pages = 0, offset = 0;
+	void *addr;
+
+#define BYTES_PER_LONG (BITS_PER_LONG / BITS_PER_BYTE)
+	for (i = 0; i < HPAGE_PMD_NR; i++, page++, offset++) {
+		addr = kmap(page);
+		if (unlikely((offset + 1) * BYTES_PER_LONG > PAGE_SIZE))
+			offset = 0;
+		if (*(const unsigned long *)(addr + offset) == 0UL) {
+			if (++maybe_zero_pages == HPAGE_PMD_NR >> 5) {
+				kunmap(page);
+				return true;
+			}
+		}
+		kunmap(page);
+	}
+
+	return false;
+}
+
+static bool replace_zero_pte(struct page *page, struct vm_area_struct *vma,
+			     unsigned long addr, void *zero_page)
+{
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = addr,
+		.flags = PVMW_SYNC | PVMW_MIGRATION,
+	};
+	pte_t pte;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		pte = pte_mkspecial(
+			pfn_pte(page_to_pfn((struct page *)zero_page),
+			vma->vm_page_prot));
+		dec_mm_counter(vma->vm_mm, MM_ANONPAGES);
+		set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
+		update_mmu_cache(vma, pvmw.address, pvmw.pte);
+	}
+
+	return true;
+}
+
+static void replace_zero_ptes_locked(struct page *page)
+{
+	struct page *zero_page = ZERO_PAGE(0);
+	struct rmap_walk_control rwc = {
+		.rmap_one = replace_zero_pte,
+		.arg = zero_page,
+	};
+
+	rmap_walk_locked(page, &rwc);
+}
+
+static bool replace_zero_page(struct page *page)
+{
+	struct anon_vma *anon_vma = NULL;
+	bool unmap_success;
+	bool ret = true;
+
+	anon_vma = page_get_anon_vma(page);
+	if (!anon_vma)
+		return false;
+
+	anon_vma_lock_write(anon_vma);
+	try_to_migrate(page, TTU_RMAP_LOCKED);
+	unmap_success = !page_mapped(page);
+
+	if (!unmap_success || !is_zero_page(page)) {
+		/* remap the page */
+		remove_migration_ptes(page, page, true);
+		ret = false;
+	} else
+		replace_zero_ptes_locked(page);
+
+	anon_vma_unlock_write(anon_vma);
+	put_anon_vma(anon_vma);
+
+	return ret;
+}
+
+/*
+ * reclaim_zero_subpages - reclaim the zero subpages and putback the non-zero
+ * subpages.
+ *
+ * The non-zero subpages are putback to the keep_list, and will be putback to
+ * the lru list.
+ *
+ * Return the number of reclaimed zero subpages.
+ */
+static unsigned long reclaim_zero_subpages(struct list_head *list,
+					   struct list_head *keep_list)
+{
+	LIST_HEAD(zero_list);
+	struct page *page;
+	unsigned long reclaimed = 0;
+
+	while (!list_empty(list)) {
+		page = lru_to_page(list);
+		list_del_init(&page->lru);
+		if (is_zero_page(page)) {
+			if (!trylock_page(page))
+				goto keep;
+
+			if (!replace_zero_page(page)) {
+				unlock_page(page);
+				goto keep;
+			}
+
+			__ClearPageActive(page);
+			unlock_page(page);
+			if (put_page_testzero(page)) {
+				list_add(&page->lru, &zero_list);
+				reclaimed++;
+			}
+
+			/* someone may hold the zero page, we just skip it. */
+
+			continue;
+		}
+keep:
+		list_add(&page->lru, keep_list);
+	}
+
+	mem_cgroup_uncharge_list(&zero_list);
+	free_unref_page_list(&zero_list);
+
+	return reclaimed;
+
+}
+
+#ifdef CONFIG_MMU
+#define ZSR_PG_MLOCK(flag)	(1UL << flag)
+#else
+#define ZSR_PG_MLOCK(flag)	0
+#endif
+
+#ifdef CONFIG_ARCH_USES_PG_UNCACHED
+#define ZSR_PG_UNCACHED(flag)	(1UL << flag)
+#else
+#define ZSR_PG_UNCACHED(flag)	0
+#endif
+
+#ifdef CONFIG_MEMORY_FAILURE
+#define ZSR_PG_HWPOISON(flag)	(1UL << flag)
+#else
+#define ZSR_PG_HWPOISON(flag)	0
+#endif
+
+/* Filter unsupported page flags. */
+#define ZSR_FLAG_CHECK			\
+	((1UL << PG_error) |		\
+	 (1UL << PG_owner_priv_1) |	\
+	 (1UL << PG_arch_1) |		\
+	 (1UL << PG_reserved) |		\
+	 (1UL << PG_private) |		\
+	 (1UL << PG_private_2) |	\
+	 (1UL << PG_writeback) |	\
+	 (1UL << PG_swapcache) |	\
+	 (1UL << PG_mappedtodisk) |	\
+	 (1UL << PG_reclaim) |		\
+	 (1UL << PG_unevictable) |	\
+	 ZSR_PG_MLOCK(PG_mlocked) |	\
+	 ZSR_PG_UNCACHED(PG_uncached) |	\
+	 ZSR_PG_HWPOISON(PG_hwpoison))
+
+#define hpage_can_reclaim(page) \
+	(PageAnon(page) && !PageKsm(page) && !(page->flags & ZSR_FLAG_CHECK))
+
+#define hr_queue_list_to_page(head) \
+	compound_head(list_entry((head)->prev, struct page,\
+		      hpage_reclaim_list))
+
+/*
+ * zsr_get_hpage - get one huge page from huge page reclaim queue
+ *
+ * Return -EINVAL if the queue is empty; otherwise, return 0.
+ * If the queue is not empty, it will check whether the tail page of the
+ * queue can be reclaimed or not. If the page can be reclaimed, it will
+ * be stored in reclaim_page; otherwise, just delete the page from the
+ * queue.
+ */
+int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page)
+{
+	struct page *page = NULL;
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags);
+	if (list_empty(&hr_queue->reclaim_queue)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	page = hr_queue_list_to_page(&hr_queue->reclaim_queue);
+	list_del_init(hpage_reclaim_list(page));
+	hr_queue->reclaim_queue_len--;
+
+	if (!hpage_can_reclaim(page) || !get_page_unless_zero(page))
+		goto unlock;
+
+	if (!trylock_page(page)) {
+		put_page(page);
+		goto unlock;
+	}
+
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+
+	if (hpage_can_reclaim(page) && hpage_estimate_zero(page) &&
+	    !isolate_lru_page(page)) {
+		__mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON,
+				      HPAGE_PMD_NR);
+		/*
+		 *  dec the reference added in
+		 *  isolate_lru_page
+		 */
+		page_ref_dec(page);
+		*reclaim_page = page;
+	} else {
+		unlock_page(page);
+		put_page(page);
+	}
+
+	return ret;
+
+unlock:
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+	return ret;
+
+}
+
+unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	unsigned long reclaimed;
+	unsigned long flags;
+	LIST_HEAD(split_list);
+	LIST_HEAD(keep_list);
+
+	/*
+	 * Split the huge page and reclaim the zero subpages.
+	 * And putback the non-zero subpages to the lru list.
+	 */
+	if (split_huge_page_to_list(page, &split_list)) {
+		unlock_page(page);
+		putback_lru_page(page);
+		mod_node_page_state(pgdat, NR_ISOLATED_ANON,
+				    -HPAGE_PMD_NR);
+		return 0;
+	}
+
+	unlock_page(page);
+	list_add_tail(&page->lru, &split_list);
+	reclaimed = reclaim_zero_subpages(&split_list, &keep_list);
+
+	spin_lock_irqsave(&lruvec->lru_lock, flags);
+	move_pages_to_lru(lruvec, &keep_list);
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+	mod_node_page_state(pgdat, NR_ISOLATED_ANON,
+			    -HPAGE_PMD_NR);
+
+	mem_cgroup_uncharge_list(&keep_list);
+	free_unref_page_list(&keep_list);
+
+	return reclaimed;
+}
+#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b762215..5df1cdd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2739,6 +2739,56 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* Need the page lock if the page is not a newly allocated page. */
+static void add_hpage_to_queue(struct page *page, struct mem_cgroup *memcg)
+{
+	struct hpage_reclaim *hr_queue;
+	unsigned long flags;
+
+	if (READ_ONCE(memcg->thp_reclaim) == THP_RECLAIM_DISABLE)
+		return;
+
+	page = compound_head(page);
+	/*
+	 * we just want to add the anon page to the queue, but it is not sure
+	 * the page is anon or not when charging to memcg.
+	 * page_mapping return NULL if the page is a anon page or the mapping
+	 * is not yet set.
+	 */
+	if (!is_transparent_hugepage(page) || page_mapping(page))
+		return;
+
+	hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hpage_reclaim_queue;
+	spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags);
+	if (list_empty(hpage_reclaim_list(page))) {
+		list_add(hpage_reclaim_list(page), &hr_queue->reclaim_queue);
+		hr_queue->reclaim_queue_len++;
+	}
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+}
+
+void del_hpage_from_queue(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	struct hpage_reclaim *hr_queue;
+	unsigned long flags;
+
+	page = compound_head(page);
+	memcg = page_memcg(page);
+	if (!memcg || !is_transparent_hugepage(page))
+		return;
+
+	hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hpage_reclaim_queue;
+	spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags);
+	if (!list_empty(hpage_reclaim_list(page))) {
+		list_del_init(hpage_reclaim_list(page));
+		hr_queue->reclaim_queue_len--;
+	}
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+}
+#endif
+
 static void commit_charge(struct page *page, struct mem_cgroup *memcg)
 {
 	VM_BUG_ON_PAGE(page_memcg(page), page);
@@ -2751,6 +2801,10 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg)
 	 * - exclusive reference
 	 */
 	page->memcg_data = (unsigned long)memcg;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	add_hpage_to_queue(page, memcg);
+#endif
 }
 
 static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
@@ -4425,6 +4479,26 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 
 	return 0;
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static u64 mem_cgroup_thp_reclaim_read(struct cgroup_subsys_state *css,
+				       struct cftype *cft)
+{
+	return READ_ONCE(mem_cgroup_from_css(css)->thp_reclaim);
+}
+
+static int mem_cgroup_thp_reclaim_write(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	if (val != THP_RECLAIM_DISABLE && val != THP_RECLAIM_ENABLE)
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->thp_reclaim, val);
+
+	return 0;
+}
+#endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
@@ -4988,6 +5062,13 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 		.write = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read_u64,
 	},
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	{
+		.name = "thp_reclaim",
+		.read_u64 = mem_cgroup_thp_reclaim_read,
+		.write_u64 = mem_cgroup_thp_reclaim_write,
+	},
+#endif
 	{ },	/* terminate */
 };
 
@@ -5088,6 +5169,12 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	pn->on_tree = false;
 	pn->memcg = memcg;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	spin_lock_init(&pn->hpage_reclaim_queue.reclaim_queue_lock);
+	INIT_LIST_HEAD(&pn->hpage_reclaim_queue.reclaim_queue);
+	pn->hpage_reclaim_queue.reclaim_queue_len = 0;
+#endif
+
 	memcg->nodeinfo[node] = pn;
 	return 0;
 }
@@ -5176,6 +5263,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	spin_lock_init(&memcg->deferred_split_queue.split_queue_lock);
 	INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue);
 	memcg->deferred_split_queue.split_queue_len = 0;
+
+	memcg->thp_reclaim = THP_RECLAIM_DISABLE;
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
@@ -5209,6 +5298,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		page_counter_init(&memcg->swap, &parent->swap);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		memcg->thp_reclaim = parent->thp_reclaim;
+#endif
 	} else {
 		page_counter_init(&memcg->memory, NULL);
 		page_counter_init(&memcg->swap, NULL);
@@ -5654,6 +5746,10 @@ static int mem_cgroup_move_account(struct page *page,
 		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
 	}
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	del_hpage_from_queue(page);
+#endif
+
 	/*
 	 * All state has been migrated, let's switch to the new memcg.
 	 *
@@ -5674,6 +5770,10 @@ static int mem_cgroup_move_account(struct page *page,
 
 	page->memcg_data = (unsigned long)to;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	add_hpage_to_queue(page, to);
+#endif
+
 	__unlock_page_memcg(from);
 
 	ret = 0;
@@ -6850,6 +6950,9 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	del_hpage_from_queue(page);
+#endif
 	/*
 	 * Nobody should be changing or seriously looking at
 	 * page memcg or objcg at this point, we have fully
@@ -7196,6 +7299,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(oldid, page);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	del_hpage_from_queue(page);
+#endif
+
 	page->memcg_data = 0;
 
 	if (!mem_cgroup_is_root(memcg))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2..9be136f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2151,8 +2151,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  *
  * Returns the number of pages moved to the given lruvec.
  */
-static unsigned int move_pages_to_lru(struct lruvec *lruvec,
-				      struct list_head *list)
+unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list)
 {
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
@@ -2783,6 +2782,57 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 	return can_demote(pgdat->node_id, sc);
 }
 
+#ifdef CONFIG_MEMCG
+#define MAX_SCAN_HPAGE 32UL
+/*
+ * Try to reclaim the zero subpages for the transparent huge page.
+ */
+static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
+						 int priority,
+						 unsigned long nr_to_reclaim)
+{
+	struct mem_cgroup *memcg;
+	struct hpage_reclaim *hr_queue;
+	int nid = lruvec->pgdat->node_id;
+	unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan;
+
+	memcg = lruvec_memcg(lruvec);
+	if (!memcg)
+		goto out;
+
+	hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
+	if (!READ_ONCE(memcg->thp_reclaim))
+		goto out;
+
+	/* The last scan loop will scan all the huge pages.*/
+	nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE;
+
+	do {
+		struct page *page = NULL;
+
+		if (zsr_get_hpage(hr_queue, &page))
+			break;
+
+		if (!page)
+			continue;
+
+		nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
+
+		cond_resched();
+
+	} while ((nr_reclaimed < nr_to_reclaim) && (++nr_scanned != nr_to_scan));
+out:
+	return nr_reclaimed;
+}
+#else
+static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
+						 int priority,
+						 unsigned long nr_to_reclaim)
+{
+	return 0;
+}
+#endif
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
@@ -2886,6 +2936,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (nr_reclaimed < nr_to_reclaim)
+		nr_reclaimed += reclaim_hpage_zero_subpages(lruvec,
+				sc->priority, nr_to_reclaim - nr_reclaimed);
+#endif
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
  2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
@ 2021-10-28 11:56 ` Ning Zhang
  2021-10-29  0:44     ` kernel test robot
  2021-10-28 11:56 ` [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold Ning Zhang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Ning Zhang @ 2021-10-28 11:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Yu Zhao

Add a global interface to configure zero subpages reclaim global:

  /sys/kernel/mm/transparent_hugepage/reclaim

It has three modes:

  memcg, means every memory cgroup will use their own configure.
  enable, means every mem cgroup will enable reclaim.
  disable, means every mem cgroup will disable reclaim.

The default mode is memcg.

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
---
 include/linux/huge_mm.h    |  1 +
 include/linux/memcontrol.h |  8 ++++++++
 mm/huge_memory.c           | 44 ++++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c            |  2 +-
 mm/vmscan.c                |  2 +-
 5 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e1b3bf9..04607b1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -186,6 +186,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 bool is_transparent_hugepage(struct page *page);
 
 #ifdef CONFIG_MEMCG
+extern int global_thp_reclaim;
 int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page);
 unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
 static inline struct list_head *hpage_reclaim_list(struct page *page)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 502a6ab..f99f13f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1123,6 +1123,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void del_hpage_from_queue(struct page *page);
+
+static inline int get_thp_reclaim_mode(struct mem_cgroup *memcg)
+{
+	int reclaim = READ_ONCE(global_thp_reclaim);
+
+	return (reclaim != THP_RECLAIM_MEMCG) ? reclaim :
+			READ_ONCE(memcg->thp_reclaim);
+}
 #endif
 
 #else /* CONFIG_MEMCG */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 21e3c01..84fd738 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -60,6 +60,10 @@
 
 static struct shrinker deferred_split_shrinker;
 
+#ifdef CONFIG_MEMCG
+int global_thp_reclaim = THP_RECLAIM_MEMCG;
+#endif
+
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
@@ -330,6 +334,43 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+#ifdef CONFIG_MEMCG
+static ssize_t reclaim_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	int thp_reclaim = READ_ONCE(global_thp_reclaim);
+
+	if (thp_reclaim == THP_RECLAIM_MEMCG)
+		return sprintf(buf, "[memcg] enable disable\n");
+	else if (thp_reclaim == THP_RECLAIM_ENABLE)
+		return sprintf(buf, "memcg [enable] disable\n");
+	else
+		return sprintf(buf, "memcg enable [disable]\n");
+}
+
+static ssize_t reclaim_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	if (!memcmp("memcg", buf,
+		    min(sizeof("memcg")-1, count)))
+		WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_MEMCG);
+	else if (!memcmp("enable", buf,
+		    min(sizeof("enable")-1, count)))
+		WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_ENABLE);
+	else if (!memcmp("disable", buf,
+		    min(sizeof("disable")-1, count)))
+		WRITE_ONCE(global_thp_reclaim, THP_RECLAIM_DISABLE);
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+static struct kobj_attribute reclaim_attr =
+	__ATTR(reclaim, 0644, reclaim_show, reclaim_store);
+#endif
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
@@ -338,6 +379,9 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
+#ifdef CONFIG_MEMCG
+	&reclaim_attr.attr,
+#endif
 	NULL,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5df1cdd..ae96781 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2746,7 +2746,7 @@ static void add_hpage_to_queue(struct page *page, struct mem_cgroup *memcg)
 	struct hpage_reclaim *hr_queue;
 	unsigned long flags;
 
-	if (READ_ONCE(memcg->thp_reclaim) == THP_RECLAIM_DISABLE)
+	if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
 		return;
 
 	page = compound_head(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9be136f..f4ff14d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2801,7 +2801,7 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
 		goto out;
 
 	hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
-	if (!READ_ONCE(memcg->thp_reclaim))
+	if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
 		goto out;
 
 	/* The last scan loop will scan all the huge pages.*/
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
  2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
  2021-10-28 11:56 ` [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Ning Zhang
@ 2021-10-28 11:56 ` Ning Zhang
  2021-10-28 11:56 ` [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim Ning Zhang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Ning Zhang @ 2021-10-28 11:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Yu Zhao

In this patch, we add memory.thp_reclaim_ctrl for each memory
cgroup to control thp reclaim.

The first controller "threshold" is to set the reclaim threshold.
The default value is 16, which means if a huge page contains over
16 zero subpages (estimated), the huge page can be split and the
zero subpages can be reclaimed when the zero subpages reclaim is
enable.

You can change this value by:

  echo "threshold $v" > /sys/fs/cgroup/memory/{memcg}/thp_reclaim_ctrl

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
---
 include/linux/huge_mm.h    |  3 ++-
 include/linux/memcontrol.h |  3 +++
 mm/huge_memory.c           |  9 ++++---
 mm/memcontrol.c            | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |  4 ++-
 5 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 04607b1..304e3df 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -187,7 +187,8 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 
 #ifdef CONFIG_MEMCG
 extern int global_thp_reclaim;
-int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page);
+int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page,
+		  int threshold);
 unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
 static inline struct list_head *hpage_reclaim_list(struct page *page)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f99f13f..4815c56 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -237,6 +237,8 @@ enum thp_reclaim_state {
 	THP_RECLAIM_ENABLE,
 	THP_RECLAIM_MEMCG, /* For global configure*/
 };
+
+#define THP_RECLAIM_THRESHOLD_DEFAULT  16
 #endif
 /*
  * The memory controller data structure. The memory controller controls both
@@ -356,6 +358,7 @@ struct mem_cgroup {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	struct deferred_split deferred_split_queue;
 	int thp_reclaim;
+	int thp_reclaim_threshold;
 #endif
 
 	struct mem_cgroup_per_node *nodeinfo[];
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 84fd738..40a9879 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3270,7 +3270,7 @@ static inline bool is_zero_page(struct page *page)
  * We'll split the huge page iff it contains at least 1/32 zeros,
  * estimate it by checking some discrete unsigned long values.
  */
-static bool hpage_estimate_zero(struct page *page)
+static bool hpage_estimate_zero(struct page *page, int threshold)
 {
 	unsigned int i, maybe_zero_pages = 0, offset = 0;
 	void *addr;
@@ -3281,7 +3281,7 @@ static bool hpage_estimate_zero(struct page *page)
 		if (unlikely((offset + 1) * BYTES_PER_LONG > PAGE_SIZE))
 			offset = 0;
 		if (*(const unsigned long *)(addr + offset) == 0UL) {
-			if (++maybe_zero_pages == HPAGE_PMD_NR >> 5) {
+			if (++maybe_zero_pages == threshold) {
 				kunmap(page);
 				return true;
 			}
@@ -3456,7 +3456,8 @@ static unsigned long reclaim_zero_subpages(struct list_head *list,
  * be stored in reclaim_page; otherwise, just delete the page from the
  * queue.
  */
-int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page)
+int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page,
+		  int threshold)
 {
 	struct page *page = NULL;
 	unsigned long flags;
@@ -3482,7 +3483,7 @@ int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page)
 
 	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
 
-	if (hpage_can_reclaim(page) && hpage_estimate_zero(page) &&
+	if (hpage_can_reclaim(page) && hpage_estimate_zero(page, threshold) &&
 	    !isolate_lru_page(page)) {
 		__mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON,
 				      HPAGE_PMD_NR);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae96781..7ba3c69 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4498,6 +4498,61 @@ static int mem_cgroup_thp_reclaim_write(struct cgroup_subsys_state *css,
 
 	return 0;
 }
+
+static inline char *strsep_s(char **s, const char *ct)
+{
+	char *p;
+
+	while ((p = strsep(s, ct))) {
+		if (*p)
+			return p;
+	}
+
+	return NULL;
+}
+
+static int memcg_thp_reclaim_ctrl_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	int thp_reclaim_threshold = READ_ONCE(memcg->thp_reclaim_threshold);
+
+	seq_printf(m, "threshold\t%d\n", thp_reclaim_threshold);
+
+	return 0;
+}
+
+static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of,
+					    char *buf, size_t nbytes,
+					    loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	char *key, *value;
+	int ret;
+
+	key = strsep_s(&buf, " \t\n");
+	if (!key)
+		return -EINVAL;
+
+	if (!strcmp(key, "threshold")) {
+		int threshold;
+
+		value = strsep_s(&buf, " \t\n");
+		if (!value)
+			return -EINVAL;
+
+		ret = kstrtouint(value, 0, &threshold);
+		if (ret)
+			return ret;
+
+		if (threshold > HPAGE_PMD_NR || threshold < 1)
+			return -EINVAL;
+
+		xchg(&memcg->thp_reclaim_threshold, threshold);
+	} else
+		return -EINVAL;
+
+	return nbytes;
+}
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -5068,6 +5123,11 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 		.read_u64 = mem_cgroup_thp_reclaim_read,
 		.write_u64 = mem_cgroup_thp_reclaim_write,
 	},
+	{
+		.name = "thp_reclaim_ctrl",
+		.seq_show = memcg_thp_reclaim_ctrl_show,
+		.write = memcg_thp_reclaim_ctrl_write,
+	},
 #endif
 	{ },	/* terminate */
 };
@@ -5265,6 +5325,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->deferred_split_queue.split_queue_len = 0;
 
 	memcg->thp_reclaim = THP_RECLAIM_DISABLE;
+	memcg->thp_reclaim_threshold = THP_RECLAIM_THRESHOLD_DEFAULT;
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
@@ -5300,6 +5361,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		memcg->thp_reclaim = parent->thp_reclaim;
+		memcg->thp_reclaim_threshold = parent->thp_reclaim_threshold;
 #endif
 	} else {
 		page_counter_init(&memcg->memory, NULL);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f4ff14d..fcc80a6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2794,6 +2794,7 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
 	struct mem_cgroup *memcg;
 	struct hpage_reclaim *hr_queue;
 	int nid = lruvec->pgdat->node_id;
+	int threshold;
 	unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan;
 
 	memcg = lruvec_memcg(lruvec);
@@ -2806,11 +2807,12 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
 
 	/* The last scan loop will scan all the huge pages.*/
 	nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE;
+	threshold = READ_ONCE(memcg->thp_reclaim_threshold);
 
 	do {
 		struct page *page = NULL;
 
-		if (zsr_get_hpage(hr_queue, &page))
+		if (zsr_get_hpage(hr_queue, &page, threshold))
 			break;
 
 		if (!page)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
                   ` (2 preceding siblings ...)
  2021-10-28 11:56 ` [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold Ning Zhang
@ 2021-10-28 11:56 ` Ning Zhang
  2021-10-28 11:56 ` [RFC 5/6] mm, thp: add some statistics for " Ning Zhang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Ning Zhang @ 2021-10-28 11:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Yu Zhao

Add a new controller named "reclaim" for memory.thp_reclaim_ctrl
to trigger thp reclaim immediately:

  echo "reclaim 1" > memory.thp_reclaim_ctrl
  echo "reclaim 2" > memory.thp_reclaim_ctrl

"reclaim 1" means triggering reclaim only for current memcg.
"reclaim 2" means triggering reclaim for current memcg and it's
children memcgs.

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  1 +
 mm/huge_memory.c        | 29 +++++++++++++++++++++++++++++
 mm/memcontrol.c         | 27 +++++++++++++++++++++++++++
 3 files changed, 57 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 304e3df..f792433 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -190,6 +190,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page,
 		  int threshold);
 unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
+void zsr_reclaim_memcg(struct mem_cgroup *memcg);
 static inline struct list_head *hpage_reclaim_list(struct page *page)
 {
 	return &page[3].hpage_reclaim_list;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40a9879..633fd0f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3541,4 +3541,33 @@ unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page)
 
 	return reclaimed;
 }
+
+void zsr_reclaim_memcg(struct mem_cgroup *memcg)
+{
+	struct lruvec *lruvec;
+	struct hpage_reclaim *hr_queue;
+	int threshold, nid;
+
+	if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
+		return;
+
+	threshold = READ_ONCE(memcg->thp_reclaim_threshold);
+	for_each_online_node(nid) {
+		lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+		hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
+		for ( ; ; ) {
+			struct page *page = NULL;
+
+			if (zsr_get_hpage(hr_queue, &page, threshold))
+				break;
+
+			if (!page)
+				continue;
+
+			zsr_reclaim_hpage(lruvec, page);
+
+			cond_resched();
+		}
+	}
+}
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7ba3c69..a8e3ca1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4521,6 +4521,8 @@ static int memcg_thp_reclaim_ctrl_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+#define CTRL_RECLAIM_MEMCG 1 /* only relciam current memcg */
+#define CTRL_RECLAIM_ALL   2 /* reclaim current memcg and all the children memcgs */
 static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of,
 					    char *buf, size_t nbytes,
 					    loff_t off)
@@ -4548,6 +4550,31 @@ static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of,
 			return -EINVAL;
 
 		xchg(&memcg->thp_reclaim_threshold, threshold);
+	} else if (!strcmp(key, "reclaim")) {
+		struct mem_cgroup *iter;
+		int mode;
+
+		value = strsep_s(&buf, " \t\n");
+		if (!value)
+			return -EINVAL;
+
+		ret = kstrtouint(value, 0, &mode);
+		if (ret)
+			return ret;
+
+		switch (mode) {
+		case CTRL_RECLAIM_MEMCG:
+			zsr_reclaim_memcg(memcg);
+			break;
+		case CTRL_RECLAIM_ALL:
+			iter = mem_cgroup_iter(memcg, NULL, NULL);
+			do {
+				zsr_reclaim_memcg(iter);
+			} while ((iter = mem_cgroup_iter(memcg, iter, NULL)));
+			break;
+		default:
+			return -EINVAL;
+		}
 	} else
 		return -EINVAL;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 5/6] mm, thp: add some statistics for zero subpages reclaim
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
                   ` (3 preceding siblings ...)
  2021-10-28 11:56 ` [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim Ning Zhang
@ 2021-10-28 11:56 ` Ning Zhang
  2021-10-28 11:56 ` [RFC 6/6] mm, thp: add document " Ning Zhang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Ning Zhang @ 2021-10-28 11:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Yu Zhao

queue_length show the numbers of huge pages in the queue.
split_hpage shows the numbers of huge pages split by thp reclaim.
split_failed shows the numbers of huge pages split failed
reclaim_subpage shows the numbers of zero subpages reclaimed by
thp reclaim.

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  3 ++-
 include/linux/mmzone.h  |  3 +++
 mm/huge_memory.c        |  8 ++++++--
 mm/memcontrol.c         | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c             |  2 +-
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f792433..5d4a038 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -189,7 +189,8 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 extern int global_thp_reclaim;
 int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page,
 		  int threshold);
-unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
+unsigned long zsr_reclaim_hpage(struct hpage_reclaim *hr_queue,
+				struct lruvec *lruvec, struct page *page);
 void zsr_reclaim_memcg(struct mem_cgroup *memcg);
 static inline struct list_head *hpage_reclaim_list(struct page *page)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 222cd4f..6ce6890 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -792,6 +792,9 @@ struct hpage_reclaim {
 	spinlock_t reclaim_queue_lock;
 	struct list_head reclaim_queue;
 	unsigned long reclaim_queue_len;
+	atomic_long_t split_hpage;
+	atomic_long_t split_failed;
+	atomic_long_t reclaim_subpage;
 };
 #endif
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 633fd0f..5e737d0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3506,7 +3506,8 @@ int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page,
 
 }
 
-unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page)
+unsigned long zsr_reclaim_hpage(struct hpage_reclaim *hr_queue,
+				struct lruvec *lruvec, struct page *page)
 {
 	struct pglist_data *pgdat = page_pgdat(page);
 	unsigned long reclaimed;
@@ -3523,12 +3524,15 @@ unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page)
 		putback_lru_page(page);
 		mod_node_page_state(pgdat, NR_ISOLATED_ANON,
 				    -HPAGE_PMD_NR);
+		atomic_long_inc(&hr_queue->split_failed);
 		return 0;
 	}
 
 	unlock_page(page);
 	list_add_tail(&page->lru, &split_list);
 	reclaimed = reclaim_zero_subpages(&split_list, &keep_list);
+	atomic_long_inc(&hr_queue->split_hpage);
+	atomic_long_add(reclaimed, &hr_queue->reclaim_subpage);
 
 	spin_lock_irqsave(&lruvec->lru_lock, flags);
 	move_pages_to_lru(lruvec, &keep_list);
@@ -3564,7 +3568,7 @@ void zsr_reclaim_memcg(struct mem_cgroup *memcg)
 			if (!page)
 				continue;
 
-			zsr_reclaim_hpage(lruvec, page);
+			zsr_reclaim_hpage(hr_queue, lruvec, page);
 
 			cond_resched();
 		}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a8e3ca1..f8016ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4580,6 +4580,49 @@ static ssize_t memcg_thp_reclaim_ctrl_write(struct kernfs_open_file *of,
 
 	return nbytes;
 }
+
+static int memcg_thp_reclaim_stat_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	struct mem_cgroup_per_node *mz;
+	int nid;
+	unsigned long len;
+
+	seq_puts(m, "queue_length\t");
+	for_each_node(nid) {
+		mz = memcg->nodeinfo[nid];
+		len = READ_ONCE(mz->hpage_reclaim_queue.reclaim_queue_len);
+		seq_printf(m, "%-24lu", len);
+	}
+
+	seq_puts(m, "\n");
+	seq_puts(m, "split_hpage\t");
+	for_each_node(nid) {
+		mz = memcg->nodeinfo[nid];
+		len = atomic_long_read(&mz->hpage_reclaim_queue.split_hpage);
+		seq_printf(m, "%-24lu", len);
+	}
+
+	seq_puts(m, "\n");
+	seq_puts(m, "split_failed\t");
+	for_each_node(nid) {
+		mz = memcg->nodeinfo[nid];
+		len = atomic_long_read(&mz->hpage_reclaim_queue.split_failed);
+		seq_printf(m, "%-24lu", len);
+	}
+
+	seq_puts(m, "\n");
+	seq_puts(m, "reclaim_subpage\t");
+	for_each_node(nid) {
+		mz = memcg->nodeinfo[nid];
+		len = atomic_long_read(&mz->hpage_reclaim_queue.reclaim_subpage);
+		seq_printf(m, "%-24lu", len);
+	}
+
+	seq_puts(m, "\n");
+
+	return 0;
+}
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -5155,6 +5198,10 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 		.seq_show = memcg_thp_reclaim_ctrl_show,
 		.write = memcg_thp_reclaim_ctrl_write,
 	},
+	{
+		.name = "thp_reclaim_stat",
+		.seq_show = memcg_thp_reclaim_stat_show,
+	},
 #endif
 	{ },	/* terminate */
 };
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fcc80a6..cb5f53d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2818,7 +2818,7 @@ static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
 		if (!page)
 			continue;
 
-		nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
+		nr_reclaimed += zsr_reclaim_hpage(hr_queue, lruvec, page);
 
 		cond_resched();
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 6/6] mm, thp: add document for zero subpages reclaim
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
                   ` (4 preceding siblings ...)
  2021-10-28 11:56 ` [RFC 5/6] mm, thp: add some statistics for " Ning Zhang
@ 2021-10-28 11:56 ` Ning Zhang
  2021-10-28 14:13 ` [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Kirill A. Shutemov
  2021-10-29 13:38 ` Michal Hocko
  7 siblings, 0 replies; 21+ messages in thread
From: Ning Zhang @ 2021-10-28 11:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Yu Zhao

Add user guide for thp zero subpages reclaim.

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 75 ++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index c9c37f1..85cd3b7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -421,3 +421,78 @@ support enabled just fine as always. No difference can be noted in
 hugetlbfs other than there will be less overall fragmentation. All
 usual features belonging to hugetlbfs are preserved and
 unaffected. libhugetlbfs will also work fine as usual.
+
+THP zero subpages reclaim
+=========================
+THP may lead to memory bloat which may cause OOM. The reason is a huge
+page may contain some zero subpages which users didn't really access them.
+To avoid this, a mechanism to reclaim these zero subpages is introduced::
+
+        echo 1 > /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim
+        echo 0 > /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim
+
+Echo 1 to enable and echo 0 to disable.
+The default value is inherited from its parent. The default mode of root
+memcg is disable.
+
+We also add a global interface, if you don't want to configure it by
+configuring every memory cgroup, you can use this one::
+
+        /sys/kernel/mm/transparent_hugepage/reclaim
+
+memcg
+        The default mode. It means every mem cgroup will use their own
+        configure.
+
+enable
+        means every mem cgroup will enable reclaim.
+
+disable
+        means every mem cgroup will disable reclaim.
+
+If zero subpages reclaim is enabled, the new huge page will be add to a
+reclaim queue in mem_cgroup, and the queue would be scanned when memory
+reclaiming. The queue stat can be checked like this::
+
+        cat /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_stat
+
+queue_length
+        means the queue length of each node.
+
+split_hpage
+        means the numbers of huge pages split by thp reclaim of each node.
+
+split_failed
+        means the numbers of huge pages split failed by thp reclaim of
+        each node.
+
+reclaim_subpage
+        means the numbers of zero subpages reclaimed by thp reclaim of
+        each node.
+
+We also add a controller interface to set configs for thp reclaim::
+
+        /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_ctrl
+
+threshold
+        means the huge page which contains at least threshold zero pages would
+        be split (estimate it by checking some discrete unsigned long values).
+        The default value of threshold is 16, and will inherit from it's parent.
+        The range of this value is (0, HPAGE_PMD_NR], which means the value must
+        be less than or equal to HPAGE_PMD_NR (512 in x86), and be greater than 0.
+        We can set reclaim threshold to be 8 by this::
+
+        echo "threshold 8" > memory.thp_reclaim_ctrl
+
+reclaim
+        triggers action immediately for the huge pages in the reclaim queue.
+        The action deponds on the thp reclaim config (reclaim, swap or disable,
+        disable means just remove the huge page from the queue).
+        This contronller has two value, 1 and 2. 1 means just reclaim the current
+        memcg, and 2 means reclaim the current memcg and all the children memcgs.
+        Like this::
+
+        echo "reclaim 1" > memory.thp_reclaim_ctrl
+        echo "reclaim 2" > memory.thp_reclaim_ctrl
+
+Only one of the configs mentioned above can be set at a time.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC 1/6] mm, thp: introduce thp zero subpages reclaim
  2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
@ 2021-10-28 12:53   ` Matthew Wilcox
  2021-10-29 12:16     ` ning zhang
  2021-10-28 20:50     ` kernel test robot
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2021-10-28 12:53 UTC (permalink / raw)
  To: Ning Zhang
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Yu Zhao

On Thu, Oct 28, 2021 at 07:56:50PM +0800, Ning Zhang wrote:
> +++ b/include/linux/huge_mm.h
> @@ -185,6 +185,15 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  void free_transhuge_page(struct page *page);
>  bool is_transparent_hugepage(struct page *page);
>  
> +#ifdef CONFIG_MEMCG
> +int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page);
> +unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
> +static inline struct list_head *hpage_reclaim_list(struct page *page)
> +{
> +	return &page[3].hpage_reclaim_list;
> +}
> +#endif

I don't think any of this needs to be under an ifdef.  That goes for a
lot of your other additions to header files.

> @@ -1110,6 +1121,10 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  						gfp_t gfp_mask,
>  						unsigned long *total_scanned);
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void del_hpage_from_queue(struct page *page);
> +#endif

That name is too generic.  Also, to avoid ifdefs in code, it should be:

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void del_hpage_from_queue(struct page *page);
#else
static inline void del_hpage_from_queue(struct page *page) { }
#endif

> @@ -159,6 +159,12 @@ struct page {
>  			/* For both global and memcg */
>  			struct list_head deferred_list;
>  		};
> +		struct {	 /* Third tail page of compound page */
> +			unsigned long _compound_pad_2;
> +			unsigned long _compound_pad_3;
> +			/* For zero subpages reclaim */
> +			struct list_head hpage_reclaim_list;

Why do you need _compound_pad_3 here?

> +++ b/include/linux/mmzone.h
> @@ -787,6 +787,12 @@ struct deferred_split {
>  	struct list_head split_queue;
>  	unsigned long split_queue_len;
>  };
> +
> +struct hpage_reclaim {
> +	spinlock_t reclaim_queue_lock;
> +	struct list_head reclaim_queue;
> +	unsigned long reclaim_queue_len;
> +};

Have you considered using an XArray instead of a linked list?

> +static bool hpage_estimate_zero(struct page *page)
> +{
> +	unsigned int i, maybe_zero_pages = 0, offset = 0;
> +	void *addr;
> +
> +#define BYTES_PER_LONG (BITS_PER_LONG / BITS_PER_BYTE)

BYTES_PER_LONG is simply sizeof(long).
Also, I'd check the entire cacheline rather than just one word; it's
essentially free.

> +#ifdef CONFIG_MMU
> +#define ZSR_PG_MLOCK(flag)	(1UL << flag)
> +#else
> +#define ZSR_PG_MLOCK(flag)	0
> +#endif

Or use __PG_MLOCKED ?

> +#ifdef CONFIG_ARCH_USES_PG_UNCACHED
> +#define ZSR_PG_UNCACHED(flag)	(1UL << flag)
> +#else
> +#define ZSR_PG_UNCACHED(flag)	0
> +#endif

Define __PG_UNCACHED in page-flags.h?

> +#ifdef CONFIG_MEMORY_FAILURE
> +#define ZSR_PG_HWPOISON(flag)	(1UL << flag)
> +#else
> +#define ZSR_PG_HWPOISON(flag)	0
> +#endif

__PG_HWPOISON

> +#define hr_queue_list_to_page(head) \
> +	compound_head(list_entry((head)->prev, struct page,\
> +		      hpage_reclaim_list))

I think you're better off subtracting 3*sizeof(struct page) than
loading from compound_head.

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/* Need the page lock if the page is not a newly allocated page. */
> +static void add_hpage_to_queue(struct page *page, struct mem_cgroup *memcg)
> +{
> +	struct hpage_reclaim *hr_queue;
> +	unsigned long flags;
> +
> +	if (READ_ONCE(memcg->thp_reclaim) == THP_RECLAIM_DISABLE)
> +		return;
> +
> +	page = compound_head(page);

Why do you think the caller might be passing in a tail page here?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
                   ` (5 preceding siblings ...)
  2021-10-28 11:56 ` [RFC 6/6] mm, thp: add document " Ning Zhang
@ 2021-10-28 14:13 ` Kirill A. Shutemov
  2021-10-29 12:07   ` ning zhang
  2021-10-29 13:38 ` Michal Hocko
  7 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2021-10-28 14:13 UTC (permalink / raw)
  To: Ning Zhang
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Yu Zhao

On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
> As we know, thp may lead to memory bloat which may cause OOM.
> Through testing with some apps, we found that the reason of
> memory bloat is a huge page may contain some zero subpages
> (may accessed or not). And we found that most zero subpages
> are centralized in a few huge pages.
> 
> Following is a text_classification_rnn case for tensorflow:
> 
>   zero_subpages   huge_pages  waste
>   [     0,     1) 186         0.00%
>   [     1,     2) 23          0.01%
>   [     2,     4) 36          0.02%
>   [     4,     8) 67          0.08%
>   [     8,    16) 80          0.23%
>   [    16,    32) 109         0.61%
>   [    32,    64) 44          0.49%
>   [    64,   128) 12          0.30%
>   [   128,   256) 28          1.54%
>   [   256,   513) 159        18.03%
> 
> In the case, there are 187 huge pages (25% of the total huge pages)
> which contain more then 128 zero subpages. And these huge pages
> lead to 19.57% waste of the total rss. It means we can reclaim
> 19.57% memory by splitting the 187 huge pages and reclaiming the
> zero subpages.
> 
> This patchset introduce a new mechanism to split the huge page
> which has zero subpages and reclaim these zero subpages.
> 
> We add the anonymous huge page to a list to reduce the cost of
> finding the huge page. When the memory reclaim is triggering,
> the list will be walked and the huge page contains enough zero
> subpages may be reclaimed. Meanwhile, replace the zero subpages
> by ZERO_PAGE(0). 

Does it actually help your workload?

I mean this will only be triggered via vmscan that was going to split
pages and free anyway.

You prioritize splitting THP and freeing zero subpages over reclaiming
other pages. It may or may not be right thing to do, depending on
workload.

Maybe it makes more sense to check for all-zero pages just after
split_huge_page_to_list() in vmscan and free such pages immediately rather
then add all this complexity?

> Yu Zhao has done some similar work when the huge page is swap out
> or migrated to accelerate[1]. While we do this in the normal memory
> shrink path for the swapoff scene to avoid OOM.
> 
> In the future, we will do the proactive reclaim to reclaim the "cold"
> huge page proactively. This is for keeping the performance of thp as
> for as possible. In addition to that, some users want the memory usage
> using thp is equal to the usage using 4K.

Proactive reclaim can be harmful if your max_ptes_none allows to recreate
THP back.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 1/6] mm, thp: introduce thp zero subpages reclaim
  2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
@ 2021-10-28 20:50     ` kernel test robot
  2021-10-28 20:50     ` kernel test robot
  1 sibling, 0 replies; 21+ messages in thread
From: kernel test robot @ 2021-10-28 20:50 UTC (permalink / raw)
  To: Ning Zhang; +Cc: llvm, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 13257 bytes --]

Hi Ning,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc7]
[cannot apply to hnaz-mm/master next-20211028]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1fc596a56b334f4d593a2b49e5ff55af6aaa0816
config: arm-randconfig-c002-20211028 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 5db7568a6a1fcb408eb8988abdaff2a225a8eb72)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/ba9f8c1a43c2d9ab2d2ac5696aaffbeaf043fa02
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
        git checkout ba9f8c1a43c2d9ab2d2ac5696aaffbeaf043fa02
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/vmscan.c:1340:6: warning: variable 'err' set but not used [-Wunused-but-set-variable]
           int err;
               ^
>> mm/vmscan.c:2803:36: error: no member named 'hpage_reclaim_queue' in 'struct mem_cgroup_per_node'
           hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
                       ~~~~~~~~~~~~~~~~~~~~  ^
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:290:10: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
                   ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:290:39: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
                                                ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:291:10: note: expanded from macro '__native_word'
            sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
                   ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:291:38: note: expanded from macro '__native_word'
            sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
                                               ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:48: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                                         ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                    ^
   include/linux/compiler_types.h:279:13: note: expanded from macro '__unqual_scalar_typeof'
                   _Generic((x),                                           \
                             ^
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                    ^
   include/linux/compiler_types.h:286:15: note: expanded from macro '__unqual_scalar_typeof'
                            default: (x)))
                                      ^
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:72: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                           ^
   mm/vmscan.c:2804:6: error: invalid argument type 'void' to unary expression
           if (!READ_ONCE(memcg->thp_reclaim))
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/vmscan.c:2813:7: error: implicit declaration of function 'zsr_get_hpage' [-Werror,-Wimplicit-function-declaration]
                   if (zsr_get_hpage(hr_queue, &page))
                       ^
>> mm/vmscan.c:2819:19: error: implicit declaration of function 'zsr_reclaim_hpage' [-Werror,-Wimplicit-function-declaration]
                   nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
                                   ^
   1 warning and 12 errors generated.


vim +2803 mm/vmscan.c

  2784	
  2785	#ifdef CONFIG_MEMCG
  2786	#define MAX_SCAN_HPAGE 32UL
  2787	/*
  2788	 * Try to reclaim the zero subpages for the transparent huge page.
  2789	 */
  2790	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2791							 int priority,
  2792							 unsigned long nr_to_reclaim)
  2793	{
  2794		struct mem_cgroup *memcg;
  2795		struct hpage_reclaim *hr_queue;
  2796		int nid = lruvec->pgdat->node_id;
  2797		unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan;
  2798	
  2799		memcg = lruvec_memcg(lruvec);
  2800		if (!memcg)
  2801			goto out;
  2802	
> 2803		hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
  2804		if (!READ_ONCE(memcg->thp_reclaim))
  2805			goto out;
  2806	
  2807		/* The last scan loop will scan all the huge pages.*/
  2808		nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE;
  2809	
  2810		do {
  2811			struct page *page = NULL;
  2812	
> 2813			if (zsr_get_hpage(hr_queue, &page))
  2814				break;
  2815	
  2816			if (!page)
  2817				continue;
  2818	
> 2819			nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
  2820	
  2821			cond_resched();
  2822	
  2823		} while ((nr_reclaimed < nr_to_reclaim) && (++nr_scanned != nr_to_scan));
  2824	out:
  2825		return nr_reclaimed;
  2826	}
  2827	#else
  2828	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2829							 int priority,
  2830							 unsigned long nr_to_reclaim)
  2831	{
  2832		return 0;
  2833	}
  2834	#endif
  2835	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29712 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 1/6] mm, thp: introduce thp zero subpages reclaim
@ 2021-10-28 20:50     ` kernel test robot
  0 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2021-10-28 20:50 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 13500 bytes --]

Hi Ning,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc7]
[cannot apply to hnaz-mm/master next-20211028]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1fc596a56b334f4d593a2b49e5ff55af6aaa0816
config: arm-randconfig-c002-20211028 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 5db7568a6a1fcb408eb8988abdaff2a225a8eb72)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/ba9f8c1a43c2d9ab2d2ac5696aaffbeaf043fa02
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
        git checkout ba9f8c1a43c2d9ab2d2ac5696aaffbeaf043fa02
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/vmscan.c:1340:6: warning: variable 'err' set but not used [-Wunused-but-set-variable]
           int err;
               ^
>> mm/vmscan.c:2803:36: error: no member named 'hpage_reclaim_queue' in 'struct mem_cgroup_per_node'
           hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
                       ~~~~~~~~~~~~~~~~~~~~  ^
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:290:10: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
                   ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:290:39: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
                                                ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:291:10: note: expanded from macro '__native_word'
            sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
                   ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:291:38: note: expanded from macro '__native_word'
            sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
                                               ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:48: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                                         ^
   include/linux/compiler_types.h:322:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:310:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:302:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                    ^
   include/linux/compiler_types.h:279:13: note: expanded from macro '__unqual_scalar_typeof'
                   _Generic((x),                                           \
                             ^
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                    ^
   include/linux/compiler_types.h:286:15: note: expanded from macro '__unqual_scalar_typeof'
                            default: (x)))
                                      ^
   mm/vmscan.c:2804:24: error: no member named 'thp_reclaim' in 'struct mem_cgroup'
           if (!READ_ONCE(memcg->thp_reclaim))
                          ~~~~~  ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:72: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                           ^
   mm/vmscan.c:2804:6: error: invalid argument type 'void' to unary expression
           if (!READ_ONCE(memcg->thp_reclaim))
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/vmscan.c:2813:7: error: implicit declaration of function 'zsr_get_hpage' [-Werror,-Wimplicit-function-declaration]
                   if (zsr_get_hpage(hr_queue, &page))
                       ^
>> mm/vmscan.c:2819:19: error: implicit declaration of function 'zsr_reclaim_hpage' [-Werror,-Wimplicit-function-declaration]
                   nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
                                   ^
   1 warning and 12 errors generated.


vim +2803 mm/vmscan.c

  2784	
  2785	#ifdef CONFIG_MEMCG
  2786	#define MAX_SCAN_HPAGE 32UL
  2787	/*
  2788	 * Try to reclaim the zero subpages for the transparent huge page.
  2789	 */
  2790	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2791							 int priority,
  2792							 unsigned long nr_to_reclaim)
  2793	{
  2794		struct mem_cgroup *memcg;
  2795		struct hpage_reclaim *hr_queue;
  2796		int nid = lruvec->pgdat->node_id;
  2797		unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan;
  2798	
  2799		memcg = lruvec_memcg(lruvec);
  2800		if (!memcg)
  2801			goto out;
  2802	
> 2803		hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
  2804		if (!READ_ONCE(memcg->thp_reclaim))
  2805			goto out;
  2806	
  2807		/* The last scan loop will scan all the huge pages.*/
  2808		nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE;
  2809	
  2810		do {
  2811			struct page *page = NULL;
  2812	
> 2813			if (zsr_get_hpage(hr_queue, &page))
  2814				break;
  2815	
  2816			if (!page)
  2817				continue;
  2818	
> 2819			nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
  2820	
  2821			cond_resched();
  2822	
  2823		} while ((nr_reclaimed < nr_to_reclaim) && (++nr_scanned != nr_to_scan));
  2824	out:
  2825		return nr_reclaimed;
  2826	}
  2827	#else
  2828	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2829							 int priority,
  2830							 unsigned long nr_to_reclaim)
  2831	{
  2832		return 0;
  2833	}
  2834	#endif
  2835	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 29712 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim
  2021-10-28 11:56 ` [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Ning Zhang
@ 2021-10-29  0:44     ` kernel test robot
  0 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2021-10-29  0:44 UTC (permalink / raw)
  To: Ning Zhang; +Cc: llvm, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4822 bytes --]

Hi Ning,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc7]
[cannot apply to hnaz-mm/master next-20211028]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1fc596a56b334f4d593a2b49e5ff55af6aaa0816
config: arm-randconfig-c002-20211028 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 5db7568a6a1fcb408eb8988abdaff2a225a8eb72)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/4111289b6a222d9c2aeb7c593f0f52e0c38f3247
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
        git checkout 4111289b6a222d9c2aeb7c593f0f52e0c38f3247
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/vmscan.c:1340:6: warning: variable 'err' set but not used [-Wunused-but-set-variable]
           int err;
               ^
   mm/vmscan.c:2803:36: error: no member named 'hpage_reclaim_queue' in 'struct mem_cgroup_per_node'
           hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
                       ~~~~~~~~~~~~~~~~~~~~  ^
>> mm/vmscan.c:2804:6: error: implicit declaration of function 'get_thp_reclaim_mode' [-Werror,-Wimplicit-function-declaration]
           if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
               ^
>> mm/vmscan.c:2804:37: error: use of undeclared identifier 'THP_RECLAIM_DISABLE'; did you mean 'WB_RECLAIMABLE'?
           if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
                                              ^~~~~~~~~~~~~~~~~~~
                                              WB_RECLAIMABLE
   include/linux/backing-dev-defs.h:37:2: note: 'WB_RECLAIMABLE' declared here
           WB_RECLAIMABLE,
           ^
   mm/vmscan.c:2813:7: error: implicit declaration of function 'zsr_get_hpage' [-Werror,-Wimplicit-function-declaration]
                   if (zsr_get_hpage(hr_queue, &page))
                       ^
   mm/vmscan.c:2819:19: error: implicit declaration of function 'zsr_reclaim_hpage' [-Werror,-Wimplicit-function-declaration]
                   nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
                                   ^
   1 warning and 5 errors generated.


vim +/get_thp_reclaim_mode +2804 mm/vmscan.c

  2784	
  2785	#ifdef CONFIG_MEMCG
  2786	#define MAX_SCAN_HPAGE 32UL
  2787	/*
  2788	 * Try to reclaim the zero subpages for the transparent huge page.
  2789	 */
  2790	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2791							 int priority,
  2792							 unsigned long nr_to_reclaim)
  2793	{
  2794		struct mem_cgroup *memcg;
  2795		struct hpage_reclaim *hr_queue;
  2796		int nid = lruvec->pgdat->node_id;
  2797		unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan;
  2798	
  2799		memcg = lruvec_memcg(lruvec);
  2800		if (!memcg)
  2801			goto out;
  2802	
  2803		hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
> 2804		if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
  2805			goto out;
  2806	
  2807		/* The last scan loop will scan all the huge pages.*/
  2808		nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE;
  2809	
  2810		do {
  2811			struct page *page = NULL;
  2812	
  2813			if (zsr_get_hpage(hr_queue, &page))
  2814				break;
  2815	
  2816			if (!page)
  2817				continue;
  2818	
  2819			nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
  2820	
  2821			cond_resched();
  2822	
  2823		} while ((nr_reclaimed < nr_to_reclaim) && (++nr_scanned != nr_to_scan));
  2824	out:
  2825		return nr_reclaimed;
  2826	}
  2827	#else
  2828	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2829							 int priority,
  2830							 unsigned long nr_to_reclaim)
  2831	{
  2832		return 0;
  2833	}
  2834	#endif
  2835	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29712 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim
@ 2021-10-29  0:44     ` kernel test robot
  0 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2021-10-29  0:44 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4936 bytes --]

Hi Ning,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.15-rc7]
[cannot apply to hnaz-mm/master next-20211028]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1fc596a56b334f4d593a2b49e5ff55af6aaa0816
config: arm-randconfig-c002-20211028 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 5db7568a6a1fcb408eb8988abdaff2a225a8eb72)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/4111289b6a222d9c2aeb7c593f0f52e0c38f3247
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ning-Zhang/Reclaim-zero-subpages-of-thp-to-avoid-memory-bloat/20211028-200001
        git checkout 4111289b6a222d9c2aeb7c593f0f52e0c38f3247
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/vmscan.c:1340:6: warning: variable 'err' set but not used [-Wunused-but-set-variable]
           int err;
               ^
   mm/vmscan.c:2803:36: error: no member named 'hpage_reclaim_queue' in 'struct mem_cgroup_per_node'
           hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
                       ~~~~~~~~~~~~~~~~~~~~  ^
>> mm/vmscan.c:2804:6: error: implicit declaration of function 'get_thp_reclaim_mode' [-Werror,-Wimplicit-function-declaration]
           if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
               ^
>> mm/vmscan.c:2804:37: error: use of undeclared identifier 'THP_RECLAIM_DISABLE'; did you mean 'WB_RECLAIMABLE'?
           if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
                                              ^~~~~~~~~~~~~~~~~~~
                                              WB_RECLAIMABLE
   include/linux/backing-dev-defs.h:37:2: note: 'WB_RECLAIMABLE' declared here
           WB_RECLAIMABLE,
           ^
   mm/vmscan.c:2813:7: error: implicit declaration of function 'zsr_get_hpage' [-Werror,-Wimplicit-function-declaration]
                   if (zsr_get_hpage(hr_queue, &page))
                       ^
   mm/vmscan.c:2819:19: error: implicit declaration of function 'zsr_reclaim_hpage' [-Werror,-Wimplicit-function-declaration]
                   nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
                                   ^
   1 warning and 5 errors generated.


vim +/get_thp_reclaim_mode +2804 mm/vmscan.c

  2784	
  2785	#ifdef CONFIG_MEMCG
  2786	#define MAX_SCAN_HPAGE 32UL
  2787	/*
  2788	 * Try to reclaim the zero subpages for the transparent huge page.
  2789	 */
  2790	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2791							 int priority,
  2792							 unsigned long nr_to_reclaim)
  2793	{
  2794		struct mem_cgroup *memcg;
  2795		struct hpage_reclaim *hr_queue;
  2796		int nid = lruvec->pgdat->node_id;
  2797		unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan;
  2798	
  2799		memcg = lruvec_memcg(lruvec);
  2800		if (!memcg)
  2801			goto out;
  2802	
  2803		hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
> 2804		if (get_thp_reclaim_mode(memcg) == THP_RECLAIM_DISABLE)
  2805			goto out;
  2806	
  2807		/* The last scan loop will scan all the huge pages.*/
  2808		nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE;
  2809	
  2810		do {
  2811			struct page *page = NULL;
  2812	
  2813			if (zsr_get_hpage(hr_queue, &page))
  2814				break;
  2815	
  2816			if (!page)
  2817				continue;
  2818	
  2819			nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
  2820	
  2821			cond_resched();
  2822	
  2823		} while ((nr_reclaimed < nr_to_reclaim) && (++nr_scanned != nr_to_scan));
  2824	out:
  2825		return nr_reclaimed;
  2826	}
  2827	#else
  2828	static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
  2829							 int priority,
  2830							 unsigned long nr_to_reclaim)
  2831	{
  2832		return 0;
  2833	}
  2834	#endif
  2835	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 29712 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-10-28 14:13 ` [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Kirill A. Shutemov
@ 2021-10-29 12:07   ` ning zhang
  2021-10-29 16:56     ` Yang Shi
  0 siblings, 1 reply; 21+ messages in thread
From: ning zhang @ 2021-10-29 12:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Yu Zhao, Gang Deng


在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>> As we know, thp may lead to memory bloat which may cause OOM.
>> Through testing with some apps, we found that the reason of
>> memory bloat is a huge page may contain some zero subpages
>> (may accessed or not). And we found that most zero subpages
>> are centralized in a few huge pages.
>>
>> Following is a text_classification_rnn case for tensorflow:
>>
>>    zero_subpages   huge_pages  waste
>>    [     0,     1) 186         0.00%
>>    [     1,     2) 23          0.01%
>>    [     2,     4) 36          0.02%
>>    [     4,     8) 67          0.08%
>>    [     8,    16) 80          0.23%
>>    [    16,    32) 109         0.61%
>>    [    32,    64) 44          0.49%
>>    [    64,   128) 12          0.30%
>>    [   128,   256) 28          1.54%
>>    [   256,   513) 159        18.03%
>>
>> In the case, there are 187 huge pages (25% of the total huge pages)
>> which contain more then 128 zero subpages. And these huge pages
>> lead to 19.57% waste of the total rss. It means we can reclaim
>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>> zero subpages.
>>
>> This patchset introduce a new mechanism to split the huge page
>> which has zero subpages and reclaim these zero subpages.
>>
>> We add the anonymous huge page to a list to reduce the cost of
>> finding the huge page. When the memory reclaim is triggering,
>> the list will be walked and the huge page contains enough zero
>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>> by ZERO_PAGE(0).
> Does it actually help your workload?
>
> I mean this will only be triggered via vmscan that was going to split
> pages and free anyway.
>
> You prioritize splitting THP and freeing zero subpages over reclaiming
> other pages. It may or may not be right thing to do, depending on
> workload.
>
> Maybe it makes more sense to check for all-zero pages just after
> split_huge_page_to_list() in vmscan and free such pages immediately rather
> then add all this complexity?
>
The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages 
which
have waste and reclaim them.

We do this for two reasons:
1. If swap is off, anonymous pages will not be scanned, and we don't 
have the
    opportunity  to split the huge page. ZSR can be helpful for this.
2. If swap is on, splitting first will not only split the huge page, but 
also
    swap out the nonzero subpages, while ZSR will only split the huge page.
    Splitting first will result to more performance degradation. If ZSR 
can't
    reclaim enough pages, swap can still work.

Why use a seperate ZSR list instead of the default LRU list?

Because it may cause high CPU overhead to scan for target huge pages if 
there
both exist a lot of regular and huge pages. And it maybe especially 
terrible
when swap is off, we may scan the whole LRU list many times. A huge page 
will
be deleted from ZSR list when it was scanned, so the page will be 
scanned only
once. It's hard to use LRU list, because it may add new pages into LRU list
continuously when scanning.

Also, we can decrease the priority to prioritize reclaiming file-backed 
page.
For example, only triggerring ZSR when the priority is less than 4.
>> Yu Zhao has done some similar work when the huge page is swap out
>> or migrated to accelerate[1]. While we do this in the normal memory
>> shrink path for the swapoff scene to avoid OOM.
>>
>> In the future, we will do the proactive reclaim to reclaim the "cold"
>> huge page proactively. This is for keeping the performance of thp as
>> for as possible. In addition to that, some users want the memory usage
>> using thp is equal to the usage using 4K.
> Proactive reclaim can be harmful if your max_ptes_none allows to recreate
> THP back.
Thanks! We will consider it.
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 1/6] mm, thp: introduce thp zero subpages reclaim
  2021-10-28 12:53   ` Matthew Wilcox
@ 2021-10-29 12:16     ` ning zhang
  0 siblings, 0 replies; 21+ messages in thread
From: ning zhang @ 2021-10-29 12:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Yu Zhao


在 2021/10/28 下午8:53, Matthew Wilcox 写道:
> On Thu, Oct 28, 2021 at 07:56:50PM +0800, Ning Zhang wrote:
>> +++ b/include/linux/huge_mm.h
>> @@ -185,6 +185,15 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>>   void free_transhuge_page(struct page *page);
>>   bool is_transparent_hugepage(struct page *page);
>>   
>> +#ifdef CONFIG_MEMCG
>> +int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page);
>> +unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
>> +static inline struct list_head *hpage_reclaim_list(struct page *page)
>> +{
>> +	return &page[3].hpage_reclaim_list;
>> +}
>> +#endif
> I don't think any of this needs to be under an ifdef.  That goes for a
> lot of your other additions to header files.
>
>> @@ -1110,6 +1121,10 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>>   						gfp_t gfp_mask,
>>   						unsigned long *total_scanned);
>>   
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +void del_hpage_from_queue(struct page *page);
>> +#endif
> That name is too generic.  Also, to avoid ifdefs in code, it should be:
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> void del_hpage_from_queue(struct page *page);
> #else
> static inline void del_hpage_from_queue(struct page *page) { }
> #endif
>
>> @@ -159,6 +159,12 @@ struct page {
>>   			/* For both global and memcg */
>>   			struct list_head deferred_list;
>>   		};
>> +		struct {	 /* Third tail page of compound page */
>> +			unsigned long _compound_pad_2;
>> +			unsigned long _compound_pad_3;
>> +			/* For zero subpages reclaim */
>> +			struct list_head hpage_reclaim_list;
> Why do you need _compound_pad_3 here?
>
>> +++ b/include/linux/mmzone.h
>> @@ -787,6 +787,12 @@ struct deferred_split {
>>   	struct list_head split_queue;
>>   	unsigned long split_queue_len;
>>   };
>> +
>> +struct hpage_reclaim {
>> +	spinlock_t reclaim_queue_lock;
>> +	struct list_head reclaim_queue;
>> +	unsigned long reclaim_queue_len;
>> +};
> Have you considered using an XArray instead of a linked list?
>
>> +static bool hpage_estimate_zero(struct page *page)
>> +{
>> +	unsigned int i, maybe_zero_pages = 0, offset = 0;
>> +	void *addr;
>> +
>> +#define BYTES_PER_LONG (BITS_PER_LONG / BITS_PER_BYTE)
> BYTES_PER_LONG is simply sizeof(long).
> Also, I'd check the entire cacheline rather than just one word; it's
> essentially free.
>
>> +#ifdef CONFIG_MMU
>> +#define ZSR_PG_MLOCK(flag)	(1UL << flag)
>> +#else
>> +#define ZSR_PG_MLOCK(flag)	0
>> +#endif
> Or use __PG_MLOCKED ?
>
>> +#ifdef CONFIG_ARCH_USES_PG_UNCACHED
>> +#define ZSR_PG_UNCACHED(flag)	(1UL << flag)
>> +#else
>> +#define ZSR_PG_UNCACHED(flag)	0
>> +#endif
> Define __PG_UNCACHED in page-flags.h?
>
>> +#ifdef CONFIG_MEMORY_FAILURE
>> +#define ZSR_PG_HWPOISON(flag)	(1UL << flag)
>> +#else
>> +#define ZSR_PG_HWPOISON(flag)	0
>> +#endif
> __PG_HWPOISON
>
>> +#define hr_queue_list_to_page(head) \
>> +	compound_head(list_entry((head)->prev, struct page,\
>> +		      hpage_reclaim_list))
> I think you're better off subtracting 3*sizeof(struct page) than
> loading from compound_head.
>
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +/* Need the page lock if the page is not a newly allocated page. */
>> +static void add_hpage_to_queue(struct page *page, struct mem_cgroup *memcg)
>> +{
>> +	struct hpage_reclaim *hr_queue;
>> +	unsigned long flags;
>> +
>> +	if (READ_ONCE(memcg->thp_reclaim) == THP_RECLAIM_DISABLE)
>> +		return;
>> +
>> +	page = compound_head(page);
> Why do you think the caller might be passing in a tail page here?

Thanks for the comments!  I will modify it.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
                   ` (6 preceding siblings ...)
  2021-10-28 14:13 ` [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Kirill A. Shutemov
@ 2021-10-29 13:38 ` Michal Hocko
  2021-10-29 16:12   ` ning zhang
  7 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2021-10-29 13:38 UTC (permalink / raw)
  To: Ning Zhang
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vladimir Davydov, Yu Zhao

On Thu 28-10-21 19:56:49, Ning Zhang wrote:
> As we know, thp may lead to memory bloat which may cause OOM.
> Through testing with some apps, we found that the reason of
> memory bloat is a huge page may contain some zero subpages
> (may accessed or not). And we found that most zero subpages
> are centralized in a few huge pages.
> 
> Following is a text_classification_rnn case for tensorflow:
> 
>   zero_subpages   huge_pages  waste
>   [     0,     1) 186         0.00%
>   [     1,     2) 23          0.01%
>   [     2,     4) 36          0.02%
>   [     4,     8) 67          0.08%
>   [     8,    16) 80          0.23%
>   [    16,    32) 109         0.61%
>   [    32,    64) 44          0.49%
>   [    64,   128) 12          0.30%
>   [   128,   256) 28          1.54%
>   [   256,   513) 159        18.03%
> 
> In the case, there are 187 huge pages (25% of the total huge pages)
> which contain more then 128 zero subpages. And these huge pages
> lead to 19.57% waste of the total rss. It means we can reclaim
> 19.57% memory by splitting the 187 huge pages and reclaiming the
> zero subpages.

What is the THP policy configuration in your testing? I assume you are
using defaults right? That would be always for THP and madvise for
defrag. Would it make more sense to use madvise mode for THP for your
workload? The THP code is rather complex and just by looking at the
diffstat this add quite a lot on top. Is this really worth it?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-10-29 13:38 ` Michal Hocko
@ 2021-10-29 16:12   ` ning zhang
  2021-11-01  9:20     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: ning zhang @ 2021-10-29 16:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vladimir Davydov, Yu Zhao


在 2021/10/29 下午9:38, Michal Hocko 写道:
> On Thu 28-10-21 19:56:49, Ning Zhang wrote:
>> As we know, thp may lead to memory bloat which may cause OOM.
>> Through testing with some apps, we found that the reason of
>> memory bloat is a huge page may contain some zero subpages
>> (may accessed or not). And we found that most zero subpages
>> are centralized in a few huge pages.
>>
>> Following is a text_classification_rnn case for tensorflow:
>>
>>    zero_subpages   huge_pages  waste
>>    [     0,     1) 186         0.00%
>>    [     1,     2) 23          0.01%
>>    [     2,     4) 36          0.02%
>>    [     4,     8) 67          0.08%
>>    [     8,    16) 80          0.23%
>>    [    16,    32) 109         0.61%
>>    [    32,    64) 44          0.49%
>>    [    64,   128) 12          0.30%
>>    [   128,   256) 28          1.54%
>>    [   256,   513) 159        18.03%
>>
>> In the case, there are 187 huge pages (25% of the total huge pages)
>> which contain more then 128 zero subpages. And these huge pages
>> lead to 19.57% waste of the total rss. It means we can reclaim
>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>> zero subpages.
> What is the THP policy configuration in your testing? I assume you are
> using defaults right? That would be always for THP and madvise for
> defrag. Would it make more sense to use madvise mode for THP for your
> workload? The THP code is rather complex and just by looking at the
> diffstat this add quite a lot on top. Is this really worth it?

The THP configuration is always.

Madvise needs users to set MADV_HUGEPAGE by themselves if they want use 
huge page, while many users don't do set this, and they can't control 
this well.

Such as java, users can set heap and metaspace to use huge pages with 
madvise, but there is also memory bloat. Users still need to test 
whether their app can accept the waste.

For the case above, if we set THP configuration to be madvise, all the 
pages it uses will be 4K-page.

Memory bloat is one of the most important reasons that users disable 
THP.  We do this to popularize THP to be default enabled.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-10-29 12:07   ` ning zhang
@ 2021-10-29 16:56     ` Yang Shi
  2021-11-01  2:50       ` ning zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Yang Shi @ 2021-10-29 16:56 UTC (permalink / raw)
  To: ning zhang
  Cc: Kirill A. Shutemov, Linux MM, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Yu Zhao, Gang Deng

On Fri, Oct 29, 2021 at 5:08 AM ning zhang <ningzhang@linux.alibaba.com> wrote:
>
>
> 在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
> > On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
> >> As we know, thp may lead to memory bloat which may cause OOM.
> >> Through testing with some apps, we found that the reason of
> >> memory bloat is a huge page may contain some zero subpages
> >> (may accessed or not). And we found that most zero subpages
> >> are centralized in a few huge pages.
> >>
> >> Following is a text_classification_rnn case for tensorflow:
> >>
> >>    zero_subpages   huge_pages  waste
> >>    [     0,     1) 186         0.00%
> >>    [     1,     2) 23          0.01%
> >>    [     2,     4) 36          0.02%
> >>    [     4,     8) 67          0.08%
> >>    [     8,    16) 80          0.23%
> >>    [    16,    32) 109         0.61%
> >>    [    32,    64) 44          0.49%
> >>    [    64,   128) 12          0.30%
> >>    [   128,   256) 28          1.54%
> >>    [   256,   513) 159        18.03%
> >>
> >> In the case, there are 187 huge pages (25% of the total huge pages)
> >> which contain more then 128 zero subpages. And these huge pages
> >> lead to 19.57% waste of the total rss. It means we can reclaim
> >> 19.57% memory by splitting the 187 huge pages and reclaiming the
> >> zero subpages.
> >>
> >> This patchset introduce a new mechanism to split the huge page
> >> which has zero subpages and reclaim these zero subpages.
> >>
> >> We add the anonymous huge page to a list to reduce the cost of
> >> finding the huge page. When the memory reclaim is triggering,
> >> the list will be walked and the huge page contains enough zero
> >> subpages may be reclaimed. Meanwhile, replace the zero subpages
> >> by ZERO_PAGE(0).
> > Does it actually help your workload?
> >
> > I mean this will only be triggered via vmscan that was going to split
> > pages and free anyway.
> >
> > You prioritize splitting THP and freeing zero subpages over reclaiming
> > other pages. It may or may not be right thing to do, depending on
> > workload.
> >
> > Maybe it makes more sense to check for all-zero pages just after
> > split_huge_page_to_list() in vmscan and free such pages immediately rather
> > then add all this complexity?
> >
> The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages
> which
> have waste and reclaim them.
>
> We do this for two reasons:
> 1. If swap is off, anonymous pages will not be scanned, and we don't
> have the
>     opportunity  to split the huge page. ZSR can be helpful for this.
> 2. If swap is on, splitting first will not only split the huge page, but
> also
>     swap out the nonzero subpages, while ZSR will only split the huge page.
>     Splitting first will result to more performance degradation. If ZSR
> can't
>     reclaim enough pages, swap can still work.
>
> Why use a seperate ZSR list instead of the default LRU list?
>
> Because it may cause high CPU overhead to scan for target huge pages if
> there
> both exist a lot of regular and huge pages. And it maybe especially
> terrible
> when swap is off, we may scan the whole LRU list many times. A huge page
> will
> be deleted from ZSR list when it was scanned, so the page will be
> scanned only
> once. It's hard to use LRU list, because it may add new pages into LRU list
> continuously when scanning.
>
> Also, we can decrease the priority to prioritize reclaiming file-backed
> page.
> For example, only triggerring ZSR when the priority is less than 4.

I'm not sure if this will help the workloads in general or not. The
problem is it doesn't check if the huge page is "hot" or not. It just
picks up the first huge page from the list, which seems like a FIFO
list IIUC. But if the huge page is "hot" even though there is some
internal access imbalance it may be better to keep the huge page since
the performance gain may outperform the memory saving. But if the huge
page is not "hot", then I think the question is why it is a THP in the
first place.

Let's step back to think about whether allocating THP upon first
access for such area or workload is good or not. We should be able to
check the access imbalance in allocation stage instead of reclaim
stage. Currently anonymous THP just supports 3 modes: always, madvise
and none. Both always and madvise tries to allocate THP in page fault
path (assuming anonymous THP) upon first access. I'm wondering if we
could add a "defer" mode or not. It defers THP allocation/collapse to
khugepaged instead of in page fault path. Then all the knobs used by
khugepaged could be applied, particularly max_ptes_none in your case.
You could set a low max_ptes_none if you prefer memory saving. IMHO,
this seems much simpler than scanning list (may be quite long) to find
out suitable candidate then split then replace to zero page.

Of course this may have some potential performance impact since the
THP install is delayed for some time. This could be optimized by
respecting  MADV_HUGEPAGE.

Anyway, just some wild idea.

> >> Yu Zhao has done some similar work when the huge page is swap out
> >> or migrated to accelerate[1]. While we do this in the normal memory
> >> shrink path for the swapoff scene to avoid OOM.
> >>
> >> In the future, we will do the proactive reclaim to reclaim the "cold"
> >> huge page proactively. This is for keeping the performance of thp as
> >> for as possible. In addition to that, some users want the memory usage
> >> using thp is equal to the usage using 4K.
> > Proactive reclaim can be harmful if your max_ptes_none allows to recreate
> > THP back.
> Thanks! We will consider it.
> >
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-10-29 16:56     ` Yang Shi
@ 2021-11-01  2:50       ` ning zhang
  0 siblings, 0 replies; 21+ messages in thread
From: ning zhang @ 2021-11-01  2:50 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill A. Shutemov, Linux MM, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Yu Zhao, Gang Deng


在 2021/10/30 上午12:56, Yang Shi 写道:
> On Fri, Oct 29, 2021 at 5:08 AM ning zhang <ningzhang@linux.alibaba.com> wrote:
>>
>> 在 2021/10/28 下午10:13, Kirill A. Shutemov 写道:
>>> On Thu, Oct 28, 2021 at 07:56:49PM +0800, Ning Zhang wrote:
>>>> As we know, thp may lead to memory bloat which may cause OOM.
>>>> Through testing with some apps, we found that the reason of
>>>> memory bloat is a huge page may contain some zero subpages
>>>> (may accessed or not). And we found that most zero subpages
>>>> are centralized in a few huge pages.
>>>>
>>>> Following is a text_classification_rnn case for tensorflow:
>>>>
>>>>     zero_subpages   huge_pages  waste
>>>>     [     0,     1) 186         0.00%
>>>>     [     1,     2) 23          0.01%
>>>>     [     2,     4) 36          0.02%
>>>>     [     4,     8) 67          0.08%
>>>>     [     8,    16) 80          0.23%
>>>>     [    16,    32) 109         0.61%
>>>>     [    32,    64) 44          0.49%
>>>>     [    64,   128) 12          0.30%
>>>>     [   128,   256) 28          1.54%
>>>>     [   256,   513) 159        18.03%
>>>>
>>>> In the case, there are 187 huge pages (25% of the total huge pages)
>>>> which contain more then 128 zero subpages. And these huge pages
>>>> lead to 19.57% waste of the total rss. It means we can reclaim
>>>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>>>> zero subpages.
>>>>
>>>> This patchset introduce a new mechanism to split the huge page
>>>> which has zero subpages and reclaim these zero subpages.
>>>>
>>>> We add the anonymous huge page to a list to reduce the cost of
>>>> finding the huge page. When the memory reclaim is triggering,
>>>> the list will be walked and the huge page contains enough zero
>>>> subpages may be reclaimed. Meanwhile, replace the zero subpages
>>>> by ZERO_PAGE(0).
>>> Does it actually help your workload?
>>>
>>> I mean this will only be triggered via vmscan that was going to split
>>> pages and free anyway.
>>>
>>> You prioritize splitting THP and freeing zero subpages over reclaiming
>>> other pages. It may or may not be right thing to do, depending on
>>> workload.
>>>
>>> Maybe it makes more sense to check for all-zero pages just after
>>> split_huge_page_to_list() in vmscan and free such pages immediately rather
>>> then add all this complexity?
>>>
>> The purpose of zero subpages reclaim(ZSR) is to pick out the huge pages
>> which
>> have waste and reclaim them.
>>
>> We do this for two reasons:
>> 1. If swap is off, anonymous pages will not be scanned, and we don't
>> have the
>>      opportunity  to split the huge page. ZSR can be helpful for this.
>> 2. If swap is on, splitting first will not only split the huge page, but
>> also
>>      swap out the nonzero subpages, while ZSR will only split the huge page.
>>      Splitting first will result to more performance degradation. If ZSR
>> can't
>>      reclaim enough pages, swap can still work.
>>
>> Why use a seperate ZSR list instead of the default LRU list?
>>
>> Because it may cause high CPU overhead to scan for target huge pages if
>> there
>> both exist a lot of regular and huge pages. And it maybe especially
>> terrible
>> when swap is off, we may scan the whole LRU list many times. A huge page
>> will
>> be deleted from ZSR list when it was scanned, so the page will be
>> scanned only
>> once. It's hard to use LRU list, because it may add new pages into LRU list
>> continuously when scanning.
>>
>> Also, we can decrease the priority to prioritize reclaiming file-backed
>> page.
>> For example, only triggerring ZSR when the priority is less than 4.
> I'm not sure if this will help the workloads in general or not. The
> problem is it doesn't check if the huge page is "hot" or not. It just
> picks up the first huge page from the list, which seems like a FIFO
> list IIUC. But if the huge page is "hot" even though there is some
> internal access imbalance it may be better to keep the huge page since
> the performance gain may outperform the memory saving. But if the huge
> page is not "hot", then I think the question is why it is a THP in the
> first place.
We don't split all the huge pages, and just split the huge page
contains enough zero subpages. It's hard to check a anonymous
page is hot or cold, and we are working on it.

We only scan 32 huge pages maximum except the last loop when
reclaiming. I think we can start ZSR when priority is 1 or 2,
or maybe only when priority is 0. In this case, If we don't
start ZSR, the process will be killed by OOM.
>
> Let's step back to think about whether allocating THP upon first
> access for such area or workload is good or not. We should be able to
> check the access imbalance in allocation stage instead of reclaim
> stage. Currently anonymous THP just supports 3 modes: always, madvise
> and none. Both always and madvise tries to allocate THP in page fault
> path (assuming anonymous THP) upon first access. I'm wondering if we
> could add a "defer" mode or not. It defers THP allocation/collapse to
> khugepaged instead of in page fault path. Then all the knobs used by
> khugepaged could be applied, particularly max_ptes_none in your case.
> You could set a low max_ptes_none if you prefer memory saving. IMHO,
> this seems much simpler than scanning list (may be quite long) to find
> out suitable candidate then split then replace to zero page.
>
> Of course this may have some potential performance impact since the
> THP install is delayed for some time. This could be optimized by
> respecting  MADV_HUGEPAGE.
>
> Anyway, just some wild idea.
>
>>>> Yu Zhao has done some similar work when the huge page is swap out
>>>> or migrated to accelerate[1]. While we do this in the normal memory
>>>> shrink path for the swapoff scene to avoid OOM.
>>>>
>>>> In the future, we will do the proactive reclaim to reclaim the "cold"
>>>> huge page proactively. This is for keeping the performance of thp as
>>>> for as possible. In addition to that, some users want the memory usage
>>>> using thp is equal to the usage using 4K.
>>> Proactive reclaim can be harmful if your max_ptes_none allows to recreate
>>> THP back.
>> Thanks! We will consider it.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-10-29 16:12   ` ning zhang
@ 2021-11-01  9:20     ` Michal Hocko
  2021-11-08  3:24       ` ning zhang
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2021-11-01  9:20 UTC (permalink / raw)
  To: ning zhang
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vladimir Davydov, Yu Zhao

On Sat 30-10-21 00:12:53, ning zhang wrote:
> 
> 在 2021/10/29 下午9:38, Michal Hocko 写道:
> > On Thu 28-10-21 19:56:49, Ning Zhang wrote:
> > > As we know, thp may lead to memory bloat which may cause OOM.
> > > Through testing with some apps, we found that the reason of
> > > memory bloat is a huge page may contain some zero subpages
> > > (may accessed or not). And we found that most zero subpages
> > > are centralized in a few huge pages.
> > > 
> > > Following is a text_classification_rnn case for tensorflow:
> > > 
> > >    zero_subpages   huge_pages  waste
> > >    [     0,     1) 186         0.00%
> > >    [     1,     2) 23          0.01%
> > >    [     2,     4) 36          0.02%
> > >    [     4,     8) 67          0.08%
> > >    [     8,    16) 80          0.23%
> > >    [    16,    32) 109         0.61%
> > >    [    32,    64) 44          0.49%
> > >    [    64,   128) 12          0.30%
> > >    [   128,   256) 28          1.54%
> > >    [   256,   513) 159        18.03%
> > > 
> > > In the case, there are 187 huge pages (25% of the total huge pages)
> > > which contain more then 128 zero subpages. And these huge pages
> > > lead to 19.57% waste of the total rss. It means we can reclaim
> > > 19.57% memory by splitting the 187 huge pages and reclaiming the
> > > zero subpages.
> > What is the THP policy configuration in your testing? I assume you are
> > using defaults right? That would be always for THP and madvise for
> > defrag. Would it make more sense to use madvise mode for THP for your
> > workload? The THP code is rather complex and just by looking at the
> > diffstat this add quite a lot on top. Is this really worth it?
> 
> The THP configuration is always.
> 
> Madvise needs users to set MADV_HUGEPAGE by themselves if they want use huge
> page, while many users don't do set this, and they can't control this well.

What do you mean tey can't control this well? 

> Such as java, users can set heap and metaspace to use huge pages with
> madvise, but there is also memory bloat. Users still need to test whether
> their app can accept the waste.

There will always be some internal fragmentation when huge pages are
used. The amount will depend on how well the memory is used but huge
pages give a performance boost in return.

If the memory bloat is a significant problem then overeager THP usage is
certainly not good and I would argue that applying THP always policy is
not a proper configuration. No matter how much the MM code can try to
fix up the situation it will be always a catch up game.
 
> For the case above, if we set THP configuration to be madvise, all the pages
> it uses will be 4K-page.
> 
> Memory bloat is one of the most important reasons that users disable THP. 
> We do this to popularize THP to be default enabled.

To my knowledge the most popular reason to disable THP is the runtime
overhead. A large part of that overhead has been reduced by not doing
heavy compaction during the page fault allocations by default. Memory
overhead is certainly an important aspect as well but there is always
a possibility to reduce that by reducing it to madvised regions for
page fault (i.e. those where author of the code has considered the
costs vs. benefits of the huge page) and setting up a conservative
khugepaged policy. So there are existing tools available. You are trying
to add quite a lot of code so you should have good arguments to add more
complexity. I am not sure that popularizing THP is a strong one TBH.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat
  2021-11-01  9:20     ` Michal Hocko
@ 2021-11-08  3:24       ` ning zhang
  0 siblings, 0 replies; 21+ messages in thread
From: ning zhang @ 2021-11-08  3:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vladimir Davydov, Yu Zhao

[-- Attachment #1: Type: text/plain, Size: 4267 bytes --]


在 2021/11/1 下午5:20, Michal Hocko 写道:
> On Sat 30-10-21 00:12:53, ning zhang wrote:
>> 在 2021/10/29 下午9:38, Michal Hocko 写道:
>>> On Thu 28-10-21 19:56:49, Ning Zhang wrote:
>>>> As we know, thp may lead to memory bloat which may cause OOM.
>>>> Through testing with some apps, we found that the reason of
>>>> memory bloat is a huge page may contain some zero subpages
>>>> (may accessed or not). And we found that most zero subpages
>>>> are centralized in a few huge pages.
>>>>
>>>> Following is a text_classification_rnn case for tensorflow:
>>>>
>>>>     zero_subpages   huge_pages  waste
>>>>     [     0,     1) 186         0.00%
>>>>     [     1,     2) 23          0.01%
>>>>     [     2,     4) 36          0.02%
>>>>     [     4,     8) 67          0.08%
>>>>     [     8,    16) 80          0.23%
>>>>     [    16,    32) 109         0.61%
>>>>     [    32,    64) 44          0.49%
>>>>     [    64,   128) 12          0.30%
>>>>     [   128,   256) 28          1.54%
>>>>     [   256,   513) 159        18.03%
>>>>
>>>> In the case, there are 187 huge pages (25% of the total huge pages)
>>>> which contain more then 128 zero subpages. And these huge pages
>>>> lead to 19.57% waste of the total rss. It means we can reclaim
>>>> 19.57% memory by splitting the 187 huge pages and reclaiming the
>>>> zero subpages.
>>> What is the THP policy configuration in your testing? I assume you are
>>> using defaults right? That would be always for THP and madvise for
>>> defrag. Would it make more sense to use madvise mode for THP for your
>>> workload? The THP code is rather complex and just by looking at the
>>> diffstat this add quite a lot on top. Is this really worth it?
>> The THP configuration is always.
>>
>> Madvise needs users to set MADV_HUGEPAGE by themselves if they want use huge
>> page, while many users don't do set this, and they can't control this well.
> What do you mean tey can't control this well?

I means they don't know where they should use THP.

And even if they use madvise, memory bloat still exists.
<https://dict.youdao.com/w/still%20exist/#keyfrom=E2Ctranslation>

>
>> Such as java, users can set heap and metaspace to use huge pages with
>> madvise, but there is also memory bloat. Users still need to test whether
>> their app can accept the waste.
> There will always be some internal fragmentation when huge pages are
> used. The amount will depend on how well the memory is used but huge
> pages give a performance boost in return.
>
> If the memory bloat is a significant problem then overeager THP usage is
> certainly not good and I would argue that applying THP always policy is
> not a proper configuration. No matter how much the MM code can try to
> fix up the situation it will be always a catch up game.
>   
>> For the case above, if we set THP configuration to be madvise, all the pages
>> it uses will be 4K-page.
>>
>> Memory bloat is one of the most important reasons that users disable THP.
>> We do this to popularize THP to be default enabled.
> To my knowledge the most popular reason to disable THP is the runtime
> overhead. A large part of that overhead has been reduced by not doing
> heavy compaction during the page fault allocations by default. Memory
> overhead is certainly an important aspect as well but there is always
> a possibility to reduce that by reducing it to madvised regions for
> page fault (i.e. those where author of the code has considered the
> costs vs. benefits of the huge page) and setting up a conservative
> khugepaged policy. So there are existing tools available. You are trying
> to add quite a lot of code so you should have good arguments to add more
> complexity. I am not sure that popularizing THP is a strong one TBH.

Sorry for relpying late. For the compaction, we can set defrag
of THP to be defer or never, to avoid overhead produced by
direct reclaim. However, there are no way to reduce memory bloat.

If the memory usage reach the limit, and we can't reclaim some
pages, the OOM will be triggered and the process will be killed.
Our patchest is to avoid OOM.

Much code is interface to control ZSR. And we will try to
reduce the complexity.


[-- Attachment #2: Type: text/html, Size: 5976 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-11-08  3:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
2021-10-28 11:56 ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Ning Zhang
2021-10-28 12:53   ` Matthew Wilcox
2021-10-29 12:16     ` ning zhang
2021-10-28 20:50   ` kernel test robot
2021-10-28 20:50     ` kernel test robot
2021-10-28 11:56 ` [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Ning Zhang
2021-10-29  0:44   ` kernel test robot
2021-10-29  0:44     ` kernel test robot
2021-10-28 11:56 ` [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold Ning Zhang
2021-10-28 11:56 ` [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim Ning Zhang
2021-10-28 11:56 ` [RFC 5/6] mm, thp: add some statistics for " Ning Zhang
2021-10-28 11:56 ` [RFC 6/6] mm, thp: add document " Ning Zhang
2021-10-28 14:13 ` [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Kirill A. Shutemov
2021-10-29 12:07   ` ning zhang
2021-10-29 16:56     ` Yang Shi
2021-11-01  2:50       ` ning zhang
2021-10-29 13:38 ` Michal Hocko
2021-10-29 16:12   ` ning zhang
2021-11-01  9:20     ` Michal Hocko
2021-11-08  3:24       ` ning zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.