[PATCH 0/2] Add a new scheme to support demotion on tiered memory system

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Add a new scheme to support demotion on tiered memory system
@ 2021-12-21  9:18 Baolin Wang
  2021-12-21  9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Baolin Wang @ 2021-12-21  9:18 UTC (permalink / raw)
  To: sj, akpm
  Cc: ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang,
	baolin.wang, linux-mm, linux-kernel

Hi,

Now on tiered memory system with different memory types, the reclaim path in
shrink_page_list() already support demoting pages to slow memory node instead
of discarding the pages. However, at that time the fast memory node memory
wartermark is already tense, which will increase the memory allocation latency
during page demotion. So a new method from user space demoting cold pages
proactively will be more helpful.

We can rely on the DAMON in user space to help to monitor the cold memory on
fast memory node, and demote the cold pages to slow memory node proactively to
keep the fast memory node in a healthy state.

This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature,
and works well from my testing. Any comments are welcome. Thanks.

Baolin Wang (2):
  mm: Export the alloc_demote_page() function
  mm/damon: Add a new scheme to support demotion on tiered memory system

 include/linux/damon.h |   3 +
 mm/damon/dbgfs.c      |   1 +
 mm/damon/vaddr.c      | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h         |   1 +
 mm/vmscan.c           |   2 +-
 5 files changed, 162 insertions(+), 1 deletion(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] mm: Export the alloc_demote_page() function
  2021-12-21  9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang
@ 2021-12-21  9:18 ` Baolin Wang
  2021-12-21 10:13   ` SeongJae Park
  2021-12-21  9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang
  2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park
  2 siblings, 1 reply; 13+ messages in thread
From: Baolin Wang @ 2021-12-21  9:18 UTC (permalink / raw)
  To: sj, akpm
  Cc: ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang,
	baolin.wang, linux-mm, linux-kernel

Export the alloc_demote_page() function to the head file as a
preparation to support page demotion for DAMON monitor.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/internal.h | 1 +
 mm/vmscan.c   | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index deb9bda..99ea5fb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -181,6 +181,7 @@ static inline void set_page_refcounted(struct page *page)
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
 extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+extern struct page *alloc_demote_page(struct page *page, unsigned long node);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f3162a5..bf38327 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1458,7 +1458,7 @@ static void page_check_dirty_writeback(struct page *page,
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
+struct page *alloc_demote_page(struct page *page, unsigned long node)
 {
 	struct migration_target_control mtc = {
 		/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm: Export the alloc_demote_page() function
  2021-12-21  9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang
@ 2021-12-21 10:13   ` SeongJae Park
  0 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2021-12-21 10:13 UTC (permalink / raw)
  To: Baolin Wang
  Cc: sj, akpm, ying.huang, dave.hansen, ziy, shy828301,
	zhongjiang-ali, xlpang, linux-mm, linux-kernel

On Tue, 21 Dec 2021 17:18:03 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Export the alloc_demote_page() function to the head file as a
> preparation to support page demotion for DAMON monitor.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

> ---
>  mm/internal.h | 1 +
>  mm/vmscan.c   | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index deb9bda..99ea5fb 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -181,6 +181,7 @@ static inline void set_page_refcounted(struct page *page)
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
>  extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
> +extern struct page *alloc_demote_page(struct page *page, unsigned long node);
>  
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f3162a5..bf38327 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1458,7 +1458,7 @@ static void page_check_dirty_writeback(struct page *page,
>  		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
>  }
>  
> -static struct page *alloc_demote_page(struct page *page, unsigned long node)
> +struct page *alloc_demote_page(struct page *page, unsigned long node)
>  {
>  	struct migration_target_control mtc = {
>  		/*
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system
  2021-12-21  9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang
  2021-12-21  9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang
@ 2021-12-21  9:18 ` Baolin Wang
  2021-12-21 13:24   ` SeongJae Park
  2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park
  2 siblings, 1 reply; 13+ messages in thread
From: Baolin Wang @ 2021-12-21  9:18 UTC (permalink / raw)
  To: sj, akpm
  Cc: ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang,
	baolin.wang, linux-mm, linux-kernel

On tiered memory system, the reclaim path in shrink_page_list() already
support demoting pages to slow memory node instead of discarding the
pages. However, at that time the fast memory node memory wartermark is
already tense, which will increase the memory allocation latency during
page demotion.

We can rely on the DAMON in user space to help to monitor the cold
memory on fast memory node, and demote the cold pages to slow memory
node proactively to keep the fast memory node in a healthy state.
Thus this patch introduces a new scheme named DAMOS_DEMOTE to support
this feature.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/damon.h |   3 +
 mm/damon/dbgfs.c      |   1 +
 mm/damon/vaddr.c      | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 160 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index af64838..da9957c 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -87,6 +87,8 @@ struct damon_target {
  * @DAMOS_PAGEOUT:	Call ``madvise()`` for the region with MADV_PAGEOUT.
  * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
  * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
+ * @DAMOS_DEMOTE:	Migate cold pages from fast memory type (DRAM) to slow
+ * memory type (persistent memory).
  * @DAMOS_STAT:		Do nothing but count the stat.
  */
 enum damos_action {
@@ -95,6 +97,7 @@ enum damos_action {
 	DAMOS_PAGEOUT,
 	DAMOS_HUGEPAGE,
 	DAMOS_NOHUGEPAGE,
+	DAMOS_DEMOTE,
 	DAMOS_STAT,		/* Do nothing but only record the stat */
 };
 
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index 58dbb96..43355ab 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -168,6 +168,7 @@ static bool damos_action_valid(int action)
 	case DAMOS_PAGEOUT:
 	case DAMOS_HUGEPAGE:
 	case DAMOS_NOHUGEPAGE:
+	case DAMOS_DEMOTE:
 	case DAMOS_STAT:
 		return true;
 	default:
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 9e213a1..b354d3e 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -14,6 +14,10 @@
 #include <linux/page_idle.h>
 #include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
+#include <linux/swapops.h>
+#include "../internal.h"
 
 #include "prmtv-common.h"
 
@@ -693,6 +697,156 @@ static unsigned long damos_madvise(struct damon_target *target,
 }
 #endif	/* CONFIG_ADVISE_SYSCALLS */
 
+static bool damos_isolate_page(struct page *page, struct list_head *demote_list)
+{
+	struct page *head = compound_head(page);
+
+	/* Do not interfere with other mappings of this page */
+	if (page_mapcount(head) != 1)
+		return false;
+
+	/* No need migration if the target demotion node is empty. */
+	if (next_demotion_node(page_to_nid(head)) == NUMA_NO_NODE)
+		return false;
+
+	if (isolate_lru_page(head))
+		return false;
+
+	list_add_tail(&head->lru, demote_list);
+	mod_node_page_state(page_pgdat(head),
+			    NR_ISOLATED_ANON + page_is_file_lru(head),
+			    thp_nr_pages(head));
+	return true;
+}
+
+static int damos_migrate_pmd_entry(pmd_t *pmd, unsigned long addr,
+				   unsigned long end, struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct list_head *demote_list = walk->private;
+	spinlock_t *ptl;
+	struct page *page;
+	pte_t *pte, *mapped_pte;
+
+	if (!vma_migratable(vma))
+		return -EFAULT;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		/* Bail out if THP migration is not supported. */
+		if (!thp_migration_supported())
+			goto thp_out;
+
+		/* If the THP pte is under migration, do not bother it. */
+		if (unlikely(is_pmd_migration_entry(*pmd)))
+			goto thp_out;
+
+		page = damon_get_page(pmd_pfn(*pmd));
+		if (!page)
+			goto thp_out;
+
+		damos_isolate_page(page, demote_list);
+
+		put_page(page);
+thp_out:
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	/* regular page handling */
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		return -EINVAL;
+
+	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		if (pte_none(*pte) || !pte_present(*pte))
+			continue;
+
+		page = damon_get_page(pte_pfn(*pte));
+		if (!page)
+			continue;
+
+		damos_isolate_page(page, demote_list);
+		put_page(page);
+	}
+	pte_unmap_unlock(mapped_pte, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static const struct mm_walk_ops damos_migrate_pages_walk_ops = {
+	.pmd_entry              = damos_migrate_pmd_entry,
+};
+
+/*
+ * damos_demote() - demote cold pages from fast memory to slow memory
+ * @target:    the given target
+ * @r:         region of the target
+ *
+ * On tiered memory system, if DAMON monitored cold pages on fast memory
+ * node (DRAM), we can demote them to slow memory node proactively in case
+ * accumulating much more cold memory on fast memory node (DRAM) to reclaim.
+ *
+ * Return:
+ * = 0 means no pages were demoted.
+ * > 0 means how many cold pages were demoted successfully.
+ * < 0 means errors happened.
+ */
+static int damos_demote(struct damon_target *target, struct damon_region *r)
+{
+	struct mm_struct *mm;
+	LIST_HEAD(demote_pages);
+	LIST_HEAD(pagelist);
+	int target_nid, nr_pages, ret = 0;
+	unsigned int nr_succeeded, demoted_pages = 0;
+	struct page *page, *page2;
+
+	/* Validate if allowing to do page demotion */
+	if (!numa_demotion_enabled)
+		return -EINVAL;
+
+	mm = damon_get_mm(target);
+	if (!mm)
+		return -ENOMEM;
+
+	mmap_read_lock(mm);
+	walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end),
+			&damos_migrate_pages_walk_ops, &demote_pages);
+	mmap_read_unlock(mm);
+
+	mmput(mm);
+	if (list_empty(&demote_pages))
+		return 0;
+
+	list_for_each_entry_safe(page, page2, &demote_pages, lru) {
+		list_add(&page->lru, &pagelist);
+		target_nid = next_demotion_node(page_to_nid(page));
+		nr_pages = thp_nr_pages(page);
+
+		ret = migrate_pages(&pagelist, alloc_demote_page, NULL,
+				    target_nid, MIGRATE_SYNC, MR_DEMOTION,
+				    &nr_succeeded);
+		if (ret) {
+			if (!list_empty(&pagelist)) {
+				list_del(&page->lru);
+				mod_node_page_state(page_pgdat(page),
+						    NR_ISOLATED_ANON + page_is_file_lru(page),
+						    -nr_pages);
+				putback_lru_page(page);
+			}
+		} else {
+			__count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
+			demoted_pages += nr_succeeded;
+		}
+
+		INIT_LIST_HEAD(&pagelist);
+		cond_resched();
+	}
+
+	return demoted_pages;
+}
+
 static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 		struct damon_target *t, struct damon_region *r,
 		struct damos *scheme)
@@ -715,6 +869,8 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
 	case DAMOS_NOHUGEPAGE:
 		madv_action = MADV_NOHUGEPAGE;
 		break;
+	case DAMOS_DEMOTE:
+		return damos_demote(t, r);
 	case DAMOS_STAT:
 		return 0;
 	default:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system
  2021-12-21  9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang
@ 2021-12-21 13:24   ` SeongJae Park
  2021-12-21 14:18     ` Baolin Wang
  0 siblings, 1 reply; 13+ messages in thread
From: SeongJae Park @ 2021-12-21 13:24 UTC (permalink / raw)
  To: Baolin Wang
  Cc: sj, akpm, ying.huang, dave.hansen, ziy, shy828301,
	zhongjiang-ali, xlpang, linux-mm, linux-kernel

Hi Baolin,


Thank you for this great patch!  I left some comments below.

On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> On tiered memory system, the reclaim path in shrink_page_list() already
> support demoting pages to slow memory node instead of discarding the
> pages. However, at that time the fast memory node memory wartermark is
> already tense, which will increase the memory allocation latency during
> page demotion.
> 
> We can rely on the DAMON in user space to help to monitor the cold
> memory on fast memory node, and demote the cold pages to slow memory
> node proactively to keep the fast memory node in a healthy state.
> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support
> this feature.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  include/linux/damon.h |   3 +
>  mm/damon/dbgfs.c      |   1 +
>  mm/damon/vaddr.c      | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 160 insertions(+)
> 
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index af64838..da9957c 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -87,6 +87,8 @@ struct damon_target {
>   * @DAMOS_PAGEOUT:	Call ``madvise()`` for the region with MADV_PAGEOUT.
>   * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
>   * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
> + * @DAMOS_DEMOTE:	Migate cold pages from fast memory type (DRAM) to slow
> + * memory type (persistent memory).

s/Migate/Migrate/

Also, could you make the second line of the description aligned with the first
line? e.g.,

    + * @DAMOS_DEMOTE:	Migate cold pages from fast memory type (DRAM) to slow
    + *                 memory type (persistent memory).

>   * @DAMOS_STAT:		Do nothing but count the stat.
>   */
>  enum damos_action {
> @@ -95,6 +97,7 @@ enum damos_action {
>  	DAMOS_PAGEOUT,
>  	DAMOS_HUGEPAGE,
>  	DAMOS_NOHUGEPAGE,
> +	DAMOS_DEMOTE,
>  	DAMOS_STAT,		/* Do nothing but only record the stat */

This would make user space people who were using 'DAMOS_STAT' be confused.
Could you put 'DAMOS_DEMOTE' after 'DAMOS_STAT'?

>  };
>  
> diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
> index 58dbb96..43355ab 100644
> --- a/mm/damon/dbgfs.c
> +++ b/mm/damon/dbgfs.c
> @@ -168,6 +168,7 @@ static bool damos_action_valid(int action)
>  	case DAMOS_PAGEOUT:
>  	case DAMOS_HUGEPAGE:
>  	case DAMOS_NOHUGEPAGE:
> +	case DAMOS_DEMOTE:
>  	case DAMOS_STAT:
>  		return true;
>  	default:
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index 9e213a1..b354d3e 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -14,6 +14,10 @@
>  #include <linux/page_idle.h>
>  #include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_inline.h>
> +#include <linux/swapops.h>
> +#include "../internal.h"
>  
>  #include "prmtv-common.h"
>  
> @@ -693,6 +697,156 @@ static unsigned long damos_madvise(struct damon_target *target,
>  }
>  #endif	/* CONFIG_ADVISE_SYSCALLS */
>  
> +static bool damos_isolate_page(struct page *page, struct list_head *demote_list)
> +{
> +	struct page *head = compound_head(page);
> +
> +	/* Do not interfere with other mappings of this page */
> +	if (page_mapcount(head) != 1)
> +		return false;
> +
> +	/* No need migration if the target demotion node is empty. */
> +	if (next_demotion_node(page_to_nid(head)) == NUMA_NO_NODE)
> +		return false;
> +
> +	if (isolate_lru_page(head))
> +		return false;
> +
> +	list_add_tail(&head->lru, demote_list);
> +	mod_node_page_state(page_pgdat(head),
> +			    NR_ISOLATED_ANON + page_is_file_lru(head),
> +			    thp_nr_pages(head));
> +	return true;

The return value is not used by callers.  If not needed, let's remove it.

> +}
> +
> +static int damos_migrate_pmd_entry(pmd_t *pmd, unsigned long addr,
> +				   unsigned long end, struct mm_walk *walk)
> +{
> +	struct vm_area_struct *vma = walk->vma;
> +	struct list_head *demote_list = walk->private;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	pte_t *pte, *mapped_pte;
> +
> +	if (!vma_migratable(vma))
> +		return -EFAULT;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		/* Bail out if THP migration is not supported. */
> +		if (!thp_migration_supported())
> +			goto thp_out;
> +
> +		/* If the THP pte is under migration, do not bother it. */
> +		if (unlikely(is_pmd_migration_entry(*pmd)))
> +			goto thp_out;
> +
> +		page = damon_get_page(pmd_pfn(*pmd));
> +		if (!page)
> +			goto thp_out;
> +
> +		damos_isolate_page(page, demote_list);
> +
> +		put_page(page);
> +thp_out:
> +		spin_unlock(ptl);
> +		return 0;
> +	}

Could we wrap above THP handling code with '#ifdef CONFIG_TRANSPARENT_HUGEPAGE'?

> +
> +	/* regular page handling */
> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> +		return -EINVAL;
> +
> +	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		if (pte_none(*pte) || !pte_present(*pte))
> +			continue;
> +
> +		page = damon_get_page(pte_pfn(*pte));
> +		if (!page)
> +			continue;
> +
> +		damos_isolate_page(page, demote_list);
> +		put_page(page);
> +	}
> +	pte_unmap_unlock(mapped_pte, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops damos_migrate_pages_walk_ops = {
> +	.pmd_entry              = damos_migrate_pmd_entry,

The names are little bit confusing to me.  How about 's/migrate/isolate/'?

> +};
> +
> +/*
> + * damos_demote() - demote cold pages from fast memory to slow memory
> + * @target:    the given target
> + * @r:         region of the target
> + *
> + * On tiered memory system, if DAMON monitored cold pages on fast memory
> + * node (DRAM), we can demote them to slow memory node proactively in case
> + * accumulating much more cold memory on fast memory node (DRAM) to reclaim.
> + *
> + * Return:
> + * = 0 means no pages were demoted.
> + * > 0 means how many cold pages were demoted successfully.
> + * < 0 means errors happened.

damon_va_apply_scheme(), which returns what this function returns when
DAMOS_DEMOTE action is given, should return bytes of the region that the action
is successfully applied.

> + */
> +static int damos_demote(struct damon_target *target, struct damon_region *r)
> +{
> +	struct mm_struct *mm;
> +	LIST_HEAD(demote_pages);
> +	LIST_HEAD(pagelist);
> +	int target_nid, nr_pages, ret = 0;
> +	unsigned int nr_succeeded, demoted_pages = 0;
> +	struct page *page, *page2;
> +
> +	/* Validate if allowing to do page demotion */
> +	if (!numa_demotion_enabled)
> +		return -EINVAL;
> +
> +	mm = damon_get_mm(target);
> +	if (!mm)
> +		return -ENOMEM;
> +
> +	mmap_read_lock(mm);
> +	walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end),

I guess PAGE_ALIGN()s are not necessary here?

> +			&damos_migrate_pages_walk_ops, &demote_pages);
> +	mmap_read_unlock(mm);
> +
> +	mmput(mm);
> +	if (list_empty(&demote_pages))
> +		return 0;
> +
> +	list_for_each_entry_safe(page, page2, &demote_pages, lru) {

I'd prefer 'next' or 'next_page' instead of 'page2'.

> +		list_add(&page->lru, &pagelist);
> +		target_nid = next_demotion_node(page_to_nid(page));
> +		nr_pages = thp_nr_pages(page);
> +
> +		ret = migrate_pages(&pagelist, alloc_demote_page, NULL,
> +				    target_nid, MIGRATE_SYNC, MR_DEMOTION,
> +				    &nr_succeeded);
> +		if (ret) {
> +			if (!list_empty(&pagelist)) {
> +				list_del(&page->lru);
> +				mod_node_page_state(page_pgdat(page),
> +						    NR_ISOLATED_ANON + page_is_file_lru(page),
> +						    -nr_pages);
> +				putback_lru_page(page);
> +			}
> +		} else {
> +			__count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
> +			demoted_pages += nr_succeeded;
> +		}

Why should we do above work for each page on our own instead of exporting and
calling 'demote_page_list()'?


Thanks,
SJ

> +
> +		INIT_LIST_HEAD(&pagelist);
> +		cond_resched();
> +	}
> +
> +	return demoted_pages;
> +}
> +
>  static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
>  		struct damon_target *t, struct damon_region *r,
>  		struct damos *scheme)
> @@ -715,6 +869,8 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx,
>  	case DAMOS_NOHUGEPAGE:
>  		madv_action = MADV_NOHUGEPAGE;
>  		break;
> +	case DAMOS_DEMOTE:
> +		return damos_demote(t, r);
>  	case DAMOS_STAT:
>  		return 0;
>  	default:
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system
  2021-12-21 13:24   ` SeongJae Park
@ 2021-12-21 14:18     ` Baolin Wang
  2021-12-22  8:43       ` SeongJae Park
  0 siblings, 1 reply; 13+ messages in thread
From: Baolin Wang @ 2021-12-21 14:18 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali,
	xlpang, linux-mm, linux-kernel

Hi SeongJae,

On 12/21/2021 9:24 PM, SeongJae Park Wrote:
> Hi Baolin,
> 
> 
> Thank you for this great patch!  I left some comments below.
> 
> On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> On tiered memory system, the reclaim path in shrink_page_list() already
>> support demoting pages to slow memory node instead of discarding the
>> pages. However, at that time the fast memory node memory wartermark is
>> already tense, which will increase the memory allocation latency during
>> page demotion.
>>
>> We can rely on the DAMON in user space to help to monitor the cold
>> memory on fast memory node, and demote the cold pages to slow memory
>> node proactively to keep the fast memory node in a healthy state.
>> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support
>> this feature.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   include/linux/damon.h |   3 +
>>   mm/damon/dbgfs.c      |   1 +
>>   mm/damon/vaddr.c      | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 160 insertions(+)
>>
>> diff --git a/include/linux/damon.h b/include/linux/damon.h
>> index af64838..da9957c 100644
>> --- a/include/linux/damon.h
>> +++ b/include/linux/damon.h
>> @@ -87,6 +87,8 @@ struct damon_target {
>>    * @DAMOS_PAGEOUT:	Call ``madvise()`` for the region with MADV_PAGEOUT.
>>    * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
>>    * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
>> + * @DAMOS_DEMOTE:	Migate cold pages from fast memory type (DRAM) to slow
>> + * memory type (persistent memory).
> 
> s/Migate/Migrate/
> 
> Also, could you make the second line of the description aligned with the first
> line? e.g.,
> 
>      + * @DAMOS_DEMOTE:	Migate cold pages from fast memory type (DRAM) to slow
>      + *                 memory type (persistent memory).

Sure.

> 
>>    * @DAMOS_STAT:		Do nothing but count the stat.
>>    */
>>   enum damos_action {
>> @@ -95,6 +97,7 @@ enum damos_action {
>>   	DAMOS_PAGEOUT,
>>   	DAMOS_HUGEPAGE,
>>   	DAMOS_NOHUGEPAGE,
>> +	DAMOS_DEMOTE,
>>   	DAMOS_STAT,		/* Do nothing but only record the stat */
> 
> This would make user space people who were using 'DAMOS_STAT' be confused.
> Could you put 'DAMOS_DEMOTE' after 'DAMOS_STAT'?

Ah, right, will do.

> 
>>   };
>>   
>> diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
>> index 58dbb96..43355ab 100644
>> --- a/mm/damon/dbgfs.c
>> +++ b/mm/damon/dbgfs.c
>> @@ -168,6 +168,7 @@ static bool damos_action_valid(int action)
>>   	case DAMOS_PAGEOUT:
>>   	case DAMOS_HUGEPAGE:
>>   	case DAMOS_NOHUGEPAGE:
>> +	case DAMOS_DEMOTE:
>>   	case DAMOS_STAT:
>>   		return true;
>>   	default:
>> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
>> index 9e213a1..b354d3e 100644
>> --- a/mm/damon/vaddr.c
>> +++ b/mm/damon/vaddr.c
>> @@ -14,6 +14,10 @@
>>   #include <linux/page_idle.h>
>>   #include <linux/pagewalk.h>
>>   #include <linux/sched/mm.h>
>> +#include <linux/migrate.h>
>> +#include <linux/mm_inline.h>
>> +#include <linux/swapops.h>
>> +#include "../internal.h"
>>   
>>   #include "prmtv-common.h"
>>   
>> @@ -693,6 +697,156 @@ static unsigned long damos_madvise(struct damon_target *target,
>>   }
>>   #endif	/* CONFIG_ADVISE_SYSCALLS */
>>   
>> +static bool damos_isolate_page(struct page *page, struct list_head *demote_list)
>> +{
>> +	struct page *head = compound_head(page);
>> +
>> +	/* Do not interfere with other mappings of this page */
>> +	if (page_mapcount(head) != 1)
>> +		return false;
>> +
>> +	/* No need migration if the target demotion node is empty. */
>> +	if (next_demotion_node(page_to_nid(head)) == NUMA_NO_NODE)
>> +		return false;
>> +
>> +	if (isolate_lru_page(head))
>> +		return false;
>> +
>> +	list_add_tail(&head->lru, demote_list);
>> +	mod_node_page_state(page_pgdat(head),
>> +			    NR_ISOLATED_ANON + page_is_file_lru(head),
>> +			    thp_nr_pages(head));
>> +	return true;
> 
> The return value is not used by callers.  If not needed, let's remove it.

Yes.

> 
>> +}
>> +
>> +static int damos_migrate_pmd_entry(pmd_t *pmd, unsigned long addr,
>> +				   unsigned long end, struct mm_walk *walk)
>> +{
>> +	struct vm_area_struct *vma = walk->vma;
>> +	struct list_head *demote_list = walk->private;
>> +	spinlock_t *ptl;
>> +	struct page *page;
>> +	pte_t *pte, *mapped_pte;
>> +
>> +	if (!vma_migratable(vma))
>> +		return -EFAULT;
>> +
>> +	ptl = pmd_trans_huge_lock(pmd, vma);
>> +	if (ptl) {
>> +		/* Bail out if THP migration is not supported. */
>> +		if (!thp_migration_supported())
>> +			goto thp_out;
>> +
>> +		/* If the THP pte is under migration, do not bother it. */
>> +		if (unlikely(is_pmd_migration_entry(*pmd)))
>> +			goto thp_out;
>> +
>> +		page = damon_get_page(pmd_pfn(*pmd));
>> +		if (!page)
>> +			goto thp_out;
>> +
>> +		damos_isolate_page(page, demote_list);
>> +
>> +		put_page(page);
>> +thp_out:
>> +		spin_unlock(ptl);
>> +		return 0;
>> +	}
> 
> Could we wrap above THP handling code with '#ifdef CONFIG_TRANSPARENT_HUGEPAGE'?

The pmd_trans_huge_lock() will return NULL if the 
CONFIG_TRANSPARENT_HUGEPAGE is not defined, meanwhile I think the 
compiler will optimize the code if the CONFIG_TRANSPARENT_HUGEPAGE is 
not select. Se we can use it usually instead of adding annoying "#ifdef 
CONFIG_XXX" in many places to handle THP, which looks simpler and more 
readable for me.

> 
>> +
>> +	/* regular page handling */
>> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>> +		return -EINVAL;
>> +
>> +	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +		if (pte_none(*pte) || !pte_present(*pte))
>> +			continue;
>> +
>> +		page = damon_get_page(pte_pfn(*pte));
>> +		if (!page)
>> +			continue;
>> +
>> +		damos_isolate_page(page, demote_list);
>> +		put_page(page);
>> +	}
>> +	pte_unmap_unlock(mapped_pte, ptl);
>> +	cond_resched();
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct mm_walk_ops damos_migrate_pages_walk_ops = {
>> +	.pmd_entry              = damos_migrate_pmd_entry,
> 
> The names are little bit confusing to me.  How about 's/migrate/isolate/'?

Sure.

> 
>> +};
>> +
>> +/*
>> + * damos_demote() - demote cold pages from fast memory to slow memory
>> + * @target:    the given target
>> + * @r:         region of the target
>> + *
>> + * On tiered memory system, if DAMON monitored cold pages on fast memory
>> + * node (DRAM), we can demote them to slow memory node proactively in case
>> + * accumulating much more cold memory on fast memory node (DRAM) to reclaim.
>> + *
>> + * Return:
>> + * = 0 means no pages were demoted.
>> + * > 0 means how many cold pages were demoted successfully.
>> + * < 0 means errors happened.
> 
> damon_va_apply_scheme(), which returns what this function returns when
> DAMOS_DEMOTE action is given, should return bytes of the region that the action
> is successfully applied.

OK, I will change the return values as you suggested in next version.

>> + */
>> +static int damos_demote(struct damon_target *target, struct damon_region *r)
>> +{
>> +	struct mm_struct *mm;
>> +	LIST_HEAD(demote_pages);
>> +	LIST_HEAD(pagelist);
>> +	int target_nid, nr_pages, ret = 0;
>> +	unsigned int nr_succeeded, demoted_pages = 0;
>> +	struct page *page, *page2;
>> +
>> +	/* Validate if allowing to do page demotion */
>> +	if (!numa_demotion_enabled)
>> +		return -EINVAL;
>> +
>> +	mm = damon_get_mm(target);
>> +	if (!mm)
>> +		return -ENOMEM;
>> +
>> +	mmap_read_lock(mm);
>> +	walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end),
> 
> I guess PAGE_ALIGN()s are not necessary here?

Yes, I think so too, will remove it.

> 
>> +			&damos_migrate_pages_walk_ops, &demote_pages);
>> +	mmap_read_unlock(mm);
>> +
>> +	mmput(mm);
>> +	if (list_empty(&demote_pages))
>> +		return 0;
>> +
>> +	list_for_each_entry_safe(page, page2, &demote_pages, lru) {
> 
> I'd prefer 'next' or 'next_page' instead of 'page2'.

Sure.

> 
>> +		list_add(&page->lru, &pagelist);
>> +		target_nid = next_demotion_node(page_to_nid(page));
>> +		nr_pages = thp_nr_pages(page);
>> +
>> +		ret = migrate_pages(&pagelist, alloc_demote_page, NULL,
>> +				    target_nid, MIGRATE_SYNC, MR_DEMOTION,
>> +				    &nr_succeeded);
>> +		if (ret) {
>> +			if (!list_empty(&pagelist)) {
>> +				list_del(&page->lru);
>> +				mod_node_page_state(page_pgdat(page),
>> +						    NR_ISOLATED_ANON + page_is_file_lru(page),
>> +						    -nr_pages);
>> +				putback_lru_page(page);
>> +			}
>> +		} else {
>> +			__count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
>> +			demoted_pages += nr_succeeded;
>> +		}
> 
> Why should we do above work for each page on our own instead of exporting and
> calling 'demote_page_list()'?

Cuase demote_page_list() is used for demote cold pages which are from 
one same fast memory node, they can have one same target demotion node.

But for the regions for the process, we can get some cold pages from 
different fast memory nodes (suppose we have mulptiple dram node on the 
system), so its target demotion node may be different. Thus we should 
migrate each cold pages with getting the correct target demotion node.

Thanks for your comments.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system
  2021-12-21 14:18     ` Baolin Wang
@ 2021-12-22  8:43       ` SeongJae Park
  2021-12-22  9:15         ` Baolin Wang
  0 siblings, 1 reply; 13+ messages in thread
From: SeongJae Park @ 2021-12-22  8:43 UTC (permalink / raw)
  To: Baolin Wang
  Cc: SeongJae Park, akpm, ying.huang, dave.hansen, ziy, shy828301,
	zhongjiang-ali, xlpang, linux-mm, linux-kernel

On Tue, 21 Dec 2021 22:18:06 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Hi SeongJae,
> 
> On 12/21/2021 9:24 PM, SeongJae Park Wrote:
> > Hi Baolin,
> > 
> > 
> > Thank you for this great patch!  I left some comments below.
> > 
> > On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > 
> >> On tiered memory system, the reclaim path in shrink_page_list() already
> >> support demoting pages to slow memory node instead of discarding the
> >> pages. However, at that time the fast memory node memory wartermark is
> >> already tense, which will increase the memory allocation latency during
> >> page demotion.
> >>
> >> We can rely on the DAMON in user space to help to monitor the cold
> >> memory on fast memory node, and demote the cold pages to slow memory
> >> node proactively to keep the fast memory node in a healthy state.
> >> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support
> >> this feature.
> >>
> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> ---
[...]
> >> + */
> >> +static int damos_demote(struct damon_target *target, struct damon_region *r)
> >> +{
> >> +	struct mm_struct *mm;
> >> +	LIST_HEAD(demote_pages);
> >> +	LIST_HEAD(pagelist);
> >> +	int target_nid, nr_pages, ret = 0;
> >> +	unsigned int nr_succeeded, demoted_pages = 0;
> >> +	struct page *page, *page2;
> >> +
> >> +	/* Validate if allowing to do page demotion */
> >> +	if (!numa_demotion_enabled)
> >> +		return -EINVAL;
> >> +
> >> +	mm = damon_get_mm(target);
> >> +	if (!mm)
> >> +		return -ENOMEM;
> >> +
> >> +	mmap_read_lock(mm);
> >> +	walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end),
[...]
> >> +			&damos_migrate_pages_walk_ops, &demote_pages);
> >> +	mmap_read_unlock(mm);
> >> +
> >> +	mmput(mm);
> >> +	if (list_empty(&demote_pages))
> >> +		return 0;
> >> +
> >> +	list_for_each_entry_safe(page, page2, &demote_pages, lru) {
> > 
> > I'd prefer 'next' or 'next_page' instead of 'page2'.
> 
> Sure.
> 
> > 
> >> +		list_add(&page->lru, &pagelist);
> >> +		target_nid = next_demotion_node(page_to_nid(page));
> >> +		nr_pages = thp_nr_pages(page);
> >> +
> >> +		ret = migrate_pages(&pagelist, alloc_demote_page, NULL,
> >> +				    target_nid, MIGRATE_SYNC, MR_DEMOTION,
> >> +				    &nr_succeeded);
> >> +		if (ret) {
> >> +			if (!list_empty(&pagelist)) {
> >> +				list_del(&page->lru);
> >> +				mod_node_page_state(page_pgdat(page),
> >> +						    NR_ISOLATED_ANON + page_is_file_lru(page),
> >> +						    -nr_pages);
> >> +				putback_lru_page(page);
> >> +			}
> >> +		} else {
> >> +			__count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
> >> +			demoted_pages += nr_succeeded;
> >> +		}
> > 
> > Why should we do above work for each page on our own instead of exporting and
> > calling 'demote_page_list()'?
> 
> Cuase demote_page_list() is used for demote cold pages which are from 
> one same fast memory node, they can have one same target demotion node.
> 
> But for the regions for the process, we can get some cold pages from 
> different fast memory nodes (suppose we have mulptiple dram node on the 
> system), so its target demotion node may be different. Thus we should 
> migrate each cold pages with getting the correct target demotion node.

Thank you for the kind explanation.  But, I still want to reuse the code rather
than copying if possible.  How about below dumb ideas off the top of my head?

1. Call demote_page_list() for each page from here
2. Call demote_page_list() for each page from damos_migrate_pmd_entry()
3. Make damos_migrate_pages_walk_ops to configure multiple demote_pages lists,
   each per fast memory node, and calling demote_page_list() here for each
   per-fast-memory-node demote_pages list.
4. Make demote_page_list() handles this case and use it.  e.g., NULL pgdat
   parameter means the pages are not from same node.

Please let me know if I'm missing something.


Thanks,
SJ

> 
> Thanks for your comments.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system
  2021-12-22  8:43       ` SeongJae Park
@ 2021-12-22  9:15         ` Baolin Wang
  2021-12-23  8:53           ` SeongJae Park
  0 siblings, 1 reply; 13+ messages in thread
From: Baolin Wang @ 2021-12-22  9:15 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali,
	xlpang, linux-mm, linux-kernel



On 12/22/2021 4:43 PM, SeongJae Park wrote:
> On Tue, 21 Dec 2021 22:18:06 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> Hi SeongJae,
>>
>> On 12/21/2021 9:24 PM, SeongJae Park Wrote:
>>> Hi Baolin,
>>>
>>>
>>> Thank you for this great patch!  I left some comments below.
>>>
>>> On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>> On tiered memory system, the reclaim path in shrink_page_list() already
>>>> support demoting pages to slow memory node instead of discarding the
>>>> pages. However, at that time the fast memory node memory wartermark is
>>>> already tense, which will increase the memory allocation latency during
>>>> page demotion.
>>>>
>>>> We can rely on the DAMON in user space to help to monitor the cold
>>>> memory on fast memory node, and demote the cold pages to slow memory
>>>> node proactively to keep the fast memory node in a healthy state.
>>>> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support
>>>> this feature.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---

[snip]

>>
>>>
>>>> +		list_add(&page->lru, &pagelist);
>>>> +		target_nid = next_demotion_node(page_to_nid(page));
>>>> +		nr_pages = thp_nr_pages(page);
>>>> +
>>>> +		ret = migrate_pages(&pagelist, alloc_demote_page, NULL,
>>>> +				    target_nid, MIGRATE_SYNC, MR_DEMOTION,
>>>> +				    &nr_succeeded);
>>>> +		if (ret) {
>>>> +			if (!list_empty(&pagelist)) {
>>>> +				list_del(&page->lru);
>>>> +				mod_node_page_state(page_pgdat(page),
>>>> +						    NR_ISOLATED_ANON + page_is_file_lru(page),
>>>> +						    -nr_pages);
>>>> +				putback_lru_page(page);
>>>> +			}
>>>> +		} else {
>>>> +			__count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
>>>> +			demoted_pages += nr_succeeded;
>>>> +		}
>>>
>>> Why should we do above work for each page on our own instead of exporting and
>>> calling 'demote_page_list()'?
>>
>> Cuase demote_page_list() is used for demote cold pages which are from
>> one same fast memory node, they can have one same target demotion node.
>>
>> But for the regions for the process, we can get some cold pages from
>> different fast memory nodes (suppose we have mulptiple dram node on the
>> system), so its target demotion node may be different. Thus we should
>> migrate each cold pages with getting the correct target demotion node.
> 
> Thank you for the kind explanation.  But, I still want to reuse the code rather
> than copying if possible.  How about below dumb ideas off the top of my head?
> 
> 1. Call demote_page_list() for each page from here

Sounds reasonable.

> 2. Call demote_page_list() for each page from damos_migrate_pmd_entry()

We are under mmap lock in damos_migrate_pmd_entry(), it is not suitable 
to do page migration.

> 3. Make damos_migrate_pages_walk_ops to configure multiple demote_pages lists,
>     each per fast memory node, and calling demote_page_list() here for each
>     per-fast-memory-node demote_pages list.

Which will bring more complexity I think, and we should avoid doing 
migration under mmap lock.

> 4. Make demote_page_list() handles this case and use it.  e.g., NULL pgdat
>     parameter means the pages are not from same node.

Thanks for your suggestion, actually after more thinking, I think we can 
reuse the demote_page_list() and it can be easy to change. Somethink 
like below on top of my current patch, how do you think? Thanks.

--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -796,7 +796,7 @@ static unsigned long damos_demote(struct 
damon_target *target,
         struct mm_struct *mm;
         LIST_HEAD(demote_pages);
         LIST_HEAD(pagelist);
-       int target_nid, nr_pages, ret = 0;
+       int nr_pages;
         unsigned int nr_succeeded, demoted_pages = 0;
         struct page *page, *next;

@@ -818,14 +818,11 @@ static unsigned long damos_demote(struct 
damon_target *target,
                 return 0;

         list_for_each_entry_safe(page, next, &demote_pages, lru) {
-               list_add(&page->lru, &pagelist);
-               target_nid = next_demotion_node(page_to_nid(page));
                 nr_pages = thp_nr_pages(page);
+               list_add(&page->lru, &pagelist);

-               ret = migrate_pages(&pagelist, alloc_demote_page, NULL,
-                                   target_nid, MIGRATE_SYNC, MR_DEMOTION,
-                                   &nr_succeeded);
-               if (ret) {
+               nr_succeeded = demote_page_list(pagelist, page_pgdat(page));
+               if (!nr_succeeded) {
                         if (!list_empty(&pagelist)) {
                                 list_del(&page->lru);
                                 mod_node_page_state(page_pgdat(page),
@@ -834,11 +831,9 @@ static unsigned long damos_demote(struct 
damon_target *target,
                                 putback_lru_page(page);
                         }
                 } else {
-                       __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
                         demoted_pages += nr_succeeded;
                 }

-               INIT_LIST_HEAD(&pagelist);
                 cond_resched();
         }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system
  2021-12-22  9:15         ` Baolin Wang
@ 2021-12-23  8:53           ` SeongJae Park
  0 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2021-12-23  8:53 UTC (permalink / raw)
  To: Baolin Wang
  Cc: SeongJae Park, akpm, ying.huang, dave.hansen, ziy, shy828301,
	zhongjiang-ali, xlpang, linux-mm, linux-kernel

On Wed, 22 Dec 2021 17:15:18 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> 
> 
> On 12/22/2021 4:43 PM, SeongJae Park wrote:
> > On Tue, 21 Dec 2021 22:18:06 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > 
> >> Hi SeongJae,
> >>
> >> On 12/21/2021 9:24 PM, SeongJae Park Wrote:
> >>> Hi Baolin,
> >>>
> >>>
> >>> Thank you for this great patch!  I left some comments below.
> >>>
> >>> On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> >>>
> >>>> On tiered memory system, the reclaim path in shrink_page_list() already
> >>>> support demoting pages to slow memory node instead of discarding the
> >>>> pages. However, at that time the fast memory node memory wartermark is
> >>>> already tense, which will increase the memory allocation latency during
> >>>> page demotion.
> >>>>
> >>>> We can rely on the DAMON in user space to help to monitor the cold
> >>>> memory on fast memory node, and demote the cold pages to slow memory
> >>>> node proactively to keep the fast memory node in a healthy state.
> >>>> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support
> >>>> this feature.
> >>>>
> >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>>> ---
> 
> [snip]
> 
> >>
> >>>
> >>>> +		list_add(&page->lru, &pagelist);
> >>>> +		target_nid = next_demotion_node(page_to_nid(page));
> >>>> +		nr_pages = thp_nr_pages(page);
> >>>> +
> >>>> +		ret = migrate_pages(&pagelist, alloc_demote_page, NULL,
> >>>> +				    target_nid, MIGRATE_SYNC, MR_DEMOTION,
> >>>> +				    &nr_succeeded);
> >>>> +		if (ret) {
> >>>> +			if (!list_empty(&pagelist)) {
> >>>> +				list_del(&page->lru);
> >>>> +				mod_node_page_state(page_pgdat(page),
> >>>> +						    NR_ISOLATED_ANON + page_is_file_lru(page),
> >>>> +						    -nr_pages);
> >>>> +				putback_lru_page(page);
> >>>> +			}
> >>>> +		} else {
> >>>> +			__count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
> >>>> +			demoted_pages += nr_succeeded;
> >>>> +		}
> >>>
> >>> Why should we do above work for each page on our own instead of exporting and
> >>> calling 'demote_page_list()'?
> >>
> >> Cuase demote_page_list() is used for demote cold pages which are from
> >> one same fast memory node, they can have one same target demotion node.
> >>
> >> But for the regions for the process, we can get some cold pages from
> >> different fast memory nodes (suppose we have mulptiple dram node on the
> >> system), so its target demotion node may be different. Thus we should
> >> migrate each cold pages with getting the correct target demotion node.
> > 
> > Thank you for the kind explanation.  But, I still want to reuse the code rather
> > than copying if possible.  How about below dumb ideas off the top of my head?
> > 
> > 1. Call demote_page_list() for each page from here
> 
> Sounds reasonable.
> 
> > 2. Call demote_page_list() for each page from damos_migrate_pmd_entry()
> 
> We are under mmap lock in damos_migrate_pmd_entry(), it is not suitable 
> to do page migration.
> 
> > 3. Make damos_migrate_pages_walk_ops to configure multiple demote_pages lists,
> >     each per fast memory node, and calling demote_page_list() here for each
> >     per-fast-memory-node demote_pages list.
> 
> Which will bring more complexity I think, and we should avoid doing 
> migration under mmap lock.
> 
> > 4. Make demote_page_list() handles this case and use it.  e.g., NULL pgdat
> >     parameter means the pages are not from same node.
> 
> Thanks for your suggestion, actually after more thinking, I think we can 
> reuse the demote_page_list() and it can be easy to change. Somethink 
> like below on top of my current patch, how do you think? Thanks.

So, you selected the option 1.  I personally think option 3 or 4 would be more
optimal, but also agree that it could increase the complexity.  As we already
have the time/space quota feature for schemes overhead control, I think
starting with this simple approach makes sense to me.


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system
  2021-12-21  9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang
  2021-12-21  9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang
  2021-12-21  9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang
@ 2021-12-21 13:26 ` SeongJae Park
  2021-12-21 14:32   ` Baolin Wang
  2 siblings, 1 reply; 13+ messages in thread
From: SeongJae Park @ 2021-12-21 13:26 UTC (permalink / raw)
  To: Baolin Wang
  Cc: sj, akpm, ying.huang, dave.hansen, ziy, shy828301,
	zhongjiang-ali, xlpang, linux-mm, linux-kernel

Hi Baolin,

On Tue, 21 Dec 2021 17:18:02 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> Hi,
> 
> Now on tiered memory system with different memory types, the reclaim path in
> shrink_page_list() already support demoting pages to slow memory node instead
> of discarding the pages. However, at that time the fast memory node memory
> wartermark is already tense, which will increase the memory allocation latency
> during page demotion. So a new method from user space demoting cold pages
> proactively will be more helpful.
> 
> We can rely on the DAMON in user space to help to monitor the cold memory on
> fast memory node, and demote the cold pages to slow memory node proactively to
> keep the fast memory node in a healthy state.
> 
> This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature,
> and works well from my testing. Any comments are welcome. Thanks.

I like the idea, thank you for these patches!  If possible, could you share
some details about your tests?


Thanks,
SJ

> 
> 
> Baolin Wang (2):
>   mm: Export the alloc_demote_page() function
>   mm/damon: Add a new scheme to support demotion on tiered memory system
> 
>  include/linux/damon.h |   3 +
>  mm/damon/dbgfs.c      |   1 +
>  mm/damon/vaddr.c      | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/internal.h         |   1 +
>  mm/vmscan.c           |   2 +-
>  5 files changed, 162 insertions(+), 1 deletion(-)
> 
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system
  2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park
@ 2021-12-21 14:32   ` Baolin Wang
  2021-12-22  8:54     ` SeongJae Park
  0 siblings, 1 reply; 13+ messages in thread
From: Baolin Wang @ 2021-12-21 14:32 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali,
	xlpang, linux-mm, linux-kernel



On 12/21/2021 9:26 PM, SeongJae Park wrote:
> Hi Baolin,
> 
> On Tue, 21 Dec 2021 17:18:02 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> Hi,
>>
>> Now on tiered memory system with different memory types, the reclaim path in
>> shrink_page_list() already support demoting pages to slow memory node instead
>> of discarding the pages. However, at that time the fast memory node memory
>> wartermark is already tense, which will increase the memory allocation latency
>> during page demotion. So a new method from user space demoting cold pages
>> proactively will be more helpful.
>>
>> We can rely on the DAMON in user space to help to monitor the cold memory on
>> fast memory node, and demote the cold pages to slow memory node proactively to
>> keep the fast memory node in a healthy state.
>>
>> This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature,
>> and works well from my testing. Any comments are welcome. Thanks.
> 
> I like the idea, thank you for these patches!  If possible, could you share
> some details about your tests?

Sure, sorry for not adding more information about my tests.

My machine contains 64G DRAM + 256G AEP(persistent memory), and you 
should enable the demotion firstly by:
echo "true" > /sys/kernel/mm/numa/demotion_enabled

Then I just write a simple test case like below to mmap some anon 
memory, and then just read and write half of the mmap buffer to let 
another half to be cold enough to demote.

int main()
{
         int len = 50 * 1024 * 1024;
         int scan_len = len / 2;
         int i, ret, j;
         unsigned long *p;

         p = mmap(NULL, len, PROT_READ | PROT_WRITE,
                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
         if (p == MAP_FAILED) {
                 printf("failed to get memory\n");
                 return -1;
         }

         for (i = 0; i < len / sizeof(*p); i++)
                 p[i] = 0x55aa;

         /* Let another half of buffer to be cold */
         do {
                 for (i = 0; i < scan_len / sizeof(*p); i++)
                         p[i] = 0x55aa;

                 sleep(2);

                 for (i = 0; i < scan_len / sizeof(*p); i++)
                         j +=  p[i] >> 2;
         } while (1);

         munmap(p, len);
         return 0;
}

After setting the atts/schemes/target_ids, then start monitoring:
echo 100000 1000000 1000000 10 1000 > /sys/kernel/debug/damon/attrs
echo 4096 8192000 0 5 10 2000 5 1000 2097152 5000 0 0 0 0 0 3 2 1 > 
/sys/kernel/debug/damon/schemes

After a while, you can check the demote statictics by below command, and 
you can find the demote scheme is applied by demoting some cold pages to 
slow memory (AEP) node.

cat /proc/vmstat | grep "demote"
pgdemote_direct 6881

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system
  2021-12-21 14:32   ` Baolin Wang
@ 2021-12-22  8:54     ` SeongJae Park
  2021-12-22  9:57       ` Baolin Wang
  0 siblings, 1 reply; 13+ messages in thread
From: SeongJae Park @ 2021-12-22  8:54 UTC (permalink / raw)
  To: Baolin Wang
  Cc: SeongJae Park, akpm, ying.huang, dave.hansen, ziy, shy828301,
	zhongjiang-ali, xlpang, linux-mm, linux-kernel

On Tue, 21 Dec 2021 22:32:24 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> 
> 
> On 12/21/2021 9:26 PM, SeongJae Park wrote:
> > Hi Baolin,
> > 
> > On Tue, 21 Dec 2021 17:18:02 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> > 
> >> Hi,
> >>
> >> Now on tiered memory system with different memory types, the reclaim path in
> >> shrink_page_list() already support demoting pages to slow memory node instead
> >> of discarding the pages. However, at that time the fast memory node memory
> >> wartermark is already tense, which will increase the memory allocation latency
> >> during page demotion. So a new method from user space demoting cold pages
> >> proactively will be more helpful.
> >>
> >> We can rely on the DAMON in user space to help to monitor the cold memory on
> >> fast memory node, and demote the cold pages to slow memory node proactively to
> >> keep the fast memory node in a healthy state.
> >>
> >> This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature,
> >> and works well from my testing. Any comments are welcome. Thanks.
> > 
> > I like the idea, thank you for these patches!  If possible, could you share
> > some details about your tests?
> 
> Sure, sorry for not adding more information about my tests.

No problem!

> 
> My machine contains 64G DRAM + 256G AEP(persistent memory), and you 
> should enable the demotion firstly by:
> echo "true" > /sys/kernel/mm/numa/demotion_enabled
> 
> Then I just write a simple test case like below to mmap some anon 
> memory, and then just read and write half of the mmap buffer to let 
> another half to be cold enough to demote.
> 
> int main()
> {
>          int len = 50 * 1024 * 1024;
>          int scan_len = len / 2;
>          int i, ret, j;
>          unsigned long *p;
> 
>          p = mmap(NULL, len, PROT_READ | PROT_WRITE,
>                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>          if (p == MAP_FAILED) {
>                  printf("failed to get memory\n");
>                  return -1;
>          }
> 
>          for (i = 0; i < len / sizeof(*p); i++)
>                  p[i] = 0x55aa;
> 
>          /* Let another half of buffer to be cold */
>          do {
>                  for (i = 0; i < scan_len / sizeof(*p); i++)
>                          p[i] = 0x55aa;
> 
>                  sleep(2);
> 
>                  for (i = 0; i < scan_len / sizeof(*p); i++)
>                          j +=  p[i] >> 2;
>          } while (1);
> 
>          munmap(p, len);
>          return 0;
> }
> 
> After setting the atts/schemes/target_ids, then start monitoring:
> echo 100000 1000000 1000000 10 1000 > /sys/kernel/debug/damon/attrs
> echo 4096 8192000 0 5 10 2000 5 1000 2097152 5000 0 0 0 0 0 3 2 1 > 
> /sys/kernel/debug/damon/schemes
> 
> After a while, you can check the demote statictics by below command, and 
> you can find the demote scheme is applied by demoting some cold pages to 
> slow memory (AEP) node.
> 
> cat /proc/vmstat | grep "demote"
> pgdemote_direct 6881

Thank you for sharing this great details!

I was just wondering if you have tested and measured the effects of the memory
allocation latency increase during the page demotion, which invoked by
shrink_page_list(), and also if you have measured how much improvement can be
achieved with DAMON-based demotion in the scenario.  Seems that's not the case,
and I personally think that information is not essential for this patch, so I
see no problem here.  But, if you have tested or have a plan to do that, and if
you could, I think sharing the results on this cover letter would make this
even greater.


Thanks,
SJ

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system
  2021-12-22  8:54     ` SeongJae Park
@ 2021-12-22  9:57       ` Baolin Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Baolin Wang @ 2021-12-22  9:57 UTC (permalink / raw)
  To: SeongJae Park
  Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali,
	xlpang, linux-mm, linux-kernel



On 12/22/2021 4:54 PM, SeongJae Park wrote:
[snip]

>>
>> My machine contains 64G DRAM + 256G AEP(persistent memory), and you
>> should enable the demotion firstly by:
>> echo "true" > /sys/kernel/mm/numa/demotion_enabled
>>
>> Then I just write a simple test case like below to mmap some anon
>> memory, and then just read and write half of the mmap buffer to let
>> another half to be cold enough to demote.
>>
>> int main()
>> {
>>           int len = 50 * 1024 * 1024;
>>           int scan_len = len / 2;
>>           int i, ret, j;
>>           unsigned long *p;
>>
>>           p = mmap(NULL, len, PROT_READ | PROT_WRITE,
>>                    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>           if (p == MAP_FAILED) {
>>                   printf("failed to get memory\n");
>>                   return -1;
>>           }
>>
>>           for (i = 0; i < len / sizeof(*p); i++)
>>                   p[i] = 0x55aa;
>>
>>           /* Let another half of buffer to be cold */
>>           do {
>>                   for (i = 0; i < scan_len / sizeof(*p); i++)
>>                           p[i] = 0x55aa;
>>
>>                   sleep(2);
>>
>>                   for (i = 0; i < scan_len / sizeof(*p); i++)
>>                           j +=  p[i] >> 2;
>>           } while (1);
>>
>>           munmap(p, len);
>>           return 0;
>> }
>>
>> After setting the atts/schemes/target_ids, then start monitoring:
>> echo 100000 1000000 1000000 10 1000 > /sys/kernel/debug/damon/attrs
>> echo 4096 8192000 0 5 10 2000 5 1000 2097152 5000 0 0 0 0 0 3 2 1 >
>> /sys/kernel/debug/damon/schemes
>>
>> After a while, you can check the demote statictics by below command, and
>> you can find the demote scheme is applied by demoting some cold pages to
>> slow memory (AEP) node.
>>
>> cat /proc/vmstat | grep "demote"
>> pgdemote_direct 6881
> 
> Thank you for sharing this great details!
> 
> I was just wondering if you have tested and measured the effects of the memory
> allocation latency increase during the page demotion, which invoked by
> shrink_page_list(), and also if you have measured how much improvement can be
> achieved with DAMON-based demotion in the scenario.  Seems that's not the case,

Not yet testing on the real workload with DAMON demote scheme now, and I 
think DAMON is lack of some functions to tune performance on tiered 
memory system. At least I think we also need add a new promotion scheme 
for DAMON to promote hot memory from slow memory node to the fast memory 
node, which is on my TODO list.

> and I personally think that information is not essential for this patch, so I
> see no problem here.  But, if you have tested or have a plan to do that, and if
> you could, I think sharing the results on this cover letter would make this
> even greater.

Sure, will do if we find some funny results with DAMON on tiered memory 
system in future. Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-12-23  8:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21  9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang
2021-12-21  9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang
2021-12-21 10:13   ` SeongJae Park
2021-12-21  9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang
2021-12-21 13:24   ` SeongJae Park
2021-12-21 14:18     ` Baolin Wang
2021-12-22  8:43       ` SeongJae Park
2021-12-22  9:15         ` Baolin Wang
2021-12-23  8:53           ` SeongJae Park
2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park
2021-12-21 14:32   ` Baolin Wang
2021-12-22  8:54     ` SeongJae Park
2021-12-22  9:57       ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).