* [PATCH 0/2] Add a new scheme to support demotion on tiered memory system @ 2021-12-21 9:18 Baolin Wang 2021-12-21 9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Baolin Wang @ 2021-12-21 9:18 UTC (permalink / raw) To: sj, akpm Cc: ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, baolin.wang, linux-mm, linux-kernel Hi, Now on tiered memory system with different memory types, the reclaim path in shrink_page_list() already support demoting pages to slow memory node instead of discarding the pages. However, at that time the fast memory node memory wartermark is already tense, which will increase the memory allocation latency during page demotion. So a new method from user space demoting cold pages proactively will be more helpful. We can rely on the DAMON in user space to help to monitor the cold memory on fast memory node, and demote the cold pages to slow memory node proactively to keep the fast memory node in a healthy state. This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature, and works well from my testing. Any comments are welcome. Thanks. Baolin Wang (2): mm: Export the alloc_demote_page() function mm/damon: Add a new scheme to support demotion on tiered memory system include/linux/damon.h | 3 + mm/damon/dbgfs.c | 1 + mm/damon/vaddr.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 1 + mm/vmscan.c | 2 +- 5 files changed, 162 insertions(+), 1 deletion(-) -- 1.8.3.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] mm: Export the alloc_demote_page() function 2021-12-21 9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang @ 2021-12-21 9:18 ` Baolin Wang 2021-12-21 10:13 ` SeongJae Park 2021-12-21 9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang 2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park 2 siblings, 1 reply; 13+ messages in thread From: Baolin Wang @ 2021-12-21 9:18 UTC (permalink / raw) To: sj, akpm Cc: ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, baolin.wang, linux-mm, linux-kernel Export the alloc_demote_page() function to the head file as a preparation to support page demotion for DAMON monitor. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> --- mm/internal.h | 1 + mm/vmscan.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index deb9bda..99ea5fb 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -181,6 +181,7 @@ static inline void set_page_refcounted(struct page *page) extern int isolate_lru_page(struct page *page); extern void putback_lru_page(struct page *page); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +extern struct page *alloc_demote_page(struct page *page, unsigned long node); /* * in mm/rmap.c: diff --git a/mm/vmscan.c b/mm/vmscan.c index f3162a5..bf38327 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1458,7 +1458,7 @@ static void page_check_dirty_writeback(struct page *page, mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +struct page *alloc_demote_page(struct page *page, unsigned long node) { struct migration_target_control mtc = { /* -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] mm: Export the alloc_demote_page() function 2021-12-21 9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang @ 2021-12-21 10:13 ` SeongJae Park 0 siblings, 0 replies; 13+ messages in thread From: SeongJae Park @ 2021-12-21 10:13 UTC (permalink / raw) To: Baolin Wang Cc: sj, akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel On Tue, 21 Dec 2021 17:18:03 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > Export the alloc_demote_page() function to the head file as a > preparation to support page demotion for DAMON monitor. > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Thanks, SJ > --- > mm/internal.h | 1 + > mm/vmscan.c | 2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/internal.h b/mm/internal.h > index deb9bda..99ea5fb 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -181,6 +181,7 @@ static inline void set_page_refcounted(struct page *page) > extern int isolate_lru_page(struct page *page); > extern void putback_lru_page(struct page *page); > extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); > +extern struct page *alloc_demote_page(struct page *page, unsigned long node); > > /* > * in mm/rmap.c: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f3162a5..bf38327 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1458,7 +1458,7 @@ static void page_check_dirty_writeback(struct page *page, > mapping->a_ops->is_dirty_writeback(page, dirty, writeback); > } > > -static struct page *alloc_demote_page(struct page *page, unsigned long node) > +struct page *alloc_demote_page(struct page *page, unsigned long node) > { > struct migration_target_control mtc = { > /* > -- > 1.8.3.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system 2021-12-21 9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang 2021-12-21 9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang @ 2021-12-21 9:18 ` Baolin Wang 2021-12-21 13:24 ` SeongJae Park 2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park 2 siblings, 1 reply; 13+ messages in thread From: Baolin Wang @ 2021-12-21 9:18 UTC (permalink / raw) To: sj, akpm Cc: ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, baolin.wang, linux-mm, linux-kernel On tiered memory system, the reclaim path in shrink_page_list() already support demoting pages to slow memory node instead of discarding the pages. However, at that time the fast memory node memory wartermark is already tense, which will increase the memory allocation latency during page demotion. We can rely on the DAMON in user space to help to monitor the cold memory on fast memory node, and demote the cold pages to slow memory node proactively to keep the fast memory node in a healthy state. Thus this patch introduces a new scheme named DAMOS_DEMOTE to support this feature. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> --- include/linux/damon.h | 3 + mm/damon/dbgfs.c | 1 + mm/damon/vaddr.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 160 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index af64838..da9957c 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -87,6 +87,8 @@ struct damon_target { * @DAMOS_PAGEOUT: Call ``madvise()`` for the region with MADV_PAGEOUT. * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. + * @DAMOS_DEMOTE: Migate cold pages from fast memory type (DRAM) to slow + * memory type (persistent memory). * @DAMOS_STAT: Do nothing but count the stat. */ enum damos_action { @@ -95,6 +97,7 @@ enum damos_action { DAMOS_PAGEOUT, DAMOS_HUGEPAGE, DAMOS_NOHUGEPAGE, + DAMOS_DEMOTE, DAMOS_STAT, /* Do nothing but only record the stat */ }; diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index 58dbb96..43355ab 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -168,6 +168,7 @@ static bool damos_action_valid(int action) case DAMOS_PAGEOUT: case DAMOS_HUGEPAGE: case DAMOS_NOHUGEPAGE: + case DAMOS_DEMOTE: case DAMOS_STAT: return true; default: diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index 9e213a1..b354d3e 100644 --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -14,6 +14,10 @@ #include <linux/page_idle.h> #include <linux/pagewalk.h> #include <linux/sched/mm.h> +#include <linux/migrate.h> +#include <linux/mm_inline.h> +#include <linux/swapops.h> +#include "../internal.h" #include "prmtv-common.h" @@ -693,6 +697,156 @@ static unsigned long damos_madvise(struct damon_target *target, } #endif /* CONFIG_ADVISE_SYSCALLS */ +static bool damos_isolate_page(struct page *page, struct list_head *demote_list) +{ + struct page *head = compound_head(page); + + /* Do not interfere with other mappings of this page */ + if (page_mapcount(head) != 1) + return false; + + /* No need migration if the target demotion node is empty. */ + if (next_demotion_node(page_to_nid(head)) == NUMA_NO_NODE) + return false; + + if (isolate_lru_page(head)) + return false; + + list_add_tail(&head->lru, demote_list); + mod_node_page_state(page_pgdat(head), + NR_ISOLATED_ANON + page_is_file_lru(head), + thp_nr_pages(head)); + return true; +} + +static int damos_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long end, struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + struct list_head *demote_list = walk->private; + spinlock_t *ptl; + struct page *page; + pte_t *pte, *mapped_pte; + + if (!vma_migratable(vma)) + return -EFAULT; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + /* Bail out if THP migration is not supported. */ + if (!thp_migration_supported()) + goto thp_out; + + /* If the THP pte is under migration, do not bother it. */ + if (unlikely(is_pmd_migration_entry(*pmd))) + goto thp_out; + + page = damon_get_page(pmd_pfn(*pmd)); + if (!page) + goto thp_out; + + damos_isolate_page(page, demote_list); + + put_page(page); +thp_out: + spin_unlock(ptl); + return 0; + } + + /* regular page handling */ + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + return -EINVAL; + + mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + for (; addr != end; pte++, addr += PAGE_SIZE) { + if (pte_none(*pte) || !pte_present(*pte)) + continue; + + page = damon_get_page(pte_pfn(*pte)); + if (!page) + continue; + + damos_isolate_page(page, demote_list); + put_page(page); + } + pte_unmap_unlock(mapped_pte, ptl); + cond_resched(); + + return 0; +} + +static const struct mm_walk_ops damos_migrate_pages_walk_ops = { + .pmd_entry = damos_migrate_pmd_entry, +}; + +/* + * damos_demote() - demote cold pages from fast memory to slow memory + * @target: the given target + * @r: region of the target + * + * On tiered memory system, if DAMON monitored cold pages on fast memory + * node (DRAM), we can demote them to slow memory node proactively in case + * accumulating much more cold memory on fast memory node (DRAM) to reclaim. + * + * Return: + * = 0 means no pages were demoted. + * > 0 means how many cold pages were demoted successfully. + * < 0 means errors happened. + */ +static int damos_demote(struct damon_target *target, struct damon_region *r) +{ + struct mm_struct *mm; + LIST_HEAD(demote_pages); + LIST_HEAD(pagelist); + int target_nid, nr_pages, ret = 0; + unsigned int nr_succeeded, demoted_pages = 0; + struct page *page, *page2; + + /* Validate if allowing to do page demotion */ + if (!numa_demotion_enabled) + return -EINVAL; + + mm = damon_get_mm(target); + if (!mm) + return -ENOMEM; + + mmap_read_lock(mm); + walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end), + &damos_migrate_pages_walk_ops, &demote_pages); + mmap_read_unlock(mm); + + mmput(mm); + if (list_empty(&demote_pages)) + return 0; + + list_for_each_entry_safe(page, page2, &demote_pages, lru) { + list_add(&page->lru, &pagelist); + target_nid = next_demotion_node(page_to_nid(page)); + nr_pages = thp_nr_pages(page); + + ret = migrate_pages(&pagelist, alloc_demote_page, NULL, + target_nid, MIGRATE_SYNC, MR_DEMOTION, + &nr_succeeded); + if (ret) { + if (!list_empty(&pagelist)) { + list_del(&page->lru); + mod_node_page_state(page_pgdat(page), + NR_ISOLATED_ANON + page_is_file_lru(page), + -nr_pages); + putback_lru_page(page); + } + } else { + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); + demoted_pages += nr_succeeded; + } + + INIT_LIST_HEAD(&pagelist); + cond_resched(); + } + + return demoted_pages; +} + static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, struct damon_target *t, struct damon_region *r, struct damos *scheme) @@ -715,6 +869,8 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, case DAMOS_NOHUGEPAGE: madv_action = MADV_NOHUGEPAGE; break; + case DAMOS_DEMOTE: + return damos_demote(t, r); case DAMOS_STAT: return 0; default: -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system 2021-12-21 9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang @ 2021-12-21 13:24 ` SeongJae Park 2021-12-21 14:18 ` Baolin Wang 0 siblings, 1 reply; 13+ messages in thread From: SeongJae Park @ 2021-12-21 13:24 UTC (permalink / raw) To: Baolin Wang Cc: sj, akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel Hi Baolin, Thank you for this great patch! I left some comments below. On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > On tiered memory system, the reclaim path in shrink_page_list() already > support demoting pages to slow memory node instead of discarding the > pages. However, at that time the fast memory node memory wartermark is > already tense, which will increase the memory allocation latency during > page demotion. > > We can rely on the DAMON in user space to help to monitor the cold > memory on fast memory node, and demote the cold pages to slow memory > node proactively to keep the fast memory node in a healthy state. > Thus this patch introduces a new scheme named DAMOS_DEMOTE to support > this feature. > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > include/linux/damon.h | 3 + > mm/damon/dbgfs.c | 1 + > mm/damon/vaddr.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 160 insertions(+) > > diff --git a/include/linux/damon.h b/include/linux/damon.h > index af64838..da9957c 100644 > --- a/include/linux/damon.h > +++ b/include/linux/damon.h > @@ -87,6 +87,8 @@ struct damon_target { > * @DAMOS_PAGEOUT: Call ``madvise()`` for the region with MADV_PAGEOUT. > * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. > * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. > + * @DAMOS_DEMOTE: Migate cold pages from fast memory type (DRAM) to slow > + * memory type (persistent memory). s/Migate/Migrate/ Also, could you make the second line of the description aligned with the first line? e.g., + * @DAMOS_DEMOTE: Migate cold pages from fast memory type (DRAM) to slow + * memory type (persistent memory). > * @DAMOS_STAT: Do nothing but count the stat. > */ > enum damos_action { > @@ -95,6 +97,7 @@ enum damos_action { > DAMOS_PAGEOUT, > DAMOS_HUGEPAGE, > DAMOS_NOHUGEPAGE, > + DAMOS_DEMOTE, > DAMOS_STAT, /* Do nothing but only record the stat */ This would make user space people who were using 'DAMOS_STAT' be confused. Could you put 'DAMOS_DEMOTE' after 'DAMOS_STAT'? > }; > > diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c > index 58dbb96..43355ab 100644 > --- a/mm/damon/dbgfs.c > +++ b/mm/damon/dbgfs.c > @@ -168,6 +168,7 @@ static bool damos_action_valid(int action) > case DAMOS_PAGEOUT: > case DAMOS_HUGEPAGE: > case DAMOS_NOHUGEPAGE: > + case DAMOS_DEMOTE: > case DAMOS_STAT: > return true; > default: > diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c > index 9e213a1..b354d3e 100644 > --- a/mm/damon/vaddr.c > +++ b/mm/damon/vaddr.c > @@ -14,6 +14,10 @@ > #include <linux/page_idle.h> > #include <linux/pagewalk.h> > #include <linux/sched/mm.h> > +#include <linux/migrate.h> > +#include <linux/mm_inline.h> > +#include <linux/swapops.h> > +#include "../internal.h" > > #include "prmtv-common.h" > > @@ -693,6 +697,156 @@ static unsigned long damos_madvise(struct damon_target *target, > } > #endif /* CONFIG_ADVISE_SYSCALLS */ > > +static bool damos_isolate_page(struct page *page, struct list_head *demote_list) > +{ > + struct page *head = compound_head(page); > + > + /* Do not interfere with other mappings of this page */ > + if (page_mapcount(head) != 1) > + return false; > + > + /* No need migration if the target demotion node is empty. */ > + if (next_demotion_node(page_to_nid(head)) == NUMA_NO_NODE) > + return false; > + > + if (isolate_lru_page(head)) > + return false; > + > + list_add_tail(&head->lru, demote_list); > + mod_node_page_state(page_pgdat(head), > + NR_ISOLATED_ANON + page_is_file_lru(head), > + thp_nr_pages(head)); > + return true; The return value is not used by callers. If not needed, let's remove it. > +} > + > +static int damos_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, > + unsigned long end, struct mm_walk *walk) > +{ > + struct vm_area_struct *vma = walk->vma; > + struct list_head *demote_list = walk->private; > + spinlock_t *ptl; > + struct page *page; > + pte_t *pte, *mapped_pte; > + > + if (!vma_migratable(vma)) > + return -EFAULT; > + > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + /* Bail out if THP migration is not supported. */ > + if (!thp_migration_supported()) > + goto thp_out; > + > + /* If the THP pte is under migration, do not bother it. */ > + if (unlikely(is_pmd_migration_entry(*pmd))) > + goto thp_out; > + > + page = damon_get_page(pmd_pfn(*pmd)); > + if (!page) > + goto thp_out; > + > + damos_isolate_page(page, demote_list); > + > + put_page(page); > +thp_out: > + spin_unlock(ptl); > + return 0; > + } Could we wrap above THP handling code with '#ifdef CONFIG_TRANSPARENT_HUGEPAGE'? > + > + /* regular page handling */ > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) > + return -EINVAL; > + > + mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); > + for (; addr != end; pte++, addr += PAGE_SIZE) { > + if (pte_none(*pte) || !pte_present(*pte)) > + continue; > + > + page = damon_get_page(pte_pfn(*pte)); > + if (!page) > + continue; > + > + damos_isolate_page(page, demote_list); > + put_page(page); > + } > + pte_unmap_unlock(mapped_pte, ptl); > + cond_resched(); > + > + return 0; > +} > + > +static const struct mm_walk_ops damos_migrate_pages_walk_ops = { > + .pmd_entry = damos_migrate_pmd_entry, The names are little bit confusing to me. How about 's/migrate/isolate/'? > +}; > + > +/* > + * damos_demote() - demote cold pages from fast memory to slow memory > + * @target: the given target > + * @r: region of the target > + * > + * On tiered memory system, if DAMON monitored cold pages on fast memory > + * node (DRAM), we can demote them to slow memory node proactively in case > + * accumulating much more cold memory on fast memory node (DRAM) to reclaim. > + * > + * Return: > + * = 0 means no pages were demoted. > + * > 0 means how many cold pages were demoted successfully. > + * < 0 means errors happened. damon_va_apply_scheme(), which returns what this function returns when DAMOS_DEMOTE action is given, should return bytes of the region that the action is successfully applied. > + */ > +static int damos_demote(struct damon_target *target, struct damon_region *r) > +{ > + struct mm_struct *mm; > + LIST_HEAD(demote_pages); > + LIST_HEAD(pagelist); > + int target_nid, nr_pages, ret = 0; > + unsigned int nr_succeeded, demoted_pages = 0; > + struct page *page, *page2; > + > + /* Validate if allowing to do page demotion */ > + if (!numa_demotion_enabled) > + return -EINVAL; > + > + mm = damon_get_mm(target); > + if (!mm) > + return -ENOMEM; > + > + mmap_read_lock(mm); > + walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end), I guess PAGE_ALIGN()s are not necessary here? > + &damos_migrate_pages_walk_ops, &demote_pages); > + mmap_read_unlock(mm); > + > + mmput(mm); > + if (list_empty(&demote_pages)) > + return 0; > + > + list_for_each_entry_safe(page, page2, &demote_pages, lru) { I'd prefer 'next' or 'next_page' instead of 'page2'. > + list_add(&page->lru, &pagelist); > + target_nid = next_demotion_node(page_to_nid(page)); > + nr_pages = thp_nr_pages(page); > + > + ret = migrate_pages(&pagelist, alloc_demote_page, NULL, > + target_nid, MIGRATE_SYNC, MR_DEMOTION, > + &nr_succeeded); > + if (ret) { > + if (!list_empty(&pagelist)) { > + list_del(&page->lru); > + mod_node_page_state(page_pgdat(page), > + NR_ISOLATED_ANON + page_is_file_lru(page), > + -nr_pages); > + putback_lru_page(page); > + } > + } else { > + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); > + demoted_pages += nr_succeeded; > + } Why should we do above work for each page on our own instead of exporting and calling 'demote_page_list()'? Thanks, SJ > + > + INIT_LIST_HEAD(&pagelist); > + cond_resched(); > + } > + > + return demoted_pages; > +} > + > static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, > struct damon_target *t, struct damon_region *r, > struct damos *scheme) > @@ -715,6 +869,8 @@ static unsigned long damon_va_apply_scheme(struct damon_ctx *ctx, > case DAMOS_NOHUGEPAGE: > madv_action = MADV_NOHUGEPAGE; > break; > + case DAMOS_DEMOTE: > + return damos_demote(t, r); > case DAMOS_STAT: > return 0; > default: > -- > 1.8.3.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system 2021-12-21 13:24 ` SeongJae Park @ 2021-12-21 14:18 ` Baolin Wang 2021-12-22 8:43 ` SeongJae Park 0 siblings, 1 reply; 13+ messages in thread From: Baolin Wang @ 2021-12-21 14:18 UTC (permalink / raw) To: SeongJae Park Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel Hi SeongJae, On 12/21/2021 9:24 PM, SeongJae Park Wrote: > Hi Baolin, > > > Thank you for this great patch! I left some comments below. > > On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > >> On tiered memory system, the reclaim path in shrink_page_list() already >> support demoting pages to slow memory node instead of discarding the >> pages. However, at that time the fast memory node memory wartermark is >> already tense, which will increase the memory allocation latency during >> page demotion. >> >> We can rely on the DAMON in user space to help to monitor the cold >> memory on fast memory node, and demote the cold pages to slow memory >> node proactively to keep the fast memory node in a healthy state. >> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support >> this feature. >> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> --- >> include/linux/damon.h | 3 + >> mm/damon/dbgfs.c | 1 + >> mm/damon/vaddr.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++ >> 3 files changed, 160 insertions(+) >> >> diff --git a/include/linux/damon.h b/include/linux/damon.h >> index af64838..da9957c 100644 >> --- a/include/linux/damon.h >> +++ b/include/linux/damon.h >> @@ -87,6 +87,8 @@ struct damon_target { >> * @DAMOS_PAGEOUT: Call ``madvise()`` for the region with MADV_PAGEOUT. >> * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. >> * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. >> + * @DAMOS_DEMOTE: Migate cold pages from fast memory type (DRAM) to slow >> + * memory type (persistent memory). > > s/Migate/Migrate/ > > Also, could you make the second line of the description aligned with the first > line? e.g., > > + * @DAMOS_DEMOTE: Migate cold pages from fast memory type (DRAM) to slow > + * memory type (persistent memory). Sure. > >> * @DAMOS_STAT: Do nothing but count the stat. >> */ >> enum damos_action { >> @@ -95,6 +97,7 @@ enum damos_action { >> DAMOS_PAGEOUT, >> DAMOS_HUGEPAGE, >> DAMOS_NOHUGEPAGE, >> + DAMOS_DEMOTE, >> DAMOS_STAT, /* Do nothing but only record the stat */ > > This would make user space people who were using 'DAMOS_STAT' be confused. > Could you put 'DAMOS_DEMOTE' after 'DAMOS_STAT'? Ah, right, will do. > >> }; >> >> diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c >> index 58dbb96..43355ab 100644 >> --- a/mm/damon/dbgfs.c >> +++ b/mm/damon/dbgfs.c >> @@ -168,6 +168,7 @@ static bool damos_action_valid(int action) >> case DAMOS_PAGEOUT: >> case DAMOS_HUGEPAGE: >> case DAMOS_NOHUGEPAGE: >> + case DAMOS_DEMOTE: >> case DAMOS_STAT: >> return true; >> default: >> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c >> index 9e213a1..b354d3e 100644 >> --- a/mm/damon/vaddr.c >> +++ b/mm/damon/vaddr.c >> @@ -14,6 +14,10 @@ >> #include <linux/page_idle.h> >> #include <linux/pagewalk.h> >> #include <linux/sched/mm.h> >> +#include <linux/migrate.h> >> +#include <linux/mm_inline.h> >> +#include <linux/swapops.h> >> +#include "../internal.h" >> >> #include "prmtv-common.h" >> >> @@ -693,6 +697,156 @@ static unsigned long damos_madvise(struct damon_target *target, >> } >> #endif /* CONFIG_ADVISE_SYSCALLS */ >> >> +static bool damos_isolate_page(struct page *page, struct list_head *demote_list) >> +{ >> + struct page *head = compound_head(page); >> + >> + /* Do not interfere with other mappings of this page */ >> + if (page_mapcount(head) != 1) >> + return false; >> + >> + /* No need migration if the target demotion node is empty. */ >> + if (next_demotion_node(page_to_nid(head)) == NUMA_NO_NODE) >> + return false; >> + >> + if (isolate_lru_page(head)) >> + return false; >> + >> + list_add_tail(&head->lru, demote_list); >> + mod_node_page_state(page_pgdat(head), >> + NR_ISOLATED_ANON + page_is_file_lru(head), >> + thp_nr_pages(head)); >> + return true; > > The return value is not used by callers. If not needed, let's remove it. Yes. > >> +} >> + >> +static int damos_migrate_pmd_entry(pmd_t *pmd, unsigned long addr, >> + unsigned long end, struct mm_walk *walk) >> +{ >> + struct vm_area_struct *vma = walk->vma; >> + struct list_head *demote_list = walk->private; >> + spinlock_t *ptl; >> + struct page *page; >> + pte_t *pte, *mapped_pte; >> + >> + if (!vma_migratable(vma)) >> + return -EFAULT; >> + >> + ptl = pmd_trans_huge_lock(pmd, vma); >> + if (ptl) { >> + /* Bail out if THP migration is not supported. */ >> + if (!thp_migration_supported()) >> + goto thp_out; >> + >> + /* If the THP pte is under migration, do not bother it. */ >> + if (unlikely(is_pmd_migration_entry(*pmd))) >> + goto thp_out; >> + >> + page = damon_get_page(pmd_pfn(*pmd)); >> + if (!page) >> + goto thp_out; >> + >> + damos_isolate_page(page, demote_list); >> + >> + put_page(page); >> +thp_out: >> + spin_unlock(ptl); >> + return 0; >> + } > > Could we wrap above THP handling code with '#ifdef CONFIG_TRANSPARENT_HUGEPAGE'? The pmd_trans_huge_lock() will return NULL if the CONFIG_TRANSPARENT_HUGEPAGE is not defined, meanwhile I think the compiler will optimize the code if the CONFIG_TRANSPARENT_HUGEPAGE is not select. Se we can use it usually instead of adding annoying "#ifdef CONFIG_XXX" in many places to handle THP, which looks simpler and more readable for me. > >> + >> + /* regular page handling */ >> + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) >> + return -EINVAL; >> + >> + mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); >> + for (; addr != end; pte++, addr += PAGE_SIZE) { >> + if (pte_none(*pte) || !pte_present(*pte)) >> + continue; >> + >> + page = damon_get_page(pte_pfn(*pte)); >> + if (!page) >> + continue; >> + >> + damos_isolate_page(page, demote_list); >> + put_page(page); >> + } >> + pte_unmap_unlock(mapped_pte, ptl); >> + cond_resched(); >> + >> + return 0; >> +} >> + >> +static const struct mm_walk_ops damos_migrate_pages_walk_ops = { >> + .pmd_entry = damos_migrate_pmd_entry, > > The names are little bit confusing to me. How about 's/migrate/isolate/'? Sure. > >> +}; >> + >> +/* >> + * damos_demote() - demote cold pages from fast memory to slow memory >> + * @target: the given target >> + * @r: region of the target >> + * >> + * On tiered memory system, if DAMON monitored cold pages on fast memory >> + * node (DRAM), we can demote them to slow memory node proactively in case >> + * accumulating much more cold memory on fast memory node (DRAM) to reclaim. >> + * >> + * Return: >> + * = 0 means no pages were demoted. >> + * > 0 means how many cold pages were demoted successfully. >> + * < 0 means errors happened. > > damon_va_apply_scheme(), which returns what this function returns when > DAMOS_DEMOTE action is given, should return bytes of the region that the action > is successfully applied. OK, I will change the return values as you suggested in next version. >> + */ >> +static int damos_demote(struct damon_target *target, struct damon_region *r) >> +{ >> + struct mm_struct *mm; >> + LIST_HEAD(demote_pages); >> + LIST_HEAD(pagelist); >> + int target_nid, nr_pages, ret = 0; >> + unsigned int nr_succeeded, demoted_pages = 0; >> + struct page *page, *page2; >> + >> + /* Validate if allowing to do page demotion */ >> + if (!numa_demotion_enabled) >> + return -EINVAL; >> + >> + mm = damon_get_mm(target); >> + if (!mm) >> + return -ENOMEM; >> + >> + mmap_read_lock(mm); >> + walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end), > > I guess PAGE_ALIGN()s are not necessary here? Yes, I think so too, will remove it. > >> + &damos_migrate_pages_walk_ops, &demote_pages); >> + mmap_read_unlock(mm); >> + >> + mmput(mm); >> + if (list_empty(&demote_pages)) >> + return 0; >> + >> + list_for_each_entry_safe(page, page2, &demote_pages, lru) { > > I'd prefer 'next' or 'next_page' instead of 'page2'. Sure. > >> + list_add(&page->lru, &pagelist); >> + target_nid = next_demotion_node(page_to_nid(page)); >> + nr_pages = thp_nr_pages(page); >> + >> + ret = migrate_pages(&pagelist, alloc_demote_page, NULL, >> + target_nid, MIGRATE_SYNC, MR_DEMOTION, >> + &nr_succeeded); >> + if (ret) { >> + if (!list_empty(&pagelist)) { >> + list_del(&page->lru); >> + mod_node_page_state(page_pgdat(page), >> + NR_ISOLATED_ANON + page_is_file_lru(page), >> + -nr_pages); >> + putback_lru_page(page); >> + } >> + } else { >> + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); >> + demoted_pages += nr_succeeded; >> + } > > Why should we do above work for each page on our own instead of exporting and > calling 'demote_page_list()'? Cuase demote_page_list() is used for demote cold pages which are from one same fast memory node, they can have one same target demotion node. But for the regions for the process, we can get some cold pages from different fast memory nodes (suppose we have mulptiple dram node on the system), so its target demotion node may be different. Thus we should migrate each cold pages with getting the correct target demotion node. Thanks for your comments. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system 2021-12-21 14:18 ` Baolin Wang @ 2021-12-22 8:43 ` SeongJae Park 2021-12-22 9:15 ` Baolin Wang 0 siblings, 1 reply; 13+ messages in thread From: SeongJae Park @ 2021-12-22 8:43 UTC (permalink / raw) To: Baolin Wang Cc: SeongJae Park, akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel On Tue, 21 Dec 2021 22:18:06 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > Hi SeongJae, > > On 12/21/2021 9:24 PM, SeongJae Park Wrote: > > Hi Baolin, > > > > > > Thank you for this great patch! I left some comments below. > > > > On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > >> On tiered memory system, the reclaim path in shrink_page_list() already > >> support demoting pages to slow memory node instead of discarding the > >> pages. However, at that time the fast memory node memory wartermark is > >> already tense, which will increase the memory allocation latency during > >> page demotion. > >> > >> We can rely on the DAMON in user space to help to monitor the cold > >> memory on fast memory node, and demote the cold pages to slow memory > >> node proactively to keep the fast memory node in a healthy state. > >> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support > >> this feature. > >> > >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > >> --- [...] > >> + */ > >> +static int damos_demote(struct damon_target *target, struct damon_region *r) > >> +{ > >> + struct mm_struct *mm; > >> + LIST_HEAD(demote_pages); > >> + LIST_HEAD(pagelist); > >> + int target_nid, nr_pages, ret = 0; > >> + unsigned int nr_succeeded, demoted_pages = 0; > >> + struct page *page, *page2; > >> + > >> + /* Validate if allowing to do page demotion */ > >> + if (!numa_demotion_enabled) > >> + return -EINVAL; > >> + > >> + mm = damon_get_mm(target); > >> + if (!mm) > >> + return -ENOMEM; > >> + > >> + mmap_read_lock(mm); > >> + walk_page_range(mm, PAGE_ALIGN(r->ar.start), PAGE_ALIGN(r->ar.end), [...] > >> + &damos_migrate_pages_walk_ops, &demote_pages); > >> + mmap_read_unlock(mm); > >> + > >> + mmput(mm); > >> + if (list_empty(&demote_pages)) > >> + return 0; > >> + > >> + list_for_each_entry_safe(page, page2, &demote_pages, lru) { > > > > I'd prefer 'next' or 'next_page' instead of 'page2'. > > Sure. > > > > >> + list_add(&page->lru, &pagelist); > >> + target_nid = next_demotion_node(page_to_nid(page)); > >> + nr_pages = thp_nr_pages(page); > >> + > >> + ret = migrate_pages(&pagelist, alloc_demote_page, NULL, > >> + target_nid, MIGRATE_SYNC, MR_DEMOTION, > >> + &nr_succeeded); > >> + if (ret) { > >> + if (!list_empty(&pagelist)) { > >> + list_del(&page->lru); > >> + mod_node_page_state(page_pgdat(page), > >> + NR_ISOLATED_ANON + page_is_file_lru(page), > >> + -nr_pages); > >> + putback_lru_page(page); > >> + } > >> + } else { > >> + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); > >> + demoted_pages += nr_succeeded; > >> + } > > > > Why should we do above work for each page on our own instead of exporting and > > calling 'demote_page_list()'? > > Cuase demote_page_list() is used for demote cold pages which are from > one same fast memory node, they can have one same target demotion node. > > But for the regions for the process, we can get some cold pages from > different fast memory nodes (suppose we have mulptiple dram node on the > system), so its target demotion node may be different. Thus we should > migrate each cold pages with getting the correct target demotion node. Thank you for the kind explanation. But, I still want to reuse the code rather than copying if possible. How about below dumb ideas off the top of my head? 1. Call demote_page_list() for each page from here 2. Call demote_page_list() for each page from damos_migrate_pmd_entry() 3. Make damos_migrate_pages_walk_ops to configure multiple demote_pages lists, each per fast memory node, and calling demote_page_list() here for each per-fast-memory-node demote_pages list. 4. Make demote_page_list() handles this case and use it. e.g., NULL pgdat parameter means the pages are not from same node. Please let me know if I'm missing something. Thanks, SJ > > Thanks for your comments. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system 2021-12-22 8:43 ` SeongJae Park @ 2021-12-22 9:15 ` Baolin Wang 2021-12-23 8:53 ` SeongJae Park 0 siblings, 1 reply; 13+ messages in thread From: Baolin Wang @ 2021-12-22 9:15 UTC (permalink / raw) To: SeongJae Park Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel On 12/22/2021 4:43 PM, SeongJae Park wrote: > On Tue, 21 Dec 2021 22:18:06 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > >> Hi SeongJae, >> >> On 12/21/2021 9:24 PM, SeongJae Park Wrote: >>> Hi Baolin, >>> >>> >>> Thank you for this great patch! I left some comments below. >>> >>> On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: >>> >>>> On tiered memory system, the reclaim path in shrink_page_list() already >>>> support demoting pages to slow memory node instead of discarding the >>>> pages. However, at that time the fast memory node memory wartermark is >>>> already tense, which will increase the memory allocation latency during >>>> page demotion. >>>> >>>> We can rely on the DAMON in user space to help to monitor the cold >>>> memory on fast memory node, and demote the cold pages to slow memory >>>> node proactively to keep the fast memory node in a healthy state. >>>> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support >>>> this feature. >>>> >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> --- [snip] >> >>> >>>> + list_add(&page->lru, &pagelist); >>>> + target_nid = next_demotion_node(page_to_nid(page)); >>>> + nr_pages = thp_nr_pages(page); >>>> + >>>> + ret = migrate_pages(&pagelist, alloc_demote_page, NULL, >>>> + target_nid, MIGRATE_SYNC, MR_DEMOTION, >>>> + &nr_succeeded); >>>> + if (ret) { >>>> + if (!list_empty(&pagelist)) { >>>> + list_del(&page->lru); >>>> + mod_node_page_state(page_pgdat(page), >>>> + NR_ISOLATED_ANON + page_is_file_lru(page), >>>> + -nr_pages); >>>> + putback_lru_page(page); >>>> + } >>>> + } else { >>>> + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); >>>> + demoted_pages += nr_succeeded; >>>> + } >>> >>> Why should we do above work for each page on our own instead of exporting and >>> calling 'demote_page_list()'? >> >> Cuase demote_page_list() is used for demote cold pages which are from >> one same fast memory node, they can have one same target demotion node. >> >> But for the regions for the process, we can get some cold pages from >> different fast memory nodes (suppose we have mulptiple dram node on the >> system), so its target demotion node may be different. Thus we should >> migrate each cold pages with getting the correct target demotion node. > > Thank you for the kind explanation. But, I still want to reuse the code rather > than copying if possible. How about below dumb ideas off the top of my head? > > 1. Call demote_page_list() for each page from here Sounds reasonable. > 2. Call demote_page_list() for each page from damos_migrate_pmd_entry() We are under mmap lock in damos_migrate_pmd_entry(), it is not suitable to do page migration. > 3. Make damos_migrate_pages_walk_ops to configure multiple demote_pages lists, > each per fast memory node, and calling demote_page_list() here for each > per-fast-memory-node demote_pages list. Which will bring more complexity I think, and we should avoid doing migration under mmap lock. > 4. Make demote_page_list() handles this case and use it. e.g., NULL pgdat > parameter means the pages are not from same node. Thanks for your suggestion, actually after more thinking, I think we can reuse the demote_page_list() and it can be easy to change. Somethink like below on top of my current patch, how do you think? Thanks. --- a/mm/damon/vaddr.c +++ b/mm/damon/vaddr.c @@ -796,7 +796,7 @@ static unsigned long damos_demote(struct damon_target *target, struct mm_struct *mm; LIST_HEAD(demote_pages); LIST_HEAD(pagelist); - int target_nid, nr_pages, ret = 0; + int nr_pages; unsigned int nr_succeeded, demoted_pages = 0; struct page *page, *next; @@ -818,14 +818,11 @@ static unsigned long damos_demote(struct damon_target *target, return 0; list_for_each_entry_safe(page, next, &demote_pages, lru) { - list_add(&page->lru, &pagelist); - target_nid = next_demotion_node(page_to_nid(page)); nr_pages = thp_nr_pages(page); + list_add(&page->lru, &pagelist); - ret = migrate_pages(&pagelist, alloc_demote_page, NULL, - target_nid, MIGRATE_SYNC, MR_DEMOTION, - &nr_succeeded); - if (ret) { + nr_succeeded = demote_page_list(pagelist, page_pgdat(page)); + if (!nr_succeeded) { if (!list_empty(&pagelist)) { list_del(&page->lru); mod_node_page_state(page_pgdat(page), @@ -834,11 +831,9 @@ static unsigned long damos_demote(struct damon_target *target, putback_lru_page(page); } } else { - __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); demoted_pages += nr_succeeded; } - INIT_LIST_HEAD(&pagelist); cond_resched(); } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system 2021-12-22 9:15 ` Baolin Wang @ 2021-12-23 8:53 ` SeongJae Park 0 siblings, 0 replies; 13+ messages in thread From: SeongJae Park @ 2021-12-23 8:53 UTC (permalink / raw) To: Baolin Wang Cc: SeongJae Park, akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel On Wed, 22 Dec 2021 17:15:18 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > On 12/22/2021 4:43 PM, SeongJae Park wrote: > > On Tue, 21 Dec 2021 22:18:06 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > >> Hi SeongJae, > >> > >> On 12/21/2021 9:24 PM, SeongJae Park Wrote: > >>> Hi Baolin, > >>> > >>> > >>> Thank you for this great patch! I left some comments below. > >>> > >>> On Tue, 21 Dec 2021 17:18:04 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > >>> > >>>> On tiered memory system, the reclaim path in shrink_page_list() already > >>>> support demoting pages to slow memory node instead of discarding the > >>>> pages. However, at that time the fast memory node memory wartermark is > >>>> already tense, which will increase the memory allocation latency during > >>>> page demotion. > >>>> > >>>> We can rely on the DAMON in user space to help to monitor the cold > >>>> memory on fast memory node, and demote the cold pages to slow memory > >>>> node proactively to keep the fast memory node in a healthy state. > >>>> Thus this patch introduces a new scheme named DAMOS_DEMOTE to support > >>>> this feature. > >>>> > >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > >>>> --- > > [snip] > > >> > >>> > >>>> + list_add(&page->lru, &pagelist); > >>>> + target_nid = next_demotion_node(page_to_nid(page)); > >>>> + nr_pages = thp_nr_pages(page); > >>>> + > >>>> + ret = migrate_pages(&pagelist, alloc_demote_page, NULL, > >>>> + target_nid, MIGRATE_SYNC, MR_DEMOTION, > >>>> + &nr_succeeded); > >>>> + if (ret) { > >>>> + if (!list_empty(&pagelist)) { > >>>> + list_del(&page->lru); > >>>> + mod_node_page_state(page_pgdat(page), > >>>> + NR_ISOLATED_ANON + page_is_file_lru(page), > >>>> + -nr_pages); > >>>> + putback_lru_page(page); > >>>> + } > >>>> + } else { > >>>> + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); > >>>> + demoted_pages += nr_succeeded; > >>>> + } > >>> > >>> Why should we do above work for each page on our own instead of exporting and > >>> calling 'demote_page_list()'? > >> > >> Cuase demote_page_list() is used for demote cold pages which are from > >> one same fast memory node, they can have one same target demotion node. > >> > >> But for the regions for the process, we can get some cold pages from > >> different fast memory nodes (suppose we have mulptiple dram node on the > >> system), so its target demotion node may be different. Thus we should > >> migrate each cold pages with getting the correct target demotion node. > > > > Thank you for the kind explanation. But, I still want to reuse the code rather > > than copying if possible. How about below dumb ideas off the top of my head? > > > > 1. Call demote_page_list() for each page from here > > Sounds reasonable. > > > 2. Call demote_page_list() for each page from damos_migrate_pmd_entry() > > We are under mmap lock in damos_migrate_pmd_entry(), it is not suitable > to do page migration. > > > 3. Make damos_migrate_pages_walk_ops to configure multiple demote_pages lists, > > each per fast memory node, and calling demote_page_list() here for each > > per-fast-memory-node demote_pages list. > > Which will bring more complexity I think, and we should avoid doing > migration under mmap lock. > > > 4. Make demote_page_list() handles this case and use it. e.g., NULL pgdat > > parameter means the pages are not from same node. > > Thanks for your suggestion, actually after more thinking, I think we can > reuse the demote_page_list() and it can be easy to change. Somethink > like below on top of my current patch, how do you think? Thanks. So, you selected the option 1. I personally think option 3 or 4 would be more optimal, but also agree that it could increase the complexity. As we already have the time/space quota feature for schemes overhead control, I think starting with this simple approach makes sense to me. Thanks, SJ [...] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system 2021-12-21 9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang 2021-12-21 9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang 2021-12-21 9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang @ 2021-12-21 13:26 ` SeongJae Park 2021-12-21 14:32 ` Baolin Wang 2 siblings, 1 reply; 13+ messages in thread From: SeongJae Park @ 2021-12-21 13:26 UTC (permalink / raw) To: Baolin Wang Cc: sj, akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel Hi Baolin, On Tue, 21 Dec 2021 17:18:02 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > Hi, > > Now on tiered memory system with different memory types, the reclaim path in > shrink_page_list() already support demoting pages to slow memory node instead > of discarding the pages. However, at that time the fast memory node memory > wartermark is already tense, which will increase the memory allocation latency > during page demotion. So a new method from user space demoting cold pages > proactively will be more helpful. > > We can rely on the DAMON in user space to help to monitor the cold memory on > fast memory node, and demote the cold pages to slow memory node proactively to > keep the fast memory node in a healthy state. > > This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature, > and works well from my testing. Any comments are welcome. Thanks. I like the idea, thank you for these patches! If possible, could you share some details about your tests? Thanks, SJ > > > Baolin Wang (2): > mm: Export the alloc_demote_page() function > mm/damon: Add a new scheme to support demotion on tiered memory system > > include/linux/damon.h | 3 + > mm/damon/dbgfs.c | 1 + > mm/damon/vaddr.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/internal.h | 1 + > mm/vmscan.c | 2 +- > 5 files changed, 162 insertions(+), 1 deletion(-) > > -- > 1.8.3.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system 2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park @ 2021-12-21 14:32 ` Baolin Wang 2021-12-22 8:54 ` SeongJae Park 0 siblings, 1 reply; 13+ messages in thread From: Baolin Wang @ 2021-12-21 14:32 UTC (permalink / raw) To: SeongJae Park Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel On 12/21/2021 9:26 PM, SeongJae Park wrote: > Hi Baolin, > > On Tue, 21 Dec 2021 17:18:02 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > >> Hi, >> >> Now on tiered memory system with different memory types, the reclaim path in >> shrink_page_list() already support demoting pages to slow memory node instead >> of discarding the pages. However, at that time the fast memory node memory >> wartermark is already tense, which will increase the memory allocation latency >> during page demotion. So a new method from user space demoting cold pages >> proactively will be more helpful. >> >> We can rely on the DAMON in user space to help to monitor the cold memory on >> fast memory node, and demote the cold pages to slow memory node proactively to >> keep the fast memory node in a healthy state. >> >> This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature, >> and works well from my testing. Any comments are welcome. Thanks. > > I like the idea, thank you for these patches! If possible, could you share > some details about your tests? Sure, sorry for not adding more information about my tests. My machine contains 64G DRAM + 256G AEP(persistent memory), and you should enable the demotion firstly by: echo "true" > /sys/kernel/mm/numa/demotion_enabled Then I just write a simple test case like below to mmap some anon memory, and then just read and write half of the mmap buffer to let another half to be cold enough to demote. int main() { int len = 50 * 1024 * 1024; int scan_len = len / 2; int i, ret, j; unsigned long *p; p = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { printf("failed to get memory\n"); return -1; } for (i = 0; i < len / sizeof(*p); i++) p[i] = 0x55aa; /* Let another half of buffer to be cold */ do { for (i = 0; i < scan_len / sizeof(*p); i++) p[i] = 0x55aa; sleep(2); for (i = 0; i < scan_len / sizeof(*p); i++) j += p[i] >> 2; } while (1); munmap(p, len); return 0; } After setting the atts/schemes/target_ids, then start monitoring: echo 100000 1000000 1000000 10 1000 > /sys/kernel/debug/damon/attrs echo 4096 8192000 0 5 10 2000 5 1000 2097152 5000 0 0 0 0 0 3 2 1 > /sys/kernel/debug/damon/schemes After a while, you can check the demote statictics by below command, and you can find the demote scheme is applied by demoting some cold pages to slow memory (AEP) node. cat /proc/vmstat | grep "demote" pgdemote_direct 6881 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system 2021-12-21 14:32 ` Baolin Wang @ 2021-12-22 8:54 ` SeongJae Park 2021-12-22 9:57 ` Baolin Wang 0 siblings, 1 reply; 13+ messages in thread From: SeongJae Park @ 2021-12-22 8:54 UTC (permalink / raw) To: Baolin Wang Cc: SeongJae Park, akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel On Tue, 21 Dec 2021 22:32:24 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > On 12/21/2021 9:26 PM, SeongJae Park wrote: > > Hi Baolin, > > > > On Tue, 21 Dec 2021 17:18:02 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > >> Hi, > >> > >> Now on tiered memory system with different memory types, the reclaim path in > >> shrink_page_list() already support demoting pages to slow memory node instead > >> of discarding the pages. However, at that time the fast memory node memory > >> wartermark is already tense, which will increase the memory allocation latency > >> during page demotion. So a new method from user space demoting cold pages > >> proactively will be more helpful. > >> > >> We can rely on the DAMON in user space to help to monitor the cold memory on > >> fast memory node, and demote the cold pages to slow memory node proactively to > >> keep the fast memory node in a healthy state. > >> > >> This patch set introduces a new scheme named DAMOS_DEMOTE to support this feature, > >> and works well from my testing. Any comments are welcome. Thanks. > > > > I like the idea, thank you for these patches! If possible, could you share > > some details about your tests? > > Sure, sorry for not adding more information about my tests. No problem! > > My machine contains 64G DRAM + 256G AEP(persistent memory), and you > should enable the demotion firstly by: > echo "true" > /sys/kernel/mm/numa/demotion_enabled > > Then I just write a simple test case like below to mmap some anon > memory, and then just read and write half of the mmap buffer to let > another half to be cold enough to demote. > > int main() > { > int len = 50 * 1024 * 1024; > int scan_len = len / 2; > int i, ret, j; > unsigned long *p; > > p = mmap(NULL, len, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (p == MAP_FAILED) { > printf("failed to get memory\n"); > return -1; > } > > for (i = 0; i < len / sizeof(*p); i++) > p[i] = 0x55aa; > > /* Let another half of buffer to be cold */ > do { > for (i = 0; i < scan_len / sizeof(*p); i++) > p[i] = 0x55aa; > > sleep(2); > > for (i = 0; i < scan_len / sizeof(*p); i++) > j += p[i] >> 2; > } while (1); > > munmap(p, len); > return 0; > } > > After setting the atts/schemes/target_ids, then start monitoring: > echo 100000 1000000 1000000 10 1000 > /sys/kernel/debug/damon/attrs > echo 4096 8192000 0 5 10 2000 5 1000 2097152 5000 0 0 0 0 0 3 2 1 > > /sys/kernel/debug/damon/schemes > > After a while, you can check the demote statictics by below command, and > you can find the demote scheme is applied by demoting some cold pages to > slow memory (AEP) node. > > cat /proc/vmstat | grep "demote" > pgdemote_direct 6881 Thank you for sharing this great details! I was just wondering if you have tested and measured the effects of the memory allocation latency increase during the page demotion, which invoked by shrink_page_list(), and also if you have measured how much improvement can be achieved with DAMON-based demotion in the scenario. Seems that's not the case, and I personally think that information is not essential for this patch, so I see no problem here. But, if you have tested or have a plan to do that, and if you could, I think sharing the results on this cover letter would make this even greater. Thanks, SJ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/2] Add a new scheme to support demotion on tiered memory system 2021-12-22 8:54 ` SeongJae Park @ 2021-12-22 9:57 ` Baolin Wang 0 siblings, 0 replies; 13+ messages in thread From: Baolin Wang @ 2021-12-22 9:57 UTC (permalink / raw) To: SeongJae Park Cc: akpm, ying.huang, dave.hansen, ziy, shy828301, zhongjiang-ali, xlpang, linux-mm, linux-kernel On 12/22/2021 4:54 PM, SeongJae Park wrote: [snip] >> >> My machine contains 64G DRAM + 256G AEP(persistent memory), and you >> should enable the demotion firstly by: >> echo "true" > /sys/kernel/mm/numa/demotion_enabled >> >> Then I just write a simple test case like below to mmap some anon >> memory, and then just read and write half of the mmap buffer to let >> another half to be cold enough to demote. >> >> int main() >> { >> int len = 50 * 1024 * 1024; >> int scan_len = len / 2; >> int i, ret, j; >> unsigned long *p; >> >> p = mmap(NULL, len, PROT_READ | PROT_WRITE, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> if (p == MAP_FAILED) { >> printf("failed to get memory\n"); >> return -1; >> } >> >> for (i = 0; i < len / sizeof(*p); i++) >> p[i] = 0x55aa; >> >> /* Let another half of buffer to be cold */ >> do { >> for (i = 0; i < scan_len / sizeof(*p); i++) >> p[i] = 0x55aa; >> >> sleep(2); >> >> for (i = 0; i < scan_len / sizeof(*p); i++) >> j += p[i] >> 2; >> } while (1); >> >> munmap(p, len); >> return 0; >> } >> >> After setting the atts/schemes/target_ids, then start monitoring: >> echo 100000 1000000 1000000 10 1000 > /sys/kernel/debug/damon/attrs >> echo 4096 8192000 0 5 10 2000 5 1000 2097152 5000 0 0 0 0 0 3 2 1 > >> /sys/kernel/debug/damon/schemes >> >> After a while, you can check the demote statictics by below command, and >> you can find the demote scheme is applied by demoting some cold pages to >> slow memory (AEP) node. >> >> cat /proc/vmstat | grep "demote" >> pgdemote_direct 6881 > > Thank you for sharing this great details! > > I was just wondering if you have tested and measured the effects of the memory > allocation latency increase during the page demotion, which invoked by > shrink_page_list(), and also if you have measured how much improvement can be > achieved with DAMON-based demotion in the scenario. Seems that's not the case, Not yet testing on the real workload with DAMON demote scheme now, and I think DAMON is lack of some functions to tune performance on tiered memory system. At least I think we also need add a new promotion scheme for DAMON to promote hot memory from slow memory node to the fast memory node, which is on my TODO list. > and I personally think that information is not essential for this patch, so I > see no problem here. But, if you have tested or have a plan to do that, and if > you could, I think sharing the results on this cover letter would make this > even greater. Sure, will do if we find some funny results with DAMON on tiered memory system in future. Thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-12-23 8:53 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-21 9:18 [PATCH 0/2] Add a new scheme to support demotion on tiered memory system Baolin Wang 2021-12-21 9:18 ` [PATCH 1/2] mm: Export the alloc_demote_page() function Baolin Wang 2021-12-21 10:13 ` SeongJae Park 2021-12-21 9:18 ` [PATCH 2/2] mm/damon: Add a new scheme to support demotion on tiered memory system Baolin Wang 2021-12-21 13:24 ` SeongJae Park 2021-12-21 14:18 ` Baolin Wang 2021-12-22 8:43 ` SeongJae Park 2021-12-22 9:15 ` Baolin Wang 2021-12-23 8:53 ` SeongJae Park 2021-12-21 13:26 ` [PATCH 0/2] " SeongJae Park 2021-12-21 14:32 ` Baolin Wang 2021-12-22 8:54 ` SeongJae Park 2021-12-22 9:57 ` Baolin Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).