From: Jingqi Liu With PMEM nodes, the demotion path could be 1) DRAM pages: migrate to PMEM node 2) PMEM pages: swap out This patch does (1) for anonymous pages only. Since we cannot detect hotness of (unmapped) page cache pages for now. The user space daemon can do migration in both directions: - PMEM=>DRAM hot page migration - DRAM=>PMEM cold page migration However it's more natural for user space to do hot page migration and kernel to do cold page migration. Especially, only kernel can guarantee on-demand migration when there is memory pressure. So the big picture will look like this: user space daemon does regular hot page migration to DRAM, creating memory pressure on DRAM nodes, which triggers kernel cold page migration to PMEM nodes. Du Fan: - Support multiple NUMA nodes. - Don't migrate clean MADV_FREE pages to PMEM node. With advise(MADV_FREE) syscall, both vma structure and its corresponding page entries still lives, but we got MADV_FREE page, anonymous but WITHOUT SwapBacked. In case of page reclaim, clean MADV_FREE pages will be freed and return to buddy system, the dirty ones then turn into canonical anonymous page with PageSwapBacked(page) set, and put into LRU_INACTIVE_FILE list falling into standard aging routine. Point is clean MADV_FREE pages should not be migrated, it has steal (useless) user data once madvise(MADV_FREE) called and guard against thus scenarios. P.S. MADV_FREE is heavily used by jemalloc engine, and workload like redis, refer to [1] for detailed backgroud, usecase, and benchmark result. [1] https://lore.kernel.org/patchwork/patch/622179/ Fengguang: - detect migrate thp and hugetlb - avoid moving pages to a non-existent node Signed-off-by: Fan Du Signed-off-by: Jingqi Liu Signed-off-by: Fengguang Wu --- mm/vmscan.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) --- linux.orig/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800 +++ linux/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800 @@ -1112,6 +1112,7 @@ static unsigned long shrink_page_list(st { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(move_pages); int pgactivate = 0; unsigned nr_unqueued_dirty = 0; unsigned nr_dirty = 0; @@ -1121,6 +1122,7 @@ static unsigned long shrink_page_list(st unsigned nr_immediate = 0; unsigned nr_ref_keep = 0; unsigned nr_unmap_fail = 0; + int page_on_dram = is_node_dram(pgdat->node_id); cond_resched(); @@ -1275,6 +1277,21 @@ static unsigned long shrink_page_list(st } /* + * Check if the page is in DRAM numa node. + * Skip MADV_FREE pages as it might be freed + * immediately to buddy system if it's clean. + */ + if (node_online(pgdat->peer_node) && + PageAnon(page) && (PageSwapBacked(page) || PageTransHuge(page))) { + if (page_on_dram) { + /* Add to the page list which will be moved to pmem numa node. */ + list_add(&page->lru, &move_pages); + unlock_page(page); + continue; + } + } + + /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. * Lazyfree page could be freed directly @@ -1496,6 +1513,22 @@ keep: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + /* Move the anonymous pages to PMEM numa node. */ + if (!list_empty(&move_pages)) { + int err; + + /* Could not block. */ + err = migrate_pages(&move_pages, alloc_new_node_page, NULL, + pgdat->peer_node, + MIGRATE_ASYNC, MR_NUMA_MISPLACED); + if (err) { + putback_movable_pages(&move_pages); + + /* Join the pages which were not migrated. */ + list_splice(&ret_pages, &move_pages); + } + } + mem_cgroup_uncharge_list(&free_pages); try_to_unmap_flush(); free_unref_page_list(&free_pages);