linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management
@ 2019-04-04  2:00 Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 01/25] mm: migrate: Change migrate_mode to support combination migration modes Zi Yan
                   ` (26 more replies)
  0 siblings, 27 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Thanks to Dave Hansen's patches, which make PMEM as part of memory as NUMA nodes.
How to use PMEM along with normal DRAM remains an open problem. There are
several patchsets posted on the mailing list, proposing to use page migration to
move pages between PMEM and DRAM using Linux page replacement policy [1,2,3].
There are some important problems not addressed in these patches:
1. The page migration in Linux does not provide high enough throughput for us to
fully exploit PMEM or other use cases.
2. Linux page replacement is running too infrequent to distinguish hot and cold
pages.

I am trying to attack the problems with this patch series. This is not a final
solution, but I would like to gather more feedback and comments from the mailing
list.

Page migration throughput problem
====

For example, in my recent email [4], I gave the page migration throughput numbers
for different page migrations, none of which can achieve > 2.5GB/s throughput
(the throughput is measured around kernel functions: migrate_pages() and
migrate_page_copy()):

                             |  migrate_pages() |    migrate_page_copy()
migrating single 4KB page:   |  0.312GB/s       |   1.385GB/s
migrating 512 4KB pages:     |  0.854GB/s       |   1.983GB/s
migrating single 2MB THP:    |  2.387GB/s       |   2.481GB/s

In reality, microbenchmarks show that Intel PMEM can provide ~65GB/s read
throughput and ~16GB/s write throughput [5], which are much higher than
the throughput achieved by Linux page migration.

In addition, it is also desirable to use page migration to move data
between high-bandwidth memory and DRAM, like IBM Summit, which exposes
high-performance GPU memories as NUMA nodes [6]. This requires even higher page
migration throughput.

In this patch series, I propose four different ways of improving page migration
throughput (mostly on 2MB THP migration):
1. multi-threaded page migration: Patch 03 to 06.
2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08.
3. concurrent (batched) page migration: Patch 09, 10, and 11.
4. exchange pages: Patch 12 to 17. (This is a repost of part of [7])

Here are some throughput numbers showing clear throughput improvements on
a two-socket NUMA machine with two Xeon E5-2650 v3 @ 2.30GHz and a 19.2GB/s
bandwidth QPI link (the same machine as mentioned in [4]):

                                    |  migrate_pages() |   migrate_page_copy()
=> migrating single 2MB THP         |  2.387GB/s       |   2.481GB/s
 2-thread single THP migration      |  3.478GB/s       |   3.704GB/s
 4-thread single THP migration      |  5.474GB/s       |   6.054GB/s
 8-thread single THP migration      |  7.846GB/s       |   9.029GB/s
16-thread single THP migration      |  7.423GB/s       |   8.464GB/s
16-ch. DMA single THP migration     |  4.322GB/s       |   4.536GB/s

 2-thread 16-THP migration          |  3.610GB/s       |   3.838GB/s
 2-thread 16-THP batched migration  |  4.138GB/s       |   4.344GB/s
 4-thread 16-THP migration          |  6.385GB/s       |   7.031GB/s
 4-thread 16-THP batched migration  |  7.382GB/s       |   8.072GB/s
 8-thread 16-THP migration          |  8.039GB/s       |   9.029GB/s
 8-thread 16-THP batched migration  |  9.023GB/s       |   10.056GB/s
16-thread 16-THP migration          |  8.137GB/s       |   9.137GB/s
16-thread 16-THP batched migration  |  9.907GB/s       |   11.175GB/s

 1-thread 16-THP exchange           |  4.135GB/s       |   4.225GB/s
 2-thread 16-THP batched exchange   |  7.061GB/s       |   7.325GB/s
 4-thread 16-THP batched exchange   |  9.729GB/s       |   10.237GB/s
 8-thread 16-THP batched exchange   |  9.992GB/s       |   10.533GB/s
16-thread 16-THP batched exchange   |  9.520GB/s       |   10.056GB/s

=> migrating 512 4KB pages          |  0.854GB/s       |   1.983GB/s
 1-thread 512-4KB batched exchange  |  1.271GB/s       |   3.433GB/s
 2-thread 512-4KB batched exchange  |  1.240GB/s       |   3.190GB/s
 4-thread 512-4KB batched exchange  |  1.255GB/s       |   3.823GB/s
 8-thread 512-4KB batched exchange  |  1.336GB/s       |   3.921GB/s
16-thread 512-4KB batched exchange  |  1.334GB/s       |   3.897GB/s

Concerns were raised on how to avoid CPU resource competition between
page migration and user applications and have power awareness.
Daniel Jordan recently posted a multi-threaded ktask patch series could be
a solution [8].


Infrequent page list update problem
====

Current page lists are updated by calling shrink_list() when memory pressure
comes,  which might not be frequent enough to keep track of hot and cold pages.
Because all pages are on active lists at the first time shrink_list() is called
and the reference bit on the pages might not reflect the up to date access status
of these pages. But we also do not want to periodically shrink the global page
lists, which adds unnecessary overheads to the whole system. So I propose to
actively shrink page lists on the memcg we are interested in.

Patch 18 to 25 add a new system call to shrink page lists on given application's
memcg and migrate pages between two NUMA nodes. It isolates the impact from the
rest of the system. To share DRAM among different applications, Patch 18 and 19
add per-node memcg size limit, so you can limit the memory usage for particular
NUMA node(s).


Patch structure
====
1. multi-threaded page migration: Patch 01 to 06.
2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08.
3. concurrent (batched) page migration: Patch 09, 10, and 11.
4. exchange pages: Patch 12 to 17. (This is a repost of part of [7])
5. per-node size limit in memcg: Patch 18 and 19.
6. actively shrink page lists and perform page migration in given memcg: Patch 20 to 25.


Any comment is welcome.

[1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
[2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/
[3]: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/
[4]: https://lore.kernel.org/linux-mm/6A903D34-A293-4056-B135-6FA227DE1828@nvidia.com/
[5]: https://www.storagereview.com/supermicro_superserver_with_intel_optane_dc_persistent_memory_first_look_review
[6]: https://www.ibm.com/thought-leadership/summit-supercomputer/
[7]: https://lore.kernel.org/linux-mm/20190215220856.29749-1-zi.yan@sent.com/
[8]: https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/

Zi Yan (25):
  mm: migrate: Change migrate_mode to support combination migration
    modes.
  mm: migrate: Add mode parameter to support future page copy routines.
  mm: migrate: Add a multi-threaded page migration function.
  mm: migrate: Add copy_page_multithread into migrate_pages.
  mm: migrate: Add vm.accel_page_copy in sysfs to control page copy
    acceleration.
  mm: migrate: Make the number of copy threads adjustable via sysctl.
  mm: migrate: Add copy_page_dma to use DMA Engine to copy pages.
  mm: migrate: Add copy_page_dma into migrate_page_copy.
  mm: migrate: Add copy_page_lists_dma_always to support copy a list of
       pages.
  mm: migrate: copy_page_lists_mt() to copy a page list using
    multi-threads.
  mm: migrate: Add concurrent page migration into move_pages syscall.
  exchange pages: new page migration mechanism: exchange_pages()
  exchange pages: add multi-threaded exchange pages.
  exchange pages: concurrent exchange pages.
  exchange pages: exchange anonymous page and file-backed page.
  exchange page: Add THP exchange support.
  exchange page: Add exchange_page() syscall.
  memcg: Add per node memory usage&max stats in memcg.
  mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit.
  memory manage: Add memory manage syscall.
  mm: move update_lru_sizes() to mm_inline.h for broader use.
  memory manage: active/inactive page list manipulation in memcg.
  memory manage: page migration based page manipulation between NUMA
    nodes.
  memory manage: limit migration batch size.
  memory manage: use exchange pages to memory manage to improve
    throughput.

 arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 fs/aio.c                               |   12 +-
 fs/f2fs/data.c                         |    6 +-
 fs/hugetlbfs/inode.c                   |    4 +-
 fs/iomap.c                             |    4 +-
 fs/ubifs/file.c                        |    4 +-
 include/linux/cgroup-defs.h            |    1 +
 include/linux/exchange.h               |   27 +
 include/linux/highmem.h                |    3 +
 include/linux/ksm.h                    |    4 +
 include/linux/memcontrol.h             |   67 ++
 include/linux/migrate.h                |   12 +-
 include/linux/migrate_mode.h           |    8 +
 include/linux/mm_inline.h              |   21 +
 include/linux/sched/coredump.h         |    1 +
 include/linux/sched/sysctl.h           |    3 +
 include/linux/syscalls.h               |   10 +
 include/uapi/linux/mempolicy.h         |    9 +-
 kernel/sysctl.c                        |   47 +
 mm/Makefile                            |    5 +
 mm/balloon_compaction.c                |    2 +-
 mm/compaction.c                        |   22 +-
 mm/copy_page.c                         |  708 +++++++++++++++
 mm/exchange.c                          | 1560 ++++++++++++++++++++++++++++++++
 mm/exchange_page.c                     |  228 +++++
 mm/internal.h                          |  113 +++
 mm/ksm.c                               |   35 +
 mm/memcontrol.c                        |   80 ++
 mm/memory_manage.c                     |  649 +++++++++++++
 mm/mempolicy.c                         |   38 +-
 mm/migrate.c                           |  621 ++++++++++++-
 mm/vmscan.c                            |  115 +--
 mm/zsmalloc.c                          |    2 +-
 33 files changed, 4261 insertions(+), 162 deletions(-)
 create mode 100644 include/linux/exchange.h
 create mode 100644 mm/copy_page.c
 create mode 100644 mm/exchange.c
 create mode 100644 mm/exchange_page.c
 create mode 100644 mm/memory_manage.c

--
2.7.4


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 01/25] mm: migrate: Change migrate_mode to support combination migration modes.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 02/25] mm: migrate: Add mode parameter to support future page copy routines Zi Yan
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

No functionality is changed. Prepare for the following patches,
which add parallel, concurrent page migration modes in conjunction
to the existing modes.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/aio.c                     | 10 +++++-----
 fs/f2fs/data.c               |  4 ++--
 fs/hugetlbfs/inode.c         |  2 +-
 fs/iomap.c                   |  2 +-
 fs/ubifs/file.c              |  2 +-
 include/linux/migrate_mode.h |  2 ++
 mm/balloon_compaction.c      |  2 +-
 mm/compaction.c              | 22 +++++++++++-----------
 mm/migrate.c                 | 18 +++++++++---------
 mm/zsmalloc.c                |  2 +-
 10 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 38b741a..0a88dfd 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -389,7 +389,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	 * happen under the ctx->completion_lock. That does not work with the
 	 * migration workflow of MIGRATE_SYNC_NO_COPY.
 	 */
-	if (mode == MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) == MIGRATE_SYNC_NO_COPY)
 		return -EINVAL;
 
 	rc = 0;
@@ -1300,10 +1300,10 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
  *	Create an aio_context capable of receiving at least nr_events.
  *	ctxp must not point to an aio_context that already exists, and
  *	must be initialized to 0 prior to the call.  On successful
- *	creation of the aio_context, *ctxp is filled in with the resulting 
+ *	creation of the aio_context, *ctxp is filled in with the resulting
  *	handle.  May fail with -EINVAL if *ctxp is not initialized,
- *	if the specified nr_events exceeds internal limits.  May fail 
- *	with -EAGAIN if the specified nr_events exceeds the user's limit 
+ *	if the specified nr_events exceeds internal limits.  May fail
+ *	with -EAGAIN if the specified nr_events exceeds the user's limit
  *	of available events.  May fail with -ENOMEM if insufficient kernel
  *	resources are available.  May fail with -EFAULT if an invalid
  *	pointer is passed for ctxp.  Will fail with -ENOSYS if not
@@ -1373,7 +1373,7 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p)
 #endif
 
 /* sys_io_destroy:
- *	Destroy the aio_context specified.  May cancel any outstanding 
+ *	Destroy the aio_context specified.  May cancel any outstanding
  *	AIOs and block on completion.  Will fail with -ENOSYS if not
  *	implemented.  May fail with -EINVAL if the context pointed to
  *	is invalid.
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 97279441..e7f0e3a 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2792,7 +2792,7 @@ int f2fs_migrate_page(struct address_space *mapping,
 
 	/* migrating an atomic written page is safe with the inmem_lock hold */
 	if (atomic_written) {
-		if (mode != MIGRATE_SYNC)
+		if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC)
 			return -EBUSY;
 		if (!mutex_trylock(&fi->inmem_lock))
 			return -EAGAIN;
@@ -2825,7 +2825,7 @@ int f2fs_migrate_page(struct address_space *mapping,
 		f2fs_clear_page_private(page);
 	}
 
-	if (mode != MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
 		migrate_page_copy(newpage, page);
 	else
 		migrate_page_states(newpage, page);
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ec32fec..04ba8bb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -885,7 +885,7 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
 		set_page_private(page, 0);
 	}
 
-	if (mode != MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
 		migrate_page_copy(newpage, page);
 	else
 		migrate_page_states(newpage, page);
diff --git a/fs/iomap.c b/fs/iomap.c
index abdd18e..8ee3f9f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -584,7 +584,7 @@ iomap_migrate_page(struct address_space *mapping, struct page *newpage,
 		SetPagePrivate(newpage);
 	}
 
-	if (mode != MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
 		migrate_page_copy(newpage, page);
 	else
 		migrate_page_states(newpage, page);
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 5d2ffb1..2bb8788 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1490,7 +1490,7 @@ static int ubifs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	}
 
-	if (mode != MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
 		migrate_page_copy(newpage, page);
 	else
 		migrate_page_states(newpage, page);
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 883c992..59d75fc 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -17,6 +17,8 @@ enum migrate_mode {
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
 	MIGRATE_SYNC_NO_COPY,
+
+	MIGRATE_MODE_MASK = 3,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index ef858d5..5acb55f 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -158,7 +158,7 @@ int balloon_page_migrate(struct address_space *mapping,
 	 * is unlikely to be use with ballon pages. See include/linux/hmm.h for
 	 * user of the MIGRATE_SYNC_NO_COPY mode.
 	 */
-	if (mode == MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) == MIGRATE_SYNC_NO_COPY)
 		return -EINVAL;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
diff --git a/mm/compaction.c b/mm/compaction.c
index f171a83..bfcbe08 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -408,7 +408,7 @@ static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
 
 	if (pfn > zone->compact_cached_migrate_pfn[0])
 		zone->compact_cached_migrate_pfn[0] = pfn;
-	if (cc->mode != MIGRATE_ASYNC &&
+	if ((cc->mode & MIGRATE_MODE_MASK) != MIGRATE_ASYNC &&
 	    pfn > zone->compact_cached_migrate_pfn[1])
 		zone->compact_cached_migrate_pfn[1] = pfn;
 }
@@ -475,7 +475,7 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
 						struct compact_control *cc)
 {
 	/* Track if the lock is contended in async mode */
-	if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
+	if (((cc->mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC) && !cc->contended) {
 		if (spin_trylock_irqsave(lock, *flags))
 			return true;
 
@@ -792,7 +792,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	 */
 	while (unlikely(too_many_isolated(pgdat))) {
 		/* async migration should just abort */
-		if (cc->mode == MIGRATE_ASYNC)
+		if ((cc->mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)
 			return 0;
 
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -803,7 +803,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 	cond_resched();
 
-	if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
+	if (cc->direct_compaction && ((cc->mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)) {
 		skip_on_failure = true;
 		next_skip_pfn = block_end_pfn(low_pfn, cc->order);
 	}
@@ -1117,7 +1117,7 @@ static bool suitable_migration_source(struct compact_control *cc,
 	if (pageblock_skip_persistent(page))
 		return false;
 
-	if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
+	if (((cc->mode & MIGRATE_MODE_MASK) != MIGRATE_ASYNC) || !cc->direct_compaction)
 		return true;
 
 	block_mt = get_pageblock_migratetype(page);
@@ -1216,7 +1216,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long
 		return;
 
 	/* Minimise scanning during async compaction */
-	if (cc->direct_compaction && cc->mode == MIGRATE_ASYNC)
+	if (cc->direct_compaction && (cc->mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)
 		return;
 
 	/* Pageblock boundaries */
@@ -1448,7 +1448,7 @@ static void isolate_freepages(struct compact_control *cc)
 	block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
 						zone_end_pfn(zone));
 	low_pfn = pageblock_end_pfn(cc->migrate_pfn);
-	stride = cc->mode == MIGRATE_ASYNC ? COMPACT_CLUSTER_MAX : 1;
+	stride = (cc->mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC ? COMPACT_CLUSTER_MAX : 1;
 
 	/*
 	 * Isolate free pages until enough are available to migrate the
@@ -1734,7 +1734,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	struct page *page;
 	const isolate_mode_t isolate_mode =
 		(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
-		(cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
+		(((cc->mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC) ? ISOLATE_ASYNC_MIGRATE : 0);
 	bool fast_find_block;
 
 	/*
@@ -1907,7 +1907,7 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 			 * to sync compaction, as async compaction operates
 			 * on pageblocks of the same migratetype.
 			 */
-			if (cc->mode == MIGRATE_ASYNC ||
+			if ((cc->mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC ||
 					IS_ALIGNED(cc->migrate_pfn,
 							pageblock_nr_pages)) {
 				return COMPACT_SUCCESS;
@@ -2063,7 +2063,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 	unsigned long start_pfn = cc->zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(cc->zone);
 	unsigned long last_migrated_pfn;
-	const bool sync = cc->mode != MIGRATE_ASYNC;
+	const bool sync = (cc->mode & MIGRATE_MODE_MASK) != MIGRATE_ASYNC;
 	bool update_cached;
 
 	cc->migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -2195,7 +2195,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
 			 * order-aligned block, so skip the rest of it.
 			 */
 			if (cc->direct_compaction &&
-						(cc->mode == MIGRATE_ASYNC)) {
+						((cc->mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)) {
 				cc->migrate_pfn = block_end_pfn(
 						cc->migrate_pfn - 1, cc->order);
 				/* Draining pcplists is useless in this case */
diff --git a/mm/migrate.c b/mm/migrate.c
index ac6f493..c161c03 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -691,7 +691,7 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	if (mode != MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) !=  MIGRATE_SYNC_NO_COPY)
 		migrate_page_copy(newpage, page);
 	else
 		migrate_page_states(newpage, page);
@@ -707,7 +707,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
 	struct buffer_head *bh = head;
 
 	/* Simple case, sync compaction */
-	if (mode != MIGRATE_ASYNC) {
+	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_ASYNC) {
 		do {
 			lock_buffer(bh);
 			bh = bh->b_this_page;
@@ -804,7 +804,7 @@ static int __buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	if (mode != MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) !=  MIGRATE_SYNC_NO_COPY)
 		migrate_page_copy(newpage, page);
 	else
 		migrate_page_states(newpage, page);
@@ -895,7 +895,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
-		switch (mode) {
+		switch (mode & MIGRATE_MODE_MASK) {
 		case MIGRATE_SYNC:
 		case MIGRATE_SYNC_NO_COPY:
 			break;
@@ -911,7 +911,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	 */
 	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
-		return mode == MIGRATE_SYNC ? -EAGAIN : -EBUSY;
+		return (mode & MIGRATE_MODE_MASK) == MIGRATE_SYNC ? -EAGAIN : -EBUSY;
 
 	return migrate_page(mapping, newpage, page, mode);
 }
@@ -1009,7 +1009,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	bool is_lru = !__PageMovable(page);
 
 	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
+		if (!force || ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC))
 			goto out;
 
 		/*
@@ -1038,7 +1038,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the retry loop is too short and in the sync-light case,
 		 * the overhead of stalling is too much
 		 */
-		switch (mode) {
+		switch (mode & MIGRATE_MODE_MASK) {
 		case MIGRATE_SYNC:
 		case MIGRATE_SYNC_NO_COPY:
 			break;
@@ -1303,9 +1303,9 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		return -ENOMEM;
 
 	if (!trylock_page(hpage)) {
-		if (!force)
+		if (!force || ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC))
 			goto out;
-		switch (mode) {
+		switch (mode & MIGRATE_MODE_MASK) {
 		case MIGRATE_SYNC:
 		case MIGRATE_SYNC_NO_COPY:
 			break;
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0787d33..018bb51 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1981,7 +1981,7 @@ static int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	 * happen under the zs lock, which does not work with
 	 * MIGRATE_SYNC_NO_COPY workflow.
 	 */
-	if (mode == MIGRATE_SYNC_NO_COPY)
+	if ((mode & MIGRATE_MODE_MASK) == MIGRATE_SYNC_NO_COPY)
 		return -EINVAL;
 
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 02/25] mm: migrate: Add mode parameter to support future page copy routines.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 01/25] mm: migrate: Change migrate_mode to support combination migration modes Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 03/25] mm: migrate: Add a multi-threaded page migration function Zi Yan
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

MIGRATE_SINGLETHREAD is added as the default behavior.
migrate_page_copy() and copy_huge_page() are changed.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/aio.c                     |  2 +-
 fs/f2fs/data.c               |  2 +-
 fs/hugetlbfs/inode.c         |  2 +-
 fs/iomap.c                   |  2 +-
 fs/ubifs/file.c              |  2 +-
 include/linux/migrate.h      |  6 ++++--
 include/linux/migrate_mode.h |  3 +++
 mm/migrate.c                 | 14 ++++++++------
 8 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0a88dfd..986d21e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -437,7 +437,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	 * events from being lost.
 	 */
 	spin_lock_irqsave(&ctx->completion_lock, flags);
-	migrate_page_copy(new, old);
+	migrate_page_copy(new, old, MIGRATE_SINGLETHREAD);
 	BUG_ON(ctx->ring_pages[idx] != old);
 	ctx->ring_pages[idx] = new;
 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index e7f0e3a..6a419a9 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2826,7 +2826,7 @@ int f2fs_migrate_page(struct address_space *mapping,
 	}
 
 	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
-		migrate_page_copy(newpage, page);
+		migrate_page_copy(newpage, page, MIGRATE_SINGLETHREAD);
 	else
 		migrate_page_states(newpage, page);
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 04ba8bb..03dfa49 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -886,7 +886,7 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
 	}
 
 	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
-		migrate_page_copy(newpage, page);
+		migrate_page_copy(newpage, page, MIGRATE_SINGLETHREAD);
 	else
 		migrate_page_states(newpage, page);
 
diff --git a/fs/iomap.c b/fs/iomap.c
index 8ee3f9f..a6e0456 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -585,7 +585,7 @@ iomap_migrate_page(struct address_space *mapping, struct page *newpage,
 	}
 
 	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
-		migrate_page_copy(newpage, page);
+		migrate_page_copy(newpage, page, MIGRATE_SINGLETHREAD);
 	else
 		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 2bb8788..3a3dbbd 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1491,7 +1491,7 @@ static int ubifs_migrate_page(struct address_space *mapping,
 	}
 
 	if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC_NO_COPY)
-		migrate_page_copy(newpage, page);
+		migrate_page_copy(newpage, page, MIGRATE_SINGLETHREAD);
 	else
 		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf..5218a07 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -73,7 +73,8 @@ extern void putback_movable_page(struct page *page);
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
 extern void migrate_page_states(struct page *newpage, struct page *page);
-extern void migrate_page_copy(struct page *newpage, struct page *page);
+extern void migrate_page_copy(struct page *newpage, struct page *page,
+				  enum migrate_mode mode);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
@@ -97,7 +98,8 @@ static inline void migrate_page_states(struct page *newpage, struct page *page)
 }
 
 static inline void migrate_page_copy(struct page *newpage,
-				     struct page *page) {}
+				     struct page *page,
+				     enum migrate_mode mode) {}
 
 static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page)
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 59d75fc..da44940 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -11,6 +11,8 @@
  *	with the CPU. Instead, page copy happens outside the migratepage()
  *	callback and is likely using a DMA engine. See migrate_vma() and HMM
  *	(mm/hmm.c) for users of this mode.
+ * MIGRATE_SINGLETHREAD uses a single thread to move pages, it is the default
+ *	behavior
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
@@ -19,6 +21,7 @@ enum migrate_mode {
 	MIGRATE_SYNC_NO_COPY,
 
 	MIGRATE_MODE_MASK = 3,
+	MIGRATE_SINGLETHREAD	= 0,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/migrate.c b/mm/migrate.c
index c161c03..2b2653e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -567,7 +567,8 @@ static void __copy_gigantic_page(struct page *dst, struct page *src,
 	}
 }
 
-static void copy_huge_page(struct page *dst, struct page *src)
+static void copy_huge_page(struct page *dst, struct page *src,
+				enum migrate_mode mode)
 {
 	int i;
 	int nr_pages;
@@ -657,10 +658,11 @@ void migrate_page_states(struct page *newpage, struct page *page)
 }
 EXPORT_SYMBOL(migrate_page_states);
 
-void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_copy(struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
+		copy_huge_page(newpage, page, mode);
 	else
 		copy_highpage(newpage, page);
 
@@ -692,7 +694,7 @@ int migrate_page(struct address_space *mapping,
 		return rc;
 
 	if ((mode & MIGRATE_MODE_MASK) !=  MIGRATE_SYNC_NO_COPY)
-		migrate_page_copy(newpage, page);
+		migrate_page_copy(newpage, page, mode);
 	else
 		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
@@ -805,7 +807,7 @@ static int __buffer_migrate_page(struct address_space *mapping,
 	SetPagePrivate(newpage);
 
 	if ((mode & MIGRATE_MODE_MASK) !=  MIGRATE_SYNC_NO_COPY)
-		migrate_page_copy(newpage, page);
+		migrate_page_copy(newpage, page, MIGRATE_SINGLETHREAD);
 	else
 		migrate_page_states(newpage, page);
 
@@ -2024,7 +2026,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	new_page->index = page->index;
 	/* flush the cache before copying using the kernel virtual address */
 	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
-	migrate_page_copy(new_page, page);
+	migrate_page_copy(new_page, page, MIGRATE_SINGLETHREAD);
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 03/25] mm: migrate: Add a multi-threaded page migration function.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 01/25] mm: migrate: Change migrate_mode to support combination migration modes Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 02/25] mm: migrate: Add mode parameter to support future page copy routines Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 04/25] mm: migrate: Add copy_page_multithread into migrate_pages Zi Yan
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

copy_page_multithread() function is added to migrate huge pages
in multi-threaded way, which provides higher throughput than
a single-threaded way.

Internally, copy_page_multithread() splits and distributes a huge page
into multiple threads, then send them as jobs to system_highpri_wq.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/highmem.h |   2 +
 mm/Makefile             |   2 +
 mm/copy_page.c          | 128 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+)
 create mode 100644 mm/copy_page.c

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c..0f50dc5 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -276,4 +276,6 @@ static inline void copy_highpage(struct page *to, struct page *from)
 
 #endif
 
+int copy_page_multithread(struct page *to, struct page *from, int nr_pages);
+
 #endif /* _LINUX_HIGHMEM_H */
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9..fa02a9f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -44,6 +44,8 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 obj-y += init-mm.o
 obj-y += memblock.o
 
+obj-y += copy_page.o
+
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
 endif
diff --git a/mm/copy_page.c b/mm/copy_page.c
new file mode 100644
index 0000000..9cf849c
--- /dev/null
+++ b/mm/copy_page.c
@@ -0,0 +1,128 @@
+/*
+ * Enhanced page copy routine.
+ *
+ * Copyright 2019 by NVIDIA.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Zi Yan <ziy@nvidia.com>
+ *
+ */
+
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
+#include <linux/freezer.h>
+
+
+const unsigned int limit_mt_num = 4;
+
+/* ======================== multi-threaded copy page ======================== */
+
+struct copy_item {
+	char *to;
+	char *from;
+	unsigned long chunk_size;
+};
+
+struct copy_page_info {
+	struct work_struct copy_page_work;
+	unsigned long num_items;
+	struct copy_item item_list[0];
+};
+
+static void copy_page_routine(char *vto, char *vfrom,
+	unsigned long chunk_size)
+{
+	memcpy(vto, vfrom, chunk_size);
+}
+
+static void copy_page_work_queue_thread(struct work_struct *work)
+{
+	struct copy_page_info *my_work = (struct copy_page_info *)work;
+	int i;
+
+	for (i = 0; i < my_work->num_items; ++i)
+		copy_page_routine(my_work->item_list[i].to,
+						  my_work->item_list[i].from,
+						  my_work->item_list[i].chunk_size);
+}
+
+int copy_page_multithread(struct page *to, struct page *from, int nr_pages)
+{
+	unsigned int total_mt_num = limit_mt_num;
+	int to_node = page_to_nid(to);
+	int i;
+	struct copy_page_info *work_items[NR_CPUS] = {0};
+	char *vto, *vfrom;
+	unsigned long chunk_size;
+	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
+	int cpu_id_list[NR_CPUS] = {0};
+	int cpu;
+	int err = 0;
+
+	total_mt_num = min_t(unsigned int, total_mt_num,
+						 cpumask_weight(per_node_cpumask));
+	if (total_mt_num > 1)
+		total_mt_num = (total_mt_num / 2) * 2;
+
+	if (total_mt_num > num_online_cpus() || total_mt_num <=1)
+		return -ENODEV;
+
+	for (cpu = 0; cpu < total_mt_num; ++cpu) {
+		work_items[cpu] = kzalloc(sizeof(struct copy_page_info)
+						+ sizeof(struct copy_item), GFP_KERNEL);
+		if (!work_items[cpu]) {
+			err = -ENOMEM;
+			goto free_work_items;
+		}
+	}
+
+	i = 0;
+	for_each_cpu(cpu, per_node_cpumask) {
+		if (i >= total_mt_num)
+			break;
+		cpu_id_list[i] = cpu;
+		++i;
+	}
+
+	vfrom = kmap(from);
+	vto = kmap(to);
+	chunk_size = PAGE_SIZE*nr_pages / total_mt_num;
+
+	for (i = 0; i < total_mt_num; ++i) {
+		INIT_WORK((struct work_struct *)work_items[i],
+				  copy_page_work_queue_thread);
+
+		work_items[i]->num_items = 1;
+		work_items[i]->item_list[0].to = vto + i * chunk_size;
+		work_items[i]->item_list[0].from = vfrom + i * chunk_size;
+		work_items[i]->item_list[0].chunk_size = chunk_size;
+
+		queue_work_on(cpu_id_list[i],
+					  system_highpri_wq,
+					  (struct work_struct *)work_items[i]);
+	}
+
+	/* Wait until it finishes  */
+	for (i = 0; i < total_mt_num; ++i)
+		flush_work((struct work_struct *)work_items[i]);
+
+	kunmap(to);
+	kunmap(from);
+
+free_work_items:
+	for (cpu = 0; cpu < total_mt_num; ++cpu)
+		if (work_items[cpu])
+			kfree(work_items[cpu]);
+
+	return err;
+}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 04/25] mm: migrate: Add copy_page_multithread into migrate_pages.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (2 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 03/25] mm: migrate: Add a multi-threaded page migration function Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 05/25] mm: migrate: Add vm.accel_page_copy in sysfs to control page copy acceleration Zi Yan
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

An option is added to move_pages() syscall to use multi-threaded
page migration.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/migrate_mode.h   |  1 +
 include/uapi/linux/mempolicy.h |  2 ++
 mm/migrate.c                   | 29 +++++++++++++++++++----------
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index da44940..5bc8a77 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -22,6 +22,7 @@ enum migrate_mode {
 
 	MIGRATE_MODE_MASK = 3,
 	MIGRATE_SINGLETHREAD	= 0,
+	MIGRATE_MT				= 1<<4,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3354774..890269b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -48,6 +48,8 @@ enum {
 #define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
 #define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
 
+#define MPOL_MF_MOVE_MT  (1<<6)	/* Use multi-threaded page copy routine */
+
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
 			 MPOL_MF_MOVE_ALL)
diff --git a/mm/migrate.c b/mm/migrate.c
index 2b2653e..dd6ccbe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -572,6 +572,7 @@ static void copy_huge_page(struct page *dst, struct page *src,
 {
 	int i;
 	int nr_pages;
+	int rc = -EFAULT;
 
 	if (PageHuge(src)) {
 		/* hugetlbfs page */
@@ -588,10 +589,14 @@ static void copy_huge_page(struct page *dst, struct page *src,
 		nr_pages = hpage_nr_pages(src);
 	}
 
-	for (i = 0; i < nr_pages; i++) {
-		cond_resched();
-		copy_highpage(dst + i, src + i);
-	}
+	if (mode & MIGRATE_MT)
+		rc = copy_page_multithread(dst, src, nr_pages);
+
+	if (rc)
+		for (i = 0; i < nr_pages; i++) {
+			cond_resched();
+			copy_highpage(dst + i, src + i);
+		}
 }
 
 /*
@@ -1500,7 +1505,7 @@ static int store_status(int __user *status, int start, int value, int nr)
 }
 
 static int do_move_pages_to_node(struct mm_struct *mm,
-		struct list_head *pagelist, int node)
+		struct list_head *pagelist, int node, bool migrate_mt)
 {
 	int err;
 
@@ -1508,7 +1513,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
 		return 0;
 
 	err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
-			MIGRATE_SYNC, MR_SYSCALL);
+			MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD),
+			MR_SYSCALL);
 	if (err)
 		putback_movable_pages(pagelist);
 	return err;
@@ -1629,7 +1635,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 			current_node = node;
 			start = i;
 		} else if (node != current_node) {
-			err = do_move_pages_to_node(mm, &pagelist, current_node);
+			err = do_move_pages_to_node(mm, &pagelist, current_node,
+				flags & MPOL_MF_MOVE_MT);
 			if (err)
 				goto out;
 			err = store_status(status, start, current_node, i - start);
@@ -1652,7 +1659,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 		if (err)
 			goto out_flush;
 
-		err = do_move_pages_to_node(mm, &pagelist, current_node);
+		err = do_move_pages_to_node(mm, &pagelist, current_node,
+				flags & MPOL_MF_MOVE_MT);
 		if (err)
 			goto out;
 		if (i > start) {
@@ -1667,7 +1675,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 		return err;
 
 	/* Make sure we do not overwrite the existing error */
-	err1 = do_move_pages_to_node(mm, &pagelist, current_node);
+	err1 = do_move_pages_to_node(mm, &pagelist, current_node,
+				flags & MPOL_MF_MOVE_MT);
 	if (!err1)
 		err1 = store_status(status, start, current_node, i - start);
 	if (!err)
@@ -1763,7 +1772,7 @@ static int kernel_move_pages(pid_t pid, unsigned long nr_pages,
 	nodemask_t task_nodes;
 
 	/* Check flags */
-	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_MT))
 		return -EINVAL;
 
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 05/25] mm: migrate: Add vm.accel_page_copy in sysfs to control page copy acceleration.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (3 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 04/25] mm: migrate: Add copy_page_multithread into migrate_pages Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 06/25] mm: migrate: Make the number of copy threads adjustable via sysctl Zi Yan
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Since base page migration did not gain any speedup from
multi-threaded methods, we only accelerate the huge page case.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 kernel/sysctl.c | 11 +++++++++++
 mm/migrate.c    |  6 ++++++
 2 files changed, 17 insertions(+)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e5da394..3d8490e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -101,6 +101,8 @@
 
 #if defined(CONFIG_SYSCTL)
 
+extern int accel_page_copy;
+
 /* External variables not in a header file. */
 extern int suid_dumpable;
 #ifdef CONFIG_COREDUMP
@@ -1430,6 +1432,15 @@ static struct ctl_table vm_table[] = {
 		.extra2			= &one,
 	},
 #endif
+	{
+		.procname	= "accel_page_copy",
+		.data		= &accel_page_copy,
+		.maxlen		= sizeof(accel_page_copy),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 	 {
 		.procname	= "hugetlb_shm_group",
 		.data		= &sysctl_hugetlb_shm_group,
diff --git a/mm/migrate.c b/mm/migrate.c
index dd6ccbe..8a344e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -55,6 +55,8 @@
 
 #include "internal.h"
 
+int accel_page_copy = 1;
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -589,6 +591,10 @@ static void copy_huge_page(struct page *dst, struct page *src,
 		nr_pages = hpage_nr_pages(src);
 	}
 
+	/* Try to accelerate page migration if it is not specified in mode  */
+	if (accel_page_copy)
+		mode |= MIGRATE_MT;
+
 	if (mode & MIGRATE_MT)
 		rc = copy_page_multithread(dst, src, nr_pages);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 06/25] mm: migrate: Make the number of copy threads adjustable via sysctl.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (4 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 05/25] mm: migrate: Add vm.accel_page_copy in sysfs to control page copy acceleration Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 07/25] mm: migrate: Add copy_page_dma to use DMA Engine to copy pages Zi Yan
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 kernel/sysctl.c | 9 +++++++++
 mm/copy_page.c  | 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d8490e..0eae0b8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -102,6 +102,7 @@
 #if defined(CONFIG_SYSCTL)
 
 extern int accel_page_copy;
+extern unsigned int limit_mt_num;
 
 /* External variables not in a header file. */
 extern int suid_dumpable;
@@ -1441,6 +1442,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "limit_mt_num",
+		.data		= &limit_mt_num,
+		.maxlen		= sizeof(limit_mt_num),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+	},
 	 {
 		.procname	= "hugetlb_shm_group",
 		.data		= &sysctl_hugetlb_shm_group,
diff --git a/mm/copy_page.c b/mm/copy_page.c
index 9cf849c..6665e3d 100644
--- a/mm/copy_page.c
+++ b/mm/copy_page.c
@@ -23,7 +23,7 @@
 #include <linux/freezer.h>
 
 
-const unsigned int limit_mt_num = 4;
+unsigned int limit_mt_num = 4;
 
 /* ======================== multi-threaded copy page ======================== */
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 07/25] mm: migrate: Add copy_page_dma to use DMA Engine to copy pages.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (5 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 06/25] mm: migrate: Make the number of copy threads adjustable via sysctl Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 08/25] mm: migrate: Add copy_page_dma into migrate_page_copy Zi Yan
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

vm.use_all_dma_chans will grab all usable DMA channels
vm.limit_dma_chans will limit how many DMA channels in use

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/highmem.h      |   1 +
 include/linux/sched/sysctl.h |   3 +
 kernel/sysctl.c              |  19 +++
 mm/copy_page.c               | 291 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 314 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 0f50dc5..119bb39 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -277,5 +277,6 @@ static inline void copy_highpage(struct page *to, struct page *from)
 #endif
 
 int copy_page_multithread(struct page *to, struct page *from, int nr_pages);
+int copy_page_dma(struct page *to, struct page *from, int nr_pages);
 
 #endif /* _LINUX_HIGHMEM_H */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d7..ce11241 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -90,4 +90,7 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
 				 loff_t *ppos);
 #endif
 
+extern int sysctl_dma_page_migration(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0eae0b8..b8712eb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -103,6 +103,8 @@
 
 extern int accel_page_copy;
 extern unsigned int limit_mt_num;
+extern int use_all_dma_chans;
+extern int limit_dma_chans;
 
 /* External variables not in a header file. */
 extern int suid_dumpable;
@@ -1451,6 +1453,23 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 	 {
+		.procname	= "use_all_dma_chans",
+		.data		= &use_all_dma_chans,
+		.maxlen		= sizeof(use_all_dma_chans),
+		.mode		= 0644,
+		.proc_handler	= sysctl_dma_page_migration,
+		.extra1		= &zero,
+		.extra2		= &one,
+	 },
+	 {
+		.procname	= "limit_dma_chans",
+		.data		= &limit_dma_chans,
+		.maxlen		= sizeof(limit_dma_chans),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+	 },
+	 {
 		.procname	= "hugetlb_shm_group",
 		.data		= &sysctl_hugetlb_shm_group,
 		.maxlen		= sizeof(gid_t),
diff --git a/mm/copy_page.c b/mm/copy_page.c
index 6665e3d..5e7a797 100644
--- a/mm/copy_page.c
+++ b/mm/copy_page.c
@@ -126,3 +126,294 @@ int copy_page_multithread(struct page *to, struct page *from, int nr_pages)
 
 	return err;
 }
+/* ======================== DMA copy page ======================== */
+#include <linux/dmaengine.h>
+#include <linux/dma-mapping.h>
+
+#define NUM_AVAIL_DMA_CHAN 16
+
+
+int use_all_dma_chans = 0;
+int limit_dma_chans = NUM_AVAIL_DMA_CHAN;
+
+
+struct dma_chan *copy_chan[NUM_AVAIL_DMA_CHAN] = {0};
+struct dma_device *copy_dev[NUM_AVAIL_DMA_CHAN] = {0};
+
+
+
+#ifdef CONFIG_PROC_SYSCTL
+extern int proc_dointvec_minmax(struct ctl_table *table, int write,
+		  void __user *buffer, size_t *lenp, loff_t *ppos);
+int sysctl_dma_page_migration(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos)
+{
+	int err = 0;
+	int use_all_dma_chans_prior_val = use_all_dma_chans;
+	dma_cap_mask_t copy_mask;
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (err < 0)
+		return err;
+	if (write) {
+		/* Grab all DMA channels  */
+		if (use_all_dma_chans_prior_val == 0 && use_all_dma_chans == 1) {
+			int i;
+
+			dma_cap_zero(copy_mask);
+			dma_cap_set(DMA_MEMCPY, copy_mask);
+
+			dmaengine_get();
+			for (i = 0; i < NUM_AVAIL_DMA_CHAN; ++i) {
+				if (!copy_chan[i]) {
+					copy_chan[i] = dma_request_channel(copy_mask, NULL, NULL);
+				}
+				if (!copy_chan[i]) {
+					pr_err("%s: cannot grab channel: %d\n", __func__, i);
+					continue;
+				}
+
+				copy_dev[i] = copy_chan[i]->device;
+
+				if (!copy_dev[i]) {
+					pr_err("%s: no device: %d\n", __func__, i);
+					continue;
+				}
+			}
+
+		}
+		/* Release all DMA channels  */
+		else if (use_all_dma_chans_prior_val == 1 && use_all_dma_chans == 0) {
+			int i;
+
+			for (i = 0; i < NUM_AVAIL_DMA_CHAN; ++i) {
+				if (copy_chan[i]) {
+					dma_release_channel(copy_chan[i]);
+					copy_chan[i] = NULL;
+					copy_dev[i] = NULL;
+				}
+			}
+
+			dmaengine_put();
+		}
+
+		if (err)
+			use_all_dma_chans = use_all_dma_chans_prior_val;
+	}
+	return err;
+}
+
+#endif
+
+static int copy_page_dma_once(struct page *to, struct page *from, int nr_pages)
+{
+	static struct dma_chan *copy_chan = NULL;
+	struct dma_device *device = NULL;
+	struct dma_async_tx_descriptor *tx = NULL;
+	dma_cookie_t cookie;
+	enum dma_ctrl_flags flags = 0;
+	struct dmaengine_unmap_data *unmap = NULL;
+	dma_cap_mask_t mask;
+	int ret_val = 0;
+
+
+	dma_cap_zero(mask);
+	dma_cap_set(DMA_MEMCPY, mask);
+
+	dmaengine_get();
+
+	copy_chan = dma_request_channel(mask, NULL, NULL);
+
+	if (!copy_chan) {
+		pr_err("%s: cannot get a channel\n", __func__);
+		ret_val = -1;
+		goto no_chan;
+	}
+
+	device = copy_chan->device;
+
+	if (!device) {
+		pr_err("%s: cannot get a device\n", __func__);
+		ret_val = -2;
+		goto release;
+	}
+
+	unmap = dmaengine_get_unmap_data(device->dev, 2, GFP_NOWAIT);
+
+	if (!unmap) {
+		pr_err("%s: cannot get unmap data\n", __func__);
+		ret_val = -3;
+		goto release;
+	}
+
+	unmap->to_cnt = 1;
+	unmap->addr[0] = dma_map_page(device->dev, from, 0, PAGE_SIZE*nr_pages,
+					  DMA_TO_DEVICE);
+	unmap->from_cnt = 1;
+	unmap->addr[1] = dma_map_page(device->dev, to, 0, PAGE_SIZE*nr_pages,
+					  DMA_FROM_DEVICE);
+	unmap->len = PAGE_SIZE*nr_pages;
+
+	tx = device->device_prep_dma_memcpy(copy_chan,
+						unmap->addr[1],
+						unmap->addr[0], unmap->len,
+						flags);
+
+	if (!tx) {
+		pr_err("%s: null tx descriptor\n", __func__);
+		ret_val = -4;
+		goto unmap_dma;
+	}
+
+	cookie = tx->tx_submit(tx);
+
+	if (dma_submit_error(cookie)) {
+		pr_err("%s: submission error\n", __func__);
+		ret_val = -5;
+		goto unmap_dma;
+	}
+
+	if (dma_sync_wait(copy_chan, cookie) != DMA_COMPLETE) {
+		pr_err("%s: dma does not complete properly\n", __func__);
+		ret_val = -6;
+	}
+
+unmap_dma:
+	dmaengine_unmap_put(unmap);
+release:
+	if (copy_chan) {
+		dma_release_channel(copy_chan);
+	}
+no_chan:
+	dmaengine_put();
+
+	return ret_val;
+}
+
+static int copy_page_dma_always(struct page *to, struct page *from, int nr_pages)
+{
+	struct dma_async_tx_descriptor *tx[NUM_AVAIL_DMA_CHAN] = {0};
+	dma_cookie_t cookie[NUM_AVAIL_DMA_CHAN];
+	enum dma_ctrl_flags flags[NUM_AVAIL_DMA_CHAN] = {0};
+	struct dmaengine_unmap_data *unmap[NUM_AVAIL_DMA_CHAN] = {0};
+	int ret_val = 0;
+	int total_available_chans = NUM_AVAIL_DMA_CHAN;
+	int i;
+	size_t page_offset;
+
+	for (i = 0; i < NUM_AVAIL_DMA_CHAN; ++i) {
+		if (!copy_chan[i]) {
+			total_available_chans = i;
+		}
+	}
+	if (total_available_chans != NUM_AVAIL_DMA_CHAN) {
+		pr_err("%d channels are missing", NUM_AVAIL_DMA_CHAN - total_available_chans);
+	}
+
+	total_available_chans = min_t(int, total_available_chans, limit_dma_chans);
+
+	/* round down to closest 2^x value  */
+	total_available_chans = 1<<ilog2(total_available_chans);
+
+	if ((nr_pages != 1) && (nr_pages % total_available_chans != 0))
+		return -5;
+
+	for (i = 0; i < total_available_chans; ++i) {
+		unmap[i] = dmaengine_get_unmap_data(copy_dev[i]->dev, 2, GFP_NOWAIT);
+		if (!unmap[i]) {
+			pr_err("%s: no unmap data at chan %d\n", __func__, i);
+			ret_val = -3;
+			goto unmap_dma;
+		}
+	}
+
+	for (i = 0; i < total_available_chans; ++i) {
+		if (nr_pages == 1) {
+			page_offset = PAGE_SIZE / total_available_chans;
+
+			unmap[i]->to_cnt = 1;
+			unmap[i]->addr[0] = dma_map_page(copy_dev[i]->dev, from, page_offset*i,
+							  page_offset,
+							  DMA_TO_DEVICE);
+			unmap[i]->from_cnt = 1;
+			unmap[i]->addr[1] = dma_map_page(copy_dev[i]->dev, to, page_offset*i,
+							  page_offset,
+							  DMA_FROM_DEVICE);
+			unmap[i]->len = page_offset;
+		} else {
+			page_offset = nr_pages / total_available_chans;
+
+			unmap[i]->to_cnt = 1;
+			unmap[i]->addr[0] = dma_map_page(copy_dev[i]->dev,
+								from + page_offset*i,
+								0,
+								PAGE_SIZE*page_offset,
+								DMA_TO_DEVICE);
+			unmap[i]->from_cnt = 1;
+			unmap[i]->addr[1] = dma_map_page(copy_dev[i]->dev,
+								to + page_offset*i,
+								0,
+								PAGE_SIZE*page_offset,
+								DMA_FROM_DEVICE);
+			unmap[i]->len = PAGE_SIZE*page_offset;
+		}
+	}
+
+	for (i = 0; i < total_available_chans; ++i) {
+		tx[i] = copy_dev[i]->device_prep_dma_memcpy(copy_chan[i],
+							unmap[i]->addr[1],
+							unmap[i]->addr[0],
+							unmap[i]->len,
+							flags[i]);
+		if (!tx[i]) {
+			pr_err("%s: no tx descriptor at chan %d\n", __func__, i);
+			ret_val = -4;
+			goto unmap_dma;
+		}
+	}
+
+	for (i = 0; i < total_available_chans; ++i) {
+		cookie[i] = tx[i]->tx_submit(tx[i]);
+
+		if (dma_submit_error(cookie[i])) {
+			pr_err("%s: submission error at chan %d\n", __func__, i);
+			ret_val = -5;
+			goto unmap_dma;
+		}
+
+		dma_async_issue_pending(copy_chan[i]);
+	}
+
+	for (i = 0; i < total_available_chans; ++i) {
+		if (dma_sync_wait(copy_chan[i], cookie[i]) != DMA_COMPLETE) {
+			ret_val = -6;
+			pr_err("%s: dma does not complete at chan %d\n", __func__, i);
+		}
+	}
+
+unmap_dma:
+
+	for (i = 0; i < total_available_chans; ++i) {
+		if (unmap[i])
+			dmaengine_unmap_put(unmap[i]);
+	}
+
+	return ret_val;
+}
+
+int copy_page_dma(struct page *to, struct page *from, int nr_pages)
+{
+	BUG_ON(hpage_nr_pages(from) != nr_pages);
+	BUG_ON(hpage_nr_pages(to) != nr_pages);
+
+	if (!use_all_dma_chans) {
+		return copy_page_dma_once(to, from, nr_pages);
+	}
+
+	return copy_page_dma_always(to, from, nr_pages);
+}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 08/25] mm: migrate: Add copy_page_dma into migrate_page_copy.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (6 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 07/25] mm: migrate: Add copy_page_dma to use DMA Engine to copy pages Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 09/25] mm: migrate: Add copy_page_lists_dma_always to support copy a list of pages Zi Yan
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Fallback to copy_highpage when it fails.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/migrate_mode.h   |  1 +
 include/uapi/linux/mempolicy.h |  1 +
 mm/migrate.c                   | 31 +++++++++++++++++++++----------
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 5bc8a77..4f7f5557 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -23,6 +23,7 @@ enum migrate_mode {
 	MIGRATE_MODE_MASK = 3,
 	MIGRATE_SINGLETHREAD	= 0,
 	MIGRATE_MT				= 1<<4,
+	MIGRATE_DMA				= 1<<5,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 890269b..49573a6 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -48,6 +48,7 @@ enum {
 #define MPOL_MF_LAZY	 (1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
 #define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
 
+#define MPOL_MF_MOVE_DMA (1<<5)	/* Use DMA page copy routine */
 #define MPOL_MF_MOVE_MT  (1<<6)	/* Use multi-threaded page copy routine */
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
diff --git a/mm/migrate.c b/mm/migrate.c
index 8a344e2..09114d3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -553,15 +553,21 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
  * specialized.
  */
 static void __copy_gigantic_page(struct page *dst, struct page *src,
-				int nr_pages)
+				int nr_pages, enum migrate_mode mode)
 {
 	int i;
 	struct page *dst_base = dst;
 	struct page *src_base = src;
+	int rc = -EFAULT;
 
 	for (i = 0; i < nr_pages; ) {
 		cond_resched();
-		copy_highpage(dst, src);
+
+		if (mode & MIGRATE_DMA)
+			rc = copy_page_dma(dst, src, 1);
+
+		if (rc)
+			copy_highpage(dst, src);
 
 		i++;
 		dst = mem_map_next(dst, dst_base, i);
@@ -582,7 +588,7 @@ static void copy_huge_page(struct page *dst, struct page *src,
 		nr_pages = pages_per_huge_page(h);
 
 		if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
-			__copy_gigantic_page(dst, src, nr_pages);
+			__copy_gigantic_page(dst, src, nr_pages, mode);
 			return;
 		}
 	} else {
@@ -597,6 +603,8 @@ static void copy_huge_page(struct page *dst, struct page *src,
 
 	if (mode & MIGRATE_MT)
 		rc = copy_page_multithread(dst, src, nr_pages);
+	else if (mode & MIGRATE_DMA)
+		rc = copy_page_dma(dst, src, nr_pages);
 
 	if (rc)
 		for (i = 0; i < nr_pages; i++) {
@@ -674,8 +682,9 @@ void migrate_page_copy(struct page *newpage, struct page *page,
 {
 	if (PageHuge(page) || PageTransHuge(page))
 		copy_huge_page(newpage, page, mode);
-	else
+	else {
 		copy_highpage(newpage, page);
+	}
 
 	migrate_page_states(newpage, page);
 }
@@ -1511,7 +1520,8 @@ static int store_status(int __user *status, int start, int value, int nr)
 }
 
 static int do_move_pages_to_node(struct mm_struct *mm,
-		struct list_head *pagelist, int node, bool migrate_mt)
+		struct list_head *pagelist, int node,
+		bool migrate_mt, bool migrate_dma)
 {
 	int err;
 
@@ -1519,7 +1529,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
 		return 0;
 
 	err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
-			MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD),
+			MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD) |
+			(migrate_dma ? MIGRATE_DMA : MIGRATE_SINGLETHREAD),
 			MR_SYSCALL);
 	if (err)
 		putback_movable_pages(pagelist);
@@ -1642,7 +1653,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 			start = i;
 		} else if (node != current_node) {
 			err = do_move_pages_to_node(mm, &pagelist, current_node,
-				flags & MPOL_MF_MOVE_MT);
+				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA);
 			if (err)
 				goto out;
 			err = store_status(status, start, current_node, i - start);
@@ -1666,7 +1677,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 			goto out_flush;
 
 		err = do_move_pages_to_node(mm, &pagelist, current_node,
-				flags & MPOL_MF_MOVE_MT);
+				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA);
 		if (err)
 			goto out;
 		if (i > start) {
@@ -1682,7 +1693,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 
 	/* Make sure we do not overwrite the existing error */
 	err1 = do_move_pages_to_node(mm, &pagelist, current_node,
-				flags & MPOL_MF_MOVE_MT);
+				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA);
 	if (!err1)
 		err1 = store_status(status, start, current_node, i - start);
 	if (!err)
@@ -1778,7 +1789,7 @@ static int kernel_move_pages(pid_t pid, unsigned long nr_pages,
 	nodemask_t task_nodes;
 
 	/* Check flags */
-	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_MT))
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_MT|MPOL_MF_MOVE_DMA))
 		return -EINVAL;
 
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 09/25] mm: migrate: Add copy_page_lists_dma_always to support copy a list of pages.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (7 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 08/25] mm: migrate: Add copy_page_dma into migrate_page_copy Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 10/25] mm: migrate: copy_page_lists_mt() to copy a page list using multi-threads Zi Yan
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Both src and dst page lists should match the page size at each
page and the length of both lists is shared.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/copy_page.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h  |   4 ++
 2 files changed, 170 insertions(+)

diff --git a/mm/copy_page.c b/mm/copy_page.c
index 5e7a797..84f1c02 100644
--- a/mm/copy_page.c
+++ b/mm/copy_page.c
@@ -417,3 +417,169 @@ int copy_page_dma(struct page *to, struct page *from, int nr_pages)
 
 	return copy_page_dma_always(to, from, nr_pages);
 }
+
+/*
+ * Use DMA copy a list of pages to a new location
+ *
+ * Just put each page into individual DMA channel.
+ *
+ * */
+int copy_page_lists_dma_always(struct page **to, struct page **from, int nr_items)
+{
+	struct dma_async_tx_descriptor **tx = NULL;
+	dma_cookie_t *cookie = NULL;
+	enum dma_ctrl_flags flags[NUM_AVAIL_DMA_CHAN] = {0};
+	struct dmaengine_unmap_data *unmap[NUM_AVAIL_DMA_CHAN] = {0};
+	int ret_val = 0;
+	int total_available_chans = NUM_AVAIL_DMA_CHAN;
+	int i;
+	int page_idx;
+
+	for (i = 0; i < NUM_AVAIL_DMA_CHAN; ++i) {
+		if (!copy_chan[i]) {
+			total_available_chans = i;
+		}
+	}
+	if (total_available_chans != NUM_AVAIL_DMA_CHAN) {
+		pr_err("%d channels are missing\n", NUM_AVAIL_DMA_CHAN - total_available_chans);
+	}
+	if (limit_dma_chans < total_available_chans)
+		total_available_chans = limit_dma_chans;
+
+	/* round down to closest 2^x value  */
+	total_available_chans = 1<<ilog2(total_available_chans);
+
+	total_available_chans = min_t(int, total_available_chans, nr_items);
+
+
+	tx = kzalloc(sizeof(struct dma_async_tx_descriptor*)*nr_items, GFP_KERNEL);
+	if (!tx) {
+		ret_val = -ENOMEM;
+		goto out;
+	}
+	cookie = kzalloc(sizeof(dma_cookie_t)*nr_items, GFP_KERNEL);
+	if (!cookie) {
+		ret_val = -ENOMEM;
+		goto out_free_tx;
+	}
+
+	for (i = 0; i < total_available_chans; ++i) {
+		int num_xfer_per_dev = nr_items / total_available_chans;
+
+		if (i < (nr_items % total_available_chans))
+			num_xfer_per_dev += 1;
+
+		if (num_xfer_per_dev > 128) {
+			ret_val = -ENOMEM;
+			pr_err("%s: too many pages to be transferred\n", __func__);
+			goto out_free_both;
+		}
+
+		unmap[i] = dmaengine_get_unmap_data(copy_dev[i]->dev,
+						2 * num_xfer_per_dev, GFP_NOWAIT);
+		if (!unmap[i]) {
+			pr_err("%s: no unmap data at chan %d\n", __func__, i);
+			ret_val = -ENODEV;
+			goto unmap_dma;
+		}
+	}
+
+	page_idx = 0;
+	for (i = 0; i < total_available_chans; ++i) {
+		int num_xfer_per_dev = nr_items / total_available_chans;
+		int xfer_idx;
+
+		if (i < (nr_items % total_available_chans))
+			num_xfer_per_dev += 1;
+
+		unmap[i]->to_cnt = num_xfer_per_dev;
+		unmap[i]->from_cnt = num_xfer_per_dev;
+		unmap[i]->len = hpage_nr_pages(from[i]) * PAGE_SIZE;
+
+		for (xfer_idx = 0; xfer_idx < num_xfer_per_dev; ++xfer_idx, ++page_idx) {
+			size_t page_len = hpage_nr_pages(from[page_idx]) * PAGE_SIZE;
+
+			BUG_ON(page_len != hpage_nr_pages(to[page_idx]) * PAGE_SIZE);
+			BUG_ON(unmap[i]->len != page_len);
+
+			unmap[i]->addr[xfer_idx] =
+				 dma_map_page(copy_dev[i]->dev, from[page_idx],
+							  0,
+							  page_len,
+							  DMA_TO_DEVICE);
+
+			unmap[i]->addr[xfer_idx+num_xfer_per_dev] =
+				 dma_map_page(copy_dev[i]->dev, to[page_idx],
+							  0,
+							  page_len,
+							  DMA_FROM_DEVICE);
+		}
+	}
+
+	page_idx = 0;
+	for (i = 0; i < total_available_chans; ++i) {
+		int num_xfer_per_dev = nr_items / total_available_chans;
+		int xfer_idx;
+
+		if (i < (nr_items % total_available_chans))
+			num_xfer_per_dev += 1;
+
+		for (xfer_idx = 0; xfer_idx < num_xfer_per_dev; ++xfer_idx, ++page_idx) {
+
+			tx[page_idx] = copy_dev[i]->device_prep_dma_memcpy(copy_chan[i],
+								unmap[i]->addr[xfer_idx + num_xfer_per_dev],
+								unmap[i]->addr[xfer_idx],
+								unmap[i]->len,
+								flags[i]);
+			if (!tx[page_idx]) {
+				pr_err("%s: no tx descriptor at chan %d xfer %d\n",
+					   __func__, i, xfer_idx);
+				ret_val = -ENODEV;
+				goto unmap_dma;
+			}
+
+			cookie[page_idx] = tx[page_idx]->tx_submit(tx[page_idx]);
+
+			if (dma_submit_error(cookie[page_idx])) {
+				pr_err("%s: submission error at chan %d xfer %d\n",
+					   __func__, i, xfer_idx);
+				ret_val = -ENODEV;
+				goto unmap_dma;
+			}
+		}
+
+		dma_async_issue_pending(copy_chan[i]);
+	}
+
+	page_idx = 0;
+	for (i = 0; i < total_available_chans; ++i) {
+		int num_xfer_per_dev = nr_items / total_available_chans;
+		int xfer_idx;
+
+		if (i < (nr_items % total_available_chans))
+			num_xfer_per_dev += 1;
+
+		for (xfer_idx = 0; xfer_idx < num_xfer_per_dev; ++xfer_idx, ++page_idx) {
+
+			if (dma_sync_wait(copy_chan[i], cookie[page_idx]) != DMA_COMPLETE) {
+				ret_val = -6;
+				pr_err("%s: dma does not complete at chan %d, xfer %d\n",
+					   __func__, i, xfer_idx);
+			}
+		}
+	}
+
+unmap_dma:
+	for (i = 0; i < total_available_chans; ++i) {
+		if (unmap[i])
+			dmaengine_unmap_put(unmap[i]);
+	}
+
+out_free_both:
+	kfree(cookie);
+out_free_tx:
+	kfree(tx);
+out:
+
+	return ret_val;
+}
diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b..cb1a610 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -555,4 +555,8 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 
 void setup_zone_pageset(struct zone *zone);
 extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+extern int copy_page_lists_dma_always(struct page **to,
+			struct page **from, int nr_pages);
+
 #endif	/* __MM_INTERNAL_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 10/25] mm: migrate: copy_page_lists_mt() to copy a page list using multi-threads.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (8 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 09/25] mm: migrate: Add copy_page_lists_dma_always to support copy a list of pages Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 11/25] mm: migrate: Add concurrent page migration into move_pages syscall Zi Yan
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

This prepare the support for migrate_page_concur(), which migrates
multiple pages at the same time.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/copy_page.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h  |   2 +
 2 files changed, 125 insertions(+)

diff --git a/mm/copy_page.c b/mm/copy_page.c
index 84f1c02..d2fd67e 100644
--- a/mm/copy_page.c
+++ b/mm/copy_page.c
@@ -126,6 +126,129 @@ int copy_page_multithread(struct page *to, struct page *from, int nr_pages)
 
 	return err;
 }
+
+int copy_page_lists_mt(struct page **to, struct page **from, int nr_items)
+{
+	int err = 0;
+	unsigned int total_mt_num = limit_mt_num;
+	int to_node = page_to_nid(*to);
+	int i;
+	struct copy_page_info *work_items[NR_CPUS] = {0};
+	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
+	int cpu_id_list[NR_CPUS] = {0};
+	int cpu;
+	int max_items_per_thread;
+	int item_idx;
+
+	total_mt_num = min_t(unsigned int, total_mt_num,
+						 cpumask_weight(per_node_cpumask));
+
+
+	if (total_mt_num > num_online_cpus())
+		return -ENODEV;
+
+	/* Each threads get part of each page, if nr_items < totla_mt_num */
+	if (nr_items < total_mt_num)
+		max_items_per_thread = nr_items;
+	else
+		max_items_per_thread = (nr_items / total_mt_num) +
+				((nr_items % total_mt_num)?1:0);
+
+
+	for (cpu = 0; cpu < total_mt_num; ++cpu) {
+		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
+					sizeof(struct copy_item)*max_items_per_thread, GFP_KERNEL);
+		if (!work_items[cpu]) {
+			err = -ENOMEM;
+			goto free_work_items;
+		}
+	}
+
+	i = 0;
+	for_each_cpu(cpu, per_node_cpumask) {
+		if (i >= total_mt_num)
+			break;
+		cpu_id_list[i] = cpu;
+		++i;
+	}
+
+	if (nr_items < total_mt_num) {
+		for (cpu = 0; cpu < total_mt_num; ++cpu) {
+			INIT_WORK((struct work_struct *)work_items[cpu],
+					  copy_page_work_queue_thread);
+			work_items[cpu]->num_items = max_items_per_thread;
+		}
+
+		for (item_idx = 0; item_idx < nr_items; ++item_idx) {
+			unsigned long chunk_size = PAGE_SIZE * hpage_nr_pages(from[item_idx]) / total_mt_num;
+			char *vfrom = kmap(from[item_idx]);
+			char *vto = kmap(to[item_idx]);
+			VM_BUG_ON(PAGE_SIZE * hpage_nr_pages(from[item_idx]) % total_mt_num);
+			BUG_ON(hpage_nr_pages(to[item_idx]) !=
+				   hpage_nr_pages(from[item_idx]));
+
+			for (cpu = 0; cpu < total_mt_num; ++cpu) {
+				work_items[cpu]->item_list[item_idx].to = vto + chunk_size * cpu;
+				work_items[cpu]->item_list[item_idx].from = vfrom + chunk_size * cpu;
+				work_items[cpu]->item_list[item_idx].chunk_size =
+					chunk_size;
+			}
+		}
+
+		for (cpu = 0; cpu < total_mt_num; ++cpu)
+			queue_work_on(cpu_id_list[cpu],
+						  system_highpri_wq,
+						  (struct work_struct *)work_items[cpu]);
+	} else {
+		item_idx = 0;
+		for (cpu = 0; cpu < total_mt_num; ++cpu) {
+			int num_xfer_per_thread = nr_items / total_mt_num;
+			int per_cpu_item_idx;
+
+			if (cpu < (nr_items % total_mt_num))
+				num_xfer_per_thread += 1;
+
+			INIT_WORK((struct work_struct *)work_items[cpu],
+					  copy_page_work_queue_thread);
+
+			work_items[cpu]->num_items = num_xfer_per_thread;
+			for (per_cpu_item_idx = 0; per_cpu_item_idx < work_items[cpu]->num_items;
+				 ++per_cpu_item_idx, ++item_idx) {
+				work_items[cpu]->item_list[per_cpu_item_idx].to = kmap(to[item_idx]);
+				work_items[cpu]->item_list[per_cpu_item_idx].from =
+					kmap(from[item_idx]);
+				work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
+					PAGE_SIZE * hpage_nr_pages(from[item_idx]);
+
+				BUG_ON(hpage_nr_pages(to[item_idx]) !=
+					   hpage_nr_pages(from[item_idx]));
+			}
+
+			queue_work_on(cpu_id_list[cpu],
+						  system_highpri_wq,
+						  (struct work_struct *)work_items[cpu]);
+		}
+		if (item_idx != nr_items)
+			pr_err("%s: only %d out of %d pages are transferred\n", __func__,
+				item_idx - 1, nr_items);
+	}
+
+	/* Wait until it finishes  */
+	for (i = 0; i < total_mt_num; ++i)
+		flush_work((struct work_struct *)work_items[i]);
+
+	for (i = 0; i < nr_items; ++i) {
+			kunmap(to[i]);
+			kunmap(from[i]);
+	}
+
+free_work_items:
+	for (cpu = 0; cpu < total_mt_num; ++cpu)
+		if (work_items[cpu])
+			kfree(work_items[cpu]);
+
+	return err;
+}
 /* ======================== DMA copy page ======================== */
 #include <linux/dmaengine.h>
 #include <linux/dma-mapping.h>
diff --git a/mm/internal.h b/mm/internal.h
index cb1a610..51f5e1b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -558,5 +558,7 @@ extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
 
 extern int copy_page_lists_dma_always(struct page **to,
 			struct page **from, int nr_pages);
+extern int copy_page_lists_mt(struct page **to,
+			struct page **from, int nr_pages);
 
 #endif	/* __MM_INTERNAL_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 11/25] mm: migrate: Add concurrent page migration into move_pages syscall.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (9 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 10/25] mm: migrate: copy_page_lists_mt() to copy a page list using multi-threads Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 12/25] exchange pages: new page migration mechanism: exchange_pages() Zi Yan
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Concurrent page migration unmaps all pages in a list, copy all pages
in one function (copy_page_list*), finally remaps all new pages.
This is different from existing page migration process which migrate
one page at a time.

Only anonymous pages are supported. All file-backed pages are still
migrated sequentially. Because locking becomes more complicated when
a list of file-backed pages belong to different files, which might
cause deadlocks if locks on each file are not done properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/migrate.h        |   6 +
 include/linux/migrate_mode.h   |   1 +
 include/uapi/linux/mempolicy.h |   1 +
 mm/migrate.c                   | 543 ++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 542 insertions(+), 9 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 5218a07..1001a1c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -67,6 +67,8 @@ extern int migrate_page(struct address_space *mapping,
 			enum migrate_mode mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
+extern int migrate_pages_concur(struct list_head *l, new_page_t new, free_page_t free,
+		unsigned long private, enum migrate_mode mode, int reason);
 extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
@@ -87,6 +89,10 @@ static inline int migrate_pages(struct list_head *l, new_page_t new,
 		free_page_t free, unsigned long private, enum migrate_mode mode,
 		int reason)
 	{ return -ENOSYS; }
+static inline int migrate_pages_concur(struct list_head *l, new_page_t new,
+		free_page_t free, unsigned long private, enum migrate_mode mode,
+		int reason)
+	{ return -ENOSYS; }
 static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 	{ return -EBUSY; }
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 4f7f5557..68263da 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -24,6 +24,7 @@ enum migrate_mode {
 	MIGRATE_SINGLETHREAD	= 0,
 	MIGRATE_MT				= 1<<4,
 	MIGRATE_DMA				= 1<<5,
+	MIGRATE_CONCUR			= 1<<6,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 49573a6..eb6560e 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -50,6 +50,7 @@ enum {
 
 #define MPOL_MF_MOVE_DMA (1<<5)	/* Use DMA page copy routine */
 #define MPOL_MF_MOVE_MT  (1<<6)	/* Use multi-threaded page copy routine */
+#define MPOL_MF_MOVE_CONCUR  (1<<7)	/* Move pages in a batch */
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
diff --git a/mm/migrate.c b/mm/migrate.c
index 09114d3..ad02797 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,15 @@
 
 int accel_page_copy = 1;
 
+
+struct page_migration_work_item {
+	struct list_head list;
+	struct page *old_page;
+	struct page *new_page;
+	struct anon_vma *anon_vma;
+	int page_was_mapped;
+};
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -1396,6 +1405,509 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	return rc;
 }
 
+static int __unmap_page_concur(struct page *page, struct page *newpage,
+				struct anon_vma **anon_vma,
+				int *page_was_mapped,
+				int force, enum migrate_mode mode)
+{
+	int rc = -EAGAIN;
+	bool is_lru = !__PageMovable(page);
+
+	*anon_vma = NULL;
+	*page_was_mapped = 0;
+
+	if (!trylock_page(page)) {
+		if (!force || ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC))
+			goto out;
+
+		/*
+		 * It's not safe for direct compaction to call lock_page.
+		 * For example, during page readahead pages are added locked
+		 * to the LRU. Later, when the IO completes the pages are
+		 * marked uptodate and unlocked. However, the queueing
+		 * could be merging multiple pages for one bio (e.g.
+		 * mpage_readpages). If an allocation happens for the
+		 * second or third page, the process can end up locking
+		 * the same page twice and deadlocking. Rather than
+		 * trying to be clever about what pages can be locked,
+		 * avoid the use of lock_page for direct compaction
+		 * altogether.
+		 */
+		if (current->flags & PF_MEMALLOC)
+			goto out;
+
+		lock_page(page);
+	}
+
+	/* We are working on page_mapping(page) == NULL */
+	VM_BUG_ON_PAGE(PageWriteback(page), page);
+#if 0
+	if (PageWriteback(page)) {
+		/*
+		 * Only in the case of a full synchronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
+		 */
+		if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC) {
+			rc = -EBUSY;
+			goto out_unlock;
+		}
+		if (!force)
+			goto out_unlock;
+		wait_on_page_writeback(page);
+	}
+#endif
+
+	/*
+	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
+	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * This get_anon_vma() delays freeing anon_vma pointer until the end
+	 * of migration. File cache pages are no problem because of page_lock()
+	 * File Caches may use write_page() or lock_page() in migration, then,
+	 * just care Anon page here.
+	 *
+	 * Only page_get_anon_vma() understands the subtleties of
+	 * getting a hold on an anon_vma from outside one of its mms.
+	 * But if we cannot get anon_vma, then we won't need it anyway,
+	 * because that implies that the anon page is no longer mapped
+	 * (and cannot be remapped so long as we hold the page lock).
+	 */
+	if (PageAnon(page) && !PageKsm(page))
+		*anon_vma = page_get_anon_vma(page);
+
+	/*
+	 * Block others from accessing the new page when we get around to
+	 * establishing additional references. We are usually the only one
+	 * holding a reference to newpage at this point. We used to have a BUG
+	 * here if trylock_page(newpage) fails, but would like to allow for
+	 * cases where there might be a race with the previous use of newpage.
+	 * This is much like races on refcount of oldpage: just don't BUG().
+	 */
+	if (unlikely(!trylock_page(newpage)))
+		goto out_unlock;
+
+	if (unlikely(!is_lru)) {
+		/* Just migrate the page and remove it from item list */
+		VM_BUG_ON(1);
+		rc = move_to_new_page(newpage, page, mode);
+		goto out_unlock_both;
+	}
+
+	/*
+	 * Corner case handling:
+	 * 1. When a new swap-cache page is read into, it is added to the LRU
+	 * and treated as swapcache but it has no rmap yet.
+	 * Calling try_to_unmap() against a page->mapping==NULL page will
+	 * trigger a BUG.  So handle it here.
+	 * 2. An orphaned page (see truncate_complete_page) might have
+	 * fs-private metadata. The page can be picked up due to memory
+	 * offlining.  Everywhere else except page reclaim, the page is
+	 * invisible to the vm, so the page can not be migrated.  So try to
+	 * free the metadata, so the page can be freed.
+	 */
+	if (!page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(page), page);
+		if (page_has_private(page)) {
+			try_to_free_buffers(page);
+			goto out_unlock_both;
+		}
+	} else if (page_mapped(page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !*anon_vma,
+				page);
+		try_to_unmap(page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		*page_was_mapped = 1;
+	}
+
+	return MIGRATEPAGE_SUCCESS;
+
+out_unlock_both:
+	unlock_page(newpage);
+out_unlock:
+	/* Drop an anon_vma reference if we took one */
+	if (*anon_vma)
+		put_anon_vma(*anon_vma);
+	unlock_page(page);
+out:
+	return rc;
+}
+
+static int unmap_pages_and_get_new_concur(new_page_t get_new_page,
+				free_page_t put_new_page, unsigned long private,
+				struct page_migration_work_item *item,
+				int force,
+				enum migrate_mode mode, enum migrate_reason reason)
+{
+	int rc = MIGRATEPAGE_SUCCESS;
+
+	if (!thp_migration_supported() && PageTransHuge(item->old_page))
+		return -ENOMEM;
+
+	item->new_page = get_new_page(item->old_page, private);
+	if (!item->new_page)
+		return -ENOMEM;
+
+	if (page_count(item->old_page) == 1) {
+		/* page was freed from under us. So we are done. */
+		ClearPageActive(item->old_page);
+		ClearPageUnevictable(item->old_page);
+		if (unlikely(__PageMovable(item->old_page))) {
+			lock_page(item->old_page);
+			if (!PageMovable(item->old_page))
+				__ClearPageIsolated(item->old_page);
+			unlock_page(item->old_page);
+		}
+		if (put_new_page)
+			put_new_page(item->new_page, private);
+		else
+			put_page(item->new_page);
+		item->new_page = NULL;
+		goto out;
+	}
+
+	rc = __unmap_page_concur(item->old_page, item->new_page, &item->anon_vma,
+							&item->page_was_mapped,
+							force, mode);
+	if (rc == MIGRATEPAGE_SUCCESS)
+		return rc;
+
+out:
+	if (rc != -EAGAIN) {
+		list_del(&item->old_page->lru);
+
+		if (likely(!__PageMovable(item->old_page)))
+			mod_node_page_state(page_pgdat(item->old_page), NR_ISOLATED_ANON +
+					page_is_file_cache(item->old_page),
+					-hpage_nr_pages(item->old_page));
+	}
+
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		/* only for pages freed under us  */
+		VM_BUG_ON(page_count(item->old_page) != 1);
+		put_page(item->old_page);
+		item->old_page = NULL;
+
+	} else {
+		if (rc != -EAGAIN) {
+			if (likely(!__PageMovable(item->old_page))) {
+				putback_lru_page(item->old_page);
+				goto put_new;
+			}
+
+			lock_page(item->old_page);
+			if (PageMovable(item->old_page))
+				putback_movable_page(item->old_page);
+			else
+				__ClearPageIsolated(item->old_page);
+			unlock_page(item->old_page);
+			put_page(item->old_page);
+		}
+
+		/*
+		 * If migration was not successful and there's a freeing callback, use
+		 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
+		 * during isolation.
+		 */
+put_new:
+		if (put_new_page)
+			put_new_page(item->new_page, private);
+		else
+			put_page(item->new_page);
+		item->new_page = NULL;
+
+	}
+
+	return rc;
+}
+
+static int move_mapping_concurr(struct list_head *unmapped_list_ptr,
+					   struct list_head *wip_list_ptr,
+					   free_page_t put_new_page, unsigned long private,
+					   enum migrate_mode mode)
+{
+	struct page_migration_work_item *iterator, *iterator2;
+	struct address_space *mapping;
+
+	list_for_each_entry_safe(iterator, iterator2, unmapped_list_ptr, list) {
+		VM_BUG_ON_PAGE(!PageLocked(iterator->old_page), iterator->old_page);
+		VM_BUG_ON_PAGE(!PageLocked(iterator->new_page), iterator->new_page);
+
+		mapping = page_mapping(iterator->old_page);
+
+		VM_BUG_ON(mapping);
+
+		VM_BUG_ON(PageWriteback(iterator->old_page));
+
+		if (page_count(iterator->old_page) != 1) {
+			list_move(&iterator->list, wip_list_ptr);
+			if (iterator->page_was_mapped)
+				remove_migration_ptes(iterator->old_page,
+					iterator->old_page, false);
+			unlock_page(iterator->new_page);
+			if (iterator->anon_vma)
+				put_anon_vma(iterator->anon_vma);
+			unlock_page(iterator->old_page);
+
+			if (put_new_page)
+				put_new_page(iterator->new_page, private);
+			else
+				put_page(iterator->new_page);
+			iterator->new_page = NULL;
+			continue;
+		}
+
+		iterator->new_page->index = iterator->old_page->index;
+		iterator->new_page->mapping = iterator->old_page->mapping;
+		if (PageSwapBacked(iterator->old_page))
+			SetPageSwapBacked(iterator->new_page);
+	}
+
+	return 0;
+}
+
+static int copy_to_new_pages_concur(struct list_head *unmapped_list_ptr,
+				enum migrate_mode mode)
+{
+	struct page_migration_work_item *iterator;
+	int num_pages = 0, idx = 0;
+	struct page **src_page_list = NULL, **dst_page_list = NULL;
+	unsigned long size = 0;
+	int rc = -EFAULT;
+
+	if (list_empty(unmapped_list_ptr))
+		return 0;
+
+	list_for_each_entry(iterator, unmapped_list_ptr, list) {
+		++num_pages;
+		size += PAGE_SIZE * hpage_nr_pages(iterator->old_page);
+	}
+
+	src_page_list = kzalloc(sizeof(struct page *)*num_pages, GFP_KERNEL);
+	if (!src_page_list) {
+		BUG();
+		return -ENOMEM;
+	}
+	dst_page_list = kzalloc(sizeof(struct page *)*num_pages, GFP_KERNEL);
+	if (!dst_page_list) {
+		BUG();
+		return -ENOMEM;
+	}
+
+	list_for_each_entry(iterator, unmapped_list_ptr, list) {
+		src_page_list[idx] = iterator->old_page;
+		dst_page_list[idx] = iterator->new_page;
+		++idx;
+	}
+
+	BUG_ON(idx != num_pages);
+
+	if (mode & MIGRATE_DMA)
+		rc = copy_page_lists_dma_always(dst_page_list, src_page_list,
+							num_pages);
+	else if (mode & MIGRATE_MT)
+		rc = copy_page_lists_mt(dst_page_list, src_page_list,
+							num_pages);
+
+	if (rc) {
+		list_for_each_entry(iterator, unmapped_list_ptr, list) {
+			if (PageHuge(iterator->old_page) ||
+				PageTransHuge(iterator->old_page))
+				copy_huge_page(iterator->new_page, iterator->old_page, 0);
+			else
+				copy_highpage(iterator->new_page, iterator->old_page);
+		}
+	}
+
+	kfree(src_page_list);
+	kfree(dst_page_list);
+
+	list_for_each_entry(iterator, unmapped_list_ptr, list) {
+		migrate_page_states(iterator->new_page, iterator->old_page);
+	}
+
+	return 0;
+}
+
+static int remove_migration_ptes_concurr(struct list_head *unmapped_list_ptr)
+{
+	struct page_migration_work_item *iterator, *iterator2;
+
+	list_for_each_entry_safe(iterator, iterator2, unmapped_list_ptr, list) {
+		if (iterator->page_was_mapped)
+			remove_migration_ptes(iterator->old_page, iterator->new_page, false);
+
+		unlock_page(iterator->new_page);
+
+		if (iterator->anon_vma)
+			put_anon_vma(iterator->anon_vma);
+
+		unlock_page(iterator->old_page);
+
+		list_del(&iterator->old_page->lru);
+		mod_node_page_state(page_pgdat(iterator->old_page), NR_ISOLATED_ANON +
+				page_is_file_cache(iterator->old_page),
+				-hpage_nr_pages(iterator->old_page));
+
+		put_page(iterator->old_page);
+		iterator->old_page = NULL;
+
+		if (unlikely(__PageMovable(iterator->new_page)))
+			put_page(iterator->new_page);
+		else
+			putback_lru_page(iterator->new_page);
+		iterator->new_page = NULL;
+	}
+
+	return 0;
+}
+
+int migrate_pages_concur(struct list_head *from, new_page_t get_new_page,
+		free_page_t put_new_page, unsigned long private,
+		enum migrate_mode mode, int reason)
+{
+	int retry = 1;
+	int nr_failed = 0;
+	int nr_succeeded = 0;
+	int pass = 0;
+	struct page *page;
+	int swapwrite = current->flags & PF_SWAPWRITE;
+	int rc;
+	int total_num_pages = 0, idx;
+	struct page_migration_work_item *item_list;
+	struct page_migration_work_item *iterator, *iterator2;
+	int item_list_order = 0;
+
+	LIST_HEAD(wip_list);
+	LIST_HEAD(unmapped_list);
+	LIST_HEAD(serialized_list);
+	LIST_HEAD(failed_list);
+
+	if (!swapwrite)
+		current->flags |= PF_SWAPWRITE;
+
+	list_for_each_entry(page, from, lru)
+		++total_num_pages;
+
+	item_list_order = get_order(total_num_pages *
+		sizeof(struct page_migration_work_item));
+
+	if (item_list_order > MAX_ORDER) {
+		item_list = alloc_pages_exact(total_num_pages *
+			sizeof(struct page_migration_work_item), GFP_ATOMIC);
+		memset(item_list, 0, total_num_pages *
+			sizeof(struct page_migration_work_item));
+	} else {
+		item_list = (struct page_migration_work_item *)__get_free_pages(GFP_ATOMIC,
+						item_list_order);
+		memset(item_list, 0, PAGE_SIZE<<item_list_order);
+	}
+
+	idx = 0;
+	list_for_each_entry(page, from, lru) {
+		item_list[idx].old_page = page;
+		item_list[idx].new_page = NULL;
+		INIT_LIST_HEAD(&item_list[idx].list);
+		list_add_tail(&item_list[idx].list, &wip_list);
+		idx += 1;
+	}
+
+	for(pass = 0; pass < 1 && retry; pass++) {
+		retry = 0;
+
+		/* unmap and get new page for page_mapping(page) == NULL */
+		list_for_each_entry_safe(iterator, iterator2, &wip_list, list) {
+			cond_resched();
+
+			if (iterator->new_page) {
+				pr_info("%s: iterator already has a new page?\n", __func__);
+				VM_BUG_ON_PAGE(1, iterator->old_page);
+			}
+
+			/* We do not migrate huge pages, file-backed, or swapcached pages */
+			if (PageHuge(iterator->old_page)) {
+				rc = -ENODEV;
+			}
+			else if ((page_mapping(iterator->old_page) != NULL)) {
+				rc = -ENODEV;
+			}
+			else
+				rc = unmap_pages_and_get_new_concur(get_new_page, put_new_page,
+						private, iterator, pass > 2, mode,
+						reason);
+
+			switch(rc) {
+			case -ENODEV:
+				list_move(&iterator->list, &serialized_list);
+				break;
+			case -ENOMEM:
+				if (PageTransHuge(page))
+					list_move(&iterator->list, &serialized_list);
+				else
+					goto out;
+				break;
+			case -EAGAIN:
+				retry++;
+				break;
+			case MIGRATEPAGE_SUCCESS:
+				if (iterator->old_page) {
+					list_move(&iterator->list, &unmapped_list);
+					nr_succeeded++;
+				} else { /* pages are freed under us */
+					list_del(&iterator->list);
+				}
+				break;
+			default:
+				/*
+				 * Permanent failure (-EBUSY, -ENOSYS, etc.):
+				 * unlike -EAGAIN case, the failed page is
+				 * removed from migration page list and not
+				 * retried in the next outer loop.
+				 */
+				list_move(&iterator->list, &failed_list);
+				nr_failed++;
+				break;
+			}
+		}
+out:
+		if (list_empty(&unmapped_list))
+			continue;
+
+		/* move page->mapping to new page, only -EAGAIN could happen  */
+		move_mapping_concurr(&unmapped_list, &wip_list, put_new_page, private, mode);
+		/* copy pages in unmapped_list */
+		copy_to_new_pages_concur(&unmapped_list, mode);
+		/* remove migration pte, if old_page is NULL?, unlock old and new
+		 * pages, put anon_vma, put old and new pages */
+		remove_migration_ptes_concurr(&unmapped_list);
+	}
+	nr_failed += retry;
+	rc = nr_failed;
+
+	if (!list_empty(from))
+		rc = migrate_pages(from, get_new_page, put_new_page, 
+				private, mode, reason);
+
+	if (nr_succeeded)
+		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+	if (nr_failed)
+		count_vm_events(PGMIGRATE_FAIL, nr_failed);
+	trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+
+	if (item_list_order >= MAX_ORDER) {
+		free_pages_exact(item_list, total_num_pages *
+			sizeof(struct page_migration_work_item));
+	} else {
+		free_pages((unsigned long)item_list, item_list_order);
+	}
+
+	if (!swapwrite)
+		current->flags &= ~PF_SWAPWRITE;
+
+	return rc;
+}
+
 /*
  * migrate_pages - migrate the pages specified in a list, to the free pages
  *		   supplied as the target for the page migration
@@ -1521,17 +2033,25 @@ static int store_status(int __user *status, int start, int value, int nr)
 
 static int do_move_pages_to_node(struct mm_struct *mm,
 		struct list_head *pagelist, int node,
-		bool migrate_mt, bool migrate_dma)
+		bool migrate_mt, bool migrate_dma, bool migrate_concur)
 {
 	int err;
 
 	if (list_empty(pagelist))
 		return 0;
 
-	err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
-			MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD) |
-			(migrate_dma ? MIGRATE_DMA : MIGRATE_SINGLETHREAD),
-			MR_SYSCALL);
+	if (migrate_concur) {
+		err = migrate_pages_concur(pagelist, alloc_new_node_page, NULL, node,
+				MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD) |
+				(migrate_dma ? MIGRATE_DMA : MIGRATE_SINGLETHREAD),
+				MR_SYSCALL);
+
+	} else {
+		err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
+				MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD) |
+				(migrate_dma ? MIGRATE_DMA : MIGRATE_SINGLETHREAD),
+				MR_SYSCALL);
+	}
 	if (err)
 		putback_movable_pages(pagelist);
 	return err;
@@ -1653,7 +2173,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 			start = i;
 		} else if (node != current_node) {
 			err = do_move_pages_to_node(mm, &pagelist, current_node,
-				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA);
+				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA,
+				flags & MPOL_MF_MOVE_CONCUR);
 			if (err)
 				goto out;
 			err = store_status(status, start, current_node, i - start);
@@ -1677,7 +2198,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 			goto out_flush;
 
 		err = do_move_pages_to_node(mm, &pagelist, current_node,
-				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA);
+				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA,
+				flags & MPOL_MF_MOVE_CONCUR);
 		if (err)
 			goto out;
 		if (i > start) {
@@ -1693,7 +2215,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 
 	/* Make sure we do not overwrite the existing error */
 	err1 = do_move_pages_to_node(mm, &pagelist, current_node,
-				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA);
+				flags & MPOL_MF_MOVE_MT, flags & MPOL_MF_MOVE_DMA,
+				flags & MPOL_MF_MOVE_CONCUR);
 	if (!err1)
 		err1 = store_status(status, start, current_node, i - start);
 	if (!err)
@@ -1789,7 +2312,9 @@ static int kernel_move_pages(pid_t pid, unsigned long nr_pages,
 	nodemask_t task_nodes;
 
 	/* Check flags */
-	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_MT|MPOL_MF_MOVE_DMA))
+	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|
+				  MPOL_MF_MOVE_DMA|MPOL_MF_MOVE_MT|
+				  MPOL_MF_MOVE_CONCUR))
 		return -EINVAL;
 
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 12/25] exchange pages: new page migration mechanism: exchange_pages()
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (10 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 11/25] mm: migrate: Add concurrent page migration into move_pages syscall Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 13/25] exchange pages: add multi-threaded exchange pages Zi Yan
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It exchanges two pages by unmapping both first, then exchanging the
data of the pages using a u64 register, and finally remapping both
pages.

It saves the overheads of allocating two new pages in two
back-to-back migrate_pages().

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/exchange.h |  23 ++
 include/linux/ksm.h      |   4 +
 mm/Makefile              |   1 +
 mm/exchange.c            | 597 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/ksm.c                 |  35 +++
 5 files changed, 660 insertions(+)
 create mode 100644 include/linux/exchange.h
 create mode 100644 mm/exchange.c

diff --git a/include/linux/exchange.h b/include/linux/exchange.h
new file mode 100644
index 0000000..778068e
--- /dev/null
+++ b/include/linux/exchange.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXCHANGE_H
+#define _LINUX_EXCHANGE_H
+
+#include <linux/migrate.h>
+
+struct exchange_page_info {
+	struct page *from_page;
+	struct page *to_page;
+
+	struct anon_vma *from_anon_vma;
+	struct anon_vma *to_anon_vma;
+
+	int from_page_was_mapped;
+	int to_page_was_mapped;
+
+	struct list_head list;
+};
+
+int exchange_pages(struct list_head *exchange_list,
+			enum migrate_mode mode,
+			int reason);
+#endif /* _LINUX_EXCHANGE_H */
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index e48b1e4..170312d 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -55,6 +55,7 @@ void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
 bool reuse_ksm_page(struct page *page,
 			struct vm_area_struct *vma, unsigned long address);
+void ksm_exchange_page(struct page *to_page, struct page *from_page);
 
 #else  /* !CONFIG_KSM */
 
@@ -92,6 +93,9 @@ static inline bool reuse_ksm_page(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	return false;
+static inline void ksm_exchange_page(struct page *to_page,
+				struct page *from_page)
+{
 }
 #endif /* CONFIG_MMU */
 #endif /* !CONFIG_KSM */
diff --git a/mm/Makefile b/mm/Makefile
index fa02a9f..5e6c591 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -45,6 +45,7 @@ obj-y += init-mm.o
 obj-y += memblock.o
 
 obj-y += copy_page.o
+obj-y += exchange.o
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
diff --git a/mm/exchange.c b/mm/exchange.c
new file mode 100644
index 0000000..626bbea
--- /dev/null
+++ b/mm/exchange.c
@@ -0,0 +1,597 @@
+/*
+ * Exchange two in-use pages. Page flags and page->mapping are exchanged
+ * as well. Only anonymous pages are supported.
+ *
+ * Copyright (C) 2016 NVIDIA, Zi Yan <ziy@nvidia.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/migrate.h>
+#include <linux/exchange.h>
+#include <linux/security.h>
+#include <linux/cpuset.h>
+#include <linux/hugetlb.h>
+#include <linux/mm_inline.h>
+#include <linux/page_idle.h>
+#include <linux/page-flags.h>
+#include <linux/ksm.h>
+#include <linux/memcontrol.h>
+#include <linux/balloon_compaction.h>
+#include <linux/buffer_head.h>
+
+
+#include "internal.h"
+
+/*
+ * Move a list of individual pages
+ */
+struct pages_to_node {
+	unsigned long from_addr;
+	int from_status;
+
+	unsigned long to_addr;
+	int to_status;
+};
+
+struct page_flags {
+	unsigned int page_error :1;
+	unsigned int page_referenced:1;
+	unsigned int page_uptodate:1;
+	unsigned int page_active:1;
+	unsigned int page_unevictable:1;
+	unsigned int page_checked:1;
+	unsigned int page_mappedtodisk:1;
+	unsigned int page_dirty:1;
+	unsigned int page_is_young:1;
+	unsigned int page_is_idle:1;
+	unsigned int page_swapcache:1;
+	unsigned int page_writeback:1;
+	unsigned int page_private:1;
+	unsigned int __pad:3;
+};
+
+
+static void exchange_page(char *to, char *from)
+{
+	u64 tmp;
+	int i;
+
+	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
+		tmp = *((u64*)(from + i));
+		*((u64*)(from + i)) = *((u64*)(to + i));
+		*((u64*)(to + i)) = tmp;
+	}
+}
+
+static inline void exchange_highpage(struct page *to, struct page *from)
+{
+	char *vfrom, *vto;
+
+	vfrom = kmap_atomic(from);
+	vto = kmap_atomic(to);
+	exchange_page(vto, vfrom);
+	kunmap_atomic(vto);
+	kunmap_atomic(vfrom);
+}
+
+static void __exchange_gigantic_page(struct page *dst, struct page *src,
+				int nr_pages)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < nr_pages; ) {
+		cond_resched();
+		exchange_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+static void exchange_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	int nr_pages;
+
+	if (PageHuge(src)) {
+		/* hugetlbfs page */
+		struct hstate *h = page_hstate(src);
+		nr_pages = pages_per_huge_page(h);
+
+		if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+			__exchange_gigantic_page(dst, src, nr_pages);
+			return;
+		}
+	} else {
+		/* thp page */
+		BUG_ON(!PageTransHuge(src));
+		nr_pages = hpage_nr_pages(src);
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		cond_resched();
+		exchange_highpage(dst + i, src + i);
+	}
+}
+
+/*
+ * Copy the page to its new location without polluting cache
+ */
+static void exchange_page_flags(struct page *to_page, struct page *from_page)
+{
+	int from_cpupid, to_cpupid;
+	struct page_flags from_page_flags, to_page_flags;
+	struct mem_cgroup *to_memcg = page_memcg(to_page),
+					  *from_memcg = page_memcg(from_page);
+
+	from_cpupid = page_cpupid_xchg_last(from_page, -1);
+
+	from_page_flags.page_error = TestClearPageError(from_page);
+	from_page_flags.page_referenced = TestClearPageReferenced(from_page);
+	from_page_flags.page_uptodate = PageUptodate(from_page);
+	ClearPageUptodate(from_page);
+	from_page_flags.page_active = TestClearPageActive(from_page);
+	from_page_flags.page_unevictable = TestClearPageUnevictable(from_page);
+	from_page_flags.page_checked = PageChecked(from_page);
+	ClearPageChecked(from_page);
+	from_page_flags.page_mappedtodisk = PageMappedToDisk(from_page);
+	ClearPageMappedToDisk(from_page);
+	from_page_flags.page_dirty = PageDirty(from_page);
+	ClearPageDirty(from_page);
+	from_page_flags.page_is_young = test_and_clear_page_young(from_page);
+	from_page_flags.page_is_idle = page_is_idle(from_page);
+	clear_page_idle(from_page);
+	from_page_flags.page_swapcache = PageSwapCache(from_page);
+	from_page_flags.page_private = PagePrivate(from_page);
+	ClearPagePrivate(from_page);
+	from_page_flags.page_writeback = test_clear_page_writeback(from_page);
+
+
+	to_cpupid = page_cpupid_xchg_last(to_page, -1);
+
+	to_page_flags.page_error = TestClearPageError(to_page);
+	to_page_flags.page_referenced = TestClearPageReferenced(to_page);
+	to_page_flags.page_uptodate = PageUptodate(to_page);
+	ClearPageUptodate(to_page);
+	to_page_flags.page_active = TestClearPageActive(to_page);
+	to_page_flags.page_unevictable = TestClearPageUnevictable(to_page);
+	to_page_flags.page_checked = PageChecked(to_page);
+	ClearPageChecked(to_page);
+	to_page_flags.page_mappedtodisk = PageMappedToDisk(to_page);
+	ClearPageMappedToDisk(to_page);
+	to_page_flags.page_dirty = PageDirty(to_page);
+	ClearPageDirty(to_page);
+	to_page_flags.page_is_young = test_and_clear_page_young(to_page);
+	to_page_flags.page_is_idle = page_is_idle(to_page);
+	clear_page_idle(to_page);
+	to_page_flags.page_swapcache = PageSwapCache(to_page);
+	to_page_flags.page_private = PagePrivate(to_page);
+	ClearPagePrivate(to_page);
+	to_page_flags.page_writeback = test_clear_page_writeback(to_page);
+
+	/* set to_page */
+	if (from_page_flags.page_error)
+		SetPageError(to_page);
+	if (from_page_flags.page_referenced)
+		SetPageReferenced(to_page);
+	if (from_page_flags.page_uptodate)
+		SetPageUptodate(to_page);
+	if (from_page_flags.page_active) {
+		VM_BUG_ON_PAGE(from_page_flags.page_unevictable, from_page);
+		SetPageActive(to_page);
+	} else if (from_page_flags.page_unevictable)
+		SetPageUnevictable(to_page);
+	if (from_page_flags.page_checked)
+		SetPageChecked(to_page);
+	if (from_page_flags.page_mappedtodisk)
+		SetPageMappedToDisk(to_page);
+
+	/* Move dirty on pages not done by migrate_page_move_mapping() */
+	if (from_page_flags.page_dirty)
+		SetPageDirty(to_page);
+
+	if (from_page_flags.page_is_young)
+		set_page_young(to_page);
+	if (from_page_flags.page_is_idle)
+		set_page_idle(to_page);
+
+	/* set from_page */
+	if (to_page_flags.page_error)
+		SetPageError(from_page);
+	if (to_page_flags.page_referenced)
+		SetPageReferenced(from_page);
+	if (to_page_flags.page_uptodate)
+		SetPageUptodate(from_page);
+	if (to_page_flags.page_active) {
+		VM_BUG_ON_PAGE(to_page_flags.page_unevictable, from_page);
+		SetPageActive(from_page);
+	} else if (to_page_flags.page_unevictable)
+		SetPageUnevictable(from_page);
+	if (to_page_flags.page_checked)
+		SetPageChecked(from_page);
+	if (to_page_flags.page_mappedtodisk)
+		SetPageMappedToDisk(from_page);
+
+	/* Move dirty on pages not done by migrate_page_move_mapping() */
+	if (to_page_flags.page_dirty)
+		SetPageDirty(from_page);
+
+	if (to_page_flags.page_is_young)
+		set_page_young(from_page);
+	if (to_page_flags.page_is_idle)
+		set_page_idle(from_page);
+
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	page_cpupid_xchg_last(to_page, from_cpupid);
+	page_cpupid_xchg_last(from_page, to_cpupid);
+
+	ksm_exchange_page(to_page, from_page);
+	/*
+	 * Please do not reorder this without considering how mm/ksm.c's
+	 * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
+	 */
+	ClearPageSwapCache(to_page);
+	ClearPageSwapCache(from_page);
+	if (from_page_flags.page_swapcache)
+		SetPageSwapCache(to_page);
+	if (to_page_flags.page_swapcache)
+		SetPageSwapCache(from_page);
+
+
+#ifdef CONFIG_PAGE_OWNER
+	/* exchange page owner  */
+	BUG();
+#endif
+	/* exchange mem cgroup  */
+	to_page->mem_cgroup = from_memcg;
+	from_page->mem_cgroup = to_memcg;
+
+}
+
+/*
+ * Replace the page in the mapping.
+ *
+ * The number of remaining references must be:
+ * 1 for anonymous pages without a mapping
+ * 2 for pages with a mapping
+ * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
+ */
+
+static int exchange_page_move_mapping(struct address_space *to_mapping,
+			struct address_space *from_mapping,
+			struct page *to_page, struct page *from_page,
+			enum migrate_mode mode,
+			int to_extra_count, int from_extra_count)
+{
+	int to_expected_count = 1 + to_extra_count,
+		from_expected_count = 1 + from_extra_count;
+	unsigned long from_page_index = page_index(from_page),
+				  to_page_index = page_index(to_page);
+	int to_swapbacked = PageSwapBacked(to_page),
+		from_swapbacked = PageSwapBacked(from_page);
+	struct address_space *to_mapping_value = to_page->mapping,
+						 *from_mapping_value = from_page->mapping;
+
+
+	if (!to_mapping) {
+		/* Anonymous page without mapping */
+		if (page_count(to_page) != to_expected_count)
+			return -EAGAIN;
+	}
+
+	if (!from_mapping) {
+		/* Anonymous page without mapping */
+		if (page_count(from_page) != from_expected_count)
+			return -EAGAIN;
+	}
+
+	/*
+	 * Now we know that no one else is looking at the page:
+	 * no turning back from here.
+	 */
+	/* from_page  */
+	from_page->index = to_page_index;
+	from_page->mapping = to_mapping_value;
+
+	ClearPageSwapBacked(from_page);
+	if (to_swapbacked)
+		SetPageSwapBacked(from_page);
+
+
+	/* to_page  */
+	to_page->index = from_page_index;
+	to_page->mapping = from_mapping_value;
+
+	ClearPageSwapBacked(to_page);
+	if (from_swapbacked)
+		SetPageSwapBacked(to_page);
+
+	return MIGRATEPAGE_SUCCESS;
+}
+
+static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
+				enum migrate_mode mode)
+{
+	int rc = -EBUSY;
+	struct address_space *to_page_mapping, *from_page_mapping;
+
+	VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
+	VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
+
+	/* copy page->mapping not use page_mapping()  */
+	to_page_mapping = page_mapping(to_page);
+	from_page_mapping = page_mapping(from_page);
+
+	BUG_ON(from_page_mapping);
+	BUG_ON(to_page_mapping);
+
+	BUG_ON(PageWriteback(from_page));
+	BUG_ON(PageWriteback(to_page));
+
+	/* actual page mapping exchange */
+	rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
+						to_page, from_page, mode, 0, 0);
+	/* actual page data exchange  */
+	if (rc != MIGRATEPAGE_SUCCESS)
+		return rc;
+
+	rc = -EFAULT;
+
+	if (PageHuge(from_page) || PageTransHuge(from_page))
+		exchange_huge_page(to_page, from_page);
+	else
+		exchange_highpage(to_page, from_page);
+	rc = 0;
+
+	exchange_page_flags(to_page, from_page);
+
+	return rc;
+}
+
+static int unmap_and_exchange(struct page *from_page, struct page *to_page,
+				enum migrate_mode mode)
+{
+	int rc = -EAGAIN;
+	int from_page_was_mapped = 0, to_page_was_mapped = 0;
+	pgoff_t from_index, to_index;
+	struct anon_vma *from_anon_vma = NULL, *to_anon_vma = NULL;
+
+	/* from_page lock down  */
+	if (!trylock_page(from_page)) {
+		if ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)
+			goto out;
+
+		lock_page(from_page);
+	}
+
+	BUG_ON(PageWriteback(from_page));
+
+	/*
+	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
+	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * This get_anon_vma() delays freeing anon_vma pointer until the end
+	 * of migration. File cache pages are no problem because of page_lock()
+	 * File Caches may use write_page() or lock_page() in migration, then,
+	 * just care Anon page here.
+	 *
+	 * Only page_get_anon_vma() understands the subtleties of
+	 * getting a hold on an anon_vma from outside one of its mms.
+	 * But if we cannot get anon_vma, then we won't need it anyway,
+	 * because that implies that the anon page is no longer mapped
+	 * (and cannot be remapped so long as we hold the page lock).
+	 */
+	if (PageAnon(from_page) && !PageKsm(from_page))
+		from_anon_vma = page_get_anon_vma(from_page);
+
+	/* to_page lock down  */
+	if (!trylock_page(to_page)) {
+		if ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)
+			goto out_unlock;
+
+		lock_page(to_page);
+	}
+
+	BUG_ON(PageWriteback(to_page));
+
+	/*
+	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
+	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * This get_anon_vma() delays freeing anon_vma pointer until the end
+	 * of migration. File cache pages are no problem because of page_lock()
+	 * File Caches may use write_page() or lock_page() in migration, then,
+	 * just care Anon page here.
+	 *
+	 * Only page_get_anon_vma() understands the subtleties of
+	 * getting a hold on an anon_vma from outside one of its mms.
+	 * But if we cannot get anon_vma, then we won't need it anyway,
+	 * because that implies that the anon page is no longer mapped
+	 * (and cannot be remapped so long as we hold the page lock).
+	 */
+	if (PageAnon(to_page) && !PageKsm(to_page))
+		to_anon_vma = page_get_anon_vma(to_page);
+
+	from_index = from_page->index;
+	to_index = to_page->index;
+
+	/*
+	 * Corner case handling:
+	 * 1. When a new swap-cache page is read into, it is added to the LRU
+	 * and treated as swapcache but it has no rmap yet.
+	 * Calling try_to_unmap() against a page->mapping==NULL page will
+	 * trigger a BUG.  So handle it here.
+	 * 2. An orphaned page (see truncate_complete_page) might have
+	 * fs-private metadata. The page can be picked up due to memory
+	 * offlining.  Everywhere else except page reclaim, the page is
+	 * invisible to the vm, so the page can not be migrated.  So try to
+	 * free the metadata, so the page can be freed.
+	 */
+	if (!from_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(from_page), from_page);
+		if (page_has_private(from_page)) {
+			try_to_free_buffers(from_page);
+			goto out_unlock_both;
+		}
+	} else if (page_mapped(from_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(from_page) && !PageKsm(from_page) &&
+					   !from_anon_vma, from_page);
+		try_to_unmap(from_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		from_page_was_mapped = 1;
+	}
+
+	if (!to_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(to_page), to_page);
+		if (page_has_private(to_page)) {
+			try_to_free_buffers(to_page);
+			goto out_unlock_both_remove_from_migration_pte;
+		}
+	} else if (page_mapped(to_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(to_page) && !PageKsm(to_page) &&
+					   !to_anon_vma, to_page);
+		try_to_unmap(to_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		to_page_was_mapped = 1;
+	}
+
+	if (!page_mapped(from_page) && !page_mapped(to_page))
+		rc = exchange_from_to_pages(to_page, from_page, mode);
+
+	/* In remove_migration_ptes(), page_walk_vma() assumes
+	 * from_page and to_page have the same index.
+	 * Thus, we restore old_page->index here.
+	 * Here to_page is the old_page.
+	 */
+	if (to_page_was_mapped) {
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(to_page->index, to_index);
+
+		remove_migration_ptes(to_page,
+			rc == MIGRATEPAGE_SUCCESS ? from_page : to_page, false);
+
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(to_page->index, to_index);
+	}
+
+out_unlock_both_remove_from_migration_pte:
+	if (from_page_was_mapped) {
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(from_page->index, from_index);
+
+		remove_migration_ptes(from_page,
+			rc == MIGRATEPAGE_SUCCESS ? to_page : from_page, false);
+
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(from_page->index, from_index);
+	}
+
+
+
+out_unlock_both:
+	if (to_anon_vma)
+		put_anon_vma(to_anon_vma);
+	unlock_page(to_page);
+out_unlock:
+	/* Drop an anon_vma reference if we took one */
+	if (from_anon_vma)
+		put_anon_vma(from_anon_vma);
+	unlock_page(from_page);
+out:
+
+	return rc;
+}
+
+/*
+ * Exchange pages in the exchange_list
+ *
+ * Caller should release the exchange_list resource.
+ *
+ * */
+int exchange_pages(struct list_head *exchange_list,
+			enum migrate_mode mode,
+			int reason)
+{
+	struct exchange_page_info *one_pair, *one_pair2;
+	int failed = 0;
+
+	list_for_each_entry_safe(one_pair, one_pair2, exchange_list, list) {
+		struct page *from_page = one_pair->from_page;
+		struct page *to_page = one_pair->to_page;
+		int rc;
+		int retry = 0;
+
+again:
+		if (page_count(from_page) == 1) {
+			/* page was freed from under us. So we are done  */
+			ClearPageActive(from_page);
+			ClearPageUnevictable(from_page);
+
+			put_page(from_page);
+			dec_node_page_state(from_page, NR_ISOLATED_ANON +
+					page_is_file_cache(from_page));
+
+			if (page_count(to_page) == 1) {
+				ClearPageActive(to_page);
+				ClearPageUnevictable(to_page);
+				put_page(to_page);
+			} else
+				goto putback_to_page;
+
+			continue;
+		}
+
+		if (page_count(to_page) == 1) {
+			/* page was freed from under us. So we are done  */
+			ClearPageActive(to_page);
+			ClearPageUnevictable(to_page);
+
+			put_page(to_page);
+
+			dec_node_page_state(to_page, NR_ISOLATED_ANON +
+					page_is_file_cache(to_page));
+
+			dec_node_page_state(from_page, NR_ISOLATED_ANON +
+					page_is_file_cache(from_page));
+			putback_lru_page(from_page);
+			continue;
+		}
+
+		/* TODO: compound page not supported */
+		if (PageCompound(from_page) || page_mapping(from_page)) {
+			++failed;
+			goto putback;
+		}
+
+		rc = unmap_and_exchange(from_page, to_page, mode);
+
+		if (rc == -EAGAIN && retry < 3) {
+			++retry;
+			goto again;
+		}
+
+		if (rc != MIGRATEPAGE_SUCCESS)
+			++failed;
+
+putback:
+		dec_node_page_state(from_page, NR_ISOLATED_ANON +
+				page_is_file_cache(from_page));
+
+		putback_lru_page(from_page);
+putback_to_page:
+		dec_node_page_state(to_page, NR_ISOLATED_ANON +
+				page_is_file_cache(to_page));
+
+		putback_lru_page(to_page);
+
+	}
+	return failed;
+}
diff --git a/mm/ksm.c b/mm/ksm.c
index fc64874..e5b492b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2716,6 +2716,41 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 		set_page_stable_node(oldpage, NULL);
 	}
 }
+
+void ksm_exchange_page(struct page *to_page, struct page *from_page)
+{
+	struct stable_node *to_stable_node, *from_stable_node;
+
+	VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
+	VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
+
+	to_stable_node = page_stable_node(to_page);
+	from_stable_node = page_stable_node(from_page);
+	if (to_stable_node) {
+		VM_BUG_ON_PAGE(to_stable_node->kpfn != page_to_pfn(from_page),
+					from_page);
+		to_stable_node->kpfn = page_to_pfn(to_page);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+	}
+	if (from_stable_node) {
+		VM_BUG_ON_PAGE(from_stable_node->kpfn != page_to_pfn(to_page),
+					to_page);
+		from_stable_node->kpfn = page_to_pfn(from_page);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+	}
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 13/25] exchange pages: add multi-threaded exchange pages.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (11 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 12/25] exchange pages: new page migration mechanism: exchange_pages() Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 14/25] exchange pages: concurrent " Zi Yan
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Exchange two pages using multi threads. Exchange two lists of pages
using multi threads.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/Makefile        |   1 +
 mm/exchange.c      |  15 ++--
 mm/exchange_page.c | 229 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h      |   5 ++
 4 files changed, 245 insertions(+), 5 deletions(-)
 create mode 100644 mm/exchange_page.c

diff --git a/mm/Makefile b/mm/Makefile
index 5e6c591..2f1f1ad 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -46,6 +46,7 @@ obj-y += memblock.o
 
 obj-y += copy_page.o
 obj-y += exchange.o
+obj-y += exchange_page.o
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
diff --git a/mm/exchange.c b/mm/exchange.c
index 626bbea..ce2c899 100644
--- a/mm/exchange.c
+++ b/mm/exchange.c
@@ -345,11 +345,16 @@ static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
 
 	rc = -EFAULT;
 
-	if (PageHuge(from_page) || PageTransHuge(from_page))
-		exchange_huge_page(to_page, from_page);
-	else
-		exchange_highpage(to_page, from_page);
-	rc = 0;
+	if (mode & MIGRATE_MT)
+		rc = exchange_page_mthread(to_page, from_page,
+				hpage_nr_pages(from_page));
+	if (rc) {
+		if (PageHuge(from_page) || PageTransHuge(from_page))
+			exchange_huge_page(to_page, from_page);
+		else
+			exchange_highpage(to_page, from_page);
+		rc = 0;
+	}
 
 	exchange_page_flags(to_page, from_page);
 
diff --git a/mm/exchange_page.c b/mm/exchange_page.c
new file mode 100644
index 0000000..6054697
--- /dev/null
+++ b/mm/exchange_page.c
@@ -0,0 +1,229 @@
+/*
+ * Exchange page copy routine.
+ *
+ * Copyright 2019 by NVIDIA.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Zi Yan <ziy@nvidia.com>
+ *
+ */
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
+#include <linux/freezer.h>
+
+/*
+ * nr_copythreads can be the highest number of threads for given node
+ * on any architecture. The actual number of copy threads will be
+ * limited by the cpumask weight of the target node.
+ */
+extern unsigned int limit_mt_num;
+
+struct copy_page_info {
+	struct work_struct copy_page_work;
+	char *to;
+	char *from;
+	unsigned long chunk_size;
+};
+
+static void exchange_page_routine(char *to, char *from, unsigned long chunk_size)
+{
+	u64 tmp;
+	int i;
+
+	for (i = 0; i < chunk_size; i += sizeof(tmp)) {
+		tmp = *((u64*)(from + i));
+		*((u64*)(from + i)) = *((u64*)(to + i));
+		*((u64*)(to + i)) = tmp;
+	}
+}
+
+static void exchange_page_work_queue_thread(struct work_struct *work)
+{
+	struct copy_page_info *my_work = (struct copy_page_info*)work;
+
+	exchange_page_routine(my_work->to,
+							  my_work->from,
+							  my_work->chunk_size);
+}
+
+int exchange_page_mthread(struct page *to, struct page *from, int nr_pages)
+{
+	int total_mt_num = limit_mt_num;
+	int to_node = page_to_nid(to);
+	int i;
+	struct copy_page_info *work_items;
+	char *vto, *vfrom;
+	unsigned long chunk_size;
+	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
+	int cpu_id_list[32] = {0};
+	int cpu;
+
+	total_mt_num = min_t(unsigned int, total_mt_num,
+						 cpumask_weight(per_node_cpumask));
+
+	if (total_mt_num > 1)
+		total_mt_num = (total_mt_num / 2) * 2;
+
+	if (total_mt_num > 32 || total_mt_num < 1)
+		return -ENODEV;
+
+	work_items = kvzalloc(sizeof(struct copy_page_info)*total_mt_num,
+						 GFP_KERNEL);
+	if (!work_items)
+		return -ENOMEM;
+
+	i = 0;
+	for_each_cpu(cpu, per_node_cpumask) {
+		if (i >= total_mt_num)
+			break;
+		cpu_id_list[i] = cpu;
+		++i;
+	}
+
+	/* XXX: assume no highmem  */
+	vfrom = kmap(from);
+	vto = kmap(to);
+	chunk_size = PAGE_SIZE*nr_pages / total_mt_num;
+
+	for (i = 0; i < total_mt_num; ++i) {
+		INIT_WORK((struct work_struct *)&work_items[i],
+				exchange_page_work_queue_thread);
+
+		work_items[i].to = vto + i * chunk_size;
+		work_items[i].from = vfrom + i * chunk_size;
+		work_items[i].chunk_size = chunk_size;
+
+		queue_work_on(cpu_id_list[i],
+					  system_highpri_wq,
+					  (struct work_struct *)&work_items[i]);
+	}
+
+	/* Wait until it finishes  */
+	flush_workqueue(system_highpri_wq);
+
+	kunmap(to);
+	kunmap(from);
+
+	kvfree(work_items);
+
+	return 0;
+}
+
+int exchange_page_lists_mthread(struct page **to, struct page **from, int nr_pages)
+{
+	int err = 0;
+	unsigned int total_mt_num = limit_mt_num;
+	int to_node = page_to_nid(*to);
+	int i;
+	struct copy_page_info *work_items;
+	int nr_pages_per_page = hpage_nr_pages(*from);
+	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
+	int cpu_id_list[32] = {0};
+	int cpu;
+	int item_idx;
+
+
+	total_mt_num = min_t(unsigned int, total_mt_num,
+						 cpumask_weight(per_node_cpumask));
+
+	if (total_mt_num > 32 || total_mt_num < 1)
+		return -ENODEV;
+
+	if (nr_pages < total_mt_num) {
+		int residual_nr_pages = nr_pages - rounddown_pow_of_two(nr_pages);
+
+		if (residual_nr_pages) {
+			for (i = 0; i < residual_nr_pages; ++i) {
+				BUG_ON(hpage_nr_pages(to[i]) != hpage_nr_pages(from[i]));
+				err = exchange_page_mthread(to[i], from[i], hpage_nr_pages(to[i]));
+				VM_BUG_ON(err);
+			}
+			nr_pages = rounddown_pow_of_two(nr_pages);
+			to = &to[residual_nr_pages];
+			from = &from[residual_nr_pages];
+		}
+
+		work_items = kvzalloc(sizeof(struct copy_page_info)*total_mt_num,
+							 GFP_KERNEL);
+	} else
+		work_items = kvzalloc(sizeof(struct copy_page_info)*nr_pages,
+							 GFP_KERNEL);
+	if (!work_items)
+		return -ENOMEM;
+
+	i = 0;
+	for_each_cpu(cpu, per_node_cpumask) {
+		if (i >= total_mt_num)
+			break;
+		cpu_id_list[i] = cpu;
+		++i;
+	}
+
+	if (nr_pages < total_mt_num) {
+		for (cpu = 0; cpu < total_mt_num; ++cpu)
+			INIT_WORK((struct work_struct *)&work_items[cpu],
+					  exchange_page_work_queue_thread);
+		cpu = 0;
+		for (item_idx = 0; item_idx < nr_pages; ++item_idx) {
+			unsigned long chunk_size = nr_pages * PAGE_SIZE * hpage_nr_pages(from[item_idx]) / total_mt_num;
+			char *vfrom = kmap(from[item_idx]);
+			char *vto = kmap(to[item_idx]);
+			VM_BUG_ON(PAGE_SIZE * hpage_nr_pages(from[item_idx]) % total_mt_num);
+			VM_BUG_ON(total_mt_num % nr_pages);
+			BUG_ON(hpage_nr_pages(to[item_idx]) !=
+				   hpage_nr_pages(from[item_idx]));
+
+			for (i = 0; i < (total_mt_num/nr_pages); ++cpu, ++i) {
+				work_items[cpu].to = vto + chunk_size * i;
+				work_items[cpu].from = vfrom + chunk_size * i;
+				work_items[cpu].chunk_size = chunk_size;
+			}
+		}
+		if (cpu != total_mt_num)
+			pr_err("%s: only %d out of %d pages are transferred\n", __func__,
+				cpu - 1, total_mt_num);
+
+		for (cpu = 0; cpu < total_mt_num; ++cpu)
+			queue_work_on(cpu_id_list[cpu],
+						  system_highpri_wq,
+						  (struct work_struct *)&work_items[cpu]);
+	} else {
+		for (i = 0; i < nr_pages; ++i) {
+			int thread_idx = i % total_mt_num;
+
+			INIT_WORK((struct work_struct *)&work_items[i], exchange_page_work_queue_thread);
+
+			/* XXX: assume no highmem  */
+			work_items[i].to = kmap(to[i]);
+			work_items[i].from = kmap(from[i]);
+			work_items[i].chunk_size = PAGE_SIZE * hpage_nr_pages(from[i]);
+
+			BUG_ON(hpage_nr_pages(to[i]) != hpage_nr_pages(from[i]));
+
+			queue_work_on(cpu_id_list[thread_idx], system_highpri_wq, (struct work_struct *)&work_items[i]);
+		}
+	}
+
+	/* Wait until it finishes  */
+	flush_workqueue(system_highpri_wq);
+
+	for (i = 0; i < nr_pages; ++i) {
+			kunmap(to[i]);
+			kunmap(from[i]);
+	}
+
+	kvfree(work_items);
+
+	return err;
+}
+
diff --git a/mm/internal.h b/mm/internal.h
index 51f5e1b..a039459 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -561,4 +561,9 @@ extern int copy_page_lists_dma_always(struct page **to,
 extern int copy_page_lists_mt(struct page **to,
 			struct page **from, int nr_pages);
 
+extern int exchange_page_mthread(struct page *to, struct page *from,
+			int nr_pages);
+extern int exchange_page_lists_mthread(struct page **to,
+						  struct page **from, 
+						  int nr_pages);
 #endif	/* __MM_INTERNAL_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 14/25] exchange pages: concurrent exchange pages.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (12 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 13/25] exchange pages: add multi-threaded exchange pages Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 15/25] exchange pages: exchange anonymous page and file-backed page Zi Yan
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It unmaps two lists of pages, then exchange them in
exchange_page_lists_mthread(), and finally remaps both lists of
pages.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/exchange.h |   2 +
 mm/exchange.c            | 397 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/exchange_page.c       |   1 -
 3 files changed, 399 insertions(+), 1 deletion(-)

diff --git a/include/linux/exchange.h b/include/linux/exchange.h
index 778068e..20d2184 100644
--- a/include/linux/exchange.h
+++ b/include/linux/exchange.h
@@ -20,4 +20,6 @@ struct exchange_page_info {
 int exchange_pages(struct list_head *exchange_list,
 			enum migrate_mode mode,
 			int reason);
+int exchange_pages_concur(struct list_head *exchange_list,
+		enum migrate_mode mode, int reason);
 #endif /* _LINUX_EXCHANGE_H */
diff --git a/mm/exchange.c b/mm/exchange.c
index ce2c899..bbada58 100644
--- a/mm/exchange.c
+++ b/mm/exchange.c
@@ -600,3 +600,400 @@ int exchange_pages(struct list_head *exchange_list,
 	}
 	return failed;
 }
+
+
+static int unmap_pair_pages_concur(struct exchange_page_info *one_pair,
+				int force, enum migrate_mode mode)
+{
+	int rc = -EAGAIN;
+	struct anon_vma *anon_vma_from_page = NULL, *anon_vma_to_page = NULL;
+	struct page *from_page = one_pair->from_page;
+	struct page *to_page = one_pair->to_page;
+
+	/* from_page lock down  */
+	if (!trylock_page(from_page)) {
+		if (!force || ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC))
+			goto out;
+
+		lock_page(from_page);
+	}
+
+	BUG_ON(PageWriteback(from_page));
+
+	/*
+	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
+	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * This get_anon_vma() delays freeing anon_vma pointer until the end
+	 * of migration. File cache pages are no problem because of page_lock()
+	 * File Caches may use write_page() or lock_page() in migration, then,
+	 * just care Anon page here.
+	 *
+	 * Only page_get_anon_vma() understands the subtleties of
+	 * getting a hold on an anon_vma from outside one of its mms.
+	 * But if we cannot get anon_vma, then we won't need it anyway,
+	 * because that implies that the anon page is no longer mapped
+	 * (and cannot be remapped so long as we hold the page lock).
+	 */
+	if (PageAnon(from_page) && !PageKsm(from_page))
+		one_pair->from_anon_vma = anon_vma_from_page
+					= page_get_anon_vma(from_page);
+
+	/* to_page lock down  */
+	if (!trylock_page(to_page)) {
+		if (!force || ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC))
+			goto out_unlock;
+
+		lock_page(to_page);
+	}
+
+	BUG_ON(PageWriteback(to_page));
+
+	/*
+	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
+	 * we cannot notice that anon_vma is freed while we migrates a page.
+	 * This get_anon_vma() delays freeing anon_vma pointer until the end
+	 * of migration. File cache pages are no problem because of page_lock()
+	 * File Caches may use write_page() or lock_page() in migration, then,
+	 * just care Anon page here.
+	 *
+	 * Only page_get_anon_vma() understands the subtleties of
+	 * getting a hold on an anon_vma from outside one of its mms.
+	 * But if we cannot get anon_vma, then we won't need it anyway,
+	 * because that implies that the anon page is no longer mapped
+	 * (and cannot be remapped so long as we hold the page lock).
+	 */
+	if (PageAnon(to_page) && !PageKsm(to_page))
+		one_pair->to_anon_vma = anon_vma_to_page = page_get_anon_vma(to_page);
+
+	/*
+	 * Corner case handling:
+	 * 1. When a new swap-cache page is read into, it is added to the LRU
+	 * and treated as swapcache but it has no rmap yet.
+	 * Calling try_to_unmap() against a page->mapping==NULL page will
+	 * trigger a BUG.  So handle it here.
+	 * 2. An orphaned page (see truncate_complete_page) might have
+	 * fs-private metadata. The page can be picked up due to memory
+	 * offlining.  Everywhere else except page reclaim, the page is
+	 * invisible to the vm, so the page can not be migrated.  So try to
+	 * free the metadata, so the page can be freed.
+	 */
+	if (!from_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(from_page), from_page);
+		if (page_has_private(from_page)) {
+			try_to_free_buffers(from_page);
+			goto out_unlock_both;
+		}
+	} else if (page_mapped(from_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(from_page) && !PageKsm(from_page) &&
+					   !anon_vma_from_page, from_page);
+		try_to_unmap(from_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+
+		one_pair->from_page_was_mapped = 1;
+	}
+
+	if (!to_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(to_page), to_page);
+		if (page_has_private(to_page)) {
+			try_to_free_buffers(to_page);
+			goto out_unlock_both;
+		}
+	} else if (page_mapped(to_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(to_page) && !PageKsm(to_page) &&
+					   !anon_vma_to_page, to_page);
+		try_to_unmap(to_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+
+		one_pair->to_page_was_mapped = 1;
+	}
+
+	return MIGRATEPAGE_SUCCESS;
+
+out_unlock_both:
+	if (anon_vma_to_page)
+		put_anon_vma(anon_vma_to_page);
+	unlock_page(to_page);
+out_unlock:
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma_from_page)
+		put_anon_vma(anon_vma_from_page);
+	unlock_page(from_page);
+out:
+
+	return rc;
+}
+
+static int exchange_page_mapping_concur(struct list_head *unmapped_list_ptr,
+					   struct list_head *exchange_list_ptr,
+						enum migrate_mode mode)
+{
+	int rc = -EBUSY;
+	int nr_failed = 0;
+	struct address_space *to_page_mapping, *from_page_mapping;
+	struct exchange_page_info *one_pair, *one_pair2;
+
+	list_for_each_entry_safe(one_pair, one_pair2, unmapped_list_ptr, list) {
+		struct page *from_page = one_pair->from_page;
+		struct page *to_page = one_pair->to_page;
+
+		VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
+		VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
+
+		/* copy page->mapping not use page_mapping()  */
+		to_page_mapping = page_mapping(to_page);
+		from_page_mapping = page_mapping(from_page);
+
+		BUG_ON(from_page_mapping);
+		BUG_ON(to_page_mapping);
+
+		BUG_ON(PageWriteback(from_page));
+		BUG_ON(PageWriteback(to_page));
+
+		/* actual page mapping exchange */
+		rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
+							to_page, from_page, mode, 0, 0);
+
+		if (rc) {
+			if (one_pair->from_page_was_mapped)
+				remove_migration_ptes(from_page, from_page, false);
+			if (one_pair->to_page_was_mapped)
+				remove_migration_ptes(to_page, to_page, false);
+
+			if (one_pair->from_anon_vma)
+				put_anon_vma(one_pair->from_anon_vma);
+			unlock_page(from_page);
+
+			if (one_pair->to_anon_vma)
+				put_anon_vma(one_pair->to_anon_vma);
+			unlock_page(to_page);
+
+			mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+					page_is_file_cache(from_page), -hpage_nr_pages(from_page));
+			putback_lru_page(from_page);
+
+			mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+					page_is_file_cache(to_page), -hpage_nr_pages(to_page));
+			putback_lru_page(to_page);
+
+			one_pair->from_page = NULL;
+			one_pair->to_page = NULL;
+
+			list_move(&one_pair->list, exchange_list_ptr);
+			++nr_failed;
+		}
+	}
+
+	return nr_failed;
+}
+
+static int exchange_page_data_concur(struct list_head *unmapped_list_ptr,
+									enum migrate_mode mode)
+{
+	struct exchange_page_info *one_pair;
+	int num_pages = 0, idx = 0;
+	struct page **src_page_list = NULL, **dst_page_list = NULL;
+	unsigned long size = 0;
+	int rc = -EFAULT;
+
+	if (list_empty(unmapped_list_ptr))
+		return 0;
+
+	/* form page list  */
+	list_for_each_entry(one_pair, unmapped_list_ptr, list) {
+		++num_pages;
+		size += PAGE_SIZE * hpage_nr_pages(one_pair->from_page);
+	}
+
+	src_page_list = kzalloc(sizeof(struct page *)*num_pages, GFP_KERNEL);
+	if (!src_page_list)
+		return -ENOMEM;
+	dst_page_list = kzalloc(sizeof(struct page *)*num_pages, GFP_KERNEL);
+	if (!dst_page_list)
+		return -ENOMEM;
+
+	list_for_each_entry(one_pair, unmapped_list_ptr, list) {
+		src_page_list[idx] = one_pair->from_page;
+		dst_page_list[idx] = one_pair->to_page;
+		++idx;
+	}
+
+	BUG_ON(idx != num_pages);
+
+
+	if (mode & MIGRATE_MT)
+		rc = exchange_page_lists_mthread(dst_page_list, src_page_list,
+				num_pages);
+
+	if (rc) {
+		list_for_each_entry(one_pair, unmapped_list_ptr, list) {
+			if (PageHuge(one_pair->from_page) ||
+				PageTransHuge(one_pair->from_page)) {
+				exchange_huge_page(one_pair->to_page, one_pair->from_page);
+			} else {
+				exchange_highpage(one_pair->to_page, one_pair->from_page);
+			}
+		}
+	}
+
+	kfree(src_page_list);
+	kfree(dst_page_list);
+
+	list_for_each_entry(one_pair, unmapped_list_ptr, list) {
+		exchange_page_flags(one_pair->to_page, one_pair->from_page);
+	}
+
+	return rc;
+}
+
+static int remove_migration_ptes_concur(struct list_head *unmapped_list_ptr)
+{
+	struct exchange_page_info *iterator;
+
+	list_for_each_entry(iterator, unmapped_list_ptr, list) {
+		remove_migration_ptes(iterator->from_page, iterator->to_page, false);
+		remove_migration_ptes(iterator->to_page, iterator->from_page, false);
+
+
+		if (iterator->from_anon_vma)
+			put_anon_vma(iterator->from_anon_vma);
+		unlock_page(iterator->from_page);
+
+
+		if (iterator->to_anon_vma)
+			put_anon_vma(iterator->to_anon_vma);
+		unlock_page(iterator->to_page);
+
+
+		putback_lru_page(iterator->from_page);
+		iterator->from_page = NULL;
+
+		putback_lru_page(iterator->to_page);
+		iterator->to_page = NULL;
+	}
+
+	return 0;
+}
+
+int exchange_pages_concur(struct list_head *exchange_list,
+		enum migrate_mode mode, int reason)
+{
+	struct exchange_page_info *one_pair, *one_pair2;
+	int pass = 0;
+	int retry = 1;
+	int nr_failed = 0;
+	int nr_succeeded = 0;
+	int rc = 0;
+	LIST_HEAD(serialized_list);
+	LIST_HEAD(unmapped_list);
+
+	for(pass = 0; pass < 1 && retry; pass++) {
+		retry = 0;
+
+		/* unmap and get new page for page_mapping(page) == NULL */
+		list_for_each_entry_safe(one_pair, one_pair2, exchange_list, list) {
+			struct page *from_page = one_pair->from_page;
+			struct page *to_page = one_pair->to_page;
+			cond_resched();
+
+			if (page_count(from_page) == 1) {
+				/* page was freed from under us. So we are done  */
+				ClearPageActive(from_page);
+				ClearPageUnevictable(from_page);
+
+				put_page(from_page);
+				dec_node_page_state(from_page, NR_ISOLATED_ANON +
+						page_is_file_cache(from_page));
+
+				if (page_count(to_page) == 1) {
+					ClearPageActive(to_page);
+					ClearPageUnevictable(to_page);
+					put_page(to_page);
+				} else {
+					mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+							page_is_file_cache(to_page), -hpage_nr_pages(to_page));
+					putback_lru_page(to_page);
+				}
+				list_del(&one_pair->list);
+
+				continue;
+			}
+
+			if (page_count(to_page) == 1) {
+				/* page was freed from under us. So we are done  */
+				ClearPageActive(to_page);
+				ClearPageUnevictable(to_page);
+
+				put_page(to_page);
+
+				dec_node_page_state(to_page, NR_ISOLATED_ANON +
+						page_is_file_cache(to_page));
+
+				mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+						page_is_file_cache(from_page), -hpage_nr_pages(from_page));
+				putback_lru_page(from_page);
+
+				list_del(&one_pair->list);
+				continue;
+			}
+		/* We do not exchange huge pages and file-backed pages concurrently */
+			if (PageHuge(one_pair->from_page) || PageHuge(one_pair->to_page)) {
+				rc = -ENODEV;
+			}
+			else if ((page_mapping(one_pair->from_page) != NULL) ||
+					 (page_mapping(one_pair->from_page) != NULL)) {
+				rc = -ENODEV;
+			}
+			else
+				rc = unmap_pair_pages_concur(one_pair, 1, mode);
+
+			switch(rc) {
+			case -ENODEV:
+				list_move(&one_pair->list, &serialized_list);
+				break;
+			case -ENOMEM:
+				goto out;
+			case -EAGAIN:
+				retry++;
+				break;
+			case MIGRATEPAGE_SUCCESS:
+				list_move(&one_pair->list, &unmapped_list);
+				nr_succeeded++;
+				break;
+			default:
+				/*
+				 * Permanent failure (-EBUSY, -ENOSYS, etc.):
+				 * unlike -EAGAIN case, the failed page is
+				 * removed from migration page list and not
+				 * retried in the next outer loop.
+				 */
+				list_move(&one_pair->list, &serialized_list);
+				nr_failed++;
+				break;
+			}
+		}
+
+		/* move page->mapping to new page, only -EAGAIN could happen  */
+		exchange_page_mapping_concur(&unmapped_list, exchange_list, mode);
+
+
+		/* copy pages in unmapped_list */
+		exchange_page_data_concur(&unmapped_list, mode);
+
+
+		/* remove migration pte, if old_page is NULL?, unlock old and new
+		 * pages, put anon_vma, put old and new pages */
+		remove_migration_ptes_concur(&unmapped_list);
+	}
+
+	nr_failed += retry;
+	rc = nr_failed;
+
+	exchange_pages(&serialized_list, mode, reason);
+out:
+	list_splice(&unmapped_list, exchange_list);
+	list_splice(&serialized_list, exchange_list);
+
+	return nr_failed?-EFAULT:0;
+}
diff --git a/mm/exchange_page.c b/mm/exchange_page.c
index 6054697..5dba0a6 100644
--- a/mm/exchange_page.c
+++ b/mm/exchange_page.c
@@ -126,7 +126,6 @@ int exchange_page_lists_mthread(struct page **to, struct page **from, int nr_pag
 	int to_node = page_to_nid(*to);
 	int i;
 	struct copy_page_info *work_items;
-	int nr_pages_per_page = hpage_nr_pages(*from);
 	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
 	int cpu_id_list[32] = {0};
 	int cpu;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 15/25] exchange pages: exchange anonymous page and file-backed page.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (13 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 14/25] exchange pages: concurrent " Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 16/25] exchange page: Add THP exchange support Zi Yan
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

This is only done for the basic exchange pages, because we might
need to lock multiple files when doing concurrent exchange pages,
which could cause deadlocks easily.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/exchange.c | 284 ++++++++++++++++++++++++++++++++++++++++++++++------------
 mm/internal.h |   9 ++
 mm/migrate.c  |   6 +-
 3 files changed, 241 insertions(+), 58 deletions(-)

diff --git a/mm/exchange.c b/mm/exchange.c
index bbada58..555a72c 100644
--- a/mm/exchange.c
+++ b/mm/exchange.c
@@ -20,6 +20,8 @@
 #include <linux/memcontrol.h>
 #include <linux/balloon_compaction.h>
 #include <linux/buffer_head.h>
+#include <linux/fs.h> /* buffer_migrate_page  */
+#include <linux/backing-dev.h>
 
 
 #include "internal.h"
@@ -147,8 +149,6 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 	from_page_flags.page_is_idle = page_is_idle(from_page);
 	clear_page_idle(from_page);
 	from_page_flags.page_swapcache = PageSwapCache(from_page);
-	from_page_flags.page_private = PagePrivate(from_page);
-	ClearPagePrivate(from_page);
 	from_page_flags.page_writeback = test_clear_page_writeback(from_page);
 
 
@@ -170,8 +170,6 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 	to_page_flags.page_is_idle = page_is_idle(to_page);
 	clear_page_idle(to_page);
 	to_page_flags.page_swapcache = PageSwapCache(to_page);
-	to_page_flags.page_private = PagePrivate(to_page);
-	ClearPagePrivate(to_page);
 	to_page_flags.page_writeback = test_clear_page_writeback(to_page);
 
 	/* set to_page */
@@ -268,18 +266,22 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 static int exchange_page_move_mapping(struct address_space *to_mapping,
 			struct address_space *from_mapping,
 			struct page *to_page, struct page *from_page,
+			struct buffer_head *to_head, struct buffer_head *from_head,
 			enum migrate_mode mode,
 			int to_extra_count, int from_extra_count)
 {
-	int to_expected_count = 1 + to_extra_count,
-		from_expected_count = 1 + from_extra_count;
-	unsigned long from_page_index = page_index(from_page),
-				  to_page_index = page_index(to_page);
+	int to_expected_count = expected_page_refs(to_mapping, to_page) + to_extra_count,
+		from_expected_count = expected_page_refs(from_mapping, from_page) + from_extra_count;
+	unsigned long from_page_index = from_page->index;
+	unsigned long to_page_index = to_page->index;
 	int to_swapbacked = PageSwapBacked(to_page),
 		from_swapbacked = PageSwapBacked(from_page);
-	struct address_space *to_mapping_value = to_page->mapping,
-						 *from_mapping_value = from_page->mapping;
+	struct address_space *to_mapping_value = to_page->mapping;
+	struct address_space *from_mapping_value = from_page->mapping;
 
+	VM_BUG_ON_PAGE(to_mapping != page_mapping(to_page), to_page);
+	VM_BUG_ON_PAGE(from_mapping != page_mapping(from_page), from_page);
+	VM_BUG_ON(PageCompound(from_page) != PageCompound(to_page));
 
 	if (!to_mapping) {
 		/* Anonymous page without mapping */
@@ -293,26 +295,125 @@ static int exchange_page_move_mapping(struct address_space *to_mapping,
 			return -EAGAIN;
 	}
 
-	/*
-	 * Now we know that no one else is looking at the page:
-	 * no turning back from here.
-	 */
-	/* from_page  */
-	from_page->index = to_page_index;
-	from_page->mapping = to_mapping_value;
+	/* both are anonymous pages  */
+	if (!from_mapping && !to_mapping) {
+		/* from_page  */
+		from_page->index = to_page_index;
+		from_page->mapping = to_mapping_value;
+
+		ClearPageSwapBacked(from_page);
+		if (to_swapbacked)
+			SetPageSwapBacked(from_page);
+
+
+		/* to_page  */
+		to_page->index = from_page_index;
+		to_page->mapping = from_mapping_value;
+
+		ClearPageSwapBacked(to_page);
+		if (from_swapbacked)
+			SetPageSwapBacked(to_page);
+	} else if (!from_mapping && to_mapping) {
+		/* from is anonymous, to is file-backed  */
+		XA_STATE(to_xas, &to_mapping->i_pages, page_index(to_page));
+		struct zone *from_zone, *to_zone;
+		int dirty;
+
+		from_zone = page_zone(from_page);
+		to_zone = page_zone(to_page);
+
+		xas_lock_irq(&to_xas);
+
+		if (page_count(to_page) != to_expected_count ||
+			xas_load(&to_xas) != to_page) {
+			xas_unlock_irq(&to_xas);
+			return -EAGAIN;
+		}
+
+		if (!page_ref_freeze(to_page, to_expected_count)) {
+			xas_unlock_irq(&to_xas);
+			pr_debug("cannot freeze page count\n");
+			return -EAGAIN;
+		}
+
+		if (!page_ref_freeze(from_page, from_expected_count)) {
+			page_ref_unfreeze(to_page, to_expected_count);
+			xas_unlock_irq(&to_xas);
+
+			return -EAGAIN;
+		}
+		/*
+		 * Now we know that no one else is looking at the page:
+		 * no turning back from here.
+		 */
+		ClearPageSwapBacked(from_page);
+		ClearPageSwapBacked(to_page);
+
+		/* from_page  */
+		from_page->index = to_page_index;
+		from_page->mapping = to_mapping_value;
+		/* to_page  */
+		to_page->index = from_page_index;
+		to_page->mapping = from_mapping_value;
+
+		if (to_swapbacked)
+			__SetPageSwapBacked(from_page);
+		else
+			VM_BUG_ON_PAGE(PageSwapCache(to_page), to_page);
 
-	ClearPageSwapBacked(from_page);
-	if (to_swapbacked)
-		SetPageSwapBacked(from_page);
+		if (from_swapbacked)
+			__SetPageSwapBacked(to_page);
+		else
+			VM_BUG_ON_PAGE(PageSwapCache(from_page), from_page);
 
+		dirty = PageDirty(to_page);
 
-	/* to_page  */
-	to_page->index = from_page_index;
-	to_page->mapping = from_mapping_value;
+		xas_store(&to_xas, from_page);
+		if (PageTransHuge(to_page)) {
+			int i;
+			for (i = 1; i < HPAGE_PMD_NR; i++) {
+				xas_next(&to_xas);
+				xas_store(&to_xas, from_page + i);
+			}
+		}
+
+		/* move cache reference */
+		page_ref_unfreeze(to_page, to_expected_count - hpage_nr_pages(to_page));
+		page_ref_unfreeze(from_page, from_expected_count + hpage_nr_pages(from_page));
+
+		xas_unlock(&to_xas);
+
+		/*
+		 * If moved to a different zone then also account
+		 * the page for that zone. Other VM counters will be
+		 * taken care of when we establish references to the
+		 * new page and drop references to the old page.
+		 *
+		 * Note that anonymous pages are accounted for
+		 * via NR_FILE_PAGES and NR_ANON_MAPPED if they
+		 * are mapped to swap space.
+		 */
+		if (to_zone != from_zone) {
+			__dec_node_state(to_zone->zone_pgdat, NR_FILE_PAGES);
+			__inc_node_state(from_zone->zone_pgdat, NR_FILE_PAGES);
+			if (PageSwapBacked(to_page) && !PageSwapCache(to_page)) {
+				__dec_node_state(to_zone->zone_pgdat, NR_SHMEM);
+				__inc_node_state(from_zone->zone_pgdat, NR_SHMEM);
+			}
+			if (dirty && mapping_cap_account_dirty(to_mapping)) {
+				__dec_node_state(to_zone->zone_pgdat, NR_FILE_DIRTY);
+				__dec_zone_state(to_zone, NR_ZONE_WRITE_PENDING);
+				__inc_node_state(from_zone->zone_pgdat, NR_FILE_DIRTY);
+				__inc_zone_state(from_zone, NR_ZONE_WRITE_PENDING);
+			}
+		}
+		local_irq_enable();
 
-	ClearPageSwapBacked(to_page);
-	if (from_swapbacked)
-		SetPageSwapBacked(to_page);
+	} else {
+		/* from is file-backed to is anonymous: fold this to the case above */
+		/* both are file-backed  */
+		VM_BUG_ON(1);
+	}
 
 	return MIGRATEPAGE_SUCCESS;
 }
@@ -322,6 +423,7 @@ static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
 {
 	int rc = -EBUSY;
 	struct address_space *to_page_mapping, *from_page_mapping;
+	struct buffer_head *to_head = NULL, *to_bh = NULL;
 
 	VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
 	VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
@@ -330,15 +432,71 @@ static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
 	to_page_mapping = page_mapping(to_page);
 	from_page_mapping = page_mapping(from_page);
 
+	/* from_page has to be anonymous page  */
 	BUG_ON(from_page_mapping);
-	BUG_ON(to_page_mapping);
-
 	BUG_ON(PageWriteback(from_page));
+	/* writeback has to finish */
 	BUG_ON(PageWriteback(to_page));
 
-	/* actual page mapping exchange */
-	rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
-						to_page, from_page, mode, 0, 0);
+	/* to_page is anonymous  */
+	if (!to_page_mapping) {
+exchange_mappings:
+		/* actual page mapping exchange */
+		rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
+							to_page, from_page, NULL, NULL, mode, 0, 0);
+	} else {
+		if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) {
+			if (!page_has_buffers(to_page))
+				goto exchange_mappings;
+
+			to_head = page_buffers(to_page);
+
+			rc = exchange_page_move_mapping(to_page_mapping,
+					from_page_mapping, to_page, from_page,
+					to_head, NULL, mode, 0, 0);
+
+			if (rc != MIGRATEPAGE_SUCCESS)
+				return rc;
+
+			/*
+			 * In the async case, migrate_page_move_mapping locked the buffers
+			 * with an IRQ-safe spinlock held. In the sync case, the buffers
+			 * need to be locked now
+			 */
+			if ((mode & MIGRATE_MODE_MASK) != MIGRATE_ASYNC)
+				BUG_ON(!buffer_migrate_lock_buffers(to_head, mode));
+
+			ClearPagePrivate(to_page);
+			set_page_private(from_page, page_private(to_page));
+			set_page_private(to_page, 0);
+			/* transfer private page count  */
+			put_page(to_page);
+			get_page(from_page);
+
+			to_bh = to_head;
+			do {
+				set_bh_page(to_bh, from_page, bh_offset(to_bh));
+				to_bh = to_bh->b_this_page;
+
+			} while (to_bh != to_head);
+
+			SetPagePrivate(from_page);
+
+			to_bh = to_head;
+		} else if (!to_page_mapping->a_ops->migratepage) {
+			/* fallback_migrate_page  */
+			if (PageDirty(to_page)) {
+				if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC)
+					return -EBUSY;
+				return writeout(to_page_mapping, to_page);
+			}
+			if (page_has_private(to_page) &&
+				!try_to_release_page(to_page, GFP_KERNEL))
+				return -EAGAIN;
+
+			goto exchange_mappings;
+		}
+	}
 	/* actual page data exchange  */
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
@@ -356,8 +514,28 @@ static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
 		rc = 0;
 	}
 
+	/*
+	 * 1. buffer_migrate_page:
+	 *   private flag should be transferred from to_page to from_page
+	 *
+	 * 2. anon<->anon, fallback_migrate_page:
+	 *   both have none private flags or to_page's is cleared.
+	 * */
+	VM_BUG_ON(!((page_has_private(from_page) && !page_has_private(to_page)) ||
+				(!page_has_private(from_page) && !page_has_private(to_page))));
+
 	exchange_page_flags(to_page, from_page);
 
+	if (to_bh) {
+		VM_BUG_ON(to_bh != to_head);
+		do {
+			unlock_buffer(to_bh);
+			put_bh(to_bh);
+			to_bh = to_bh->b_this_page;
+
+		} while (to_bh != to_head);
+	}
+
 	return rc;
 }
 
@@ -369,34 +547,12 @@ static int unmap_and_exchange(struct page *from_page, struct page *to_page,
 	pgoff_t from_index, to_index;
 	struct anon_vma *from_anon_vma = NULL, *to_anon_vma = NULL;
 
-	/* from_page lock down  */
 	if (!trylock_page(from_page)) {
 		if ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)
 			goto out;
-
 		lock_page(from_page);
 	}
 
-	BUG_ON(PageWriteback(from_page));
-
-	/*
-	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
-	 * we cannot notice that anon_vma is freed while we migrates a page.
-	 * This get_anon_vma() delays freeing anon_vma pointer until the end
-	 * of migration. File cache pages are no problem because of page_lock()
-	 * File Caches may use write_page() or lock_page() in migration, then,
-	 * just care Anon page here.
-	 *
-	 * Only page_get_anon_vma() understands the subtleties of
-	 * getting a hold on an anon_vma from outside one of its mms.
-	 * But if we cannot get anon_vma, then we won't need it anyway,
-	 * because that implies that the anon page is no longer mapped
-	 * (and cannot be remapped so long as we hold the page lock).
-	 */
-	if (PageAnon(from_page) && !PageKsm(from_page))
-		from_anon_vma = page_get_anon_vma(from_page);
-
-	/* to_page lock down  */
 	if (!trylock_page(to_page)) {
 		if ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC)
 			goto out_unlock;
@@ -404,7 +560,22 @@ static int unmap_and_exchange(struct page *from_page, struct page *to_page,
 		lock_page(to_page);
 	}
 
-	BUG_ON(PageWriteback(to_page));
+	/* from_page is supposed to be an anonymous page */
+	VM_BUG_ON_PAGE(PageWriteback(from_page), from_page);
+
+	if (PageWriteback(to_page)) {
+		/*
+		 * Only in the case of a full synchronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
+		 */
+		if ((mode & MIGRATE_MODE_MASK) != MIGRATE_SYNC) {
+			rc = -EBUSY;
+			goto out_unlock;
+		}
+		wait_on_page_writeback(to_page);
+	}
 
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
@@ -420,6 +591,9 @@ static int unmap_and_exchange(struct page *from_page, struct page *to_page,
 	 * because that implies that the anon page is no longer mapped
 	 * (and cannot be remapped so long as we hold the page lock).
 	 */
+	if (PageAnon(from_page) && !PageKsm(from_page))
+		from_anon_vma = page_get_anon_vma(from_page);
+
 	if (PageAnon(to_page) && !PageKsm(to_page))
 		to_anon_vma = page_get_anon_vma(to_page);
 
@@ -753,7 +927,7 @@ static int exchange_page_mapping_concur(struct list_head *unmapped_list_ptr,
 
 		/* actual page mapping exchange */
 		rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
-							to_page, from_page, mode, 0, 0);
+							to_page, from_page, NULL, NULL, mode, 0, 0);
 
 		if (rc) {
 			if (one_pair->from_page_was_mapped)
diff --git a/mm/internal.h b/mm/internal.h
index a039459..cf63bf6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -566,4 +566,13 @@ extern int exchange_page_mthread(struct page *to, struct page *from,
 extern int exchange_page_lists_mthread(struct page **to,
 						  struct page **from, 
 						  int nr_pages);
+
+extern int exchange_two_pages(struct page *page1, struct page *page2);
+
+bool buffer_migrate_lock_buffers(struct buffer_head *head,
+							enum migrate_mode mode);
+int writeout(struct address_space *mapping, struct page *page);
+int expected_page_refs(struct address_space *mapping, struct page *page);
+
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index ad02797..a0ca817 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -385,7 +385,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
 }
 #endif
 
-static int expected_page_refs(struct address_space *mapping, struct page *page)
+int expected_page_refs(struct address_space *mapping, struct page *page)
 {
 	int expected_count = 1;
 
@@ -732,7 +732,7 @@ EXPORT_SYMBOL(migrate_page);
 
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+bool buffer_migrate_lock_buffers(struct buffer_head *head,
 							enum migrate_mode mode)
 {
 	struct buffer_head *bh = head;
@@ -880,7 +880,7 @@ int buffer_migrate_page_norefs(struct address_space *mapping,
 /*
  * Writeback a page to clean the dirty state
  */
-static int writeout(struct address_space *mapping, struct page *page)
+int writeout(struct address_space *mapping, struct page *page)
 {
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 16/25] exchange page: Add THP exchange support.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (14 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 15/25] exchange pages: exchange anonymous page and file-backed page Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 17/25] exchange page: Add exchange_page() syscall Zi Yan
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Enable exchange THPs in the process. It also need to take care of
exchanging PTE-mapped THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/exchange.h |  2 ++
 mm/exchange.c            | 73 +++++++++++++++++++++++++++++++++++++-----------
 mm/migrate.c             |  2 +-
 3 files changed, 60 insertions(+), 17 deletions(-)

diff --git a/include/linux/exchange.h b/include/linux/exchange.h
index 20d2184..8785d08 100644
--- a/include/linux/exchange.h
+++ b/include/linux/exchange.h
@@ -14,6 +14,8 @@ struct exchange_page_info {
 	int from_page_was_mapped;
 	int to_page_was_mapped;
 
+	pgoff_t from_index, to_index;
+
 	struct list_head list;
 };
 
diff --git a/mm/exchange.c b/mm/exchange.c
index 555a72c..45c7013 100644
--- a/mm/exchange.c
+++ b/mm/exchange.c
@@ -51,7 +51,8 @@ struct page_flags {
 	unsigned int page_swapcache:1;
 	unsigned int page_writeback:1;
 	unsigned int page_private:1;
-	unsigned int __pad:3;
+	unsigned int page_doublemap:1;
+	unsigned int __pad:2;
 };
 
 
@@ -127,20 +128,23 @@ static void exchange_huge_page(struct page *dst, struct page *src)
 static void exchange_page_flags(struct page *to_page, struct page *from_page)
 {
 	int from_cpupid, to_cpupid;
-	struct page_flags from_page_flags, to_page_flags;
+	struct page_flags from_page_flags = {0}, to_page_flags = {0};
 	struct mem_cgroup *to_memcg = page_memcg(to_page),
 					  *from_memcg = page_memcg(from_page);
 
 	from_cpupid = page_cpupid_xchg_last(from_page, -1);
 
-	from_page_flags.page_error = TestClearPageError(from_page);
+	from_page_flags.page_error = PageError(from_page);
+	if (from_page_flags.page_error)
+		ClearPageError(from_page);
 	from_page_flags.page_referenced = TestClearPageReferenced(from_page);
 	from_page_flags.page_uptodate = PageUptodate(from_page);
 	ClearPageUptodate(from_page);
 	from_page_flags.page_active = TestClearPageActive(from_page);
 	from_page_flags.page_unevictable = TestClearPageUnevictable(from_page);
 	from_page_flags.page_checked = PageChecked(from_page);
-	ClearPageChecked(from_page);
+	if (from_page_flags.page_checked)
+		ClearPageChecked(from_page);
 	from_page_flags.page_mappedtodisk = PageMappedToDisk(from_page);
 	ClearPageMappedToDisk(from_page);
 	from_page_flags.page_dirty = PageDirty(from_page);
@@ -150,18 +154,22 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 	clear_page_idle(from_page);
 	from_page_flags.page_swapcache = PageSwapCache(from_page);
 	from_page_flags.page_writeback = test_clear_page_writeback(from_page);
+	from_page_flags.page_doublemap = PageDoubleMap(from_page);
 
 
 	to_cpupid = page_cpupid_xchg_last(to_page, -1);
 
-	to_page_flags.page_error = TestClearPageError(to_page);
+	to_page_flags.page_error = PageError(to_page);
+	if (to_page_flags.page_error)
+		ClearPageError(to_page);
 	to_page_flags.page_referenced = TestClearPageReferenced(to_page);
 	to_page_flags.page_uptodate = PageUptodate(to_page);
 	ClearPageUptodate(to_page);
 	to_page_flags.page_active = TestClearPageActive(to_page);
 	to_page_flags.page_unevictable = TestClearPageUnevictable(to_page);
 	to_page_flags.page_checked = PageChecked(to_page);
-	ClearPageChecked(to_page);
+	if (to_page_flags.page_checked)
+		ClearPageChecked(to_page);
 	to_page_flags.page_mappedtodisk = PageMappedToDisk(to_page);
 	ClearPageMappedToDisk(to_page);
 	to_page_flags.page_dirty = PageDirty(to_page);
@@ -171,6 +179,7 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 	clear_page_idle(to_page);
 	to_page_flags.page_swapcache = PageSwapCache(to_page);
 	to_page_flags.page_writeback = test_clear_page_writeback(to_page);
+	to_page_flags.page_doublemap = PageDoubleMap(to_page);
 
 	/* set to_page */
 	if (from_page_flags.page_error)
@@ -197,6 +206,8 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 		set_page_young(to_page);
 	if (from_page_flags.page_is_idle)
 		set_page_idle(to_page);
+	if (from_page_flags.page_doublemap)
+		SetPageDoubleMap(to_page);
 
 	/* set from_page */
 	if (to_page_flags.page_error)
@@ -223,6 +234,8 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 		set_page_young(from_page);
 	if (to_page_flags.page_is_idle)
 		set_page_idle(from_page);
+	if (to_page_flags.page_doublemap)
+		SetPageDoubleMap(from_page);
 
 	/*
 	 * Copy NUMA information to the new page, to prevent over-eager
@@ -599,7 +612,6 @@ static int unmap_and_exchange(struct page *from_page, struct page *to_page,
 
 	from_index = from_page->index;
 	to_index = to_page->index;
-
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -673,8 +685,6 @@ static int unmap_and_exchange(struct page *from_page, struct page *to_page,
 			swap(from_page->index, from_index);
 	}
 
-
-
 out_unlock_both:
 	if (to_anon_vma)
 		put_anon_vma(to_anon_vma);
@@ -689,6 +699,23 @@ static int unmap_and_exchange(struct page *from_page, struct page *to_page,
 	return rc;
 }
 
+static bool can_be_exchanged(struct page *from, struct page *to)
+{
+	if (PageCompound(from) != PageCompound(to))
+		return false;
+
+	if (PageHuge(from) != PageHuge(to))
+		return false;
+
+	if (PageHuge(from) || PageHuge(to))
+		return false;
+
+	if (compound_order(from) != compound_order(to))
+		return false;
+
+	return true;
+}
+
 /*
  * Exchange pages in the exchange_list
  *
@@ -745,7 +772,8 @@ int exchange_pages(struct list_head *exchange_list,
 		}
 
 		/* TODO: compound page not supported */
-		if (PageCompound(from_page) || page_mapping(from_page)) {
+		if (!can_be_exchanged(from_page, to_page) ||
+		    page_mapping(from_page)) {
 			++failed;
 			goto putback;
 		}
@@ -784,6 +812,8 @@ static int unmap_pair_pages_concur(struct exchange_page_info *one_pair,
 	struct page *from_page = one_pair->from_page;
 	struct page *to_page = one_pair->to_page;
 
+	one_pair->from_index = from_page->index;
+	one_pair->to_index = to_page->index;
 	/* from_page lock down  */
 	if (!trylock_page(from_page)) {
 		if (!force || ((mode & MIGRATE_MODE_MASK) == MIGRATE_ASYNC))
@@ -903,7 +933,6 @@ static int exchange_page_mapping_concur(struct list_head *unmapped_list_ptr,
 					   struct list_head *exchange_list_ptr,
 						enum migrate_mode mode)
 {
-	int rc = -EBUSY;
 	int nr_failed = 0;
 	struct address_space *to_page_mapping, *from_page_mapping;
 	struct exchange_page_info *one_pair, *one_pair2;
@@ -911,6 +940,7 @@ static int exchange_page_mapping_concur(struct list_head *unmapped_list_ptr,
 	list_for_each_entry_safe(one_pair, one_pair2, unmapped_list_ptr, list) {
 		struct page *from_page = one_pair->from_page;
 		struct page *to_page = one_pair->to_page;
+		int rc = -EBUSY;
 
 		VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
 		VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
@@ -926,8 +956,9 @@ static int exchange_page_mapping_concur(struct list_head *unmapped_list_ptr,
 		BUG_ON(PageWriteback(to_page));
 
 		/* actual page mapping exchange */
-		rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
-							to_page, from_page, NULL, NULL, mode, 0, 0);
+		if (!page_mapped(from_page) && !page_mapped(to_page))
+			rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
+								to_page, from_page, NULL, NULL, mode, 0, 0);
 
 		if (rc) {
 			if (one_pair->from_page_was_mapped)
@@ -954,7 +985,7 @@ static int exchange_page_mapping_concur(struct list_head *unmapped_list_ptr,
 			one_pair->from_page = NULL;
 			one_pair->to_page = NULL;
 
-			list_move(&one_pair->list, exchange_list_ptr);
+			list_del(&one_pair->list);
 			++nr_failed;
 		}
 	}
@@ -1026,8 +1057,18 @@ static int remove_migration_ptes_concur(struct list_head *unmapped_list_ptr)
 	struct exchange_page_info *iterator;
 
 	list_for_each_entry(iterator, unmapped_list_ptr, list) {
-		remove_migration_ptes(iterator->from_page, iterator->to_page, false);
-		remove_migration_ptes(iterator->to_page, iterator->from_page, false);
+		struct page *from_page = iterator->from_page;
+		struct page *to_page = iterator->to_page;
+
+		swap(from_page->index, iterator->from_index);
+		if (iterator->from_page_was_mapped)
+			remove_migration_ptes(iterator->from_page, iterator->to_page, false);
+		swap(from_page->index, iterator->from_index);
+
+		swap(to_page->index, iterator->to_index);
+		if (iterator->to_page_was_mapped)
+			remove_migration_ptes(iterator->to_page, iterator->from_page, false);
+		swap(to_page->index, iterator->to_index);
 
 
 		if (iterator->from_anon_vma)
diff --git a/mm/migrate.c b/mm/migrate.c
index a0ca817..da7af68 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -229,7 +229,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (PageKsm(page))
 			new = page;
 		else
-			new = page - pvmw.page->index +
+			new = page - page->index +
 				linear_page_index(vma, pvmw.address);
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 17/25] exchange page: Add exchange_page() syscall.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (15 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 16/25] exchange page: Add THP exchange support Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 18/25] memcg: Add per node memory usage&max stats in memcg Zi Yan
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Users can use the syscall to exchange two lists of pages, similar
to move_pages() syscall.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/syscalls.h               |   5 +
 mm/exchange.c                          | 346 +++++++++++++++++++++++++++++++++
 3 files changed, 352 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 92ee0b4..863a21e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	exchange_pages	__x64_sys_exchange_pages
 # don't use numbers 387 through 423, add new calls after the last
 # 'common' entry
 424	common	pidfd_send_signal	__x64_sys_pidfd_send_signal
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e446806..2c1eb49 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1203,6 +1203,11 @@ asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
 			unsigned long fd, unsigned long pgoff);
 asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
 
+asmlinkage long sys_exchange_pages(pid_t pid, unsigned long nr_pages,
+				const void __user * __user *from_pages,
+				const void __user * __user *to_pages,
+				int __user *status,
+				int flags);
 
 /*
  * Not a real system call, but a placeholder for syscalls which are
diff --git a/mm/exchange.c b/mm/exchange.c
index 45c7013..48e344e 100644
--- a/mm/exchange.c
+++ b/mm/exchange.c
@@ -22,6 +22,7 @@
 #include <linux/buffer_head.h>
 #include <linux/fs.h> /* buffer_migrate_page  */
 #include <linux/backing-dev.h>
+#include <linux/sched/mm.h>
 
 
 #include "internal.h"
@@ -1212,3 +1213,348 @@ int exchange_pages_concur(struct list_head *exchange_list,
 
 	return nr_failed?-EFAULT:0;
 }
+
+static int store_status(int __user *status, int start, int value, int nr)
+{
+	while (nr-- > 0) {
+		if (put_user(value, status + start))
+			return -EFAULT;
+		start++;
+	}
+
+	return 0;
+}
+
+static int do_exchange_page_list(struct mm_struct *mm,
+		struct list_head *from_pagelist, struct list_head *to_pagelist,
+		bool migrate_mt, bool migrate_concur)
+{
+	int err;
+	struct exchange_page_info *one_pair;
+	LIST_HEAD(exchange_page_list);
+
+	while (!list_empty(from_pagelist)) {
+		struct page *from_page, *to_page;
+
+		from_page = list_first_entry_or_null(from_pagelist, struct page, lru);
+		to_page = list_first_entry_or_null(to_pagelist, struct page, lru);
+
+		if (!from_page || !to_page)
+			break;
+
+		one_pair = kzalloc(sizeof(struct exchange_page_info), GFP_ATOMIC);
+		if (!one_pair) {
+			err = -ENOMEM;
+			break;
+		}
+
+		list_del(&from_page->lru);
+		list_del(&to_page->lru);
+
+		one_pair->from_page = from_page;
+		one_pair->to_page = to_page;
+
+		list_add_tail(&one_pair->list, &exchange_page_list);
+	}
+
+	if (migrate_concur)
+		err = exchange_pages_concur(&exchange_page_list,
+			MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD),
+			MR_SYSCALL);
+	else
+		err = exchange_pages(&exchange_page_list,
+			MIGRATE_SYNC | (migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD),
+			MR_SYSCALL);
+
+	while (!list_empty(&exchange_page_list)) {
+		struct exchange_page_info *one_pair =
+			list_first_entry(&exchange_page_list,
+							 struct exchange_page_info, list);
+
+		list_del(&one_pair->list);
+		kfree(one_pair);
+	}
+
+	if (!list_empty(from_pagelist))
+		putback_movable_pages(from_pagelist);
+
+	if (!list_empty(to_pagelist))
+		putback_movable_pages(to_pagelist);
+
+	return err;
+}
+
+static int add_page_for_exchange(struct mm_struct *mm,
+		unsigned long from_addr, unsigned long to_addr,
+		struct list_head *from_pagelist, struct list_head *to_pagelist,
+		bool migrate_all)
+{
+	struct vm_area_struct *from_vma, *to_vma;
+	struct page *from_page, *to_page;
+	LIST_HEAD(err_page_list);
+	unsigned int follflags;
+	int err;
+
+	err = -EFAULT;
+	from_vma = find_vma(mm, from_addr);
+	if (!from_vma || from_addr < from_vma->vm_start ||
+		!vma_migratable(from_vma))
+		goto set_from_status;
+
+	/* FOLL_DUMP to ignore special (like zero) pages */
+	follflags = FOLL_GET | FOLL_DUMP;
+	from_page = follow_page(from_vma, from_addr, follflags);
+
+	err = PTR_ERR(from_page);
+	if (IS_ERR(from_page))
+		goto set_from_status;
+
+	err = -ENOENT;
+	if (!from_page)
+		goto set_from_status;
+
+	err = -EACCES;
+	if (page_mapcount(from_page) > 1 && !migrate_all)
+		goto put_and_set_from_page;
+
+	if (PageHuge(from_page)) {
+		if (PageHead(from_page))
+			if (isolate_huge_page(from_page, &err_page_list)) {
+				err = 0;
+			}
+		goto put_and_set_from_page;
+	} else if (PageTransCompound(from_page)) {
+		if (PageTail(from_page)) {
+			err = -EACCES;
+			goto put_and_set_from_page;
+		}
+	}
+
+	err = isolate_lru_page(from_page);
+	if (!err)
+		mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+					page_is_file_cache(from_page), hpage_nr_pages(from_page));
+put_and_set_from_page:
+	/*
+	 * Either remove the duplicate refcount from
+	 * isolate_lru_page() or drop the page ref if it was
+	 * not isolated.
+	 *
+	 * Since FOLL_GET calls get_page(), and isolate_lru_page()
+	 * also calls get_page()
+	 */
+	put_page(from_page);
+set_from_status:
+	if (err)
+		goto out;
+
+	/* to pages  */
+	err = -EFAULT;
+	to_vma = find_vma(mm, to_addr);
+	if (!to_vma ||
+		to_addr < to_vma->vm_start ||
+		!vma_migratable(to_vma))
+		goto set_to_status;
+
+	/* FOLL_DUMP to ignore special (like zero) pages */
+	to_page = follow_page(to_vma, to_addr, follflags);
+
+	err = PTR_ERR(to_page);
+	if (IS_ERR(to_page))
+		goto set_to_status;
+
+	err = -ENOENT;
+	if (!to_page)
+		goto set_to_status;
+
+	err = -EACCES;
+	if (page_mapcount(to_page) > 1 &&
+			!migrate_all)
+		goto put_and_set_to_page;
+
+	if (PageHuge(to_page)) {
+		if (PageHead(to_page))
+			if (isolate_huge_page(to_page, &err_page_list)) {
+				err = 0;
+			}
+		goto put_and_set_to_page;
+	} else if (PageTransCompound(to_page)) {
+		if (PageTail(to_page)) {
+			err = -EACCES;
+			goto put_and_set_to_page;
+		}
+	}
+
+	err = isolate_lru_page(to_page);
+	if (!err)
+		mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+					page_is_file_cache(to_page), hpage_nr_pages(to_page));
+put_and_set_to_page:
+	/*
+	 * Either remove the duplicate refcount from
+	 * isolate_lru_page() or drop the page ref if it was
+	 * not isolated.
+	 *
+	 * Since FOLL_GET calls get_page(), and isolate_lru_page()
+	 * also calls get_page()
+	 */
+	put_page(to_page);
+set_to_status:
+	if (!err) {
+		if ((PageHuge(from_page) != PageHuge(to_page)) ||
+			(PageTransHuge(from_page) != PageTransHuge(to_page))) {
+			list_add(&from_page->lru, &err_page_list);
+			list_add(&to_page->lru, &err_page_list);
+		} else {
+			list_add_tail(&from_page->lru, from_pagelist);
+			list_add_tail(&to_page->lru, to_pagelist);
+		}
+	} else
+		list_add(&from_page->lru, &err_page_list);
+out:
+	if (!list_empty(&err_page_list))
+		putback_movable_pages(&err_page_list);
+	return err;
+}
+/*
+ * Migrate an array of page address onto an array of nodes and fill
+ * the corresponding array of status.
+ */
+static int do_pages_exchange(struct mm_struct *mm, nodemask_t task_nodes,
+			 unsigned long nr_pages,
+			 const void __user * __user *from_pages,
+			 const void __user * __user *to_pages,
+			 int __user *status, int flags)
+{
+	LIST_HEAD(from_pagelist);
+	LIST_HEAD(to_pagelist);
+	int start, i;
+	int err = 0, err1;
+
+	migrate_prep();
+
+	down_read(&mm->mmap_sem);
+	for (i = start = 0; i < nr_pages; i++) {
+		const void __user *from_p, *to_p;
+		unsigned long from_addr, to_addr;
+
+		err = -EFAULT;
+		if (get_user(from_p, from_pages + i))
+			goto out_flush;
+		if (get_user(to_p, to_pages + i))
+			goto out_flush;
+
+		from_addr = (unsigned long)from_p;
+		to_addr = (unsigned long)to_p;
+
+		err = -EACCES;
+		/*
+		 * Errors in the page lookup or isolation are not fatal and we simply
+		 * report them via status
+		 */
+		err = add_page_for_exchange(mm, from_addr, to_addr,
+				&from_pagelist, &to_pagelist,
+				flags & MPOL_MF_MOVE_ALL);
+
+		if (!err)
+			continue;
+
+		err = store_status(status, i, err, 1);
+		if (err)
+			goto out_flush;
+
+		err = do_exchange_page_list(mm, &from_pagelist, &to_pagelist,
+				flags & MPOL_MF_MOVE_MT,
+				flags & MPOL_MF_MOVE_CONCUR);
+		if (err)
+			goto out;
+		if (i > start) {
+			err = store_status(status, start, 0, i - start);
+			if (err)
+				goto out;
+		}
+		start = i;
+	}
+out_flush:
+	/* Make sure we do not overwrite the existing error */
+	err1 = do_exchange_page_list(mm, &from_pagelist, &to_pagelist,
+				flags & MPOL_MF_MOVE_MT,
+				flags & MPOL_MF_MOVE_CONCUR);
+	if (!err1)
+		err1 = store_status(status, start, 0, i - start);
+	if (!err)
+		err = err1;
+out:
+	up_read(&mm->mmap_sem);
+	return err;
+}
+
+SYSCALL_DEFINE6(exchange_pages, pid_t, pid, unsigned long, nr_pages,
+		const void __user * __user *, from_pages,
+		const void __user * __user *, to_pages,
+		int __user *, status, int, flags)
+{
+	const struct cred *cred = current_cred(), *tcred;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int err;
+	nodemask_t task_nodes;
+
+	/* Check flags */
+	if (flags & ~(MPOL_MF_MOVE|
+				  MPOL_MF_MOVE_ALL|
+				  MPOL_MF_MOVE_MT|
+				  MPOL_MF_MOVE_CONCUR))
+		return -EINVAL;
+
+	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
+		return -EPERM;
+
+	/* Find the mm_struct */
+	rcu_read_lock();
+	task = pid ? find_task_by_vpid(pid) : current;
+	if (!task) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+	get_task_struct(task);
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. The right exists if the process has administrative
+	 * capabilities, superuser privileges or the same
+	 * userid as the target process.
+	 */
+	tcred = __task_cred(task);
+	if (!uid_eq(cred->euid, tcred->suid) && !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->uid,  tcred->suid) && !uid_eq(cred->uid,  tcred->uid) &&
+	    !capable(CAP_SYS_NICE)) {
+		rcu_read_unlock();
+		err = -EPERM;
+		goto out;
+	}
+	rcu_read_unlock();
+
+	err = security_task_movememory(task);
+	if (err)
+		goto out;
+
+	task_nodes = cpuset_mems_allowed(task);
+	mm = get_task_mm(task);
+	put_task_struct(task);
+
+	if (!mm)
+		return -EINVAL;
+
+	err = do_pages_exchange(mm, task_nodes, nr_pages, from_pages,
+				    to_pages, status, flags);
+
+	mmput(mm);
+
+	return err;
+
+out:
+	put_task_struct(task);
+
+	return err;
+}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 18/25] memcg: Add per node memory usage&max stats in memcg.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (16 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 17/25] exchange page: Add exchange_page() syscall Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 19/25] mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit Zi Yan
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It prepares for the following patches to enable memcg-based NUMA
node page migration. We are going to limit memory usage in each node
on a per-memcg basis.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/cgroup-defs.h |  1 +
 include/linux/memcontrol.h  | 67 +++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c             | 80 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 148 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1c70803..7e87f5e 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -531,6 +531,7 @@ struct cftype {
 	struct cgroup_subsys *ss;	/* NULL for cgroup core files */
 	struct list_head node;		/* anchored at ss->cfts */
 	struct kernfs_ops *kf_ops;
+	int numa_node_id;
 
 	int (*open)(struct kernfs_open_file *of);
 	void (*release)(struct kernfs_open_file *of);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f3d880..3e40321 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -130,6 +130,7 @@ struct mem_cgroup_per_node {
 	atomic_long_t		lruvec_stat[NR_VM_NODE_STAT_ITEMS];
 
 	unsigned long		lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
+	unsigned long		max_nr_base_pages;
 
 	struct mem_cgroup_reclaim_iter	iter[DEF_PRIORITY + 1];
 
@@ -797,6 +798,51 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 void mem_cgroup_split_huge_fixup(struct page *head);
 #endif
 
+static inline unsigned long lruvec_size_memcg_node(enum lru_list lru,
+	struct mem_cgroup *memcg, int nid)
+{
+	if (nid == MAX_NUMNODES)
+		return 0;
+
+	VM_BUG_ON(lru < 0 || lru >= NR_LRU_LISTS);
+	return mem_cgroup_node_nr_lru_pages(memcg, nid, BIT(lru));
+}
+
+static inline unsigned long active_inactive_size_memcg_node(struct mem_cgroup *memcg, int nid, bool active)
+{
+	unsigned long val = 0;
+	enum lru_list lru;
+
+	for_each_evictable_lru(lru) {
+		if ((active  && is_active_lru(lru)) ||
+			(!active && !is_active_lru(lru)))
+			val += mem_cgroup_node_nr_lru_pages(memcg, nid, BIT(lru));
+	}
+
+	return val;
+}
+
+static inline unsigned long memcg_size_node(struct mem_cgroup *memcg, int nid)
+{
+	unsigned long val = 0;
+	int i;
+
+	if (nid == MAX_NUMNODES)
+		return val;
+
+	for (i = 0; i < NR_LRU_LISTS; i++)
+		val += mem_cgroup_node_nr_lru_pages(memcg, nid, BIT(i));
+
+	return val;
+}
+
+static inline unsigned long memcg_max_size_node(struct mem_cgroup *memcg, int nid)
+{
+	if (nid == MAX_NUMNODES)
+		return 0;
+	return memcg->nodeinfo[nid]->max_nr_base_pages;
+}
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1123,6 +1169,27 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline unsigned long lruvec_size_memcg_node(enum lru_list lru,
+	struct mem_cgroup *memcg, int nid)
+{
+	return 0;
+}
+
+static inline unsigned long active_inactive_size_memcg_node(struct mem_cgroup *memcg, int nid, bool active)
+{
+	return 0;
+}
+
+static inline unsigned long memcg_size_node(struct mem_cgroup *memcg, int nid)
+{
+	return 0;
+}
+static inline unsigned long memcg_max_size_node(struct mem_cgroup *memcg, int nid)
+{
+	return 0;
+}
+
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 532e0e2..478d216 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4394,6 +4394,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	pn->usage_in_excess = 0;
 	pn->on_tree = false;
 	pn->memcg = memcg;
+	pn->max_nr_base_pages = PAGE_COUNTER_MAX;
 
 	memcg->nodeinfo[node] = pn;
 	return 0;
@@ -6700,4 +6701,83 @@ static int __init mem_cgroup_swap_init(void)
 }
 subsys_initcall(mem_cgroup_swap_init);
 
+static int memory_per_node_stat_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	struct cftype *cur_file = seq_cft(m);
+	int nid = cur_file->numa_node_id;
+	unsigned long val = 0;
+	int i;
+
+	for (i = 0; i < NR_LRU_LISTS; i++)
+		val += mem_cgroup_node_nr_lru_pages(memcg, nid, BIT(i));
+
+	seq_printf(m, "%llu\n", (u64)val * PAGE_SIZE);
+
+	return 0;
+}
+
+static int memory_per_node_max_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	struct cftype *cur_file = seq_cft(m);
+	int nid = cur_file->numa_node_id;
+	unsigned long max = READ_ONCE(memcg->nodeinfo[nid]->max_nr_base_pages);
+
+	if (max == PAGE_COUNTER_MAX)
+		seq_puts(m, "max\n");
+	else
+		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_per_node_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	struct cftype *cur_file = of_cft(of);
+	int nid = cur_file->numa_node_id;
+	unsigned long max;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &max);
+	if (err)
+		return err;
+
+	xchg(&memcg->nodeinfo[nid]->max_nr_base_pages, max);
+
+	return nbytes;
+}
+
+static struct cftype memcg_per_node_stats_files[N_MEMORY];
+static struct cftype memcg_per_node_max_files[N_MEMORY];
+
+static int __init mem_cgroup_per_node_init(void)
+{
+	int nid;
+
+	for_each_node_state(nid, N_MEMORY) {
+		snprintf(memcg_per_node_stats_files[nid].name, MAX_CFTYPE_NAME,
+				"size_at_node:%d", nid);
+		memcg_per_node_stats_files[nid].flags = CFTYPE_NOT_ON_ROOT;
+		memcg_per_node_stats_files[nid].seq_show = memory_per_node_stat_show;
+		memcg_per_node_stats_files[nid].numa_node_id = nid;
+
+		snprintf(memcg_per_node_max_files[nid].name, MAX_CFTYPE_NAME,
+				"max_at_node:%d", nid);
+		memcg_per_node_max_files[nid].flags = CFTYPE_NOT_ON_ROOT;
+		memcg_per_node_max_files[nid].seq_show = memory_per_node_max_show;
+		memcg_per_node_max_files[nid].write = memory_per_node_max_write;
+		memcg_per_node_max_files[nid].numa_node_id = nid;
+	}
+	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys,
+				memcg_per_node_stats_files));
+	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys,
+				memcg_per_node_max_files));
+	return 0;
+}
+subsys_initcall(mem_cgroup_per_node_init);
+
 #endif /* CONFIG_MEMCG_SWAP */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 19/25] mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (17 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 18/25] memcg: Add per node memory usage&max stats in memcg Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 20/25] memory manage: Add memory manage syscall Zi Yan
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

With MPOL_F_MEMCG set and MPOL_PREFERRED is used, we will enforce
the memory limit set in the corresponding memcg.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/uapi/linux/mempolicy.h |  3 ++-
 mm/mempolicy.c                 | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index eb6560e..a9d03e5 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -28,12 +28,13 @@ enum {
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
+#define MPOL_F_MEMCG		(1 << 13)
 
 /*
  * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
  * either set_mempolicy() or mbind().
  */
-#define MPOL_MODE_FLAGS	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES)
+#define MPOL_MODE_FLAGS	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_MEMCG)
 
 /* Flags for get_mempolicy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index af171cc..0e30049 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2040,6 +2040,42 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
+	if (pol->mode == MPOL_PREFERRED && (pol->flags & MPOL_F_MEMCG)) {
+		struct task_struct *p = current;
+		struct mem_cgroup *memcg = mem_cgroup_from_task(p);
+		int nid = pol->v.preferred_node;
+		unsigned long nr_memcg_node_size;
+		struct mm_struct *mm = get_task_mm(p);
+		unsigned long nr_pages = hugepage?HPAGE_PMD_NR:1;
+
+		if (!(memcg && mm)) {
+			if (mm)
+				mmput(mm);
+			goto use_other_policy;
+		}
+
+		/* skip preferred node if mm_manage is going on */
+		if (test_bit(MMF_MM_MANAGE, &mm->flags)) {
+			nid = next_memory_node(nid);
+			if (nid == MAX_NUMNODES)
+				nid = first_memory_node;
+		}
+		mmput(mm);
+
+		nr_memcg_node_size = memcg_max_size_node(memcg, nid);
+
+		while (nr_memcg_node_size != ULONG_MAX &&
+			   nr_memcg_node_size <= (memcg_size_node(memcg, nid) + nr_pages)) {
+			if ((nid = next_memory_node(nid)) == MAX_NUMNODES)
+				nid = first_memory_node;
+			nr_memcg_node_size = memcg_max_size_node(memcg, nid);
+		}
+
+		mpol_cond_put(pol);
+		page = __alloc_pages_node(nid, gfp | __GFP_THISNODE, order);
+		goto out;
+	}
+use_other_policy:
 	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
 		int hpage_node = node;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 20/25] memory manage: Add memory manage syscall.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (18 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 19/25] mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 21/25] mm: move update_lru_sizes() to mm_inline.h for broader use Zi Yan
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

This prepares for the following patches to provide a user API to
manipulate pages in two memory nodes with the help of memcg.

missing memcg_max_size_node()

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/sched/coredump.h         |   1 +
 include/linux/syscalls.h               |   5 ++
 include/uapi/linux/mempolicy.h         |   1 +
 mm/Makefile                            |   1 +
 mm/internal.h                          |   2 +
 mm/memory_manage.c                     | 109 +++++++++++++++++++++++++++++++++
 mm/mempolicy.c                         |   2 +-
 8 files changed, 121 insertions(+), 1 deletion(-)
 create mode 100644 mm/memory_manage.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 863a21e..fa8def3 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
 335	common	exchange_pages	__x64_sys_exchange_pages
+336	common	mm_manage		__x64_sys_mm_manage
 # don't use numbers 387 through 423, add new calls after the last
 # 'common' entry
 424	common	pidfd_send_signal	__x64_sys_pidfd_send_signal
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ecdc654..9aa9d94b 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -73,6 +73,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
 #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
+#define MMF_MM_MANAGE		27
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
 				 MMF_DISABLE_THP_MASK)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2c1eb49..47d56c5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1208,6 +1208,11 @@ asmlinkage long sys_exchange_pages(pid_t pid, unsigned long nr_pages,
 				const void __user * __user *to_pages,
 				int __user *status,
 				int flags);
+asmlinkage long sys_mm_manage(pid_t pid, unsigned long nr_pages,
+				unsigned long maxnode,
+				const unsigned long __user *old_nodes,
+				const unsigned long __user *new_nodes,
+				int flags);
 
 /*
  * Not a real system call, but a placeholder for syscalls which are
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a9d03e5..4722bb7 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -52,6 +52,7 @@ enum {
 #define MPOL_MF_MOVE_DMA (1<<5)	/* Use DMA page copy routine */
 #define MPOL_MF_MOVE_MT  (1<<6)	/* Use multi-threaded page copy routine */
 #define MPOL_MF_MOVE_CONCUR  (1<<7)	/* Move pages in a batch */
+#define MPOL_MF_EXCHANGE	(1<<8)	/* Exchange pages */
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
diff --git a/mm/Makefile b/mm/Makefile
index 2f1f1ad..5302d79 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,6 +47,7 @@ obj-y += memblock.o
 obj-y += copy_page.o
 obj-y += exchange.o
 obj-y += exchange_page.o
+obj-y += memory_manage.o
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
diff --git a/mm/internal.h b/mm/internal.h
index cf63bf6..94feb14 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -574,5 +574,7 @@ bool buffer_migrate_lock_buffers(struct buffer_head *head,
 int writeout(struct address_space *mapping, struct page *page);
 int expected_page_refs(struct address_space *mapping, struct page *page);
 
+int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
+		     unsigned long maxnode);
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory_manage.c b/mm/memory_manage.c
new file mode 100644
index 0000000..b8f3654
--- /dev/null
+++ b/mm/memory_manage.c
@@ -0,0 +1,109 @@
+/*
+ * A syscall used to move pages between two nodes.
+ */
+
+#include <linux/sched/mm.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/nodemask.h>
+#include <linux/security.h>
+#include <linux/syscalls.h>
+
+#include "internal.h"
+
+
+SYSCALL_DEFINE6(mm_manage, pid_t, pid, unsigned long, nr_pages,
+		unsigned long, maxnode,
+		const unsigned long __user *, slow_nodes,
+		const unsigned long __user *, fast_nodes,
+		int, flags)
+{
+	const struct cred *cred = current_cred(), *tcred;
+	struct task_struct *task;
+	struct mm_struct *mm = NULL;
+	int err;
+	nodemask_t task_nodes;
+	nodemask_t *slow;
+	nodemask_t *fast;
+	NODEMASK_SCRATCH(scratch);
+
+	if (!scratch)
+		return -ENOMEM;
+
+	slow = &scratch->mask1;
+	fast = &scratch->mask2;
+
+	err = get_nodes(slow, slow_nodes, maxnode);
+	if (err)
+		goto out;
+
+	err = get_nodes(fast, fast_nodes, maxnode);
+	if (err)
+		goto out;
+
+	/* Check flags */
+	if (flags & ~(MPOL_MF_MOVE_MT|
+				  MPOL_MF_MOVE_DMA|
+				  MPOL_MF_MOVE_CONCUR|
+				  MPOL_MF_EXCHANGE))
+		return -EINVAL;
+
+	/* Find the mm_struct */
+	rcu_read_lock();
+	task = pid ? find_task_by_vpid(pid) : current;
+	if (!task) {
+		rcu_read_unlock();
+		err = -ESRCH;
+		goto out;
+	}
+	get_task_struct(task);
+
+	err = -EINVAL;
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. The right exists if the process has administrative
+	 * capabilities, superuser privileges or the same
+	 * userid as the target process.
+	 */
+	tcred = __task_cred(task);
+	if (!uid_eq(cred->euid, tcred->suid) && !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->uid,  tcred->suid) && !uid_eq(cred->uid,  tcred->uid) &&
+	    !capable(CAP_SYS_NICE)) {
+		rcu_read_unlock();
+		err = -EPERM;
+		goto out_put;
+	}
+	rcu_read_unlock();
+
+	err = security_task_movememory(task);
+	if (err)
+		goto out_put;
+
+	task_nodes = cpuset_mems_allowed(task);
+	mm = get_task_mm(task);
+	put_task_struct(task);
+
+	if (!mm) {
+		err = -EINVAL;
+		goto out;
+	}
+	if (test_bit(MMF_MM_MANAGE, &mm->flags)) {
+		mmput(mm);
+		goto out;
+	} else {
+		set_bit(MMF_MM_MANAGE, &mm->flags);
+	}
+
+
+	clear_bit(MMF_MM_MANAGE, &mm->flags);
+	mmput(mm);
+out:
+	NODEMASK_SCRATCH_FREE(scratch);
+
+	return err;
+
+out_put:
+	put_task_struct(task);
+	goto out;
+
+}
\ No newline at end of file
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e30049..168d17f8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1249,7 +1249,7 @@ static long do_mbind(unsigned long start, unsigned long len,
  */
 
 /* Copy a node mask from user space. */
-static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
+int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
 		     unsigned long maxnode)
 {
 	unsigned long k;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 21/25] mm: move update_lru_sizes() to mm_inline.h for broader use.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (19 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 20/25] memory manage: Add memory manage syscall Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 22/25] memory manage: active/inactive page list manipulation in memcg Zi Yan
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/mm_inline.h | 21 +++++++++++++++++++++
 mm/vmscan.c               | 25 ++-----------------------
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 04ec454..b9fbd0b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -44,6 +44,27 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
 #endif
 }
 
+/*
+ * Update LRU sizes after isolating pages. The LRU size updates must
+ * be complete before mem_cgroup_update_lru_size due to a santity check.
+ */
+static __always_inline void update_lru_sizes(struct lruvec *lruvec,
+			enum lru_list lru, unsigned long *nr_zone_taken)
+{
+	int zid;
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		if (!nr_zone_taken[zid])
+			continue;
+
+		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
+#ifdef CONFIG_MEMCG
+		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
+#endif
+	}
+
+}
+
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5ad0b3..1d539d6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1593,27 +1593,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 }
 
 
-/*
- * Update LRU sizes after isolating pages. The LRU size updates must
- * be complete before mem_cgroup_update_lru_size due to a santity check.
- */
-static __always_inline void update_lru_sizes(struct lruvec *lruvec,
-			enum lru_list lru, unsigned long *nr_zone_taken)
-{
-	int zid;
-
-	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-		if (!nr_zone_taken[zid])
-			continue;
-
-		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
-#ifdef CONFIG_MEMCG
-		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
-#endif
-	}
-
-}
-
 /**
  * pgdat->lru_lock is heavily contended.  Some of the functions that
  * shrink the lists perform better by taking out a batch of pages
@@ -1804,7 +1783,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	return isolated > inactive;
 }
 
-static noinline_for_stack void
+noinline_for_stack void
 putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
@@ -2003,7 +1982,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
  * Returns the number of pages moved to the given lru.
  */
 
-static unsigned move_active_pages_to_lru(struct lruvec *lruvec,
+unsigned move_active_pages_to_lru(struct lruvec *lruvec,
 				     struct list_head *list,
 				     struct list_head *pages_to_free,
 				     enum lru_list lru)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 22/25] memory manage: active/inactive page list manipulation in memcg.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (20 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 21/25] mm: move update_lru_sizes() to mm_inline.h for broader use Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 23/25] memory manage: page migration based page manipulation between NUMA nodes Zi Yan
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

The syscall allows users to trigger page list scanning to actively
move pages between active/inactive lists according to page
references. This is limited to the memcg which the process belongs
to. It would not impact the global LRU lists, which is the root
memcg.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/uapi/linux/mempolicy.h |  1 +
 mm/internal.h                  | 93 +++++++++++++++++++++++++++++++++++++++++-
 mm/memory_manage.c             | 76 +++++++++++++++++++++++++++++++++-
 mm/vmscan.c                    | 90 ++++++++--------------------------------
 4 files changed, 184 insertions(+), 76 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 4722bb7..dac474a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -53,6 +53,7 @@ enum {
 #define MPOL_MF_MOVE_MT  (1<<6)	/* Use multi-threaded page copy routine */
 #define MPOL_MF_MOVE_CONCUR  (1<<7)	/* Move pages in a batch */
 #define MPOL_MF_EXCHANGE	(1<<8)	/* Exchange pages */
+#define MPOL_MF_SHRINK_LISTS	(1<<9)	/* Exchange pages */
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
diff --git a/mm/internal.h b/mm/internal.h
index 94feb14..eec88de 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -564,7 +564,7 @@ extern int copy_page_lists_mt(struct page **to,
 extern int exchange_page_mthread(struct page *to, struct page *from,
 			int nr_pages);
 extern int exchange_page_lists_mthread(struct page **to,
-						  struct page **from, 
+						  struct page **from,
 						  int nr_pages);
 
 extern int exchange_two_pages(struct page *page1, struct page *page2);
@@ -577,4 +577,95 @@ int expected_page_refs(struct address_space *mapping, struct page *page);
 int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
 		     unsigned long maxnode);
 
+unsigned move_active_pages_to_lru(struct lruvec *lruvec,
+				     struct list_head *list,
+				     struct list_head *pages_to_free,
+				     enum lru_list lru);
+void putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list);
+
+struct scan_control {
+	/* How many pages shrink_list() should reclaim */
+	unsigned long nr_to_reclaim;
+
+	/*
+	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
+	 * are scanned.
+	 */
+	nodemask_t	*nodemask;
+
+	/*
+	 * The memory cgroup that hit its limit and as a result is the
+	 * primary target of this reclaim invocation.
+	 */
+	struct mem_cgroup *target_mem_cgroup;
+
+	/* Writepage batching in laptop mode; RECLAIM_WRITE */
+	unsigned int may_writepage:1;
+
+	/* Can mapped pages be reclaimed? */
+	unsigned int may_unmap:1;
+
+	/* Can pages be swapped as part of reclaim? */
+	unsigned int may_swap:1;
+
+	/* e.g. boosted watermark reclaim leaves slabs alone */
+	unsigned int may_shrinkslab:1;
+
+	/*
+	 * Cgroups are not reclaimed below their configured memory.low,
+	 * unless we threaten to OOM. If any cgroups are skipped due to
+	 * memory.low and nothing was reclaimed, go back for memory.low.
+	 */
+	unsigned int memcg_low_reclaim:1;
+	unsigned int memcg_low_skipped:1;
+
+	unsigned int hibernation_mode:1;
+
+	/* One of the zones is ready for compaction */
+	unsigned int compaction_ready:1;
+
+	unsigned int isolate_only_huge_page:1;
+	unsigned int isolate_only_base_page:1;
+	unsigned int no_reclaim:1;
+
+	/* Allocation order */
+	s8 order;
+
+	/* Scan (total_size >> priority) pages at once */
+	s8 priority;
+
+	/* The highest zone to isolate pages for reclaim from */
+	s8 reclaim_idx;
+
+	/* This context's GFP mask */
+	gfp_t gfp_mask;
+
+	/* Incremented by the number of inactive pages that were scanned */
+	unsigned long nr_scanned;
+
+	/* Number of pages freed so far during a call to shrink_zones() */
+	unsigned long nr_reclaimed;
+
+	struct {
+		unsigned int dirty;
+		unsigned int unqueued_dirty;
+		unsigned int congested;
+		unsigned int writeback;
+		unsigned int immediate;
+		unsigned int file_taken;
+		unsigned int taken;
+	} nr;
+};
+
+unsigned long isolate_lru_pages(unsigned long nr_to_scan,
+		struct lruvec *lruvec, struct list_head *dst,
+		unsigned long *nr_scanned, struct scan_control *sc,
+		enum lru_list lru);
+void shrink_active_list(unsigned long nr_to_scan,
+			       struct lruvec *lruvec,
+			       struct scan_control *sc,
+			       enum lru_list lru);
+unsigned long shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
+		     struct scan_control *sc, enum lru_list lru);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory_manage.c b/mm/memory_manage.c
index b8f3654..e8dddbf 100644
--- a/mm/memory_manage.c
+++ b/mm/memory_manage.c
@@ -5,13 +5,79 @@
 #include <linux/sched/mm.h>
 #include <linux/cpuset.h>
 #include <linux/mempolicy.h>
+#include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
 #include <linux/nodemask.h>
+#include <linux/rmap.h>
 #include <linux/security.h>
+#include <linux/swap.h>
 #include <linux/syscalls.h>
 
 #include "internal.h"
 
 
+static unsigned long shrink_lists_node_memcg(pg_data_t *pgdat,
+	struct mem_cgroup *memcg, unsigned long nr_to_scan)
+{
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
+	enum lru_list lru;
+
+	for_each_evictable_lru(lru) {
+		unsigned long nr_to_scan_local = lruvec_size_memcg_node(lru, memcg,
+				pgdat->node_id) / 2;
+		struct scan_control sc = {.may_unmap = 1, .no_reclaim = 1};
+		/*nr_reclaimed += shrink_list(lru, nr_to_scan, lruvec, memcg, sc);*/
+		/*
+		 * for slow node, we want active list, we start from the top of
+		 * the active list. For pages in the bottom of
+		 * the inactive list, we can place it to the top of inactive list
+		 */
+		/*
+		 * for fast node, we want inactive list, we start from the bottom of
+		 * the inactive list. For pages in the active list, we just keep them.
+		 */
+		/*
+		 * A key question is how many pages to scan each time, and what criteria
+		 * to use to move pages between active/inactive page lists.
+		 *  */
+		if (is_active_lru(lru))
+			shrink_active_list(nr_to_scan_local, lruvec, &sc, lru);
+		else
+			shrink_inactive_list(nr_to_scan_local, lruvec, &sc, lru);
+	}
+	cond_resched();
+
+	return 0;
+}
+
+static int shrink_lists(struct task_struct *p, struct mm_struct *mm,
+		const nodemask_t *slow, const nodemask_t *fast, unsigned long nr_to_scan)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_task(p);
+	int slow_nid, fast_nid;
+	int err = 0;
+
+	if (!memcg)
+		return 0;
+	/* Let's handle simplest situation first */
+	if (!(nodes_weight(*slow) == 1 && nodes_weight(*fast) == 1))
+		return 0;
+
+	if (memcg == root_mem_cgroup)
+		return 0;
+
+	slow_nid = first_node(*slow);
+	fast_nid = first_node(*fast);
+
+	/* move pages between page lists in slow node */
+	shrink_lists_node_memcg(NODE_DATA(slow_nid), memcg, nr_to_scan);
+
+	/* move pages between page lists in fast node */
+	shrink_lists_node_memcg(NODE_DATA(fast_nid), memcg, nr_to_scan);
+
+	return err;
+}
+
 SYSCALL_DEFINE6(mm_manage, pid_t, pid, unsigned long, nr_pages,
 		unsigned long, maxnode,
 		const unsigned long __user *, slow_nodes,
@@ -42,10 +108,14 @@ SYSCALL_DEFINE6(mm_manage, pid_t, pid, unsigned long, nr_pages,
 		goto out;
 
 	/* Check flags */
-	if (flags & ~(MPOL_MF_MOVE_MT|
+	if (flags & ~(
+				  MPOL_MF_MOVE|
+				  MPOL_MF_MOVE_MT|
 				  MPOL_MF_MOVE_DMA|
 				  MPOL_MF_MOVE_CONCUR|
-				  MPOL_MF_EXCHANGE))
+				  MPOL_MF_EXCHANGE|
+				  MPOL_MF_SHRINK_LISTS|
+				  MPOL_MF_MOVE_ALL))
 		return -EINVAL;
 
 	/* Find the mm_struct */
@@ -94,6 +164,8 @@ SYSCALL_DEFINE6(mm_manage, pid_t, pid, unsigned long, nr_pages,
 		set_bit(MMF_MM_MANAGE, &mm->flags);
 	}
 
+	if (flags & MPOL_MF_SHRINK_LISTS)
+		shrink_lists(task, mm, slow, fast, nr_pages);
 
 	clear_bit(MMF_MM_MANAGE, &mm->flags);
 	mmput(mm);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1d539d6..3693550 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -63,75 +63,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
-struct scan_control {
-	/* How many pages shrink_list() should reclaim */
-	unsigned long nr_to_reclaim;
-
-	/*
-	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
-	 * are scanned.
-	 */
-	nodemask_t	*nodemask;
-
-	/*
-	 * The memory cgroup that hit its limit and as a result is the
-	 * primary target of this reclaim invocation.
-	 */
-	struct mem_cgroup *target_mem_cgroup;
-
-	/* Writepage batching in laptop mode; RECLAIM_WRITE */
-	unsigned int may_writepage:1;
-
-	/* Can mapped pages be reclaimed? */
-	unsigned int may_unmap:1;
-
-	/* Can pages be swapped as part of reclaim? */
-	unsigned int may_swap:1;
-
-	/* e.g. boosted watermark reclaim leaves slabs alone */
-	unsigned int may_shrinkslab:1;
-
-	/*
-	 * Cgroups are not reclaimed below their configured memory.low,
-	 * unless we threaten to OOM. If any cgroups are skipped due to
-	 * memory.low and nothing was reclaimed, go back for memory.low.
-	 */
-	unsigned int memcg_low_reclaim:1;
-	unsigned int memcg_low_skipped:1;
-
-	unsigned int hibernation_mode:1;
-
-	/* One of the zones is ready for compaction */
-	unsigned int compaction_ready:1;
-
-	/* Allocation order */
-	s8 order;
-
-	/* Scan (total_size >> priority) pages at once */
-	s8 priority;
-
-	/* The highest zone to isolate pages for reclaim from */
-	s8 reclaim_idx;
-
-	/* This context's GFP mask */
-	gfp_t gfp_mask;
-
-	/* Incremented by the number of inactive pages that were scanned */
-	unsigned long nr_scanned;
-
-	/* Number of pages freed so far during a call to shrink_zones() */
-	unsigned long nr_reclaimed;
-
-	struct {
-		unsigned int dirty;
-		unsigned int unqueued_dirty;
-		unsigned int congested;
-		unsigned int writeback;
-		unsigned int immediate;
-		unsigned int file_taken;
-		unsigned int taken;
-	} nr;
-};
 
 #ifdef ARCH_HAS_PREFETCH
 #define prefetch_prev_lru_page(_page, _base, _field)			\
@@ -1261,6 +1192,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			; /* try to reclaim the page below */
 		}
 
+		/* We keep the page in inactive list for migration in the next
+		 * step */
+		if (sc->no_reclaim) {
+			stat->nr_ref_keep++;
+			goto keep_locked;
+		}
+
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
@@ -1613,7 +1551,7 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
  *
  * returns how many pages were moved onto *@dst.
  */
-static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
+unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct lruvec *lruvec, struct list_head *dst,
 		unsigned long *nr_scanned, struct scan_control *sc,
 		enum lru_list lru)
@@ -1634,6 +1572,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct page *page;
 
 		page = lru_to_page(src);
+		nr_pages = hpage_nr_pages(page);
+
+		if (sc->isolate_only_base_page && nr_pages != 1)
+			continue;
+		if (sc->isolate_only_huge_page && nr_pages == 1)
+			continue;
+
 		prefetchw_prev_lru_page(page, src, flags);
 
 		VM_BUG_ON_PAGE(!PageLRU(page), page);
@@ -1653,7 +1598,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		scan++;
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
-			nr_pages = hpage_nr_pages(page);
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
@@ -1855,7 +1799,7 @@ static int current_may_throttle(void)
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
-static noinline_for_stack unsigned long
+noinline_for_stack unsigned long
 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		     struct scan_control *sc, enum lru_list lru)
 {
@@ -2029,7 +1973,7 @@ unsigned move_active_pages_to_lru(struct lruvec *lruvec,
 	return nr_moved;
 }
 
-static void shrink_active_list(unsigned long nr_to_scan,
+void shrink_active_list(unsigned long nr_to_scan,
 			       struct lruvec *lruvec,
 			       struct scan_control *sc,
 			       enum lru_list lru)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 23/25] memory manage: page migration based page manipulation between NUMA nodes.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (21 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 22/25] memory manage: active/inactive page list manipulation in memcg Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 24/25] memory manage: limit migration batch size Zi Yan
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Users are expected to set memcg max size to reflect their memory
resource allocation policy. The syscall simply migrates pages belong
to the application's memcg between from_node to to_node, where
from_node is considered fast memory and to_node is considered slow
memory. In common cases, active(hot) pages are migrated from to_node
to from_node and inactive(cold) pages are migrated from from_node to
to_node.

Separate migration for base pages and huge pages to achieve high
throughput.

1. They are migrated via different calls.
2. 4KB base pages are not transferred via multi-threaded.
3. All pages are migrated together if no optimization is used.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/memory_manage.c | 275 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 275 insertions(+)

diff --git a/mm/memory_manage.c b/mm/memory_manage.c
index e8dddbf..d63ad25 100644
--- a/mm/memory_manage.c
+++ b/mm/memory_manage.c
@@ -6,6 +6,7 @@
 #include <linux/cpuset.h>
 #include <linux/mempolicy.h>
 #include <linux/memcontrol.h>
+#include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/nodemask.h>
 #include <linux/rmap.h>
@@ -15,6 +16,11 @@
 
 #include "internal.h"
 
+enum isolate_action {
+	ISOLATE_COLD_PAGES = 1,
+	ISOLATE_HOT_PAGES,
+	ISOLATE_HOT_AND_COLD_PAGES,
+};
 
 static unsigned long shrink_lists_node_memcg(pg_data_t *pgdat,
 	struct mem_cgroup *memcg, unsigned long nr_to_scan)
@@ -78,6 +84,272 @@ static int shrink_lists(struct task_struct *p, struct mm_struct *mm,
 	return err;
 }
 
+static unsigned long isolate_pages_from_lru_list(pg_data_t *pgdat,
+		struct mem_cgroup *memcg, unsigned long nr_pages,
+		struct list_head *base_page_list,
+		struct list_head *huge_page_list,
+		unsigned long *nr_taken_base_page,
+		unsigned long *nr_taken_huge_page,
+		enum isolate_action action)
+{
+	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
+	enum lru_list lru;
+	unsigned long nr_all_taken = 0;
+
+	if (nr_pages == ULONG_MAX)
+		nr_pages = memcg_size_node(memcg, pgdat->node_id);
+
+	lru_add_drain_all();
+
+	for_each_evictable_lru(lru) {
+		unsigned long nr_scanned, nr_taken;
+		int file = is_file_lru(lru);
+		struct scan_control sc = {.may_unmap = 1};
+
+		if (action == ISOLATE_COLD_PAGES && is_active_lru(lru))
+			continue;
+		if (action == ISOLATE_HOT_PAGES && !is_active_lru(lru))
+			continue;
+
+		spin_lock_irq(&pgdat->lru_lock);
+
+		/* Isolate base pages */
+		sc.isolate_only_base_page = 1;
+		nr_taken = isolate_lru_pages(nr_pages, lruvec, base_page_list,
+					&nr_scanned, &sc, lru);
+		/* Isolate huge pages */
+		sc.isolate_only_base_page = 0;
+		sc.isolate_only_huge_page = 1;
+		nr_taken += isolate_lru_pages(nr_pages - nr_scanned, lruvec,
+					huge_page_list, &nr_scanned, &sc, lru);
+
+		__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
+
+		spin_unlock_irq(&pgdat->lru_lock);
+
+		nr_all_taken += nr_taken;
+
+		if (nr_all_taken > nr_pages)
+			break;
+	}
+
+	return nr_all_taken;
+}
+
+static int migrate_to_node(struct list_head *page_list, int nid,
+		enum migrate_mode mode)
+{
+	bool migrate_concur = mode & MIGRATE_CONCUR;
+	int num = 0;
+	int from_nid;
+	int err;
+
+	if (list_empty(page_list))
+		return num;
+
+	from_nid = page_to_nid(list_first_entry(page_list, struct page, lru));
+
+	if (migrate_concur)
+		err = migrate_pages_concur(page_list, alloc_new_node_page,
+			NULL, nid, mode, MR_SYSCALL);
+	else
+		err = migrate_pages(page_list, alloc_new_node_page,
+			NULL, nid, mode, MR_SYSCALL);
+
+	if (err) {
+		struct page *page;
+
+		list_for_each_entry(page, page_list, lru)
+			num += hpage_nr_pages(page);
+		pr_debug("%d pages failed to migrate from %d to %d\n",
+			num, from_nid, nid);
+
+		putback_movable_pages(page_list);
+	}
+	return num;
+}
+
+static inline int _putback_overflow_pages(unsigned long max_nr_pages,
+		struct list_head *page_list, unsigned long *nr_remaining_pages)
+{
+	struct page *page;
+	LIST_HEAD(putback_list);
+
+	if (list_empty(page_list))
+		return max_nr_pages;
+
+	*nr_remaining_pages = 0;
+	/* in case we need to drop the whole list */
+	page = list_first_entry(page_list, struct page, lru);
+	if (max_nr_pages <= (2 * hpage_nr_pages(page))) {
+		max_nr_pages = 0;
+		putback_movable_pages(page_list);
+		goto out;
+	}
+
+	list_for_each_entry(page, page_list, lru) {
+		int nr_pages = hpage_nr_pages(page);
+		/* drop just one more page to avoid using up free space  */
+		if (max_nr_pages <= (2 * nr_pages)) {
+			max_nr_pages = 0;
+			break;
+		}
+		max_nr_pages -= nr_pages;
+		*nr_remaining_pages += nr_pages;
+	}
+
+	/* we did not scan all pages in page_list, we need to put back some */
+	if (&page->lru != page_list) {
+		list_cut_position(&putback_list, page_list, &page->lru);
+		putback_movable_pages(page_list);
+		list_splice(&putback_list, page_list);
+	}
+out:
+	return max_nr_pages;
+}
+
+static int putback_overflow_pages(unsigned long max_nr_base_pages,
+		unsigned long max_nr_huge_pages,
+		long nr_free_pages,
+		struct list_head *base_page_list,
+		struct list_head *huge_page_list,
+		unsigned long *nr_base_pages,
+		unsigned long *nr_huge_pages)
+{
+	if (nr_free_pages < 0) {
+		if ((-nr_free_pages) > max_nr_base_pages) {
+			nr_free_pages += max_nr_base_pages;
+			max_nr_base_pages = 0;
+		}
+
+		if ((-nr_free_pages) > max_nr_huge_pages) {
+			nr_free_pages = 0;
+			max_nr_base_pages = 0;
+		}
+	}
+	/*
+	 * counting pages in page lists and substract the number from max_nr_*
+	 * when max_nr_* go to zero, drop the remaining pages
+	 */
+	max_nr_huge_pages += _putback_overflow_pages(nr_free_pages/2 + max_nr_base_pages,
+			base_page_list, nr_base_pages);
+	return _putback_overflow_pages(nr_free_pages/2 + max_nr_huge_pages,
+			huge_page_list, nr_huge_pages);
+}
+
+static int do_mm_manage(struct task_struct *p, struct mm_struct *mm,
+		const nodemask_t *slow, const nodemask_t *fast,
+		unsigned long nr_pages, int flags)
+{
+	bool migrate_mt = flags & MPOL_MF_MOVE_MT;
+	bool migrate_concur = flags & MPOL_MF_MOVE_CONCUR;
+	bool migrate_dma = flags & MPOL_MF_MOVE_DMA;
+	bool move_hot_and_cold_pages = flags & MPOL_MF_MOVE_ALL;
+	struct mem_cgroup *memcg = mem_cgroup_from_task(p);
+	int err = 0;
+	unsigned long nr_isolated_slow_pages;
+	unsigned long nr_isolated_slow_base_pages = 0;
+	unsigned long nr_isolated_slow_huge_pages = 0;
+	unsigned long nr_isolated_fast_pages;
+	/* in case no migration from to node, we migrate all isolated pages from
+	 * slow node  */
+	unsigned long nr_isolated_fast_base_pages = ULONG_MAX;
+	unsigned long nr_isolated_fast_huge_pages = ULONG_MAX;
+	unsigned long max_nr_pages_fast_node, nr_pages_fast_node;
+	unsigned long nr_pages_slow_node, nr_active_pages_slow_node;
+	long nr_free_pages_fast_node;
+	int slow_nid, fast_nid;
+	enum migrate_mode mode = MIGRATE_SYNC |
+		(migrate_mt ? MIGRATE_MT : MIGRATE_SINGLETHREAD) |
+		(migrate_dma ? MIGRATE_DMA : MIGRATE_SINGLETHREAD) |
+		(migrate_concur ? MIGRATE_CONCUR : MIGRATE_SINGLETHREAD);
+	enum isolate_action isolate_action =
+		move_hot_and_cold_pages?ISOLATE_HOT_AND_COLD_PAGES:ISOLATE_HOT_PAGES;
+	LIST_HEAD(slow_base_page_list);
+	LIST_HEAD(slow_huge_page_list);
+
+	if (!memcg)
+		return 0;
+	/* Let's handle simplest situation first */
+	if (!(nodes_weight(*slow) == 1 && nodes_weight(*fast) == 1))
+		return 0;
+
+	/* Only work on specific cgroup not the global root */
+	if (memcg == root_mem_cgroup)
+		return 0;
+
+	slow_nid = first_node(*slow);
+	fast_nid = first_node(*fast);
+
+	max_nr_pages_fast_node = memcg_max_size_node(memcg, fast_nid);
+	nr_pages_fast_node = memcg_size_node(memcg, fast_nid);
+	nr_active_pages_slow_node = active_inactive_size_memcg_node(memcg,
+			slow_nid, true);
+	nr_pages_slow_node = memcg_size_node(memcg, slow_nid);
+
+	nr_free_pages_fast_node = max_nr_pages_fast_node - nr_pages_fast_node;
+
+	/* do not migrate in more pages than fast node can hold */
+	nr_pages = min_t(unsigned long, max_nr_pages_fast_node, nr_pages);
+	/* do not migrate away more pages than slow node has */
+	nr_pages = min_t(unsigned long, nr_pages_slow_node, nr_pages);
+
+	/* if fast node has enough space, migrate all possible pages in slow node */
+	if (nr_pages != ULONG_MAX &&
+		nr_free_pages_fast_node > 0 &&
+		nr_active_pages_slow_node < nr_free_pages_fast_node) {
+		isolate_action = ISOLATE_HOT_AND_COLD_PAGES;
+	}
+
+	nr_isolated_slow_pages = isolate_pages_from_lru_list(NODE_DATA(slow_nid),
+			memcg, nr_pages, &slow_base_page_list, &slow_huge_page_list,
+			&nr_isolated_slow_base_pages, &nr_isolated_slow_huge_pages,
+			isolate_action);
+
+	if (max_nr_pages_fast_node != ULONG_MAX &&
+		(nr_free_pages_fast_node < 0 ||
+		 nr_free_pages_fast_node < nr_isolated_slow_pages)) {
+		LIST_HEAD(fast_base_page_list);
+		LIST_HEAD(fast_huge_page_list);
+
+		nr_isolated_fast_base_pages = 0;
+		nr_isolated_fast_huge_pages = 0;
+		/* isolate pages on fast node to make space */
+		nr_isolated_fast_pages = isolate_pages_from_lru_list(NODE_DATA(fast_nid),
+			memcg,
+			nr_isolated_slow_pages - nr_free_pages_fast_node,
+			&fast_base_page_list, &fast_huge_page_list,
+			&nr_isolated_fast_base_pages, &nr_isolated_fast_huge_pages,
+			move_hot_and_cold_pages?ISOLATE_HOT_AND_COLD_PAGES:ISOLATE_COLD_PAGES);
+
+		/* Migrate pages to slow node */
+		/* No multi-threaded migration for base pages */
+		nr_isolated_fast_base_pages -=
+			migrate_to_node(&fast_base_page_list, slow_nid, mode & ~MIGRATE_MT);
+
+		nr_isolated_fast_huge_pages -=
+			migrate_to_node(&fast_huge_page_list, slow_nid, mode);
+	}
+
+	if (nr_isolated_fast_base_pages != ULONG_MAX &&
+		nr_isolated_fast_huge_pages != ULONG_MAX)
+		putback_overflow_pages(nr_isolated_fast_base_pages,
+				nr_isolated_fast_huge_pages, nr_free_pages_fast_node,
+				&slow_base_page_list, &slow_huge_page_list,
+				&nr_isolated_slow_base_pages,
+				&nr_isolated_slow_huge_pages);
+
+	/* Migrate pages to fast node */
+	/* No multi-threaded migration for base pages */
+	nr_isolated_slow_base_pages -=
+		migrate_to_node(&slow_base_page_list, fast_nid, mode & ~MIGRATE_MT);
+
+	nr_isolated_slow_huge_pages -=
+		migrate_to_node(&slow_huge_page_list, fast_nid, mode);
+
+	return err;
+}
+
 SYSCALL_DEFINE6(mm_manage, pid_t, pid, unsigned long, nr_pages,
 		unsigned long, maxnode,
 		const unsigned long __user *, slow_nodes,
@@ -167,6 +439,9 @@ SYSCALL_DEFINE6(mm_manage, pid_t, pid, unsigned long, nr_pages,
 	if (flags & MPOL_MF_SHRINK_LISTS)
 		shrink_lists(task, mm, slow, fast, nr_pages);
 
+	if (flags & MPOL_MF_MOVE)
+		err = do_mm_manage(task, mm, slow, fast, nr_pages, flags);
+
 	clear_bit(MMF_MM_MANAGE, &mm->flags);
 	mmput(mm);
 out:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 24/25] memory manage: limit migration batch size.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (22 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 23/25] memory manage: page migration based page manipulation between NUMA nodes Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  2:00 ` [RFC PATCH 25/25] memory manage: use exchange pages to memory manage to improve throughput Zi Yan
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Make migration batch size adjustable to avoid excessive migration
overheads when a lot of pages are under migration.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 kernel/sysctl.c    |  8 ++++++++
 mm/memory_manage.c | 60 ++++++++++++++++++++++++++++++++++++------------------
 2 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b8712eb..b92e2da9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -105,6 +105,7 @@ extern int accel_page_copy;
 extern unsigned int limit_mt_num;
 extern int use_all_dma_chans;
 extern int limit_dma_chans;
+extern int migration_batch_size;
 
 /* External variables not in a header file. */
 extern int suid_dumpable;
@@ -1470,6 +1471,13 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	 },
 	 {
+		.procname	= "migration_batch_size",
+		.data		= &migration_batch_size,
+		.maxlen		= sizeof(migration_batch_size),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	 },
+	 {
 		.procname	= "hugetlb_shm_group",
 		.data		= &sysctl_hugetlb_shm_group,
 		.maxlen		= sizeof(gid_t),
diff --git a/mm/memory_manage.c b/mm/memory_manage.c
index d63ad25..8b76fcf 100644
--- a/mm/memory_manage.c
+++ b/mm/memory_manage.c
@@ -16,6 +16,8 @@
 
 #include "internal.h"
 
+int migration_batch_size = 16;
+
 enum isolate_action {
 	ISOLATE_COLD_PAGES = 1,
 	ISOLATE_HOT_PAGES,
@@ -137,35 +139,49 @@ static unsigned long isolate_pages_from_lru_list(pg_data_t *pgdat,
 }
 
 static int migrate_to_node(struct list_head *page_list, int nid,
-		enum migrate_mode mode)
+		enum migrate_mode mode, int batch_size)
 {
 	bool migrate_concur = mode & MIGRATE_CONCUR;
+	bool unlimited_batch_size = (batch_size <=0 || !migrate_concur);
 	int num = 0;
-	int from_nid;
+	int from_nid = -1;
 	int err;
 
 	if (list_empty(page_list))
 		return num;
 
-	from_nid = page_to_nid(list_first_entry(page_list, struct page, lru));
+	while (!list_empty(page_list)) {
+		LIST_HEAD(batch_page_list);
+		int i;
 
-	if (migrate_concur)
-		err = migrate_pages_concur(page_list, alloc_new_node_page,
-			NULL, nid, mode, MR_SYSCALL);
-	else
-		err = migrate_pages(page_list, alloc_new_node_page,
-			NULL, nid, mode, MR_SYSCALL);
+		/* it should move all pages to batch_page_list if !migrate_concur */
+		for (i = 0; i < batch_size || unlimited_batch_size; i++) {
+			struct page *item = list_first_entry_or_null(page_list, struct page, lru);
+			if (!item)
+				break;
+			list_move(&item->lru, &batch_page_list);
+		}
 
-	if (err) {
-		struct page *page;
+		from_nid = page_to_nid(list_first_entry(&batch_page_list, struct page, lru));
 
-		list_for_each_entry(page, page_list, lru)
-			num += hpage_nr_pages(page);
-		pr_debug("%d pages failed to migrate from %d to %d\n",
-			num, from_nid, nid);
+		if (migrate_concur)
+			err = migrate_pages_concur(&batch_page_list, alloc_new_node_page,
+				NULL, nid, mode, MR_SYSCALL);
+		else
+			err = migrate_pages(&batch_page_list, alloc_new_node_page,
+				NULL, nid, mode, MR_SYSCALL);
 
-		putback_movable_pages(page_list);
+		if (err) {
+			struct page *page;
+
+			list_for_each_entry(page, &batch_page_list, lru)
+				num += hpage_nr_pages(page);
+
+			putback_movable_pages(&batch_page_list);
+		}
 	}
+	pr_debug("%d pages failed to migrate from %d to %d\n",
+		num, from_nid, nid);
 	return num;
 }
 
@@ -325,10 +341,12 @@ static int do_mm_manage(struct task_struct *p, struct mm_struct *mm,
 		/* Migrate pages to slow node */
 		/* No multi-threaded migration for base pages */
 		nr_isolated_fast_base_pages -=
-			migrate_to_node(&fast_base_page_list, slow_nid, mode & ~MIGRATE_MT);
+			migrate_to_node(&fast_base_page_list, slow_nid,
+				mode & ~MIGRATE_MT, migration_batch_size);
 
 		nr_isolated_fast_huge_pages -=
-			migrate_to_node(&fast_huge_page_list, slow_nid, mode);
+			migrate_to_node(&fast_huge_page_list, slow_nid, mode,
+				migration_batch_size);
 	}
 
 	if (nr_isolated_fast_base_pages != ULONG_MAX &&
@@ -342,10 +360,12 @@ static int do_mm_manage(struct task_struct *p, struct mm_struct *mm,
 	/* Migrate pages to fast node */
 	/* No multi-threaded migration for base pages */
 	nr_isolated_slow_base_pages -=
-		migrate_to_node(&slow_base_page_list, fast_nid, mode & ~MIGRATE_MT);
+		migrate_to_node(&slow_base_page_list, fast_nid, mode & ~MIGRATE_MT,
+				migration_batch_size);
 
 	nr_isolated_slow_huge_pages -=
-		migrate_to_node(&slow_huge_page_list, fast_nid, mode);
+		migrate_to_node(&slow_huge_page_list, fast_nid, mode,
+				migration_batch_size);
 
 	return err;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 25/25] memory manage: use exchange pages to memory manage to improve throughput.
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (23 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 24/25] memory manage: limit migration batch size Zi Yan
@ 2019-04-04  2:00 ` Zi Yan
  2019-04-04  7:13 ` [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Michal Hocko
  2019-04-05  0:32 ` Yang Shi
  26 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-04  2:00 UTC (permalink / raw)
  To: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

1. Exclude file-backed base pages from exchanging.
2. Split THP in exchange pages if THP support is disabled.
3. if THP migration is supported, only exchange THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/memory_manage.c | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)

diff --git a/mm/memory_manage.c b/mm/memory_manage.c
index 8b76fcf..d3d07b7 100644
--- a/mm/memory_manage.c
+++ b/mm/memory_manage.c
@@ -7,6 +7,7 @@
 #include <linux/mempolicy.h>
 #include <linux/memcontrol.h>
 #include <linux/migrate.h>
+#include <linux/exchange.h>
 #include <linux/mm_inline.h>
 #include <linux/nodemask.h>
 #include <linux/rmap.h>
@@ -253,6 +254,147 @@ static int putback_overflow_pages(unsigned long max_nr_base_pages,
 			huge_page_list, nr_huge_pages);
 }
 
+static int add_pages_to_exchange_list(struct list_head *from_pagelist,
+	struct list_head *to_pagelist, struct exchange_page_info *info_list,
+	struct list_head *exchange_list, unsigned long info_list_size)
+{
+	unsigned long info_list_index = 0;
+	LIST_HEAD(failed_from_list);
+	LIST_HEAD(failed_to_list);
+
+	while (!list_empty(from_pagelist) && !list_empty(to_pagelist)) {
+		struct page *from_page, *to_page;
+		struct exchange_page_info *one_pair = &info_list[info_list_index];
+		int rc;
+
+		from_page = list_first_entry_or_null(from_pagelist, struct page, lru);
+		to_page = list_first_entry_or_null(to_pagelist, struct page, lru);
+
+		if (!from_page || !to_page)
+			break;
+
+		if (!thp_migration_supported() && PageTransHuge(from_page)) {
+			lock_page(from_page);
+			rc = split_huge_page_to_list(from_page, &from_page->lru);
+			unlock_page(from_page);
+			if (rc) {
+				list_move(&from_page->lru, &failed_from_list);
+				continue;
+			}
+		}
+
+		if (!thp_migration_supported() && PageTransHuge(to_page)) {
+			lock_page(to_page);
+			rc = split_huge_page_to_list(to_page, &to_page->lru);
+			unlock_page(to_page);
+			if (rc) {
+				list_move(&to_page->lru, &failed_to_list);
+				continue;
+			}
+		}
+
+		if (hpage_nr_pages(from_page) != hpage_nr_pages(to_page)) {
+			if (!(hpage_nr_pages(from_page) == 1 && hpage_nr_pages(from_page) == HPAGE_PMD_NR)) {
+				list_del(&from_page->lru);
+				list_add(&from_page->lru, &failed_from_list);
+			}
+			if (!(hpage_nr_pages(to_page) == 1 && hpage_nr_pages(to_page) == HPAGE_PMD_NR)) {
+				list_del(&to_page->lru);
+				list_add(&to_page->lru, &failed_to_list);
+			}
+			continue;
+		}
+
+		/* Exclude file-backed pages, exchange it concurrently is not
+		 * implemented yet. */
+		if (page_mapping(from_page)) {
+			list_del(&from_page->lru);
+			list_add(&from_page->lru, &failed_from_list);
+			continue;
+		}
+		if (page_mapping(to_page)) {
+			list_del(&to_page->lru);
+			list_add(&to_page->lru, &failed_to_list);
+			continue;
+		}
+
+		list_del(&from_page->lru);
+		list_del(&to_page->lru);
+
+		one_pair->from_page = from_page;
+		one_pair->to_page = to_page;
+
+		list_add_tail(&one_pair->list, exchange_list);
+
+		info_list_index++;
+		if (info_list_index >= info_list_size)
+			break;
+	}
+	list_splice(&failed_from_list, from_pagelist);
+	list_splice(&failed_to_list, to_pagelist);
+
+	return info_list_index;
+}
+
+static unsigned long exchange_pages_between_nodes(unsigned long nr_from_pages,
+	unsigned long nr_to_pages, struct list_head *from_page_list,
+	struct list_head *to_page_list, int batch_size,
+	bool huge_page, enum migrate_mode mode)
+{
+	struct exchange_page_info *info_list;
+	unsigned long info_list_size = min_t(unsigned long,
+		nr_from_pages, nr_to_pages) / (huge_page?HPAGE_PMD_NR:1);
+	unsigned long added_size = 0;
+	bool migrate_concur = mode & MIGRATE_CONCUR;
+	LIST_HEAD(exchange_list);
+
+	/* non concurrent does not need to split into batches  */
+	if (!migrate_concur || batch_size <= 0)
+		batch_size = info_list_size;
+
+	/* prepare for huge page split  */
+	if (!thp_migration_supported() && huge_page) {
+		batch_size = batch_size * HPAGE_PMD_NR;
+		info_list_size = info_list_size * HPAGE_PMD_NR;
+	}
+
+	info_list = kvzalloc(sizeof(struct exchange_page_info)*batch_size,
+			GFP_KERNEL);
+	if (!info_list)
+		return 0;
+
+	while (!list_empty(from_page_list) && !list_empty(to_page_list)) {
+		unsigned long nr_added_pages;
+		INIT_LIST_HEAD(&exchange_list);
+
+		nr_added_pages = add_pages_to_exchange_list(from_page_list, to_page_list,
+			info_list, &exchange_list, batch_size);
+
+		/*
+		 * Nothing to exchange, we bail out.
+		 *
+		 * In case from_page_list and to_page_list both only have file-backed
+		 * pages left */
+		if (!nr_added_pages)
+			break;
+
+		added_size += nr_added_pages;
+
+		VM_BUG_ON(added_size > info_list_size);
+
+		if (migrate_concur)
+			exchange_pages_concur(&exchange_list, mode, MR_SYSCALL);
+		else
+			exchange_pages(&exchange_list, mode, MR_SYSCALL);
+
+		memset(info_list, 0, sizeof(struct exchange_page_info)*batch_size);
+	}
+
+	kvfree(info_list);
+
+	return info_list_size;
+}
+
 static int do_mm_manage(struct task_struct *p, struct mm_struct *mm,
 		const nodemask_t *slow, const nodemask_t *fast,
 		unsigned long nr_pages, int flags)
@@ -261,6 +403,7 @@ static int do_mm_manage(struct task_struct *p, struct mm_struct *mm,
 	bool migrate_concur = flags & MPOL_MF_MOVE_CONCUR;
 	bool migrate_dma = flags & MPOL_MF_MOVE_DMA;
 	bool move_hot_and_cold_pages = flags & MPOL_MF_MOVE_ALL;
+	bool migrate_exchange_pages = flags & MPOL_MF_EXCHANGE;
 	struct mem_cgroup *memcg = mem_cgroup_from_task(p);
 	int err = 0;
 	unsigned long nr_isolated_slow_pages;
@@ -338,6 +481,35 @@ static int do_mm_manage(struct task_struct *p, struct mm_struct *mm,
 			&nr_isolated_fast_base_pages, &nr_isolated_fast_huge_pages,
 			move_hot_and_cold_pages?ISOLATE_HOT_AND_COLD_PAGES:ISOLATE_COLD_PAGES);
 
+		if (migrate_exchange_pages) {
+			unsigned long nr_exchange_pages;
+
+			/*
+			 * base pages can include file-backed ones, we do not handle them
+			 * at the moment
+			 */
+			if (!thp_migration_supported()) {
+				nr_exchange_pages =  exchange_pages_between_nodes(nr_isolated_slow_base_pages,
+					nr_isolated_fast_base_pages, &slow_base_page_list,
+					&fast_base_page_list, migration_batch_size, false, mode);
+
+				nr_isolated_fast_base_pages -= nr_exchange_pages;
+			}
+
+			/* THP page exchange */
+			nr_exchange_pages =  exchange_pages_between_nodes(nr_isolated_slow_huge_pages,
+				nr_isolated_fast_huge_pages, &slow_huge_page_list,
+				&fast_huge_page_list, migration_batch_size, true, mode);
+
+			/* split THP above, so we do not need to multiply the counter */
+			if (!thp_migration_supported())
+				nr_isolated_fast_huge_pages -= nr_exchange_pages;
+			else
+				nr_isolated_fast_huge_pages -= nr_exchange_pages * HPAGE_PMD_NR;
+
+			goto migrate_out;
+		} else {
+migrate_out:
 		/* Migrate pages to slow node */
 		/* No multi-threaded migration for base pages */
 		nr_isolated_fast_base_pages -=
@@ -347,6 +519,7 @@ static int do_mm_manage(struct task_struct *p, struct mm_struct *mm,
 		nr_isolated_fast_huge_pages -=
 			migrate_to_node(&fast_huge_page_list, slow_nid, mode,
 				migration_batch_size);
+		}
 	}
 
 	if (nr_isolated_fast_base_pages != ULONG_MAX &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (24 preceding siblings ...)
  2019-04-04  2:00 ` [RFC PATCH 25/25] memory manage: use exchange pages to memory manage to improve throughput Zi Yan
@ 2019-04-04  7:13 ` Michal Hocko
  2019-04-05  0:32 ` Yang Shi
  26 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2019-04-04  7:13 UTC (permalink / raw)
  To: ziy
  Cc: Dave Hansen, Yang Shi, Keith Busch, Fengguang Wu, linux-mm,
	linux-kernel, Daniel Jordan, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans

On Wed 03-04-19 19:00:21, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Thanks to Dave Hansen's patches, which make PMEM as part of memory as NUMA nodes.
> How to use PMEM along with normal DRAM remains an open problem. There are
> several patchsets posted on the mailing list, proposing to use page migration to
> move pages between PMEM and DRAM using Linux page replacement policy [1,2,3].
> There are some important problems not addressed in these patches:
> 1. The page migration in Linux does not provide high enough throughput for us to
> fully exploit PMEM or other use cases.
> 2. Linux page replacement is running too infrequent to distinguish hot and cold
> pages.
[...]
>  33 files changed, 4261 insertions(+), 162 deletions(-)

For a patch _this_ large you should really start with a real world
usecasing hitting bottlenecks with the current implementation. Should
microbenchmarks can trigger bottlenecks much easier but do real
application do the same? Please give us some numbers.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management
  2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
                   ` (25 preceding siblings ...)
  2019-04-04  7:13 ` [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Michal Hocko
@ 2019-04-05  0:32 ` Yang Shi
  2019-04-05 17:20   ` Zi Yan
  26 siblings, 1 reply; 29+ messages in thread
From: Yang Shi @ 2019-04-05  0:32 UTC (permalink / raw)
  To: ziy, Dave Hansen, Keith Busch, Fengguang Wu, linux-mm, linux-kernel
  Cc: Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans



On 4/3/19 7:00 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
>
> Thanks to Dave Hansen's patches, which make PMEM as part of memory as NUMA nodes.
> How to use PMEM along with normal DRAM remains an open problem. There are
> several patchsets posted on the mailing list, proposing to use page migration to
> move pages between PMEM and DRAM using Linux page replacement policy [1,2,3].
> There are some important problems not addressed in these patches:
> 1. The page migration in Linux does not provide high enough throughput for us to
> fully exploit PMEM or other use cases.
> 2. Linux page replacement is running too infrequent to distinguish hot and cold
> pages.
>
> I am trying to attack the problems with this patch series. This is not a final
> solution, but I would like to gather more feedback and comments from the mailing
> list.
>
> Page migration throughput problem
> ====
>
> For example, in my recent email [4], I gave the page migration throughput numbers
> for different page migrations, none of which can achieve > 2.5GB/s throughput
> (the throughput is measured around kernel functions: migrate_pages() and
> migrate_page_copy()):
>
>                               |  migrate_pages() |    migrate_page_copy()
> migrating single 4KB page:   |  0.312GB/s       |   1.385GB/s
> migrating 512 4KB pages:     |  0.854GB/s       |   1.983GB/s
> migrating single 2MB THP:    |  2.387GB/s       |   2.481GB/s
>
> In reality, microbenchmarks show that Intel PMEM can provide ~65GB/s read
> throughput and ~16GB/s write throughput [5], which are much higher than
> the throughput achieved by Linux page migration.
>
> In addition, it is also desirable to use page migration to move data
> between high-bandwidth memory and DRAM, like IBM Summit, which exposes
> high-performance GPU memories as NUMA nodes [6]. This requires even higher page
> migration throughput.
>
> In this patch series, I propose four different ways of improving page migration
> throughput (mostly on 2MB THP migration):
> 1. multi-threaded page migration: Patch 03 to 06.
> 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08.
> 3. concurrent (batched) page migration: Patch 09, 10, and 11.
> 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7])
>
> Here are some throughput numbers showing clear throughput improvements on
> a two-socket NUMA machine with two Xeon E5-2650 v3 @ 2.30GHz and a 19.2GB/s
> bandwidth QPI link (the same machine as mentioned in [4]):
>
>                                      |  migrate_pages() |   migrate_page_copy()
> => migrating single 2MB THP         |  2.387GB/s       |   2.481GB/s
>   2-thread single THP migration      |  3.478GB/s       |   3.704GB/s
>   4-thread single THP migration      |  5.474GB/s       |   6.054GB/s
>   8-thread single THP migration      |  7.846GB/s       |   9.029GB/s
> 16-thread single THP migration      |  7.423GB/s       |   8.464GB/s
> 16-ch. DMA single THP migration     |  4.322GB/s       |   4.536GB/s
>
>   2-thread 16-THP migration          |  3.610GB/s       |   3.838GB/s
>   2-thread 16-THP batched migration  |  4.138GB/s       |   4.344GB/s
>   4-thread 16-THP migration          |  6.385GB/s       |   7.031GB/s
>   4-thread 16-THP batched migration  |  7.382GB/s       |   8.072GB/s
>   8-thread 16-THP migration          |  8.039GB/s       |   9.029GB/s
>   8-thread 16-THP batched migration  |  9.023GB/s       |   10.056GB/s
> 16-thread 16-THP migration          |  8.137GB/s       |   9.137GB/s
> 16-thread 16-THP batched migration  |  9.907GB/s       |   11.175GB/s
>
>   1-thread 16-THP exchange           |  4.135GB/s       |   4.225GB/s
>   2-thread 16-THP batched exchange   |  7.061GB/s       |   7.325GB/s
>   4-thread 16-THP batched exchange   |  9.729GB/s       |   10.237GB/s
>   8-thread 16-THP batched exchange   |  9.992GB/s       |   10.533GB/s
> 16-thread 16-THP batched exchange   |  9.520GB/s       |   10.056GB/s
>
> => migrating 512 4KB pages          |  0.854GB/s       |   1.983GB/s
>   1-thread 512-4KB batched exchange  |  1.271GB/s       |   3.433GB/s
>   2-thread 512-4KB batched exchange  |  1.240GB/s       |   3.190GB/s
>   4-thread 512-4KB batched exchange  |  1.255GB/s       |   3.823GB/s
>   8-thread 512-4KB batched exchange  |  1.336GB/s       |   3.921GB/s
> 16-thread 512-4KB batched exchange  |  1.334GB/s       |   3.897GB/s
>
> Concerns were raised on how to avoid CPU resource competition between
> page migration and user applications and have power awareness.
> Daniel Jordan recently posted a multi-threaded ktask patch series could be
> a solution [8].
>
>
> Infrequent page list update problem
> ====
>
> Current page lists are updated by calling shrink_list() when memory pressure
> comes,  which might not be frequent enough to keep track of hot and cold pages.
> Because all pages are on active lists at the first time shrink_list() is called
> and the reference bit on the pages might not reflect the up to date access status
> of these pages. But we also do not want to periodically shrink the global page
> lists, which adds unnecessary overheads to the whole system. So I propose to
> actively shrink page lists on the memcg we are interested in.
>
> Patch 18 to 25 add a new system call to shrink page lists on given application's
> memcg and migrate pages between two NUMA nodes. It isolates the impact from the
> rest of the system. To share DRAM among different applications, Patch 18 and 19
> add per-node memcg size limit, so you can limit the memory usage for particular
> NUMA node(s).

This sounds a little bit confusing to me. Is it totally user's decision 
about when to call the syscall to shrink page lists? But, how would user 
know when is a good timing? Could you please elaborate the usecase?

Thanks,
Yang

>
>
> Patch structure
> ====
> 1. multi-threaded page migration: Patch 01 to 06.
> 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08.
> 3. concurrent (batched) page migration: Patch 09, 10, and 11.
> 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7])
> 5. per-node size limit in memcg: Patch 18 and 19.
> 6. actively shrink page lists and perform page migration in given memcg: Patch 20 to 25.
>
>
> Any comment is welcome.
>
> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
> [2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/
> [3]: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/
> [4]: https://lore.kernel.org/linux-mm/6A903D34-A293-4056-B135-6FA227DE1828@nvidia.com/
> [5]: https://www.storagereview.com/supermicro_superserver_with_intel_optane_dc_persistent_memory_first_look_review
> [6]: https://www.ibm.com/thought-leadership/summit-supercomputer/
> [7]: https://lore.kernel.org/linux-mm/20190215220856.29749-1-zi.yan@sent.com/
> [8]: https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/
>
> Zi Yan (25):
>    mm: migrate: Change migrate_mode to support combination migration
>      modes.
>    mm: migrate: Add mode parameter to support future page copy routines.
>    mm: migrate: Add a multi-threaded page migration function.
>    mm: migrate: Add copy_page_multithread into migrate_pages.
>    mm: migrate: Add vm.accel_page_copy in sysfs to control page copy
>      acceleration.
>    mm: migrate: Make the number of copy threads adjustable via sysctl.
>    mm: migrate: Add copy_page_dma to use DMA Engine to copy pages.
>    mm: migrate: Add copy_page_dma into migrate_page_copy.
>    mm: migrate: Add copy_page_lists_dma_always to support copy a list of
>         pages.
>    mm: migrate: copy_page_lists_mt() to copy a page list using
>      multi-threads.
>    mm: migrate: Add concurrent page migration into move_pages syscall.
>    exchange pages: new page migration mechanism: exchange_pages()
>    exchange pages: add multi-threaded exchange pages.
>    exchange pages: concurrent exchange pages.
>    exchange pages: exchange anonymous page and file-backed page.
>    exchange page: Add THP exchange support.
>    exchange page: Add exchange_page() syscall.
>    memcg: Add per node memory usage&max stats in memcg.
>    mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit.
>    memory manage: Add memory manage syscall.
>    mm: move update_lru_sizes() to mm_inline.h for broader use.
>    memory manage: active/inactive page list manipulation in memcg.
>    memory manage: page migration based page manipulation between NUMA
>      nodes.
>    memory manage: limit migration batch size.
>    memory manage: use exchange pages to memory manage to improve
>      throughput.
>
>   arch/x86/entry/syscalls/syscall_64.tbl |    2 +
>   fs/aio.c                               |   12 +-
>   fs/f2fs/data.c                         |    6 +-
>   fs/hugetlbfs/inode.c                   |    4 +-
>   fs/iomap.c                             |    4 +-
>   fs/ubifs/file.c                        |    4 +-
>   include/linux/cgroup-defs.h            |    1 +
>   include/linux/exchange.h               |   27 +
>   include/linux/highmem.h                |    3 +
>   include/linux/ksm.h                    |    4 +
>   include/linux/memcontrol.h             |   67 ++
>   include/linux/migrate.h                |   12 +-
>   include/linux/migrate_mode.h           |    8 +
>   include/linux/mm_inline.h              |   21 +
>   include/linux/sched/coredump.h         |    1 +
>   include/linux/sched/sysctl.h           |    3 +
>   include/linux/syscalls.h               |   10 +
>   include/uapi/linux/mempolicy.h         |    9 +-
>   kernel/sysctl.c                        |   47 +
>   mm/Makefile                            |    5 +
>   mm/balloon_compaction.c                |    2 +-
>   mm/compaction.c                        |   22 +-
>   mm/copy_page.c                         |  708 +++++++++++++++
>   mm/exchange.c                          | 1560 ++++++++++++++++++++++++++++++++
>   mm/exchange_page.c                     |  228 +++++
>   mm/internal.h                          |  113 +++
>   mm/ksm.c                               |   35 +
>   mm/memcontrol.c                        |   80 ++
>   mm/memory_manage.c                     |  649 +++++++++++++
>   mm/mempolicy.c                         |   38 +-
>   mm/migrate.c                           |  621 ++++++++++++-
>   mm/vmscan.c                            |  115 +--
>   mm/zsmalloc.c                          |    2 +-
>   33 files changed, 4261 insertions(+), 162 deletions(-)
>   create mode 100644 include/linux/exchange.h
>   create mode 100644 mm/copy_page.c
>   create mode 100644 mm/exchange.c
>   create mode 100644 mm/exchange_page.c
>   create mode 100644 mm/memory_manage.c
>
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management
  2019-04-05  0:32 ` Yang Shi
@ 2019-04-05 17:20   ` Zi Yan
  0 siblings, 0 replies; 29+ messages in thread
From: Zi Yan @ 2019-04-05 17:20 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, Keith Busch, Fengguang Wu, linux-mm, linux-kernel,
	Daniel Jordan, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, Javier Cabezas, David Nellans

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]


>> Infrequent page list update problem
>> ====
>>
>> Current page lists are updated by calling shrink_list() when memory pressure
>> comes,  which might not be frequent enough to keep track of hot and cold pages.
>> Because all pages are on active lists at the first time shrink_list() is called
>> and the reference bit on the pages might not reflect the up to date access status
>> of these pages. But we also do not want to periodically shrink the global page
>> lists, which adds unnecessary overheads to the whole system. So I propose to
>> actively shrink page lists on the memcg we are interested in.
>>
>> Patch 18 to 25 add a new system call to shrink page lists on given application's
>> memcg and migrate pages between two NUMA nodes. It isolates the impact from the
>> rest of the system. To share DRAM among different applications, Patch 18 and 19
>> add per-node memcg size limit, so you can limit the memory usage for particular
>> NUMA node(s).
>
> This sounds a little bit confusing to me. Is it totally user's decision about when to call the syscall to shrink page lists? But, how would user know when is a good timing? Could you please elaborate the usecase?

Sure. We would set up a daemon that monitors user applications and calls the syscall
to shuffle the page lists for the user applications, although the daemon’s concrete
action plan is still under exploration. It might not be ideal but the page access information
could be refreshed periodically and page migration would happen on the background of
application execution.

On the other hand, if we wait until DRAM is full and use page migration to make room in DRAM
for either page promotion or new page allocation, page migration sits on the critical path
of application execution. Considering the bandwidth and access latency gaps between
DRAM and PMEM are not as large as the gaps between DRAM and SSD, the cost of page migration
(4KB/0.312GB/s = 12us or 2MB/2.387GB/s = 818us)might defeat the benefit of using DRAM over PMEM.
I just wonder which would be better: waiting for 12us or 818us then reading 4KB or 2MB data in DRAM
or directly accessing the data in PMEM without waiting.

Let me know if this makes sense to you.

Thanks.

--
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-04-05 17:20 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-04  2:00 [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Zi Yan
2019-04-04  2:00 ` [RFC PATCH 01/25] mm: migrate: Change migrate_mode to support combination migration modes Zi Yan
2019-04-04  2:00 ` [RFC PATCH 02/25] mm: migrate: Add mode parameter to support future page copy routines Zi Yan
2019-04-04  2:00 ` [RFC PATCH 03/25] mm: migrate: Add a multi-threaded page migration function Zi Yan
2019-04-04  2:00 ` [RFC PATCH 04/25] mm: migrate: Add copy_page_multithread into migrate_pages Zi Yan
2019-04-04  2:00 ` [RFC PATCH 05/25] mm: migrate: Add vm.accel_page_copy in sysfs to control page copy acceleration Zi Yan
2019-04-04  2:00 ` [RFC PATCH 06/25] mm: migrate: Make the number of copy threads adjustable via sysctl Zi Yan
2019-04-04  2:00 ` [RFC PATCH 07/25] mm: migrate: Add copy_page_dma to use DMA Engine to copy pages Zi Yan
2019-04-04  2:00 ` [RFC PATCH 08/25] mm: migrate: Add copy_page_dma into migrate_page_copy Zi Yan
2019-04-04  2:00 ` [RFC PATCH 09/25] mm: migrate: Add copy_page_lists_dma_always to support copy a list of pages Zi Yan
2019-04-04  2:00 ` [RFC PATCH 10/25] mm: migrate: copy_page_lists_mt() to copy a page list using multi-threads Zi Yan
2019-04-04  2:00 ` [RFC PATCH 11/25] mm: migrate: Add concurrent page migration into move_pages syscall Zi Yan
2019-04-04  2:00 ` [RFC PATCH 12/25] exchange pages: new page migration mechanism: exchange_pages() Zi Yan
2019-04-04  2:00 ` [RFC PATCH 13/25] exchange pages: add multi-threaded exchange pages Zi Yan
2019-04-04  2:00 ` [RFC PATCH 14/25] exchange pages: concurrent " Zi Yan
2019-04-04  2:00 ` [RFC PATCH 15/25] exchange pages: exchange anonymous page and file-backed page Zi Yan
2019-04-04  2:00 ` [RFC PATCH 16/25] exchange page: Add THP exchange support Zi Yan
2019-04-04  2:00 ` [RFC PATCH 17/25] exchange page: Add exchange_page() syscall Zi Yan
2019-04-04  2:00 ` [RFC PATCH 18/25] memcg: Add per node memory usage&max stats in memcg Zi Yan
2019-04-04  2:00 ` [RFC PATCH 19/25] mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit Zi Yan
2019-04-04  2:00 ` [RFC PATCH 20/25] memory manage: Add memory manage syscall Zi Yan
2019-04-04  2:00 ` [RFC PATCH 21/25] mm: move update_lru_sizes() to mm_inline.h for broader use Zi Yan
2019-04-04  2:00 ` [RFC PATCH 22/25] memory manage: active/inactive page list manipulation in memcg Zi Yan
2019-04-04  2:00 ` [RFC PATCH 23/25] memory manage: page migration based page manipulation between NUMA nodes Zi Yan
2019-04-04  2:00 ` [RFC PATCH 24/25] memory manage: limit migration batch size Zi Yan
2019-04-04  2:00 ` [RFC PATCH 25/25] memory manage: use exchange pages to memory manage to improve throughput Zi Yan
2019-04-04  7:13 ` [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management Michal Hocko
2019-04-05  0:32 ` Yang Shi
2019-04-05 17:20   ` Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).